There are many reasons to make it easy to rerun our analyses. The gapminder data is updated periodically, and we may want to pull in that new information later and re-run our analysis again. We may also obtain similar data from a different source in the future.

In this lesson, we’ll learn how to write a function so that we can repeat a set of operations with a single command. Once we have a function that is known to work, we can use it repeatedly without worrying about how it works, just as we have used functions like min and max.

Functions gather a sequence of operations into a whole, preserving it for ongoing use. Functions provide:

As the basic building block of most programming languages, user-defined functions constitute “programming” as much as any single abstraction can. If you have written a function, you are a computer programmer.

Defining a function

We define a function by assigning the output of function to a variable. The list of argument names are contained within parentheses. Next, the body of the function–the statements that are executed when it runs–is contained within curly braces ({}). The statements in the body are indented by two spaces. This makes the code easier to read but does not affect how the code operates.

When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.

Calling our own function is no different from calling any other function.

One feature unique to R is that the return statement is not required. R automatically returns whichever variable is on the last line of the body of the function. Since we are just learning, we will explicitly define the return statement.

Previously we calculated the GDP by multiplying the population and gdp per capita. Rather than specifying the dataset we want to calculate gdp for every time, let’s turn this into a function.

# Takes a dataset (dat) and multiplies the pop column with the gdpPercap column.
calcGDP <- function(dat) {
  gdp<-dat$gdpPercap * dat$pop
  
  return(gdp)
}

We define calcGDP by assigning it to the output of function. The list of argument names are contained within parentheses. Next, the body of the function – the statements executed when you call the function – is contained within curly braces ({}).

We’ve indented the statements in the body by two spaces. This makes the code easier to read but does not affect how it operates.

When we call the function, the values we pass to it are assigned to the arguments, which become variables inside the body of the function.

Inside the function, we use the return function to send back the result. This return function is optional: R will automatically return the results of whatever command is executed on the last line of the function.

gapminder_location<-curl(url = "https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv")
gapminder<-read.csv(gapminder_location)

calcGDP(head(gapminder))
## [1]  6567086330  7585448670  8758855797  9648014150  9678553274 11697659231

That’s not very informative. Let’s also output the information from the other columns.

# Takes a dataset and multiplies the population column with the GDP per capita column.
calcGDP <- function(dat) {
  gdp <- dat$pop * dat$gdpPercap  
  dat <- cbind(dat, gdp)
  return(dat)
}

calcGDP(head(gapminder))
##       country year      pop continent lifeExp gdpPercap         gdp
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453  6567086330
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530  7585448670
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007  8758855797
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971  9648014150
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811  9678553274
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134 11697659231

Note we can specify any dataset or subset of our data.

calcGDP(gapminder[20:30,])
##    country year      pop continent lifeExp gdpPercap         gdp
## 20 Albania 1987  3075321    Europe  72.000  3738.933 11498418358
## 21 Albania 1992  3326498    Europe  71.581  2497.438  8307722183
## 22 Albania 1997  3428038    Europe  72.950  3193.055 10945912519
## 23 Albania 2002  3508512    Europe  75.651  4604.212 16153932130
## 24 Albania 2007  3600523    Europe  76.423  5937.030 21376411360
## 25 Algeria 1952  9279525    Africa  43.077  2449.008 22725632678
## 26 Algeria 1957 10270856    Africa  45.685  3013.976 30956113720
## 27 Algeria 1962 11000948    Africa  48.303  2550.817 28061403854
## 28 Algeria 1967 12760499    Africa  51.407  3246.992 41433235247
## 29 Algeria 1972 14760787    Africa  54.518  4182.664 61739408943
## 30 Algeria 1977 17152804    Africa  58.014  4910.417 84227416174

We can use == to subset data by a particular value.

head(calcGDP(gapminder[gapminder$year == 2007, ]))
##        country year      pop continent lifeExp  gdpPercap          gdp
## 12 Afghanistan 2007 31889923      Asia  43.828   974.5803  31079291949
## 24     Albania 2007  3600523    Europe  76.423  5937.0295  21376411360
## 36     Algeria 2007 33333216    Africa  72.301  6223.3675 207444851958
## 48      Angola 2007 12420476    Africa  42.731  4797.2313  59583895818
## 60   Argentina 2007 40301927  Americas  75.320 12779.3796 515033625357
## 72   Australia 2007 20434176   Oceania  81.235 34435.3674 703658358894

We can get values for two different years using by specifying one year OR another using | (the converse is &)

head(calcGDP(gapminder[gapminder$year == 2007|gapminder$year == 1952, ]))
##        country year      pop continent lifeExp gdpPercap          gdp
## 1  Afghanistan 1952  8425333      Asia  28.801  779.4453   6567086330
## 12 Afghanistan 2007 31889923      Asia  43.828  974.5803  31079291949
## 13     Albania 1952  1282697    Europe  55.230 1601.0561   2053669902
## 24     Albania 2007  3600523    Europe  76.423 5937.0295  21376411360
## 25     Algeria 1952  9279525    Africa  43.077 2449.0082  22725632678
## 36     Algeria 2007 33333216    Africa  72.301 6223.3675 207444851958

Because this is getting unwieldy to read let’s put all this subsetting into our function. When we call the function we want to specify the dataset, year(s), and country(ies).

calcGDP(gapminder, 1952,"Afghanistan")

We can also use a matching function %in% to subset data by a range of values.

To do that, we need add some more arguments to our function so we can extract year and country.

# Takes a dataset and multiplies the population column with the GDP per capita column.
calcGDP <- function(dat, year, country) {
  dat <- dat[dat$year %in% year, ]
  dat <- dat[dat$country %in% country,]
  
  gdp <- dat$pop * dat$gdpPercap
  
  dat <- cbind(dat, gdp)
  return(dat)
}

The function now takes a subset of the rows for all columns by year. It then subsets this subset by country. Then it calculates the GDP for the subset of the previous two steps. The function then adds the GDP as a new column to the subsetted data and returns this as the final result. Because we have defined all of these pieces of code in one function we can now repeat this process for any dataset.

We can now calculate the GDP for a single combination of year and country.

By using %in% we can also give multiple years or countries to those arguments.

calcGDP(gapminder, 1952:1962,country="Afghanistan")
##       country year      pop continent lifeExp gdpPercap        gdp
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453 6567086330
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530 7585448670
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007 8758855797

Note that we haven’t changed our original dataset. The subsetting only occurs to the copy of the data inside the function.

dim(gapminder)
## [1] 1704    6

Now let’s expand this function to check whether the year and country are specified. If they aren’t then we can use all of them. We can use conditional statements to set actions to occur only if a condition or a set of conditions are met.

# if
if (condition is true) {
  perform action
}

# if ... else
if (condition is true) {
  perform action
} else {  # that is, if the condition is false,
  perform alternative action
}

A common use of an if statement if to check is to compare values. For example:

x=1001
if(x==1001){
  print('x is 1001')
} else{
  print('x is not 1001')
}
## [1] "x is 1001"
x=1001
if(x>1000){
  print('x is greater than 1000')
} else{
  print('x is not greater than 1000')
}
## [1] "x is greater than 1000"

And if I’m coding properly I would put this in a function.

check1000<-function(x){
  if(x>1000){
    print(x)
    print('is greater than 1000')
  } else{
    print(x)
    print('is not greater than 1000')
  }
}
check1000(1001)
## [1] 1001
## [1] "is greater than 1000"

For calculating gdp information we first specify the default value of year and country as NULL. We then check whether when the function is called the year or country is specified or the default value is used using an if statement and the is.null function.

# Takes a dataset and multiplies the population column with the GDP per capita column.
calcGDP <- function(dat, year=NULL, country=NULL) {
  if(!is.null(year)) {
    dat <- dat[dat$year %in% year, ]
  }
  if (!is.null(country)) {
    dat <- dat[dat$country %in% country,]
  }
  gdp <- dat$pop * dat$gdpPercap

  dat <- cbind(dat, gdp=gdp)
  return(dat)
}

The function now subsets the provided data by year if the year argument isn’t empty, then subsets the result by country if the country argument isn’t empty. Then it calculates the GDP for whatever subset emerges from the previous two steps. The function then adds the GDP as a new column to the subsetted data and returns this as the final result. You can see that the output is much more informative than just getting a vector of numbers.

Let’s take a look at what happens when we specify the year:

head(calcGDP(gapminder, year=2007))
##        country year      pop continent lifeExp  gdpPercap          gdp
## 12 Afghanistan 2007 31889923      Asia  43.828   974.5803  31079291949
## 24     Albania 2007  3600523    Europe  76.423  5937.0295  21376411360
## 36     Algeria 2007 33333216    Africa  72.301  6223.3675 207444851958
## 48      Angola 2007 12420476    Africa  42.731  4797.2313  59583895818
## 60   Argentina 2007 40301927  Americas  75.320 12779.3796 515033625357
## 72   Australia 2007 20434176   Oceania  81.235 34435.3674 703658358894

Or for a specific country:

calcGDP(gapminder, country="Australia")
##      country year      pop continent lifeExp gdpPercap          gdp
## 61 Australia 1952  8691212   Oceania  69.120  10039.60  87256254102
## 62 Australia 1957  9712569   Oceania  70.330  10949.65 106349227169
## 63 Australia 1962 10794968   Oceania  70.930  12217.23 131884573002
## 64 Australia 1967 11872264   Oceania  71.100  14526.12 172457986742
## 65 Australia 1972 13177000   Oceania  71.930  16788.63 221223770658
## 66 Australia 1977 14074100   Oceania  73.490  18334.20 258037329175
## 67 Australia 1982 15184200   Oceania  74.740  19477.01 295742804309
## 68 Australia 1987 16257249   Oceania  76.320  21888.89 355853119294
## 69 Australia 1992 17481977   Oceania  77.560  23424.77 409511234952
## 70 Australia 1997 18565243   Oceania  78.830  26997.94 501223252921
## 71 Australia 2002 19546792   Oceania  80.370  30687.75 599847158654
## 72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894

Or both:

calcGDP(gapminder, year=2007, country="Australia")
##      country year      pop continent lifeExp gdpPercap          gdp
## 72 Australia 2007 20434176   Oceania  81.235  34435.37 703658358894

Let’s walk through the body of the function:

calcGDP <- function(dat, year=NULL, country=NULL) {
}

Here we’ve added two arguments, year, and country. We’ve set default arguments for both as NULL using the = operator in the function definition. This means that those arguments will take on those values unless the user specifies otherwise.

  if(!is.null(year)) {
    dat <- dat[dat$year %in% year, ]
  }

  if (!is.null(country)) {
    dat <- dat[dat$country %in% country,]
  }

Here, we check whether each additional argument is set to null, and whenever they’re not null overwrite the dataset stored in dat with a subset given by the non-null argument.

We can now ask the function to calculate the GDP for:

Tip: Pass by value

Functions in R almost always make copies of the data to operate on inside of a function body. When we modify dat inside the function we are modifying the copy of the gapminder dataset stored in dat, not the original variable we gave as the first argument.

This is called “pass-by-value” and it makes writing code much safer: you can always be sure that whatever changes you make within the body of the function, stay inside the body of the function.

Tip: Function scope

Another important concept is scoping: any variables (or functions!) you create or modify inside the body of a function only exist for the lifetime of the function’s execution. When we call calcGDP, the variables dat, gdp only exist inside the body of the function. Even if we have variables of the same name in our interactive R session, they are not modified in any way when executing a function.

Challenge

What is the expected result from the following script?

add3 <- function(y){
   y+3
 }
 x <- 10
 y <- add3(x)
 print(x)
 print(y)