Previously we looked at how you can use functions to simplify your code. We defined the calcGDP function, which takes the gapminder dataset, and multiplies the population and GDP per capita column. We also defined additional arguments so we could filter by year and country:

# Takes a dataset and multiplies the population column
# with the GDP per capita column.
calcGDP <- function(dat, year=NULL, country=NULL) {
  if(!is.null(year)) {
    dat <- dat[dat$year %in% year, ]
  }
  if (!is.null(country)) {
    dat <- dat[dat$country %in% country,]
  }
  gdp <- dat$pop * dat$gdpPercap

  new <- cbind(dat, gdp=gdp)
  return(new)
}

A common task you’ll encounter when working with data, is that you’ll want to run calculations on different groups within the data. In the above, we were simply calculating the GDP by multiplying two columns together. But what if we wanted to calculated the mean GDP per continent?

We could run calcGDP and then take the mean of each continent:

withGDP <- calcGDP(gapminder)
mean(withGDP[withGDP$continent == "Africa", "gdp"])
## [1] 20904782844
mean(withGDP[withGDP$continent == "Americas", "gdp"])
## [1] 379262350210
mean(withGDP[withGDP$continent == "Asia", "gdp"])
## [1] 227233738153

But this isn’t very nice. Yes, by using a function, you have reduced a substantial amount of repetition. That is nice. But there is still repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.

We could write a new function that is flexible like calcGDP, but this also takes a substantial amount of effort and testing to get right.

The abstract problem we’re encountering here is know as “split-apply-combine”. We want to split our data into groups, in this case continents, apply some calculations on that group, then optionally combine the results together afterwards.

The plyr package

You should be familiar with the apply family of functions. While R’s built in functions do work, we’re going to introduce you to another method for solving the “split-apply-combine” problem. The plyr package provides a set of functions that we find more user friendly for solving this problem.

Plyr has functions for operating on lists, data.frames and arrays (matrices, or n-dimensional vectors). Each function performs:

  1. A splitting operation
  2. Apply a function on each split in turn.
  3. Recombine output data as a single data object.

The functions are named based on the data structure they expect as input, and the data structure you want returned as output: [a]rray, [l]ist, or [d]ata.frame. The first letter corresponds to the input data structure, the second letter to the output data structure, and then the rest of the function is named “ply”.

This gives us 9 core functions **ply. There are an additional three functions which will only perform the split and apply steps, and not any combine step. They’re named by their input data type and represent null output by a _ (see table)

Note here that plyr’s use of “array” is different to R’s, an array in ply can include a vector or matrix.

Each of the xxply functions (daply, ddply, llply, laply, …) has the same structure and has 4 key features and structure:

xxply(.data, .variables, .fun)

Now we can quickly calculate the mean GDP per continent:

ddply(
 .data = calcGDP(gapminder),
 .variables = "continent",
 .fun = function(x) mean(x$gdp)
)
##   continent           V1
## 1    Africa  20904782844
## 2  Americas 379262350210
## 3      Asia 227233738153
## 4    Europe 269442085301
## 5   Oceania 188187105354

Let’s walk through the previous code:

What if we want a different type of output data structure?:

dlply(
 .data = calcGDP(gapminder),
 .variables = "continent",
 .fun = function(x) mean(x$gdp)
)
## $Africa
## [1] 20904782844
## 
## $Americas
## [1] 379262350210
## 
## $Asia
## [1] 227233738153
## 
## $Europe
## [1] 269442085301
## 
## $Oceania
## [1] 188187105354
## 
## attr(,"split_type")
## [1] "data.frame"
## attr(,"split_labels")
##   continent
## 1    Africa
## 2  Americas
## 3      Asia
## 4    Europe
## 5   Oceania

We called the same function again, but changed the second letter to an l, so the output was returned as a list.

We can specify multiple columns to group by and then arrange the results as a matrix:

daply(
 .data = calcGDP(gapminder),
 .variables = c("continent", "year"),
 .fun = function(x) mean(x$gdp)
)
##           year
## continent          1952         1957         1962         1967
##   Africa     5992294608   7359188796   8784876958  11443994101
##   Americas 117738997171 140817061264 169153069442 217867530844
##   Asia      34095762661  47267432088  60136869012  84648519224
##   Europe    84971341466 109989505140 138984693095 173366641137
##   Oceania   54157223944  66826828013  82336453245 105958863585
##           year
## continent          1972         1977         1982         1987
##   Africa    15072241974  18694898732  22040401045  24107264108
##   Americas 268159178814 324085389022 363314008350 439447790357
##   Asia     124385747313 159802590186 194429049919 241784763369
##   Europe   218691462733 255367522034 279484077072 316507473546
##   Oceania  134112109227 154707711162 176177151380 209451563998
##           year
## continent          1992         1997         2002         2007
##   Africa    26256977719  30023173824  35303511424  45778570846
##   Americas 489899820623 582693307146 661248623419 776723426068
##   Asia     307100497486 387597655323 458042336179 627513635079
##   Europe   342703247405 383606933833 436448815097 493183311052
##   Oceania  236319179826 289304255183 345236880176 403657044512

We can apply slightly more complex functions to the grouped data as well. Note paste concatenates vectors. format allows numbers to be printed “prettier”.

d_ply(
  .data=gapminder,
  .variables = "continent",
  .fun = function(x) {
    meanGDPperCap <- mean(x$gdpPercap)
    print(paste(
      "The mean GDP per capita for", unique(x$continent),
      "is", format(meanGDPperCap, big.mark=",")
   ))
  }
)
## [1] "The mean GDP per capita for Africa is 2,193.755"
## [1] "The mean GDP per capita for Americas is 7,136.11"
## [1] "The mean GDP per capita for Asia is 7,902.15"
## [1] "The mean GDP per capita for Europe is 14,469.48"
## [1] "The mean GDP per capita for Oceania is 18,621.61"

You can find out more about a function and how to use it by putting a ? in front of a function. For example ?paste.

Challenge 1

Calculate the average life expectancy per continent. Which has the longest? Which had the shortest?

Challenge 2

Calculate the average life expectancy per continent and year. Which had the longest and shortest in 2007? Which had the greatest change in between 1952 and 2007?