Here I will reemphasize that learning how to think about working with data is more important than learning how to use specific tools. You can use the dplyr
package to get the same results as with plyr
but in a slightly different way.
dplyr
implements the following verbs useful for data manipulation:
select()
: focus on a subset of variables
filter()
: focus on a subset of rows
mutate()
: add new columns
summarise()
: reduce each group to a smaller number of summary statistics
arrange()
: re-order the rows
do()
: applies any R function to each group of the data
Here we take our data, use thegroup_by
function to do the splitting and then summarise
each group with a function (here we take the mean of gdpPercap).
library(dplyr)
grouped_gap<-group_by(gapminder,continent)
summarise(grouped_gap,gdp = mean(gdpPercap))
## # A tibble: 5 × 2
## continent gdp
## <fctr> <dbl>
## 1 Africa 2193.755
## 2 Americas 7136.110
## 3 Asia 7902.150
## 4 Europe 14469.476
## 5 Oceania 18621.609
The way I have written the commands above works, but it difficult to read. The %>%
works as a pipe in R the way that |
did in bash. It loads with dplyr
or the magrittr
package. We can pipe the data to group_by
and then pipe that to summarise
, which makes our workflow more readable. You usually to need to group prior to summarising. You can also group by more than one variable.
gapminder %>% group_by(continent) %>% summarise(gdp = mean(gdpPercap))
## # A tibble: 5 × 2
## continent gdp
## <fctr> <dbl>
## 1 Africa 2193.755
## 2 Americas 7136.110
## 3 Asia 7902.150
## 4 Europe 14469.476
## 5 Oceania 18621.609
Note this is a tibble, which is “a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not.” The result is the same as when we used dlply
previously.
You can even use pipes to send your results to ggplot
directly.
gapminder %>% group_by(continent) %>% summarise(gdp = mean(gdpPercap)) %>%
ggplot(aes(x=continent,y=gdp))+geom_point()
Here is an example of filter
ing - note the simplicity compared to our previous approach to subset data.
gapminder %>% filter(year==2007) %>%
ggplot(aes(x=continent,y=gdpPercap*pop))+geom_point()+scale_y_log10()
Here is an example of using mutate
and sending the resulting output into the pipe, rather than calculating the y value when ggplot is called.
mutate(gapminder,gdp=gdpPercap*pop) %>% filter(year==2007) %>%
ggplot(aes(x=continent,y=gdp))+geom_point()+scale_y_log10()
Some of this material was taken from the dplyr
github readme.