Programming with R
Analyzing data
Learning Objectives
- Select individual values and subsections from data.
- Perform operations on a data frame of data.
- Display simple graphs.
Manipulating Data
Now that our data is loaded in memory, we can start doing things with it.
If we want to get a single value from the data frame, we can provide an index in square brackets, just as we do in math:
# first value in gapminder
gapminder[1, 1]
[1] Afghanistan
142 Levels: Afghanistan Albania Algeria Angola Argentina Australia Austria Bahrain Bangladesh ... Zimbabwe
# middle value in gapminder
gapminder[3, 3]
[1] 10267083
An index like [3, 3]
selects a single element of a data frame, but we can select whole sections as well. For example, we can select the first two rows of values for the first four columns like this:
gapminder[1:2, 1:4]
country year pop continent
1 Afghanistan 1952 8425333 Asia
2 Afghanistan 1957 9240934 Asia
The slice 1:4
means, “Start at index 1 and go to index 4.”
The slice does not need to start at 1, e.g. the line below selects rows 5 through 10:
gapminder[5:10, 1:4]
We can use the function c
, which stands for combine, to select non-contiguous values:
gapminder[c(3, 8, 37, 56), c(1, 4, 5)]
country continent lifeExp
3 Afghanistan Asia 31.997
8 Afghanistan Asia 40.822
37 Angola Africa 30.015
56 Argentina Americas 70.774
We also don’t have to provide a slice for either the rows or the columns. If we don’t include a slice for the rows, R returns all the rows; if we don’t include a slice for the columns, R returns all the columns. If we don’t provide a slice for either rows or columns, e.g. gapminder[, ]
, R returns the full data frame.
# All columns from row 5
gapminder[5, ]
country year pop continent lifeExp gdpPercap
5 Afghanistan 1972 13079460 Asia 36.088 739.9811
# All rows from column 16
gapminder[, 6]
We can also calculate some information from our data. For example, the maximum and minimum of a list are as follows.
max(gapminder[,6])
min(gapminder[,6])
Now that we have our data in R, let’s do something with it.
We have the GDP per capita as well as the population of the country. Let’s calculate the GDP and add that to our data frame. First, we multiply gdpPercap by pop and save that to a variable. Multiplying two columns automatically multiplies each pair of values and makes a vector containing each result.
gdp<-gapminder$gdpPercap * gapminder$pop
gdp is a vector. A vector is the most common and basic data structure in R
. It can only contain one data type. They are the building blocks of every other data structure.
A vector can contain any of five types:
- logical (e.g.,
TRUE
,FALSE
) - integer (e.g.,
as.integer(3)
) - numeric (real or decimal) (e.g,
2
,2.0
,pi
) - complex (e.g,
1 + 0i
,1 + 4i
) - character (e.g,
"a"
,"swc"
)
Now we can add the vector gdp to the gapminder data frame using cbind
which stands for “column bind”, as in bind the vector gdp to the gapminder data frame.
gapminder <- cbind(gapminder, gdp)
We can check that our results are correct by looking at the first item in gdp and comparing it to the expected result.
gdp[1]
gapminder$gdp[1]
gapminder$gdpPercap[1]*gapminder$pop[1]
We can also view the first few rows of the data again.
head(gapminder)
country year pop continent lifeExp gdpPercap gdp
1 Afghanistan 1952 8425333 Asia 28.801 779.4453 6567086330
2 Afghanistan 1957 9240934 Asia 30.332 820.8530 7585448670
3 Afghanistan 1962 10267083 Asia 31.997 853.1007 8758855797
4 Afghanistan 1967 11537966 Asia 34.020 836.1971 9648014150
5 Afghanistan 1972 13079460 Asia 36.088 739.9811 9678553274
6 Afghanistan 1977 14880372 Asia 38.438 786.1134 11697659231
Challenge 1
Let’s try this on the pop
column of the gapminder
dataset.
Make a new column in the gapminder
data frame that contains population in units of millions of people. Check the head or tail of the data frame to make sure it worked.
Now let’s use our data to answer a biological question. For example:
What is the relationship between life expectancy and year?
We won’t go into too much detail, but briefly:
lm
estimates linear statistical models- The first argument is a formula, with
a ~ b
meaning thata
, the dependent (or response) variable, is a function ofb
, the independent variable. - We tell
lm
to use the gapminder data frame, so it knows where to find the variableslifeExp
andyear
.
reg <- lm(lifeExp ~ year, data=gapminder)
Let’s look at the output:
reg
Call:
lm(formula = lifeExp ~ year, data = gapminder)
Coefficients:
(Intercept) year
-585.6522 0.3259
Not much there right? But if we look at the structure…
str(reg)
List of 12
$ coefficients : Named num [1:2] -585.652 0.326
..- attr(*, "names")= chr [1:2] "(Intercept)" "year"
$ residuals : Named num [1:1704] -21.7 -21.8 -21.8 -21.4 -20.9 ...
..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
$ effects : Named num [1:1704] -2455.1 232.2 -20.8 -20.5 -20.2 ...
..- attr(*, "names")= chr [1:1704] "(Intercept)" "year" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:1704] 50.5 52.1 53.8 55.4 57 ...
..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:1704, 1:2] -41.2795 0.0242 0.0242 0.0242 0.0242 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:1704] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "year"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.02 1.03
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 1702
$ xlevels : Named list()
$ call : language lm(formula = lifeExp ~ year, data = gapminder)
$ terms :Classes 'terms', 'formula' length 3 lifeExp ~ year
.. ..- attr(*, "variables")= language list(lifeExp, year)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "lifeExp" "year"
.. .. .. ..$ : chr "year"
.. ..- attr(*, "term.labels")= chr "year"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(lifeExp, year)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "year"
$ model :'data.frame': 1704 obs. of 2 variables:
..$ lifeExp: num [1:1704] 28.8 30.3 32 34 36.1 ...
..$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
..- attr(*, "terms")=Classes 'terms', 'formula' length 3 lifeExp ~ year
.. .. ..- attr(*, "variables")= language list(lifeExp, year)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "lifeExp" "year"
.. .. .. .. ..$ : chr "year"
.. .. ..- attr(*, "term.labels")= chr "year"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(lifeExp, year)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "year"
- attr(*, "class")= chr "lm"
There’s a great deal stored in nested lists! The structure function allows you to see all the data available, in this case, the data that was returned by the lm
function.
For now, we can look at the summary
:
summary(reg)
Call:
lm(formula = lifeExp ~ year, data = gapminder)
Residuals:
Min 1Q Median 3Q Max
-39.949 -9.651 1.697 10.335 22.158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -585.65219 32.31396 -18.12 <2e-16 ***
year 0.32590 0.01632 19.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.63 on 1702 degrees of freedom
Multiple R-squared: 0.1898, Adjusted R-squared: 0.1893
F-statistic: 398.6 on 1 and 1702 DF, p-value: < 2.2e-16
As you might expect, life expectancy has slowly been increasing over time, so we see a significant positive association!