Programming with R

Analyzing data

Learning Objectives

  • Select individual values and subsections from data.
  • Perform operations on a data frame of data.
  • Display simple graphs.

Manipulating Data

Now that our data is loaded in memory, we can start doing things with it.

If we want to get a single value from the data frame, we can provide an index in square brackets, just as we do in math:

# first value in gapminder
gapminder[1, 1]
[1] Afghanistan
142 Levels: Afghanistan Albania Algeria Angola Argentina Australia Austria Bahrain Bangladesh ... Zimbabwe

# middle value in gapminder
gapminder[3, 3]
[1] 10267083

An index like [3, 3] selects a single element of a data frame, but we can select whole sections as well. For example, we can select the first two rows of values for the first four columns like this:

gapminder[1:2, 1:4]
      country year     pop continent
1 Afghanistan 1952 8425333      Asia
2 Afghanistan 1957 9240934      Asia

The slice 1:4 means, “Start at index 1 and go to index 4.”

The slice does not need to start at 1, e.g. the line below selects rows 5 through 10:

gapminder[5:10, 1:4]

We can use the function c, which stands for combine, to select non-contiguous values:

gapminder[c(3, 8, 37, 56), c(1, 4, 5)]
       country continent lifeExp
3  Afghanistan      Asia  31.997
8  Afghanistan      Asia  40.822
37      Angola    Africa  30.015
56   Argentina  Americas  70.774

We also don’t have to provide a slice for either the rows or the columns. If we don’t include a slice for the rows, R returns all the rows; if we don’t include a slice for the columns, R returns all the columns. If we don’t provide a slice for either rows or columns, e.g. gapminder[, ], R returns the full data frame.

# All columns from row 5
gapminder[5, ]
      country year      pop continent lifeExp gdpPercap
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
# All rows from column 16
gapminder[, 6]

We can also calculate some information from our data. For example, the maximum and minimum of a list are as follows.

max(gapminder[,6])
min(gapminder[,6])

Now that we have our data in R, let’s do something with it.

We have the GDP per capita as well as the population of the country. Let’s calculate the GDP and add that to our data frame. First, we multiply gdpPercap by pop and save that to a variable. Multiplying two columns automatically multiplies each pair of values and makes a vector containing each result.

gdp<-gapminder$gdpPercap * gapminder$pop

gdp is a vector. A vector is the most common and basic data structure in R. It can only contain one data type. They are the building blocks of every other data structure.

A vector can contain any of five types:

  • logical (e.g., TRUE, FALSE)
  • integer (e.g., as.integer(3))
  • numeric (real or decimal) (e.g, 2, 2.0, pi)
  • complex (e.g, 1 + 0i, 1 + 4i)
  • character (e.g, "a", "swc")

Now we can add the vector gdp to the gapminder data frame using cbind which stands for “column bind”, as in bind the vector gdp to the gapminder data frame.

gapminder <- cbind(gapminder, gdp)

We can check that our results are correct by looking at the first item in gdp and comparing it to the expected result.

gdp[1]
gapminder$gdp[1]
gapminder$gdpPercap[1]*gapminder$pop[1]

We can also view the first few rows of the data again.

head(gapminder)
      country year      pop continent lifeExp gdpPercap         gdp
1 Afghanistan 1952  8425333      Asia  28.801  779.4453  6567086330
2 Afghanistan 1957  9240934      Asia  30.332  820.8530  7585448670
3 Afghanistan 1962 10267083      Asia  31.997  853.1007  8758855797
4 Afghanistan 1967 11537966      Asia  34.020  836.1971  9648014150
5 Afghanistan 1972 13079460      Asia  36.088  739.9811  9678553274
6 Afghanistan 1977 14880372      Asia  38.438  786.1134 11697659231

Challenge 1

Let’s try this on the pop column of the gapminder dataset.

Make a new column in the gapminder data frame that contains population in units of millions of people. Check the head or tail of the data frame to make sure it worked.

Now let’s use our data to answer a biological question. For example:

What is the relationship between life expectancy and year?

We won’t go into too much detail, but briefly:

  • lm estimates linear statistical models
  • The first argument is a formula, with a ~ b meaning that a, the dependent (or response) variable, is a function of b, the independent variable.
  • We tell lm to use the gapminder data frame, so it knows where to find the variables lifeExp and year.
reg <- lm(lifeExp ~ year, data=gapminder)

Let’s look at the output:

reg

Call:
lm(formula = lifeExp ~ year, data = gapminder)

Coefficients:
(Intercept)         year  
  -585.6522       0.3259  

Not much there right? But if we look at the structure…

str(reg)
List of 12
 $ coefficients : Named num [1:2] -585.652 0.326
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "year"
 $ residuals    : Named num [1:1704] -21.7 -21.8 -21.8 -21.4 -20.9 ...
  ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
 $ effects      : Named num [1:1704] -2455.1 232.2 -20.8 -20.5 -20.2 ...
  ..- attr(*, "names")= chr [1:1704] "(Intercept)" "year" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:1704] 50.5 52.1 53.8 55.4 57 ...
  ..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:1704, 1:2] -41.2795 0.0242 0.0242 0.0242 0.0242 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:1704] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "year"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.02 1.03
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 1702
 $ xlevels      : Named list()
 $ call         : language lm(formula = lifeExp ~ year, data = gapminder)
 $ terms        :Classes 'terms', 'formula' length 3 lifeExp ~ year
  .. ..- attr(*, "variables")= language list(lifeExp, year)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "lifeExp" "year"
  .. .. .. ..$ : chr "year"
  .. ..- attr(*, "term.labels")= chr "year"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(lifeExp, year)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "year"
 $ model        :'data.frame':  1704 obs. of  2 variables:
  ..$ lifeExp: num [1:1704] 28.8 30.3 32 34 36.1 ...
  ..$ year   : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 lifeExp ~ year
  .. .. ..- attr(*, "variables")= language list(lifeExp, year)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "lifeExp" "year"
  .. .. .. .. ..$ : chr "year"
  .. .. ..- attr(*, "term.labels")= chr "year"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(lifeExp, year)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "year"
 - attr(*, "class")= chr "lm"

There’s a great deal stored in nested lists! The structure function allows you to see all the data available, in this case, the data that was returned by the lm function.

For now, we can look at the summary:

summary(reg)

Call:
lm(formula = lifeExp ~ year, data = gapminder)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.949  -9.651   1.697  10.335  22.158 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -585.65219   32.31396  -18.12   <2e-16 ***
year           0.32590    0.01632   19.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.63 on 1702 degrees of freedom
Multiple R-squared:  0.1898,    Adjusted R-squared:  0.1893 
F-statistic: 398.6 on 1 and 1702 DF,  p-value: < 2.2e-16

As you might expect, life expectancy has slowly been increasing over time, so we see a significant positive association!