Programming with R

Command-Line Programs

Learning Objectives

Use the values of command-line arguments in a program.
Handle flags and files separately in a command-line program.
Read data from standard input in a program so that it can be used in a pipeline.

The R Console and other interactive tools like RStudio are great for prototyping code and exploring data, but sooner or later we will want to use our program in a pipeline or run it in a shell script to process thousands of data files. In order to do that, we need to make our programs work like other Unix command-line tools. For example, we may want to run the analyses and make the plots as we were doing previously for the gapminder dataset for any dataset that we specify.

$ Rscript analysis.R gapminder-FiveYearData.csv

Command-Line Arguments

Using the text editor of your choice, save the following line of code in a text file called session-info.R:

sessionInfo()

The function, sessionInfo, outputs the version of R you are running as well as the type of computer you are using (as well as the versions of the packages that have been loaded). This is very useful information to include when asking others for help with your R code.

Now we can run the code in the file we created from the Unix Shell using Rscript:

Rscript session-info.R

R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  base

Now let’s create another script that does something more interesting. Write the following lines in a file named print-args.R:

args <- commandArgs()
cat(args, sep = "\n")

The function commandArgs extracts all the command line arguments and returns them as a vector. The function cat, similar to the cat of the Unix Shell, outputs the contents of the variable. Since we did not specify a filename for writing, cat sends the output to standard output, which we can then pipe to other Unix functions. Because we set the argument sep to "\n", which is the symbol to start a new line, each element of the vector is printed on its own line. Let’s see what happens when we run this program in the Unix Shell:

Rscript print-args.R

/usr/lib/R/bin/exec/R
--slave
--no-restore
--file=print-args.R

From this output, we learn that Rscript is just a convenience command for running R scripts. The first argument in the vector is the path to the R executable. The following are all command-line arguments that affect the behavior of R. From the R help file:

--slave: Make R run as quietly as possible
--no-restore: Don’t restore anything that was created during the R session
--file: Run this file
--args: Pass these arguments to the file being run

Thus running a file with Rscript is an easier way to run the following:

R --slave --no-restore --file=print-args.R --args

/usr/lib/R/bin/exec/R
--slave
--no-restore
--file=print-args.R
--args

If we run it with a few arguments, however:

Rscript print-args.R first second third

/usr/lib/R/bin/exec/R
--slave
--no-restore
--file=print-args.R
--args
first
second
third

then commandArgs adds each of those arguments to the vector it returns. Since the first elements of the vector are always the same, we can tell commandArgs to only return the arguments that come after --args. Let’s update print-args.R and save it as print-args-trailing.R:

args <- commandArgs(trailingOnly = TRUE)
cat(args, sep = "\n")

And then run print-args-trailing from the Unix Shell:

Rscript print-args-trailing.R first second third

first
second
third

Now commandArgs returns only the arguments that we listed after print-args-trailing.R.

With this in hand, let’s build a version of analysis.R that makes a plot for publication. The first step is to write functions for plotting.

library(ggplot2)
plot_gap <- function(dat) {
  ggplot(data = dat, aes(x = lifeExp, y = gdpPercap)) + geom_point()
  ggplot(data = dat, aes(x = lifeExp, y = gdpPercap)) + geom_point() + scale_y_log10() + geom_smooth(method="lm")
}

The second step is to add a function that loads the data and calls our plotting function.

main <- function() {
  args <- commandArgs(trailingOnly = TRUE)
  filename <- args[1]
  dat <- read.csv(file = filename, header = TRUE)
  plot_gap(dat)
}

Now we need to add a call to our main function.

main()

Now we can run this script on our gapminder data (or any other similar dataset).

Rscript analysis.R gapminder-FiveYearData.csv