The R Console and other interactive tools like RStudio are great for prototyping code and exploring data, but sooner or later we will want to use our program in a pipeline or run it in a shell script to process thousands of data files. In order to do that, we need to make our programs work like other Unix command-line tools.

It will look like

Rscript gap_analysis.R gapminder.csv

Command-Line Arguments

Using the text editor of your choice, save the following line of code in a text file called session-info.R:

sessionInfo()

The function, sessionInfo, outputs the version of R you are running as well as the type of computer you are using (as well as the versions of the packages that have been loaded). This is very useful information to include when asking others for help with your R code.

Now we can run the code in the file we created from the Unix Shell using Rscript:

Rscript session-info.R

If that did not work, remember that you must be in the correct directory.

Now let’s create another script that does something more interesting. Write the following lines in a file named print-args.R:

args <- commandArgs()
cat(args, sep = "\n")

The function commandArgs extracts all the command line arguments and returns them as a vector. The function cat, similar to the cat of the Unix Shell, outputs the contents of the variable. Since we did not specify a filename for writing, cat sends the output to standard output, which we can then pipe to other Unix functions. Because we set the argument sep to "\n", which is the symbol to start a new line, each element of the vector is printed on its own line. Let’s see what happens when we run this program in the Unix Shell:

Rscript print-args.R

From this output, we learn that Rscript is just a convenience command for running R scripts. The first argument in the vector is the path to the R executable. The following are all command-line arguments that affect the behavior of R. From the R help file:

Thus running a file with Rscript is an easier way to run the following:

R --slave --no-restore --file=print-args.R --args

If we run it with a few arguments, however:

Rscript print-args.R first second third

commandArgs adds each of those arguments to the vector it returns. Since the first elements of the vector are always the same, we can tell commandArgs to only return the arguments that come after --args. Let’s update print-args.R and save it as print-args-trailing.R:

args <- commandArgs(trailingOnly = TRUE)
cat(args, sep = "\n")

And then run Rscript print-args-trailing.R from the Unix Shell.

Now commandArgs returns only the arguments that we listed after print-args-trailing.R.

With this in hand, let’s build a version of analysis.R that makes a plot for publication. The first step is to write a function for plotting.

library(ggplot2)
plot_gap <- function(dat) {
    p <- ggplot(data = dat, aes(x=year, y=lifeExp)) +
      geom_point() + geom_smooth(method="lm")
}

The second step is to load the data and call our plotting function.

args <- commandArgs(trailingOnly = TRUE)
filename <- args[1]
dat <- read.csv(file = filename, header = TRUE)
p <- plot_gap(dat)

pdf(file="gapplot.pdf",height=4,width=6)
p
dev.off()
}

Now we can run this script on our gapminder data (or any other similar dataset).

Rscript analysis.R gapminder.csv