Chapter 6 Getting your data

For this workshop we will be using genome sequence data from a group of Andean plants. For starters we will take a peak at these data so you know what it looks like before we use standard programs to analyze it.

The sequence data is in a shared folder on this server. Relative to your workshop folder you can find the data at ../../shared/AndesWorkshop2023/.

If you are not able to access the server click here to download the data. If the text shows up in your web browser select File - Save Page As and rename the file to example.fq in the Save As bar before saving. Save the file to your workshop folder.

If you cd to the the workshop directory and list the contents you can see folders containing files of data. If you are getting errors when you cd you are probably not in your project folder. You may need to try cd ../shared/AndesWorkshop2023/ or cd ../../../shared/AndesWorkshop2023/ depending on where you are. Remember to check with pwd.

If you are working on your personal computer you should be able to see the folder in your directory by listing files.

If you are on the server, sequence files for samples can be found in different folders labeled by taxon set. Start by going to the folder for your assigned taxon set and listing its contents. You should see two folders. Go into the one for your assigned dataset.

You generally do not want to view sequence data files because they can be large and are rather hard to read. If you want to see how large use ls -lh. If you are on your personal computer you can still do this but you will see just the example file, which isn’t too large. You are already familiar with the ls command for listing the contents of folders. The dash followed by additional letters are called flags. The l flag indicates that we are listing the contents in long form to provide additional information. The h flag means we’ll view the data in human readable format (i.e. with size given in Kb or Mb).

Raw data is in the form of zipped fastq files, which are generated by a sequencer. You can tell this by the very last bit of the name of the files, which is .fastq.gz or .fq.gz (just like you might see Word files labeled as .docx). Just as a note, you often need to do some cleanup of the data that comes of the sequencer so you only have high quality data. We have done this step for you already. Take a look at a little bit of one file. The less command lets you scroll through the (very large) file. zless is the less command that works on zipped files.

If you are on the server use the following command replacing [] with a specific filename.

zless [].fastq.gz

If you are working with the downloaded example file use less.

less example.fq

Fastq files contain sets of four lines. The first line is the name of the sequence. The second line is the sequence. The fourth line is the quality of the sequence. As you scroll through you should see many sets of four (one for each sequence).

Once you are done scrolling use q (for quit) to get back to your prompt.

Before we look at our large files we are going to practice on a small example file. If you are on the server move back to the main AndesWorkshop2023 folder. List the contents and notice the example.fq file. For this example I have unzipped it. You can view the entire contents of the file using the command cat.

cat example.fq

Alternatively you can use the command head to view the first few lines of the file.

head example.fq

We can also get a sense of the amount of data by counting the number of lines using the word count command wc.

wc example.fq

This command tells you the number of lines, words, and characters. If you want to focus just on the number of lines you should use the l flag.

wc -l example.fq

If we only want to get the names of the sequences we can search for lines that match a particular pattern and print just those. We search with the command grep. The pattern we are searching for is lines that start with @. And then we provide the name of the file.

grep '^@' example.fq

Because this is a small example file we are able to print all the header lines without it being overwhelming. However, with our larger files we don’t want to do this. However, we know that the head command allows us to view just the first few lines of our output. Now we’ll combine those two commands using a pipe (|). First we enter the grep command (with it’s two arguments) then we “pipe” the output to the head command rather than printing it to the screen.

grep '^@' example.fq |  head

Notice that the head command does not have arguments in this case because it accepts the output from the grep command.

The head command accepts a flag with the number of lines you want to print. For example head -4 will print the first four lines.

Your first challenge in this course is to combine the information that you have learned to get the number of header lines in one of your zipped fastq files (using zgrep). If you are on your personal computer you can practice with the example file.

6.1 Repeating analyses

If you are working on your personal computer download these files into a folder labeled TaxonSet5_SISRSreads then cd into this folder.

One approach to analyzing multiple files in the same way is to copy your code and edit each copy to include the name of a particular file. That might be feasible (if slow) for the 10 files you are working with here, but it is inconvenient for the many files you might work with on a larger genomics project. Instead, we can make a list of the files we want to work with and tell the computer to repeat the analysis on each file.

In the prior challenge you figured out what analysis you want to do (list the number of header lines). Now we need to learn how to repeat this. In most programming languages, including our command line, we use a loop to go through a list and repeat that command. On the command line the format to do this looks like

for thing in list_of_things
do
    command $thing
done

We start with for to indicate that we are about to make a loop. Each time the loop runs (called an iteration), an item in the list is assigned in sequence to the variable thing, and the commands inside the loop are executed. Think of a variable as a box that holds the item we are working with. For example, if we are working with a list of files then the first time through the loop thing is the first file name in the list, the second time through thing is the second name, and so on. This allows us to repeat our command for each item in the list using the same variable name. Inside the loop, we indicate the variable by putting $ in front of it. The $ tells the shell interpreter to treat the variable as a variable name and substitute its value in its place, rather than treat it as text or an external command.

Start by figuring out what goes on the third line of the loop
Now figure out what goes on the first line (i.e. your list of files written out)
When you run your loop it should print the number of header lines for each file

We might want to add one command to this loop to print out the name of the file prior to printing the number of header lines so we can more easily keep track of each file’s information. The echo command will print whatever comes after it. Try inserting this command into your loop in order to print the name of the file following by the number of header lines.

Did you need to list all your files by hand? Copying and pasting a lot of names can be time consuming and error prone. Logically, you want to make your list following a particular rule, for example all of the files in this folder. We can use a wildcard (denoted *) to do this. So instead of writing out all the file names, replace all of that just with * as your list. In other situations keep in mind that we might want to be more precise. * really suggests any number of any character. To indicate all files that end in fastq.gz we could use *fastq.gz as our list.