Chapter 8 Target capture data

8.1 Background

Much of the information in this chapter is drawn from the HybPiper tutorial, which can be found at https://github.com/mossmatters/HybPiper/wiki/Tutorial

First make a folder in your home directory to hold your outputs for this component of the project.

We have pre-installed the program HybPiper on the server, which assembles target sequences from short high-throughput DNA sequencing reads. By assembling each target sequence for each species we will be able to produce a dataset that allows us to understand the evolutionary history of the species.

To access HybPiper run

conda activate /mnt/homes4celsrs/shared/envs/hybpiper

This activates a conda environment (don’t worry about this!).

If you need to obtain Hybpiper on your personal computer:

Download Miniconda for Windows or Mac and install following the instructions
Follow the instructions for bioconda and hybpiper installation: https://github.com/mossmatters/HybPiper/wiki/Installation
Use the data you downloaded previously

In this case, our target sequences are either the Angiosperm 353 standard targets or a set of target data designed specifically for these taxa. One goal of this workshop is to compare the results from different target sets. Target capture is a standard technique for sequencing many known loci simultaneously using a library of sequences. We have provided you with low-coverage sequence data from using these target sequences as probes (your fastq files). HybPiper first assigns each read to a particular target using alignment software (BWA). These reads are then assembled using a standard genome assembler (Spades). HybPiper outputs a fasta file of the (in frame) CDS portion of each sample for each target region.

8.2 Running HybPiper to assemble genes

The basic command for HybPiper is as follows:

hybpiper assemble -t_dna targets.fasta -r species1_R*_test.fastq --prefix species1 --bwa --cpu 3

Notice the commands and flags. Move to the results directory in your workshop directory to run the following commands. You should run this command, with the following changes:

The argument for the t_dna flag is the relevant fasta target file in the shared workshop folder for your dataset - you will need to indicate a full relative path not just the file name (e.g. ../../shared/AndesWorkshop2023/Campanulaceae_353.fasta). If you are using SISRS data your target file is specific to your taxon set; if you are using 353 data use the Campanulaceae file. If you are working on your personal computer download this target file and use the correct path to the file.
The argument for the the r flag is the filename(s) including paths for the fastq files of interest. Note the * indicates that we will use both pairs of the sequencing read files. Because you do not have a file named species1 you will need to change this name.
The prefix should be indicative of the species name
Note that we are sharing a server so we are specifying 3 cpu’s per group. If you are running on your own machine you can omit this flag and HybPiper will use all available resources. With limited shared resources this run may take a few minutes.

To view your results, list the contents of the folder and folders inside this. You should see a folder for each gene. Within each folder you can see several fasta files, which you can view.

To automatically run all samples consecutively you will need to loop through them. Go back and look at the prior example of loops. One approach is to make a list of all the names of the species in a file and then read that and use them one at a time.

First we need to make the list. We want to automate this process because in some cases you could have a lot of samples. Additionally, copying and pasting names into a list is prone to error. Additionally, once you automate this process you could repeat it easily and quickly for other datasets. For starters make a list of just the R1 files (you don’t want R2 files because they have the same name so you would have duplicates). First cd back to your data folder. Output your list to a file in your results folder named namelist.txt by “redirecting” the output using >.

ls *R1* > ~/evolution_workshop/results/namelist.txt

Check that this worked using cat or by viewing the file and counting the number of lines. We will need to adjust this output to remove the R1 label so that we can use wildcards to list both files together. One approach is to notice a pattern in filenames: the files start with a sample identifier followed by an underscore followed by another identifier. Thus, we can “cut off” all of the name after the second underscore and still have a unique name for our sample. We use the “cut” command to make our data into columns using _ (specify a delimiter with -d), then we select the first two of these (use the field flag -f).

ls *R1* | cut -d '_' -f 1,2 > ~/evolution_workshop/results/namelist.txt

Note that if you are on your personal computer your output path will be different

Check this worked. Now we can run HybPiper on all of our data sequentially. In the following example, the species names are entered in the namelist.txt file. The loop will iterate through them one at a time and run HybPiper.

while read name
do 
    [insert command here]
done < [path to]/namelist.txt

This while loop looks a lot like the for loop we saw previously, except that now we are reading in the file (it’s at the end for the command) and going through it line by line. name here is the variable you will use. In other cases I might start off with while read line. The while and read are commands but name or line is the information read in from the file that you use in the command (don’t forget to use $name or $line in your command).

Make sure your command is reading the correct files each time through the loop. It is helpful to put echo in front of your command to print rather than run the command to check that it works.

This analysis will take a couple of hours. Check with an instructor before running this command to ensure you have done everything correctly. Additionally, if you are on the server it’s advisable to run this command in the background. That means the command will run but you will continue to have access to your prompt so you can work. Additionally you can log out of the server (eg for lunch) and the command will continue to run. To run the command in background enclose the entire command (from before while to after txt) in ( ) and then add &> output.log & at the end to write the information that would normally print to the screen (as you saw when running a single command) to the file output.log. You can view this file (less or tail) to watch the progress of your command. The following is an example to see how running in background works. Assume you want to run the following example command (if you try this you need to create examplefile.txt first or use another short file).

while read line
do
  echo $line
  sleep 5
done < examplefile.txt

You can run this command in background as

(while read line
do
  echo $line
  sleep 5
done < examplefile.txt ) &> output.log &

If you cat output.log you will see information being printed to this file (rather than to the screen as was done when not running in background). The & runs the command in background while the &> sends what would have been printed to the screen to the designated file.

8.3 Hybpiper outputs

To obtain some information about the results of your HybPiper run use the hybpiper stats command as follows.

hybpiper stats -t_dna [target file] gene [path to]/namelist.txt

Hybpiper stats will generate two files: seq_lengths.tsv and hybpiper_stats.tsv. The first line of seq_lengths.tsv has the names of each gene. The second line has the length of the target gene, averaged over each “source” for that gene. The rest of the lines are the length of the sequence recovered by HybPiper for each gene. If there was no sequence for a gene, a 0 is entered.

8.3.1 Viewing information

While you can cat or head this file it is difficult to view this information on the command line. However, R is excellent for reading data in this format.

Select File - New File - R script
Save this file in the scripts folder
Load the tidyverse library

We will use the read.table command because our data are plain text separated by tabs (not commas as for csv files). We have three arguments: the filename (including path), how the columns are separated, and whether the file has a header line.

Note: R assumes you are in your project folder so all paths should be relative to this.

stats <- read.table("results/hybpiper_stats.tsv", sep = "\t", header = TRUE)

View your stats data.

Read and view seq_lengths.tsv.

Given your observation of the output tables, you may be interested in some of the following questions:

What are the min, max, and average number of genes retrieved across samples?
What is the range of lengths retrieved for a given gene?

You should take a look at the data and imagine how you would calculate this. Unfortunately,this process (e.g. calculating the min value of each column individually) could be challenging to communicate to the computer effectively. Furthermore, when we look at data in this “rectangular” form, we often want to ensure that our data are “tidy”. Tidy data usually has one sample per row. Currently our data have many observations per gene per row. If we were to look at the lengths file and make a new row that includes our calculation of the standard deviation of the gene lengths this would be a row with summary information not sample information.

Instead of working with these data frames directly we are going to take a look at an example that I have made for you that includes stats for just two samples in “long form.”

Read the file stats_example.csv in the shared workshop folder using the read_csv command.
Take a look at how these data are organized. Can you see how the data are organized that each row contains one piece of information?
Plot value v. stat as a scatter plot.

We can add a couple of tweaks to make our view better. First we can specify that we want each sample to be a different color by including a color = Name argument in our aes. Second we can rotate the graph 90 degrees to better view the data and labels by adding a layer coord_flip().

Now that you can see this example data better you can see it is trivial with only two samples with low information. However, you can imagine that we would like to have our real data in this format.

In order to work with your stats table you need to know how to produce a long format table. This is tricky and will take some practice so don’t worry if it seems complicated initially. We will use the pivot_longer function. If you haven’t noticed already, when you type a function and hit the tab button on your keyboard you will see a list of arguments. The first argument you need is the data. The second argument is the list of columns that will not be in the long table and currently contain the observation data. For us this is all the columns from NumReads to GenesWithChimeraWarning. Now we need to envision our new table. Go back and look at the example table for some help here. I specified a table with a column to indicate the particular stat we are measuring and a second column for the actual value observed. This command, with these four arguments looks like the following:

stats_long <- pivot_longer(stats, cols = NumReads:GenesWithChimeraWarning, names_to = 'stat', values_to = 'value')

We can also generate a summary table with one row per stat and information in that row including the mean, standard deviation, etc. across genes.

The first component of this is to develop groups of rows by stat. I like to imagine this as drawing boxes around all rows that contain a particular stat. We use the group_by function to communicate to the computer these “boxes”.

group_by(stats_long, stat)

If you run this command the data won’t appear to be any different. Now we need to generate a summary table where each group is collapsed into a single row in our new table containing the stat, its average, and standard deviation.

mean_stats <- group_by(stats_long, stat) %>% summarize(mean_across_species = mean(value),
                                                       sd = sd(value))

Note that here we use %>% as the pipe command to send the output of one command to the next. You have send the shell pipe | and this works the same way. In our summarize command we provide new column labels and what the contents of these columns will be.

Click on your mean_stats variable to view your summary table.

Now try to work with the lengths table. First make a long format table. You want to output a table with the columns Species, gene, and length.

Did you notice that the mean lengths are already included in this table? That’s really useful but also these data are not information about each sample and if we want to make summary tables and graphs we don’t want to include them.

We can use the filter command to only include data that does not (!= is not equal to) specify MeanLength in the Species column.

lengths_long_filtered <- filter(lengths_long, Species != "MeanLength")

Now can you filter this output to include only data where the value in the length column is greater than 0.

Now you should be able to make a summary table including the the mean length per gene, standard deviation of length per gene, and count of genes. Note that we did our second filtering step so these values would be accurate. Additionally, the counting function is n() (no arguments required).

Save your script!

8.3.2 Obtain gene data

At this point, we will go back to our Terminal. Remind yourself of your folders and data and note each sequence is in its own folder. We can use HybPiper to fetch the sequences recovered for the same gene from multiple samples and generate a file for each gene. Use the following command:

hybpiper retrieve_sequences dna -t_dna [target file]  --sample_names [path to]/namelist.txt

You should now see one file per gene in your results folder. Each file is fasta formatted with the data for each available species. In the next section we will create an alignment of all species for each gene.

Note that HybPiper has additional features we will not use in this workshop due to a lack of time.

8.4 Visual inspection of data

To check that your genes assembled correctly and are likely the correct species you may use BLAST, which is found at https://blast.ncbi.nlm.nih.gov/Blast.cgi

Select Nucleotide BLAST
Paste one of the contigs from one of the genes from one of the species
Select Somewhat similar sequences (blastn)
BLAST
Examine the species and the match