Chapter 9 Results
9.1 Alignment
In order to use these data to build our phylogenies, we need to align each gene. This allows our tree-building software to compare species using nucleotides that we believe share ancestry. We will use the software MAFFT. Due to the nature of this server different programs are in different “environments” so you’ll need to deactivate the hybpiper environment and activate one for mafft.
If you are on your local computer skip this step - mafft should be included with the hybpiper installation and therefore in your current environment.
conda deactivate
conda activate /mnt/homes4celsrs/shared/envs/mafft
Now MAFFT is available to run.
However, before we align our data let’s take a look at the output.
Do you remember how to print out particular lines of a file?
Try printing all of the lines that start with '>' in a particular FNA file. Make sure to use '>' including the single quotes when searching. Otherwise, the shell interprets > as writing to a new file as we have seen previously.
You should notice that HybPiper has added some information to these lines so they include more than just the name of the sample. Because these additions differ across genes subsequent analyses may treat them as separate samples.
Before we do the alignment we need to remove this extra information.
Again, we can think about the pattern we are looking for: we want to go through each FNA file (this should suggest using a loop) and keep the first “word” in each line (think of a word as a set of characters not separated by spaces).
We use the cut command to divide each line of our file into columns.
We want to keep the first “field” and use a space (’ ’) as our delimiter.
Try this with a single FNA file and output to that same file with .fa added to the end of the file name.
Now use a loop to repeat this command for all FNA files.
Remember the list of items you are iterating through is all of your FNA files, you then indicate that file in the command using the variable name (with a $) and output to that same variable followed by .fa.
Now we can loop through each of these files and output an alignment. The basic mafft command looks like the following.
mafft --auto --thread 1 7577.FNA.fa > 7577.FNA.fa.fasta
Remember you have 350 genes (i.e. 350 fasta files) so you want to run these alignments in a loop.
As before, write a loop to go through each FNA.fa file and output an alignment - you can use the input name and add .fasta to the end to create the output name as in the above example.
9.2 View in R
The best way to view an alignment is in a specialized program, but because we have R easily available we’ll view an example here.
Make a new script to run this alignment view.
You should first load the ape and ggmsa library. (This uses the gg as in ggplot and msa, which stands for multiple sequence alignment.)
If you are working on your personal computer you need to do the following:
install.packages("BiocManager")
BiocManager::install("ggmsa")
- Read in a single fasta file with ape’s
read.FASTA. - Use the ggmsa command as follows, substituting the particular variable that contains your fasta file. Allow a little while for this to run. You may need to click the Zoom button above the plot to see it more clearly.
ggmsa(seqs_7577, start = 20, end = 120, char_width = 0.5, seq_name = T, color = "Chemistry_NT") +
geom_msaBar()
Now repeat this process for the same gene with the unaligned data. Can you see the difference?
9.3 Concatenated data analysis
There are multiple ways to analyze data. The first is to concatenate everything - we will use the python tool AMAS. Make sure you are in your Terminal.
conda deactivate
conda activate /mnt/homes4celsrs/shared/envs/amas
Run AMAS using the following command. You should be able to decipher the main AMAS python script, the concat command, the various file formats, and the input data.
python3 /mnt/homes4celsrs/shared/envs/amas/bin/AMAS.py concat -f fasta \
-d dna --out-format fasta --part-format raxml -i *FNA.fa.fasta \
-t concatenated.fasta -p partitions.txt
Note that if some data was not found for some genes you may need to delete files (use the rm command) to get AMAS to concatenate your data.
9.3.1 Building trees with IQTree
We will build our first tree using our complete concatenated dataset. The program IQtree infers phylogenetic trees by maximum likelihood. This approach starts with a tree and calculates the likelihood of the data on the tree (i.e. calculating the probability of each site fitting the tree given a model of substitution and multiplying them together). Another tree is then selected to determine whether the data are more likely to have occurred on this tree. This process is repeated.
If you are on the server switch environments:
conda deactivate
conda activate /mnt/homes4celsrs/shared/envs/iqtree
If you need to install IQTree go here.
The following is an example command for iqtree
iqtree2 -nt 2 -s concatenated.fasta -spp partitions.txt -pre iqtree_tree -B 1000 -m GTR+G
- We use the General Time Reversible (GTR) model to allow sites to change back and forth among different bases with particular probabilities. We also allow a Gamma (G) distribution of rates of substitution across sites. (-m)
- We allow partitioning of the data by gene so that different genes can evolve according to different models. (-spp)
- Additionally we have included bootstrapping in our analysis to get a measure of support for relationships. (-B)
9.3.2 Viewing trees
- Make a new RScript to view your tree in R.
- Load the libraries tidyverse and ape.
- Use the
read.treecommand and provide the path to the tree as the argument - Use the
plotcommand providing the tree variable as the argument, then add, show.node.label = TRUE
9.3.2.1 Rooting your tree
Our tree-building programs create trees that are unrooted because we do not know the direction of changes among species (e.g. for a single difference we are unable to say if an A mutated to a T or vice versa). Thus, we should properly view out trees as unrooted.
plot(tax1, "u", show.node.label = TRUE, cex = .5)
Note that I have added an argument to make the font size bit smaller so we can read all of the labels. You should adjust this value as needed and use the Zoom feature to better view your tree.
We have created taxon sets where one of the included species is known to be more distantly related.
We can use this to set the root for the tree.
You should view the list of taxa for your group in the file in the shared folder for this workshop.
The last taxon in your list is the outgroup.
You can root your tree as in the following example using the root command and the outgroup argument.
Note that your outgroup will be different if you are using a different taxon set.
Additionally you must use the tip label that you can see on your tree not the labels in this spreadsheet.
If you want to view the tip labels as a list use
tax1$tip.label
Now root your tree using the root relevant to your taxon set.
tax1_root <- root(tax1, outgroup = "A217_CKDN220062756-1A")
Now that you have rooted your tree you can plot this new tree as before.
9.3.2.2 Relabeling your tips with species information
There are a couple of ways to relabel the tips of your trees with correct species names. We will use a straightforward, manual approach. We will look at the list of tips and then manually replace them with a list of species names. You should keep in mind that this approach is prone to error and challenging for many taxa; however an automated approach is a bit more challenging to set up in this course.
- List the tip labels as you did in the previous step
- Make a list of new tip labels - it should look something like the following but with more and different species
sp_names <- c("Centropogon mandonis", "Siphocampylus andinus", "Centropogon mandonis")
- Assign these species names to the tip labels
tax1$tip.label <- sp_names
If you relabeled the tips of your unrooted tree you need to root the tree with the appropriate outgroup name. You can then plot this rooted and corrected tree.
9.3.3 Tree for all data using output of individual hybpiper runs
In the prior analysis we examined all of the data combined. For the next analysis we will estimate individual gene trees and then combine these into a species tree.
iqtree -s concatenated.fasta -S partitions.txt -pre iqtree.loci -nt 2
-S tells IQ-TREE to infer separate trees for each partition. The output files are the same, except that now your treefile will contain a set of gene trees.
For a more extensive tutorial on IQTree see http://www.iqtree.org/workshop/molevol2019
9.4 Species tree analyses
An alternative is to use a species tree approach. We will use the software ASTRAL. The following is an example command that takes the output of the iqtree gene trees and estimates a species tree.
java -jar /mnt/homes4celsrs/shared/ASTRAL/astral.5.7.8.jar -i iqtree.loci.treefile -o astral_output.tre 2>astral.log
- -i is the flag for the input file
- -o is the flag for the output file
- 2> saves the output log information
The output in is Newick format and gives:
the species tree topology
branch lengths in coalescent units (only for internal branches or for terminal branches if that species has multiple individuals)
branch supports measured as local posterior probabilities
View this tree as before and compare it to the other tree you estimated using IQTree for partitioned concatenated data.
More information on astral can be found at https://github.com/smirarab/ASTRAL
9.5 Support for relationships
In our initial IQTree analysis we obtained bootstrap support for each node. An alternative approach are gene and site concordance factors. For more information see http://www.robertlanfear.com/blog/files/concordance_factors.html . You can compute gCF and sCF for the tree inferred under the partition model:
iqtree -t iqtree_tree.treefile --gcf iqtree.loci.treefile -s concatenated.fasta --scf 100 -nt 2
-t specifies a tree
–gcf specifies the gene-trees file
–scf 100 to draw 100 random quartets when computing sCF.
Repeat for the astral tree if it differs.
View the tree (iqtree_tree.treefile.cf.tree) in R with these support values. The tree will show boostrap/gcf/scf on the nodes.
9.6 MrBayes
The next approach we will use for tree building is implemented in MrBayes.
We need to do some setup before we can run this software.
MrBayes requires Nexus format, with an added block giving instructions to MrBayes.
R has some tools we can use to convert our data to the right format and add some instructions.
Note that some of the following was adapted from https://gtpb.github.io/MEVR16/bayes/mb_example.html
- Make a new script for organizational purposes
- Load the
apeandtidyverselibraries - Load your data into R
myseqs <- read.dna("results/concatenated.fasta",format="fasta",as.matrix=FALSE)
We first save the data as Nexus format, and read back in to manipulate further.
write.nexus.data(as.character(myseqs),"results/concatenated.nex",interleaved=TRUE,gap="-",missing="N")
myseqs_nex <- readLines("results/concatenated.nex")
We need to fix the information about how missing is coded because AMAS uses ? for missing.
myseqs_nex <- gsub('\\?','n',myseqs_nex)
We crease our first execution block as plain text stored in a variable.
mbblock1 <- "
begin mrbayes;
set autoclose=yes;
"
We create a second execution block has information about partitions. We get this from the partitions file generated by AMAS but we have to do some reformatting.
- Read the file with
read.table - Get just the last column using the
pullfunction - Now you can convert this list to a string to be inserted into the output file
partition_string <- toString(partitions,sep = ", ")
Finally, put all the information necessary for MrBayes into a single string
mbblock2 <- paste0(" partition favored = ",nrow(partition_file),":",partition_list,";")
Add a block for the MCMC parameters.
mbblock2 <- "
mcmc ngen=10000000 nruns=2 nchains=2 samplefreq=1000;
sump;
sumt;
end;
"
We then paste the blocks together and write to a file.
myseqs_nexus_withmb <- paste(paste(myseqs_nex,collapse="\n"),mbblock1,mbblock2,sep="")
write(myseqs_nexus_withmb,file="concatenated.nex.mb")
Now run (on the command line) in background as you did previously. This analysis may take several hours.
../../shared/bin/mb "results/concatenated.nex.mb"
Before looking at your tree check the output. In some cases you may need to run your analysis for longer and a note to this effect will be toward the end of the output. Make sure to give the log file for this run a relevant name. You don’t want to overwrite the output from HybPiper and you might want to keep this output if you do a longer run. Additionally, if you rerun the analysis with additional time make sure you don’t overwrite the current tree output.
You can find your tree in your results folder with the extension .con.tre.
Because Mr. Bayes generates output in a particular format you will need to read it in slightly differently and convert it to the standard phylo format.
First read the tree using the treeio package. Note that rather than load the package using library, here the command is specified as being part of the package directly.
mrbayes_tree <- treeio::read.mrbayes("results/concatenated.nex.mb.con.tre")
Now we convert the tree to phylo format used by ape and other packages.
Format conversion is a common problem in bioinformatics.
mrbayes_phylo <- treeio::as.phylo(mrbayes_tree)
If you click on your mrbayes_phylo variable in your environment you can see how the data are structured including the branch (edge) lengths and tip labels.
However, when we do the tree conversion we lose the node support information.
If you click on your mrbayes_tree variable you can see this information is stored
in the data variable.
We obtain this information by converting our tree to a “data frame” (i.e. the sort of rectangular data we often view in R). In this case we will use the tidyverse version of a dataframe called a tibble.
as_tibble(mrbayes_tree)
The information in the tree will print to the console and you can see it as a dataframe. To obtain the prob_percent column, we “pull” this out of the dataframe as follows.
probs <- as_tibble(mrbayes_tree) %>% pull(prob_percent)
We now assign this information to our mrbayes_phylo$node.label.
However, before we do this, print the contents of the variable to the console.
Note the number of items in this list.
Some of these are due to support values assigned to the tips rather than the nodes.
Of course all tips are supported to 100%.
We need to subset the list of supports to remove the first items related to tips (i.e. as many items as there are tips).
First we figure out how many tips there are. You could count them but we would like to make this general to all datasets and so putting in a single number would require changes when the dataset is changed.
num_tips <- length(mrbayes_phylo$tip.label)
Now we can get the items in probs from the first node value to the end of the vector.
node_probs <- probs[(num_tips+1):length(probs)]
Finally we can assign these values to the node labels in our tree.
mrbayes_phylo$node.label <- node_probs
Now
- Reroot your tree
- Fix the names to be species
- Plot your tree