Chapter 15 Discrete variable modeling
In this section, we will work on modeling a continuous trait and understanding what Brownian motion is.
- In addition to the libraries loaded in the previous chapters, load
expmandcorHMM. - Ensure that the tree and dataset from the prior chapters are loaded.
15.1 Plotting discrete data on a phylogeny
There are multiple ways to plot discrete data on a phylogeny.
We will use phytools, but there are also tools in ape and ggtree.
Using phytools we will plot a tree with tip data as colors based on hummingbird pollination.
plotTree(pole_tree,ftype="i",offset=0.6,fsize=0.3)If you look at pole_dataset you see that P1 (the category of whether butterflies visit or not)
is either 1 or 0. To the researcher this is like true/false, but the computer considers this vector to be numeric. As we did in a prior chapter (where we converted character data to numeric), we want to let the computer know this is categorical (factor) data.
aux <- as.factor(pole_dataset$P1)
aux <- setNames(aux, pole_tree$tip.label)
Now we need to create a matrix for the pie chart.
discrete.matrix <- to.matrix(aux,levels(aux))
And finally we add these pies to the tips of the tree. You may need to zoom in to see the labels on the tips of the tree.
tiplabels(pie=discrete.matrix,piecol=c("blue","red"),cex=0.1)
What conclusions do you draw about the evolution of polination based on this tree?
For clarity add a legend to your figure.
legend("topleft",levels(aux),pch=21,pt.bg=c("blue","red"),
pt.cex=2.2)
Notice that the function tiplabels() can draw pies, meaning it can draw more than one state for a tip.
15.2 Discrete character evolution
For models of discrete character evolution we will use a mathematical model called Continuous-time Markov chains (CTMC). CTMCs are stochastic (random) processes that allow us to follow the evolution of a trait(s) of interest over the branches of the tree. In this example we will focus on the evolution of hummingbird pollination.
CTMCs are usually denoted in mathematical notation as \[\{X(t), t\geq 0 \}\]; the stochastic process \[X(t)\] follows the pollination throughout time \[t\]. Time is continuous and it is measured using branch length of our phylogeny. For discrete traits the process \[X(t)\] takes values in the natural numbers \[0,1,2,...\]. CTMCs have many (cool) mathematical properties, but one of the most important property is the Markovian property that states that the probability distribution of future states of the process conditional on both past and present states depends only upon the present state.
Mathematically, CTMC can is defined through a circle and arrow diagram and that gets converted into a matrix that we called the Q-matrix.
15.2.1 What is a Q-matrix?
In mathematics, the Q-matrix is the derivative of the probability matrix (\[P'(t)=P(t)Q\]). The elements outside of the diagonal of the \[Q\] matrix are transition rates and in the diagonal we have the negative sum of all the elements of the row. The row adds to zero this is because probability matrices rows add to 1 so the derivative of the constant 1 is zero.
Let’s stop here to think through this problem by drawing it.
15.3 Modeling using CTMCs
In phylogenetic comparative methods we have different types of CTMCs. Examples of these models are:
- Evolution of tempo of a continuous trait (past lesson) -> BM, OU, EB, Levy, implemented in
geiger,phylolm,bayouR packages. - Evolution of the tempo between states-> Markov models (Mkn) implemented in
phytools,corHMM - Evolution of the tempo between states considering other potential sources promoting evolution-> Hidden state Markov models (HMM) implemented in
corHMM - Phylogeographic models that include dispersal or extirpation and cladogenesis -> DEC, DECj, DiVA implemented in
Biogeobears - Diversification (speciation and extinction) that changes over time -> Birth and death models implemented in
castor - Diversification (speciation and extinction) that changes based on clades or density -> Birth and death models implemented in
BAMM,CLADS - Diversification (speciation and extinction) that changes based on state values-> State-dependent diversification models for example BiSSE, HiSSE implemented in
hisse - Diversification (speciation and extinction) that changes based on biogeography or cladogenesis -> State-dependent diversification models for example GeoSSE implemented in
hisse
Learning about CTMCs is then critical for us to develop hypotheses about trait evolution, diversification.
15.3.1 Modeling the transition rates between two states
We are going to use a CTMC model called Mk2 (Markov model with two states). Let’s take a minute to draw define the model. We are interested in the rate of evolution of hummingbird pollination in Phlox species.
data.discrete <- cbind(pole_tree$tip.label, pole_dataset$P1)
mk2 <- corHMM(phy=pole_tree, data=data.discrete, rate.cat=1)
mk2 #Let's read and interpret the results carefully
plotMKmodel(mk2)Assumption: Equal rates
LegendAndRateMat <- getStateMat4Dat(data.discrete)
RateMat <- LegendAndRateMat$rate.mat
pars2equal <- list(c(1,2))
constrained.mat<-equateStateMatPars(RateMat,pars2equal)
constrained.mat
mk2_equalrates <- corHMM(phy=pole_tree, data=data.discrete, rate.cat=1, rate.mat=constrained.mat)
mk2_equalrates
plotMKmodel(mk2_equalrates)15.3.2 Model comparison
It is important to use model comparison in this context to investigate whether the rates of evolution between the two states are equal or different. In likelihood context we use the Akaike Information Criterion known as AIC that is defined as \[AIC= -2 \log(\textrm{Likelihood evaluated in the MLE)}+ 2\times \textrm{number of parameters}\] AIC allows us to compare between models with different number of parameters because it corrects for it. We want to choose for the model with the smallest AIC. A difference of 2 or larger between two AIC values is considered as evidence in favor of a given model.
So knowing this, which model of evolution would you choose for this example?
15.3.3 Stochastic mapping
Visualizing the history of evolution of a trait using stochastic maps.
phy <- mk2_equalrates$phy
data <-mk2_equalrates$data
model <- mk2_equalrates$solution
model[is.na(model)] <- 0
diag(model) <- -rowSums(model)
# run get simmap (can be plotted using phytools)
simmap <- makeSimmap(tree=phy, data=data, model=model, rate.cat=1, nSim=10, nCores=1)
# we import phytools plotSimmap for plotting
dev.off()
plotSimmap(simmap[[1]], fsize=.4)
colors <- setNames(c("black","red"),0:1)
dev.off()
densityTree(simmap,method="plotSimmap",lwd=8,nodes="intermediate", ylim=c(1,170),colors=colors, compute.consensus=FALSE, fsize=0.2, show.axis=FALSE, fix.depth=TRUE)