Morphological evolution in R

InfoInfo
Search:    
Primary Contact(s)
[WWW]Samantha Price
Created
March 2012
Required Software
[WWW]R package: phytools [WWW]R package: auteur
Example Datafile
gruntsimmaptrees.nex gruntsdata.txt
Presentation
Price_Presentation2012.pdf Morph_rateinR2012.R BMsims.R

Please note the code in Morph_rateinR2012.R to calculate the relative likelihoods of the 2-rate models is correct (despite what I said in the lecture and will run on the terminal using the R CMD BATCH command).

Comparing rates of phenotypic evolution in R

This tutorial will combine various aspects of the phylogenetic comparative methods covered earlier in the week to show you how to estimate and compare rates of continuous character evolution in species with different discrete states e.g. habitats or dietary strategies. Rates of character evolution are not only interesting in their own right but are also a useful tool for other comparative analyses such as ancestral state reconstruction and, in a morphological context, Brownian rates of evolution can be used as phylogenetically corrected estimates of morphological variance (see O'Meara et al. 2004 and Hutcheon & Garland 2004). This method has been used to answer a variety questions such as “Are rates of floral evolution in the large parasitic Rafflesia faster than their smaller relatives?” (Barkman et al. 2008), “Does piscivory limit morphological diversification?” (Collar et al. 2009) “Do coral reefs promote morphological diversification in fishes?” (Price et al. in press).

This tutorial will show you how to run analyses in the free open source statistical software environment ‘R’ using functions in the phytools package written by Liam Revell. This code is based on methods developed by Brian O’Meara and in previous years (prior to 2011) we used his Brownie program (see 2010 tutorial) but it can be tricky to get it to run on linux or PC, which is why we have switched to R.

See the papers cited in this tutorial for more details. All the code (with annotations) in this tutorial can be downloaded in Morph_ratein2012R.R this will open in R.


Is a Brownian motion model appropriate?

This depends on two things: your traits and your hypothesis.

It is always good practice to test for BM-like behaviour of your traits prior to using any analysis that assumes traits evolve under BM. You heard about this from Luke Mahler and Rich Glor earlier this week but as a quick recap Brownian motion (BM) assumes that at any point in time a trait can increase or decrease in value, and the direction and magnitude of that change is independent of current or past states (i.e. no trends towards increasing or decreasing character values such as Cope’s rule etc.). It also assumes that evolution is gradual, with no punctuated evolutionary leaps or slowing down/speeding up. One of several ways to test whether your traits fit a BM model is to use 3 branch length transformations λ, κ and δ developed by Mark Pagel (1997, 1999). As mentioned by Rich Glor on Tuesday afternoon and Justen Whitall and Carl Boettiger on Wednesday morning.

Your hypothesis can also determine whether BM is appropriate, if you expect the different discrete states to select for different optimal phenotypes within the continuous characters then you may want to consider Ornstein-Uhlenbeck models (see Hansen 1997 and Butler and King 2004). OU models add an adaptive peak (mean) and strength of selection to a BM model, to model selection towards a particular phenotype. BM is a special case of OU, when the strength of selection towards the peak approaches zero, the model collapses to a BM model. These models can be implemented in the R package OUCH (Butler and King 2004). I have written a tutorial on how to do multi-peak OU models in OUCH which can be found [WWW]here. Multi-peak OU models can also be implemented in Brian O’Meara’s Brownie program (see [WWW]http://www.brianomeara.info/brownie) with the benefit that it will easily read SIMMAP V1.0 trees. In contrast if your hypothesis concerns phenotypic diversity or rates of evolution then a BM is more appropriate.


Installation and Getting Started

You will need to install the phytools package from CRAN with all its dependencies.

Change the path to the files as appropriate; I am assuming you have a folder on your Desktop called Bodega where you keep all the code for this workshop.


Dataset and Rationale

We are using a dataset of grunts a group of primarily nocturnal marine fish. We are testing the hypothesis that complex reef environments promote the rate of morphological evolution within the trophic apparatus of these fishes. So we will be comparing reef-living species (1) to non-reef species (0). The dataset represents the 3 functionally important trophic characters. We are going to incorporate uncertainty in tree topology, branch lengths and mapping of reef living onto the phylogeny by sampling from the Bayesian posterior distribution of trees from a BEAST run and using stochastic mapping upon these trees (which you heard about earlier in the week from Rich Glor). Then when we integrate the rate of evolution across each of these trees we will have an estimate of the rate of character evolution incorporating the uncertainty in all of these factors.

We have 10,000,000 trees from a Bayesian posterior distribution generated using BEAST, which we have sub-sampled to give us 9 trees (please note you would need to use many more trees for this to be a valid procedure but due to time constraints we are going to limit ourselves). On each of these trees we have used stochastic character mapping (see Nielsen 2002, Huelsenbeck et al. 2003) implemented in SIMMAP v1.0 (see [WWW]http://www.simmap.com) to create a distribution of character histories of reef (1) or non-reef living (0) sampled in proportion to their Bayesian posterior probability. We have sampled 10 character histories for each tree (again this is far too few for a proper analysis). The gruntsimmaptrees.nex file contains 90 trees with the reef/non-reef character mapped onto every topology. This can be read into R using read.simmap.R this creates a multiPhylo object with an additional element called mapped.edge

gruntrs<-read.simmap("Desktop/Bodega/gruntsimmaptrees.nex", format='nexus') # you can also use format='phylip' which is the default so if you get an error you might need to check your tree format. Also note you want to the the default rev.order=TRUE when reading in simmap output as it uses the counterintuitive way of placing the states and times along each branch is given (from root to tip) in right-to-left order.

Plot the first few trees to see if they look okay

cols<-c("black", "red"); names(cols)<-c("0", "1") #If you just use the default colors it will not plot the branches associated with the 0 character as it only recognizes 1<
par(mfrow=c(3,3))
for (i in 1:9) plotSimmap(gruntrs[[i]], cols)

Now read in the data

dat<-read.table("Desktop/Bodega/gruntsdata.txt")

and extract each individual trait into its own dataframe

trait1<-dat[1] # this extracts the first element in the dataframe which is the first column if you did dat[,1] this would only extract the values for trait 1 without the row names
trait2<-dat[2]
trait3<-dat[3]


Estimating Rate of Brownian Motion Evolution

For trait 1 we are going to run two models, one that fixes the rate of BM to be the same across reef and non-reef habitats (1-rate model) and the other that allows the rate to vary across habitats (2-rate model). We first create an empty results matrix with 90 rows (1 for each tree) and 10 columns taken from the output from brownie.lite or modifications thereof and name the columns with the names of the different elements of the output from brownie.lite.

resultstrait1<-matrix(nrow=90,ncol=10, dimnames=list(1:90, c("BM_singlerate","logL_singlerate", "k1", "1rate_AICc", "Nonreef_rate", "Reef_rate","logL_multiplerate", "k2", "2rate_AICc", "pchisq")))

converge<-vector() # this is an empty vector which we are setting up and will eventually contain information about convergence of the likelihood analyses

trait<-list() # this is an empty list and will contain the sorted traits to match the tip labels of each tree

We then write a for loop that runs brownie.lite over each of the 90 trees first sorting the character data so the names are in the same order as the tip label for the tree and then taking the useful information from the temporary results vector (res) to the results table (resultstrait1). We are also going to calculate the Akaike Information Criterion for small sample size (AICc) from the maximum likelihood estimate using the standard formula.

for(i in 1:90){ # This tells R to run the commands below 90 times, for us this is for every tree in gruntrs

Now would be a good point to make sure that R thinks that all of the likelihood analyses have converged as well as take a look at the results and write them to a table.

converge

Most, if not all, of the analyses have converged – if there are ones that haven’t you may want to consider removing them for the next set of calculations. If none of them have converged try to find out what has prevented the likelihood from being optimized and above all do not trust the results.

resultstrait1
write.table(resultstrait1, file="Desktop/Bodega/Trait1BMrateresults.txt")

We are now going to calculate the column means and standard deviation; the standard deviation will be useful in the calculation of the Standard Errors.

trait1res<-rbind(colMeans(resultstrait1), apply(resultstrait1, 2, sd)) # this binds the results from the following two commands by row (rbind). The first command calculates the means of the columns for resultstrait1 and the second uses the apply function to quickly calculate the standard deviation for every column (that is what the 2 specifies) in resultstrait1.

rownames(trait1res)<-c("mean","sd")
trait1res


Analysing Output

You now have all the information at your fingertips to examine the hypothesis. This section mainly uses model selection and multi-model inference and I suggest you refer to Burnham & Anderson 2002 if you are not familiar with this approach.

The first way of deciding whether a 1 or 2-rate model fits the data better is to use the likelihood ratio test: D=-2[ln(likelihood for 1-rate model) - ln(likelihood for 2-rate model)] the significance of this is determined by a chi-squared distribution with 1df. This is calculated for you by brownie.lite and is the p-value in the results.

The next way is to use model selection. First calculate the difference in AICc score (aka ΔAICc) for the two models (model - best fitting model). Where the best fitting model is the one with the lowest AICc score, which in our case is the 2-rate model for trait1:

deltaAICc<-trait1res[1,][4]-trait1res[1,][9] # this takes the 4th element in the 1st row of trait1res which is the AICc for the 1rate model and and the 9th element in the 1st row which is the AICc score for the best-fitting model whicg is the two-rate model here.

The general rule of thumb is if the ΔAICc is >4 there is substantial support for the best fitting model (see Burnham & Anderson 2002).

We could also use model averaging to incorporate uncertainty about model choice by actually using the estimated rates from both models. For this we need to calculate the AICc weight using the relative likelihood (relLrate) of the 1 and 2-rate models which is calculated as: exp(-0.5*ΔAICc score for that model(best fitting model gets 0)). You then sum the rates of evolution for each model for each habitat weighted by their AICc weights to calculate the model averaged rate parameters.

relL1rate<-exp(-0.5*(trait1res[1,][5]-trait1res[1,][14])) # relative likelihood of the 1-rate model
relL2rate<-exp(-0.5*0) # relative likelihood of the 2-rate model

rate1aiccweight<-relL1rate/(relL1rate+relL2rate)
rate2aiccweight<-relL2rate/(relL1rate+relL2rate)

Modelaveragedreefrate=((trait1res[1,][1]* rate1aiccweight)+(trait1res[1,][6]*rate2aiccweight))

Modelaveragednonreefrate=((trait1res[1,][1]* rate1aiccweight)+(trait1res[1,][5]*rate2aiccweight))


Using R on a cluster (non-interactively)

Note that is analysis may take a while if you have a 1000 trees with lots of taxa so you may want to consider running it on a machine that is not your laptop. If you have access to multiple CPU's you could consider splitting the analysis into several portions to run on different CPU's (batch processing) and then concatenating the output. To convert this script to submit to a cluster/server you need to make sure all the important data is written to a file and that you plot to a pdf or other file type (don't forget to close it too - the default is a pdf called Rplots.pdf which is what you will generate if you run this script as is - you will notice you would probably want to make the size of the plotting area larger etc.). You also should ensure that you have the the right path to your files. Finally, you need to add quit() at the end of your script. To run R non-interactively (on a cluster) you need to use the command R CMD BATCH input-filename (i.e this script) output-filename(optional).

Try running this script, available to download at the top of the page as Morph_rateinR2012.R non-interactively. On a mac you should be able to open the Terminal window and navigate to the folder where you have saved the script (using cd) and then type:

R CMD BATCH Morph_rateinR2012.R

You should generate a default output file name Morph_rateinR2012.Rout this will show you exactly the commands that have been executed and can be used to figure out where in the analysis it currently is.


Identifying rate shifts without having a priori hypotheses

Over the last year or so, several different methods have been published that allow the identification of shifts in the rate of phenotypic evolution without a priori hypotheses (Revell et al., 2012; Eastman et al. 2011, Venditti et al. 2011 and Thomas & Freckleton 2011). We are going to have a little fun comparing two of these approaches: Revell et al. 2012 and Eastman et al. 2011. This is just a quick introduction to these new methods to show you what is possible.

Both these methods implement MCMC so you will need to apply what you have learnt throughout the workshop to implement them correctly - in particular refer to Brian Moore's tutorial on MCMC diagnosis. So if you want to implement these methods in your thesis or papers please spend *a lot* of time familiarizing yourself with the method you choose; its benefits and pitfalls as well as how to implement it correctly i.e. checking that the MCMC has converged, is appropriately mixed and not autocorrelated etc..

Eastman et al. auteur package
auteur implements a reversible-jump MCMC approach to find one or more shifts in the Brownian motion rate throughout the topology. For precise details about the RJMCMC approach they implement please see the paper: Eastman et al. 2011 Evolution 65(12), 3578-3589. This method is ideal for data exploration, to help you identify what may be interesting patterns pursue, it is unlikely to be sufficient on its own.

trait<-dat[,1]
names(trait)<-row.names(dat)

Now we will run 2 chains of 100,000 each (this is no guarantee of convergence).

rjmcmc.bm(gruntrs[[1]], trait, summary=TRUE, reml=TRUE, ngen=100000, sample.freq=500, constrainK=1, fileBase="trait1tr1_test1" ) # We are constraining it to only evaluate models with an additional independent rate using constrainK=1 so that we can more easily compare it to the Revell et al. approach.

rjmcmc.bm(gruntrs[[1]], trait, summary=TRUE,reml=TRUE, ngen=100000, sample.freq=500, fileBase="trait1tr1_test2")

This is where you should do your MCMC diagnostics - we are going to pretend you have done it for now and just get rid of 25% as burn-in

pool.rjmcmcsamples(base.dirs=c("BM.trait1tr1_test1.parameters", "BM.trait1tr1_test2.parameters"), lab="trait1tr1")

shifts.plot(phy=gruntrs[[1]], base.dir="trait1tr1.combined.rjmcmc", burnin=0.25, legend=TRUE, edge.width=2)

Revell et al. 2012 evol.rate.mcmc phytools package
This method uses a Bayesian MCMC approach to estimate a single rate shift upon a tree.

gruntr<-read.tree(text=write.tree(gruntrs[[1]]))

gruntres<-evol.rate.mcmc(gruntr, trait, ngen=100000, control=list(sample=200))

The defaults: symmetric exponential proposal distribution and Gaussian proposal distributions centered on 0 for the two rates and the ancestral value (at root). Here we have set it to run 100,000 generations and sampling every 200.

This is where you should do all your MCMC diagnosis - for now we are going to pretend you have done it! For now we will remove the first 25% as burn-in which is done by only using the last 75% of the output gruntres$mcmc[125:501, ]

Now calculate the split with the smallest summed distances to all other splits.

gruntsminsplit<-minSplit(gruntr, gruntres$mcmc[125:501, c("node", "bp")], method="sum")

Extract the posterior sample of the average split

gruntpost<-posterior.evolrate(gruntr, ave.shift=gruntsminsplit,gruntres$mcmc[125:501,], gruntres$tips[125:501])

pies<-hist(gruntres$mcmc[125:501,][,"node"], breaks=0.5:length(gruntr$tip)+gruntr$Nnode+0.5)

plot(gruntr)

We can plot the posterior probabilities on the tree using pie charts on the edges (rate shift has to appear somewhere so the probabilities across the tree have to sum to 1)

edgelabels(edge=match(which(pies$density>0.01), gruntr$edge[,2]), pie=pies$density[which(pies$density>0.01)], piecol=c("red", "black"), cex=0.5)


Miscellaneous

a) Common errors: Sometimes the default optimization methods for Maximum Likelihood will not work with your dataset so you may need to change methods you can do this by altering the method in the brownie.lite code (try adding method ="L-BFGS-B",lower=c(0,-Inf))

b) Best practices: This is beyond the scope of this tutorial but if you intend to run this kind of analysis for a publication I highly recommend looking at the other output from Brownie.lite. In var.single and vcv.multiple it provides information on the shape of the likelihood surface for the 1 and 2 rate models respectively. This should be incorporated into any estimate of the Standard Error of the parameter estimates along with the error associated with different tree topologies, branch lengths and character histories, which is estimated from the standard deviation of the parameter estimates across the simmap trees.


References

Barkman, Bendiksby, Lim, Salleh, Nais, Madulid and Schumacher (2008) Accelerated rates of floral evolution at the upper size limit for flowers. Current Biology 18(19) pg 1508-1513.

Burnham, K. P., and D. R. Anderson. 2002. Model Selection and Multimodel
Inference: A practical information theoretic approach. Springer-Verlag, New York.

Butler, M. A., and A. A. King. 2004. Phylogenetic comparative analysis: a
modeling approach for adaptive evolution. American Naturalist 164:683-695.

Collar D. et al. 2009. Piscivory limits diversification of feeding morphology in centrarchid fishes. Evolution 63(3), 1557-1573.

Eastman, J. M., Alfaro, M. E., Joyce, P., Hipp, A. L. & Harmon, L. J. (2011) A novel comparative method for identifying shifts in the rate of character evolution on trees. Evolution, 65(12), 3578-3589.

Hansen, T. F. 1997. Stabilizing selection and the comparative analysis of
adaptation. Evolution 51:1341-1351.

Huelsenbeck, J. P., Nielsen, R., Bollback, J. P. 2003. Stochastic mapping of morphological characters. Systematic Biology 52(3), 131-158.

Hutcheon & Garland 2004. Are megabats big? Journal of Mammalian Evolution 11 (3-4), pg 257-277

Nielsen, R. 2002. Mapping mutations on phylogenies. Systematic Biology 51(5), 729-739.

O'Meara et al. 2006. Testing for different rates of continuous trait evolution using likelihood. Evolution 60(5), pg 922-933.

Pagel, M. (1997) Inferring evolutionary processes from phylogenies. Zoologica Scripta 26(4), 331-348.

Pagel, M (1999) Inferring the historical patterns of biological evolution. Nature 401, 877-884.

Price, S.A., Holzman, R. A., Near, T. & Wainwright, P. (2012) Coral reefs promote the evolution of morphological diversity and ecological novelty in labrid fishes. Ecology Letters.

Revell, L. J., Mahler, D. L., Peres-Neto,P. R. & Redelings, B. D. (2012) A new phylogenetic method for identifying exceptional phenotypic diversification. Evolution 66 (1), 135-146.

Thomas, G. & Freckleton, R. P. (2012) MOTMOT: models of trait macroevolution on trees. Methods in Ecology and Evolution 3(1), 145-151 (online early 2011)

Venditti,C., Meade, A. & Pagel, M. (2011) Multiple routes to mammalian diversity. Nature 479 (393),

This is a Wiki Spot wiki. Wiki Spot is a 501(c)3 non-profit organization that helps communities collaborate via wikis.