Columbine,
photo credit
| Primary Contact(s) |
|
|
| Created |
| 7 March 2011 |
| Required Software |
| R (and several packages, follow the link to a tutorial on what to do) |
|
|
|
|
| Example Datafiles |
columbine.dat |
columbine.nex |
columbine100.nex |
felsenstein32.nex |
felsenstein128.nex |
yule100.nex |
Summary
We introduce independent contrasts in R by confirming Felsenstein's warning that phylogeny introduces an apparent correlation between two traits that are evolving independently along the tree. We compare this effect under other models of evolution and see how to handle independent contrasts under such models. We will then explore model choice and independent contrasts on continuous character data of columbines.
Tutorial
Felsenstein's Contrasts
Launch R, load the library and read in Felsenstein's tree with 32 tips, and simulate two separate traits, x, and y, under Brownian Motion on the tree using sigma=10.
# Load the phylogenetics library
require(geiger)
# Read in the data file
fels32 <- read.nexus("felsenstein32.nex")
# simulate BM with sigma=10 (the [,1,1] says return only 1 trait, 1 replicate
x <- sim.char(fels32, matrix(10) )[,1,1]
y <- sim.char(fels32, matrix(10) )[,1,1]
We can plot the tree to take a look at it. We've added a little color bar indicating the value of the x trait using a little R plotting magic. Note that the tree contains two unresolved clades, and the lower clade all tends to have higher values (more red), while the upper clade has more blue. Of course this may differ between simulations.
The traits x and y (think: body size and brain size) were evolved independently — we know that because we simulated them separately! Let's see if the tree has created any correlation between them. We use the linear model command, lm, to do this.
fit <- lm(y~x)
Figure 2: X & Y data produced by independent BM runs on Felsenstein's tree
Output should look like
> summary(fit)
[...]
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.7754 0.9344 1.900 0.067069 .
x 0.6464 0.1638 3.946 0.000443 ***
---
Don't be concerned if your numbers aren't exactly the same as those seen above, remember that we're analyzing the results of a simulation so we expect different results each time we conduct the simulations. For the simulation above, R is telling us that there is a significant correlation between x & y, p = 0.000443 in the above example (again, results from your random simulation may differ). We can see why by plotting the traits x and y and then coloring the points from each of the clades Figure 2. To recreate this figure we can use the following commands:
plot(x,y) #Plots the x and y data that we just simulated points(x[1:16], y[1:16], col="red", pch=19) #Fills the points for clade 1 (taxa 1-16) with red points(x[17:32], y[17:32], col="blue", pch=19) #Fills the points for clade 2 (taxa 17-32) with blue
We see the apparent correlation by eye, and we can tell that it's driven by two clusters corresponding to the clades. Phylogenetic independent contrasts are designed precisely for this problem. We create independent contrasts using the "pic" command, and repeat the regression. Note the lm command uses -1 so that the regression is forced through the origin.
x_ic <- pic(x, fels32)
y_ic <- pic(y, fels32)
ic_fit <- lm(y_ic~x_ic-1)
While we can use the "summary" function to get the statistics, of course we'd like to visualize the data. Here's a little plotting code, making the graph in Figure 3.
Figure 3: Raw data and contrasts
par(mfrow=c(1,2)) # a plot window for 2 graphs
plot(x,y, pch=19) # plot the data
## add the regression line:
lines(x,fit$coefficients[1] + x*fit$coefficients[2])
## write the p value on the plot
mtext(paste("Non-zero slope with p = ",
format.pval(summary(fit)$coefficients[2,4], digits=2)))
## plot the contrasts and add the regression
plot(x_ic,y_ic, pch=19)
lines(x_ic, x_ic*ic_fit$coefficients[1])
mtext(paste("Non-zero slope with p = ",
format.pval(summary(ic_fit)$coefficients[1,4], digits=2)))
We simulated the data under Brownian Motion and can see the correlation in the left panel. In the right panel, where we've plotted our independent contrasts, we do not expect to observe a correlation. The two panels of this figure nicely illustrate why independent contrasts are required for comparative analyses of data that has evolved along a phylognetic tree.
What if the data were created under an Ornstein-Uhlenbeck (OU) process? To simulate data under the OU processes we can use the same sim.char command discussed above. Before doing so, however, we need to transform the tree to reflect evolution under the OU process. (There are lots of other ways to simulate too...)
x <- sim.char(ouTree(fels32, alpha=50), matrix(10) )[,1,1] #The ouTree command here is transforming our tree with an alpha parameter of 50 y <- sim.char(ouTree(fels32, alpha=50), matrix(10) )[,1,1]
Now we'd like to repeat the analyses conducted above with the data simulated under Brownian motion (bm). To do this we can just use copy and paste, but we'll make our lives easier if we a make an R function to do this for us. Functions are short scripts that perform a set of operations on a given set of input variables. Learning how to write functions is key to taking full advantage of R. The function below does independent contrasts on our tree and simulated data and reports the results as in Figure 3:
Figure 4: Raw data and contrasts for OU
plot_ic <- function(x, y, tree){
summary(fit <- lm(y~x) )
x_ic <- pic(x, tree)
y_ic <- pic(y, tree)
summary(ic_fit <- lm(y_ic~x_ic-1))
par(mfrow=c(1,2)) # a plot window for 2 graphs
plot(x,y) # plot the data
## add the regression line
lines(x,fit$coefficients[1] + x*fit$coefficients[2])
## write the p value on the plot
mtext(paste("Non-zero slope with p = ",
format.pval(summary(fit)$coefficients[2,4], digits=2)))
## plot the contrasts and add the regression
plot(x_ic,y_ic); lines(x_ic, x_ic*ic_fit$coefficients[1])
mtext(paste("Non-zero slope with p = ",
format.pval(summary(ic_fit)$coefficients[1,4], digits=2)))
}
The function above is nothing more than a convenient way to wrap our previous commands in a custom script we've called plot_ic. Creating functions is very convenient way to conduct tasks that we're going to call repeatedly. Now we can use a single line to run all the code above with a new set of simulated data Figure 4.
Before we run our function on the OU data, let's predict what will happen. Why?
plot_ic(x,y, fels32)
The raw traits simulated under OU are uncorrelated, unlike the raw traits simulated under BM. Thus, the tree had a large effect on the correlation structure when the data really was Brownian, as Felsenstein assumes, but had a rather different effect when the data was produced by the OU model.
Thus far, we've considered two traits that were simulated independently of one another. What if there is a true biological constraint in the evolution of one trait relative to another that introduces such a correlation? We can introduce a real correlation in our data evolving under OU model:
Figure 5: Raw data and contrasts when y is truly a function of x
y <- .5*x + y # linear model with random term plot_ic(x,y, fels32)
Not unexpectedly, we now observe a strong correlation in both plots (Fig. 5).
Model Choice for Contrasts
Once we have data, can we figure out which model created these data? To do this, we're going to fit different models to our data using Geiger's fitContinuous function. This function will provide us with AIC scores that can be used to identify the model with the best fit (the best model should have the lower AIC score):
bm <- fitContinuous(fels32, x, model="BM") #Use fitContinuous to assess fit of the BM model to the data ou <- fitContinuous(fels32, x, model="OU") #Use fitContinuous to assess fit of the OU model to the data bm[[1]]$aic #Display the AIC score for the BM model ou[[1]]$aic #Display the AIC score for the OU model
Can we do independent contrasts if OU is the best model? Of course, but we must first transform the tree:
x_pic <- pic(x, ouTree(fels32, alpha = ou[[1]]$alpha) )
Now we have the correct form of the independent contrast (under OU), and we can perform the linear regression as before.
Applying this to Columbine Data
Now that we know a bit about alternative models of character evolution and how they can be analyzed using independent contrasts, we're ready for some read data. We're going to work with data for columbine flowers provided by Justen Whittall. Justen's data includes both a tree (columbine.nex) and continuously coded data on flower morphology (columbine.data):
First we have to read in the data and the tree:
traits <- read.table("columbine.dat")
tree <- read.nexus("columbine.nex")
name.check(tree, traits)
The ape package that provides "pic" is a bit picky about data formats. The data has to be a "numeric" and have names matching the species (i.e. can't be a data.frame etc.)
LogSpurLength <- traits[["LogSpurLength"]] FlowerOrientation <- traits[["FlowerOrientation"]] names(LogSpurLength) <- rownames(traits) names(FlowerOrientation) <- rownames(traits)
With the data in the correct format we can check the model:
## we'll use the data.frame structure to do these both at the same time bm <- fitContinuous(tree, data.frame(LogSpurLength, FlowerOrientation), model="BM") ou <- fitContinuous(tree, data.frame(LogSpurLength, FlowerOrientation), model="OU") bm$LogSpurLength$aic ou$LogSpurLength$aic bm$FlowerOrientation$aic ou$FlowerOrientation$aic
Aside: Model fitting in OUCH
The ouch package allows one to fit multiple OU peaks. We'll see examples of this tomorrow. Meanwhile, ouch can also be used to fit and simulate simple Brownian and single-peak OU processes. It tends to be a little faster, and uses what programmers call an object-oriented design. We'll demonstrate briefly what this means. One disadvantage is that the data requires a slightly different format for both the tree and traits. Traits have to be in a numeric vector of length of the total number of nodes with NAs on the internal nodes. Because this tends to frustrate users, we provide a little code to convert data into ouch format. (We also convert the tree using ouch's provided package. Ouch's ape2ouch function defaults to scaling the tree to unit length, which makes it hard to compare).
### What if I want to use OUCH? ouch <- data2ouch(tree, traits) ## See what the data looks like ouch ## run these models in ouch bm_ouch <- brown(ouch$data$LogSpurLength, ouch$tree) ou_ouch <- hansen(ouch$data$LogSpurLength, ouch$tree, ouch$regimes, 1, 1) ## These should agree (Geiger calls it beta = sigma^2 ) bm_ouch@sigma^2 bm[[1]]$beta
These fits from ouch are rather smart objects. For instance, guess what this does:
X <- simulate(bm_ouch) update(bm_ouch, X)
they have methods associated with them, which is not true of geiger/ape fits.
Independent Contrasts for Columbine Data
To perform independent contrasts:
LogSpurLength_contrast <- pic(LogSpurLength, tree) FlowerOrientation_contrast <- pic(FlowerOrientation, ouTree(tree, alpha=ou$FlowerOrientation$alpha)) summary(lm(FlowerOrientation_contrast ~ LogSpurLength_contrast - 1))
Traditionally this is the point where we'd start worrying if we had the right transform for the data, and we might look at the node variances vs their contrasts. Remember that transforms are biological hypotheses of different mechanisms. If the logarithm of a trait evolves under Brownian motion, the trait is evolving under a geometric Brownian motion process instead — a very different model. That's okay, for lengths we often think geometrically anyway, or by percent. It makes more sense to say a random mutation may increase or decrease the size by +/- 1% then by +/- 1 cm. Because these biological processes involve different models, we can just compare the models directly:
bm <- fitContinuous(tree, traits["LogSpurLength"]) Fitting BM model: bm2 <- fitContinuous(tree, traits["SpurLength"]) Fitting BM model: bm$Trait1$aic [1] 52.47771 bm2$Trait1$aic [1] 628.744
And we see log(spur length) is well justified over the use of spur length — or put another way, spur length seems to evolve geometrically rather than linearly. Of course we can do this at the beginning, before we get to pic or other model estimation. This kind of test isn't done as often as it should be.
We can automate both choosing the model and performing the contrasts using the custom function from the appendix below. First copy and paste in the code, and then run:
output <- p_matrix(traits, tree) output$p < 0.05 output$model[[1]]
Uncertainty in the phylogeny
There's uncertainty in the phylogeny: often we would rather consider a distribution of possible trees rather than assume we the phylogeny is known exactly.
We read in the collection of 100 trees just like we read in a single tree:
chrono <- read.nexus("columbine100.nex")
We then just add a loop over the trees. sapply is a smart looping function which will run a bit faster and figure out what format to use in outputting our data — here we'll get a matrix whose columns each correspond to a different tree. For instance, we can see if the correlation is significant over the distribution of trees:
x <- sapply(1:length(chrono), function(i) pic(traits[["LogSpurLength"]], chrono[[i]])) y <- sapply(1:length(chrono), function(i) pic(traits[["FlowerOrientation"]], chrono[[i]])) p <- sapply(1:length(chrono), function(i) summary(lm(y[,i]~x[,i]-1))$coefficients[1,4]) mean(p) plot(density(p))
We can loop over trees to see how the uncertainty in the tree estimation impacts our estimate of phylogenetic signal, lambda:
tree_uncertainty <- sapply(1:length(chrono), function(i) fitContinuous(chrono[[i]], LogSpurLength, model="lambda")[[1]]$lambda )
It is not uncommon to see comparative methods integrate over uncertainty in the tree but not over uncertainty in the model. Just because we are returning a maximum likelihood estimate does not mean we are free to ignore model uncertainty. Felsenstein's far more cited paper from 1985 explained how to just that for tree inference. Here we bootstrap the our estimate of phylogenetic signal, lambda. Note that we've written the function to bootstrap any model. Such abstraction is good practice.
bootstrap_model <- function(tree, x, model="lambda", nboot=50){
fit <- fitContinuous(tree, x, model=model)
sims <- sim.char(transformTree(tree, model, fit[[1]]), matrix(fit[[1]][[3]]), nsims=nboot )[,1,]
sapply(1:nboot, function(i){
fitContinuous(tree, sims[,i], model=model)[[1]][3]
})
}
bootstrap_uncertainty <- bootstrap_model(chrono[[1]], LogSpurLength, model="lambda")
Figure 7: Uncertainty in model estimate, from tree, and combined
We can do the same for other model parameters, such as sigma in Brownian motion. Figure 7 shows us that the uncertainty from the model estimate is much greater than that from the tree. Indeed, we can integrate over both sources of uncertainty, (right panel, code below) and we get basically the uncertainty from the model alone. Treating uncertainty in the tree without treating the uncertainty in the model would be rather misleading...
Loop over both model uncertainty and tree uncertainty:
sigma_over_trees <-
sapply(1:length(chrono), function(i){
bootstrap_model(chrono[[i]], LogSpurLength, model="lambda")
})
Plotting commands
par(mfrow=c(1,2))
plot(density(unlist(bootstrap_uncertainty)), ylim=c(0, max(density(unlist(tree_uncertainty))$y)), lwd=3, xlab="estimated lambda", main="separate sources of uncertainty")
lines(density(unlist(tree_uncertainty)), lwd=3, lty=2, col="darkblue")
lines(density(unlist(sigma_over_trees)), xlab="estimated lambda", lwd=3, lty=3, col="red", main="cumulative uncertainty")
legend("topleft", c("over sims", "over trees", "combined"), lty=c(1,2, 3), col=c("black", "darkblue", "red"))
Model Exploration
If we want to know if traits are correlated, how do we know if we have the right model? How important is it?
If we want to know if shared ancestry explains patterns in morphology or ecology...
* Simulate data under one of the following models
x <- sim.char(lambdaTree(tree, lambda=0.5), matrix(1))[,1,1]
x <- sim.char(ouTree(tree, alpha=0.5), matrix(1))[,1,1]
x <- sim.char(tree, matrix(1))[,1,1]
Write the tree and data to a file,
write.table(x, file="example.dat") write.nexus(tree, file="example.nex")
Quit and restart R. Swap machines with your neighbor. Attempt to estimate which model they used, and with what parameter values. i.e.
x <- read.table("example.dat")
tree <- read.nexus("example.nex")
bm <- fitContinuous(tree, x, model="BM")
lambda <- fitContinuous(tree, x, model="lambda")
...
References
Code Appendix
Here's the plotting function to visualize the tree and data as shown above.
plot_cts_data <- function(phy, dat){
# A function to plot current states using continuous colormap
# Args:
# phy: an ape-style tree (object of class phylo)
# dat: named data-set
# Returns:
# plots the phylogeny with a heat map indicating cts traits.
plot(phy) # just to get treelength
treelen <- max(axisPhylo() )
plot(phy, cex=1, x.lim=1.3*treelen, edge.width=2, show.tip.label=F)
mycol <- function(x){
tmp = (x - min(dat)) /max(x-min(dat))
rgb(blue = 1-tmp, green=0, red = tmp )
}
tiplabels(pch=19, col=mycol(dat), cex=1.5, adj=0.55*treelen) # add tip color code
}
Below is a custom function that will take a trait matrix and a tree and test every pairwise comparison. The method will first check which model is most appropriate (OU, BM, lambda, or delta at the moment) and use that model in the independent contrast. If select_model=FALSE it will just assume BM and run much faster.
p_matrix <- function(traits, tree, select_model=TRUE){
## Args
## traits -- a data.frame of traits with rownames corresponding to taxa and colnames corresponding to traits
## tree -- a ultrametric tree of class phlyo, "ape"-package
## select_model -- logical, attempt model selection first? Otherwise will assume Brownian motion
## Returns
## output$p -- matrix of p values
## output$model -- list with model choice for each trait and the parameter values estimated for each model.
p_vals <- matrix(NA, nrow=dim(traits)[2], ncol=dim(traits)[2])
if(select_model){
choose <- choosemodel(traits, tree)
names(choose[[1]]) <- colnames(traits)
} else { choose <- NULL
}
for(i in 1:(dim(traits)[2]-1) ){
for(j in (i+1):dim(traits)[2]){
p_vals[i,j] <- mypic(i, j, traits, tree, choose)
p_vals[j,i] <- mypic(i, j, traits, tree, choose)
}
}
rownames(p_vals) <- colnames(traits)
colnames(p_vals) <- colnames(traits)
list(p=p_vals, models=choose)
}
mypic <- function(i,j, traits, tree, choose=NULL){
# reformat data, add names (for pic)
x <- traits[[i]]
names(x) <-rownames(traits)
y <- traits[[j]]
names(y) <- rownames(traits)
y_tree <- tree # in case select_model=FALSE these are the same
x_tree <- tree
if(!is.null(choose)){
x_tree <- transformTree(tree, model=choose[[1]][i], fit=choose[[2]][,i])
y_tree <- transformTree(tree, model=choose[[1]][i], fit=choose[[2]][,i])
}
x_ic <- pic(x, x_tree)
y_ic <- pic(y, y_tree)
ic_fit <- lm(y_ic~x_ic)
format.pval(summary(ic_fit)$coefficients[2,4])
}
transformTree <- function(tree, model, fit){
if(model == "BM") out <- tree
else if(model == "OU") out <- ouTree(tree, fit$alpha)
else if(model == "lambda") out <- lambdaTree(tree, fit$lambda)
else if(model == "kappa") out <- kappaTree(tree, fit$kappa)
else if(model == "delta") out <- deltaTree(tree, fit$delta)
else print(paste("Transform model ", model, " not recognized"))
out
}
choosemodel <- function(x, tree){
lambda <- fitContinuous(tree, x, model="lambda")
bm <- fitContinuous(tree, x, model="BM")
ou <- fitContinuous(tree, x, model="OU")
delta <- fitContinuous(tree, x, model="delta")
scaling_par <- sapply(1:length(bm), function(i) c(lambda[[i]][3], 1, ou[[i]][3], delta[[i]][3]) )
options <- c("lambda", "BM", "OU", "delta")
model <- sapply(1:length(bm), function(i) options[ which.min(c(lambda[[i]]$aic, bm[[i]]$aic, ou[[i]]$aic, delta[[i]]$aic)) ] )
list(model, scaling_par)
}
data2ouch <- function(tree, data){
ot <- ape2ouch(tree, scale=FALSE)
if(is(data, "numeric")) data <- as.data.frame(data)
otd <- data.frame(nodes = as.integer(ot@nodes), ancestors = ot@ancestors, times = ot@times, labels = ot@nodelabels)
data$labels <- rownames(data)
otd <-merge(otd, data, by="labels", all=T)
otd$regimes <- as.factor("global")
rownames(otd) = otd$nodes
otd <- otd[order(otd$nodes),]
otd$nodes <- as.character(otd$nodes)
dat <- vector("list", length(otd)-5)
for(i in 5:(length(otd)-1)){
dat[[i-4]] <- otd[[i]]
names(dat[[i-4]]) <- otd$nodes
}
regimes <- otd$regimes
names(regimes) <- otd$nodes
names(dat) <- names(traits)
list(data=dat, tree=with(otd,ouchtree(nodes=nodes,ancestors=ancestors,times=times,labels=labels)), regimes=regimes)
}
Comments:
Note: You must be logged in to add comments
2013-01-04 13:14:30 In the mypic function, the linear regression of contrasts should be forced through the origin. If this is done, then the format.pval line needs to be changed as well, since there are no intercept coefficients. So the last lines should be
ic_fit <- lm(y_ic~x_ic-1)
summary(ic_fit)
format.pval(summary(ic_fit)$coefficients[4])

