|Primary Contact(s)||Created||Required Software|
|Rich Glor||7 March 2009||R|
|See Introduction||R for Phylogenetics parts I, II|
R is becoming increasingly important to implementation of phylogenetic comparative analyses. Many new methods are developed specifically for the R platform while many older methods are being translated into R packages. We’re going to tackle comparative analyses in R in several stages. Through the course of this exercise we’ll learn how to (1) load comparative data into R, (2) ensure that our comparative data and tree are compatible, (3) conduct basic diagnostic tests of phylogenetic signal, and (4) conduct a variety of standard phylogenetic comparative analyses.
A. Loading trait data into R
If you have some character data for the species in your phylogeny, you can easily upload and analyze this data in R. Perhaps the easiest way to get your data into R is to start with a simple comma delimited table. You can generate such a table in EXCEL by saving your EXCEL worksheet as a .csv file using the Save As function under the File menu in EXCEL.
I've provided you with a comma-delimited table of data called anolis_data.csv. This file contains ecological and morphological data for Greater Antillean Anolis lizards. This file includes three columns: the first contains species names, the second includes discretely coded information about anole microhabitat specialization (0=grass-bush, 1=trunk-ground, 2=trunk, 3=trunk-crown, 4=crown-giant, 5=twig), and the third contains continuously-coded data on adult body size of adult males (measured as the snout to vent length, or SVL).
1. Start by opening the anolis_data.csv file in EXCEL to see what you’re dealing with.
2. Now let’s open this same file in R using the read.csv function included in R’s base installation:
>read.csv(“anolis_data.csv”) -> anolis.dat
You can see if all has gone well by typing anolis.dat and hitting return. You should see your data scrolling across the screen.
a. We now need to perform a few basic functions that will make our data easier to work with in R’s framework. First, we’re going to convert the table into a dataframe. A dataframe is an R format that breaks a matrix down so that each column is represented as a distinct object in R’s working environment. We want the columns of our dataframe to consist solely of our ecological and morphological character data - the taxon names in the first column of our .csv file will be stored as another component of our dataframe in just a moment. To get our two columns of character data (columns 2 and 3 of our original .csv file) into a dataframe we simply type:
>data.frame(anolis.dat[,2:3]) -> anolisData
This should convert the second and third columns of anolis.dat into a dataframe called anolisData. It is critically important that you begin to understand how and why the square brackets are being used after anolis.dat in this line: this is another case where the square brackets are being used to indicate specific elements of an R object (recall our use of double square brackets to indicate that we wanted only one of the two trees in our MrBayes .con file). When we have a matrix with multiple rows and columns, specific elements of this matrix are indicated by enclosing the specific rows and columns in square brackets separated by a comma (e.g., [rows,columns]). If we leave the area for rows or columns blank, all rows or columns will be selected. anolis.dat[1,] for example, will recover the data in row 1 across the three columns in our table. See what happens when you type anolis.dat[1,]. If we want to obtain multiple rows or columns, we can use the colon, which tells R we want a series extending from the number before the colon to the number after the colon (1:5, for example, should return the series ‘1 2 3 4 5’). Hopefully now it is becoming clear why we’ve typed [,2:3] after anolis .dat. We did this because we want to extract the character data found in each row of colums 2-3. Try typing anolisData followed by return in R - you should see the data included in columns two and three of your .csv file.
b. Let's make one more change to this data before going any further by log transforming our body size data. To do this, we're simply going to replace the original raw data with log-transformed data:
>anolisData[,2] <- log(anolisData[,2])
c. Now let’s add names to the rows in our dataframe by telling R that these names can be found in the first column of anolis.dat.
>rownames(anolisData) <- anolis.dat[,1]
Type anolisData and hit return to confirm that your rownames have been added.
d. Next we’re going to make our dataframe the default dataframe in R’s system by using the attach function. This will allow us to call elements of the dataframe without constantly making reference to the name of the dataframe itself. Before using the attach command, try typing the name of one of the columns of your dataframe (e.g., micro). You should end up with an error message. Now attach your dataframe by typing:
Now try typing ‘micro’ again. This time you should get a vector listing your microhabitat data.
e. To make things even easier for subsequent analyses, we're going to reapply taxon names to the microhabitat and body size data.
>names(micro) <- rownames(anolisData)
>names(SVL) <- rownames(anolisData)
B. Checking overlap of phylogenetic and trait datasets
Once you have a tree and associated data in R you're well positioned to conduct a phylogenetic comparative analysis. Before doing this however we need to make sure that we have overlapping phylogenetic and character data. The R package Geiger includes a function called name.check that makes it relatively easy to (1) check for overlap between the taxa in a phylogenetic trees and the taxa for which you have some other sort of character data, and (2) prune taxa missing from your tree or trait data in subsequent analyses.
1. Let’s consider the two datasets from Anolis lizards that we’ve already loaded into R, which include a phylogenetic tree (anolisChronogram - this tree was made during Part II of this exercise, so go back and complete this exercise if you don't still have this tree stored in R) and data on microhabitat specialization and body size (anolisData). To determine which taxa are overlapping in these two files, type:
The output should consist of two lists of taxon names, one called $Tree.not.data and another called $Data.not.tree. In our case, the fact that numerous names are present in the $Tree.not.data list means that some anole species are represented in our tree, but not in our set of character data (the reason for this is that our tree was reconstructed using anoles from across the genus, whereas character data was only available for species from the Greater Antilles whose microhabitat specialization has been characterized). By contrast, the $Data.not.tree list should be empty (indicated by character(0)), suggesting that there are no taxa in our comparative dataset that are absent from the tree.
2. In order to conduct phylogenetic comparative analyses, we want to prune all the taxa from our tree that are not also present among our trait data. To do this we’re going to save the two lists generated by name.check under the name anolisOverlap and then use the names in the $Tree.not.data portion of this object in an application of the drop.tip function:
>name.check(anolisChronogram, anolisData) -> anolisOverlap >drop.tip(anolisChronogram, anolisOverlap$Tree.not.data) -> anolisComparativeTree
Take a look at your newly pruned tree by typing plot(anolisComparativeTree).
C. Visualizing Trait Data on a Phylogeny
1. We often would like to plot the data for our comparative analysis across the tips of our tree. With a bit of practice, this can be done fairly easily in R.
a. Now let's add colored boxes to indicate microhabitat specialist states. This may seem a bit complicated at first, but you'll appreciate the flexibility of the method if you stick with it. We're going to start by making a dummy vector that is formatted to store character data. The names of the cells in this dummy vector will correspond with the names of our tip taxa. We're soon going to fill this vector in with the names of the colors we'd like to use for labeling.
microLabel <- character(length(anolisComparativeTree$tip.label))
names(microLabel) <- names(micro)
microLabel[micro==0] <- "red"
microLabel[micro==1] <- "yellow"
microLabel[micro==2] <- "green"
microLabel[micro==3] <- "light blue"
microLabel[micro==4] <- "blue"
microLabel[micro==5] <- "purple"
If you type microLabel now, you should see a vector of names and associated colors.
b. Our next challenge is to plot these colors across the tips of our tree. To do this, we're going to use the points command. This command is used to give the X and Y coordinates of places on our figure where we'd like to place points. We can also specify what types of points we'd like to use with the pch command and what color we'd like these points to be using the previously generated microLabel vector. Let's first replot our tree, using the label.offset command to make room for the character data (you can play around with the label offset value to see how higher values move your taxon labels increasingly further away from the tree tips).
plot(anolisComparativeTree, cex=0.5, label.offset=2)
Once the tree is replotted, we're ready to add the points.
points(rep(104, length(anolisComparativeTree$tip.label)), 1:length(anolisComparativeTree$tip.label), pch=22, bg=microLabel[anolisComparativeTree$tip.label], cex=0.5, lwd=0.25)
It is important to realize that microLabel[anolisComparativeTree$tip.label] is crucial as it sorts the microLabel to match the the tip labels on the tree!
NOTE: The points command will only work if you already have a plotting window open. This is because points is trying to add points to an already existing figure. Go back and repeat the plotting function implemented in step 1a if you closed your Quartz window between then and now.
c. Now let's add scale bars that indicate the size of each species to this tree. The way we're going to do this is by telling R to make segments at the tips of our tree that scale with body size. To keep things simple, lets just put these segments to the right of the taxon names. The segments command works by connecting points in R's X/Y coordinate space with a line, or segment. We need to tell R where to start a segment on the X/Y axis and where to end this segment along the X/Y axis. The syntax for doing this involves telling R where to start and stop the segment involves (Xstart, Ystart, Xfinish, Yfinish). To make segments after the taxon names, we're going to start at 123 units down the x-axis (a value that we determine from the starting knowledge that our tree is 100 units long combined and subsequently playing around with possible values until we get the segments to appear after the taxon names). From this point, we extend a line whose length corresponds with the lizards body size using the following command:
>segments(rep(123,85), 1:85, rep(123,85) + SVL[anolisComparativeTree$tip.label], 1:85, lwd=2)
Further explanation of the segments command: This command is broken down into four parts: (1) starting X coordinates, (2) starting Y coordinates, (3) ending X coordinates, (4) ending Y coordinates. The first part of this command rep(123,85) uses the rep command to generate a vector that contains the value 123 repeated 85 times. We're doing this because we want to create a line for each of the 85 taxa in our tree, and we want all of these lines to start at an x-value of 123 (you may find that you need to adjust this value on a case-by-case basis to make the lines appear where you want them relative to the tips of your tree). The next command 1:85 tells R that we want lines to begin at integer values ranging from 1-85 on the Y axis: by convention, the tips of phylogentic trees are found at integer values along the Y axis. The next part of this command rep(123,85) + SVL[anolisComparativeTree$tip.label]} tells R how far lines will extend along the X-axis, in this case, they're going to be as long as the SVL value for the species in our dataset. The final part tells R that our segments will be horizontal lines by repeating the vector with integer values ranging from 1 to 85 for the final Y coordinates of each line.