|7 March 2010 (Updated 6 March 2011)|
|RAxML version 7.0.3|
For many datasets, a full maximum likelihood analysis is extremely computationally expensive. As a result, several software packages have been developed to conduct rapid maximum likelihood analyses. These programs include GARLI, IQPNNI, PHYML, and RAxML. In general, these programs rely on modifications to the standard hill climbing approach to find a good, though not necessarily the best, likelihood tree. This tutorial introduces likelihood tree reconstruction and bootstrapping using RAxML (Randomized Axelerated Maximum Likelihood), which was developed by Alexandros Stamatakis.
A Note on Alternatives to RAxML
Tutorials for GARLI (Genetic Algorithm for Rapid Likelihood Inference) also exist on the Phylowiki. The current version of GARLI is GARLI 1.0, which is the finalized version of the beta release, GARLI 0.96b. Historically, one disadvantage of GARLI relative to RAxML was that GARLI did not allow partitioned analyses. However, Derrick Zwickl has made available a version of GARLI, GARLI-PART 0.97, that allows partitioned analyses; it is available here. This version is still considered experimental, but it has undergone extensive testing.
The GARLI tutorials available on the PhyloWiki are for version 0.96, 0.97 and 1.0 (in which settings are specified in a config file that you then execute from the command prompt) and for GARLI 0.95 (which uses a graphical user interface for setting up the run).
What is RAxML?
Here's a brief description from the RAxML 7.0.4 Manual by Alexandros Stamatakis: “RAxML (Randomized Axelerated Maximum Likelihood) is a program for sequential and parallel Maximum Likelihood-based inference of large phylogenetic trees. It was originally derived from fastDNAml which in turn was derived from Joe Felsenstein’s dnaml, which is part of the PHYLIP package.”
RAxML employs several heuristics to drastically reduce likelihood search times (see Stamatakis et al., 2005 for more details). These heuristics include: 1) building an initial starting tree under parsimony using random stepwise addition; 2) using Lazy Subtree Rearrangements for branch swapping, which saves time for multiple reasons, primarily because branch optimization only occurs for branches adjacent to the insertion point; 3) using GTR + CAT (GTR with per site rate categories) instead of GTR + GAMMA ; and 4) use of simulated annealing, which incorporates a cooling schedule and allows “backward steps” during the hill-climbing process.
Remember, there is a trade off at work here. RAxML and other fast ML programs get a likelihood tree quickly, but you are not guaranteed to get the optimal tree.
Because of the limited amount of time available during the Bodega Workshop, we will be using RAxML 7.0.3 which is the latest version with a pre-compiled executable. If you are using this tutorial outside of the workshop, you should check Alexis' website for the latest version (version 7.2.8 as of early March, 2011) and download and compile the source code (there are instructions on how to do this in Alexis' RAxML tutorial). The commands covered below are still applicable in these more recent versions. Note also that Alexis releases updates of RAxML frequently, so check back often.
Download the software program RAxML 7.0.3 and the example datafile. Unzip the software package and place it wherever you want. This tutorial will assume you just have the RAxML program and the example file in a folder named 'raxml' on your desktop. The tutorial also assumes that the program file is simply called 'raxml'; if yours has a different name (e.g. its called 'raxmlHPC_iMac'), either rename it or substitute the name of your file where appropriate.
Run the program by opening a terminal window and going to the folder containing raxml and the example files:
RAxML runs by typing in a single line at the command prompt in terminal. The command line will also be written to your ‘info’ file for easy look up later. If you are using a partitioned dataset, you will also need to have a text file defining the partitions. We will cover this below.
IMPORTANT!!! Sometimes cutting and pasting command lines directly from this tutorial or a text file (e.g., an info file that you used previously) will cause problems in running your commands. This happens because there are unseen, white space characters that will prevent your commands from working properly. If you are having problems, then stop pasting command lines and just type them out.
Running an analysis on an unpartitioned dataset
To run RAxML, all you need to do is type a single line at the command prompt. However, there are many options, so be sure to read through the list of program commands and their descriptions in the RAxML manual.
Here’s a list of the basic commands:
Run an analysis on the primates.phy dataset. Do 10 runs using GTRGAMMA.
./raxml -s primates.phy -n testrun –m GTRGAMMA -#10
Open the info file in a text editor.
Did all searches get pretty similar likelihood values? For the primates dataset the answer should be yes. For other datasets, variation in search results could suggest that some searches are terminating in sub-optimal peaks. This might indicate a more complicated likelihood surface for which you might want to conduct an increased number of replicates to ensure adequate searching.
The best scoring tree is reported in the run log on terminal and in the resulting info file. Open the best scoring tree in your favorite tree viewer (e.g., FigTree). Remember that this is the best scoring tree but not necessarily the best or optimal tree.
Running a bootstrap analysis on an unpartitioned dataset
RAxML can run a standard bootstrap and a rapid bootstrap, with the rapid bootstrap at least an order of magnitude faster than the standard bootstrap. The rapid bootstrap is a slimmed down version of the standard RAxML search algorithm. The critical question is how well do values from the rapid bootstrap compare with those from the standard bootstrap? Stamatakis (RAxML 7.0.4 Manual) argues that "the results obtained by the rapid bootstrapping algorithm are qualitatively comparable to those obtained via the standard RAxML BS algorithm and, more importantly, the deviations in support values between the rapid and the standard RAxML BS algorithm are smaller than those induced by using a different search strategy, e.g., GARLI or PHYML." Unfortunately, thorough comparisons examining variation in support values among approaches have not yet been published.
Thus, it is safest to only use the rapid bootstrap when you can compare the recovered support values with support values from other approaches (e.g., Bayesian posterior probabilities) to ensure that the results are consistent across approaches. For the same reason, if you have the time and the computational resources, it is safest to use the standard bootstrap. Below we will first run the standard bootstrap and then run a rapid bootstrap. After we run each of those we will compare values from standard and rapid bootstraps from two different datasets.
The standard bootstrap.
Here are the functions that we will be using.
-b BootstrapRandomNumberSeed (use –x for rapid bootstrap)
The –f option tells RAxML what type of function or analysis you want to execute. By specifying "i", we are telling RAxML to conduct a standard bootstrap analysis.
Use the –b option to specify the random number seed that will be consistent across runs.
Use -# to specify the number of bootstrap pseudoreplicates
To run this analysis, type:
./raxml –f i -s primates.phy -n boot –m GTRGAMMA –b 1234 -# 20
(Obviously 20 is a ridiculously low number of bootstrap replicates. Normally you would want this to be much higher, for example 1000 replicates.)
When your analysis is complete, you will have 2 analysis files:
Information on the analysis written to: RAxML_info.boot
All bootstrapped trees written to: RAxML_bootstrap.boot
We now need to summarize the bootstrap results and put the values on our best-scoring ML tree. To do this, we will use the -f b command in which we summarize the set of bootstrap trees identified by -z and and draw the bipartitions onto a topology specified by -t. Note that your best-scoring tree file will likely have a different name than the one below because it will likely be from a different replicate and you may not have added the .tre suffix.
./raxml -f b -t RAxML_result.testrun.RUN.4 -z RAxML_bootstrap.boot -m GTRGAMMA -s primates.phy -n boottree
Open the resulting tree in FigTree or your favorite tree viewer. If using FigTree, you will be asked what to name the labels on each node. Name these ‘bootstraps.’ To view the support values on the tree, click on nodelabels and select Display>bootstrap.
The rapid bootstrap.
Here we use the same commands specified above except that we use the -f a option and -x instead of -b to specify the rapid bootstrap random number seed.
./raxml –f a -s primates.phy -n boot2 –m GTRGAMMA –x 1234 -# 100
The analysis will conduct the 100 bootstrap replicates, conduct a final likelihood search, and then draw the bootstrap support values on the tree found in the combined likelihood analysis.
When your analysis is complete, you will have 4 analysis files:
Information on the analysis written to: RAxML_info.boot2
All bootstrapped trees written to: RAxML_bootstrap.boot2
Best-scoring ML tree written to: RAxML_bestTree.boot2
Best-scoring ML tree with support values written to: RAxML_bipartitions.boot2
Open the RAxML_bipartitions file in FigTree of your favorite tree viewer.
A cursory comparison of support values from standard and rapid RAxML bootstrap values.
As discussed above, the rapid RAxML bootstrap is a "quick and dirty" approximation of the standard RAxML bootstrap. Thus, it is important to assess whether the rapid bootstrap yields similar values to the standard bootstrap. To the right is the best-scoring tree from a RAxML likelihood search for our example primates dataset with standard and rapid RAxML bootstraps reported along branches. Support values are from 500 bootstrap replicates with values along branches reported as standard / rapid bootstrap support values. For most nodes, the values recovered in the two approaches are similar, but note that support for the chimpanzee-human relationship differs by 10 between analyses (in red).
To the right is the best-scoring tree from a RAxML search of a second dataset. Support values along branches are reported as Bayesian posterior probabilities / RAxML standard bootstrap / RAxML rapid bootstrap / GARLI bootstrap. Note that for this dataset, support values for a given node are more similar between the RAxML standard bootstrap and GARLI bootstrap than results from standard and rapid bootstraps in RAxML. Four nodes shown in red have especially large discrepancies between the standard and rapid RAxML bootstraps, with values from the rapid bootstrap sometimes larger and sometimes smaller than those from the standard search. These results suggest that 1) detailed comparisons of support values among programs and strategies are much needed, and 2) careful consideration should be given to deciding when to use the rapid bootstrap.
Running an analysis on a partitioned dataset
To run an analysis on a partitioned dataset, you need to specify the partitions in a separate file. We will set up a text file that specifies two partitions: one is for first and second positions, and the second is for third positions. In your text editor, make a file with the following information (DO NOT cut and paste these lines from the Phylowiki! If you do, you might end up with unseen characters in the whitespace of your text file that cause error messages when you run the analysis.):
DNA, codon1codon2 = 1-898\3,2-898\3
DNA, codon3 = 3-898\3
NOTE that you use a backslash, not a forward slash, in defining codon positions.
Save this file as "partition".
In terminal, run a partitioned analysis by typing:
./raxml -s primates.phy -n testrun2 –m GTRGAMMA –q partition -#5
Similarly, if you wanted to run a bootstrap analysis on a partitioned dataset, you merely need to specify the partition file using -q in the bootstrap command line.
Stamatakis, A. 2006. RAxML-VI-HPC: Maximum Likelihood-based Phylogenetic Analyses with Thousands of Taxa and Mixed Models. Bioinformatics 22(21):2688–2690.
Stamatakis, A., T. Ludwig, and H. Meier. 2005. RAxML-III: a fast program for maximum likelihood-based inference of large
phylogenetic trees. Bioinformatics 21(4):456–463.
Stamatakis, A. 2005. An Efficient Program for phylogenetic Inference Using Simulated Annealing. In Proceedings of the 19th IEEE/ACM International Parallel and Distributed Processing Symposium (IPDPS2005), High Performance Computational Biology Workshop, Proceedings on CD, Denver, Colorado, April 2005. (PDF available on Stamatakis’ web page)