| Primary Contact(s) |
| Bob Thomson |
| Created |
| 6 March 2009 |
| Required Software |
|
|
| Example Datafile |
| This tutorial uses two of the example files for BEST |
example.nex |
yeast.nex |
dummy.txt |
Summary
BEST is a method developed by Liang Liu and Dennis Pearl for inferring species trees from a set of gene alignments. The program is a modification of MrBayes, so installing and using the two programs is similar. BEST attempts to estimate the posterior distribution of the species tree using a model that allows for incongruity among gene trees due to incomplete lineage sorting. The method allows single or multiple alleles to be sampled for each species, but limits the user to a single outgroup. The program estimates gene trees, theta for each ancestral population, and the species tree (including branch lengths). Standard nexus files are used as input. This tutorial covers the features of BEST version 2.2. Note that the 2.x versions of BEST are a major departure from earlier version in that gene trees and species tree are jointly estimated in a single MCMC run, rather than in two separate runs as in earlier versions.
Introduction
This tutorial uses BEST version 2.2, two example data files, example.nex and yeast.nex, and a dummy file called dummy.txt.
Tutorial
Installation
Download the software and the example files. Unzip the software package and place it wherever you want. On OSX or Linux, it may be easier to copy BEST into your /usr/bin or /usr/local/bin, but this tutorial will assume you just have it in a folder with the example files on your desktop.
Run the program by opening a terminal window and going to the folder containing BEST and the example files:
cd ~/Desktop/best_tutorial
Check to make sure the software runs properly by typing ‘./mbbest’ (on OS X or Linux) or by double clicking the executable on windows. You should see the standard MrBayes introduction with the BEST title inserted at the top. If this works, the program is running ok and you are ready to set up an input file.
Note that, while it is possible to run BEST by entering commands one at a time at the command prompt, setting up an input file is generally easier and less error-prone. Note also that the additional functions that BEST uses have not been added into the MrBayes help files, so typing ‘help <some_command>’ won’t get you anywhere for BEST specific commands.
Setting up and running with single alleles
Open the first example file called ‘example.nex’ in a text editor. This file contains simulated data from 6 genes (each 500 basepairs long) for 4 different species. The format of the input file is the same as for MrBayes, the programs differ in how the mrbayes block is set up at the end of the file.
BEST requires that you specify an outgroup, which it uses to root the individual gene trees. We do this, as in MrBayes, by typing ‘outgroup X’ where ‘X’ is the number of the taxon you wish to use as an outgroup. BEST also requires that you specify taxon sets that define which species each allele belongs to. This is particularly important when you are using multiple alleles per species, but even in the single allele case BEST requires these taxon sets. Following the taxon sets, we need to define which sequences belong to which gene by defining character sets. For this example file, the data is interleaved with each gene in separate chunks of text, this is generally a good practice to adopt but isn’t necessary. Define each character set using
charset gene1 = start – finish;
Where ‘gene1’ is the name of the gene (e.g., RAG, CMOS, etc.), ‘start’ and ‘finish’ are the positions in the alignment where the data for that gene starts and ends, respectively. Once each character set is defined, partition the dataset by gene
partition Genes = 6; gene1, gene2,… geneN
set partition = Genes;
You’ll likely want to set different models for each of your genes, the process for doing this is exactly as in a partioned MrBayes analysis, see that tutorial if you don’t know how to do this. For this tutorial, we’ll leave the model on the simple default settings so that our example runs quickly. Note that when you set up more complex models, you’ll also need to specify the appropriate priors in the ‘prset’ command below.
The parameters specific to BEST are set using the prset command. The prior for Theta (REPLACE) is an inverse gamma distribution, the shape of which is set using two parameters (alpha and beta). Setting these parameters to high values can cause this prior to strongly influence the analysis, and so the authors suggest setting these to low values. A typical setting is alpha=3 and beta =0.03. Thus:
Thetapr=invgamma(3,0.03)
Note that the mean of the inverse gamma is:
β/(α – 1)
And the variance is:
β2 /(α-1)2(α-2)
So alpha should be set higher than 2, otherwise the prior probability for theta will have an undefined mean and/or variance.
BEST allows a different mutation rate for each gene, which is estimated along with the gene trees and species trees. The prior distribution used for this value is a uniform distribution, bounded with a minimum and maximum value. The standard setting is:
Genemupr=uniform(0.5,1.5)
If there is a lot of variation in mutation rate between your genes, the bounds of this distribution may need to be increased.
BEST proposes new species trees for a vector of gene trees by modifying the ‘maximum tree’ at a number of nodes. The maximum tree is the species tree with the longest possible branches that is still temporally compatible with the vector of gene trees. That is, no split in the species tree can take place earlier in time than in any of the gene trees. The number of nodes that this maximum tree is then modified at is controlled by a poisson distribution with a mean that we specify using the poissonmean command. A typical setting is poissonmean=5, though this may need to be changed (particularly for larger trees).
To speed convergence, BEST employs an annealing step (which down-weights the effect of the prior) in the beginning of the burnin period in order to allow the chain to move to areas of parameter space with high likelihood more quickly. Recall that the acceptance probabilities for a proposed change depend on both the likelihood and the prior, when the prior is downweighted, the Likelihood plays a proportionally larger role and the chain behaves more like a hill-climbing algorithm, moving rapidly toward areas of high likelihood. The annealing phase of the run is controlled by the propTemp command, which specifies the proportion of generations in which the annealing step is employed, e.g.
Proptemp = 0.05
would use annealing in the first 5% of generations. This should be set to include fewer generations than you plan to discard as burnin. Setting this too low can cause your run to require more generations to reach convergence, though setting it too high causes you to have to discard more of the run as burnin.
Finally, we need to tell BEST to actually do the species tree analysis (rather than a typical partitioned bayesian analysis) using the command
BEST=1
In all, your prset command should look like
prset thetapr=invgamma(3,0.003) GeneMuPr=uniform(0.5,1.5) poissonmean=5 BEST=1;
Plus any additional priors necessary for your model of molecular evolution.
Once the priors are properly set, all that remains is to unlink the proper parameters so that they are estimated independently for each gene
unlink topology=(all) brlens=(all) genemu=(all);
Then we can run the MCMC, using settings similar to MrBayes. For this tutorial, we’ll do 2 runs, each with one cold chain and one heated chain, for a small number of generations. Because the parameter space for a BEST run is enlarged relative to that for a MrBayes run, BEST runs typically take longer (sometimes a lot longer) to converge, so for a thorough analysis you’ll want to greatly increase the number of generations, make sure your chains are mixing efficiently, etc. See the MrBayes tutorial for more information about this.
mcmc ngen=100000 nrun = 2 nchain = 2 samplefreq=100;
Run the analysis!
(back in the terminal window now)
execute example.nex
Summarizing output
When the run finishes, the program will ask if you want to continue. Notice that the program prints the average standard deviation of split frequencies for each of the gene trees (rather than just one tree, as in a MrBayes analysis). As with MrBayes, you want these values to be well below 0.1, if they aren’t you’ll want to keep the run going. It’s likely that the chain hasn’t converged yet for this run because we ran very few generations, we’ll proceed anyway so say ‘no’.
If you look at the folder that your example file was in, you’ll notice that there are ‘.p’ files for each of the two runs and ‘.t’ files for each of two runs for each of the gene partitions, plus extra ‘.t’ files that are named ‘example.nex.sptree.X.t’. These latter two are the files containing the sampled species trees.
In the single allele case, summarizing runs is similar to MrBayes.
sump burnin=250
The output looks similar to MrBayes, except that we have information about the individual trees and mutation rates for each gene.
sumt burnin=250 filename=example.nex.sptree
This will give you a partition table with information about the mean and variance for the branch lengths and theta. In the background, the sumt command will produce a ‘sptree.con’ file. This file contains the consensus species tree with branch lengths, posterior probabilities, and the inferred theta for each node. The notation for this tree file is a little different than normal. The tree looks like:
(O:0.018476,((H:0.008843,C:0.008843)0.67:0.000356#0.000998,G:0.009199)1.00:0.009277#0.004647)
Which looks like a standard tree file except for those extra ‘#’ symbols followed by numbers. These are the inferred theta for each ancestral population. To get these to display in figtree, replace the ‘#’ with something like ‘&theta=’ and enclose the whole number in brackets, like so:
(O:0.018476,((H:0.008843,C:0.008843)0.67:0.000356[&theta=0.000998],G:0.009199)1.00:0.009277[&theta=0.004647])
Note: This seems to work only if the tree is in a nexus formatted file. See the comments at the bottom of this tutorial if you are trying to do something else.
BEST also outputs a ‘.parts’ and ‘.trprobs’ file as in MrBayes. If you’d like to summarize the individual gene ‘.t’ files, I would suggest starting up MrBayes and doing it there, I’ve run into some odd behavior when trying to do this in BEST.
Multiple alleles
Using BEST when you have multiple alleles per species involves two changes. First, the taxon sets have to be set up properly. See the example file yeast.nex for an example.
taxset s1= 1-9; taxset s2= 10-14; taxset s3= 15-17; taxset s4= 18; taxset s5= 19-21; taxset s6= 22;
Here, you define names for the taxon sets (the s1 through s6) and tell the program which taxa (in numerical order) belong to which species; thus, the first 9 individuals in the alignment belong to species 1. After this, you can set up and run the analysis exactly as before. Note that the yeast.nex example file also contains an example of a more complex mrbayes block that demonstrates using different models of molecular evolution for different partitions.
After the run ends, you can summarize the parameter file as before, but summarizing the tree files is slightly different. The issue is that the program currently has 22 taxa in memory, but the species trees contain only 6 taxa. Because of this, if you attempt to run ‘sumt’ on the species trees as before, you’ll get an error. The solution is to make a dummy file that will set the taxa in memory appropriately. See ‘dummy.txt’ for an example.
#NEXUS Begin data; Dimensions ntax=6 nchar=5; Format datatype=dna gap=-; Matrix s1 AAAAA s2 AAAAA s3 AAAAA s4 AAAAA s5 AAAAA s6 AAAAA ; end; begin mrbayes; prset best=1; sumt nrun=2 burnin=10 filename= yeast.nex.sptree; end;
Execute this file in BEST and you should get your summarized tree files as before.
References
Edwards, S.V., L. Liu, D.K. Pearl. 2007. High-resolution species trees without concatenation. PNAS 104:5936-5941
Liu, L., D.K. Pearl. 2007. Species trees from gene trees: Reconstructing bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Systematic Biology 56:504-514
Liu, L., D.K. Pearl, R.T. Brumfield, S.V. Edwards. 2008. Estimating species trees using multiple-allele DNA sequence data. Evolution 62: 2080-2091. doi:10.111/j.1558-5646.2008.00414.x
Edwards, S.V. 2009. Is a new and general theory of molecular systematics emerging?. Evolution 63: 1-19. doi:10.1111/j.1558-5646.2008.00549.x
for more references see the BEST webpage at:
http://www.stat.osu.edu/~dkp/BEST/references.html
Comments:
2009-03-26 18:57:44 I don't think (O:0.018476,((H:0.008843,C:0.008843)0.67:0.000356[&theta=0.000998],G:0.009199)1.00:0.009277[&theta=0.004647])
works. You need to use something like
(O:0.018476,((H:0.008843,C:0.008843)[&theta=0.000998]:0.000356,G:0.009199)[&theta=0.004647]:0.009277); —118.90.62.121
2009-03-27 08:26:18 Interesting. I'm not quite sure whats going on here. It works fine as a nexus tree file, but not as a newick. So if you simply replace the '#'s in the .con nexus file that you get back from the sumt command, it should work. If you extract that tree, and put it in its own file, you'd want to do something like what you suggest. Thanks for pointing it out! I'll add a note in the text above. —BobThomson


