| Primary Contact(s) | Created | Required Software | Example Datafile |
|
|
8 March 2009 |
|
monkeys.nex |
Summary
This tutorial is based on the MrBayes 3.1 Manual (F. Rhonquist, J. P. Huelsenbeck, and P. van der Mark, 2005) that is distributed with the software.
Tutorial
Set up
Installation
Instructions for downloading and installing MrBayes can be found in the manual. If you are able to install MrBayes in your “path” then you can call the program from any folder. If not, then your data set needs to be in the same folder as MrBayes.
Data files
We will start by viewing a data file. Get the monkeys.nex data file and put it in your MrBayes folder. Navigate to your MrBayes folder and open the monkeys.nex dataset. Open this file in TextWrangler (or other text editor). A text editor like TextWrangler is very useful for creating/editing nexus files, and can aid in finding errors in your files. Monkeys.nex is composed of 12 mitochondrial ND4 sequences (898 bp). The monkeys.nex data set is currently a single partition, and has no model specified. Therefore, an analysis of this data set would invoke the default settings in MrBayes.
Starting an Analysis
Start MrBayes by double-clicking the program icon (Windows or Macintosh) or by typing ./mb or mb (Unix).
To read data into MrBayes from the file monkeys.nex, type execute monkeys.nex or simply exe monkeys.nex. At the prompt, type help to see a list of the commands available in MrBayes. Most commands allow you to set values (options) for a range of parameters. If you type help <command>, where <command> is any of the listed commands, you will see the help information for that command as well as a description of the available options.
Specifying a model
1) lset –set the parameters of the likelihood model
2) prset –set the priors of the model
In general, lset is used to define the structure of the model and prset is used to define the prior probability distributions on the parameters of the model. Type help lset. By default, MrBayes divides the data into one partition for each type of data you have in your DATA block (below we will look at a partitioned model). We will analyze the data using a standard nucleotide substitution model, in which case the default 4by4 option is appropriate.
Starting an analysis now would invoke the default settings. However, we will assume that the appropriate model for this data set is the GTR+G+I model, and apply this model to monkeys.nex so we need to change the number of substitution types from the default Nst=1 to Nst=6, and rate setting from the default equal (no rate variation across sites) to Invgamma (gamma-shaped rate variation with a proportion of invariable sites). Do this by typing lset nst=6 rates=invgamma. MrBayes will acknowledge that it has changed the settings.
Setting the priors
We now need to set the priors for our model. There are six types of parameters in the model: 1) the topology, 2) the branch lengths, 3) the four stationary frequencies of the nucleotides, 4) the six different nucleotide substitution rates, 5) the proportion of invariable sites, and 6) the shape parameter of the gamma distribution of rate variation. The default priors in MrBayes work well for most analyses, and we will not change any of them here. However, it is good practice to go through all the priors and make sure you understand the default settings and the available options.Type help prset to obtain a list of the default settings for the parameters in your model.
We will focus on:
Revmatpr (for the six substitution rates of the GTR rate matrix),
Statefreqpr (for the stationary nucleotide frequencies),
Shapepr (for the shape parameter of the gamma distribution of rate variation),
Pinvarpr (for the proportion of invariable sites),
Topologypr (for the topology), and
Brlenspr (for the branch lengths).
The default prior probability density is a flat Dirichlet (all values are 1.0) for both Revmatpr and Statefreqpr. This is appropriate if we want to estimate these parameters from the data assuming no prior knowledge about their values. It is possible to fix the rates and nucleotide frequencies but this is generally not recommended. However, it is occasionally necessary to fix the nucleotide frequencies to be equal, for instance in specifying the JC and SYM models. This would be achieved by typing prset statefreqpr=fixed(equal).
*Note that MrBayes does not support some submodels of the GTR model.
The Shapepr parameter determines the prior for the shape parameter of the gamma
distribution of rate variation. We will leave it at its default setting, a uniform distribution
spanning a wide range of values. The prior for the proportion of invariable sites is set
with Pinvarpr. The default setting is a uniform distribution between 0 and 1, an
appropriate setting if we don’t want to assume any prior knowledge about the proportion
of invariable sites. For topology, the default Uniform setting for the Topologypr parameter puts equal probability on all distinct, fully resolved topologies. The Brlenspr parameter can either be set to unconstrained or clock-constrained. For trees without a molecular clock (unconstrained) the branch length prior can be set either to exponential or uniform. *The default exponential prior with parameter 10.0 should work well for most analyses.
* However, recent work has shown that in some circumstances, MrBayes can produce trees with artificially-long branch lengths (Brown et al., in press). In these cases, one would need to modify the branch length prior to put more emphasis on shorter branch lengths. A good way to check for abnormally long branch lengths is to compare ML tree lengths with those output from your Bayes analysis-they should be similar.
Checking the model
To check the model before we start the analysis, type showmodel. This will give an
overview of the model settings. Note that we have six types of parameters in our model. All of these parameters will be estimated during the analysis.
Setting up the analysis
Review the run settings by typing help mcmc. Analyses are started by issuing the mcmc command. However, the help mcmc command will show the default starting conditions without actually starting the analysis.
The Ngen setting is the number of generations for which the analysis will be run. It is useful to run a small number of generations first to make sure the analysis is correctly set up and to get an idea of how long it will take to complete a longer analysis. We will start with 10,000 generations. To change the Ngen setting without starting the analysis we use the mcmcp command, which is equivalent to mcmc except that it does not start the analysis. Type mcmcp ngen=10000 to set the number of generations to 10,000. You can type help mcmc to confirm that the setting was changed appropriately.
By default, MrBayes will run two simultaneous, completely independent analyses starting from different random trees (Nruns = 2). Running more than one analysis simultaneously is very helpful in determining when you have a good sample from the posterior probability distribution, so leave this setting as is. The idea is to start each run from different randomly chosen trees. In the early phases of the run, the two runs will sample very different trees but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar.
By default, MrBayes uses Metropolis coupling to improve the MCMC sampling of the target distribution, and will discard the first 25% samples from the cold chain. The Swapfreq, Nswaps, Nchains, and Temp settings together control the Metropolis coupling behavior. When Nchains is set to 1, no heating is used. When Nchains is set to a value n larger than 1, then n - 1 heated chains are used. By default, Nchains is set to 4, meaning that MrBayes will use 3 heated chains and one "cold" chain. The effect of the heating is to flatten out the posterior probability, such that the heated chains more easily find isolated peaks in the posterior distribution and can help the cold chain move more rapidly between these peaks.
The Samplefreq setting determines how often the chain is sampled. By default, the chain is sampled every 100th generation, and this works well for most analyses. However, our analysis is so small that we are likely to get convergence quickly, so it makes sense to sample the chain more frequently, say every 10th generation (this will ensure that we get at least 1,000 samples when the number of generations is set to 10,000). To change the sampling frequency, type mcmcp samplefreq=10. When the chain is sampled, the current values of the model parameters are printed to files. The substitution model parameters are printed to a .p file (in our case, there will be one file for each independent analysis, and they will be called monkeys.nex.run1.p and monkeys.nex.run2.p). The .p files are tab delimited text files that can be imported into most statistics and graphing programs. The topology and branch lengths are printed to a .t file (in our case, there will be two files called monkeys.nex.run1.t and monkeys.nex.run2.t). The .t files are Nexus tree files that can be imported into programs like PAUP* and FigTree.
The Printfreq parameter controls the frequency with which the state of the chains is printed to the screen. Leave Printfreq at the default value (print to screen every 100th generation). The default behavior of MrBayes is to save trees with branch lengths to the .t file. Since this is what we want, we leave this setting as is.
Running the Analysis
Finally, we are ready to start the analysis. Type mcmc. MrBayes will first print information about the model and then list the proposal mechanisms that will be used in sampling from the posterior distribution. The first column lists the generation number. The following four columns with negative numbers each correspond to one chain in the first run. Each column corresponds to one physical location in computer memory, and the chains actually shift positions in the columns as the run proceeds. The numbers are the log likelihood values of the chains. The chain that is currently the cold chain has its value surrounded by square brackets, whereas the heated chains have their values surrounded by parentheses. When two chains successfully change states, they trade column positions (places in computer memory). If the Metropolis coupling works well, the cold chain should move around among the columns; this means that the cold chain successfully swaps states with the heated chains. If the cold chain gets stuck in one of the columns, then the heated chains are not successfully contributing states to the cold chain, and the Metropolis coupling is inefficient. The analysis may then have to be run longer or the temperature difference between chains may have to be lowered. The star column separates the two different runs. The last column gives the time left to completion of the specified number of generations.
*Note that all the commands you just typed in can be added to the MrBayes block (see below).
When to stop the analysis
Generally, a conservative approach is to set up your analyses to sample millions of generations in order to generate a large sample of the stationary posterior probability distribution. However, this can take several hours or days to complete so we will sample a relatively brief run.
At the end of the run, MrBayes asks whether or not you want to continue with the analysis (you can set MrBayes to automatically stop the analysis when it has reached the specified number of generations by using the set autoclose=yes command before starting the analysis). Before answering that question, examine the average standard deviation of split frequencies. As the two runs converge onto the stationary distribution, we expect the average standard deviation of split frequencies to approach zero, reflecting the fact that the two tree samples become increasingly similar. In our case, the average standard deviation is about 0.07 after 1,000 generations and then drops to less than 0.000001 towards the end of the run. Your values can differ slightly because of stochastic effects. Given the extremely low value of the average standard deviation at the end of the run, there appears to be no need to continue the analysis beyond 10,000 generations so when
MrBayes asks “Continue with analysis? (yes/no):”, stop the analysis by typing no.
Using a convergence diagnostic is recommended, such as the standard deviation of split frequencies or other software (such as AWTY or MrConverge) to determine run length. In the beginning of the run, the values typically increase rapidly (the absolute values decrease, since these are negative numbers). This phase of the run is referred to as the “burn-in” and the samples from this phase are typically discarded. Once the likelihood of the cold chain stops increasing and starts to randomly fluctuate within a more or less stable range, the run may have reached stationarity, that is, it is producing a good sample from the posterior probability distribution. At stationarity, we also expect different, independent runs to sample similar likelihood values.
Summarizing Samples of Substitution Model Parameters
sump
During the run, samples of the substitution model parameters have been written to the .p files every samplefreq generation. The first number, in square brackets, is a randomly generated ID number that lets you identify the analysis from which the samples come. The next line contains the column headers, and is followed by the sampled values. From left to right, the columns contain: (1) the generation number (Gen); (2) the log likelihood of the cold chain (LnL); (3) the total tree length (the sum of all branch lengths, TL); (4) the six GTR rate parameters (r(A<->C), r(A<->G) etc); (5) the four stationary nucleotide frequencies (pi(A), pi(C) etc); (6) the shape parameter of the gamma distribution of rate variation (alpha); and (7) the proportion of invariable sites (pinvar). If you use a different model for your data set, the .p files will of course be different. MrBayes provides the sump command to summarize the sampled parameter values. Before using it, we need to decide on the burn-in. Since the convergence diagnostic we used previously to determine when to stop the analysis discarded the first 25% of the samples and indicated that convergence had been reached after 10,000 generations, it makes sense to discard 25 % of the samples obtained during the first 10,000 generations. Since we sampled every 10th generation, there are 1,000 samples (1,001 to be exact, since the first generation is always sampled) and 25% translates to 250 samples. Thus, summarize the information in the .p file by typing sump burnin=250. By default, sump will summarize the information in the .p file generated most recently, but the filename can be changed if necessary. The sump command will first generate a plot of the generation versus the log probability of the data (the log likelihood values). If we are at stationarity, this plot should look like ‘white noise’, that is, there should be no tendency of increase or decrease over time. If you see an obvious trend in your plot, either increasing or decreasing, you probably need to run the analysis longer to get an adequate sample from the posterior probability distribution. At the bottom of the sump output, there is a table summarizing the samples of the parameter values:For each parameter, the table lists the mean and variance of the sampled values, the lower and upper boundaries of the 95% credibility interval, and the median of the sampled values. The parameters are the same as those listed in the .p files: the total tree length (TL), the six reversible substitution rates (r(A<->C), r(A<->G), etc), the four stationary state frequencies (pi(A), pi(C), etc), the shape of the gamma distribution of rate variation across sites (alpha), and the proportion of invariable sites (pinvar). Note that the six rate parameters of the GTR model are given as proportions of the rate sum (the Dirichlet parameterization). This parameterization has some advantages in the Bayesian context; in particular, it allows convenient formulation of priors. If you want to scale the rates relative to the G-T rate, just divide all rate proportions by the G-T rate proportion. The last column in the table contains a convergence diagnostic, the Potential Scale Reduction Factor (PSRF). If we have a good sample from the posterior probability distribution, these values should be close to 1.0. If you have a small number of samples, there may be some spread in these values, indicating that you may need to sample the analysis more often or run it longer.
Summarizing samples of trees and branch lengths
Trees and branch lengths are printed to the .t files. These files are Nexus-formatted tree files. To summarize the tree and branch length information, type sumt burnin=250. The sumt and sump commands each have separate burn-in settings so it is necessary to give
the burn-in here again.
The sumt command will output, among other things, summary statistics for the taxon bipartitions, a tree with clade credibility (posterior probability) values, and a phylogram. Then it gives the number of times the partition was sampled (#obs), the probability of the partition (Probab.), the standard deviation of the partition frequency (Stdev(s)), the mean (Mean(v)) and variance (Var(v)) of the branch length, the Potential Scale Reduction Factor (PSRF), and finally the number of runs in which the partition was sampled (Nruns). In our analysis, there is overwhelming support for a single tree, so all partitions have a posterior probability of 1.0. The clade credibility tree gives the probability of each partition or clade in the tree, and the phylogram gives the branch lengths measured in expected substitutions per site. The sumt command creates three additional files. The first is a .parts file, which contains the list of taxon bipartitions, their posterior probability (the proportion of sampled trees containing them), and the associated branch lengths. The second generated file has the suffix .con and includes two consensus trees. The first one has both the posterior probability of clades (as interior node labels) and the branch lengths.
Analyzing a Partitioned Data Set
MrBayes handles a wide variety of data types and models, as well as any mix of these models. In this example we will set up a partitioned analysis of the ND4 mitochondrial data since these sequences are coding regions. First, open up the monkeys.nex file in TextWrangler. The Nexus file contains a MrBayes block at the bottom of the file enclosed by [ and ] MrBayes blocks begin with begin mrbayes; and end with
end;
The block contains various instructions to the MrBayes program. MrBayes will skip everything enclosed within square brackets so this MrBayes block was ignored during the previous analyses.
Dividing the Data into Partitions
To divide the data into, in this case, codon positions, it is convenient to first specify character sets. In principle, this can be done from the command line but, it is much more convenient to create a MrBayes block that contains all of this information.
begin mrbayes; charset 1st=1-898\3; charset 2nd=2-898\3; charset 3rd=3-898\3; partition codons=3:1st,2nd,3rd; set partition=codons; unlink shape=(all) pinvar=(all) statefreq=(all) revmat=(all); prset applyto=(all) ratepr=variable; lset applyto=(1,2) nst=2 rates=gamma; lset applyto=(3) nst=6 rates=gamma; prset applyto=(3) pinvarpr=fixed(0); mcmcp ngen=10000 samplefreq=10 file=monkeys_partitioned; end;
The new commands in this file are: charset, partition, set partition, unlink, applyto, and a new parameter: ratepr.
The charset command simply associates a name with a set of characters. In our example, we assigned all first codon positions to a single partition using the command charset 1st=1-898\3; the character set “1st” is defined above as the first codon positions of the coding ND4 gene (i.e. nucleotides 1,4,7...898). The next step is to define a partition of the data according to codon position. This is accomplished with the line partition codons=3:1st,2nd,3rd;
Next, we tell MrBayes that we want to work with this partitioning of the data instead of the default partitioning. We do this using the set command: set partition=codons;
When we read this block into MrBayes, we will get a partitioned model with the first character division being the first position of the codon, the second division being the second position of the codon etc. In addition, we want a separate model for each partition. Thus, all the parameters should be estimated separately for the individual data partitions. Assume further that we want the overall evolutionary rate to be allowed to vary across different partitions. We can obtain this model by using lset and prset with the applyto mechanism, which allows us to apply the settings to specific partition. With the MrBayes block above, we applied an HKY + G model to the 1st and 2nd partitions, and a GTR+G+I model to the 3rd partition, we could also enter these commands through the command line by typing lset applyto=(1,2) nst=2 rates=gamma, and prset applyto=(1,2) pinvarpr=fixed(0). This takes care of partitions 1 and 2, and lset applyto=(3) nst=6 rates=gamma would set the model for the 3rd partition.
As you can see, all molecular partitions now evolve under the desired model. However, by default, all parameters (statefreq, revmat, shape, pinvar) would be shared (linked) across partitions unless we unlink allowing each partition to have its own set of parameters. The parameters are unlinked using the commands: unlink shape=(all) pinvar=(all) statefreq=(all) revmat=(all).
One more important detail is to allow the overall rate to be different across partitions. This is achieved using the ratepr parameter of the prset command. By default, ratepr is set to fixed, meaning that all partitions have the same overall rate. By changing this to variable, the rates are allowed to vary under a flat Dirichlet prior. We allow all our partitions to evolve under different rates using the commands:
prset applyto=(all) ratepr=variable.
The model has been completely specified so now we can proceed with the analysis.
Type 'showmodel' to confirm that each partition has its own model, and run the analysis.
A final word on partitioned data sets—What is the appropriate number of partitions? Our goal is to improve the fit, but avoid overparameterization of the model. In addition, even a relatively simple data set like the monkeys.nex example can be partitioned in 5 different ways: 1 partition, 2 partitions (1st combined with 2nd + 3rd), 2 partitions (1st, 3rd + 2nd), 2 partitions (2nd, 3rd + 1st), and 3 partitions (1st + 2nd + 3rd). Obviously, the number of possible partitioning schemes increases greatly with additional data. So how do you determine the “best” partitioning scheme? A common approach is to use comparisons of Bayes factors. A good example of this approach can be found in Nylander et al. (2004), but this approach is not practical for data sets with numerous partitions. Li et al. (2008) present a cluster analysis method to determine optimal data partitioning, and this method appears to be able to assess numerous partitioning schemes, but has not yet been rigorously assessed. In any event, simply partitioning your data by “obvious” partitions (i.e. by codon, or by gene) is not necessarily the optimal partitioning strategy.
References
Li, C., G. Lu, and G. Orti. 2008. Optimal data partitioning and a test case for Ray-finned fishes (Actinopterygii) based on ten nuclear loci. Systematic Biology, 57:519-539.
Nylander, J. A. A., F. Ronquist, J. P. Huelsenbeck, and J. L. Nieves-Aldrey. 2004. Bayesian phylogenetic analysis of combined data. Systematic Biology 53:47–67.
Comments:
Note: You must be logged in to add comments

