Slides from Jeremy Brown's lecture on MrBayes: Bodega12_JMB.pdf

MrBayes Tutorial for Modena: MrBayes_tutorial.zip

Jeremy Brown introduces the class to Bayesian inference in MrBayes (March 7, 2010)

Primary Contact(s) |

Jeremy Brown |

Created |

7 March 2010 (updated 13 March 2012) |

Required Software |

MrBayes3.2.1 |

Example Datafile |

This tutorial uses one example file for MrBayes |

primates.nex |

## Summary

A description from the MrBayes webpage:

MrBayes is a program for the Bayesian estimation of phylogeny. Bayesian inference of phylogeny is based upon a quantity called the posterior probability distribution of trees, which is the probability of a tree conditioned on the observations. The conditioning is accomplished using Bayes's theorem. The posterior probability distribution of trees is impossible to calculate analytically; instead, MrBayes uses a simulation technique called Markov chain Monte Carlo (or MCMC) to approximate the posterior probabilities of trees.

## Introduction

This tutorial uses MrBayes version 3.2.1 and one example data file: primates.nex.

## Tutorial

### Installation

MrBayes can be downloaded from the MrBayes website. Pre-compiled, executable versions of MrBayes for PCs or Macs are available for direct download. Both serial and parallel versions are available. Parallel versions allow a single analysis to take advantage of multiple processors, if they are available.

The source code can also be downloaded and compiled locally for use on Unix systems or for command-line use on other systems. Command-line versions will facilitate batch and cluster-based analyses. Instructions for compilation are given on the website.

### Starting MrBayes

If you have downloaded a pre-compiled executable version of MrBayes, simply double-click on the icon to start MrBayes. If you have compiled the source code locally, you can execute MrBayes from the command line (e.g., ./mb in Unix), once you have set your working directory properly.

Once MrBayes has started, it should print out an intro splash screen and then wait for your commands at the command prompt:

`MrBayes > `

### Getting Help

The single most important MrBayes command is *help*. Simply typing *help* on the command line,

`MrBayes > help `

will provide a list of the possible MrBayes commands and a brief description of their purpose. If you would like more information on a particular command, including a list of its associated parameters and their current settings, use *help <command_name>* where *<command_name>* should be replaced by the name of the command in which you are interested. For example, to get help on the *execute* command, type:

`MrBayes > help execute `

### Loading and Manipulating Data

The first command that you will need to use MrBayes is *execute*. This command loads information (either a data matrix or a set of MrBayes commands) from a file. The file should be nexus-formatted and placed in the same directory as the MrBayes executable. To load the primate sequence data, place the primates.nex file in the same folder as MrBayes and then type:

`MrBayes > execute primates.nex`

The *execute* command can be used identically with a file containing MrBayes commands. If this file contains all the commands needed for an analysis, one could simply open MrBayes and execute this command file at the command prompt. You could also run MrBayes with a command file by specifying the file name (e.g., mb_cmds.nex) on the command line when executing MrBayes. For instance, in Unix:

`mb mb_cmds.nex`

[Note: This command line should work if you've run the MrBayes installer, so the MrBayes binary has been automatically placed in your path. If you've compiled MrBayes from source code on your own, you'll need to move the binary to the folder where you want to run your analysis. You'll then need to type `./mb mb_cmds.nex` instead.]

Once data has been loaded into MrBayes, sites can be excluded from an analysis using the *exclude* command. Sites are designated for exclusion based on the number corresponding to their column in an alignment. For instance, to exclude sites 90-99 in the primates data that we already loaded, type:

`MrBayes > exclude 90-99`

If sites need to be re-included, use the *include* command:

`MrBayes > include 90-99`

Often, one might want to include/exclude every *n*-th character. For instance, defining a character set that starts with the first character and includes every 3rd character would circumscribe all 1st codon positions. This type of inclusion/exclusion is accomplished by using the notation \3 at the end of the character list. For example, to exclude all 1st codon positions, type:

`MrBayes > exclude 1-. \3`

The `"."` symbol is used to mean the last position in our data set. To exclude all 2nd codon positions, use a very similar command, but start the character list with the 2nd site:

`MrBayes > exclude 2-. \3`

The inclusion/exclusion status of all sites in our dataset can be viewed using the *charstat* command.

### Specifying a Model of Sequence Evolution

#### Basic Model Manipulations

Most model specification is done using the *lset* command. The most common model specifications performed in *lset* involve the number of substitution types and the model of rate variation across sites. For instance, the number of unique substitution rates can be specified using the *nst* parameter. Possible values for *nst* are 1 (as in the Jukes-Cantor model), 2 (as in the HKY model), and 6 (as in the GTR model). Possible models of rate variation specified with the *rates* parameter include a proportion of invariable sites (I), gamma-distributed rates across sites (G), or a combination of the two (I+G). For example, you could specify a GTR+G model by typing:

`MrBayes > lset nst=6 rates=gamma`

In the newest version of MrBayes (3.2), one can also avoid having to specify only one scheme of substitution types by allowing MrBayes to move across different schemes as part of its MCMC sampling. This procedure is known as reversible jump MCMC (RJ-MCMC). To set up reversible jumping, use the following command:

`MrBayes > lset nst=mixed rates=gamma`

Reversible jumping is not currently set up for different models of rate variation across sites, so you will still need to specify +I, +G, or +I+G.

While I will not provide detailed instructions here on analyses with doublet and codon models in MrBayes, these models can also be specified using the *lset* command. See the MrBayes manual for more information.

By default, MrBayes allows the stationary frequencies of the four nucleotides to be estimated from the data. To fix these frequencies at particular values (often 0.25 for all four nucleotides), use the *prset* command. MrBayes considers the fixation of base frequencies to be a prior setting, rather than a setting of the likelihood model. For instance, to fix all frequencies to be equal, type:

`MrBayes > prset statefreqpr=fixed(equal)`

#### Partitioning

It is now widely recognized that the evolutionary process has not been homogeneous across sites. One approach to accomodating heterogeneity in the evolutionary process across sites is to divide sites into distinct subsets (a process called partitioning) and model the evolution of each subset using an independent Markov model of nucleotide substitution.

The first step in performing a partitioned analysis is to define the distinct subsets of your data. These definitions are made using the *charset* command. The syntax for specifying sites with *charset* is the same as with the *include* and *exclude* commands. For instance, to specify a subset corresponding to the first codon position, type:

`MrBayes > charset cod.pos.1 = 1-. \3`

Subsets corresponding to codon positions 2 and 3 could be similarly specified as:

`MrBayes > charset cod.pos.2 = 2-. \3` `MrBayes > charset cod.pos.3 = 3-. \3`

Note that each site must be assigned to one and only one subset to perform a partitioned analysis.

Once all of the appropriate character sets have been defined, they need to be explicity combined into a partition. Appropriately, this is done with a command called *partition*. For instance, to specify a partitioning scheme with 3 subsets corresponding to the three codon positions, type

`MrBayes > partition cod.pos = 3:cod.pos.1,cod.pos.2,cod.pos.3`

The number immediately preceding the `":"` gives the number of subsets in the partioning scheme.

After defining a partitioning scheme, you need to tell MrBayes that you actually want to use that scheme. This is done with the *set* command. So, to use the `"cod.pos"` partitioning scheme, type

`MrBayes > set partition=cod.pos`

Now that the partitioning scheme is all set, you need to define models of evolution for each subset. As above, this is done with the *lset* command. The one additional consideration is that you need to tell MrBayes to which subsets you want any particular *lset* call to apply. To apply one *lset* command to all subsets within a partitioning scheme, type

`MrBayes > lset applyto=(all) nst=6 rates=gamma`

If you want to apply different *lset* calls to different partitions, use the numbers corresponding to the subset (defined by the order in which the subsets are listed in the partition definition). So, to apply an HKY model to codon positions 1 & 2 and a GTR+G model to codon position 3, type

`MrBayes > lset applyto=(1,2) nst=2` `MrBayes > lset applyto=(3) nst=6 rates=gamma`

After defining the appropriate models, you need to explicitly tell MrBayes that parameters (some or all) need to be estimated independently (or `"unlinked"`) for each partition. The `"unlinking"` of parameters across parititions is accomplished with the *unlink* command

`MrBayes > unlink revmat=(all) tratio=(all) statefreq=(all) shape=(all)`

Each parameter (or related set of parameters) needs to be unlinked individually. `"revmat"` corresponds to the relative rates of substitution, `"tratio"` corresponds to the transition/transversion rate ratio, `"statefreq"` corresponds to the stationary nucleotide frequencies, and `"shape"` corresponds to the alpha shape parameter of the gamma distribution that models rate variation across sites. You may also need to unlink the proportion of invariable sites across partitions if they're included in your models.

To check that you've unlinked parameters appropriately across partitions, use the *showmodel* command.

It is also possible to allow the overall tree length to be estimated independently for each subset, through the use of independent `"scaling parameters"`. By default, these scaling parameters are linked across partitions. Unlike the unlinking of other parameters, the scaling parameters are unlinked using *prset* by typing

`MrBayes > prset ratepr=variable`

Note that the use of rate multipliers across partitions still assumes that the relative branch lengths in the tree are identical across partitions. If, instead, one wants to model different relative branch lengths across partitions, *every* branch length can be estimated independently across partitions. The unlinking of every branch length across partitions is done with *unlink*

`MrBayes > unlink brlens=(all)`

*Unlink* commands can be reversed using the *link* command, with identical syntax.

### Specifying Priors on Model Parameters

Priors for nearly all parameters (i.e., relative rates of substitution, base frequencies, branch lengths, etc) can be set using the *prset* command. The available prior distributions vary by parameter type. Lots of information on possibile distributions can be gained from *help prset*. As one example, a more informative prior around equal base frequencies (but not fixed at equal frequencies) can be set by increasing the values of the Dirichlet parameters

`MrBayes > prset applyto=(all) statefreqpr=Dirichlet(100,100,100,100)`

One particularly important prior to consider is the branch-length prior. This prior is important because this single distribution is applied to *all* the branch lengths in the tree, which can be a very large number of independent parameters. This is a particularly good prior to try testing for prior sensitivity. In other words, run multiple analyses with different branch-length priors and examine the sensitivity of the resulting posteriors to the specification of the prior. Also note that the parameter specified for the exponential prior on branch lengths is the *rate* of the exponential distribution, which is the inverse of its mean. To specify an exponential prior that is pushed up more tightly around small branch lengths, you should specify a *larger* rate parameter. For instance,

`MrBayes > prset brlenspr=Unconstrained:Exp(100)` - Exponential prior on branch lengths with a mean of 0.01. `MrBayes > prset brlenspr=Unconstrained:Exp(10)` - Exponential prior on branch lengths with a mean of 0.1. (Default in MrBayes) `MrBayes > prset brlenspr=Unconstrained:Exp(1)` - Exponential prior on branch lengths with a mean of 1.

It has also been shown that the branch-length prior specification can affect the inferred posterior probabilities of the topology. Therefore, when performing prior sensitivity analyses with different branch-length priors, it might be important to monitor sensitivity to inferred branch lengths and topologies.

### Manipulating Markov chain Monte Carlo (MCMC) Settings

There are a huge number of potential settings pertaining to the Markov chain Monte Carlo (MCMC) analysis that can be manipulated in MrBayes. Most of these changes are made using the *mcmcp* and *mcmc* commands. These commands differ only in whether or not they begin the Markov chain when the command is issued. *mcmcp* just sets the parameter values *without* starting the analysis. *mcmc* will begin the analysis after setting the values of parameters. Some of the MCMC parameters that can be set using these commands include:

*ngen* - The number of MCMC generations MrBayes will run before pausing *nruns* - The number of independent MCMC analyses run by MrBayes *nchains* - The number of Metropolis-coupled chains for each independent MCMC analysis *samplefreq* - The frequency with which MrBayes writes output to files *printfreq* - The frequency with which MrBayes prints output to the screen *filename* - The root name for output file names *temp* - The degree of heating for Metropolis-coupled chains

Many other MCMC options can be set with these commands. The full list of these options can be obtained by typing *help mcmcp* or *help mcmc*.

The other command useful in tweaking the details of the MCMC analysis is the *props* command. *props* can alter the relative frequency of proposals to different parameters, as well as the distributions from which proposals are drawn. Proposal distributions differ according to the parameter of interest. The final page of the MrBayes manual lists the proposal distributions for each parameter type and their associated tuning parameters. It is also important to note that in the current version of MrBayes (3.1.2), the *props* command can only be issued `"interactively"`. In other words, you cannot put a *props* command in a MrBayes block of a command file. Typing *props* will cause MrBayes to prompt you for more information about how to change proposal frequencies and distributions. Finally, a **'word of caution**'. Certain changes to proposals can make it virtually impossible for the MCMC analysis to mix properly across the posterior distribution and give biased results. Using rigorous checks of convergence should allow you to detect cases of very poor mixing.

### Summarizing Output

After you have run an MCMC analysis in MrBayes and made sure that your runs have converged (a topic not covered here), you can summarize the estimated posterior distributions of both the parameter values and trees using the *sump* (parameters) and *sumt* (trees) commands. You can also discard those samples biased by the starting point of an MCMC analysis, a process known as discarding the `"burn-in"`. Removing burn-in when summarizing the posterior distributions is done by specifying a value for the *burn-in* parameter with *sump* and *sumt*

`MrBayes > sump burnin=1000` - Summarizing posterior samples of parameter values, discarding 1,000 *samples* (NOT generations) `MrBayes > sumt burnin=1000` - Summarizing posterior samples of trees, discarding 1,000 *samples*

Several other parameters can be specified when performing posterior summaries. A full list can be obtained by typing *help sump* or *help sumt*.

The output of the *sump* command will include means, medians, variances, and 95% credible intervals for all scalar parameters.

The output of the *sumt* command includes estimated posterior probabilities of branches (.parts file), estimated posterior probabilities of trees (.trprobs file), and a majority-rule consensus tree calculated from the posterior tree samples (.con file).

Other software (some listed below) can help you summarize posterior distributions of trees and parameter values, as well as diagnose convergence.

### Estimating Marginal Likelihoods Using Steppingstone Sampling

The marginal likelihood of a model is an important quantity that allows comparison of fit across different models, accounting for uncertainty in the values of model parameters. The simplest way to estimate the marginal likelihood of a model based on the output of MCMC sampling of the posterior is to calculate the harmonic mean of their likelihood scores. However, the harmonic mean can have some highly undesirable statistical properties, including bias and extremely high variance. This leads to low repeatability across replicates for estimates that are inaccurate. A much more accurate and repeatable approach was recently developed, also based on importance sampling, and is known as steppingstone sampling (Xie et al., 2011; Fan et al., 2011). In this approach, an MCMC chain is still used, but it samples a series of "power posterior distributions" that move progressively between the posterior and the prior (or vice versa). A theoretical treatment of this approach is beyond the scope of this tutorial, but I encourage any user interested in estimating marginal likelihoods to read the papers describing and characterizing this approach (Xie et al., 2011; Fan et al., 2011).

To estimate marginal likelihoods via steppingstone in MrBayes, one first needs to specify appropriate settings. This can be done using the `mcmcp` and `ssp` commands. Relevant parameters include the total number of generations sampled (set with `mcmcp`), the burn-in before the first step of the sampling (set with `ssp`), the burn-in for each sampling step (set with `mcmcp`), and the number of different power posterior distributions sampled (set with `ssp`).

After everything is set up, start the steppingstone sampling process using the simple `ss` command:

`MrBayes > ss`

Once sampling is completed across all power posterior distributions, MrBayes will provide an estimate of the marginal likelihood. It is good practice to estimate these values several times with independent runs to ensure their repeatability.

NOTE: Because MrBayes is sampling many different distributions that vary between the posterior and prior when steppingstone is run, do NOT use the trees and parameter files from an `ss` run to estimate posterior probabilities and credible intervals.

### Related Software

#### Tracer

Tracer is a really nice piece of software written by Andrew Rambaut to summarize posterior distributions of scalar values (e.g., log-likelihoods and model parameters). It was originally written to summarize output from BEAST, but it also works quite nicely with MrBayes .p files. Tracer allows the generation of trace plots, plots of marginal distributions (both single and joint), and summary statistics of marginal distributions.

#### AWTY

AWTY is a system for the graphical exploration of MCMC convergence, written by Jim Wilgenbusch, Dan Warren, and David Swofford. It's currently available as a web-based tool and will generate plots comparing the estimated posterior distributions of branch support from multiple independent MCMC runs.

#### TreeSetViz

TreeSetViz is a module for Mesquite implementing multidimensional scaling to visualize the distances between a set of phylogenetic trees in two dimensions, written by a number of students and faculty at City University of New York (CUNY) and the Univ. of Texas. It is extremely useful for visualizing the movement of an MCMC run through tree space, as well as comparing the tree topologies sampled by independent analyses.

NOTE: As of the last update of this wiki page, TreeSetViz was not compatible with the most recent version of Mesquite. However, older versions of Mesquite that remain compatible with TreeSetViz are still available.

#### TreeScaper

TreeScaper allows visualization of the distances between a set of trees in 2D (similar to TreeSetViz) or 3D using multidimensional scaling. TreeScaper is stand-alone software and provides a good deal of flexibility in projecting the tree-to-tree distances. It is useful for many applications, including the comparison of trees sampled across independent MrBayes runs or the comparison of trees sampled during analyses of different genes.

## References

Fan, Y., R. Wu, M.-H. Chen, L. Kuo, and P.O. Lewis. 2011. Choosing among partition models in Bayesian phylogenetics. *Systematic Biology* 28: 523-532. Available Here

Hillis, D.M., T.A. Heath, and K. St. John. 2005. Analysis and visualization of tree space. *Systematic Biology* 54: 471-482. Available Here

Ronquist, F. and J.P. Huelsenbeck. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. *Bioinformatics* 19:1572-1574 Available Here

Ronquist, F., M. Teslenko, P. van der Mark, D.L. Ayres, A. Darling, S. Hohna, B. Larget, L. Liu, M.A. Suchard, and J.P. Huelsenbeck. 2012. MrBayes3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. *Systematic Biology* In Press Available Here

Wilgenbusch, J.C., D.L. Warren, and D.L. Swofford. 2004. AWTY: A system for graphical exploration of MCMC convergence in Bayesian phylogenetic inference Available Here

Xie, W., P.O. Lewis, Y. Fan, L. Kuo, and M.-H. Chen. 2011. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. *Systematic Biology* 60:150-160 Available Here

### Comments:

**Note: You must be logged in to add comments**