Created by Brian Moore
BEAST reads input files written in xml—the extended markup language—that is similar to the more familiar html (hyper-text markup language) used in web applications. This may seem to be an odd choice, as it can be a bit intimidating for users not familiar with xml and tends to be quite verbose. On the other hand, this format does afford great flexibility for specifying almost arbitrarily complex analyses. In any case, if we want to use BEAST, we need to gain some familiarity with this file format, which we hope to achieve through the following exercise.
There are 2 main steps to formatting an input file for divergence time estimation using BEAST: generating a base file with BEAUti and then modifying it with a text editor. The latter will be described in the part 3 of the tutorial.
Open the program BEAUti. This is a ‘helper’ application for BEAST that reads the more standard file NEXUS and generates (almost usable) xml files for basic analyses with BEAST.
Step 1: Getting a NEXUS file into BEAUti
From the File menu, select the ‘Import NEXUS’ option, and navigate to the directory that contains the file ‘Platanus_DTE.nex’.
Step 2: The data window
The alignment will be displayed in the Data pane of the BEAUti window. We are estimating divergence times for a species phylogeny, so ensure that dates are specified as ‘years since some time in the past’.
Step 3: Defining groups of taxa
Move to the Taxa pane. This provides options for specifying one or more subsets of species in your data that you simply may wish to name and or enforce constrain to be monophyletic. We will first describe how to designate a set of taxa, which we will call 'clade_1', and how to enforce the monophyly of this clade using BEAUti.
Click on the ‘+’ button located in the lower left hand corner of the BEAUti window. This will cause the text ‘untitled1’ to appear in the left-most panel, and a complete list of taxa to appear in the center panel of the window.
Click on the text ‘untitled1’ located in the left-most panel, and change the name to ‘clade_1’, then check the box labeled ‘monophyletic’ to enforce the monophyly of clade_1.
To define the set of species that belong to clade_1, select the lower four species (i.e., all but ‘Platanus_kerrii’, and then click the arrow in the center of the window to move these four species from the center ‘Excluded taxa’ panel to the righmost ‘Included taxa’ panel.
NOTE, in order to help you become comfortable with xml, we are going to specify these options manually using a text editor in Part 3 of the tutorial. Accordingly, please click on the ‘-’ button located in the lower left hand corner of the BEAUti window to delete the taxon set that we defined above.
Step 4: Model specification
Move to the Model pane. This provides options for specifying a model of nucleotide substitution, a model for accommodating among-site substitution-rate variation, and a model for accommodating variation in substitution rate across branches. Let’s imagine that we have previously performed a model selection analyses (e.g., by means of AIC criterion implemented in ModelTest), and that this procedure has identified the general time reversible (GTR) model with Gamma distributed rate variation across sites. Furthermore, let’s imagine that we have detected significant substitution-rate variation in this data set (e.g., by means of hLRT implemented in PAUP*), which suggests that these data do not conform to the molecular clock hypothesis. Accordingly, we want to reflect these findings by making the following model specifications:
Select GTR from the ‘Substitution Model’ pull-down menu, for which we will use empirical base frequencies.
Next, select ‘Gamma’ from the ‘Site Heterogeneity Model’ pull-down menu; we will approximate the continuous Gamma distribution with four discrete rate categories (note that run time scales as NT, where N is the number of rate categories).
These sequence data apparently deviate from the molecular clock, so uncheck the ‘Fix mean substitution rate’ box; this will generate a warning us that we will need to impose a prior on the substitution rate or the age of one or more nodes in the tree. Click ‘OK’, and select the ‘Relaxed Clock: Uncorrelated Lognormal’ option from the ‘Molecular Clock Model’ pull-down menu.
Step 5: Prior specification
Move to the Priors pane. This provides options for specifying prior probability distributions for the tree topology and all of the other parameters in the nucleotide substitution model and relaxed clock model.
The sequence data were sampled from species (not individuals within a population), so we should select a speciation prior from the ‘Tree Prior’ pull-down menu: select the ‘Speciation: Yule Process’ option (which generates a prior distribution of topology and divergence times under a pure-birth stochastic branching process model). Note that you have the option of initiating the MCMC from an UPGMA starting tree: we are going to use a better starting tree that will be specified by hand later on.
The GTR substitution model (which describes the evolution of the sequence data over the phylogeny) includes parameters for five of the six nucleotide substitution types specified by the GTR model: which of these six revmat parameters is missing, and why? Furthermore, where did the parameters for the four nucleotide base frequencies go?
The discrete Gamma model (for accommodating among-site substitution rate heterogeneity) includes a single parameter, alpha, which describes the shape of the Gamma distribution.
The UCLN relaxed clock model (for accommodating among-lineage substitution-rate variation) includes parameters for the mean and standard deviation of the underlying lognormal distribution from which substitution rates at individual branches are sampled, and parameters that describe the magnitude of substitution-rate variation across the tree (the coefficientOfVariation parameter) and the degree to which variation in substitution rates are autocorrelated across ancestor-descendent branches (the covariance) parameter.
Finally, we need to specify the age of one or more nodes in the tree in order to estimate absolute divergence times/substitution rates, for which there is a parameter that describes the age of the root node. However, we are going to impose a calibration prior based on a fossil that is placed at an internal node of the tree (rather than at the root), and we will do this manually with a text editor in Part 3 of this tutorial.
Step 6: Proposal mechanisms
Move to the Operators pane. This provides options for controlling aspects of the proposal mechanisms used to update parameter values during the MCMC sampling, including the magnitude of proposed changes to each of the parameters (the ‘tuning’ values) and the frequency with which attempts will be made to update each of the parameters (the ‘weight’ values). The default tuning and weight proposal values should work fine for our data set. However, ensure that the ‘Auto Optimize’ box is checked, so that the tuning values will be automatically adjusted during the MCMC in order to ensure the efficiency of parameter mixing.
Step 7: MCMC
Move to the MCMC pane. This provides options for controlling aspects of the MCMC sampling used to approximate the joint posterior probability density of model parameters, phylogeny and divergence times.
The chain length specifies the number of generations that we want to run the MCMC. Since we want the demonstration to complete quickly, specify 1 million cycles.
Specify the MCMC samples to be printed to the screen and logged to files every 1000 cycles.
Specify the name of the parameter log file to be saved as ‘Platanus_DTE.log’.
Specify the name of the tree log file to be saved as ‘Platanus_DTE.trees'.
Step 8: Generate the xml file!
Click the ‘Generate BEAST file’ in the lower right hand corner of the BEAUti window, and save the generated file as ‘Platanus_DTE.xml’. Congratulations, you are half way there!! Now we need to make some manual modifications to our newly generated 'base' XML file, which we will describe in the next part of the tutorial.