BayesTraits with Discrete Traits (Glor)

InfoInfo
Search:    
Primary Contact(s)
Rich Glor
Created
8 March 2011
Required Software
[WWW]BayesTraits
Example Datafile
This tutorial uses two example data files
Adox.trees
Adox.trait1.txt
Adox_trait1_commands.txt

1. Download [WWW]BayesTraits and uncompress the resulting file.

2. Although BayesTraits is available for both Macs and PCs, this tutorial will focus on the Mac OSX version. If you're using a PC you should be able to follow along. In OSX, BayesTraits is run through the Terminal application. As with any Terminal-based application, we need to the locations of the program and associated files. We're going to put the BayesTraits folder in the Applications folder. Because your computer likely has multiple Applications folders, you should be sure to put it in the one that is found in your root directory (the simplest way to get to this directory is to click the Applications link in the Favorites menu that appears in the Finder window (see below)).

Applications.jpg

We're also going to put the BayesTraits directory in our path. By putting BayesTraits in our path, we're able to tell Terminal where to look for the program. This might seem like a nuisance, but it's going to save us some trouble in the longrun by permitting us to run BayesTraits from anywhere on our computer. To put BayesTraits in our path, you need to open the Terminal application and type vi .profile. This opens a file called ".profile" in the vi editor. Once ".profile" is open, type "i" to begin editing (you should see "—Insert—" at the bottom of the Terminal window). Now simply type export PATH="$PATH:/Applications/BayesTraits/" to add BayesTraits to your path. When finished, press esc to stop editing and then close the ".profile" file by typing :wq.

profile_fixed.jpg

3. We're now ready to open BayesTraits and the relevant data. We're going to explore the evolution of growth form [i.e., (0) woody, and (1) herbaceous] in a group of plants, Adoxaceae. The character data are stored in a comma delimited text file: Adox_trait1.txt. The trees used in this analysis are from a marginal distribution of dated trees inferred using BEAST: Adox.trees. BayesTraits requires a rooted tree with branch lengths/durations. Move both of these files to your BayesTraits folder and navigate to this folder in the Terminal by typing cd ../../Applications/BayesTraits. Now we're ready to run BayesTraits and we can do so by specifying both the tree and the dataset when we call BayesTraits:

./BayesTraits Adox.trees Adox.trait1.txt

4. If your data have loaded successfully you should see the following text:

Rand Seed    1299555524 
Please Select the model of evolution to use.
1)      MultiState.
2)      Discrete: Independent
3)      Discrete: Depend
4)      Continuous: Random Walk (Model A)
5)      Continuous: Directional (Model B)

This is BayesTraits' way of prompting us to select the type of analysis we'd like to do. Although we're dealing with a binary trait, we're going to select the Multistate option because the Discrete option in BayesTraits is designed to investigate character correlations and requires two characters with binary coding (we have only one character with binary coding in our dataset). Type 1 and hit enter to choose the Multistate option.

5. We're now prompted to choose whether we want to do Maximum Likelihood or Bayesian MCMC:

Please Select the analsis method to use. 
1)      Maximum Likelihood.
2)      MCMC

We're going to start with ML analyses, even though our ultimate goal is to use Bayesian inference. We're doing this because ML can provide us with a reasonable first approximation of the rate coefficient(s) that might be useful for specifying (hyer)priors in the subsequent Bayesian analysis. Type 1 and hit enter to initiate your Maximum Likelihood analyses.

5. BayesTraits should now provide us with some basic information about our dataset and model settings:

Options: 
Options:
Model:                           Multistates
Tree File Name:                  Adox.trees
Data File Name:                  Adox_trait1.txt
Log File Name:                   Adox_trait1.txt.log.txt
Summary:                         False
Analsis Type:                    Maximum Likelihood
ML attempt per tree:             10
No of Rates:                     2
Base frequency (PI's)            None
Character Symbols                0,1
Using a covarion model:          False
Restrictions:
    q01                          None
    q10                          None
Tree Information
     Trees:                      501
     Taxa:                       63
     Sites:                      1
     States:                     2

Most of the information we're seeing here has a pretty intuitive interpretation. We have 501 trees with 63 taxa in each tree and 1 character (or site) with two states (i.e., a binary character). We're currently working with two different rate parameters, one for transitions from woody to herbaceous (q01) and the other for transitions from herbaceous to woody (q10). Let's do a basic ML search under this two-rate model using the default settings for other parameters. When doing Maximum Likelihood analyses on a set of more than 1 tree, BayesTraits is going to provide results calculated independently for each of the 501 trees in our dataset. To run your analysis, simply type run.

Output for the first 10 trees in our sample should look something like this:

Tree No      Lh      q01     q10     Root P(0)       Root P(1) 
1       -9.727674       0.000000        0.017002        0.000000        1.000000
2       -10.138425      0.000000        0.018218        0.000000        1.000000
3       -10.311666      0.000000        0.017754        0.000000        1.000000
4       -10.437176      0.000000        0.020893        0.000000        1.000000
5       -9.824290       0.000000        0.017578        0.000000        1.000000
6       -9.924709       0.000000        0.019461        0.000000        1.000000
7       -10.249299      0.000000        0.020936        0.000000        1.000000
8       -10.254006      0.000000        0.022155        0.000000        1.000000
9       -10.257945      0.000000        0.020684        0.000000        1.000000
10      -10.494423      0.000000        0.020988        0.000000        1.000000

Let's focus on several aspects of these results. First, we should see that all of our trees produce similar log likelihoods. This suggests that variation among the trees in our posterior distribution is not having a strong impact on the inference of evolution in this trait. Second, the rate for transitions from the woody to herbaceous (q01 = 0.000) is much lower than that of herbaceous to woody (q10 = 0.018-0.022). Finally, the probability of an herbaceous state at the root is substantially higher than the probability of a woody state at the root (0.000 v. 1.000).

6. Before we leave likelihood land, we could consider whether the difference in the two estimated rate parameters has implications for model selection. To do this we're going to re-run our analysis with the two parameters constrained to be equal to one another. To do this, we're going to use the command restrict q01 q10 (just before entering run), remember you have to specify the matrix, data and parameters for each run. We can confirm that this operation was successful by typing info. We should now see that q01 is restricted to be the same as q10 while q10 has no restrictions. Let's now re-run the analysis by typing run. The first ML scores for the first 10 trees look something like this:

Tree No      Lh      q01     q10     Root P(0)       Root P(1) 
1       -13.946363      0.005398        0.005398        0.454504        0.545496
2       -14.400731      0.005738        0.005738        0.457616        0.542384
3       -14.588476      0.005608        0.005608        0.459535        0.540465
4       -14.873274      0.006408        0.006408        0.445273        0.554727
5       -14.165872      0.005521        0.005521        0.439266        0.560734
6       -14.396200      0.005913        0.005913        0.438098        0.561902
7       -14.479001      0.006861        0.006861        0.439984        0.560016
8       -14.641807      0.006948        0.006948        0.443282        0.556718
9       -14.595005      0.006512        0.006512        0.445936        0.554064
10      -14.833338      0.006706        0.006706        0.427272        0.572728

The fact that the resulting substantial decrease in likelihoods suggests that the two-rate parameter model is superior to the single-rate parameter model. We could formally test this hypothesis using AIC or the likelihood ratio test (with one degree of freedom).

7. Now that we have a rough idea of what our data look like, let's move to MCMC world. Let's get started with this analysis ./BayesTraits Adox.trees Adox_trait1.txt. As before, we're going to select Multistate. However, when we get the next prompt, we're going to choose option 2 (MCMC) rather than option 1 (Maximum Likelihood). When you do this, you get different information on your starting parameters that you do with Maximum Likelihood:

Options: 
Model:                           Multistates
Tree File Name:                  Adox.trees
Data File Name:                  Adox_trait1.txt
Log File Name:                   Adox_trait1.txt.log.txt
Summary:                         False
Analysis Type:                   MCMC
Sample Period:                   100
Iterations:                      5050000
Burn in:                         50000
Rate Dev:                        2.000000
No of Rates:                     2
Base frequency (PI's)            None
Character Symbols                0,1
Using a covarion model:          False
Restrictions:
    q01                          None
    q10                          None
Prior Information:
    Prior Categories:            100
    q01                          uniform 0.00 100.00
    q10                          uniform 0.00 100.00
Tree Information
     Trees:                      501
     Taxa:                       63
     Sites:                      1
     States:                     2

8. We're also going to reduce the number of iterations so that things run a bit faster than they would otherwise. Let's drop down to 100,000 iterations by typing Iterations 100000. We're going to begin by leaving the prior parameters at their default settings for now and simply run the analysis with uniform rate priors by typing run. The first few lines of the resulting output should look something like this:

Iteration    Lh      Harmonic Mean   Tree No q01     q10     Root P(0)       Root P(1)       Acceptance 
50000   -21.583288      -21.583288      95      10.550017       62.602856       0.500000        0.500000        0.940000
50100   -21.368018      -21.516551      103     5.856247        55.458663       0.500000        0.500000        0.960000
50200   -21.340524      -21.454796      313     8.247111        57.741515       0.500000        0.500000        0.950000
50300   -21.803621      -21.498049      22      4.482239        55.257891       0.500000        0.500000        1.000000
50400   -21.660432      -21.555659      176     10.657151       61.080610       0.500000        0.500000        0.910000
50500   -21.453740      -21.556891      390     10.324308       65.802350       0.500000        0.500000        0.980000
50600   -21.591820      -21.552075      383     11.692942       69.103778       0.500000        0.500000        0.920000
50700   -21.558930      -21.555231      260     12.985093       77.978155       0.500000        0.500000        0.990000
50800   -21.781473      -21.570270      455     5.967007        72.842356       0.500000        0.500000        0.890000
50900   -21.499887      -21.579031      106     6.850427        72.032596       0.500000        0.500000        0.920000
51000   -22.844072      -21.690110      1       18.280377       76.235973       0.500000        0.500000        0.910000
51100   -23.511776      -21.967507      232     19.162965       71.179811       0.500000        0.500000        0.870000

Note that this output includes the iteration in the left column as well as the Harmonic Mean estimate of the marginal log likelihood that we'll need later to calculate Bayes Factor scores, and indicates which tree is being sampled at each iteration. Note that the likelihood scores are considerably worse than they were previously, that's not surprising because we are no longer maximizing the likelihood. The most disturbing aspect of the results is that the transition rates are crazy high. This is not too surprising, given that the prior (uniform [0,100]) places the vast majority of the prior mass is on very high rates. Accordingly, we'll modify the prior to reflect more reasonable; specifically, we'll change the prior to a uniform [0,1]. This prior is more reasonable in light of the range of MLE values for these parameters (that would make this an example of an 'empirical prior'—one that is informed by a ML 'peek' at the data. Run the analysis under this new prior and see how it looks.

9. How is the MCMC mixing? Ideally, the acceptance rates to for proposed changes to the rate parameters should fall in the range between ~20-40%. We can control the acceptance rates (and, thus, the mixing of the MCMC) by changing the magnitude of proposed changes to the transition rates (i.e., RateDev parameter). All else being equal, we expect higher Rate Dev values to result in lower acceptance rates because they will propose larger changes to the rate parameters. Accordingly, if acceptance rates are too high, specify a larger value for the RateDev parameter (and vice versa). Experiment with different values until you get the chain to mix well.

10. Now we're going to repeat the above Bayesian analyses under a one-rate model. To do this, we're going to use the command restrict q01 q10 (just before entering run). We can confirm that this operation was successful by typing info. We should now see that q01 is restricted to be the same as q10 while q10 has no restrictions. Let's now re-run the analysis by typing run. As before, iteratively adjust the value of the RateDev parameter until the acceptance rates are in the target range (2-–40%).

11. We have estimates of the marginal likelihoods (based on the harmonic mean estimator) for each of the preceding Bayesian analyses (under the two-rate and one-rate models). We can therefore assess the support in the data for the two models (by taking the difference in the estimated marginal log likelihoods). Which model is preferred?

12. The two previous Bayesian analyses conditioned on a model—the first on a two-rate model, the second on a one-rate model. If we want to get really Bayesian, we can treat the model itself as a random variable. That is, we can use a special type of MCMC, reversible-jump MCMC to average over the set of possible trait models (for binary traits, there are a total of 5 possible models). This procedure accommodates uncertainty in model choice so the the inference of ancestral states and transition rates does not assume any one model of trait evolution. Let' specify a model-averaged inference of trait evolution by specifying the prior and hyperpriorm, rjhp exp 0 30, for the rjMCMC—this specifies an exponential prior for the transition rates, and the rate parameter for the exponential is itself sampled from a uniform hyperprior on the interval [0,30] . After running this analysis, we should see some lines that look like this:

Iteration    Lh      Harmonic Mean   Tree No No Off Paramiters       Model string    q01     q10     Root P(0)       Root P(1)       RJ Prior Mean   Acceptance 
50000   -10.345161      -10.345161      207     1       'Z0     0.000000        0.010113        0.000000        1.000000        3.301091        0.090000
50100   -9.419777       -10.120508      260     1       'Z0     0.000000        0.025599        0.000000        1.000000        0.703027        0.120000
50200   -9.617302       -9.922255       455     1       'Z0     0.000000        0.021933        0.000000        1.000000        0.907404        0.080000
50300   -10.040932      -9.902513       124     1       'Z0     0.000000        0.021933        0.000000        1.000000        0.998135        0.080000
50400   -10.507461      -10.005833      199     1       'Z0     0.000000        0.031622        0.000000        1.000000        0.210905        0.160000
50500   -10.099262      -10.071735      116     1       'Z0     0.000000        0.012638        0.000000        1.000000        0.303869        0.090000
50600   -9.741163       -10.052036      117     1       'Z0     0.000000        0.014276        0.000000        1.000000        0.261923        0.090000
50700   -10.413453      -10.063185      306     1       'Z0     0.000000        0.032846        0.000000        1.000000        0.287645        0.120000
50800   -11.500539      -10.256669      349     1       'Z0     0.000000        0.039525        0.000000        1.000000        2.166462        0.100000
50900   -10.339216      -10.382841      94      1       'Z0     0.000000        0.014670        0.000000        1.000000        3.414595        0.120000
51000   -10.450154      -10.384123      87      1       'Z0     0.000000        0.008764        0.000000        1.000000        4.106023        0.090000
51100   -11.107678      -10.432085      249     1       'Z0     0.000000        0.040083        0.000000        1.000000        9.399662        0.150000

13. Again, lets iteratively tweak the RateDev parameter to ensure that the rjMCMC is mixing well—the current acceptance rates are a bit low (<20%). Let's try to improve mixing by reducing the RateDev parameter, setting it to 0.1 RateDev 0.1. If we do this, we find that we have good mixing by the end of the analysis, but that we have yet to reach stationarity by the previously set burn-in. In this case, therefore, we're going to need to increase the burn-in point to ensure adequate mixing.

Iteration    Lh      Harmonic Mean   Tree No No Off Paramiters       Model string    q01     q10     Root P(0)       Root P(1)       RJ Prior Mean   Acceptance 
50000   -21.746725      -21.746725      377     2       '01     0.865084        4.784077        0.500000        0.500000        3.008918        0.870000
50100   -21.549190      -21.685116      33      2       '01     0.778098        4.692790        0.500000        0.500000        3.066121        0.820000
50200   -21.350093      -21.599172      336     2       '01     0.639613        4.431222        0.500000        0.500000        4.419139        0.820000
50300   -21.303282      -21.528656      203     2       '01     0.599025        4.426538        0.500000        0.500000        1.618403        0.830000
50400   -21.286093      -21.481195      20      2       '01     0.594383        4.614545        0.500000        0.500000        2.442129        0.860000
50500   -21.487544      -21.465539      38      2       '01     0.746834        4.661760        0.500000        0.500000        2.790109        0.810000
...
99400   -9.979530       -23.099905      303     1       'Z0     0.000000        0.013126        0.000000        1.000000        0.116057        0.480000
99500   -11.423252      -23.097885      334     1       'Z0     0.000000        0.053344        0.000000        1.000000        1.404885        0.340000
99600   -11.333112      -23.095868      247     1       'Z0     0.000000        0.052508        0.000000        1.000000        0.057220        0.390000
99700   -10.792404      -23.093856      449     1       'Z0     0.000000        0.044798        0.000000        1.000000        0.243009        0.360000
99800   -10.967647      -23.091848      372     1       'Z0     0.000000        0.035442        0.000000        1.000000        2.761527        0.360000
99900   -10.462960      -23.089844      312     1       'Z0     0.000000        0.010897        0.000000        1.000000        2.425204        0.530000
100000  -12.391982      -23.087844      158     1       'Z0     0.000000        0.003631        0.000000        1.000000        0.085852        0.360000

We can increase the burnin of subsequent runs using the burnin command. Try increasing the burnin so that it is beyond the point of burnin in the analysis you just ran.

14. The rjMCMC simulation provides an estimate of the marginal probabilities of the various trait models. What are the marginal probabilities for the one- and two rate models? How does this result jive with the results of the Bayes factor test that you performed previously?

15. Suppose we are particularly interested in knowing what the growth form was at the root of our tree. As we've already seen, our analyses strongly favor a herbaceous growth form at the root. We can assess how well supported this outcome is by using the fossil command. By fixing the state at each possible alternative and running the analysis, we can compare the resulting Bayes Factor scores to generate an estimate for the relative support of each alternative hypothesis. Let's fix the root to be a woody form Fossil root 0 1 12, ie set the root to state 0 (woody) between taxa number 1 and 12. The syntax here requires that fossil [node name] [state to fix] [taxon numbers]. Once this operation is complete, we can re-run a second time after fixing a root state of 1. Compare the resulting Harmonic mean likelihoods to see how well supported each alternative state is.

This is a Wiki Spot wiki. Wiki Spot is a 501(c)3 non-profit organization that helps communities collaborate via wikis.