1. Overview of the Analysis Pipeline

InfoInfo
Search:    

Tutorial Home Next: Part 2

Created by [WWW]Brian Moore

Introduction

The series of steps described in the BEAST tutorial are embedded in a more comprehensive analysis pipeline. More specifically, our ability to rigorously estimate divergence times requires a series of preliminary analyses to specify nucleotide substitution models, detect substitution rate variation, estimate a starting tree, etc. In this first section of the tutorial, we provide a brief overview of these preliminary analyses, which are depicted below:

pipeline.png

Step I: Multiple Sequence Alignment

Estimation of phylogeny (and, thus, all downstream inferences based on those estimates) typically condition on a specific nucleotide sequence alignment. In other words, the statements of homology specified in a sequence alignment are assumed to be known without error. Nevertheless, multiple sequence alignment is known to be an extremely difficult statistical inference problem (e.g., like phylogeny estimation, it is NP complete), and different alignments can substantially impact estimates of phylogeny and other model parameters. Accordingly, multiple sequence alignment is a critical issue in phylogenetics, and selection among available methods should be carefully considered.

Surprisingly, the choice of an alignment algorithm/implementation in practice is often based on somewhat arbitrarily conventions of phylogenetic data analysis rather than on more objective criteria. Fortunately, several statistics have been developed to assess alignment accuracy, and a growing number of simulation studies have applied these metrics to evaluate the statistical behavior and relative performance of available alignment methods. We encourage you to explore this literature and assess the choice of algorithm on your sequence alignment.

Step II: Selecting a Model of Nucleotide Substitution

[Under construction] Selection of a stochastic model of nucleotide substitution for the combined sequence alignment and each of the relevant data partitions.

Step III: Assessing the Level of Substitution Rate Variation

[Under construction] Evidence of significant substitution rate variation among lineages in a sequence alignment (i.e., departure from 'clock-like' rates of substitution) is typically assessed by means of likelihood-ratio tests (LRT). In outline, this entails estimating the maximum likelihood score for the sequence alignment (or subset thereof) under the selected nucleotide model both with and without enforcing substitution rate constancy and then comparing the likelihood-ratio test statistic to a χ2 distribution with (N–2) degrees of freedom, where N = the number of taxa.

Here we illustrate how to assess whether rates of nucleotide substitution depart significantly from expectations under a stochastically constant molecular clock in the Platanus data set using PAUP*.
LR = 2*(lnLclocklnLno-clock)

begin paup;
        set criterion=distance;
        log file=Molecular_Clock_Test.log;
        DSet distance=JC objective=ME base=equal rates=equal pinv=0
        subst=all negbrlen=setzero;
        NJ showtree=no breakties=random;
        set criterion=likelihood;
        lset clock=no;
        lscores all/Base=(0.3586 0.1470 0.1474)  Nst=6  Rmat=(1.0741 1.6419 0.1952 0.5207 1.6419)  Rates=gamma  Shape=1.2815;
        ;[!Non-clock score above, clock score below];
        roottrees;
        lset clock=yes;
        lscores all/Base=(0.3586 0.1470 0.1474)  Nst=6  Rmat=(1.0741 1.6419 0.1952 0.5207 1.6419)  Rates=gamma  Shape=1.2815;
        ;[!Clock score above];
        tstatus;
        log stop;
end;
Step IV: Estimating Phylogeny Under Candidate Partition Schemes

It is widely acknowledged that the pattern of nucleotide substitution across a sequence alignment can exhibit heterogeneity, and that this variation can potentially cause problems for phylogenetic analysis unless the variability is accommodated. Deviations from a homogeneous substitution process include both simple rate heterogeneity (i.e., among-site rate variation) stemming from site-to-site differences in selection-mediated functional constraints, systematic differences in mutation rate, etc., or may involve more fundamental process heterogeneity, where the sites in an alignment are evolving under qualitatively different evolutionary processes. Process heterogeneity might occur within a single gene region (e.g., between stem and loop regions of ribosomal sequences), or among gene regions in a concatenated alignment (e.g., comprising multiple nuclear loci and/or gene regions sampled from different genomes).

To avoid these problems, investigators typically adopt a ‘mixed-model’ approach in which the sequence alignment is first parsed into a number of process partitions that are intended to capture plausible process heterogeneity within the data (corresponding to different gene regions, codon positions of protein coding gene regions, stem and loop regions of ribosomal genes, etc.), specify a substitution model for each process partition (using various model-selection criteria, such as hLRT, AIC, etc., as outlined in Step II), and then estimate the phylogeny and other parameters under the resulting composite model. In this approach, therefore, the partition scheme is as an assumption of the inference (i.e., the estimate is conditioned on the specified mixed model), and the parameters of each process partition are independently estimated.

For most sequence alignments, several partitioning schemes are plausible a priori, which therefore requires a way to objectively identify the partitioning scheme that balances estimation bias and error variance associated with under- and over-parameterized mixed models, respectively. Increasingly, mixed-model selection is based on Bayes factors, which involves comparing the ratio of the marginal likelihoods of alternate partitioning schemes, where the marginal likelihood is the likelihood of the data integrated over the joint prior probability densities of all model parameters for each mixed-model.

As illustrated in the figure above, the posterior probability distribution of phylogeny is approximated using MCMC sampling under a series of S biologically plausible data partitioning schemes. For each candidate partitioning scheme, we perform R replicate MCMC analyses in order to assess performance of the MCMC. These analyses provide estimates of the harmonic mean of the marginal log likelihood under each of the S partitioning schemes, which can then be used to choose among the partitioning schemes using Bayes factors. Bayesian estimation of phylogeny using MrBayes is described elsewhere on this wiki, and we will cover the diagnosis of MCMC sampling and the use of Bayes factors in a later section of this tutorial.

Step V: Estimating Divergence Times Under the Selected Mixed-Model

Steps I through IV of the analysis pipeline (outlined above) provide several pieces of information required to estimate of divergence times (described in the remainder of this tutorial). Specifically, the divergence-time analyses will rely on the following results:

Our tutorial on "estimating divergence times from molecular sequence data" with BEAST continues here.

This is a Wiki Spot wiki. Wiki Spot is a 501(c)3 non-profit organization that helps communities collaborate via wikis.