This tutorial estimates the rate of evolution from a set of virus sequences isolated at different points in time (heterochronous or time-stamped data). The data consist of 129 sequences from the G (attachment protein) gene of human respiratory syncytial virus subgroup A (RSVA). These viral sequences come from various parts of the world at isolation dates ranging from 1956 to 2002 ((Zlateva, Lemey, Vandamme, & Van Ranst, 2004), (Zlateva, Lemey, Moës, Vandamme, & Van Ranst, 2005)). RSVA infects the lower respiratory tract and causes symptoms that are often indistinguishable from the common cold. Nearly all children are infected by age 3, and a small percentage (<3%) develops a more serious inflammation of the bronchioles requiring hospitalisation.
The aim of this tutorial is to estimate:
the rate of molecular evolution
the date of the most recent common ancestor
the phylogenetic relationships with measures of statistical support.
The following software will be used in this tutorial:
BEAST - this package contains the BEAST program, BEAUti, DensiTree, TreeAnnotator and other utility programs. This tutorial is written for BEAST v
2.6.x, which has support for multiple partitions. It is available for download from http://www.beast2.org.
Tracer - this program is used to explore the output of BEAST (and other Bayesian MCMC programs). It graphically and quantitively summarises the distributions of continuous parameters and provides diagnostic information. At the time of writing, the current version is v
1.7.2. It is available for download from http://beast.community/tracer.
FigTree - this is an application for displaying and printing molecular phylogenies, in particular those obtained using BEAST. At the time of writing, the current version is v
1.4.4. It is available for download from http://beast.community/figtree.
Beagle (optional) - this is a high-performance library that can perform the core calculations at the heart of
most Bayesian and maximum likelihood phylogenetics packages. https://github.com/beagle-dev/beagle-lib/wiki.
The NEXUS alignment
The data is in a file called
This file contains an alignment of 129 sequences from the G gene of RSVA
virus. These sequences are 629 nucleotides long.
Click the link above and the raw data will be displayed on your web browser. You can then right-click,
"Save Page As", and save the data as
RSV2.nex in your working directory.
Alternatively, you can find this data in the
examples/nexus directory in the directory
where BEAST was installed. If you prefer to use BEAUti,
you can select the menu
Set working dir =>
This will make it easier to find example and tutorial files
BEAST, precluding directory navigation.
Import and split the alignment
There are two options to import this alignment into BEAUti.
From the menu, to click
Drag and drop the file into the
Partitionspanel. If there is a pop-up dialog to ask you
what to add, then you need to select
Import Alignmentfrom the drop-down list as shown in Figure 1.
The gene we have at hand codes for a protein, meaning its sequence can be seen as a series of triplets, or codons, with each codon being later translated into an amino acid (remember the central dogma of molecular biology!). A string of aminoacids then constitutes a protein.
For our purposes, what matters is that we have now known for a while that different positions in each codon evolve at different rates, which has to do with what we call the "genetic code" (look up "genetic code redundancy" online). In knowing this, we can incorporate this difference in evolutionary rates by splitting the alignment into three partitions, each representing one of the three codon positions.
To do this we will click the
Split button at the bottom of the
panel and then select the
1 + 2 + 3 frame 3 from the drop-down menu (Figure 2).
This specifies that the first full codon starts at the third nucleotide in the alignment (it just so happens that our gene sequence starts at the third position in the sequence), and creates three rows in the partitions panel.
Linking or unlinking models
For the reasons mentioned above, we want to allow each partition to have its own substitution model. This will make it possible for each codon position to have a different relative evolutionary rates. But we want to still have all codon positions (i) evolving along the same phylogenetic tree, and (ii) keeping their evolutionary rates proportionally the same along that tree.
In order to achieve this, you will have to link (i) the tree and (ii) clock models (these
deal with how rates vary along the tree) across the three partitions, and name them
This requires selecting all three partitions first, and clicking the buttons
Link Clock Models
The partition panel should now look something like this:
By default all the taxa are assumed to have a date of zero (i.e., the
sequences are assumed to be sampled at the same time). In our case, the
RSVA sequences have been sampled at various dates going back to the
1950s. The actual year of sampling is given in the name of each taxon
and we could simply edit the value in the
Date column of the table to
Knowing the absolute times of sampling and having samples taken over time
-- rather than all samples having been collected only at the present moment --
is crucial because otherwise we can't tell the rate
r and time
where both contribute to the accumulated evolutionary change,
D = r * t.
We could obtain any value of
D as the result of
1/2r * 2t) or (
2r * 1/2t), etc. Can you see how
t will be
conflated? We thus need to have one or more samples taken at known absolute
t values to be able to disentangle
t. This is called "calibrating"
Back to our data, when taxa names contain the calibration information, as
is our case, then a convenient way to specify the dates of the sequences
in to let BEAUti figure it out.
You can do that by clicking the checkbox
Use tip dates, and then clicking
Auto-configure button at the top right of the
Tip Dates panel.
This will make a dialog box appear.
Select the option to
use everything, choose
after last from the
drop-down box and type
s into the corresponding text box. This will
extract the trailing numbers from the taxon names after the last lower-case letter
s, which are interpreted as the year (since 1900 in our case) that
the sample was isolated.
You may have noticed these dates are specified as
Since some time in the past:
this is known as forward time in a phylodynamic analysis.
The dates panel should now look something like this:
Thanks to calibration, the node height of the sampled phylogenetic tree
will be scaled to some unit of absolute time (here the unit is
which makes our phylogenetic trees be what we call "time trees".
Tips having a zero node height will be the latest (i.e., youngest) samples.
The root height of this time tree represents the time to the most recent
common ancestor (tMRCA) of all samples.
To make sure that you selectd the correct option, you can simply look at the
Height column. The heights of earlier samples should always be larger than
those of later (recent) samples.
Setting the substitution model
We will use the HKY model and estimate base frequencies for all three partitions.
To do this, first switch to the
Site Model panel,
and then choose
HKY from the
Subst Model drop-boxes,
Estimated from the
Also remember to check the
estimate checkbox for the
After three mutation rates are all set to estimate,
it will eventually trigger to check the
Fix mean mutation rate box.
Here, we can use
Clone function to replicate the configuration.
shift key to select all site models on the left side,
OK to clone the settings from a selected site model (Figure 7).
Go through each site model, as you can see, their
configurations are same now.
The main objective here is to set up the analysis to estimate the relative mutation rates of codons, which are relative to a general rate defined in the molecular clock model.
Molecular clock model
We are going to use a strict clock model in our analyses.
This is the simplest clock model that one can use, and it assumes that rates
remain constant throughout the whole tree.
The strict clock model is also the default clock model in BEAUti, so no changes are
necessary in the clock model panel.
If you want, you can specify a starting value for the
Priors are part of a Bayesian model, and describe our beliefs or prior inferences about the model parameters. Each parameter (including the tree!) is assigned a prior distribution, and this must be done before looking at the data.
To set up the priors, select the
We are going to choose a simple tree prior for this analysis,
Coalescent Constant Population.
For our molecular clock model, we will set the prior on the
to a log-normal distribution with mean of -5, and standard deviation of 1.25
The plot of this prior distribution and its quantiles can be visualised on the right
The further reading about priors can be seen from the tutorial Prior selection
Setting the MCMC options
For this dataset, let’s initially set the chain length to
1000000 as this will
run reasonably quickly on most modern computers.
Set the sampling frequencies for the screen logging to
the trace log file to
500 and the trees file to
(So, how many samples are we expecting to have in the trace log file?)
Also, change the log file name to
RSV2.log and tree log file name to
If you keep the default tree log file name,
$(tree) will be replaced by the name defined at the
Tree column in the
Save the BEAST
.xml specification file (e.g.,
Now run BEAST and when it asks for an input file, provide your newly
.xml file as input.
We recommend you to use BEAGLE library, if it is installed on your
BEAST will then run until it has finished reporting information to the
screen. The actual results files are saved to disk in the same location
as your input file. The output to the screen will look something like this:
BEAST v2.6.4, 2002-2021 Bayesian Evolutionary Analysis Sampling Trees Designed and developed by Remco Bouckaert, Alexei J. Drummond, Andrew Rambaut & Marc A. Suchard Department of Computer Science University of Auckland firstname.lastname@example.org email@example.com Institute of Evolutionary Biology University of Edinburgh firstname.lastname@example.org David Geffen School of Medicine University of California, Los Angeles email@example.com Downloads, Help & Resources: http://beast2.org/ Source code distributed under the GNU Lesser General Public License: http://github.com/CompEvol/beast2 BEAST developers: Alex Alekseyenko, Trevor Bedford, Erik Bloomquist, Joseph Heled, Sebastian Hoehna, Denise Kuehnert, Philippe Lemey, Wai Lok Sibon Li, Gerton Lunter, Sidney Markowitz, Vladimir Minin, Michael Defoin Platel, Oliver Pybus, Tim Vaughan, Chieh-Hsi Wu, Walter Xie Thanks to: Roald Forsberg, Beth Shapiro and Korbinian Strimmer ... ... 990000 -6059.7592 -5474.8288 -584.9303 1m58s/Msamples 1000000 -6072.4343 -5468.5631 -603.8711 1m57s/Msamples Operator Tuning #accept #reject Pr(m) Pr(acc|m) ScaleOperator(StrictClockRateScaler.c:clock) 0.79081 9087 26702 0.03589 0.25390 UpDownOperator(strictClockUpDownOperator.c:clock) 0.81735 462 35232 0.03589 0.01294 Try setting scaleFactor to about 0.904 ScaleOperator(KappaScaler.s:RSV2_1) 0.37574 290 885 0.00120 0.24681 DeltaExchangeOperator(FixMeanMutationRatesOperator) 0.33349 4839 19026 0.02392 0.20277 ScaleOperator(KappaScaler.s:RSV2_2) 0.43759 362 849 0.00120 0.29893 ScaleOperator(KappaScaler.s:RSV2_3) 0.45654 287 905 0.00120 0.24077 ScaleOperator(CoalescentConstantTreeScaler.t:tree) 0.73805 287 35575 0.03589 0.00800 Try setting scaleFactor to about 0.859 ScaleOperator(CoalescentConstantTreeRootScaler.t:tree) 0.55544 2111 33800 0.03589 0.05878 Try setting scaleFactor to about 0.745 Uniform(CoalescentConstantUniformOperator.t:tree) - 192969 166503 0.35885 0.53681 SubtreeSlide(CoalescentConstantSubtreeSlide.t:tree) 3.65244 30871 148436 0.17943 0.17217 Exchange(CoalescentConstantNarrow.t:tree) - 43760 135429 0.17943 0.24421 Exchange(CoalescentConstantWide.t:tree) - 99 35864 0.03589 0.00275 WilsonBalding(CoalescentConstantWilsonBalding.t:tree) - 198 35573 0.03589 0.00554 ScaleOperator(PopSizeScaler.t:tree) 0.57008 8862 27174 0.03589 0.24592 DeltaExchangeOperator(FrequenciesExchanger.s:RSV2_2) 0.06824 434 760 0.00120 0.36348 DeltaExchangeOperator(FrequenciesExchanger.s:RSV2_1) 0.07452 391 766 0.00120 0.33794 DeltaExchangeOperator(FrequenciesExchanger.s:RSV2_3) 0.06210 452 761 0.00120 0.37263 Tuning: The value of the operator's tuning parameter, or '-' if the operator can't be optimized. #accept: The total number of times a proposal by this operator has been accepted. #reject: The total number of times a proposal by this operator has been rejected. Pr(m): The probability this operator is chosen in a step of the MCMC (i.e. the normalized weight). Pr(acc|m): The acceptance probability (#accept as a fraction of the total proposals for this operator). Total calculation time: 120.762 seconds End likelihood: -6072.434319012309
Analysing the BEAST output
Drag and drop the BEAST log file
RSV2.log to the left panel of the software
Note that the effective sample sizes (ESSs) for many of the logged
quantities are small (ESSs less than 100 will be highlighted in red by
Tracer). This is not good. A low ESS means that the trace contains a lot
of correlated samples and thus may not represent the posterior
distribution well. In the bottom right of the window is a frequency plot
of the samples which is expected given the low ESSs is extremely rough.
If we select the tab on the right-hand-side labelled
Trace we can view
the raw trace, that is, the sampled values against the step in the MCMC
Here you can see how the samples are correlated. There are 2000 samples
in the trace (we ran the MCMC for steps sampling every 500) but adjacent
samples often tend to have similar values.
The ESS for the
clockRate is about
18, after removing the first 10% of
the samples as burn-in.
So we are only getting 1 independent sample to every
100 ~ 1800/18 actual samples).
With a short run such as this one, it may also be the case that
the default burn-in (10%) of the chain length is inadequate.
Not excluding enough of the start of the chain as burn-in will render estimates of ESS unreliable.
The simple response to this situation is that we need to run the chain for longer.
Given the lowest ESS (e.g., for the constant coalescent parameter) is
it would suggest that we have to run a much longer chain (e.g. 18 times of the current length).
But this is a simple dataset, the length of 8 million would generate samples
providing the reasonable ESSs that are
So let's go back to BEAUti, set the chain length to 8000000 and log every 4000
MCMC panel. We will also rename the log file to
and tree log file name to
Then we can create a new BEAST
RSV2-long.xml with a longer chain length.
Now run BEAST again and load the new log file into Tracer
(you can leave the old one loaded for comparison).
Please note BEAST does not support multiple instances from GUI,
so you have to close the previous run before you can start a new one.
Click on the
Trace tab and look at the raw trace plot.
We have chosen options that produce the same number of samples
but with a larger ESS.
There is still auto-correlation between the samples but
>200 effectively independent samples will now provide a better
estimate of the posterior distribution. There are no obvious trends in
the plot which would suggest that the MCMC has not yet converged, and
there are no significant long range fluctuations in the trace which
would suggest poor mixing.
As we are satisfied with the mixing, we can now move on to one of the
parameters of interest: the substitution rate. Select
clockRate in the
left-hand table. This is the average substitution rate across all sites
in the alignment. Now choose the density plot by selecting the tab
Marginal Density. This shows a plot of the marginal
posterior probability density of this parameter. You should see a plot
similar to this:
As you can see, the posterior probability density is roughly bell-shaped.
There is some sampling noise that would be reduced if we ran the chain
for longer or sampled more often, but we already have a good estimate of
the mean and HPD interval. You can overlay the density plots of multiple
traces in order to compare them (it is up to the user to determine
whether they are comparable on the the same axis or not). Select the
relative substitution rates for all three codon positions in the table
to the left (labelled
mutationRate.3). You will now see the posterior probability densities
for the relative substitution rate at all three codon positions
We will use the program
TreeAnnotator to summarise the
maximum clade credibility (MCC) tree from the sampled trees (in
the tree log file).
TreeAnnotator is an application that comes with BEAST.
First of all, we need to set the burn-in percentage to
Then we select the
Node heights to
and provide input and output file names.
This will rescale the node height to reflect the posterior mean node heights
for the clades contained in the MCC tree.
More details about settings are available from here.
Summary trees can be viewed by
FigTree (a program separate from BEAST)
DensiTree (distributed with BEAST).
FigTree can only see one tree at a time,
so we normally use it to visualise the MCC tree.
Let us open the MCC tree in
FigTree. First, open the
Trees tab on the left panel,
and tick the checkbox
Tick the checkbox of the
Node Bars tab and open it.
height_95%_HPD, you will be able to display the 95% HPD-intervals on
internal node ages (as bars) over each internal node.
Then tick the
Node labels checkbox, and switch the
height (mean) and
In order to interpret the estimated node heights, note that the latest sample in our data set is from 2002. We can thus convert our node heights to years by subtracting their heights from 2002.
Now load all your posterior trees to
show tab and tick the
Root Canal checkbox.
The "root canal tree", drawn in thick blue lines,
represents the MCC tree.
Grid tab, choose
Short grid, pick
Reverse for the
scale axis, and set the
Origin to 2002.
Please be aware that the origin here means the date of the youngest tips.
Before we can show the 95% HPD interval on the node heights,
we need to click the
Central button on the top right corner under the
Then open the
Clades tab, set the
Smallest text filed to 0.5,
to select only the clades with over 50% support.
Then, tick the
Show clades, and switch
draw option from
The error bars representing the 95% HPD interval of internal nodes will now be displayed.
If you want to show a particular node, such as the root,
you can tick the
Select only checkbox, and select it from the panel.
What are the absolute mutation rates for the three codon positions?
In what year did the most recent common ancestor of all RSVA samples live? What is the 95% HPD?
Bonus section: Bayesian Skyline plot
We can reconstruct the population history using the Bayesian Skyline
plot. In order to do so, load the
.xml file into BEAUti using the menu
Priors panel and change the tree prior from
Coalescent Constant Population
Coalescent Bayesian Skyline.
Note that an extra item is added to the priors called
Markov chained population sizes
which is a prior that ensures dependence between population sizes.
By default the number of groups used in the skyline analysis is set to
5. To change this, select menu
Show Initialization panel,
and then a list of parameters is shown in the
bPopSizes.t:tree and change the dimension to 3.
bGroupSizes.t:tree and change
its dimension to 3. The dimensions of the two parameters should be the
same. More groups mean more population changes can be detected, but it
also means more parameters need to be estimated, and that the MCMC chain must
run longer. The extended Bayesian skyline plot automatically detects the
number of changes, so it could be used as an alternative tree prior.
This analysis takes longer to converge, so we will change the MCMC
chain length to 10 million, and the log intervals for the trace log and
tree log to 10 thousand. Then, save the file and run BEAST. You can also
download the log (
RSV2-bs.log) and tree (
RSV2-bs.trees) files from
To plot the population history, load the log file in tracer and select
Bayesian Skyline Reconstruction.
A dialog is shown where you can specify the tree file associated with the log file. Also, since the youngest sample is from 2002, change the entry for age of youngest tip to 2002.
After some calculation, a graph appears showing a population history where
the median and 95% HPD intervals are plotted. After selecting the
solid interval checkbox, the graph should look something like this.
By what amount did the effective population size of RSVA grow from 1970 to 2002 according to the BSP?
What are the underlying assumptions of the BSP? Are they violated by this data set?
Change the Bayesian skyline prior to extended Bayesian skyline plot
(EBSP) prior and run until convergence. EBSP produces an extra log file,
$(seed) is replaced by the seed you used
to run BEAST. A plot can be created by running the
AppLauncher, and loading the output file in a spreadsheet.
How many groups are indicated by the EBSP analysis? This is much lower than for BSP. How does this affect the population history plots?
- Bayesian Evolutionary Analysis with BEAST 2 (Drummond & Bouckaert, 2014)
- BEAST 2 website and documentation: http://www.beast2.org/
- Join the BEAST user discussion: http://groups.google.com/group/beast-users
- Zlateva, K. T., Lemey, P., Vandamme, A.-M., & Van Ranst, M. (2004). Molecular evolution and circulation patterns of human respiratory syncytial virus subgroup a: positively selected sites in the attachment g glycoprotein. J Virol, 78(9), 4675–4683.
- Zlateva, K. T., Lemey, P., Moës, E., Vandamme, A.-M., & Van Ranst, M. (2005). Genetic variability and molecular evolution of the human respiratory syncytial virus subgroup B attachment G protein. J Virol, 79(14), 9157–9167. https://doi.org/10.1128/JVI.79.14.9157-9167.2005
- Drummond, A. J., & Bouckaert, R. R. (2014). Bayesian evolutionary analysis with BEAST 2. Cambridge University Press.
If you found Taming the BEAST helpful in designing your research, please cite the following paper:
Joëlle Barido-Sottani, Veronika Bošková, Louis du Plessis, Denise Kühnert, Carsten Magnus, Venelin Mitov, Nicola F. Müller, Jūlija Pečerska, David A. Rasmussen, Chi Zhang, Alexei J. Drummond, Tracy A. Heath, Oliver G. Pybus, Timothy G. Vaughan, Tanja Stadler (2018). Taming the BEAST – A community teaching material resource for BEAST 2. Systematic Biology, 67(1), 170–-174. doi: 10.1093/sysbio/syx060