Tutorial SIR2 Treelikelihood (original) (raw)

XML Generation: Calculating a structured tree likelihood

Tutorial objectives:

Show how to use PhyDyn's XML generator to create and test a PhyDyn STreeLikelihoodODE XML object.
Describe the basics of STreeLikelihoodODE.

Introduction

The goal of these tutorials is to introduce, step-by-step, PhyDyn's (BEAST2) components, analyses and tools. The tutorial sequence also suggests a work-flow. The first step of this work-flow consists on the definition and execution (test) of the population model we are planning to use in our analyses. The output of this initial step (Tutorial) is an XML PopModelODE specification that, when plugged to a TrajectoryOutput object, generates a csv file with the trajectory time-series.

In this tutorial we show how to use PhyDyn's XML generator to create and test a PhyDyn STreeLikelihoodODE XML object. A STreeLikelihoodODE object is a PhyDyn/BEAST CalculationNode that computes the likelihood of a given phylogenetic tree with respect to a population model trajectory. In the previous tutorial, we defined the two-deme PhyDyn population model by working directly on an XML file. This is the recommended way of defining PhyDyn models i.e. to use a text editor to create and modify a BEAST/PhyDyn XML file. However, XML editing becomes a challenging and time-consuming task for even simple BEAST/PhyDyn analyses. For example, a STreeLikelihoodODE object requires the following components:

A population model (PopModelODE) object.
A phylogenetic tree. Our example uses a TreeParser object associated to a Newick string.
A TaxonSet used to link the tree's nodes to data (TraitSet).
A Type trait and a Date Trait (optional).

In order to facilitate this task, BEAST2 is complemented by BEAUti, an XML generation tool. Similarly, PhyDyn comes equipped with XMLGenerator, a runnable XML object that generates basic PhyDyn analysis XML files. This tutorial deals with the simplest of XMLGenerator's outputs, an XML file made of:

A complete STreeLikelihoodODE component generated from a PopModelODE model specification (we use the two-deme defined in the previous tutorial) and a Newick tree file.
A LikelihoodOut component, a runnable object (effectively, a wrapper) that prints the log-likelihood computed by the STreeLikelihoodODE object.

Simulation: Phylogenetic trees for our two-deme SIR model

Phylogenetic trees are key components to BEAST analyses. In the context of PhyDyn, we want to assess the likelihood of a structured tree given a population process trajectory or history. A BEAST analysis can use a fixed tree e.g. a maximum likelihood tree generated by another analysis, or work through a sequence of trees generated via sampling. For this example (and the following tutorial) we will work with a 100-tip Newick tree generated by sampling on a population process using MASTER.

Our example Newick tree, sir2Master.nwk, was generated by executing the genSIR2Trees.R script:

  > Rscript genSIR2Trees.R <num-trees>

The following steps are executed until <num-trees> are generated:

The script calls BEAST with MASTER file sir2Master.xml as input. MASTER is a BEAST package so, in order for this to work, MASTER should be loaded beforehand (standalone MASTER versions can be used by changing the master variable in the script).
MASTER generates an annotated NEXUS file. The script processes the tree (trees with less than 200 tips are discarded, internal node annotations are removed) and writes it back in Newick format.
Trees are named sir2Master_i.nwk, with i varying from 1 to <num-trees>.

The MASTER model describes the same two-model population model we are working with but specified in terms of Chemical Master Equations (CME). The specification instructs MASTER to start the simulation with initial values 999,1 and 0 for S, I0 and I1, respectively, and sample 200 demes (tree tips) at time 20. The resulting tree is stored in Nexus format by MASTER, processed by the script (trees with less than 200 tips are discarded, internal node annotations are removed) and written back in Newick format.

The script produces an annotated phylogenetic tree. In our context, it produces a structured phylogenetic tree with tips labeled with deme information (population names). For example, the following extract from sir2Master.nwk

((95_I0:0.009020825955,96_I0:0.009020825955):1.633546797,97_I1:1.642567623):1.326547235

shows that tips 95,96 and 97 correspond to demes I0,I0 and I1, respectively.

XML Generation

This part of the tutorial shows how to use XMLGenerator XML element to generate ready-to-use PhyDyn XML files. We'll start by generating a PhyDyn analysis that outputs the likelihood of a phylogenetic tree given a population trajectory. In order to do this, we need a population model specification (two-deme) and a phylogenetic tree (sir2Master.nwk).

Copy the PhyDyn file sir2Trajectory0.xml (or the sir2TrajectoryFinal.xml file created in the previous tutorial) and name it sir2Generate.xml.

  > cp sir2Trajectory0.xml sir2Generate.xml.

Open the new file in a text editor. Replace the TrajectoryOut component

with

The parameters above instruct XMLGenerator to generate the XML file sir2Likelihood.xml made of the STreeLikelihoodODE and LikelihoodOut components (xmlType="likelihood") with information extracted from the twodeme population model and the Newick phylogenetic tree stored in sir2Master.nwk. Furthermore,

adjustTipHeights="true" fixes branch lengths so all tip-to-root distances are the same. It only makes sense if the tree is intended to be isochronous, that is, if all tips are supposed to be sampled at the same time.
createTypeTrait="false" tells PhyDyn to extract deme information directly from tree node annotations e.g. 96_I0 corresponds to deme I0.
createDateTrait="false" generates a tree without a Date Trait. If the tree is not dated, PhyDyn dates the most recent tip to the value of t0 or, if missing, to the height of the tree (max root to tip distance).

Execute the file by running BEAST2/PhyDyn:

  > beast sir2Generate.xml
File: sir2Generate.xml seed: 127 threads: 1
Generating Likelihood test xml
output xml file = sir2Likelihood.xml
root height: 13.225650562640002
num tips = 200
-- Population model
model-name = twodeme;
... <model-definition>
Output sent to XML file: sir2Likelihood.xml

You should have a new file, sir2Likelihood.xml (as indicated by the outputFile parameter). The new file is ready to go:

> beast sir2Likelihood.xml 
File: sir2Likelihood.xml seed: 127 threads: 1
model-name = twodeme;
.... model definition
logP=-1138.0153050879976

Let's pause for a moment to take a look at the generated XML (sir2Likelihood.xml). We can identify the following components:

id="twodeme.t": A Tree XML element (spec TreeParser) with the contents of the Newick input tree file sirMaster.nwk verbatim.
id="twodeme.taxa": A TaxonSet element with taxa ids extracted from the annotated tree tips.
id="twodeme":The population model specification (PopModelODE) included in the original file.
id="twodeme.stlh": A StreeLikelihoodODE object with default parameters, and references to the TreeParser and PopModelODE objects.
A LikelihoodOut XML element pointing to the likelihood element stlh:

As we mentioned above, the LikelihoodOut component is essentially a wrapper, a runnable component used to extract and print the log-likelihood computed by StreeLikelihoodODE. LikelihoodOut does not take part of proper PhyDyn/BEAST analyses but it's useful at this stage of the analysis design process i.e. when testing our population model in combination with phylogenetic trees, and the corresponding BEAST components such as Taxa and Traits. It also helps to catch syntactical errors when creating/editing the XMLs "by hand".

Tip dates and t1

The XML file generated above encodes the simplest STreeLikelihoodODE: likelihood and tree components do not point to any date or type traits. In this example, a type trait is not needed because tip annotations can be used to extract deme information. In the absence of a date trait, PhyDyn uses the value of t1 to date the most recent tip i.e. tips with zero height.

Modify t1: The current value of t1 is 20; this gives us a log-likelihood of -1138.0153 . Given that the tips of our example tree were sampled at time 20, and that we are using the same rate parameters as the MASTER model used to generate the tree, we should expect that the value of t1 that maximises the likelihood should be around 20. We could check this by trying different values of t1.

Start your file editor with sir2Likelihood.xml and change the value of t1 to 15. Run the analysis:

> beast sir2Likelihood.xml
File: sir2Likelihood.xml seed: 127 threads: 1
model-name = twodeme;
... <model-definitions>
Structured tree likelihood
logP=-16727.9757

If you do the same for values of t1 between 15 and 26 (space 1), you will get the following table:

t1	LogLh	t1	LogLh	t1	LogLh	t1	LogLh
15	-16727.97	16	-7865.60	17	-1413.93	18	-1136.62
19	-1133.79	20	-1138.01	21	-1148.75	22	-1165.00
23	-1185.21	24	-1207.52	25	-1230.46	26	-1252.72

The likelihood values behave as expected. For this particular set of times, the maximum log-likelihood value corresponds to t1=19. Proper likelihood maximisation will probably throw a value of t1 different than 20 but, definitely, in the interval [18,21].

** Removing t1** If we remove t1 from our population model, and in the absence of dating information (DateTrait), PhyDyn will date the most recent tip with the height of the root. We can verify this. Open sir2Likelihood.xml in a text editor and:

Remove t1 from the trajectory parameters component. Run BEAST with sir2Likelihood.xml as input. The returned log-likelihood is -32660.4176 .
The height of the tree is 13.225650562640002. Re-introduce t1 to the trajectory parameters with the height of the tree as value t1="13.225650562640002". The returned log-likelihood should be the same as the previous value.

Introducing Date and Type Traits

Most BEAST analyses (for example, analyses generated by BEAUti from DNA sequences) use a Date Trait to provide sampling dates. Similarly, Type traits are used to map tips to deme names. We can introduce both traits to our example by setting the createDateTrait and createTypeTrait XMLGenerator parameters to true.

Open sir2Generate.xml with a text editor and replace the XMLGenerator component with:

Run BEAST with sir2Generate.xml:

> beast sir2Generate.xml
Generating Likelihood test xml
output xml file = sir2LikelihoodTraits.xml
--- tree file: sir2Master.nwk
root height: 13.225650562640002
num tips = 200
-- Population model
model-name = twodeme;
... <model-definitions>
Output sent to XML file: sir2LikelihoodTraits.xml

The generated file sir2LikelihoodTraits.xml has the following additional components:

A Date Trait mapping each tip/taxon to 20. The values/dates reflect the value of t1.

<trait id="twodeme.dates" spec="beast.evolution.tree.TraitSet" traitname="date" value="1_I1=20.0,2_I1=20.0,3_I1=20.0,4_I1=20.0 ..."

A Type trait mapping each taxa to a deme name:

value="1_I1=I1,2_I1=I1,3_I1=I1,4_I1=I1,5_I1=I1,6_I1=I1,7_I1=I1,8_I1=I1,9_I0=I0,..."

A dated tree. The previous TreeParser object with the extra twodeme.dates component
The STreeLikelihoodODE object now uses the new Type Trait (twodeme.types) to identify the demes associated to taxa/tip:

We expect to get the same log-likelihood back:

> beast sir2LikelihoodTraits.xml
File: sir2LikelihoodTraits.xml seed: 127 threads: 1
<trait-information>
<model-information>
Structured tree likelihood
logP=-1138.0153050879976

Let's now generate sir2LikelihoodTraits.xml using the same values of t1 as the ones used in the previous section.

Edit sir2Generate.xml so that t1="15".
Run BEAST with sir2Generate.xml. The new sir2LikelihoodTraits.xml file should have a date trait with all dates set to 15. Check.
Run BEAST with sir2LikelihoodTraits.xml. Check that the returned log-likelihod is -16727.9757. This is the same loglh we obtained in the previous section.