Tutorial SIR2 Treelikelihood (original) (raw)

XML Generation: Calculating a structured tree likelihood

Tutorial objectives:

Introduction

The goal of these tutorials is to introduce, step-by-step, PhyDyn's (BEAST2) components, analyses and tools. The tutorial sequence also suggests a work-flow. The first step of this work-flow consists on the definition and execution (test) of the population model we are planning to use in our analyses. The output of this initial step (Tutorial) is an XML PopModelODE specification that, when plugged to a TrajectoryOutput object, generates a csv file with the trajectory time-series.

In this tutorial we show how to use PhyDyn's XML generator to create and test a PhyDyn STreeLikelihoodODE XML object. A STreeLikelihoodODE object is a PhyDyn/BEAST CalculationNode that computes the likelihood of a given phylogenetic tree with respect to a population model trajectory. In the previous tutorial, we defined the two-deme PhyDyn population model by working directly on an XML file. This is the recommended way of defining PhyDyn models i.e. to use a text editor to create and modify a BEAST/PhyDyn XML file. However, XML editing becomes a challenging and time-consuming task for even simple BEAST/PhyDyn analyses. For example, a STreeLikelihoodODE object requires the following components:

In order to facilitate this task, BEAST2 is complemented by BEAUti, an XML generation tool. Similarly, PhyDyn comes equipped with XMLGenerator, a runnable XML object that generates basic PhyDyn analysis XML files. This tutorial deals with the simplest of XMLGenerator's outputs, an XML file made of:

Simulation: Phylogenetic trees for our two-deme SIR model

Phylogenetic trees are key components to BEAST analyses. In the context of PhyDyn, we want to assess the likelihood of a structured tree given a population process trajectory or history. A BEAST analysis can use a fixed tree e.g. a maximum likelihood tree generated by another analysis, or work through a sequence of trees generated via sampling. For this example (and the following tutorial) we will work with a 100-tip Newick tree generated by sampling on a population process using MASTER.

Our example Newick tree, sir2Master.nwk, was generated by executing the genSIR2Trees.R script:

  > Rscript genSIR2Trees.R <num-trees>

The following steps are executed until <num-trees> are generated:

The MASTER model describes the same two-model population model we are working with but specified in terms of Chemical Master Equations (CME). The specification instructs MASTER to start the simulation with initial values 999,1 and 0 for S, I0 and I1, respectively, and sample 200 demes (tree tips) at time 20. The resulting tree is stored in Nexus format by MASTER, processed by the script (trees with less than 200 tips are discarded, internal node annotations are removed) and written back in Newick format.

The script produces an annotated phylogenetic tree. In our context, it produces a structured phylogenetic tree with tips labeled with deme information (population names). For example, the following extract from sir2Master.nwk

((95_I0:0.009020825955,96_I0:0.009020825955):1.633546797,97_I1:1.642567623):1.326547235

shows that tips 95,96 and 97 correspond to demes I0,I0 and I1, respectively.

XML Generation

This part of the tutorial shows how to use XMLGenerator XML element to generate ready-to-use PhyDyn XML files. We'll start by generating a PhyDyn analysis that outputs the likelihood of a phylogenetic tree given a population trajectory. In order to do this, we need a population model specification (two-deme) and a phylogenetic tree (sir2Master.nwk).

Copy the PhyDyn file sir2Trajectory0.xml (or the sir2TrajectoryFinal.xml file created in the previous tutorial) and name it sir2Generate.xml.

  > cp sir2Trajectory0.xml sir2Generate.xml.

Open the new file in a text editor. Replace the TrajectoryOut component

with

The parameters above instruct XMLGenerator to generate the XML file sir2Likelihood.xml made of the STreeLikelihoodODE and LikelihoodOut components (xmlType="likelihood") with information extracted from the twodeme population model and the Newick phylogenetic tree stored in sir2Master.nwk. Furthermore,

Execute the file by running BEAST2/PhyDyn:

  > beast sir2Generate.xml
File: sir2Generate.xml seed: 127 threads: 1
Generating Likelihood test xml
output xml file = sir2Likelihood.xml
root height: 13.225650562640002
num tips = 200
-- Population model
model-name = twodeme;
... <model-definition>
Output sent to XML file: sir2Likelihood.xml

You should have a new file, sir2Likelihood.xml (as indicated by the outputFile parameter). The new file is ready to go:

> beast sir2Likelihood.xml 
File: sir2Likelihood.xml seed: 127 threads: 1
model-name = twodeme;
.... model definition
logP=-1138.0153050879976

Let's pause for a moment to take a look at the generated XML (sir2Likelihood.xml). We can identify the following components:

As we mentioned above, the LikelihoodOut component is essentially a wrapper, a runnable component used to extract and print the log-likelihood computed by StreeLikelihoodODE. LikelihoodOut does not take part of proper PhyDyn/BEAST analyses but it's useful at this stage of the analysis design process i.e. when testing our population model in combination with phylogenetic trees, and the corresponding BEAST components such as Taxa and Traits. It also helps to catch syntactical errors when creating/editing the XMLs "by hand".

Tip dates and t1

The XML file generated above encodes the simplest STreeLikelihoodODE: likelihood and tree components do not point to any date or type traits. In this example, a type trait is not needed because tip annotations can be used to extract deme information. In the absence of a date trait, PhyDyn uses the value of t1 to date the most recent tip i.e. tips with zero height.

Modify t1: The current value of t1 is 20; this gives us a log-likelihood of -1138.0153 . Given that the tips of our example tree were sampled at time 20, and that we are using the same rate parameters as the MASTER model used to generate the tree, we should expect that the value of t1 that maximises the likelihood should be around 20. We could check this by trying different values of t1.

> beast sir2Likelihood.xml
File: sir2Likelihood.xml seed: 127 threads: 1
model-name = twodeme;
... <model-definitions>
Structured tree likelihood
logP=-16727.9757

If you do the same for values of t1 between 15 and 26 (space 1), you will get the following table:

t1 LogLh t1 LogLh t1 LogLh t1 LogLh
15 -16727.97 16 -7865.60 17 -1413.93 18 -1136.62
19 -1133.79 20 -1138.01 21 -1148.75 22 -1165.00
23 -1185.21 24 -1207.52 25 -1230.46 26 -1252.72

The likelihood values behave as expected. For this particular set of times, the maximum log-likelihood value corresponds to t1=19. Proper likelihood maximisation will probably throw a value of t1 different than 20 but, definitely, in the interval [18,21].

** Removing t1** If we remove t1 from our population model, and in the absence of dating information (DateTrait), PhyDyn will date the most recent tip with the height of the root. We can verify this. Open sir2Likelihood.xml in a text editor and:

Introducing Date and Type Traits

Most BEAST analyses (for example, analyses generated by BEAUti from DNA sequences) use a Date Trait to provide sampling dates. Similarly, Type traits are used to map tips to deme names. We can introduce both traits to our example by setting the createDateTrait and createTypeTrait XMLGenerator parameters to true.

Run BEAST with sir2Generate.xml:

> beast sir2Generate.xml
Generating Likelihood test xml
output xml file = sir2LikelihoodTraits.xml
--- tree file: sir2Master.nwk
root height: 13.225650562640002
num tips = 200
-- Population model
model-name = twodeme;
... <model-definitions>
Output sent to XML file: sir2LikelihoodTraits.xml

The generated file sir2LikelihoodTraits.xml has the following additional components:

<trait id="twodeme.dates" spec="beast.evolution.tree.TraitSet" traitname="date" value="1_I1=20.0,2_I1=20.0,3_I1=20.0,4_I1=20.0 ..."

value="1_I1=I1,2_I1=I1,3_I1=I1,4_I1=I1,5_I1=I1,6_I1=I1,7_I1=I1,8_I1=I1,9_I0=I0,..."

We expect to get the same log-likelihood back:

> beast sir2LikelihoodTraits.xml
File: sir2LikelihoodTraits.xml seed: 127 threads: 1
<trait-information>
<model-information>
Structured tree likelihood
logP=-1138.0153050879976

Let's now generate sir2LikelihoodTraits.xml using the same values of t1 as the ones used in the previous section.