Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Approximating the Bacterial Ancestral Recombination Graph Xavier Didelot (University of Warwick) Daniel Lawson (University of Bristol) Aaron Darling (University of California-Davis) Peter Green (University of Bristol) Daniel Falush (University College Cork) Graphical Models and Genetic Applications University of Warwick 15-17 April 2009 Approximating the Bacterial Ancestral Recombination Graph – p. 1/2 Bacterial Reproduction Mutation • Bacteria reproduce clonally: • Single bacteria splits into two identical copies Approximating the Bacterial Ancestral Recombination Graph – p. 2/2 Bacterial Reproduction Mutation Recombination • • Bacteria reproduce clonally: • Single bacteria splits into two identical copies But exchange DNA: • Transformation - from the environment • Transduction - transmission via bacteriophage • Conjugation - direct from another bacterium Approximating the Bacterial Ancestral Recombination Graph – p. 2/2 Forward in time models Moran model with constant population size N = 6. Timestep: 1. Each individual gives birth at rate b. 2. A randomly chosen individual is killed for each birth. Time Observed Sequences Approximating the Bacterial Ancestral Recombination Graph – p. 3/2 The coalescent Pure death process with rate k(k − 1)/2 where k is the number of lineages alive at a given time. Coalescent time Time Observed Sequences Allows us to calculate probability of a given tree. Approximating the Bacterial Ancestral Recombination Graph – p. 4/2 Recombination model Allow recombination to occur with probability r for each birth; a second parent is chosen at random. Time Observed Sequences Approximating the Bacterial Ancestral Recombination Graph – p. 5/2 Ancestral recombination graph Birth-death process with recombination (birth) at rate ρk/2 and coalescence (death) at rate k(k − 1)/2. Allows us to calculate probability of a given graph. Approximating the Bacterial Ancestral Recombination Graph – p. 6/2 ARG with crossover When a recombination occurs, a point S is chosen uniformly at random on the gene. The material on the left of S follows one line of descent and the material on the right the other. • Symmetry of the two parents; • This is the version of the ARG found in most of the literature. Approximating the Bacterial Ancestral Recombination Graph – p. 7/2 ARG with gene conversion When a recombination occurs, a small fragment is contributed by the donor cell and the rest by the recipient. • Asymmetry of the two parents (recipient/donor); • This is the version of the ARG which is relevant to bacteria; • Studied by Wiuf and Hein (2000). Approximating the Bacterial Ancestral Recombination Graph – p. 8/2 Clonal genealogy If we always follow the line of the recipient, we get a tree called “clonal genealogy”. • Concept unique to ARG with gene conversion; • The clonal genealogy is distributed as a coalescent tree. Approximating the Bacterial Ancestral Recombination Graph – p. 9/2 Ancestral material and local trees For each branch, the material that ends up in the leaves is called “ancestral material” and shown in bold. If we follow the ancestral material for a given site, we get the “local tree” of that site. Local tree for the first DNA base Local tree for the last DNA base Approximating the Bacterial Ancestral Recombination Graph – p. 10/2 Full ARG inference • Probability of the model: P (Graph|Data) ∝ P (Data|Graph)P (Graph) • Inference under the full ARG model is difficult: • The state-space of ARGs is huge; • The data are uninformative about the actual ARG; • An ARG is more than the sum of its local trees. • Importance sampling algorithm of Fearnhead and Donnelly (2001): a month for 31 sequences of 500bp; • Can we find a good approximation? Approximating the Bacterial Ancestral Recombination Graph – p. 11/2 Approximating the Bacterial Ancestral Recombination Graph Xavier Didelot (University of Warwick) Daniel Lawson (University of Bristol) Aaron Darling (University of California-Davis) Peter Green (University of Bristol) Daniel Falush (University College Cork) Graphical Models and Genetic Applications University of Warwick 15-17 April 2009 Approximating the Bacterial Ancestral Recombination Graph – p. 12/2 Approximating the ARG • A good approximation to the ARG should: • Contain the parts of the ARG that are well informed and biologically important • Simplify the rest of the graph to make inference possible • The clonal genealogy is of central interest in most studies. It is also likely to be well informed when the data is large and the recombination rate not too big. • Thus we use the clonal genealogy as the centre of our approximation, and simplify the recombination process. Approximating the Bacterial Ancestral Recombination Graph – p. 13/2 The weak ARG model • Two differences with full ARG: • Recombinant edges not allowed to coalesce with each other; • Recombinant edges not allowed to recombine. • The weak ARG is a “tree with added edges” rather than a graph Approximating the Bacterial Ancestral Recombination Graph – p. 14/2 Probability of a wARG • A weak ARG is made of a clonal genealogy T and a set R of recombinant edges. • Each recombinant edge in R is characterized by an origin ai on T , a destination bi on T , a starting site xi and an ending site yi . • Probability of a wARG: P (Graph) = P (T )P (R|T ) with: N Y i exp − ti P (T ) = 2 i=2 P (R|T ) = exp(−ρT /2) ρT 2 R Y R P (xi , yi )T −1 exp(−L(ai , bi )) i=1 Approximating the Bacterial Ancestral Recombination Graph – p. 15/2 Bayesian Inference • Inference problem: given a set of observed sequences, can we infer the ancestry graph? • Likelihood calculation: as for the full ARG, the wARG defines a local tree for each site, so we can use Felsenstein’s pruning algorithm. • Reversible-jump Monte-Carlo Markov Chain • Moves: • Subtree pruning and regrafting for T ; • Add and remove recombinant edges in R (only the local likelihood needs recalculating); • Updates for mutation rate θ, recombination rate ρ and mean import length δ. Approximating the Bacterial Ancestral Recombination Graph – p. 16/2 Implementation • Implementation in progress, should be ready this year • Current status: infer the recombination process given the clonal genealogy • In progress: jointly infer the clonal genealogy • The current implementation is interesting in itself, when the clonal genealogy is clear (eg. because ρ is low or lots of data) and we want to characterize patterns of recombination • Example: Multi-Locus Sequence Typing (MLST) data for 57 isolates from the Bacillus cereus group (Priest et al., 2004). • These are very preliminary results! Approximating the Bacterial Ancestral Recombination Graph – p. 17/2 Bacillus: clonal genealogy 0 1 2 9 54 24 44 42 23 28 4 29 6 5 27 25 35 3 21 16 19 47 11 26 14 49 45 31 50 43 7 22 12 41 15 51 36 37 48 30 56 17 55 8 40 10 13 20 53 52 46 18 32 38 33 34 39 Anthracis Cereus Tolworthi Kurstaki Thuringiensis Sotto Others 0.1 Approximating the Bacterial Ancestral Recombination Graph – p. 18/2 Bacillus: a single iteration 0 1 2 9 54 24 44 42 23 28 4 29 6 5 27 25 35 3 21 16 19 47 11 26 14 49 45 31 50 43 7 22 12 41 15 51 36 37 48 30 56 17 55 8 40 10 13 20 53 52 46 18 32 38 33 34 39 Anthracis Cereus Tolworthi Kurstaki Thuringiensis Sotto Others 0.1 Problem: how can we summarize all iterations? Approximating the Bacterial Ancestral Recombination Graph – p. 19/2 Bacillus analysis 0 1 2 9 54 24 44 42 23 28 4 29 6 5 27 25 35 3 21 16 19 47 11 26 14 49 45 31 50 43 7 22 12 41 15 51 36 37 48 30 56 17 55 8 40 10 13 20 53 52 46 18 32 38 33 34 39 0.1 Anthracis Cereus Tolworthi Kurstaki Thuringiensis Sotto Others 0.1 Distribution of orgins for the imports on a given branch Approximating the Bacterial Ancestral Recombination Graph – p. 20/2 Bacillus analysis 0 1 2 9 54 24 44 42 23 28 4 29 6 5 27 25 35 3 21 16 19 47 11 26 14 49 45 31 50 43 7 22 12 41 15 51 36 37 48 30 56 17 55 8 40 10 13 20 53 52 46 18 32 38 33 34 39 1 Anthracis Cereus Clade 1 Tolworthi Kurstaki Clade 2 Thuringiensis Sotto Others Clade 3 0.1 Distribution of arrivals colour-coded by origin Approximating the Bacterial Ancestral Recombination Graph – p. 21/2 Bacillus analysis 25.964 0 1 2838 Distribution of recombination along the genes Approximating the Bacterial Ancestral Recombination Graph – p. 22/2 Conclusions • The ancestral recombination graph is a powerful model of ancestry, but is not usable in inference • Need for efficient approximations • Ignoring recombination completely (as most methods do) is dangerous • The weak ARG is based on the existence in bacteria of a clonal genealogy, and approximates the recombination process by modelling the origin of each import but not its full ancestry • Our approximation is simple enough to perform inference • Our program can reveal patterns of recombination • It can also be used to infer the clonal genealogy Approximating the Bacterial Ancestral Recombination Graph – p. 23/2