* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download Document
Survey
Document related concepts
Transcript
DNA Sequencing with Longer Reads
Byung G. Kim
Computer Science Dept.
Univ. of Mass. Lowell
kim@cs.uml.edu
Outline
DNA distribution
Counting problem
DNA sequencing
Reassemble DNA fragments
Part of $1,000 genome sequencing
Bioinformatics
What is bioinformatics ?
Intersection
biology
statistics
computer science
Bioinformatics problems
DNA sequences
Protein sequences/structures
Modeling/inference
DNA and RNA
DNA (deoxyribonucleic acid) and RNA (ribonucleic acid) are composed
of linear chains of monomeric units of nucleotides
A nucleotide has three parts: a sugar, a phophate and a base
Four bases (monomers)
DNA Secondary Structure
Double helix – 1953 Watson and Crick
using X-ray diffraction
Sugar-phosphate backbone is the
outer part of the helix
Two strands run in antiparallel
directions
Two strands are complementary
Base pairing: A-T; G-C
Monomer Counts in DNA
In double strands
In a single strand ?
# of A = # of T; # of G = # of C
Erwin Chargaff’s 1st Parity Rule, 1951
# of A = # of T; # of G = # of C
Erwin Chargaff’s 2nd Parity Rule
How about oligomer (a few successive bases) frequencies ?
Oligomer Frequencies
Oligomer length = k
Window of k sliding by one base (overlapping k-1 bases)
A C T A A G C G ……
A C T A A G C G ……
A C T A A G C G ……
A simple counting program
May have to contend with long sequences
An oligomer and its reverse complement
ACT vs. AGT
Trimer Frequencies in Yeast
Trimer Frequencies in a Few Species
Observations
Symmetric in oligomer and reverse complements
Very similar among chromosomes
Issues
Two successive windows share (k-1) bases
Are they independent ? -- probably not (consider a large k)
Need jumping window
Given an oligomer distribution with k, generate a random sequence
according to the same distribution
Distributions with (k+1) from the random and the real sequences
DNA Sequencing Problem
DNA Sequencing Problem
A sample (genome) is amplified, and broken into multiple fragments
Reconstruct the original sequence from its fragments (reads)
Shot-gun Sequencing
Base coverage
A sample (genome) is amplified
A base in the sample is copied into many reads
But, reads are randomly generated
Poisson distribution
How many copies of reads to cover the whole genome ?
Called sequencing depth (c)
From Poisson, prob. of non-coverage is P(X=0) = exp(-c)
# of uncovered bases G*exp(-c)<1
Human genome (3 Gb): c>22
Sequencing By Hybridization (SBH)
More recent sequencing machines produce short k-mers,
called reads
Given an unknown DNA sequence, DNA array provides
All strings of length l that the sequence contains
No information about their positions
Sequencing by Hybridization (SBH) Problem
Reconstruct a string from its k-mer composition
SBH
Spectrum (s, k)
For string s of length n, the k-mer composition
with a multiset of n-k+1 k-mers in s
k=3, s=TATGGTGC
Spectrum(s.k) = {TAT, ATG, TGG, GGT, GTG, TGC}
Sequencing by Hybridization (SBH) Problem
Reconstruct a string from its l-mer composition
Input: A set S, representing all k-mers from unknown string s
Output: string s such that Spectrum(s, k) = S
Shortest Superstring Problem
Find superstring of the reads, but shortest one
Shortest Superstring Problem
Given a set of strings, find a shortest string that contain all of them
Input: Strings s1, s2, …., sn
Output: A shortest string s that contains all strings s1, s2, …., sn
{ 000 001 010 011 100 101 110 111 }
010
110
011
000
0001110100
001
111
101
100
SBH as a Hamiltonian Path Problem
Two k-mers overlap if overlap(p,q) = k-1
Hamiltonian Path Problem
Given Spectrum (s, k), and a vertex for every k-mer in Spectrum (s, k)
Connect every two vertices if two vertices overlap, and visit every vertex
Overlap-Layout-Consensus (OLC)
NP-complete
OLC
Conventional shotgun sequencing
Overlap-layout-consensus
Use a computer to search for overlap: trying for all possible
pairs of fragments
Layout: putting fragments together
Consensus: error correction
Good for short prokaryote genomes
For n fragments, # of possible overlaps is 2n(n-1)
Difficult
No solution for “repeat problem” to find correct path in the
layout step
Produce sequencing errors
Programs
PHRAP, CAP, TIGR, CELERA
SBH as an Eulerian Path Problem
Break up reads into short k-mers (k = 27, for example)
A graph with all (k-1)-mers : De Bruijn Graph
Edges corresponding to k-mers from Spectrum (s, k)
Find a path visiting every edge exactly once
Eulerian Path Problem
Repeatedly find Eulerian cycles in the graph
Linear time
OLC vs. De Bruijn Graph
Scaffold linkage
Connecting contigs
Paired-end read
Mate-pair
SBH Programs
ABySS (Assembly By Short Sequencing)
Simpson, 2009
Velvet
Zerbino and Birny, 2008
www.cs.uml.edu/~kim/580/08_velvet.pdf
Euler
Pevzner, 2001
www.cs.uml.edu/~kim/580/08_abyss.pdf
www.cs.uml.edu/~kim/580/01_pevzner.pdf
www.cs.uml.edu/~kim/580/09_chaisson.pdf
SOAPdenovo (Short Oligonucleotide Alignment Program)
Beijing Genomics Institute
www.cs.uml.edu/~kim/580/09_soap.pdf
SBH Schematic Overview
ABySS
Proceeds in two stages
Stage 1
Stage 2
All possible k-mers are generated from reads
Remove read errors and construct initial contigs
Use mate-pairs to extend contigs
Distributed implementation of de Bruijn graph in a cluster
using Message Passing Interface (MPI) over multiple
computers
Velvet
Construct a graph
Transform reads into roadmaps
From a read, generate k-mers with read ID and position in the
read (called roadmaps)
Each read is transformed to a set of k-mers with overlaps and
hash links to previous reads with the same k-mers
2nd database
For each read, which k-mers are overlapped by subsequent
reads
Trace reads through the graph using roadmaps
EULER, 2001
EULER
Implement Eulerian Path problem
Issues with real data
Reads may have errors
Error correction is typically done in ‘consensus’ stage
EULER corrects errors in the first step
Repeat problem
De Bruijn graph
EULER +, 2004
EULER+, 2004
A-Bruijn graph
To handle errors in reads, introduce vertices with ungapped
alignments that allow mismtaches rather than exact l-mer in de
Bruijn assembly
Graph simplication algorithms to remove errors in edges
De Bruijn graph is proportional to the coverage and requires a large
memory with a higher coverage with short reads than long–read
sequencing
EULER – SR, 2008
EULER-SR, 2008
Focus on memory-efficient algorithm dealing with Short Reads
Results
Error correction
E. coli – 68% error-free reads
99.6% errors are corrected
12h on a single processor, 1/2h for assembly
EULER – USR, 2009
EULER-USR
Show results of EULER-SR on error-prone Illumina read
data
Show 35-nt reads are sufficient when mate-pairs are
used
Comparisons
OLC vs. de Bruijn Graph
Difficulties in Graph-based Approaches
De Bruijn Graph (DBG) approach
A read (150-500 bp) is broken into a number of k-mers
A read of 150 bp produces 123 k-mers when k=27 (n-k+1)
New sequencing machines produce longer reads
Problem Sizes
OLC
For a large genome, more than giga reads
Eulerian Path
k = 27 => (k -1) = 26
4**(26) = 2**(52) (2**30 = 1 G)
New OLC Approach
Preserve read info
Data Partition
Partition data according to front and rear k-mers
Utilize known oligomer distribution of species
New OLC Approach
Native OLC
Pair-wise fragment overlap
Partial OLC
Partition data
Create consensus string
New OLC Approach
Data Partition
Join F and R buckets with the same k-mers
Front buckets of identical first k-mers of reads
Rear buckets of identical last k-mers of reads
k=15, 415 = 230 = 1 billion buckets * 2
Produce consensus contigs with base matches over a threshold
Repeat generating new F & R buckets,
Each time dealing with longer contigs
Per-bucket sequencing
Benefits
Hybrid of OLC and fixed-size de Bruijn graph
First k or last k bp’s can be rapidly compared
Parallelize the program with all possible combinations of s
Implementation
Dual Index Bucket Map data structure according to prefix
and suffix
A set of reads
A hash table (Java HashMap) mapping prefixes to a Java Vector of
references (pointers) to fragments in the set
Another hash table mapping suffixes to a vector of references to
fragments in the set.
Bucket size is a parameter
Algorithms
Focused merge
Sweep merge
Start with a random (or the largest fragment) and keep merging
from there until there is no way to merge further or the resulting
contig is at least as large as the original input sequence.
Sweep through each bucket repeatedly, making the best merge
each time until no merges can be done
Cutoff for maximum difference between merge of 2 fragments’
base ratios and the original sequence’s base ratios.
Heuristic “oversize penalty” for merges that are significantly larger
than the original sequence
Heuristics for base ratio difference
Average: calculate the base ratios for each sequence, average the
differences for each base
Maximum: take the maximum of the differences of ratio for any
base.
Comparison of OLC algorithms
Yeast genome
Artificial Data with no read errors
* with optional simulated annealing.
Strategy
Contigs
Heuristic
Time
Mixed
Chromosome
Support
Worst
Case
Accuracy
Ease
of Use
Focused
Beam
Search*
Single
Base
ratios
2-15
seconds
Unreliable/
Poor
Low
Not
too
bad
Sweep
Wide
Search*
Multiple
Base
ratios
1-30**
seconds
Excellent
Good
Needs
Micro
managi
ng
Naive
Brute
Force
Single
Actual
overlap
½ hour
to a few
hours
None
Optimal
Simple
Summary
Base and oligomer distributions in species are unique
New OLC sequencing
DBG may not be suitable for longer reads
Utilize
Data partition
Base distribution