Download Considerations for Analyzing Targeted NGS Data – HLA

Considerations for Analyzing Targeted NGS Data HLA Tim Hague, CTO Introduction  Human leukocyte antigen (HLA) is the major histocompatibility complex (MHC) in humans.  Group of genes ('superregion') on chromosome 6  Essentially encodes cell-surface antigenpresenting proteins. Functions HLA genes have functions in: combating infectious diseases graft/transplant rejection autoimmunity cancer Alleles  Large number of alleles (and proteins).  Many alleles are already known. The number of known alleles is increasing HLA Class I Gene A B C Alleles 2013 2605 1551 Proteins 1448 1988 1119 HLA Class II Gene DRA DRB* DQA1 DQB1 DPA1 DPB1 Alleles 7 1260 47 176 34 155 Proteins 2 901 29 126 17 134 HLA Class II - DRB Alleles Gene DRB1 DRB3 DRB4 Alleles 1159 58 15 Proteins 860 46 8 DRB5 20 17 Analysis Challenges HLA genes have specific analysis challenges regardless of the sequencing technology. High Polymorphism High rate of polymorphism – up to 100 times the average human mutation rate. The HLA-DRB1 and HLA-B loci have the highest sequence variation rate within the human genome. High degree of heterozygosity – homozygotes are the exception in this region. Duplications  High level of segmental duplications  Lots of similar genes and lots of very similar pseudegenes.  Duplicated segments can be more similar to each other within an individual than they are similar to the corresponding segments of the reference genome. Complex Genetics  Particularly HLA-DRB*  The DR β-chain is encoded by 4 loci, however only no more than 3 functional loci are present in a single individual, and only a maximum of 2 per chromosome. Mitigating Factors It's not all bad news: Many HLA alleles are already well known – both in terms of sequence and frequencies within the population. The HLA region is fairly small so there a high degree of linkage disequilibrium, and therefore lots of known haplotypes. Traditional Typing  SSO – low resolution, high throughput, cheap  SSP – very fast results, low resolution  SBT – sequence-based typing, high resolution, usually done by Sanger sequencing. NGS Typing High resolution, an alternative to Sangerbased SBT Why is it needed? Sanger and HLA  Sanger data is still the gold standard in the genomic sequencing industry, even though it is very expensive compared to NGS.  1 in 1'000 base error rate, if forward and reverse typing are done, error rate drops to 1 in 1'000'000. So why is it bad for HLA? Phase Resolution  2x chromosome 6  Many loci, many alleles  Lots of heterozygosity Allele Phasing problem reference sequence G / T T / A consensus sequence OR??? Allele 1 Allele 2 T A Allele 1 Allele 2 A T The Problem with Sanger  There is only one signal  High degree of heterozygosity = high degree of ambiguity  Requires statistical techniques based on known allele frequencies, plus manual intervention by trained operators  Ambiguity can only be resolved statistically, which can lead to wrong assignment for rare types HLA typing by Sanger method GGACSGGRASACACGGAAWGTGAAGGCCCACTCACAGACTSACCGAGYGRACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGMCGGT 550 500 450 400 350 300 250 200 150 100 50 0 Number of potential alleles NGS Advantages  Can reduce ambiguity  Phase resolution - two signals, but lots of short reads  Cheaper and faster than Sanger  Less manual intervention required NGS Data - Unphased NGS Data - Phased NGS Approaches  HLA*IMP – chip based imputation engine  Reference-based alignment, followed by a HLA call based on the variants detected during alignment  Search against database of known alleles NGS Reference-based  Fraught with difficulties  Very hard to align reads to this region  The variant/HLA call is only as good as the alignment  No coverage = no call Has been attempted by Broad Institute (HLA Caller) and Roche Alignment Efforts RainDance provide a targeted HLA amplification kit call HLAseq. Target: the whole MHC superregion (except for some tandem repeat regions) Goal: align this data, before doing variant/HLA call. Diverse variant “density” in the MHC superregion Based on a single sample Default BWA alignment – No coverage at an exon of HLA-DMB Low coverage and orphaned reads at a HLA-DRB1 exon BWA vs more permissive alignment: higher coverage = higher noise Large targeted region without usable coverage NGS Reference-based Not providing enough coverage everywhere What about de novo? De novo assembly (MIRA) 287 contigs (longest contig: 2199 bp) Mean contig size: 268 bp Median contig size: 209 bp Total consensus: 77084 bp RainDance target: ~ 3800000 bp De novo assembly (MIRA) NGS De Novo Alignment Not enough contigs produced, not enough coverage of the target region. What about a hybrid approach? De novo assembly with “backbone” First, alignment to backbone, then de novo assembly Backbone: 2220 contigs from HG19 chr 6 (sum: 3554852 bps) → almost whole RainDance target Results: Max reads / backbone contig: 197 Max coverage: 71 De novo assembly with “backbone” NGS Typing - Alignment Based We tried: Burrows Wheeler aligner More sensitive, seed and extend aligner De novo aligner 'Hybrid' de novo aligner The variant/HLA call is only as good as the alignment The alignments were not good enough NGS Database Based  Search against 'database' of known alleles  Such as IMGT/HLA database, available from EBI web site Stanford, Connexio, JSI Medical, BC Cancer Agency and Omixon have all tried this approach. DB Based Approach Advantages Less mapping headaches Unambiguous results Potential to be fast Difficulties Novel allele detection Homozygous alleles Results with Exome data Exon level detail Detailed results - short read pileup Conclusions  DB based approach to HLA typing is new but very promising  NGS approaches can resolve much of the ambiguity of Sanger SBT  DB based approach can also overcome the limitations of NGS reference-based alignment Conclusions Available DB based HLA typing tools differ in: Speed Sequencers supported Types of sequencing data supported (targeted, exome, whole genome) Ease of use Ambiguity of results Degree of manual intervention required Novel allele detection capabilities

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Considerations for Analyzing Targeted NGS Data – HLA