Download Complex inheritance/disease

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Complex inheritance/disease
Many Other
Genes
SNP Resources: Variation Discovery,
HapMap and the EGP
Variant
Gene
Environment
Disease
Mark J. Rieder
Department of Genome Sciences
mrieder@u.
washington..edu
mrieder@u.washington
Diabetes
Obesity
Cancer
Strategies for Genetic Analysis
C/C
C/C
C/T
C/T
C/C
C/C C/T
Strategies for Genetic Analysis (GWAS)
Populations
Association Studies
Samples with phenotype data (continuous, case-control)
(n = 100's - 1000's)
Genotype samples with commercial 'chips'
• Affymetrix - Random SNP design (v.5, v.6)
• lllumina - preselected tagSNP design (650Y, 1M)
C/T
C/T
C/T
Schizophrenia
Celiac Disease
Autism
Two hypotheses:
1- common disease/common variant?
2- common disease/many rare variants?
NIEHS SNPs Workshop
Jan 10-11, 2008
Families
Linkage Studies
Heart Disease
Multiple Sclerosis
Asthma
C/C
C/C
C /C C/C
40% T, 60% C
Cases
15% T, 85% C
Perform statistical association with each SNP
Calculate p-value for each SNP
Controls
Simple Inheritance
Complex Inheritance
Single Gene
Multiple Genes
Rare Variants
Common Variants
~600 Short Tandem Repeat Markers
Polymorphic Markers > 300,000 -1,000,000
Single Nucleotide Polymorphisms (SNPs)
Region with plausible statistical association
1
SNP Resources: SNP discovery and cataloging
Development of a genome-wide SNP map: How many SNPs?
1. SNP discovery/genotyping: Genome-wide approaches
 SNP Consortium
 HapMap
2. The current state of SNP resources (dbSNP)
3. Comprehensive SNP discovery
NIEHS SNPs - Environmental Genome Project
Nickerson and Kruglyak, Nature Genetics, 2001
~ 10 million common SNPs (> 1- 5% MAF) - 1/300 bp
SNP Databases - “How to”
to” Manual for finding SNPs
In class - Tutorial
How has SNP discovery progressed toward this goal?
HapMap Project: Create a genome-wide SNP map
Phase I 1M SNPs
Phase II 4M SNPs
Genotype SNPs in four populations:
•
•
•
•
CEPH (CEU) (Europe - n = 90, trios)
Yoruban (YRI) (Africa - n = 90, trios)
Japanese (JPT) (Asian - n = 45)
Chinese (HCB) (Asian - n =45)
Density ~ 1 SNP/kb
626 genes
15 Mbp
91,000 SNPs
To produce a genome-wide map of common variation
Common Variant/Common Disease
Density - 1 SNP/166 bp
2
Finding SNPs: Marker Discovery and Methods
Finding SNPs: Sequence-based SNP Mining
Genomic
mRNA
SNP discovery has proceeded in two distinct phases:
RT errors?
cDNA
Library
1 - SNP Identification/Discovery
Define the alleles
Map this to a unique place in the genome
Sequencing
Quality
EST
Overlap
2 - SNP Characterization (HapMap genotyping)
Determination of the genotype in many individuals
Population frequency of SNPs
DNA
SEQUENCING
BAC
Library
RRS
Library
BAC
Overlap
Shotgun
Overlap
Sequence Overlap - SNP Discovery
GTTACGCCAATACAGG
GTTACGCCAATACAGGATCCAGGAGATTACC
GTTACGCCAATACAGC
GTTACGCCAATACAGCATCCAGGAGATTACC
HapMap SNP Discovery: Prior to Genotyping
Finding SNPs: Sequence-based SNP Mining
Initiation of project planning (July 2001):
2.8 million SNPs (1.4 million validated) - 1/1900 bp
Nov 2003 - 5.7 million (2 million validated) - 1/1500 bp
Feb 2004 - 7.2 million (3.3 million validated) - 1/900 bp
BAC = Bacterial Artificial Chromosome
Primary vector for DNA cloning in the HGP
DNA from multiple individuals
Clone large fragments into BACs
(unknown sequence)
Generate more SNPs:
Random Shotgun Sequencing
Fragment DNA
Sequence and Reassemble
(known sequence)
Genomic DNA
(multiple individuals)
Other Sources of SNPs:
Perlegen (Affymetrix
(Affymetrix chips) SNP data (chr22)
Sequence chromatograms from Celera project
Assembly with other overlapping BACs
Sequence and align
(reference sequence)
TACGCCT
TACGCCT ATA
TCA
TCA AGGAGAT
GTTACGCCA
GTTACGCCAATACAGGATCC
ATACAGGATCC AGGAGATTACC
GTTACGCCAATACAGG
GTTACGCCAATACAGGATCCAGGAGATTACC
GTTACGCCAATACAGC
GTTACGCCAATACAGCATCCAGGAGATTACC
3
Draft Human Genome
SNP Discovery: dbSNP database
SNP data submitted to dbSNP: Clustering
dbSNP processing of SNPs
dbSNP
-NCBI SNP database
SNPs submitted
by research community
(submitted SNPs = ss#)
Unique mapping
to a genome location
(reference
eference SNP
NP = rs#)
Validated
Unvalidated
(by 2hit-2allele)
Finding SNPs: Marker Discovery and Methods
HapMap Discovery Increased SNP Density and Validated SNPs
12.0
TotalTotalSNPs
Reference SNPs
HapMap SNP Discovery
12.00
10.0
Validated SNPs
Validated
SNPs
11+ million
rs SNPs
10.00
8.0
SNPs(millions)
SNPs (millions)
SNP discovery has proceeded in two distinct phases:
14.00
6.0
4.0
5.6 million
validated
rs SNPs
6.00
4.00
2.0
1 - SNP Identification/Discovery
Define the alleles
Map this to a unique place in the genome
8.00
2 - SNP Characterization (HapMap genotyping)
Determination of the genotype in many individuals
Population frequency of SNPs
2.00
0.00
0.0
Jan-03
Mar-03
Jan-03 Mar-03 Jun-03 Aug-03 Oct-03 Jan-04 Mar-04 Jun-04 Sep-04 Jan-05 Mar-05 Jun-06 Mar-07
Jun-03
Aug-03
Nov-03
Jan-04
Mar-04
Jun-04
dbSNP Release
Sep-04
Jan-05
Mar-05
dbSNP Release
4
HapMap Project: Create a genome-wide SNP map
Finding SNPs: Genotype Data Adds Value to SNPs
Indiv.#1:
Indiv.#1: GTTACGCCAATACAG[G
GTTACGCCAATACAG[G/G]ATCCAGGAGATTACC
Indiv.#2:
Indiv.#2: GTTACGCCAATACAG[C
GTTACGCCAATACAG[C/G]ATCCAGGAGATTACC
Indiv.#3:
Indiv.#3: GTTACGCCAATACAG[C
GTTACGCCAATACAG[C/C]ATCCAGGAGATTACC
Genotype SNPs in four populations:
•
•
•
•
Confirms SNP as “real”
real” and “informative”
informative”
Minor Allele Frequency (MAF) - common or rare
MAF differs by different population
Detection of SNP x SNP correlations
(Linkage Disequilibrium)
 Determine tagging SNPs (tagSNPs
(tagSNPs))
 Determine haplotypes




CEPH (CEU) (Europe - n = 90, trios)
Yoruban (YRI) (Africa - n = 90, trios)
Japanese (JPT) (Asian - n = 45)
Chinese (HCB) (Asian - n =45)
To produce a genome-wide map of common variation
Interleukin 6 – Collapsed Dataset: Minor Allele Frequencey (MAF) > 10 %
dbSNP: All validated SNPs now have been genotyped!
6.00
African-Am.
Validated SNPs
SNPs with Genotypes
SNPs(millions)
5.00
European-Am.
HapMap Genotyping
5.6 million
validated
rs SNPs
4.00
3.00
2.00
1.00
0.00
Jan-03
Mar-03
Jun-03 Aug-03 Oct-03
Jan-04
Mar-04
Jun-04 Sep-04
Jan-05
Mar-05 May-06 Mar-07
dbSNP Release
• Most of the SNPs in the genome are rare
• Significant differences in SNP frequency can exist
• Correlation between SNPs can be used to select the most informative SNPs
HapMap release = 4 M validated QC SNPs
5
Development of a genome-wide SNP map: How many SNPs?
Finding SNPs: Sequence-based SNP Mining
Genomic
DNA
SEQUENCING
Nickerson and Kruglyak, Nature Genetics, 2001
~ 10 million common SNPs (> 1-5% MAF) - 1/300 bp
Feb 2001 - 1.42 million (1/1900 bp)
Nov 2003 - 2.0 million (1/1500 bp)
Feb 2004 - 3.3 million (1/900 bp)
Mar 2007 - 5.6 million (validated - 1/535 bp)
Fraction of SNPs Discovered
1.0
BAC
Overlap
Shotgun
Overlap
Random
Shotgun
cDNA
Library
EST
Align to
Reference Overlap
GTTACGCCAATACAGG
GTTACGCCAATACAGGATCCAGGAGATTACC
GTTACGCCAATACAGC
GTTACGCCAATACAGCATCCAGGAGATTACC
SNP discovery is dependent on your sample population size
{
RRS
Library
RANDOM Sequence Overlap - SNP Discovery
When will we have them all?
2 chromosomes
BAC
Library
mRNA
SNP Characterization/Genotyping
GTTACGCCAATACAGG
GTTACGCCAATACAGGATCCAGGAGATTACC
GTTACGCCAATACAGC
GTTACGCCAATACAGCATCCAGGAGATTACC
HapMap data doesn’
doesn’t capture
low frequency SNPs
88
0.5
Nickerson and Kruglyak, Nature Genetics, 2001
~ 10 million common SNPs (>1-5% MAF) - 1/300 bp
Mar 2007 - 5.6 million (validated/mapped - 1/535 bp)
2
0.0
0.0
0.1
0.2
0.3
0.4
When will we have them all?
0.5
HapMap data doesn’
doesn’t capture low frequency alleles
Used as a proxy for all common variation through linkage disequilibrium
Minor Allele Frequency (MAF)
6
Targeted SNP Discovery
Increasing Sample Size Improves SNP Discovery
Directed analysis: cSNPs
GTTACGCCAATACAGGATCCAGGAGATTACC
{ GTTACGCCAATACAGG
GTTACGCCAATACAGC
GTTACGCCAATACAGCATCCAGGAGATTACC
2 chromosomes
Arg-Cys
3’
Val-Val
Fraction of SNPs Discovered
5’
1.0
PCR amplicons
Complete analysis: cSNPs, Linkage Disequilbrium and Haplotype Data
5’
3’
Arg-Cys
Val-Val
96
Resequencing
(n=48)
48
24
16
HapMap
Based on
~ 8
chromosomes
88
0.5
2
PCR amplicons
0.0
•Generate SNP data from complete genomic resequencing
(i.e. 5’ regulatory, exon, intron, 3’ regulatory sequence)
0.0
0.1
0.2
0.3
0.4
0.5
Minor Allele Frequency (MAF)
Increasing SNP Density: HapMap ENCODE Project
ENCODE = ENCyclopedia
ENCyclopedia Of DNA Elements
Catalog all functional elements in 1% of the genome (30 Mb)
Goal: Comprehensively identify all common sequence variation in
candidate genes
10 Regions x 500 kb/region (Pilot Project)
David Altschuler (Broad), Richard Gibbs (Baylor)
16 CEU, 16 YRI, 8 HCB, 8 JPT
Comprehensive PCR based resequencing across these regions
Initial biological focus: Candidate environmental response genes
involved in DNA repair, cell cycle, apoptosis, metabolism, cell
signaling, and oxidative stress.
15,367 dbSNP
16,248 New SNPs
50% of SNPs in dbSNP
Approach: Direct resequencing of genes
Samples:
PDR-90 ethnically diverse individuals representative of U.S. population (397 genes)
EGP95-95 samples from four ethnic groups (227 genes)
5 Mb/31,500 SNPs =
1/160 bp
(24 HapMap Asians, 22 HapMap Europeans, 12 HapMap Yorubans, 15 African
Americans, 22 Hispanic)
7
Summary of NIEHS SNPs genotypes in dbSNP
Development of a genome-wide SNP map: How many SNPs?
Nickerson and Kruglyak, Nature Genetics, 2001
~ 10 million SNPs (>1% MAF-estimated) - 1/300 bp
626 genes sequenced
15 Mb scanned
> 90,000 genotyped SNPs identified
> 8 million genotypes deposited in dbSNP
NIEHS SNPs = 1/166 bp (n = 95, 4 pops)
HapMap ENCODE = 1/160 bp (n = 48, 3 pops)
Comprehensive resequencing can identify the
vast majority of SNPs in a region
Nov 2005 - Zaitlen et al. Genome Research 15:1594-1600
SNP Discovery: dbSNP database
dbSNP (Perlegen/HapMap)
Transcription
coupled repair
Mismatch repair
NIEHS SNPs
Double strand
break repair
60%
Pathways
15%
{
{
25%
25%
Cell cycle
Apoptosis
Minor Allele Freq. (MAF)
Rarer and population specific SNPs are found by resequencing
8
Protein disrupting SNPs in EGP data
nsSNPs are found in ~60% of genes
Deleterious nsSNPs are preferentially low frequency
Computational inference of nsSNP function
SIFT = Sorting Intolerant From Tolerant
Evolutionary comparison of non-synonymous SNPs
Classifications:
Tolerant, Intolerant
Ng and Henikoff, Genome Research, 12:436-446, 2002
PolyPhen - Polymorphism Phenotyping
Structural protein characteristics and evolutionary comparison
Classifications:
benign, possibly damaging, probably damaging
Ramensky, et al., Nucleic Acids Research, 30:17:3894-3900, 2002
9
Illumina NIEHS SNPs Genotyping
EGP and HapMap Genotyping
PDR = 90 ethnically diverse individuals representative of U.S. population
Genotypes
Allele Frequency
HWE
European (CEU,n=60)
EGP p1 panel
PDR
ethnically diverse
representation of US
n = 90 (anonymous)
397 genes/55,000 SNPs
African (YRI, n =60)
Each well samples 1536 SNPs in one individual
For each HapMap sample 5 x 1536 (7680 genotyped SNPs)
3,000,000 genotypes generated (total ~400 samples)
Asian (HCB, n = 45)
Select SNPs
Asian (JPT, n = 45)
~7600 SNPs
Illumina
Afr.-American
Afr.-American (n = 62)
Hispanic (n = 60)
NIEHS SNPs Genotype Data
CEPH
0.341
Japanese
Chinese
African
American
0.274
0 . 282
0.931
0 . 610
0.626
0.533
Hispanic
0.415
Yoruban
PDR (397 genes)
SNPs characterized in six
different major
major populations.
populations.
0.841
CEPH
0 . 952
0 . 410
0 . 761
Japanese
0 . 420
0 . 765
Chinese
0 . 589
African
American
Population Allele Frequency Correlations
Illumina NIEHS SNPs Genotyping
10
Hispanic
EGP and HapMap Compatibility
HapMap/US Pop.
n = 332
Panel 2 (HapMap based, n=95)
European (CEU,n=60)
European (CEU,n=22)
African (YRI, n =60)
EGP p2 DNA panel
HapMap Integration (~4 million SNPs)
High Density Genic Coverage (EGP)
African (YRI, n =12)
Asian (HCB, n = 45)
Asian (JPT, n = 45)
Asian (JPT & HCB, n = 24
Low Density Genome Coverage (HapMap)
Afr.-American
Afr.-American (n = 15)
Afr.-American
Afr.-American (n = 62)
Hispanic (n = 60)
= EGP SNP discovery (1/200 bp)
= HapMap SNPs (~1/1000 bp)
Hispanic (n = 22)
Summary: The Current State of SNP Resources
 Approximately 10 million common SNPs exist in the human genome
(1/300 bp).
 Random SNP discovery processes generate many SNPs (HapMap)
 Random approaches to SNP discovery have reached limits of discovery
and validation (1/600 bp; 50% SNP validation)
 Most validated SNPs (5+ million) have been genotyped by the
HapMap (3 pops)
 Resequencing approaches continue to catalog important variants (rare
and common not captured by the HapMap)
 NIEHS SNPs has generated SNP data on >600 candidate genes
and 90 K SNPs along with large-scale HapMap+ characterization data
for of SNPs generated using the PDR
11
Related documents