* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slide 1
Neuronal ceroid lipofuscinosis wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene expression programming wikipedia , lookup
Fetal origins hypothesis wikipedia , lookup
Gene expression profiling wikipedia , lookup
Minimal genome wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Genetic engineering wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genomic imprinting wikipedia , lookup
History of genetic engineering wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Behavioural genetics wikipedia , lookup
Medical genetics wikipedia , lookup
Genome evolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Heritability of IQ wikipedia , lookup
Human genetic variation wikipedia , lookup
Designer baby wikipedia , lookup
Genome (book) wikipedia , lookup
Genetic drift wikipedia , lookup
Population genetics wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Public health genomics wikipedia , lookup
Introduction to Gene-Finding: Linkage and Association Danielle Dick, Sarah Medland, (Ben Neale) Aim of QTL mapping… LOCALIZE and then IDENTIFY a locus that regulates a trait (QTL) • Locus: Nucleotide or sequence of nucleotides with variation in the population, with different variants associated with different trait levels. Location and Identification • Linkage • localize region of the genome where a QTL that regulates the trait is likely to be harboured • Family-specific phenomenon: Affected individuals in a family share the same ancestral predisposing DNA segment at a given QTL Location and Identification • Association • identify a QTL that regulates the trait • Population-specific phenomenon: Affected individuals in a population share the same ancestral predisposing DNA segment at a given QTL Linkage Overview Progress of the Human Genome Project Human Chromosome 4 Genetic markers (DNA polymorphisms) ATGCTTGCCACGCE ATGCTTCTTGCCATGCE Microsatellite Markers can be di(2), tri(3), or tetra (4) nucleotide repeats ATGCTTGCCACGCE Single Nucleotide Polymorphism ATGCTTGCCATGCE DNA polymorphisms  Can occur in gene, but be silent  Can change gene product (protein)   Can regulate gene product    Alter amino acid sequence (a lot or a little) Upregulate or downregulate protein production Turn off or on gene Can occur in noncoding region  This happens most often! Mutations How do we map genes?  Deviation from Mendel’s Independent Assortment Law   Aa & Bb = ¼ AB, ¼ Ab, ¼ aB, ¼ ab We’re looking for variation from this Recombination Recombination  Another way of introducing genetic diversity  Allows us to map genes!  Crossovers more likely to occur between genes that are further away; likelihood of a recombination event is proportional to the distance   Interference – tend not to see 2 crossovers in a small area Alleles that are very close together are more likely to stay together, don’t assort independently Linkage Mapping (is a marker “linked” to the disease gene)  Collect families with affected individuals  Genome Scan - Test markers evenly spaced across the entire genome (~every 10cM, ~400 markers)  Lod score (“log of the odds”) – what are the odds of observing the family marker data if the marker is linked to the disease (less recombination than expected) compared to if the marker is not linked to the disease Thomas Hunt Morgan – discoverer of linkage Linkage = Co-segregation A3A4 A1A2 A1A3 A1A2 A1A4 A2A4 A3A4 A2A3 A3A2 Marker allele A1 cosegregates with dominant disease Lod scores  >3.0 evidence for linkage  <-2.0 can rule out linkage  In between – inconclusive, collect more families Linkage = Co-segregation A3A4 A1A2 A1A3 A1A2 A1A4 •Parametric Linkage used very successfully to map disease genes for Mendelian disorders A2A4 A3A4 A2A3 A3A2 •Problematic for complex disorders: requires disease model, penetrance, assumes gene of major effect, phenotypic precision Nonparametric Linkage  Based on allele-sharing  More appropriate for phenotypes with multiple genes of small effect, environment, no disease model assumed  Basic unit of data: affected relative (often sibling) pairs x 1/4 1/4 1/4 1/4 IDENTITY BY DESCENT Sib 1 2 1 1 0 1 2 0 1 1 0 2 1 0 1 1 2 Sib 2 4/16 = 1/4 sibs share BOTH parental alleles IBD = 2 8/16 = 1/2 sibs share ONE parental allele IBD = 1 4/16 = 1/4 sibs share NO parental alleles IBD = 0 Genotypic similarity between relatives IBS Alleles shared Identical By State “look the same”, may have the same DNA sequence but they are not necessarily derived from a known common ancestor - focus for association IBD Alleles shared M1 Q1 M2 Q2 M3 Q3 M3 Q4 Identical By Descent are a copy of the same ancestor allele M1 M2 Q1 Q2 M3 M 3 Q 3 Q4 - focus for linkage M1 M3 Q1 Q 3 M1 M3 Q1 Q4 IBS IBD 2 1 Genotypic similarity – basic principals  Loci that are close together are more likely to be inherited together than loci that are further apart  Loci are likely to be inherited in context – ie with their surrounding loci  Because of this, knowing that a loci is transmitted from a common ancestor is more informative than simply observing that it is the same allele  Critical to have parental data when possible Linkage Markers… For disease traits (affected/unaffected) Affected sib pairs selected 1000 750 500 250 Expected 1 2 3 Markers 127 310 IBD = 2 IBD = 1 IBD = 0 For continuous measures Unselected sib pairs Correlation between sibs 1.00 0.75 0.50 0.25 0.00 IBD = 0 IBD = 1 IBD = 2 So how does all this fit into Mx? IDENTITY BY DESCENT Sib 1 2 1 1 0 1 2 0 1 1 0 2 1 0 1 1 2 Sib 2 4/16 = 1/4 sibs share BOTH parental alleles IBD = 2 8/16 = 1/2 sibs share ONE parental allele IBD = 1 4/16 = 1/4 sibs share NO parental alleles IBD = 0  In biometrical modeling A is correlated at 1 for MZ twins and .5 for DZ twins  .5 is the average genome-wide sharing of genes between full siblings (DZ twin relationship) 1 or .5 1 1 1 A 1 1 C a c T1 E e 1 E 1 C e c T2 A a  In linkage analysis we will be estimating an additional variance component Q  For each locus under analysis the coefficient of sharing for this parameter will vary for each pair of siblings  The coefficient will be the probability that the pair of siblings have both inherited the same alleles from a common ancestor ˆ ˆ MZ=1.0 DZ=0.5 MZ & DZ = 1.0 1 1 Q 1 A q a C c PTwin1 1 E e 1 1 1 E C e A c 1 Q a PTwin2 q Break down of time spent during a linkage/association study Linkage Cleaning & preparing genotype data Runing linkage analyses Estimating significance How do we do this? 1.Genotyping data. Microsatellite data   Ideally positioned at equal genetic distances across chromosome Mostly di/tri nucleotide repeats http://research.marshfieldclinic.org/genetics/GeneticResearch/screeningsets.asp Microsatellite data   Raw data consists of allele lengths/calls (bp) Different primers give different lengths  So to compare data you MUST know which primers were used http://research.marshfieldclinic.org/genetics/GeneticResearch/screeningsets.asp Binning  Raw allele lengths are converted to allele numbers or lengths  Example:D1S1646 tri-nucleotide repeat size range130-150     Logically: Work with binned lengths Commonly: Assign allele 1 to 130 allele, 2 to 133 allele … Commercially: Allele numbers often assigned based on reference populations CEPH. So if the first CEPH allele was 136 that would be assigned 1 and 130 & 133 would assigned the next free allele number Conclusions: whenever possible start from the RAW allele size and work with allele length Error checking  After binning check for errors     Family relationships (GRR, Rel-pair) Mendelian Errors (Sib-pair) Double Recombinants (MENDEL, ASPEX, ALEGRO) An iterative process ‘Clean’ data  ped file  Family, individual, father, mother, sex, dummy, genotypes Estimating genotypic sharing…  The ped file is used with ‘map’ files to obtain estimates of genotypic sharing between relatives at each of the locations under analysis Estimating genotypic sharing… Merlin will give you probabilities of sharing 0, 1, 2 alleles for every pair of individuals. Estimating genotypic sharing…  Output Estimating genotypic sharing…  Output Why isn’t P0, P1, P2 exact for everyone? Estimating genotypic sharing…  Output Why isn’t P0, P1, P2 exact for everyone? -missing parental genotypes -low informativeness at marker 1/2 1/2 2/2 2/2 ˆ MZ=1.0 DZ=0.5 MZ & DZ = 1.0 1 1 Q 1 A q a C c PTwin1 1 E e 1 1 1 E C e A c 1 Q a PTwin2 q Genotypic similarity between relatives IBD Alleles shared Identical By Descent are a copy of the same ancestor allele Pairs of siblings may share 0, 1 or 2 alleles IBD The probability of a pair of relatives being IBD is known as pi-hat ˆ  p( IBD 2)  .5* p( IBD1) M1 Q1 M2 Q2 M1 M 2 Q 1 Q2 M1 M3 Q1 Q3 M3 Q3 M3 Q4 M3 M3 Q3 Q4 M1 M 3 Q 1 Q4 IBS IBD 2 1 Estimating genotypic sharing…  Output ˆ  p( IBD 2)  .5* p( IBD1) ˆ  ? ˆ  ? Distribution of pi-hat STUDY:  2 Harold sample (middelb lft) 40 30 Adult Dutch DZ pairs: distribution of pi-hat ˆ at 65 cM on chromosome 19  20   10 Std. Dev = .30 Mean = .45 N = 117.00 0 0.00 .13 .06 PIHAT65 .25 .19 .38 .31 .50 .44 .63 .56 .75 .69 .88 .81 1.00 .94  ˆ < 0.25: IBD=0 group ˆ > 0.75: IBD=2 group others: IBD=1 group pi65cat= (0,1,2) Linkage Analyses  Advantage   Systematically scan the genome Disadvantages    Not very powerful Need hundreds – thousands of family member Broad peaks Lod scores 1cM = 1MB 1MB=1000kb 1kb=1000bp 1cM = 1,000,000 bp Strategy 1. Ascertain families with multiple affecteds 2. Linkage analyses to identify chromosomal regions 2.5 Wave 1 2 LodScores Wave 2 1.5 Combined 1 0.5 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160  allele-sharing among affecteds within a family cM 3. Association analyses to identify specific genes Gene A Gene B Gene C  BREAK Linkage vs. Association  Linkage analyses look for relationship between a marker and disease within a family (could be different marker in each family)  Association analyses look for relationship between a marker and disease between families (must be same marker in all families) Allelic Association: Extension of linkage to the population 3/5 3/6 2/6 5/6 3/5 3/2 2/6 5/2 Both families are ‘linked’ with the marker, but a different allele is involved Allelic Association Extension of linkage to the population 3/5 3/6 2/6 5/6 3/6 3/2 2/4 6/2 4/6 6/6 All families are ‘linked’ with the marker Allele 6 is ‘associated’ with disease 2/6 6/6 Localization  Linkage analysis yields broad chromosome regions harbouring many genes    Resolution comes from recombination events (meioses) in families assessed ‘Good’ in terms of needing few markers, ‘poor’ in terms of finding specific variants involved Association analysis yields fine-scale resolution of genetic variants   Resolution comes from ancestral recombination events ‘Good’ in terms of finding specific variants, ‘poor’ in terms of needing many markers Allelic Association Three Common Forms • Direct Association • Mutant or ‘susceptible’ polymorphism • Allele of interest is itself involved in phenotype • Indirect Association • Allele itself is not involved, but a nearby correlated marker changes phenotype • Spurious association • Apparent association not related to genetic aetiology (most common outcome…) Indirect and Direct Allelic Association Direct Association D Indirect Association & LD M1 M2 D Mn * Measure disease relevance (*) directly, ignoring correlated markers nearby Assess trait effects on D via correlated markers (Mi) rather than susceptibility/etiologic variants. Semantic distinction between Linkage Disequilibrium: correlation between (any) markers in population Allelic Association: correlation between marker allele and trait Decay of Linkage Disequilibrium Reich et al., Nature 2001 Average Levels of LD along chromosomes 1.00 CEPH W.Eur Estonian D' 0.75 0.50 0.25 0.00 0 5 10 15 20 25 30 Position (Mb) Chr22 Dawson et al Nature 2002 Characterizing Patterns of Linkage Disequilibrium Average LD decay vs physical distance Mean trends along chromosomes 1.00 D' 0.75 0.50 0.25 0.00 0 5 10 15 20 Position (Mb) Haplotype Blocks 25 30 Linkage Disequilibrium Maps & Allelic Association Marker 1 2 3 D n LD Primary Aim of LD maps: Use relationships amongst background markers (M1, M2, M3, …Mn) to learn something about D for association studies Something = * Efficient association study design by reduced genotyping * Predict approx location (fine-map) disease loci * Assess complexity of local regions * Attempt to quantify/predict underlying (unobserved) patterns ··· Deliverables: Sets of haplotype tagging SNPs Building Haplotype Maps for Gene-finding 1. Human Genome Project  Good for consensus, not good for individual differences Sept 01 Feb 02 April 04 2. Identify genetic variants  Anonymous with respect to traits. April 1999 – Dec 01 3. Assay genetic variants  Verify polymorphisms, catalogue correlations amongst sites  Anonymous with respect to traits Oct 2002 - present Oct 04 Haplotype Tagging for Efficient Genotyping Cardon & Abecasis, TIG 2003 • Some genetic variants within haplotype blocks give redundant information • A subset of variants, ‘htSNPs’, can be used to ‘tag’ the conserved haplotypes with little loss of information (Johnson et al., Nat Genet, 2001) • … Initial detection of htSNPs should facilitate future genetic association studies HapMap Strategy  Samples   Four populations, small samples Genotyping    5 kb initial density across genome (600K markers) Subsequent focus on low LD regions Recent NIH RFA for deeper coverage Hapmap validating millions of SNPs. Are they the right SNPs? Distribution of allele frequencies in public markers is biased toward common alleles Population frequency 0.6 Expected frequency in population 0.5 0.4 Frequency of public markers 0.3 0.2 Updated with phase 2—more similar to expectation 0.1 0 1-10% 11-20% 21-30% 31-40% 41-50% Minor allele frequency Phillips et al. Nat Genet 2003 Summary of Role of Linkage Disequilibrium on Association Studies  Marker characterization is becoming extensive and genotyping throughput is high  Tagging studies will yield panels for immediate use   Need to be clear about assumptions/aims of each panel Density of eventual Hapmap probably cover much of genome in high LD, but not all Challenges   Just having more markers doesn’t mean that success rate will improve Expectations of association success via LD are too high. Two types of association studies   Case-control Family-based Allelic Association Controls Cases 6/6 6/2 3/5 3/4 3/6 2/4 3/2 5/6 3/6 4/6 6/6 2/6 5/2 Allele 6 is ‘associated’ with disease 2/6 Main Blame Primary Concern with Case-Control Analyses Population stratification Analysis of mixed samples having different allele frequencies is a primary concern in human genetics, as it leads to false evidence for allelic association. Population Stratification  Leads to spurious association  Requirements:    Group differences in allele frequencies AND Group differences in outcome In epidemiology, this is a classic matching problem, with genetics as a confounding variable Population Stratification Affected Unaffected M 50 450 .50 Affected Unaffected Sample ‘A’ m Freq. 50 .10 450 .90 .50 2  1 is n.s. + M 51 549 .30 Affected Unaffected m 59 1341 .70 21 = 14.84, p < 0.001 Spurious Association M 1 99 .10 Sample ‘B’ m Freq. 9 .01 891 .99 .90 2  1 is n.s. Freq. .055 .945 Family-based association methods TDT – Transmission Disequilibrium Test 1/2 3/3 2/3 •50/50 chance the 2 is transmitted •Looking for overtransmission of a particular allele across affected individuals (undertransmission to unaffecteds) TDT Advantages/Disadvantages Advantages Robust to stratification Genotyping error detectable via Mendelian inconsistencies Estimates of haplotypes possible Disadvantages Detection/elimination of genotyping errors causes bias (Gordon et al., 2001) Uses only heterozygous parents Inefficient for genotyping 3 individuals yield 2 founders: 1/3 information not used Can be difficult/impossible to collect Late-onset disorders, psychiatric conditions, pharmacogenetic applications Association studies < 2000: TDT • TDT virtually ubiquitous over past decade Grant, manuscript referees & editors mandated design • View of case/control association studies greatly diminished due to perceived role of stratification Association Studies 2000+ : Return to population • Case/controls, using extra genotyping • +families, when available Detecting and Controlling for Population Stratification with Genetic Markers Idea • Take advantage of availability of large N genetic markers • Use case/control design • Genotype genetic markers across genome (Number depends on different factors) • Look if any evidence for background population substructure exists and account for it Two types of association studies  Case-control    Adv: more powerful Disadv: population stratification limited by case/control definition Family-based   Adv: population stratification not a problem Disadv: less powerful, hard to collect parents for some phenotypes Association Analyses vs Linkage  Advantage   Disadvantage   More powerful Not systematic (in the past) Now!  Genome wide association scans Current Association Study Challenges 1) Genome-wide screen or candidate gene Genome-wide screen    Hypothesis-free High-cost: large genotyping requirements Multiple-testing issues  Possible many false positives, fewer misses Candidate gene    Hypothesis-driven Low-cost: small genotyping requirements Multiple-testing less important  Possible many misses, fewer false positives Current Association Study Challenges 2) What constitutes a replication? GOLD Standard for association studies Replicating association results in different laboratories is often seen as most compelling piece of evidence for ‘true’ finding But…. in any sample, we measure Multiple traits Multiple genes Multiple markers in genes and we analyse all this using multiple statistical tests What is a true replication? What is a true replication? Replication Outcome      Association to same trait, but different gene Association to same trait, same gene, different SNPs (or haplotypes) Association to same trait, same gene, same SNP – but in opposite direction (protective  disease) Association to different, but correlated phenotype(s) No association at all Explanation  Genetic heterogeneity  Allelic heterogeneity  Allelic heterogeneity/pop differences  Phenotypic heterogeneity  Sample size too small Current Association Study Challenges 3) Do we have the best set of genetic markers There exist 6+ million putative SNPs in the public domain. Are they the right markers? Allele frequency distribution is biased toward common alleles Population frequency 0.6 Expected frequency in population 0.5 0.4 Frequency of public markers 0.3 0.2 0.1 0 1-10% 11-20% 21-30% 31-40% Minor allele frequency 41-50% Current Association Study Challenges 3) Do we have the best set of genetic markers Tabor et al, Nat Rev Genet 2003 Greatest power comes from markers that match allele freq with trait loci Disease Allele Frequency Marker Allele Frequency 0.1 0.3 0.5 0.7 0.9 0.1 248 626 1306 2893 10830 0.3 1018 238 466 996 3651 0.5 2874 702 267 556 2002 0.7 9169 2299 925 337 1187 0.9 73783 18908 7933 3229 616 ls = 1.5, a = 5 x 10-8, Spielman TDT (Müller-Myhsok and Abel, 1997) Current Association Study Challenges 4) Integrating the sampling, LD and genetic effects Questions that don’t stand alone: How much LD is needed to detect complex disease genes? What effect size is big enough to be detected? How common (rare) must a disease variant(s) be to be identifiable? What marker allele frequency threshold should be used to find complex disease genes? Complexity of System •In any indirect association study, we measure marker alleles that are correlated with trait variants… We do not measure the trait variants themselves •But, for study design and power, we concern ourselves with frequencies and effect sizes at the trait locus…. This can only lead to underpowered studies and inflated expectations •We should concern ourselves with the apparent effect size at the marker, which results from 1) difference in frequency of marker and trait alleles 2) LD between the marker and trait loci 3) effect size of trait allele Practical Implications of Allele Frequencies  ‘Strongest argument for using common markers is not CD-CV. It is practical: For small effects, common markers are the only ones for which sufficient sample sizes can be collected  There are situations where indirect association analysis will not work   Discrepant marker/disease freqs, low LD, heterogeneity, … Linkage approach may be only genetics approach in these cases At present, no way to know when association will/will not work  Balance with linkage Current Association Study Challenges 5) How to analyse the data  Allele based test?  2 alleles  1 df   3 genotypes  2 df  E(Y) = a + b1A+ b2D A = 0/1 additive (hom); W = 0/1 dom (het) Haplotype-based test?  For M markers, 2M possible haplotypes  2M -1 df   X = 0/1 for presence/absence Genotype-based test?   E(Y) = a + bX E(Y) = a + bH H coded for haplotype effects Multilocus test?  Epistasis, G x E interactions, many possibilities Current Association Study Challenges 6) Multiple Testing  Candidate genes: a few tests (probably correlated)  Linkage regions: 100’s – 1000’s tests (some correlated)  Whole genome association: 100,000s – 1,000,000s tests (many correlated)  What to do?  Bonferroni (conservative)  False discovery rate?  Permutations? ….Area of active research Despite challenges: upcoming association studies hold some promise  Availability of millions of genetic markers  Genotyping costs decreasing rapidly  Cost per SNP: 2001 ($0.25)  2003 ($0.10)  2004 ($0.01)  Background LD patterns being characterized  International HapMap and other projects Genome Wide Association Studies (GWAS) Underway:  Genetic Analysis Information Network (GAIN)  Psoriasis, ADHD, Schizophrenia, Bipolar Disorder, Depression, Type 1 Diabetes  Welcome Trust Case Control Consortium  Bipolar Disorder, Coronary Artery Disease, Crohn’s disease, Rheumatoid Arthristis, Type 1 Diabetes, Type 2 Diabetes  Genes, Environment, & Health Initiative (Gene/Environment Association Studies: GENEVA)  Addiction, diabetes, Heart Disease, Oral Clefts, Maternal Metabolism and Birth Weight, Lung Cancer, Pre-Term Birth, Dental Carries  Genes, Environment, & Development Initiative (GEDI)
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            