Download data analysis - DCU School of Computing

DATA ANALYSIS Module Code: CA660 Lecture Block 2 PROBABILITY – Inferential Basis • • • • COUNTING RULES – Permutations, Combinations BASICS Sample Space, Event, Probabilistic Expt. DEFINITION / Probability Types AXIOMS (Basic Rules)  i P{E}  0 any event E P{Ei}  1  P{S} for certain event OR P{Ei  Ej}  P{Ei}  P{Ej} iff {Ei  Ej}   • ADDITION RULE – general and special from Union (of events or sets of points in space) Basics contd. • CONDITIONAL PROBABILITY (Reduction in sample space) • MULTIPLICATION RULE – general and special from Intersection (of events or sets of points in space) P{B  A}  P{A B}P{B} • Chain Rule for multiple intersections • Probability distributions, from sets of possible outcomes. • Examples - come up with one of each Conditional Probability: BAYES A move towards “Likelihood” Statistics More formally Theorem of Total Probability (Rule of Elimination) If the events B1 , B2 , …,Bk constitute a partition of the sample space S, such that P{Bi}  0 for i = 1,2,…,k, then for any event A of S P{A}  i 1 P{Bi  A}  i 1 P{Bi}P{A Bi} k k So, if events B partition the space as above, then for any event A in S, where P{A}  0 P{Br A}  P{Br  A}  P{B k i i 1 A}  P{Br}P{ A Br}  P{B }P{A B } k i i 1 i BAYES RULE Example - Bayes 40,000 people in a population of 2 million carry a particular virus. P{Virus} = P{V1} = 0.0002 Tests to show presence/absence of virus, give results: P{T / V1 } =0.99 and P{T / V2 } = 0.01 P{N / V2 }=0.98 and P{N / V1 }=0.02 where V2 is the event virus absent, T, the event = positive test, N the event = negative test. (All a priori probabilities) So P{V 1}P{T V 1} P{V 1 T }   k i 1  0.019 a posteriori P{Vi}P{T / Vi} where events Vi partition the sample space Total probability BAYES Bioinformatics Example: Accuracy of Assembled DNA sequences • Want estimate of probability that ith letter of an assembled sequence is A,C,G, T or – (unknown) • Assume each fragment assembly correct, all portions equally reliable, sequencing errors independt. & uniform throughout sequence. Assume letters in sequence IID. • Let F* = {f1, f2 , …fN} be the set of fragments • Fragments aligned into assembled sequence correspond to columns i in matrix, while fragments correspond to rows j • Matrix elements xij are members of B* = {A,C,G,T, - , 0} • True sequence (in n columns) is s = {s1, s2 , …sn} where s contained in {A,C,G,T,-} = A* BAYES contd. 0 i.e. fragment j as is tj    1 fragment j is reverse complemented orientatn. Track fragment Thus need estimation of Pi ( M )  P{s i  M / xij, j  1,....N )= probability ith letter is from molecule “M”, given matrix elements(of fragments). Assuming knowledge of sequencing error rates: P{b / M }  P{xij  b / si  M }, M  A*, b  B * so that Bayes gives P( M ) j 1[(1  tj ) P( xij / M )  tjP( xij / M )] N Pi ( M )   bA* Context = M Total Prob. of b P(b) j 1[(1  tj ) P( xij / b)  tjP( xij / b )] N Summed options for b over M Example: probability other Bioinformatic problems: e.g. POPULATION GENETICS • Counts – Genotypic “frequencies” GENE with n alleles, so n(n+1)/2 possible genotypes • Population Equilibrium HARDY-WEINBERG Genes and “genotypic frequencies” constant from generation to generation (so simple relationships for genotypic and allelic frequencies) e.g. 2 allele model pA, pa allelic freq. A, a respectively, so genotypic ‘frequencies’ are pAA , pAa ,, paa , with pAA = pA pA = pA2 pAa = pA pa + pa pA = 2 pA pa paa = pa2 (pA+ pa )2 = pA2 + 2 pa pA + pa2 One generation of Random mating. H-W at single locus POPULATION PICTURE at one locus under H-W m NB : ‘Frequency’ heterozygote maximum for both allelic frequencies = 0.5 (see Fig.) Also if rare allele A pAa 2 pApa  paA  pAA 2 pApa  p A2 pa 2 pa  (1  pa ) So, if rare allele, probability high carried in heterozygous state: e.g. 99% chance for pA= 0.01 say Extended:Multiple Alleles Single Locus • p1, p2, .. pi ,...pn = “frequencies” alleles A1, A2, … Ai ,….An , Possible genotypes = A11, A12 , ….. Aij , … Ann • Under H-W equilibrium, Expected genotype frequencies (p1+ p2 +… pi ... +pn) (p1+ p2 +… pj ... +pn) = p12 + 2p1p2 +…+ 2pipj…..+ 2pn-1pn + pn2 e.g. for 4 alleles, have 10 genotypes. • Proportion of heterozygosity in population clearly PH = 1 -i p i 2 used in screening of genetic markers Example revisited: Expected genotypic frequencies for a 4-allele system; H-W m, proportion of heterozygosity in F2 progeny Genotype Expected frequency  pi  p1= 0.25 p2= 0.25 p3= 0.25 p4= 0.25 p1= 0.3 p2= 0.3 p3= 0.2 p4= 0.2 p1= 0.4 p2= 0.4 p3= 0.1 p4= 0.1 p1= 0.4 p2= 0.3 p3= 0.2 p4= 0.1 p1= 0.7 p2= 0.1 p3= 0.1 p4= 0.1 A1A1 p 1p 1 0.0625 0.09 0.16 0.16 0.49 A1A2 2p1p2 0.125 0.18 0.32 0.24 0.14 A1A3 2p1p3 0.125 0.12 0.08 0.16 0.14 A1A4 2p1p4 0.125 0.12 0.08 0.08 0.14 A2A2 p 2p 2 0.0625 0.09 0.16 0.09 0.01 A2A3 A2A4 A3A3 A3A4 A4A4 2p2p3 2p2p4 p 3p 3 2p3p4 p 4p 4 0.125 0.125 0.0625 0.125 0.0625 0.12 0.12 0.04 0.08 0.04 0.08 0.08 0.01 0.02 0.01 0.12 0.06 0.04 0.04 0.01 0.02 0.02 0.01 0.02 0.01 pH 0.75 0.74 0.66 0.70 0.48 GENERALISING: PROBABILITY RULES and PROPERTIES – Other Examples in brief • For  loci, No. of genotypes, where ni = No. alleles for locus i : 1 2 n (n 1) i i i 1 • Changes in gene frequency–from migration, mutation, selection Suppose native population has allelic freq. pn0 . Proportion mi (relative to native population) migrates from ith of k populations to native population every generation; immigrants having allelic frequency pi. So allelic frequency in a mixed population :   pn1  1  i 1 mi pn0  i 1 (mipi)  pn0  i 1[mi( pi  pn0)] k k k

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download data analysis - DCU School of Computing