Download data analysis - DCU School of Computing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
DATA ANALYSIS
Module Code: CA660
Lecture Block 2
PROBABILITY – Inferential Basis
•
•
•
•
COUNTING RULES – Permutations, Combinations
BASICS Sample Space, Event, Probabilistic Expt.
DEFINITION / Probability Types
AXIOMS (Basic Rules)

i
P{E}  0 any event E
P{Ei}  1  P{S} for certain event
OR
P{Ei  Ej}  P{Ei}  P{Ej}
iff {Ei  Ej}  
• ADDITION RULE – general and special
from Union (of events or sets of points in space)
Basics contd.
• CONDITIONAL PROBABILITY
(Reduction in sample space)
• MULTIPLICATION RULE – general and special from
Intersection (of events or sets of points in space)
P{B  A}  P{A B}P{B}
• Chain Rule for multiple intersections
• Probability distributions, from sets of possible
outcomes.
• Examples - come up with one of each
Conditional Probability: BAYES
A move towards “Likelihood” Statistics
More formally Theorem of Total Probability (Rule of Elimination)
If the events B1 , B2 , …,Bk constitute a partition of the sample space
S, such that P{Bi}  0 for i = 1,2,…,k, then for any event A of S
P{A}  i 1 P{Bi  A}  i 1 P{Bi}P{A Bi}
k
k
So, if events B partition the space as above, then for any event A in S,
where P{A}  0
P{Br A} 
P{Br  A}
 P{B
k
i
i 1
A}

P{Br}P{ A Br}
 P{B }P{A B }
k
i
i 1
i
BAYES RULE
Example - Bayes
40,000 people in a population of 2 million carry a
particular virus. P{Virus} = P{V1} = 0.0002
Tests to show presence/absence of virus, give results:
P{T / V1 } =0.99 and P{T / V2 } = 0.01
P{N / V2 }=0.98 and P{N / V1 }=0.02
where V2 is the event virus absent, T, the event =
positive test,
N the event = negative test. (All a priori probabilities)
So
P{V 1}P{T V 1}
P{V 1 T } 

k
i 1
 0.019 a posteriori
P{Vi}P{T / Vi}
where events Vi partition the sample space
Total probability
BAYES Bioinformatics Example:
Accuracy of Assembled DNA sequences
• Want estimate of probability that ith letter of an
assembled sequence is A,C,G, T or – (unknown)
• Assume each fragment assembly correct, all portions
equally reliable, sequencing errors independt. & uniform
throughout sequence. Assume letters in sequence IID.
• Let F* = {f1, f2 , …fN} be the set of fragments
• Fragments aligned into assembled sequence correspond to columns i in matrix, while fragments
correspond to rows j
• Matrix elements xij are members of B* = {A,C,G,T, - , 0}
• True sequence (in n columns) is s = {s1, s2 , …sn}
where s contained in {A,C,G,T,-} = A*
BAYES contd.
0 i.e. fragment j as is
tj  
 1 fragment j is reverse complemented
orientatn.
Track fragment
Thus need estimation of
Pi ( M )  P{s i  M / xij, j  1,....N )= probability ith letter
is from molecule “M”, given matrix elements(of fragments).
Assuming knowledge of sequencing error rates:
P{b / M }  P{xij  b / si  M }, M  A*, b  B *
so that Bayes gives
P( M ) j 1[(1  tj ) P( xij / M )  tjP( xij / M )]
N
Pi ( M ) 

bA*
Context = M
Total Prob. of b
P(b) j 1[(1  tj ) P( xij / b)  tjP( xij / b )]
N
Summed options for b over M
Example: probability other Bioinformatic
problems: e.g. POPULATION GENETICS
• Counts – Genotypic “frequencies”
GENE with n alleles, so n(n+1)/2 possible genotypes
• Population Equilibrium HARDY-WEINBERG
Genes and “genotypic frequencies” constant from
generation to generation (so simple relationships for
genotypic and allelic frequencies)
e.g. 2 allele model pA, pa allelic freq. A, a respectively, so
genotypic ‘frequencies’ are pAA , pAa ,, paa , with
pAA = pA pA = pA2
pAa = pA pa + pa pA = 2 pA pa
paa = pa2
(pA+ pa )2 = pA2 + 2 pa pA + pa2
One generation of Random mating. H-W at single locus
POPULATION PICTURE at one locus
under H-W m
NB : ‘Frequency’ heterozygote maximum for both allelic
frequencies = 0.5 (see Fig.)
Also if rare allele A
pAa
2 pApa

paA  pAA 2 pApa  p A2
pa
2 pa

(1  pa )
So, if rare allele, probability high carried in heterozygous
state: e.g. 99% chance for pA= 0.01 say
Extended:Multiple Alleles Single Locus
• p1, p2, .. pi ,...pn = “frequencies” alleles A1, A2, … Ai
,….An , Possible genotypes = A11, A12 , ….. Aij , … Ann
• Under H-W equilibrium, Expected genotype
frequencies
(p1+ p2 +… pi ... +pn) (p1+ p2 +… pj ... +pn)
= p12 + 2p1p2 +…+ 2pipj…..+ 2pn-1pn + pn2
e.g. for 4 alleles, have 10 genotypes.
• Proportion of heterozygosity in population clearly
PH = 1 -i p i 2 used in screening of
genetic markers
Example revisited: Expected genotypic
frequencies for a 4-allele system; H-W m,
proportion of heterozygosity in F2 progeny
Genotype
Expected
frequency

pi

p1= 0.25
p2= 0.25
p3= 0.25
p4= 0.25
p1= 0.3
p2= 0.3
p3= 0.2
p4= 0.2
p1= 0.4
p2= 0.4
p3= 0.1
p4= 0.1
p1= 0.4
p2= 0.3
p3= 0.2
p4= 0.1
p1= 0.7
p2= 0.1
p3= 0.1
p4= 0.1
A1A1
p 1p 1
0.0625
0.09
0.16
0.16
0.49
A1A2
2p1p2
0.125
0.18
0.32
0.24
0.14
A1A3
2p1p3
0.125
0.12
0.08
0.16
0.14
A1A4
2p1p4
0.125
0.12
0.08
0.08
0.14
A2A2
p 2p 2
0.0625
0.09
0.16
0.09
0.01
A2A3
A2A4
A3A3
A3A4
A4A4
2p2p3
2p2p4
p 3p 3
2p3p4
p 4p 4
0.125
0.125
0.0625
0.125
0.0625
0.12
0.12
0.04
0.08
0.04
0.08
0.08
0.01
0.02
0.01
0.12
0.06
0.04
0.04
0.01
0.02
0.02
0.01
0.02
0.01
pH
0.75
0.74
0.66
0.70
0.48
GENERALISING: PROBABILITY RULES
and PROPERTIES – Other Examples in brief
• For  loci, No. of genotypes, where ni = No. alleles for locus i :
1
2
n (n 1)
i
i
i 1
• Changes in gene frequency–from migration, mutation, selection
Suppose native population has allelic freq. pn0 . Proportion mi
(relative to native population) migrates from ith of k populations
to native population every generation; immigrants having allelic
frequency pi.
So allelic frequency in a mixed population :


pn1  1  i 1 mi pn0  i 1 (mipi)  pn0  i 1[mi( pi  pn0)]
k
k
k
Related documents