Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
APAM 1080 Extended Syllabus
Spring 2015
Professor Charles Lawrence
Office hours: Tuesday: 3:30-4:30 & Wednesday 11AM to noon,
also posted on AM web site.
I.
Introduction (week 1)
a. Assignment: Review probability and math statistics from AM165
b. The genome and DNA
i. Sequencing of Genome as 1st Killer application
1. A new research paradigm in biology
2. The hypothesis driven paradigm
3. The genome as the 800 pound gorilla
ii. Fundamentals of DNA structure and function
1. The genome is not enough
a. Variables of greatest interest still unknown
b. Inference of unknown variables
2. The central dogma
3. Fundamental of DNA structure
4. Heterogeneous organization of the genome
5. Repeated sequence in genomes
c. Fundamental concepts of Inference
i. Definition of Inference
ii. Inference a daily activity
iii. Inference of unobserved variables in inherently uncertain
1. Probability the language to describe uncertainty
2. A coins example
Quiz 5%
II.
Basics of probability and statistics
a. Probabilistic models (week 2: May vary depending on quiz results)
i. Review of fundamental of probability theory
1. Probabilities of events
a. Sample space of events
b. Mutually exclusive events
c. Conditional probability and independence
d. Two laws of probability
e. Total probability and Bayes rule
2. Random variables (RV)
a. Functional assignment real numbers to events
b. Multivariate distribution of RVs
i. Joint, conditional
1
ii. Marginalization
3. Expected values
ii. Review: Some basic probabilistic models
1. Binomial/multinomial
2. Beta/Dirichlet
3. Poisson
4. Gamma
a. Chi-square
b. Inverse chi-square
5. Normal
a. Multivariate normal
b. Statistical inference
i. Review Fundaments of Statistical Inference
1. Definitions
2. Probability theory to deal with the inherent uncertainty in
inference
a. The likelihood
b. Inference as reverse engineering
i. Two major paradigms
1. The sampling distribution and
frequentist concepts of behavior on
repeated samples
2. Bayesian concept of probability
distribution of unknowns for given
data
ii. Review: Frequentist estimation (Week 3)
1. Example for estimating proportion with blue eyes
2. Data: A random sample from a population
a. One of many possible samples
3. Estimator as a procedure to find point estimates
a. Statistics as function of the data
b. Distribution of a statistic
i. Statistic: a function of the data
ii. Given an independent random sample of N
students derive the probability that K = k of
them will have blue eyes if the proportion at
brown is p
iii. Which of the numbers in you formula is an
RV?
iv. What is you estimate of p, p^hat
1. Is p^hat the same as p?
2. Is p^hat an RV?
3. How would p^hat behave on
repeated samples
iii. Review: Maximum likelihood estimation (MLE)
1. Maximum likelihood principle
2
a. Define a maximum likelihood estimate
i. In your own words
ii. Mathematically for binomial use for blue
eyes
iii. General of any likelihood
iv. Show that the MLE of estimate of p in
binomial is p^hat
1. Hint take logs
2. Characteristics of estimators
a. Unbiased
i. Minimum variance unbiased estimators
b. Efficiency
c. Consistency
d. Minimum squared error loss estimators
i. Squared bias and variance
ii. Bias/variance tradeoff
3. Properties of MLEs
a. Consistent, asymptotically unbiased and normally
distributed
i. Asymptotics depend on sample growing
large compared to number of unknowns.
iv. Bayesian Inference (Weeks 4-5)
1. Likelihood*prior P(K=k|p,n)f(p|alpha,beta) =
f(K,p|n,alpha,beta)
a. For binomial
b. What are the RVs
2. Bayes rule and posterior
a. What are we trying to infer?
b. Binomial, likelihood
c. Beta prior
i. Conjugate priors general case
3. Derive posterior distribution for p
a. Bayesian inference for binomial emissions
i. Binomial likelihood
ii. Beta prior
iii. Derive posterior
b. Hierarchal model
i. Capturing similarities and differences in
multiple coins
ii. Validation example
4. MCMC algorithms
a. Gibbs sampling
b. Metropolis Hastings algorithm
In class exam 20%
3
III.
Modeling Sequences (weeks 6-8)
a. Probabilistic models for sequences
i. Markov models
1. DNA composition example
2. Markov Chain
a. Conditional independence
b. Recursion
ii. Hidden Markov models (HMM)
1. Heterogeneity in DNA composition example
a. Yeast promoter example
b. Two dice example
2. Generative hidden Markov models
a. Hidden State model
i. Markov transition model
1. Geometric length assumption
ii. Emission models
1. Categorical
2. Discrete
3. Continuous
b. Inference with Hidden Variables and HMMs
i. HMM algorithms
1. Derive: Forward algorithm
2. Derive: Back sampling algorithm
3. Backward algorithm
ii. Other examples of hidden variables
1. Alignment
a. Indices is each sequence that correspond
2. RNA secondary structure
a. Indices of bases that from base pairs
iii. Estimation of unknown parameters
1. States known
a. MLEs
b. Bayesian Inference
2. States unknown
a. Gibbs sampling
i. Sample states
ii. Sample from parameter distributions
b. EM algorithm for HMMs
i. Expectations step
ii. Maximization step
iii. EM theory
iv. Other Emission models
1. Normal emissions
a. Geo-science proxy example
2. Poisson emissions
4
a. High throughput sequencing
example
Project Report 25%
c. Change Point algorithm CP (Week 9)
i. The challenge
ii. Examples
1. Coins
2. Sequence composition
3. Paleoclimate problem
iii. Identify: Conditional independence nature of CP
iv. Number of change points and combinatorial prior
v. Forward algorithm
vi. Inference of number of states
vii. Back sampling
viii. Inference of change points
d. Tree Models (Week 10)
i. Phylogeny
ii. Trees and conditional independence
iii. Generative model
iv. Upward algorithm
v. Sampling algorithm
vi. Gibbs sampling and EM
e. The two sides of genomic inference (Week 11)
i. Discrete high-D unknowns
1. Curse of dimensionality
2. Application of decision theory in genomics
ii. Population parameters
1. Asymptotics come to bare
2. So much data frequentist model for repeat samples is
attainable.
a. Bootstrap and other resampling approaches
Take home exam 15% (Combine with Review paper?)
IV.
Hypothesis testing (week 12-14)
a. Basic concepts
i. Frequentist error types
1. Type I and p-values
2. Type II and power
ii. Bayesian evidence
b. Multiple comparisons in high-d settings
i. Family-wise p-values
5
ii. FDR
1. Expecting some false positives
a. Controlling the proportion above the critical value
2. q-values
3. Storey & Tibshirani, PNAS, (2003) 100: 9440–9445
iii. Local fdr a Bayesian approach
1. Using high-D data to create an empirical density
distribution of p-values
2. Robust assessment departures from uniform distribution of
p-values
a. Z-transform
b. Estimating mu and sigma from the core distribution
c. Local fdr and the Bayesian posterior
3. Genomics as an discovery science
a. An observational science
b. Untoward effects in observation science
i. Confounding
ii. Unseen correlation
4. Efron, JASA, (2004) 99:96-104
iv. Ridge Regression and LASSO
1. Regularizer
a. Bayesian prior
b. Penalty on high-D estimates
2. Optimization
a. Max{Likelihood – regularizer (penalty)}
3. Tibshirani JRSS-B (1996) 58: 276-288
Review Paper 15% (Combine with take home exam?)
Class participation 20%
Course reading materials
1)
2)
3)
4)
Biological Sequence Analysis, Durbin et al.
Mathematical Statistics Wackerly et al.
Bayesian Data Analysis, Gelman A, et al. (Brown Library e-Book)
Papers by Storey & Tibshirani, Efron and Tibshirani
Computer questions: Prof. Bill Thompson: william_thompson_1@brown.edu
6