Download ML course, 2016 fall What you should know:

ML course, 2016 fall What you should know: Week 1, 2 and 3: Basic issues in Probabilities and Information theory Read: Chapter 2 from the Foundations of Statistical Natural Language Processing book by Christopher Manning and Hinrich Schütze, MIT Press, 2002, and/or Probability Theory Review for Machine Learning, Samuel Ieong, November 6, 2006 (https://see.stanford.edu/materials/aimlcs229/cs229prob.pdf). Week 1: Random events Concepts/definitions: Theoretical results/formulas: • event/sample space, random event • elementary probability formula: # favorable cases # all possible cases • the “multiplication” rule; the “chain” rule • probability function • conditional probabilities • independent random events (2 forms); conditionally independent random events (2 forms) • “total probability” formula (2 forms) • Bayes formula (2 forms) Exercises illustrating the above concepts/definitions and theoretical results/formulas, in particular: proofs for certain properties derived from the definition of the probability function for instance: P (∅) = 0, P (Ā) = 1 − P (A), A ⊆ B ⇒ P (A) ≤ P (B) Ciortuz et al.’s exercise book: ch. 1, ex. 1-5 [6-7], 43-46 [47-49] Week 2: Random variables Concepts/definitions: • random variables; random variables obtained through function composition • discrete random variables; probability mass function (pmf) examples: Bernoulli, binomial, geometric, Poisson distributions • cumulative function distribution • continuous random variables; probability density function (pdf) examples: Gaussian, exponential, Gamma distributions • expectation (mean), variance, standard variation; covariance. (See definitions!) • multi-valued random functions; joint, marginal, conditional distributions Theoretical results/formulas: •P for any discrete variable X: x p(x) = 1, where p is the pmf of X for R any continuous variable X: p(x) dx = 1, where p is the pdf of X • E[X + Y ] = E[X] + E[Y ] E[aX + b] = aE[X] + b Var[aX] = a2 Var[X] Var[X] = E[X 2 ] − (E[X])2 Cov(X, Y ) = E[XY ] − E[X]E[Y ] • X, Y independent variables ⇒ Var[X + Y ] = Var[X] + Var[Y ] • X, Y independent variables ⇒ Cov(X, Y ) = 0, i.e. E[XY ] = E[X]E[Y ] • independence of random variables; conditional independence of random variables Advanced issues: • For any vector of random variables, the covariance matrix is symmetric and positive semidefinite. Advanced issues: • the likelihood function (see also Week 12) • vector of random variables; covariance matrix for a vector of random variables; pozitive [semi-]definite matrices, negative [semi-]definite matrices 1 Ciortuz et al.’s exercise book: ch. 1, ex. 25 Exercises illustrating the above concepts/definitions and theoretical results/formulas, concentrating especially on: − identifying in a given problem’s text the underlying probabilistic distribution: either a basic one (e.g., Bernoulli, binomial, categorial, multinomial etc.), or one derived [by function composition or] by summation of identically distributed random variables (see for instance pr. 14, 51, 53); − computing probabilities (see for instance pr. 51, 53a,c); − computing means / expected values of random variables (see for instance pr. 2, 14, 52, 53b,d); − verifying the [conditional] independence of two or more random variables (see for instance pr. 12, 17, 53, 54). Ciortuz et al.’s exercise book: ch. 1, ex. 8-17 [18-22], 50-56 [57-59] Week 3: Elements of Information Theory Concepts/definitions: • entropy; specific conditional entropy; average conditional entropy; joint entropy; information gain (mutual information) Advanced issues: • relative entropy; • cross-entropy Theoretical results/formulas: • 0 ≤ H(X) ≤ H(1/n, 1/n, . . . , 1/n) {z } | n times • IG(X; Y ) ≥ 0 • H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ) (generalisation: the chain rule) • IG(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) • H(X, Y ) = H(X) + H(Y ) iff X and Y are indep. • IG(X; Y ) = 0 iff X and Y are independent Exercises illustrating the above concepts/definitions and theoretical results/formulas, concentrating especially on: − computing different types of entropies (see ex. 35 and 37); − proof of some basic properties (see ex. 34a, 67), including the functional analysis of the entropy of the Bernoulli distribution, as a base for drawing its plot; − finding the best IG(Xi ; Y ), given a dataset with X1 , . . . , Xn as input variables and Y as output variable (see pr. LDP-104, i.e. CMU, 2013 fall, E. Xing, W. Cohen, Sample Questions, pr. 4). Ciortuz et al.’s exercise book: ch. 1, ex. 33ab-35, 37, 66-70 [33c-g, 36, 38-41, 71] 2 Week 4 and 5: Decision Trees Read: Chapter 3 from Tom Mitchell’s Machine Learning book. Important Note: See the Overview (rom.: “Privire de ansamblu”) document for Ciortuz et al.’s exercise book, chapter 2. It is in fact a “road map” for what we will be doing here. (This note applies also to all chapters.) Week 4: decision trees and their properties; application of ID3 algorithm, properties of ID3: Ciortuz et al.’s exercise book, ch. 2, ex. 1-7, 20-28 Week 5: extensions to the ID3 algorithm: − handling of continuous attributes: Ciortuz et al.’s ex. book, ch. 2, ex. 8-9 [10], 29-30 [31-33] − overfitting: Ciortuz et al.’s ex. book, ch. 2, ex. 8, 19bc, 29, 37b − pruning strategies for decision trees: Ciortuz et al.’s ex. book, ch. 2, ex. 17, 18, 34, 35 using other impurity neasures as [local] optimality criterion in ID3: Ciortuz et al.’s ex. book, ch. 2, ex. 13 − other issues: Ciortuz et al.’s ex. book, ch. 2, ex. 11, 12, 14-16, 19ad, 36, 37a 3 Week 6 and 7: Bayesian Classifiers Read: Chapter 6 from Tom Mitchell’s Machine Learning book (except subsections 6.11 and 6.12.2). Week 6: application of Naive Bayes and Joint Bayes algorithms: Ciortuz et al.’s exercise book, ch. 3, ex. 1-4, 11-14 Week 7: computation of the [training] error rate of Naive Bayes: Ciortuz et al.’s exercise book, ch. 3, ex. 5-7, 15-17; some properties of Naive Bayes and Joint Bayes algorithms: Ciortuz et al.’s exercise book, ch. 3, ex. 8, 20; comparisons with other classifiers: Ciortuz et al.’s exercise book, ch. 3, ex. 18-19. Advanced issues: classes of ML hypotheses: MAP hypotheses vs. ML hypotheses Ciortuz et al.’s exercise book, ch. 1 (Probilities and Statistics), ex. 32, 63, and ch. 3, ex. 9-10. Week 7++: The Gaussian distribution: Read: “Pattern Classification”, R. Duda, P. Hart, D. Stork, 2nd ed. (Wiley-Interscience, 2000), Appendices A.4.12, A.5.1 and A.5.2. The uni-variate Gaussian distribution: pdf, mean and variance, and cdf; the variable transformation enabling to go from the non-standard case to the standard case; [Advanced issue: the Central Limit Theorem – the i.i.d. case] The multi-variate Gaussian distribution: pdf, the case of diagonal covariance matrix Ciortuz et al.’s exercise book: ch. 1, ex. 23, 24, [27,] 26; (parameter estimation: 31, 61, 64) Advanced issues: Gaussian Bayes classifiers and their relationship to logistic regression Read: the draft of an additional chapter to T. Mitchell’s Machine Learning book: Generative and discriminative classifiers: Naive Bayes and logistic regression (especially section 3) Ciortuz et al.’s exercise book, ch. 3, ex. 21-26. Week 8: midterm EXAM 4 Week 9: Instance-Based Learning Read: Chapter 8 from Tom Mitchell’s Machine Learning book. application of the k-NN algorithm: Ciortuz et al.’s exercise book, ch. 4, pr. 1-7, 15-21, 23a comparisons with the ID3 algoritm: Ciortuz et al.’s exercise book, ch. 4, pr. 11, 13, 23b. Shepard’s algoritm: Ciortuz et al.’s exercise book, ch. 4, pr. 8. Advanced issues: some theoretical properties of the k-NN algorithm: Ciortuz et al.’s exercise book, ch. 4, pr. 9, 10, 22, 24. 5 Weeks 10-11, 12-14: Clustering Read: Chapter 14 from Manning and Schütze’ Foundations of Statistical Natural Language Processing book. Week 10: Hierarchical Clustering: see section B in the overview (Rom.: “Privire de ansamblu”) of the Clustering chapter in Ciortuz et al.’s exercise book; pr. 1-6, 25-28, 46-47. Week 11: The k-Means Algorithm see section C1 in the overview of the Clustering chapter in Ciortuz et al.’s exercise book; application: pr. 7-11, 29-30, properties (convergence and other issues): pr. 12-13, 31-34, a “kernelized” version of k-Means: pr. 35, comparison with the hierarchical clustering algoritms: pr. 14, 36, implementation: pr. 48. Week 12: Estimating the parameters of probabilistic distributions: the MAP and MLE methods. Read: Tom Mitchell, Estimating Probabilities, additional chapter to the Machine Learning book (draft of 24 January 2016). Ciortuz et al.’s exercise book, ch. 1, pr. 28-31, 60-62, 64, 65. Week 13: Using the EM algoritm to solve GMMs (Gaussian Mixture Models), the uni-variate case. Read: Tom Mitchell, Machine Learning, sections 6.12.1 and 6.12.3; see section C2 in the overview of the Clustering chapter in Ciortuz et al.’s exercise book; pr. 15bc-18, 38. Week 14: EM for GMMs (Gaussian Mixture Models), the multi-variate case: Ciortuz et al.’s exercise book, the Clustering chapter: pr. 19-22, 23 (the general version), 39-45; implementation: pr. 49. Advanced issues: The EM Algorithmic Schemata Read: Tom Mitchell, Machine Learning, sect. 6.12.2 convergence: Ciortuz et al.’s exercise book, ch. 6, pr. 1-2; EM for solving mixtures of certain distributions: Bernoulli: pr. 3, 11, categorical: pr. [4,] 5, 12, Gamma: pr. 14, exponential: pr. 9; EM for solving a mixture (linear combination) of unspecified distributions: pr. 8; application to text categorization: pr. [6,] 13; a “hard” version of the EM algorithm: pr. 15. Weeks 15-16: final EXAM 6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ML course, 2016 fall What you should know: