Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Center for Complex Engineering Systems Introduc)on to Applied Machine Learning Abdullah Almaatouq (amaatouq@mit.edu) Abdulaziz Alhassan (a.alhassan@cces-‐kacst-‐mit.org) Ahmad Alabdulkareem (kareem@mit.edu) Tutorial Outline • What is machine learning • How to learn • How to learn well • Prac)cal Session Examples of Machine Learning What is Machine Learning • Given computers the ability to learn without explicitly being programmed. – Arthur Samuel 1959 • Machine learning is a computer program that improves its performance in a certain task with experience. – Tom Mitchell (1998) What is Machine Learning • Making predic)ons, based on learned paSerns from historical data. • Not to be confused with data mining – Data mining focuses on exploi)ng paSerns in historical data. • New field? – Sta)s)cal Generaliza)on – Computa)onal Sta)s)cs Why Machine learning? • Huge amounts of data !!!! • Humans are expensive, computers are cheap. – 88ECU, 32CPU and 244GB RAM ≈ $3.50/h – Minimum worker wage in the US is $7.25/h • Rule based is (oden): – – – – Difficult to capture all relevant paSerns. You can’t add rules to paSerns you don’t know exist Very hard to maintain (oden) doesn’t work well • Psychic superpowers (some)mes!) When to use Machine Learning • There is a paSern to be detected • You have data Distance • You can NOT pin it mathema)cally d = r × t Time (t) Types of Problems • Supervised – Training from seen labeled examples to generalize to unseen new observa)ons. – Given some input xi predict ‘class/value’ of yi • Unsupervised – Find hidden structures in unlabeled data Types of Problems Machine Learning problems Supervised Regression Classifica)on Unsupervised Clustering Outlier Detec)on Supervised Learning • ClassificaBon – Binary: Given xi find yi in {-‐1, 1} – Mul)category : Given xi find yi in {-‐1, .. K} • Regression – Given xi find yi in R (or Rd) Spam VS Not Spam Images to digits Beak depth predic)on Supervised Learning Framework Unknown func)on f f: x ! y generates Training Examples (xi,yi) … .. (xn,yn) input Hypothesis set Op)miza)on rou)nes Error func)on New observa)on x h(x) es s o o ch Machine Learning input produces Known func)on g g ≈ f Predict value/category y Linear Regression Linear form: y = ax+b • Capture the collec)ve behavior of credit officers • Predict the credit line for new customers Bias 1 Age 23 years Annual salary $144000 Input x = Years in residence 4 years Years in job 2 years ….. … = [1,23,144000,4,2..] h(x) = θ1 x1 +... + θ d xd + θ 0 x0 d Linear regression output: h(x) = ∑θ x i i=0 i = θT x Linear Regression • The historical dataset (x1,y1), (x2,y2),…, (xN,yN) • y n ∈ R is the credit line for customer xn • How well our h(x) = θTx approximate f(x) – Squared error ( h(x) − f(x) )2 – E(h) = 1 N N ∑ (h(x ) − y n n )2 n=1 • Goal is to choose h that minimizes the E(h) Linear Regression y x1 x2 Illustra)on of 2d and 3d regression line and plane. It works the same in higher dimensions SeMng-‐up the problem E(h) = X= x T 1 T 2 . . . x Tn N ∑ (h(x ) − y n n )2 n=1 Features E(θ) = 1 N N ∑ (θ T xn − yn )2 n=1 1 || Xθ − y ||2 N = X= Observa)ons x 1 N Minimizing E(h) E(θ) = 1 || Xθ − y ||2 N E(θ) = 2 T X (Xθ − y) N Python: theta = linalg.inv (X.T*X)*X.T*y = 2 T X (Xθ − y) = 0 N = 2 (X T Xθ − X T y) = 0 N θ = (X T X )−1 X T y Inverse here is pseudo-‐inverse MATLAB: theta = inv(X'*X)*X'*y Linear Regression Supervised Learning Framework Credit line officers Unknown func)on f f: x ! y generates Training Examples (xi,yi) … .. (xn,yn) Historical credit line h(x) = ax+ b Error func)on Squared error New observa)on x input Hypothesis set Gradient Op)miza)on decent rou)nes New applica)ons h(x) es s o o ch Machine Learning Linear Regression input produces Known func)on g g ≈ f Predict Credit line value/category y Support Vector Machines • Separa)ng plane with maximum margin • Margins are configurable with parameters to allow for mistakes • Gold standard blackbox for many prac))oners Support Vector Machines • Effec)ve in high dimensional spaces • effec)ve even when #features > #observa)ons • It can uses different kernel func)ons Polynomial Linear RBF Supervised Learning Framework Unknown func)on f f: x ! y generates Training Examples (xi,yi) … .. (xn,yn) input Hypothesis set Op)miza)on rou)nes Error func)on New observa)on x h(x) es s o o ch Machine Learning input produces Known func)on g g ≈ f Predict value/category y Types of Problems Machine Learning problems Supervised Regression Classifica)on Unsupervised Clustering Outlier Detec)on Unsupervised Learning • Trying to find hidden structure in unlabeled data. • Clustering – Group similar data points in “clusters” • k-‐means • Mixture models • hierarchical clustering Unsupervised Learning • K-‐means • “Birds of a feather flock together” • Group samples around a “mean” – Start with random “means” and assign each point to the nearest “mean” – Calculate new “means” – Re-‐assign points to new “means”… Random ini)al means Assignment Mean Recalcula)on Re-‐assign points to new means Mean Recalcula)on Re-‐assign points to new means Unsupervised Learning • Mixture Models – Presence of subpopula)ons within an overall popula)on Anomaly DetecBon • Detect abnormal data • Whitelis)ng good behavior Anomaly DetecBon • Sta)s)cal Anomaly Detec)on 0.06 # X1 X2 … Xn 0.05 1 12 8 … 48 0.04 2 14 7 … 52 0.03 3 16 8 … 57 4 11 6 … 53 … … … … … m 13 7 … 56 0.02 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 0.01 P(x) = p(x1, α1, μ1) p(x2, α2, μ2)… p(xn, αn, μn) Anomaly DetecBon • Local Outlier Factor (LOF) – Local density – Locality is given by nearest neighbors How to learn (well) IniBal criBcal obstacles • Choosing Features: – By domain experts knowledge and exper)se – Or by Feature extrac)on algorithms • Choosing Parameters, for example: – Degree of the regression equa)on – SVM parameters (c, gamma) – Number of clusters • K-‐mean (pre) • Agglomera)ve (post) A simple example: FiMng a polynomial • The green curve is the true func)on (which is not a polynomial) • We will use a loss func)on that measures the squared error in the predic)on of y(x) from x. Some fits to the data: which is best? which is best? Occam's Razor • “The simplest explanaBon for some phenomenon is more likely to be accurate than more complicated explanaBons.” which is best? A simple way to reduce model complexity Add penal)es from Bishop Example • Mixture Models Using a validaBon set • Divide the total dataset into three subsets: – Training data: for learning the parameters of the model. – Valida)on data for deciding what type of model and what amount of regulariza)on works best. – Test data is used to get a final, unbiased es)mate of how well the algorithm works. We expect this es)mate to be worse than on the valida)on data. PCA • Data visualiza)on • Noise reduc)on • Data reduc)on • Can help with the curse of dimensionality • And more Original Variable B What are PCA PC 2 PC 1 Original Variable A • Orthogonal direc)ons of greatest variance in data Principal Components • First PC is direc)on of maximum variance from origin Wavelength 2 30 25 PC 1 20 15 10 5 0 0 5 10 15 20 Wavelength 1 25 30 10 15 20 Wavelength 1 25 30 • Subsequent PCs are orthogonal to 1st PC and describe maximum residual variance Wavelength 2 30 25 20 15 PC 2 10 5 0 0 5 Principal Components 5 2nd Principal Component, y2 1st Principal Component, y1 4 3 2 4.0 4.5 5.0 5.5 6.0 Principal Components 5 xi2 yi,2 4 yi,1 3 2 4.0 4.5 5.0 xi1 5.5 6.0 PCA on real example Original PCA r=4 64 256 3600 Thank you • Make sure you have python and IPython notebook – Easiest way to get this running hSp://con)nuum.io/downloads • Download the tutorial IPython notebook hSp://www.amaatouq.com/notebook.zip