Download Association Rule Mining - Indian Statistical Institute

Supervised Learning Regression, Classification Linear regression, k-NN classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 11, 2014 Power (bhp) An Example: Size of Engine vs Power 200 180 160 140 120 100 80 60 40 20 0 0 500 1000 1500 2000 2500 Engine displacement (cc)  An unknown car has an engine of size 1800cc. What is likely to be the power of the engine? 2 Power (bhp) An Example: Size of Engine vs Power Target Variable 200 180 160 140 120 100 80 60 40 20 0 0 500 1000 1500 2000 2500 Engine displacement (cc)  Intuitively, the two variables have a relation  Learn the relation from the given data  Predict the target variable after learning 3 Exercise: on a simpler set of data points 12 x 1 2 3 4 2.5 10 y 8 6 4 2 y 1 3 7 10 ? 0 0 1 2 3 4 5 x  Predict y for x = 2.5 4 Power (bhp) Linear Regression 200 180 160 140 120 100 80 60 40 20 0 Training set 0 500 1000 1500 2000 2500 Engine displacement (cc)  Assume: the relation is linear  Then for a given x (=1800), predict the value of y 5 Power (bhp) Linear Regression 200 180 160 140 120 100 80 60 40 20 0 0 500 1000 1500 2000 2500 Engine Power (cc) (bhp) 800 60 1000 90 1200 80 1200 100 1200 75 1400 90 1500 120 1800 160 2000 140 2000 170 2400 180 Engine displacement (cc)  Linear regression  Assume y = a . x + b  Try to find suitable a and b Optional exercise 6 Exercise: using Linear Regression 12 x 1 2 3 4 2.5 10 y 8 6 4 2 y 1 3 7 10 ? 0 0 1 2 3 4 5 x  Define a regression line of your choice  Predict y for x = 2.5 7 Choosing the parameters right 200 Goal: minimizing the deviation from the actual data points 150 y 100 50 0 0     500 1000 x 1500 2000 2500 The data points: (x1, y1), (x2, y2), … , (xm, ym) The regression line: f(x) = y = a . x + b Least-square cost function: J = Σi ( f(xi) – yi )2 Goal: minimize J over choices of a and b 8 How to Minimize the Cost Function? b a     Goal: minimize J for all values of a and b Start from some a = a0 and b = b0 Compute: J(a0,b0) Simultaneously change a and b towards the negative gradient and eventually hope to arrive an optimal  Question: Can there be more than one optimal? Δ 9 Another example: High blood sugar Y Training set N 0 20 40 60 80 Age  Given that a person’s age is 24, predict if (s)he has high blood sugar  Discrete values of the target variable (Y / N)  Many ways of approaching this problem 10 Classification problem High blood sugar Y N ? 0 20 24 40 60 80 Age  One approach: what other data points are nearest to the new point?  Other approaches? 11 Classification Algorithms       The k-nearest neighbor classification Naïve Bayes classification Decision Tree Linear Discriminant Analysis Logistics Regression Support Vector Machine 12 Classification or Regression? Given data about some cars: engine size, number of seats, petrol / diesel, has airbag or not, price  Problem 1: Given engine size of a new car, what is likely to be the price?  Problem 2: Given the engine size of a new car, is it likely that the car is run by petrol?  Problem 3: Given the engine size, is it likely that the car has airbags? 13 Classification Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 Training set • Owns a flat 200 150 • Does not own a flat 100 50 0 0 10 20 30 40 50 60 70 Age  Given a new person’s age and income, predict – does (s)he own a flat? 15 Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 Training set • Owns a flat 200 150 • Does not own a flat 100 50 0 0 10 20 30 40 50 60 70 Age  Nearest neighbor approach  Find nearest neighbors among the known data points and check their labels 16 Example: Age, Income and Owning a flat Monthly income (thousand rupees) 250 Training set • Owns a flat 200 150 • Does not own a flat 100 50 0 0 10 20 30 40 50 60 70 Age  The 1-Nearest Neighbor (1-NN) Algorithm: – Find the closest point in the training set – Output the label of the nearest neighbor 17 The k-Nearest Neighbor Algorithm Monthly income (thousand rupees) 250 Training set • Owns a flat 200 150 • Does not own a flat 100 50 0 0 10 20 30 40 50 60 70 Age  The k-Nearest Neighbor (k-NN) Algorithm: – Find the closest k point in the training set – Majority vote among the labels of the k points 18 Distance measures  How to measure distance to find closest points?  Euclidean: Distance between vectors x = (x1, … , xk) and y = (y1, … , yk)  Manhattan distance:  Generalized squared interpoint distance: S is the covariance matrix The Maholanobis distance (1936) 19 Classification setup  Training data / set: set of input data points and given answers for the data points  Labels: the list of possible answers  Test data / set: inputs to the classification algorithm for finding labels – Used for evaluating the algorithm in case the answers are known (but known to the algorithm)  Classification task: Determining labels of the data points for which the label is not known or not passed to the algorithm  Features: attributes that represent the data 20 Evaluation  Test set accuracy: the correct performance measure  Accuracy = #of correct answer / #of all answers  Need to know the true test labels – Option: use training set itself – Parameter selection (for k-NN) by accuracy on training set  Overfitting: a classifier performs too good on training set compared to new (unlabeled) test data 21 Better validation methods  Leave one out: – – – – – For each training data point x of training set D Construct training set D – x, test set {x} Train on D – x, test on x Overall accuracy = average over all such cases Expensive to compute  Hold out set: – Randomly choose x% (say 25-30%) of the training data, set aside as test set – Train on the rest of training data, test on the test set – Easy to compute, but tends to have higher variance 22 The k-fold Cross Validation Method  Randomly divide the training data into k partitions D1,…, Dk : possibly equal division  For each fold Di – Train a classifier with training data = D – Di – Test and validate with Di  Overall accuracy: average accuracy over all cases 23 References  Lecture videos by Prof. Andrew Ng, Stanford University Available on Coursera (Course: Machine Learning)  Data Mining Map: http://www.saedsayad.com/ 24

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Association Rule Mining - Indian Statistical Institute