* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Association Rule Mining - Indian Statistical Institute
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					Supervised Learning
Regression, Classification
Linear regression, k-NN classification
Debapriyo Majumdar
Data Mining – Fall 2014
Indian Statistical Institute Kolkata
August 11, 2014
Power (bhp)
An Example: Size of Engine vs Power
200
180
160
140
120
100
80
60
40
20
0
0
500
1000
1500
2000
2500
Engine displacement (cc)
 An unknown car has an engine of size 1800cc. What is likely
to be the power of the engine?
2
Power (bhp)
An Example: Size of Engine vs Power
Target
Variable
200
180
160
140
120
100
80
60
40
20
0
0
500
1000
1500
2000
2500
Engine displacement (cc)
 Intuitively, the two variables have a relation
 Learn the relation from the given data
 Predict the target variable after learning
3
Exercise: on a simpler set of data points
12
x
1
2
3
4
2.5
10
y
8
6
4
2
y
1
3
7
10
?
0
0
1
2
3
4
5
x
 Predict y for x = 2.5
4
Power (bhp)
Linear Regression
200
180
160
140
120
100
80
60
40
20
0
Training set
0
500
1000
1500
2000
2500
Engine displacement (cc)
 Assume: the relation is linear
 Then for a given x (=1800), predict the value of y
5
Power (bhp)
Linear Regression
200
180
160
140
120
100
80
60
40
20
0
0
500
1000
1500
2000
2500
Engine Power
(cc) (bhp)
800
60
1000
90
1200
80
1200
100
1200
75
1400
90
1500
120
1800
160
2000
140
2000
170
2400
180
Engine displacement (cc)
 Linear regression
 Assume y = a . x + b
 Try to find suitable a and b
Optional exercise
6
Exercise: using Linear Regression
12
x
1
2
3
4
2.5
10
y
8
6
4
2
y
1
3
7
10
?
0
0
1
2
3
4
5
x
 Define a regression line of your choice
 Predict y for x = 2.5
7
Choosing the parameters right
200
Goal: minimizing
the deviation from
the actual data
points
150
y
100
50
0
0
500
1000
x
1500
2000
2500
The data points: (x1, y1), (x2, y2), … , (xm, ym)
The regression line: f(x) = y = a . x + b
Least-square cost function: J = Σi ( f(xi) – yi )2
Goal: minimize J over choices of a and b
8
How to Minimize the Cost Function?
b
a
Goal: minimize J for all values of a and b
Start from some a = a0 and b = b0
Compute: J(a0,b0)
Simultaneously change a and b towards the negative
gradient and eventually hope to arrive an optimal
 Question: Can there be more than one optimal?
Δ
9
Another example:
High blood sugar
Y
Training set
N
0
20
40
60
80
Age
 Given that a person’s age is 24, predict if (s)he has
high blood sugar
 Discrete values of the target variable (Y / N)
 Many ways of approaching this problem
10
Classification problem
High blood sugar
Y
N
?
0
20
24
40
60
80
Age
 One approach: what other data points are nearest to
the new point?
 Other approaches?
11
Classification Algorithms
The k-nearest neighbor classification
Naïve Bayes classification
Decision Tree
Linear Discriminant Analysis
Logistics Regression
Support Vector Machine
12
Classification or Regression?
Given data about some cars: engine size, number of
seats, petrol / diesel, has airbag or not, price
 Problem 1: Given engine size of a new car, what is
likely to be the price?
 Problem 2: Given the engine size of a new car, is it
likely that the car is run by petrol?
 Problem 3: Given the engine size, is it likely that the
car has airbags?
13
Classification
Example: Age, Income and Owning a flat
Monthly income
(thousand rupees)
250
Training set
• Owns a
flat
200
150
• Does
not own
a flat
100
50
0
0
10
20
30
40
50
60
70
Age
 Given a new person’s age and income, predict – does
(s)he own a flat?
15
Example: Age, Income and Owning a flat
Monthly income
(thousand rupees)
250
Training set
• Owns a
flat
200
150
• Does
not own
a flat
100
50
0
0
10
20
30
40
50
60
70
Age
 Nearest neighbor approach
 Find nearest neighbors among the known data points
and check their labels
16
Example: Age, Income and Owning a flat
Monthly income
(thousand rupees)
250
Training set
• Owns a
flat
200
150
• Does
not own
a flat
100
50
0
0
10
20
30
40
50
60
70
Age
 The 1-Nearest Neighbor (1-NN) Algorithm:
– Find the closest point in the training set
– Output the label of the nearest neighbor
17
The k-Nearest Neighbor Algorithm
Monthly income
(thousand rupees)
250
Training set
• Owns a
flat
200
150
• Does
not own
a flat
100
50
0
0
10
20
30
40
50
60
70
Age
 The k-Nearest Neighbor (k-NN) Algorithm:
– Find the closest k point in the training set
– Majority vote among the labels of the k points
18
Distance measures
 How to measure distance to find closest points?
 Euclidean: Distance between vectors x = (x1, … , xk)
and y = (y1, … , yk)
 Manhattan distance:
 Generalized squared interpoint distance: S is the
covariance matrix
The Maholanobis distance (1936)
19
Classification setup
 Training data / set: set of input data points and given
answers for the data points
 Labels: the list of possible answers
 Test data / set: inputs to the classification algorithm
for finding labels
– Used for evaluating the algorithm in case the answers are
known (but known to the algorithm)
 Classification task: Determining labels of the data
points for which the label is not known or not passed
to the algorithm
 Features: attributes that represent the data
20
Evaluation
 Test set accuracy: the correct performance measure
 Accuracy = #of correct answer / #of all answers
 Need to know the true test labels
– Option: use training set itself
– Parameter selection (for k-NN) by accuracy on training set
 Overfitting: a classifier performs too good on training
set compared to new (unlabeled) test data
21
Better validation methods
 Leave one out:
–
–
–
–
–
For each training data point x of training set D
Construct training set D – x, test set {x}
Train on D – x, test on x
Overall accuracy = average over all such cases
Expensive to compute
 Hold out set:
– Randomly choose x% (say 25-30%) of the training data, set
aside as test set
– Train on the rest of training data, test on the test set
– Easy to compute, but tends to have higher variance
22
The k-fold Cross Validation Method
 Randomly divide the training data into k partitions
D1,…, Dk : possibly equal division
 For each fold Di
– Train a classifier with training data = D – Di
– Test and validate with Di
 Overall accuracy: average accuracy over all cases
23
References
 Lecture videos by Prof. Andrew Ng, Stanford University
Available on Coursera (Course: Machine Learning)
 Data Mining Map: http://www.saedsayad.com/
24
					 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            