* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download No Slide Title
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					Evaluation of Learning Models    Literature: T. Mitchel, Machine Learning, chapter 5 I.H. Witten and E. Frank, Data Mining, chapter 5 1 Fayyad’s KDD Methodology Transformed data Target data Patterns Processed data Data Mining data Selection Knowledge Interpretation Evaluation Transformation Preprocessing & feature selection & cleaning 2 Contents     Estimation of errors for one hypothesis Comparison of hypotheses Comparison of learning models Practical aspects 3 Two definitions of error  True error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D. errorD (h)  Pr [ f ( x)  h( x)] xD 4 Two definitions of error (2)  Sample error of hypothesis h with respect to target function f and data sample S is the proportion of examples h misclassifies errorS (h)  1 n   ( f ( x)  h( x)) xS where  ( f ( x)  h( x)) is 1 if and 0 otherwise. f ( x )  h( x ) 5 Two definitions of error (3)  How well does errorS(h) estimate errorD(h)? 6 Problems estimating error  1. Bias: If S is training set, errorS(h) is optimistically biased bias  E[errorS (h)]  errorD (h)  For unbiased estimate, h and S must be chosen independently. 2. Variance: Even with unbiased S, errorS(h) may still vary from errorD(h). 7 Example  Hypothesis h misclassifies 12 of the 40 examples in S 12 errorS (h)   0.30 40 What is errorD(h)? 8 Estimators  Experiment: 1. choose sample S of size n according to distribution D 2. measure errorS(h) errorS(h) is a random variable (i.e., result of an experiment) errorS(h) is an unbiased estimator for errorD(h) Given observed errorS(h) what can we conclude about errorD(h)? 9 Confidence intervals  If  S contains n examples, drawn independently of h and each other  n  30  Then  With approximately 95% probability, errorD(h) lies in the interval errorS (h)  1.96 errorS ( h )(1errorS ( h )) n 10 Confidence intervals (2)  If  S contains n examples, drawn independently of h and each other  n  30  Then  With approximately N% probability, errorD(h) lies in the interval errorS (h)  z N errorS ( h )(1errorS ( h )) n where N%: 50% 68% 80% 90% 95% 98% 99% zN: 0.67 1.00 1.28 1.64 1.96 2.33 2.58 11 errorS(h) is a random variable   Rerun the experiment with different randomly drawn S (of size n) Probability of observing r misclassified examples: n! r nr P(r )  errorD (h) (1  errorD (h)) r!(n  r )! 12 Binomial probability distribution  Probability P(r) of r heads in n coin flips, if p = Pr(heads) Binomial distribution for n = 10 and p = 0.3 n! r nr P(r )  errorD (h) (1  errorD (h)) r!(n  r )! 13 Binomial probability distribution (2)  Expected, or mean value of X, E[X], is n E[ X ]   iP (i )  np i 0  Variance of X is Var ( X )  E[( X  E[ X ]) ]  np(1  p) 2  Standard deviation of X, X, is  X  E[( X  E[ X ]) ]  np(1  p) 2 14 Normal distribution approximates binomial  errorS(h) follows a Binomial distribution, with  mean  error ( h )  errorD ( h) S  standard deviation  error ( h)  S  errorD ( h )(1errorD ( h )) n Approximate this by a Normal distribution with  mean  errorS ( h )  errorD ( h)  standard deviation  error ( h)  S errorD ( h )(1errorD ( h )) n 15 Normal probability distribution p ( x)   1 2 a   e The probability that X will fall into the interval b (a,b) is given by   2  12 ( x ) 2 p ( x)dx Expected, or mean value of X, E[X], is E[X] =  Variance of X is Var(x) = 2 Standard deviation of X, X , X =  16 Normal probability distribution   80% of area (probability) lies in   1.28 N% of area (probability) lies in   zN N%: zN: 50% 68% 80% 90% 95% 98% 99% 0.67 1.00 1.28 1.64 1.96 2.33 2.58 17 Confidence intervals, more correctly  If  S contains n examples, drawn independently of h and each other  n  30  Then  with approximately 95% probability, errorS(h) lies in the interval errorD (h)  1.96 errorD ( h )(1errorD ( h )) n errorS (h)  1.96 errorS ( h )(1errorS ( h )) n  and errorD(h) approximately lies in the interval 18 Central Limit Theorem  Consider a set of independent, identically distributed random variables Y1...Yn, all governed by an arbitrary probability distribution with mean  and finite variance 2. Define the sample mean, n Y   1 n Y i 1 i Central Limit Theorem. As n  , the distribution governing Y approaches a Normal distribution, with mean  and variance 2/n. 19 Difference between hypotheses   Test h1 on sample S1, test h2 on S2 1. Pick parameter to estimate d  errorD (h1 )  errorD (h2 )  2. Choose an estimator  3. Determine probability distribution that governs estimator dˆ  errorS1 (h1 )  errorS2 (h2 )  dˆ  errorS1 ( h1 )(1errorS1 ( h1 )) n1  errorS2 ( h2 )(1errorS2 ( h2 )) n2 20 Difference between hypotheses (2)  4. Find interval (L, U) such that N% of probability mass falls in the interval dˆ  z N errorS1 ( h1 )(1errorS1 ( h1 )) n1  errorS2 ( h2 )(1errorS2 ( h2 )) n2 21 Paired t test to compare hA, hB   1. Partition data into k disjoint test sets T1,T2,…,Tk of equal size, where this size is at least 30. 2. For i from 1 to k, do  i  errorT (hA )  errorT (hB )  3. Return the value  , where i   i k 1 k  i 0 i 22 Paired t test to compare hA, hB (2)  N% confidence interval estimate for d:   s   k 1 k  i 0 i k 1 k ( k 1)  ( i 0 i  ) 2 Note  i approximately normally distributed 23 Comparing learning algorithms LA and LB  What we’d like to estimate: ES  D [errorD ( LA (S ))  errorD ( LB (S ))]  where L(S) is the hypothesis output by learner L using training set S I.e., the expected difference in true error between hypotheses output by learners LA and LB, when trained using randomly selected training sets S drawn according to distribution D. 24 Comparing learning algorithms LA and LB (2)  But, given limited data D0, what is a good estimator?  Could partition D0 into training set S0 and test set T0, and measure errorT0 ( LA ( S 0 ))  errorT0 ( LB ( S 0 ))  Even better, repeat this many times and average the results 25 Comparing learning algorithms LA and LB (3): k-fold cross validation    1. Partition data D0 into k disjoint test sets T1,T2,…,Tk of equal size, where this size is at least 30. 2. For i from 1 to k, do use Ti for the test set, and the remaining data for training set Si 3. Return the average of the errors on the test sets 26 Practical Aspects A note on parameter tuning   It is important that the test data is not used in any way to create the classifier Some learning schemes operate in two stages:  Stage 1: builds the basic structure  Stage 2: optimizes parameter settings     The test data can’t be used for parameter tuning! Proper procedure uses three sets: training data, validation data, and test data Validation data is used to optimize parameters 27 Holdout estimation, stratification    What shall we do if the amount of data is limited? The holdout method reserves a certain amount for testing and uses the remainder for training  Usually: one third for testing, the rest for training  Problem: the samples might not be representative  Example: class might be missing in the test data  Advanced version uses stratification  Ensures that each class is represented with approximately equal proportions in both subsets 28 More on cross-validation    Standard method for evaluation: stratified ten-fold cross-validation Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate  There is also some theoretical evidence for this   Stratification reduces the estimate’s variance Even better: repeated stratified cross-validation  E.g. ten-fold cross-validation is repeated ten times  and results are averaged (reduces the variance) 29 Issues in evaluation   Statistical reliability of estimated differences in performance Choice of performance measure  Number of correct classifications  Accuracy of probability estimates  Error in numeric predictions  Costs assigned to different types of errors  Many practical applications involve costs 30 Counting the costs   In practice, different types of classification errors often incur different costs Examples:  Predicting when cows are in heat (“in estrus”)  “Not in estrus” correct 97% of the time  Loan decisions  Oil-slick detection  Fault diagnosis  Promotional mailing 31 Taking costs into account  The confusion matrix: predicted class yes actual class  no yes True positive False negative no True negative False positive There are many other types of costs!  E.g.: costs of collecting training data 32 Lift charts    In practice, costs are rarely known Decisions are usually made by comparing possible scenarios Example: promotional mailout  Situation 1: classifier predicts that 0.1% of all households will respond  Situation 2: classifier predicts that 0.4% of the 10000 most promising households will respond  A lift chart allows for a visual comparison 33 Generating a lift chart  Instances are sorted according to their predicted probability of being a true positive: Rank 1 2 3 4 …  Predicted probability Actual class 0.95 Yes 0.93 Yes 0.93 No 0.88 Yes … … In lift chart, x axis is sample size and y axis is number of true positives 34 A hypothetical lift chart 35 Summary of measures 36 Model selection criteria    Model selection criteria attempt to find a good compromise between: A. The complexity of a model B. Its prediction accuracy on the training data Reasoning: a good model is a simple model that achieves high accuracy on the given data Also known as Occam’s Razor: the best theory is the smallest one that describes all the facts 37 Warning    Suppose you are gathering hypotheses that have a probability of 95% to have an error level below 10% What if you have found 100 hypotheses satisfying this condition? Then the probability that all have an error below 10% is equal to (0.95)100  0.013 corresponding to 1.3 %. So, the probability of having at least one hypothesis with an error above 10% is about 98.7%! 38
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            