* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download EE Dept presentation 06
Survey
Document related concepts
Transcript
Introduction to
Predictive Learning
LECTURE SET 2
Basic Learning Approaches and
Complexity Control
Electrical and Computer Engineering
1
OUTLINE
2.0 Objectives
2.1 Data Encoding + Preprocessing
2.2 Terminology and Common Learning Tasks
2.3 Basic Learning Approaches
2.4 Generalization and Complexity Control
2.5 Application Example
2.6 Summary
2
2.0 Objectives
1. To quantify the notions of explanation,
prediction and model
2. Introduce terminology
3. Describe basic learning methods
4. Importance of complexity control for
generalization
3
Learning as Induction
Induction ~ function estimation from data:
Deduction ~ prediction for new inputs:
aka Standard Inductive Learning Setting
4
2.1 Data Encoding + Preprocessing
Common Types of Input & Output Variables
(input variables ~ features)
• Real-valued
• Categorical (class labels)
• Ordinal (or fuzzy) variables
• Classification: categorical output
• Regression: real-valued output
• Ranking: ordinal output
5
•
Data Preprocessing and Scaling
Preprocessing is required with observational
data (step 4 in general experimental procedure)
•
Basic preprocessing includes
- summary univariate statistics: mean, st.
deviation, min + max value, range, boxplot
performed independently for each input/output
- detection (removal) of outliers
- scaling of input/output variables (may be
necessary for some learning algorithms)
•
Visual inspection of data is tedious but useful
6
Animal Body & Brain Weight Data
(original, unscaled)
7
Removing Outliers
Remove outliers: Brachiosaurus, Diplodocus, Triceratop,
African elephant, Asian elephant
and plot the data scaled to [0,1] range:
1
0.9
0.8
0.7
Brain weight
•
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Body weight
0.7
0.8
0.9
1
8
2.2 Terminology and Learning Problems
•
Input and output variables
x
z
System
y
•
•
Learning ~ estimation of f(x): xy
Loss function L( y, f (x))
measures the quality of prediction yˆ f (x)
•
Loss function:
- defined for common learning tasks
- has to be related to application requirements
9
Supervised Learning: Regression
•
•
Data in the form (x,y), where
- x is multivariate input (i.e. vector)
- y is univariate output (real-valued)
Regression loss function L( y, f (x)) y f (x)2
Estimation of real-valued function xy
10
Supervised Learning: Classification
•
•
Data in the form (x,y), where
- x is multivariate input (i.e. vector)
- y is categorical output (class label)
Loss function for binary classification:
0 if y f x
L y, f x
1 if y f x
Estimation of indicator function xy
11
Unsupervised Learning
•
•
Data in the form (x), where
- x is multivariate input (i.e. vector)
Goal: data reduction or clustering
Clustering = estimation of mapping X c
12
Inductive Learning Setting
• Predictive Estimator observes samples (x ,y), and returns
an estimated response yˆ f (x, w)
• Recall ‘first-principle’ vs ‘empirical’ knowledge
Two modes of inference: identification vs imitation
• Minimization of Risk
Loss(y, f(x,w)) dP(x,y) min
13
Example: Regression estimation
Given: training data (xi , yi ), i 1,2,...n
Find a function f (x, w ) that minimizes squared
error for a large number (N) of future samples:
N
k 1
[( y k f (x k , w)] 2 min
2 dP(x,y) min
(y
f(
x
,w))
BUT Future data is unknown ~ P(x,y) unknown
14
Discussion
•
•
•
Math formulation useful for quantifying
- explanation ~ fitting error (training data)
- generalization ~ prediction error
Natural assumptions
- future similar to past: stationary P(x,y),
i.i.d.data
- discrepancy measure or loss function,
i.e. MSE
What if these assumptions do not hold?
15
OUTLINE
2.0 Objectives
2.1 Data Encoding + Preprocessing
2.2 Terminology and Common Learning Tasks
2.3 Basic Learning Approaches
- Parametric Modeling
- Non-parametric Modeling
- Data Reduction
2.4 Generalization and Complexity Control
2.5 Application Example
2.6 Summary
16
Parametric Modeling
Given training data (xi , yi ), i 1,2,...n
(1) Specify parametric model
(2) Estimate its parameters (via fitting to data)
• Example: Linear regression F(x)= (w x) + b
2
n
y (w x ) b
i 1
i
i
min
17
Parametric Modeling: Classification
Given training data (xi , yi ), i 1,2,...n
(a) Estimate linear decision boundary:
(b) Estimate third-order decision boundary:
18
Non-Parametric Modeling
Given training data
(xi , yi ), i 1,2,...n
Estimate the model (for given x 0) as
‘local average’ of the training data.
Note: need to define ‘local’, ‘average’
• Example: k-nearest neighbors regression
k
f (x 0 )
y
j 1
j
k
19
Example of kNN Regression
•
Ten training samples from
y x 2 0.1x N (0, 2 ), where 2 0.25
•
Using k-nn regression with k=1 and k=4:
20
Data Reduction Approach
Given training data, estimate the model as
‘compact encoding’ of the data.
Note: ‘compact’ ~ # of bits to encode the model
• Example: piece-wise linear regression
How many parameters needed
for two-linear-component model?
21
Data Reduction Approach (cont’d)
Data Reduction approaches are commonly used
for unsupervised learning tasks.
• Example: clustering.
Training data encoded by 3 points (cluster centers)
H
Issues:
- How to find centers?
- How to select the
number of clusters?
22
Standard Inductive Setting
•
Model estimation ~ inductive step, i.e. estimate
function from data samples.
• Prediction ~ deductive step
(Standard) Inductive Learning Setting
•
•
Discussion: which of the 3 modeling
approaches follow standard inductive learning?
How humans perform inductive inference?
23
OUTLINE
2.0 Objectives
2.1 Data Encoding + Preprocessing
2.2 Terminology and Common Learning Tasks
2.3 Basic Learning Approaches
2.4 Generalization and Complexity Control
- Prediction Accuracy (generalization)
- Complexity Control: examples
- Resampling
2.5 Application Example
2.6 Summary
24
Prediction Accuracy
•
•
•
•
•
All modeling approaches implement ‘data
fitting’ ~ explaining the data
BUT the true goal ~ prediction
Model explanation ~
fitting error, training error, empirical risk
Prediction accuracy ~
generalization, test error, prediction risk
Trade-off between training and test error
is controlled by ‘model complexity’
25
Explanation vs Prediction
(a) Classification
(b) Regression
26
Complexity Control: parametric modeling
Consider regression estimation
• Ten training samples
y x 2 N (0, 2 ), where 2 0.25
•
Fitting linear and 2-nd order polynomial:
27
Complexity Control: local estimation
Consider regression estimation
• Ten training samples from
y x 2 N (0, 2 ), where 2 0.25
•
Using k-nn regression with k=1 and k=4:
28
Complexity Control (cont’d)
• Complexity (of admissible models)
affects generalization (for future data)
• Specific complexity indices for
– Parametric models: ~ # of parameters
– Local modeling: size of local region
– Data reduction: # of clusters
• Complexity control = choosing good
complexity (~ good generalization) for a
given (training) data
29
How to Control Complexity ?
• Two approaches: analytic and resampling
• Analytic criteria estimate prediction error as a
function of fitting error and model complexity
For regression problems:
DoF
R r
Remp
n
Example analytic criteria for regression
1
• Schwartz Criterion:
r p, n 1 p1 p ln n
• Akaike’s FPE:
r p 1 p1 p
1
where p = DoF/n, n~sample size, DoF~degrees-of-freedom
30
Resampling
•
Split available data into 2 sets:
Training + Validation
(1) Use training set for model
estimation (via data fitting)
(2) Use validation data to estimate the
prediction error of the model
• Change model complexity index and
repeat (1) and (2)
• Select the final model providing lowest
(estimated) prediction error
BUT results are sensitive to data splitting
31
K-fold cross-validation
1. Divide the training data Z into k randomly selected
disjoint subsets {Z1, Z2,…, Zk} of size n/k
2. For each ‘left-out’ validation set Zi :
- use remaining data to estimate the model yˆ f i (x)
k
2
- estimate prediction error on Zi :
ri f i (x) y
k
nZ
1
3. Estimate ave prediction risk as Rcv ri
k i 1
i
32
Example of model selection(1)
• 25 samples are generated as
y sin 2 2x
with x uniformly sampled in [0,1], and noise ~ N(0,1)
• Regression estimated using polynomials of degree m=1,2,…,10
• Polynomial degree m = 5 is chosen via 5-fold cross-validation.
The curve shows the polynomial model, along with training (* )
and validation (*) data points, for one partitioning.
m
Estimated R via
Cross validation
1
0.1340
2
0.1356
3
0.1452
4
0.1286
5
0.0699
6
0.1130
7
0.1892
8
0.3528
9
0.3596
10
0.4006
33
Example of model selection(2)
• Same data set, but estimated using k-nn regression.
• Optimal value k = 7 chosen according to 5-fold cross-validation
model selection. The curve shows the k-nn model, along with
training (* ) and validation (*) data points, for one partitioning.
k
Estimated R via
Cross validation
1
0.1109
2
0.0926
3
0.0950
4
0.1035
5
0.1049
6
0.0874
7
0.0831
8
0.0954
9
0.1120
10
0.1227
34
Test Error
• Previous example shows two models (estimated from the
same data by different methods)
• Which model is better? (has lower test error).
Note: Poly model has lower cross-validation error
• Double resampling (for estimating test error)
Partition the data into: Learning/ validation/ test
• Test data should be never used for model estimation
35
Application Example
• Haberman’s Survival Data Set
- 5-year survival of female patients (following
surgery for breast cancer)
- 306 cases (patients)
- inputs: Age, number of positive auxilliary nodes
• Method: k-NN classifier (only odd k-values)
Note: input values are pre-scaled to [0,1]
• Model selection via LOO cross-validation
• Optimal k=45 yields min LOO error 22.75%
36
Model selection for k-NN classifier via cross-validation
Optimal decision boundary for k=45
k
1
3
7
15
…
45
47
51
53
57
61
99
Error (%)
42
30.67
26
24.33
….
21.67
22.33
23
24.33
24
25
26.33
37
Estimating test error of a method
•
•
For the same example (Haberman’s data) what is the
true test error of k-NN method ?
Use double resampling, i.e. 5-fold cross validation to estimate
test error, and LOO cross-validation to estimate optimal k for each
training fold:
1
2
3
4
5
mean
Optimal k LOO error
Test error
11
37
37
33
35
28.33%
13.33%
16.67%
25%
35%
23.67%
22.5%
25%
23.33%
24.17%
18.75%
22.75%
Note: opt k-values are different;
ave test error is larger than ave validation error.
38
Summary and Discussion
• Learning as function estimation (from data)
~ standard inductive learning setting
• Common types of learning problems:
classification, regression, clustering
• Non-standard learning settings
• Model estimation via data fitting (ERM)
• Model complexity and generalization:
- how to measure model complexity
- several complexity (tuning) parameters
39