Download Local linear regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Kernel methods
- overview
 Kernel smoothers
 Local regression
 Kernel density estimation
 Radial basis functions
Data Mining and Statistical
Learning - 2008
1
Introduction
Kernel methods are regression techniques used to estimate a
response function
y  f ( X ),
X  Rd
from noisy data
Properties:
• Different models are fitted at each query point, and only those
observations close to that point are used to fit the model
• The resulting function is smooth
• The models require only a minimum of training
Data Mining and Statistical
Learning - 2008
2
A simple one-dimensional kernel smoother
N
fˆ x0  
 K  x , x  y
i 1
N
0
i
i
 K  x0 , xi 
i 1
Observed
Fitted
6
5.9
5.8
5.7
where
5.6
5.5
5.4

x  x0

K   1   , if | x  x0 |  

0, otherwise
5.3
5.2
5.1
5
4.9
0
5
Data Mining and Statistical
Learning - 2008
10
15
20
25
3
Kernel methods, splines and ordinary least squares
regression (OLS)
• OLS: A single model is fitted to all data
• Splines: Different models are fitted to different
subintervals (cuboids) of the input domain
• Kernel methods: Different models are fitted at each
query point
Data Mining and Statistical
Learning - 2008
4
Kernel-weighted averages and moving averages
The Nadaraya-Watson kernel-weighted average
N
fˆ x0  
 K  x , x  y
0
i 1
N
i
i
 K  x , x 
0
i 1
i
 x  x0 
K  D




where  indicates the window size and the function D shows how the weights
change with distance within this window
The estimated function is smooth!
K-nearest neighbours
fˆ ( x)  Ave( yi | xi  Nk ( x))
The estimated function is piecewise constant!
Data Mining and Statistical
Learning - 2008
5
Examples of one-dimesional kernel smoothers
• Epanechnikov
kernel


 3 1  t 2 if t  1
D(t )   4
 0 otherwise
• Tri-cube kernel


3 3

 1 t
if t  1
D(t )  

 0 otherwise
Data Mining and Statistical
Learning - 2008
6
Issues in kernel smoothing
• The smoothing parameter λ has to be defined
• When there are ties at xi : Compute an average y value and
introduce weights representing the number of points
• Boundary issues
• Varying density of observations:
– bias is constant
– the variance is inversely proportional to the density
Data Mining and Statistical
Learning - 2008
7
Boundary effects of one-dimensional
kernel smoothers
Locally-weighted averages can be badly biased on the
boundaries if the response function has a significant slope
apply local linear regression
Data Mining and Statistical
Learning - 2008
8
Local linear regression
Find the intercept and slope parameters solving
The solution is a linear combination of yi:
Data Mining and Statistical
Learning - 2008
9
Kernel smoothing vs local linear regression
Kernel smoothing
Solve the minimization problem
N
min a ( x0 )  K  ( x0 , xi )[ y i   ( x0 )] 2
i 1
Local linear regression
Solve the minimization problem
N
min a ( x0 ),  ( x0 )  K  ( x0 , xi )[ y i   ( x0 )   ( x0 ) xi ] 2
i 1
Data Mining and Statistical
Learning - 2008
10
Properties of local linear regression
• Automatically modifies the kernel weights to correct for bias
• Bias depends only on the terms of order higher than one in
the expansion of f.
Data Mining and Statistical
Learning - 2008
11
Local polynomial regression
• Fitting polynomials instead of straight lines
Behavior of estimated response function:
Data Mining and Statistical
Learning - 2008
12
Polynomial vs local linear regression
Advantages:
• Reduces the ”Trimming of hills and filling of valleys”
Disadvantages:
• Higher variance (tails are more wiggly)
Data Mining and Statistical
Learning - 2008
13
Selecting the width of the kernel
Bias-Variance tradeoff:
Selecting narrow window leads to high variance and low bias
whilst selecting wide window leads to high bias and low
variance.
Data Mining and Statistical
Learning - 2008
14
Selecting the width of the kernel
fˆ  S  y,
S  ij
 l j  xi 
1. Automatic selection ( cross-validation)
2. Fixing the degrees of freedom
df  traceS  
Data Mining and Statistical
Learning - 2008
15
Local regression in RP
The one-dimensional approach is easily extended to p
dimensions by
• Using the Euclidian norm as a measure of distance in the
kernel.
• Modifying the polynomial


b X   1, X 1 , X 2 , X 12 , X 1 X 2 , X 22 ,
Data Mining and Statistical
Learning - 2008
16
Local regression in RP
”The curse of dimensionality”
• The fraction of points close to the boundary of the input
domain increases with its dimension
• Observed data do not cover the whole input domain
Data Mining and Statistical
Learning - 2008
17
Structured local regression models
Structured kernels (standardize each variable)
Note: A is positive semidefinite
Data Mining and Statistical
Learning - 2008
18
Structured local regression models
Structured regression functions
• ANOVA decompositions (e.g., additive models)
Backfitting algorithms can be used
• Varying coefficient models (partition X)
• INSERT FORMULA 6.17
Data Mining and Statistical
Learning - 2008
19
Structured local regression models
Varying coefficient
models (example)
Data Mining and Statistical
Learning - 2008
20
Local methods
• Assumption: model is locally linear ->maximize the loglikelihood locally at x0:
• Autoregressive time series. yt=β0+β1yt-1+…+ βkyt-k+et ->
yt=ztT β+et. Fit by local least-squares with kernel K(z0,zt)
Data Mining and Statistical
Learning - 2008
21
Kernel density estimation
• Straightforward estimates of the density are bumpy
• Instead, Parzen’s smooth estimate is preferred:
Normally, Gaussian kernels are used
Data Mining and Statistical
Learning - 2008
22
Radial basis functions and kernels
Using the idea of basis expansion, we treat kernel functions
as basis functions:
where ξj –prototype parameter, λj-scale parameter
Data Mining and Statistical
Learning - 2008
23
Radial basis functions and kernels
Choosing the parameters:
•
•
Estimate {λj, ξj } separately from βj (often by using the
distribution of X alone) and solve least-squares.
Data Mining and Statistical
Learning - 2008
24