Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Kernel methods - overview Kernel smoothers Local regression Kernel density estimation Radial basis functions Data Mining and Statistical Learning - 2008 1 Introduction Kernel methods are regression techniques used to estimate a response function y f ( X ), X Rd from noisy data Properties: • Different models are fitted at each query point, and only those observations close to that point are used to fit the model • The resulting function is smooth • The models require only a minimum of training Data Mining and Statistical Learning - 2008 2 A simple one-dimensional kernel smoother N fˆ x0 K x , x y i 1 N 0 i i K x0 , xi i 1 Observed Fitted 6 5.9 5.8 5.7 where 5.6 5.5 5.4 x x0 K 1 , if | x x0 | 0, otherwise 5.3 5.2 5.1 5 4.9 0 5 Data Mining and Statistical Learning - 2008 10 15 20 25 3 Kernel methods, splines and ordinary least squares regression (OLS) • OLS: A single model is fitted to all data • Splines: Different models are fitted to different subintervals (cuboids) of the input domain • Kernel methods: Different models are fitted at each query point Data Mining and Statistical Learning - 2008 4 Kernel-weighted averages and moving averages The Nadaraya-Watson kernel-weighted average N fˆ x0 K x , x y 0 i 1 N i i K x , x 0 i 1 i x x0 K D where indicates the window size and the function D shows how the weights change with distance within this window The estimated function is smooth! K-nearest neighbours fˆ ( x) Ave( yi | xi Nk ( x)) The estimated function is piecewise constant! Data Mining and Statistical Learning - 2008 5 Examples of one-dimesional kernel smoothers • Epanechnikov kernel 3 1 t 2 if t 1 D(t ) 4 0 otherwise • Tri-cube kernel 3 3 1 t if t 1 D(t ) 0 otherwise Data Mining and Statistical Learning - 2008 6 Issues in kernel smoothing • The smoothing parameter λ has to be defined • When there are ties at xi : Compute an average y value and introduce weights representing the number of points • Boundary issues • Varying density of observations: – bias is constant – the variance is inversely proportional to the density Data Mining and Statistical Learning - 2008 7 Boundary effects of one-dimensional kernel smoothers Locally-weighted averages can be badly biased on the boundaries if the response function has a significant slope apply local linear regression Data Mining and Statistical Learning - 2008 8 Local linear regression Find the intercept and slope parameters solving The solution is a linear combination of yi: Data Mining and Statistical Learning - 2008 9 Kernel smoothing vs local linear regression Kernel smoothing Solve the minimization problem N min a ( x0 ) K ( x0 , xi )[ y i ( x0 )] 2 i 1 Local linear regression Solve the minimization problem N min a ( x0 ), ( x0 ) K ( x0 , xi )[ y i ( x0 ) ( x0 ) xi ] 2 i 1 Data Mining and Statistical Learning - 2008 10 Properties of local linear regression • Automatically modifies the kernel weights to correct for bias • Bias depends only on the terms of order higher than one in the expansion of f. Data Mining and Statistical Learning - 2008 11 Local polynomial regression • Fitting polynomials instead of straight lines Behavior of estimated response function: Data Mining and Statistical Learning - 2008 12 Polynomial vs local linear regression Advantages: • Reduces the ”Trimming of hills and filling of valleys” Disadvantages: • Higher variance (tails are more wiggly) Data Mining and Statistical Learning - 2008 13 Selecting the width of the kernel Bias-Variance tradeoff: Selecting narrow window leads to high variance and low bias whilst selecting wide window leads to high bias and low variance. Data Mining and Statistical Learning - 2008 14 Selecting the width of the kernel fˆ S y, S ij l j xi 1. Automatic selection ( cross-validation) 2. Fixing the degrees of freedom df traceS Data Mining and Statistical Learning - 2008 15 Local regression in RP The one-dimensional approach is easily extended to p dimensions by • Using the Euclidian norm as a measure of distance in the kernel. • Modifying the polynomial b X 1, X 1 , X 2 , X 12 , X 1 X 2 , X 22 , Data Mining and Statistical Learning - 2008 16 Local regression in RP ”The curse of dimensionality” • The fraction of points close to the boundary of the input domain increases with its dimension • Observed data do not cover the whole input domain Data Mining and Statistical Learning - 2008 17 Structured local regression models Structured kernels (standardize each variable) Note: A is positive semidefinite Data Mining and Statistical Learning - 2008 18 Structured local regression models Structured regression functions • ANOVA decompositions (e.g., additive models) Backfitting algorithms can be used • Varying coefficient models (partition X) • INSERT FORMULA 6.17 Data Mining and Statistical Learning - 2008 19 Structured local regression models Varying coefficient models (example) Data Mining and Statistical Learning - 2008 20 Local methods • Assumption: model is locally linear ->maximize the loglikelihood locally at x0: • Autoregressive time series. yt=β0+β1yt-1+…+ βkyt-k+et -> yt=ztT β+et. Fit by local least-squares with kernel K(z0,zt) Data Mining and Statistical Learning - 2008 21 Kernel density estimation • Straightforward estimates of the density are bumpy • Instead, Parzen’s smooth estimate is preferred: Normally, Gaussian kernels are used Data Mining and Statistical Learning - 2008 22 Radial basis functions and kernels Using the idea of basis expansion, we treat kernel functions as basis functions: where ξj –prototype parameter, λj-scale parameter Data Mining and Statistical Learning - 2008 23 Radial basis functions and kernels Choosing the parameters: • • Estimate {λj, ξj } separately from βj (often by using the distribution of X alone) and solve least-squares. Data Mining and Statistical Learning - 2008 24