(original) (raw)

\slidehead{Instance Based Learning \red \ } \bk \centerline{[Read Ch. 8]} \bi \item kkk-Nearest Neighbor \item Locally weighted regression \item Radial basis functions \item Case-based reasoning \item Lazy and eager learning \ei \slidehead{Instance-Based Learning \red \ } \bk Key idea: just store all training examples langlexi,f(xi)rangle\langle x_i, f(x_i) \ranglelanglexi,f(xi)rangle \bigskip Nearest neighbor: \bi \item Given query instance xqx_qxq, first locate nearest training example xnx_nxn, then estimate hatf(xq)laf(xn)\hat{f}(x_q) \la f(x_n)hatf(xq)laf(xn) \ei kkk-Nearest neighbor: \bi \item Given xqx_qxq, take vote among its kkk nearest nbrs (if discrete-valued target function) \item take mean of fff values of kkk nearest nbrs (if real-valued) \[ \hat{f}(x_{q}) \la \frac{\sum_{i=1}^{k}f(x_{i})}{k} \] \ei \slidehead{When To Consider Nearest Neighbor \red \ } \bk \bi \item Instances map to points in Ren\Re^nRen \item Less than 20 attributes per instance \item Lots of training data \ei Advantages: \bi \item Training is very fast \item Learn complex target functions \item Don't lose information \ei Disadvantages: \bi \item Slow at query time \item Easily fooled by irrelevant attributes \ei \slidehead{Voronoi Diagram \red \ } \bk \centerline{\hbox{\psfig{file=./bookps/knn-f1.ps,width=6.5in}}} \slidehead{Behavior in the Limit \red \ } \bk Consider p(x)p(x)p(x) defines probability that instance xxx will be labeled 1 (positive) versus 0 (negative). \bigskip Nearest neighbor: \bi \item As number of training examples rainfty\ra \inftyrainfty, approaches Gibbs Algorithm \item[] Gibbs: with probability p(x)p(x)p(x) predict 1, else 0 \ei \bigskip kkk-Nearest neighbor: \bi \item As number of training examples rainfty\ra \inftyrainfty and kkk gets large, approaches Bayes optimal \item[] Bayes optimal: if p(x)>.5p(x)>.5p(x)>.5 then predict 1, else 0 \ei \bigskip Note Gibbs has at most twice the expected error of Bayes optimal \slidehead{Distance-Weighted kkkNN \red \ } \bk Might want weight nearer neighbors more heavily... \[ \hat{f}(x_{q}) \la \frac{\sum_{i=1}^{k} w_{i} f(x_{i})}{\sum_{i=1}^{k} w_{i}} \] where \[ w_{i} \equiv \frac{1}{d(x_{q}, x_{i})^{2}} \] and d(xq,xi)d(x_{q}, x_{i})d(xq,xi) is distance between xqx_{q}xq and xix_{i}xi \bigskip \bigskip Note now it makes sense to use {\em all} training examples instead of just kkk \bi \item[$\ra$]Shepard's method \ei \slidehead{Curse of Dimensionality \red \ } \bk Imagine instances described by 20 attributes, but only 2 are relevant to target function \bigskip {\em Curse of dimensionality}: nearest nbr is easily mislead when high-dimensional XXX \bigskip One approach: \bi \item Stretch jjjth axis by weight zjz_jzj, where z1,ldots,znz_1, \ldots, z_nz1,ldots,zn chosen to minimize prediction error \item Use cross-validation to automatically choose weights z1,ldots,znz_1, \ldots, z_nz1,ldots,zn \item Note setting zjz_jzj to zero eliminates this dimension altogether \ei \bigskip \centerline{ see [Moore and Lee, 1994]} \slidehead{Locally Weighted Regression \red \ } \bk Note kkkNN forms local approximation to fff for each query point xqx_qxq \bigskip Why not form an explicit approximation hatf(x)\hat{f}(x)hatf(x) for region surrounding xqx_qxq \bi \item Fit linear function to kkk nearest neighbors \item Fit quadratic, ... \item Produces ``piecewise approximation'' to fff \ei Several choices of error to minimize: \bi \item Squared error over kkk nearest neighbors \[E_{1}(x_q) \equiv \frac{1}{2} \sum_{x \in\ k\ nearest\ nbrs\ of\ x_q} (f(x) - \hat{f}(x))^2 \] \item Distance-weighted squared error over all nbrs \[E_{2}(x_q) \equiv \frac{1}{2} \sum_{x \in D} (f(x) - \hat{f}(x))^2\ K(d(x_{q}, x)) \] \item ldots\ldotsldots \ei \slidehead{Radial Basis Function Networks \red \ } \bk \bi \item Global approximation to target function, in terms of linear combination of local approximations \item Used, e.g., for image classification \item A different kind of neural network \item Closely related to distance-weighted regression, but ``eager'' instead of ``lazy'' \ei \slidehead{Radial Basis Function Networks \red \ } \bk \centerline{\hbox{\psfig{file=./bookps/rbf2.ps,height=3in}}} where ai(x)a_i(x)ai(x) are the attributes describing instance xxx, and \[ f(x) = w_0 + \sum_{u=1}^{k} w_u K_u(d(x_u,x)) \] One common choice for Ku(d(xu,x))K_u(d(x_u,x))Ku(d(xu,x)) is \[ K_u(d(x_u,x)) = e^{- \frac{1}{2 \sigma_u^2}d^2(x_u,x)} \] \slidehead{Training Radial Basis Function Networks \red \ } \bk Q1: What xux_uxu to use for each kernel function Ku(d(xu,x))K_u(d(x_u,x))Ku(d(xu,x)) \bi \item Scatter uniformly throughout instance space \item Or use training instances (reflects instance distribution) \ei Q2: How to train weights (assume here Gaussian KuK_uKu) \bi \item First choose variance (and perhaps mean) for each KuK_uKu \bi \item e.g., use EM \ei \item Then hold KuK_uKu fixed, and train linear output layer \bi \item efficient methods to fit linear function \ei \ei \slidehead{Case-Based Reasoning \red \ } \bk Can apply instance-based learning even when XneqRenX \neq \Re^nXneqRen \bi \item[$\ra$] need different ``distance'' metric \ei \bigskip Case-Based Reasoning is instance-based learning applied to instances with symbolic logic descriptions \bigskip \begin{verbatim} ((user-complaint error53-on-shutdown) (cpu-model PowerPC) (operating-system Windows) (network-connection PCIA) (memory 48meg) (installed-applications Excel Netscape VirusScan) (disk 1gig) (likely-cause ???)) \end{verbatim} \slidehead{Case-Based Reasoning in CADET \red \ } \bk CADET: 75 stored examples of mechanical devices \bi \item each training example: langle\langlelangle qualitative function, mechanical structure$\rangle$ \item new query: desired function, \item target value: mechanical structure for this function \ei Distance metric: match qualitative function descriptions \slidehead{Case-Based Reasoning in CADET \red \ } \bk \centerline{\hbox{\psfig{file=./bookps/cbr.ps,width=6.5in}}} \slidehead{Case-Based Reasoning in CADET \red \ } \bk \bi \item Instances represented by rich structural descriptions \item Multiple cases retrieved (and combined) to form solution to new problem \item Tight coupling between case retrieval and problem solving \ei Bottom line: \bi \item Simple matching of cases useful for tasks such as answering help-desk queries \item Area of ongoing research \ei \slidehead{Lazy and Eager Learning \red \ } \bk Lazy: wait for query before generalizing \bi \item \pgm{$k$-Nearest Neighbor}, Case based reasoning \ei \bigskip Eager: generalize before seeing query \bi \item Radial basis function networks, ID3, Backpropagation, NaiveBayes, ldots\ldotsldots \ei Does it matter? \bi \item Eager learner must create global approximation \item Lazy learner can create many local approximations \item if they use same HHH, lazy can represent more complex fns (e.g., consider HHH = linear functions) \ei