(original) (raw)
\slidehead{Computational Learning Theory \red \ } \bk \centerline{[read Chapter 7]} \centerline{[Suggested exercises: 7.1, 7.2, 7.5, 7.8]} \bigskip \bi \item Computational learning theory \item Setting 1: learner poses queries to teacher \item Setting 2: teacher chooses examples \item Setting 3: randomly generated instances, labeled by teacher \item Probably approximately correct (PAC) learning \item Vapnik-Chervonenkis Dimension \item Mistake bounds \ei \slidehead{Computational Learning Theory \red \ } \bk What general laws constrain inductive learning? \bigskip We seek theory to relate: \bi \item Probability of successful learning \item Number of training examples \item Complexity of hypothesis space \item Accuracy to which target concept is approximated \item Manner in which training examples presented \ei \slidehead{Prototypical Concept Learning Task \red \ } \bk \bi \item {\bf Given:} \bi \item Instances XXX: Possible days, each described by the attributes {\em Sky, AirTemp, Humidity, Wind, Water, Forecast} \item Target function ccc: EnjoySport:Xrightarrow0,1EnjoySport: X \rightarrow \{0,1 \}EnjoySport:Xrightarrow0,1 \item Hypotheses HHH: Conjunctions of literals. E.g. \[\langle ?, Cold, High, ?, ?, ? \rangle. \] \item Training examples DDD: Positive and negative examples of the target function \[\langle x_1, c(x_1) \rangle , \ldots \langle x_m, c(x_m) \rangle\] \end{itemize} \item {\bf Determine:} \begin{itemize} \item A hypothesis hhh in HHH such that h(x)=c(x)h(x)=c(x)h(x)=c(x) for all xxx in DDD? \item A hypothesis hhh in HHH such that h(x)=c(x)h(x)=c(x)h(x)=c(x) for all xxx in XXX? \end{itemize} \ei \slidehead{Sample Complexity \red \ } \bk How many training examples are sufficient to learn the target concept? \be \item If learner proposes instances, as queries to teacher \bi \item Learner proposes instance xxx, teacher provides c(x)c(x)c(x) \ei \item If teacher (who knows ccc) provides training examples \bi \item teacher provides sequence of examples of form langlex,c(x)rangle\langle x, c(x) \ranglelanglex,c(x)rangle \ei \item If some random process (e.g., nature) proposes instances \bi \item instance xxx generated randomly, teacher provides c(x)c(x)c(x) \ei \ee \slidehead{Sample Complexity: 1 \red \ } \bk Learner proposes instance xxx, teacher provides c(x)c(x)c(x) (assume ccc is in learner's hypothesis space HHH) \bigskip Optimal query strategy: play 20 questions \bi \item pick instance xxx such that half of hypotheses in VSVSVS classify xxx positive, half classify xxx negative \item When this is possible, need lceillog2∣H∣rceil\lceil \log_2 |H| \rceil lceillog2∣H∣rceil queries to learn ccc \item when not possible, need even more \ei \slidehead{Sample Complexity: 2 \red \ } \bk Teacher (who knows ccc) provides training examples (assume ccc is in learner's hypothesis space HHH) \bigskip Optimal teaching strategy: depends on HHH used by learner \bigskip Consider the case H=H = H= conjunctions of up to nnn boolean literals and their negations \bi \item[] e.g., (AirTemp=Warm)tand(Wind=Strong)(AirTemp = Warm) \tand (Wind = Strong)(AirTemp=Warm)tand(Wind=Strong), where AirTemp,Wind,ldotsAirTemp, Wind, \ldotsAirTemp,Wind,ldots each have 2 possible values. \ei \bi \item if nnn possible boolean attributes in HHH, n+1n + 1n+1 examples suffice \item why? \ei \slidehead{Sample Complexity: 3 \red \ } \bk Given: \bi \item set of instances XXX \item set of hypotheses HHH \item set of possible target concepts CCC \item training instances generated by a fixed, unknown probability distribution calD\cal{D}calD over XXX \ei Learner observes a sequence DDD of training examples of form langlex,c(x)rangle\langle x, c(x) \ranglelanglex,c(x)rangle, for some target concept cinCc \in CcinC \bi \item instances xxx are drawn from distribution calD\cal{D}calD \item teacher provides target value c(x)c(x)c(x) for each \ei Learner must output a hypothesis hhh estimating ccc \bi \item hhh is evaluated by its performance on subsequent instances drawn according to calD\cal{D}calD \ei Note: randomly drawn instances, noise-free classifications \slidehead{True Error of a Hypothesis \red \ } \bk \centerline{\hbox{\psfig{file=./bookps/pac-err.ps,width=5.0in}}} \bigskip \begin{quote} {\bf Definition:} The {\bf true error} (denoted errorcalD(h)error_{\cal{D}}(h)errorcalD(h)) of hypothesis hhh with respect to target concept ccc and distribution calD\cal{D}calD is the probability that hhh will misclassify an instance drawn at random according to calD\cal{D}calD. \[error_{\cal{D}}(h) \equiv \Pr_{x \in \cal{D}}[c(x) \neq h(x)] \] \end{quote} \slidehead{Two Notions of Error \red \ } \bk {\em Training error} of hypothesis hhh with respect to target concept ccc \bi \item How often h(x)neqc(x)h(x) \neq c(x)h(x)neqc(x) over training instances \ei \bigskip {\em True error} of hypothesis hhh with respect to ccc \bi \item How often h(x)neqc(x)h(x) \neq c(x)h(x)neqc(x) over future random instances \ei \bigskip \bigskip Our concern: \bi \item Can we bound the true error of hhh given the training error of hhh? \item First consider when training error of hhh is zero (i.e., hinVSH,Dh \in VS_{H,D}hinVSH,D) \ei \slidehead{Exhausting the Version Space \red \ } \bk \centerline{\hbox{\psfig{file=./bookps/pac-vs-exhausted.ps,width=5.0in}}} \centerline{($r =$ training error, error=error = error= true error)} \bigskip \begin{quote} {\bf Definition:} The version space VSH,DVS_{H,D}VSH,D is said to be epsilon\epsilonepsilon-{\bf exhausted} with respect to ccc and calD\cal DcalD, if every hypothesis hhh in VSH,DVS_{H,D}VSH,D has error less than epsilon\epsilonepsilon with respect to ccc and calD\cal DcalD. \[(\forall h \in VS_{H,D})\ error_{\cal D}(h) < \epsilon \] \end{quote} \slidehead{\normalsize How many examples will epsilon\epsilonepsilon-exhaust the VS? \red \ } \bk {\bf Theorem:} [Haussler, 1988]. \begin{quote} If the hypothesis space HHH is finite, and DDD is a sequence of mgeq1m \geq 1mgeq1 independent random examples of some target concept ccc, then for any 0leqepsilonleq10 \leq \epsilon \leq 10leqepsilonleq1, the probability that the version space with respect to HHH and DDD is not epsilon\epsilonepsilon-exhausted (with respect to ccc) is less than \[|H| e^{-\epsilon m} \] \end{quote} \bigskip Interesting! This bounds the probability that any consistent learner will output a hypothesis hhh with error(h)geqepsilonerror(h) \geq \epsilonerror(h)geqepsilon \bigskip If we want to this probability to be below delta\deltadelta \[|H|e^{- \epsilon m} \leq \delta \] then \[m \geq \frac{1}{\epsilon}(\ln|H| + \ln(1/\delta)) \] \slidehead{Learning Conjunctions of Boolean Literals \red \ } \bk How many examples are sufficient to assure with probability at least (1−delta)(1 - \delta)(1−delta) that \bi \item[] every hhh in VSH,DVS_{H,D}VSH,D satisfies errorcalD(h)leqepsilonerror_{\cal D}(h) \leq \epsilonerrorcalD(h)leqepsilon \ei \bigskip Use our theorem: \[m \geq \frac{1}{\epsilon}(\ln|H| + \ln(1/\delta)) \] Suppose HHH contains conjunctions of constraints on up to nnn boolean attributes (i.e., nnn boolean literals). Then ∣H∣=3n|H| = 3^n∣H∣=3n, and \[m \geq \frac{1}{\epsilon}(\ln 3^n + \ln(1/\delta)) \] or \[m \geq \frac{1}{\epsilon}(n \ln 3 + \ln(1/\delta)) \] \slidehead{How About EnjoySportEnjoySportEnjoySport? \red \ } \bk \[m \geq \frac{1}{\epsilon}(\ln|H| + \ln(1/\delta)) \] If HHH is as given in EnjoySportEnjoySportEnjoySport then ∣H∣=973|H| = 973∣H∣=973, and \[m \geq \frac{1}{\epsilon}(\ln 973 + \ln(1/\delta)) \] \bigskip ... if want to assure that with probability 95\%, VSVSVS contains only hypotheses with errorcalD(h)leq.1error_{\cal D}(h) \leq .1errorcalD(h)leq.1, then it is sufficient to have mmm examples, where \[m \geq \frac{1}{.1}(\ln 973 + \ln(1/.05)) \] \[m \geq 10 (\ln 973 + \ln 20) \] \[m \geq 10 (6.88 + 3.00) \] \[m \geq 98.8 \] \slidehead{PAC Learning \red \ } \bk Consider a class CCC of possible target concepts defined over a set of instances XXX of length nnn, and a learner LLL using hypothesis space HHH. \begin{quote} {\em Definition:} CCC is {\bf PAC-learnable} by LLL using HHH if for all cinCc \in CcinC, distributions calD\cal DcalD over XXX, epsilon\epsilonepsilon such that 0<epsilon<1/20 < \epsilon < 1/20<epsilon<1/2, and delta\deltadelta such that 0<delta<1/20 < \delta < 1/20<delta<1/2, learner LLL will with probability at least (1−delta)(1-\delta)(1−delta) output a hypothesis hinHh \in HhinH such that errorcalD(h)leqepsilonerror_{\cal D}(h) \leq \epsilonerrorcalD(h)leqepsilon, in time that is polynomial in 1/epsilon1/\epsilon1/epsilon, 1/delta1/\delta1/delta, nnn and size(c)size(c)size(c). \end{quote} \slidehead{Agnostic Learning \red \ } \bk So far, assumed cinHc \in HcinH \vspace*{.1in} Agnostic learning setting: don't assume cinHc \in HcinH \bi \item What do we want then? \bi \item The hypothesis hhh that makes fewest errors on training data \ei \item What is sample complexity in this case? \ei \[ m \geq \frac{1}{2 \epsilon^{2}}(\ln|H| + \ln(1/\delta)) \] derived from Hoeffding bounds: \[ Pr[error_{\cal D}(h) > error_D(h) + \epsilon] \leq e^{-2m\epsilon^{2}} \] \slidehead{Shattering a Set of Instances \red \ } \bk \begin{quote} {\em Definition:} a {\bf dichotomy} of a set SSS is a partition of SSS into two disjoint subsets. \end{quote} \bigskip \begin{quote} {\em Definition:} a set of instances SSS is {\bf shattered} by hypothesis space HHH if and only if for every dichotomy of SSS there exists some hypothesis in HHH consistent with this dichotomy. \end{quote} \slidehead{Three Instances Shattered \red \ } \bk \centerline{\hbox{\psfig{file=./bookps/pac-shatter.ps,width=6.5in}}} \slidehead{The Vapnik-Chervonenkis Dimension \red \ } \bk \begin{quote} {\em Definition:} The {\bf Vapnik-Chervonenkis dimension}, VC(H)VC(H)VC(H), of hypothesis space HHH defined over instance space XXX is the size of the largest finite subset of XXX shattered by HHH. If arbitrarily large finite sets of XXX can be shattered by HHH, then VC(H)equivinftyVC(H) \equiv \inftyVC(H)equivinfty. \end{quote} \slidehead{VC Dim. of Linear Decision Surfaces \red \ } \bk \centerline{\hbox{\psfig{file=./bookps/pac-pts.ps,width=6.5in}}} \slidehead{Sample Complexity from VC Dimension \red \ } \bk How many randomly drawn examples suffice to epsilon\epsilonepsilon-exhaust VSH,DVS_{H,D}VSH,D with probability at least (1−delta)(1 - \delta)(1−delta)? \[ m \geq \frac{1}{\epsilon}(4\log_2(2/\delta) + 8 VC(H) \log_2 (13/\epsilon)) \] \slidehead{Mistake Bounds \red \ } \bk So far: how many examples needed to learn? \vspace*{.1in} What about: how many mistakes before convergence? \vspace*{.2in} Let's consider similar setting to PAC learning: \bi \item Instances drawn at random from XXX according to distribution calD{\cal D}calD \item Learner must classify each instance before receiving correct classification from teacher \item Can we bound the number of mistakes learner makes before converging? \ei \slidehead{Mistake Bounds: Find-S \red \ } \bk Consider Find-S when H=H=H= conjunction of boolean literals \begin{quote} \pgm{Find-S}: \bi \item Initialize hhh to the most specific hypothesis l1tandnegl1tandl2tandnegl2ldotslntandneglnl_{1} \tand \neg l_{1} \tand l_{2} \tand \neg l_{2} \ldots l_{n} \tand \neg l_{n}l1tandnegl1tandl2tandnegl2ldotslntandnegln \item For each positive training instance xxx \bi \item Remove from hhh any literal that is not satisfied by xxx \ei \item Output hypothesis hhh. \ei \end{quote} \vspace*{.2in} How many mistakes before converging to correct hhh? \slidehead{Mistake Bounds: Halving Algorithm \red \ } \bk Consider the Halving Algorithm: \bi \item Learn concept using version space \pgm{Candidate-Elimination} algorithm \item Classify new instances by majority vote of version space members \ei \vspace*{.2in} How many mistakes before converging to correct hhh? \bi \item ... in worst case? \item ... in best case? \ei \slidehead{Optimal Mistake Bounds \red \ } \bk Let MA(C)M_{A}(C)MA(C) be the max number of mistakes made by algorithm AAA to learn concepts in CCC. (maximum over all possible cinCc \in CcinC, and all possible training sequences) \[M_{A}(C) \equiv \max_{c \in C} M_{A}(c)\] \vspace*{.3in} {\em Definition:} Let CCC be an arbitrary non-empty concept class. The {\bf optimal mistake bound} for CCC, denoted Opt(C)Opt(C)Opt(C), is the minimum over all possible learning algorithms AAA of MA(C)M_{A}(C)MA(C). \[Opt(C) \equiv \min_{A \in learning\ algorithms} M_{A}(C) \] \vspace*{.3in} \[ VC(C) \leq Opt(C) \leq M_{Halving}(C) \leq log_{2}(|C|). \]