CS 189/289A: Introduction to Machine Learning (original) (raw)

CS 189/289A Introduction to Machine Learning Jonathan Shewchuk (Please send email only if you don't want the TAs to see it; otherwise, usePiazza.)Spring 2016Mondays and Wednesday, 6:30–8:00 pm2050 Valley Life Sciences Building

Final exam solutions are available.

This class introduces algorithms for learning, which constitute an important part of artificial intelligence.

Topics include

Prerequisites

If you want to brush up on prerequisite material, Stanford's machine learning class provides nice reviews oflinear algebraandprobability theory. Other suggestions for review material appear inthis Piazza post.

Textbooks

Both textbooks for this class are available free online. Hardcover and Kindle/eTextbook versions are also available.

Homework

You have a total of 5 slip days that you can apply to your semester's homework. We will simply not award points for any late homework you submit that would bring your total slip days over five.

Homework 1is due February 10.

Homework 2is due February 18.

Homework 3is due March 3.

Homework 4is due March 31.

Homework 5 is due April 12.

Homework 6is due April 26.

Homework 7 is due May 3.

The CS 289A Projecthas a proposal due Monday, April 4. The video and final report are due Friday, May 6.

Previous midterms (including this semester's) are available:Spring 2013,Spring 2014,Spring 2015,Fall 2015,Spring 2016.

Previous final exams (including this semester's) are available. Without solutions:Spring 2013,Spring 2014,Spring 2015,Fall 2015,Spring 2016. With solutions:Spring 2013,Spring 2014,Spring 2015,Fall 2015,Spring 2016.

Lectures

Lecture 1 (January 20): Introduction. Classification, training, and testing. Validation and overfitting. Read ESL, Chapter 1. My lecture notes (text). The screencast.

Lecture 2 (January 25): Linear classifiers. Predictor functions and decision boundaries. The centroid method. Perceptrons. Read parts of the WikipediaPerceptron page. Optional: Read ESL, Section 4.5–4.5.1. My lecture notes (text). The screencast.

Lecture 3 (January 27): Gradient descent, stochastic gradient descent, and the perceptron learning algorithm. Feature space versus weight space. The maximum margin classifier, aka hard-margin support vector machine (SVM). Read ISL, Section 9–9.1. My lecture notes (text). The screencast.

Lecture 4 (February 1): The support vector classifier, aka soft-margin support vector machine (SVM). Features and nonlinear decision boundaries. Read ESL, Section 12.2 up to and including the first paragraph of 12.2.1. My lecture notes (text). The screencast.

Lecture 5 (February 3): Machine learning abstractions: application/data, model, optimization problem, optimization algorithm. Common types of optimization problems: unconstrained, constrained (with equality constraints), linear programs, quadratic programs, convex programs. Optional: Read (selectively) the Wikipedia page onmathematical optimization. My lecture notes (text). The screencast.

Lecture 6 (February 8): Decision theory: the Bayes decision rule and optimal risk. Generative and discriminative models. Read ISL, Section 4.4.1. My lecture notes (text). The screencast.

Lecture 7 (February 10): Gaussian discriminant analysis, including quadratic discriminant analysis (QDA) and linear discriminant analysis (LDA). Maximum likelihood estimation (MLE) of the parameters of a statistical model. Fitting an isotropic Gaussian distribution to sample points. Read ISL, Section 4.4. Optional: Read (selectively) the Wikipedia page onmaximum likelihood. My lecture notes (text). The screencast.

February 15 is Presidents' Day.

Lecture 8 (February 17): Eigenvectors, eigenvalues, and the eigendecomposition. The Spectral Theorem for symmetric real matrices. The quadratic form and ellipsoidal isosurfaces as an intuitive way of understanding symmetric matrices. Application to anisotropic normal distributions (aka Gaussians). Read Chuong Do's notes on the multivariate Gaussian distribution. My lecture notes (text). The screencast.

Lecture 9 (February 22): Anisotropic normal distributions (aka Gaussians). MLE, QDA, and LDA revisited for anisotropic Gaussians. Read ISL, Sections 4.4 and 4.5. My lecture notes (text). The screencast.

Lecture 10 (February 24): Regression: fitting curves to data. The 3-choice menu of regression function + loss function + cost function. Least-squares linear regression as quadratic minimization and as orthogonal projection onto the column space. The design matrix, the normal equations, the pseudoinverse, and the hat matrix (projection matrix). Logistic regression; how to compute it with gradient ascent or stochastic gradient descent. Read ISL, Sections 4-4.3. My lecture notes (text). The screencast.

Lecture 11 (February 29): My Mom's 18th birthday (not kidding). Newton's method and its application to logistic regression. LDA vs. logistic regression: advantages and disadvantages. ROC curves. Weighted least-squares regression. Least-squares polynomial regression. Read ISL, Sections 4.4.3, 7.1, 9.3.3; ESL, Section 4.4.1. My lecture notes (text). The screencast. Happy birthday, Mom!

Lecture 12 (March 2): Statistical justifications for regression. The empirical distribution and empirical risk. How the principle of maximum likelihood motivates the cost functions for least-squares linear regression and logistic regression. The bias-variance decomposition; its relationship to underfitting and overfitting; its application to least-squares linear regression. Read ESL, Sections 2.5 and 2.9. Optional: Read the Wikipedia page onthe bias-variance trade-off. My lecture notes (text). The screencast.

Lecture 13 (March 7): Ridge regression: penalized least-squares regression for reduced overfitting. How the principle of maximum a posteriori (MAP) motivates the penalty term (aka Tikhonov regularization). Kernels. Kernel ridge regression. The polynomial kernel. Read ISL, Sections 6.2-6.2.1 and ESL, Sections 12.3-12.3.2. Optional: This CrossValidated page onridge regression is pretty interesting. My lecture notes (text). The screencast.

Lecture 14 (March 9): Kernel perceptrons. Kernel logistic regression. The Gaussian kernel. Subset selection. Lasso: penalized least-squares regression for reduced overfitting and subset selection. Read ISL, Sections 6-6.1.2 and the last part of 6.1.3 on validation; and ESL, Sections 3.4-3.4.3. Optional: Read ISL, Section 9.3.2 if you're curious about kernel SVM. My lecture notes (text). The screencast.

Lecture 15 (March 14): Decision trees; algorithms for building them. Entropy and information gain. Read ISL, Sections 8-8.1. My lecture notes (text). The screencast.

The Midtermtakes place in class on Wednesday, March 16. You are permitted one “cheat sheet” of letter-sized (8½" × 11") paper.

March 21–25 is Spring Recess.

Lecture 16 (March 28): More decision trees: multivariate splits; decision tree regression; stopping early; pruning. Ensemble learning: bagging (bootstrap aggregating), random forests. Read ISL, Section 8.2. My lecture notes (text). The screencast.

Lecture 17 (March 30): Neural networks. Gradient descent and the backpropagation algorithm. Read ESL, Sections 11.3-11.4. Optional: I've heard positive recommendations for Welch Labs' video tutorialNeural Networks Demystified on YouTube. Also of special interest is this Javascriptneural net demothat runs in your browser. My lecture notes come in two parts:a text part anda handwritten part (PDF) on backpropagation. The screencast.

Lecture 18 (April 4): Neuron biology: axons, dendrites, synapses, action potentials. Differences between traditional computational models and neuronal computational models. Backpropagation with softmax outputs and logistic loss. Unit saturation, aka the vanishing gradient problem, and ways to mitigate it. Heuristics for avoiding bad local minima. My lecture notes come in two parts:a text part anda handwritten part (PDF) on backpropagation. The screencast.

Lecture 19 (April 6): Heuristics for avoiding bad local minima. Heuristics for faster training. Heuristics to avoid overfitting. Convolutional neural networks. Neurology of retinal ganglion cells in the eye and simple and complex cells in the V1 visual cortex. Read ESL, Sections 11.5 and 11.7. Optional: A fine paper on heuristics for better neural network learning isYann LeCun, Leon Bottou, Genevieve B. Orr, and Klaus-Robert Müller, “Efficient BackProp,” in G. Orr and K.-R. Müller (Eds.),Neural Networks: Tricks of the Trade, Springer, 1998. Also of special interest is this Javascriptconvolutional neural net demo that runs in your browser. The lecture notes come in two parts:a text part andsome slides (PDF) I stole from Alyosha Efros, Stephen Palmer, Fei-Fei Li, Andrej Karpathy, Marc'Aurelio Ranzato, and probably others. The screencast.

Lecture 20 (April 11): Unsupervised learning. Principal components analysis (PCA). Derivations from maximum likelihood estimation, maximizing the variance, and minimizing the sum of squared projection errors. Eigenfaces for face recognition. Read ISL, Sections 10–10.2 and the Wikipedia page onEigenface. My lecture notes (text). The screencast.

Lecture 21 (April 13): The singular value decomposition (SVD) and its application to PCA. Clustering: _k_-means clustering aka Lloyd's algorithm;_k_-medoids clustering; hierarchical clustering; greedy agglomerative clustering. Dendrograms. Read ISL, Section 10.3. My lecture notes (text). The screencast.

Lecture 22 (April 18): Spectral graph partitioning and graph clustering. Relaxing a discrete optimization problem to a continuous one. The Fiedler vector, the sweep cut, and Cheeger's inequality. The vibration analogy. Greedy divisive clustering. The normalized cut and image segmentation. Read my survey of Spectral and Isoperimetric Graph Partitioning, Sections 1.2–1.4, 2.1, 2.2, 2.4, 2.5, and optionally A and E.2. For reference: Jianbo Shi and Jitendra Malik,Normalized Cuts and Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence22(8):888–905, 2000. My lecture notes (text). The screencast.

Lecture 23 (April 20): Graph clustering with multiple eigenvectors. Latent factor analysis (aka latent semantic indexing). Nearest neighbor classification and its relationship to the Bayes risk. The geometry of high-dimensional spaces. The exhaustive algorithm for _k_-nearest neighbor queries. Read ISL, Section 2.2.3. Optional: Read the Wikipedia page onlatent semantic analysis. For reference: Andrew Y. Ng, Michael I. Jordan, Yair Weiss, On Spectral Clustering: Analysis and an Algorithm, Advances in Neural Information Processing Systems 14 (Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, editors), pages 849–856, the MIT Press, September 2002. My lecture notes (text). The screencast.

Lecture 24 (April 25): Guest lecture by Anca Draganon her research on robot planning and learning for interaction with humans. How to make robot intentions transparent to human observers, and how to plan actions that coordinate with people. (Not examinable.)

Lecture 25 (April 27): Speeding up nearest neighbor queries. Voronoi diagrams, order-k Voronoi diagrams, and point location.k_-d trees. Application of nearest neighbor search to the problem of_geolocalization: given a query photograph, determine where in the world it was taken. The lecture notes come in two parts:a text part anda handwritten part (PDF). For reference: the IM2GPS web page, which includes links to the paper. The screencast.

The Final Exam takes place onFriday, May 13, 3–6 PM in Wheeler Auditorium. (CS 189 is in exam group 19.)

Discussion Sections and Teaching Assistants

Sections begin to meet on January 28. Sections 102 and 103 are cancelled.

Thursday, 10 am 101 Tuomas 102 Latimer Thursday, 1 pm 104105 RohanBrian 228 DwinelleB56 Hildebrand Thursday, 2 pm 106107 RohanTuomas 105 Dwinelle255 Dwinelle Thursday, 3 pm 108115 MarvinDaylen 179 Dwinelle405 Soda Thursday, 4 pm 110111 MarvinAldo 255 Dwinelle254 Dwinelle Thursday, 5 pm 116 Vashisht 83 Dwinelle Friday, 9 am 112 Aldo 81 Evans Friday, 10 am 113 Shaun 71 Evans Friday, 11 am 114 Brian 85 Evans Friday, 2 pm 109 Shaun 110 Barker

Your teaching assistants:
Rohan Chitnis, ronuchit@berkeley.edu
Tuomas Haarnoja, haarnoja@berkeley.edu
Brian Hou, brian.hou@berkeley.edu
Vashisht Madhavan, vashisht.madhavan@berkeley.edu
Aldo Pacchiano, pacchiano@berkeley.edu
Shaun Singh, sdsingh@berkeley.edu
Ajay Solanky, ajaysolanky@berkeley.edu
Daylen Yang, daylen@berkeley.edu
Marvin Zhang, zhangmarvin@berkeley.edu

Grading

Supported in part by the National Science Foundation under Awards CCF-0430065, CCF-0635381, IIS-0915462, and CCF-1423560, in part by a gift from the Okawa Foundation, and in part by an Alfred P. Sloan Research Fellowship.