CS246 | Home (original) (raw)
Content
What is this course about? [Info Handout]
The course will discuss data mining and machine learning algorithms for analyzing very large amounts of data. The emphasis will be on MapReduce and Spark as tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation Systems, Clustering, Link Analysis, Large-scale Supervised Machine Learning, Data streams, Mining the Web for Structured Data, Web Advertising.
Previous offerings
The previous version of the course is CS345A: Data Mining which also included a course project. CS345A has now been split into two courses, CS246 and CS341.
You can access class notes and slides of previous versions of the course here:
Prerequisites
Students are expected to have the following background:
- Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program (e.g., CS107 or CS145 or equivalent are recommended).
- Good knowledge of Java and Python will be extremely helpful since most assignments will require the use of Spark.
- Familiarity with basic probability theory (CS109 or Stat116 or equivalent is sufficient but not necessary).
- Familiarity with writing rigorous proofs (at a minimum, at the level of CS 103).
- Familiarity with basic linear algebra (e.g., any of Math 51, Math 103, Math 113, CS 205, or EE 263 would be much more than necessary).
- Familiarity with algorithmic analysis (e.g., CS 161 would be much more than necessary).
The recitation sessions in the first weeks of the class will give an overview of the expected background.
Reference Text
The following text is useful, but not required. It can be downloaded for free, or purchased from Cambridge University Press.
Leskovec-Rajaraman-Ullman: Mining of Massive Dataset
Schedule
Lecture slides will be posted here shortly before each lecture. If you wish to view slides further in advance, refer to 2024 course offering's slides, which are mostly similar.
This schedule is subject to change. All deadlines are at 11:59pm PST.
Date | Description | Suggested Readings | Events | Deadlines |
---|---|---|---|---|
Tue Jan 7 | Introduction; MapReduce and Spark [slides] | Ch1: Data Mining Ch2: Large-Scale File Systems and Map-Reduce | ||
Thu Jan 9 | Frequent Itemsets Mining [slides] | Ch6: Frequent itemsets | Colab 0,Colab 1,Homework 1 out | |
Sat Jan 11 | Recitation: Spark tutorial | |||
Tue Jan 14 | Locality-Sensitive Hashing I [slides] | Ch3: Finding Similar Items (Sect. 3.1-3.4) | ||
Thu Jan 16 | Locality-Sensitive Hashing II [slides] | Ch3: Finding Similar Items (Sect. 3.5-3.8) | Colab 2 out | Colab 0,Colab 1due |
Thu Jan 16 | Recitation: Linear Algebra | |||
Fri Jan 17 | Recitation: Probability and Proof Techniques | |||
Tue Jan 21 | Clustering [slides] | Ch7: Clustering (Sect. 7.1-7.4) | ||
Thu Jan 23 | Dimensionality Reduction [slides] | Ch11: Dimensionality Reduction (Sect. 11.4) | Colab 3,Homework 2 out | Colab 2,Homework 1 due |
Tue Jan 28 | Recommender Systems I [slides] | Ch9: Recommendation systems | ||
Thu Jan 30 | Recommender Systems II [slides] | Ch9: Recommendation systems | Colab 4 out | Colab 3due |
Tue Feb 4 | PageRank [slides] | Ch5: Link Analysis (Sect. 5.1-5.3, 5.5) | ||
Thu Feb 6 | Extensions of PageRank to Recommendations and Spam [slides] | Ch5: Link Analysis (Sect. 5.4) Ch10: Analysis of Social Networks (Sect. 10.1-10.2, 10.6) | Colab 5,Homework 3 out | Colab 4,Homework 2 due |
Tue Feb 11 | Community Detection in Graphs [slides] | Ch10: Analysis of Social Networks (Sect. 10.3-10.5) | ||
Thu Feb 13 | Graph Representation Learning [slides] | Inductive Representation Learning on Large Graphs Do Transformers Really Perform Bad for Graph Representation? Sign and Basis Invariant Networks for Spectral Graph Representation Learning | Colab 6 out | Colab 5due |
Tue Feb 18 | Graph Neural Networks [slides] | How Powerful Are Graph Neural Networks? Identity-aware Graph Neural Networks Graph Neural Networks are More Powerful than We Think Position-aware Graph Neural Networks | ||
Thu Feb 20 | Relational Deep Learning [slides] | Relational Deep Learning - Graph Representation Learning on Relational Databases RelBench: A Benchmark for Deep Learning on Relational Databases | Colab 7,Homework 4 out | Colab 6,Homework 3due |
Tue Feb 25 | Decision Trees [slides] | Ch12: Large-Scale Machine Learning | ||
Thu Feb 27 | Mining Data Streams I & II [slides] | Ch4: Mining data streams | Colab 8 out | Colab 7due |
Tue Mar 4 | Computational Advertising [slides] | Ch8: Advertising on the Web | ||
Thu Mar 6 | Optimizing Submodular Functions [slides] | Colab 9 out | Colab 8,Homework 4due | |
Tue Mar 11 | Bandits [slides] | Turning Down the Noise in the Blogosphere by El-Arini, Veda, Shahaf, Guestrin. KDD 2009. | ||
Thu Mar 13 | Scaling ML [slides] | Colab 9due | ||
Thu Mar 20 | Exam |