From Languages to Information (Winter 2025) (original) (raw)
Welcome to CS124!! Some FAQ
- When do I need to come to class? TL;DR: Tuesdays! All 10 Tuesdays we have one of the 5 live lectures or 5 labs!
More nuanced answer: CS124 is a flipped class. This means there are mainly recorded lectures and we mostly do labs in class! Note that the 5 live lectures and the 5 labs are not recorded.- Required attendence in Hewlett 200: 6 days: 4 of the 5 live lectures (Tuesday of weeks 1, 4, 6, 10); not the Transformer lecture in week 7. Lab #1 (week 2), and Lab #5 (Week 8). These lectures and labs will not be recorded and the material is fair game for the quizzes.
- **Recommended attendence in Hewlett 200 (and required completion):**Lab #2, Lab #3, Lab #4, and the Transformers Lecture. Attendence at these (non-recorded) labs gives you extra credit but if you want you can skip the extra credit and do them yourself at home.
- Completely optional attendence in Hewlett 200:
* The 2 Thursday tutorial sessions on Jupyter and Numpy. No extra credit, but you should come if you want help with getting set up with PA0 and getting a refresher on jupyter and numpy.
* Special office hours in Hewlett 200 all the other Thursdays! - There is no final exam and (new for this year!) no midterm. We will not use the scheduled final exam time at all. The last class requirement is the final programming assignment which is due Wednesday March 12 at 5pm. The last time you have to come to class in person is Tuesday Mar 11 for my last lecture.
- Should I come to class if I am sick? Goodness no, of course not! Stay home and get better and don't make everyone else sick! Mail the staff in advance (see the email below) and they will note that you are sick and give you advice about how to keep up.
- Can I take this class asynchronously, i.e., if i have another class at the same time?
No. The class must be taken synchronously, so that you have the ability to come to all the 12 in-person class sessions; we have found over the years that the labs are necessary and result in everyone learning more. However, if you have a medical reason why you cannot take the class synchronously, or because you are required to take another class at the same time and you need the classes or else **you cannot graduate this quarter,**you can ask Dan for explicit permission to take the class asynchronously. The in-person material is not recorded, so if you are taking the class asynchronously, you must do the labs yourself at home, but won't get the extra credit that the folks who come in person will get. Note: this is only for people with medical or must-graduate-this-quarter excuses.
Another exception: since the required events are all on Tuesdays, if you have a class at this time that happens to meet only on Thursdays, that's ok! - Can I audit this class?
No and yes. No because the TAs are very hard-working and their workload helping the almost 400 students each year is assigned by the University based on enrollment, so auditors are asking the TAs to work for free, which isn't reasonable or sustainable. Yes because all the course material is public, and we encourage anyone who wants to learn the material to go through it on their own, as long as they don't ask the TAs to work for them! So that means if you're not in the class we encourage you to watch the videos on YouTube, and do the programming homeworks, but don't submit homeworks or quizzes to be graded, and don't ask questions on Ed that the TAs will need to spend time answering. - What are the prereqs? Prerequisites: CS106B, Python (at the level of CS106A), CS109 (or equivalent background in probability), and programming maturity and knowledge of UNIX equivalent to CS107 (or taking CS107 or CS1U concurrently). In some previous years CS107 and CS109 were optional. Many students advised us that it would have been helpful to have 107 and 109 first. So now both are required. You may take them concurrently, and if you have equivalent knowledge that's fine. It's also useful to have had Math 51, but not required; we'll try to give you pointers to places to make up missing background. Note: Because this class often functions as an introduce to text data science for PhD students in other areas, we will definitely make exceptions for PhD students in other departments who haven't taken 107 and 109, just talk to us!
- Can I take this course as a PhD student in a department that isn't CS?: Yes, we'd love to have non-CS PhDs! This course is not appropriate for CS grad students (because there are CS graduate versions of all the material in this course, i.e. CS224N, 224U, 224V, 224W, 224C, CS246, CS276), but it's very commonly taken by PhD students in the social sciences or humanities who plan to use text processing methods in their research. Feel free to contact us if you're worried whether you have the right background as a PhD student from another department.
- If I have permission to take the course despite not having CS107 (for example because I am a PhD student in another department), what should I do to catch up with the UNIX background I've missed: Watch some of the "getting start with UNIX" mini-videos from this old cs107 archive: (1) The "Logging in" videos that are relevant for whatever computer you are using (mac/windows/linux), (2) the first 7 "File System" videos, (3) the first 7 "Useful Commands" videos, (4) the first 3 "Shell/Productivity" videos, and (5) the "Vim" section from "Editors".
- More details? below. You are responsible for reading this entire syllabus before the 2nd day of class, January 9! The textbook is free online pdfs mostly here
- Who do I email if I have questions about personal issues or missing class? For personal issues or extenuating circumnstances, mailcs124_requests@lists.stanford.edu. This address will be read by Dan and the head TA and the course manager and coordinator to get you answers about missing classes, or about personal issues like OAE. If you have questions that are in any way technical in nature post them on Ed.
- Most important: Have fun and learn lots!!!!
Course Staff
Dan Jurafsky
Professor
Priti Rangnekar
Head TA
Veronica Rivera
Ethics Postdoc
Xuheng Cai
TA
Adam Chun
TA
Gabriela Cortes Arias
TA
Kate Eselius
TA
Daniel Guo
TA
Sri Jaladi
TA
Jonathan Lee
TA
Kasey Luo
TA
Gabe MagaƱa
TA
Elena Recaldini
TA
Jeong Shin
Ethics TA
Savitha Srinivasan
TA
Rachel Yixing Wang
TA
Pannisy Zhao
TA
Schedule
Week | Date | Homework | Quiz | In-class | Video Lectures and Readings (to be done by the Monday of the week unless I specify another date) |
---|---|---|---|---|---|
1 | Jan 7, 9 | PA 0: Setup and Tutorial [starter code] Due Fri Jan 10, 5:00pm (We'll also go over this in Thursday Jan 9's in-person tutorial ) | - | Tue Jan 7: Dan in-person Lecture: Intro (not recorded) [slides pptx slides pdf] Thurs Jan 9: In-person tutorial: Jupyter notebooks and PA0 (Optional!) | Watch before Thursday: PA 0 Windows Setup Video PA 0 Mac Setup Video |
2 | Jan 14 and 16 | PA 1: Regular Expressions [starter code] Due Fri Jan 17, 5:00pm | Quiz 1: Text Processing/Edit Distance [quiz 1 on gradescope] Due Tue Jan 14, 11:59pm | Tue Jan 14: Lab #1: Text Processing with Unix tools (Slides: Solutions are generally on the following slide; don't look at each solution til you've done the problem:) [Lab 1 pptx] [Lab 1 pdf] [Lab 1 quick command ref] [Lab 1 solutions w/numbers] [secret_ec.txt] Thur Jan 16: In-person Tutorial: NumPy (Optional) [numpy tutorial] | Basic Text Processing Canvas Videos (watch videos before Mon Jan 13) [canvas slides pptx] [ canvas slides pdf] J+M 3rd Chapter 2 "Regular Expressions, Text Normalization, Edit Distance", pages 1-21 Edit Distance Canvas Videos (watch videos before Mon Jan 13) [canvas slides pptx] [canvas slides pdf] J+M New Chapter 2 "Regular Expressions, Text Normalization, Edit Distance", pages 22-26 Just for historical reference: Ken Church's original tutorial Unix for Poets, slides/pages 1-19 |
3 | Jan 21 and 23 | PA 2: Naive Bayes and Sentiment Analysis! [starter code] Due Fri Jan 24, 5:00pm | Quiz 2: Language Modeling/Naive Bayes [gradescope] Due Tuesday Jan 21, 11:59pm | Tue Jan 21: Lab #2: Naive Bayes and Classification and its harms (watch NB videos beforehand) (don't look at the solution until you've completed all the questions!) [Lab 2] [Lab 2 Solutions] Thu Jan 23: No class: extra in-person TA office hours during class time in classroom | Language Modeling Canvas Videos (watch before Monday Jan 20) [canvas slides pptx] [canvas slides pdf] J+M (3ed) Chapter 3, "Language Modeling with N-grams" pages 1-16 (plus section 3.7, "The Web and Stupid Backoff"). Naive Bayes and Text Classification Canvas Videos (watch before Monday Jan 20) [canvas slides pptx] [canvas slides pdf] J+M (3ed) Chapter 4, "Naive Bayes and Sentiment Classification" pages 1-14 plus page 18, sections 4.1 through 4.8 and 4.10. Optional: Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. EMNLP 2002, pages 79--86 |
4 | Jan 28 and Jan 30 | PA 3: Logistic Regression! [starter code] Due Fri Jan 31, 5:00pm | Quiz 3: Logistic Regression [gradescope] Due Tuesday Jan 28, 11:59pm | Tuesday: Dan in-person Lecture (required and not recorded): "Social NLP/ NLP for Computational Social Science" [slides pdf] Thursday: No class: extra in-person TA office hours during class time in Hewlett 200 | Logistic Regression Canvas Videos (watch/read before Mon Jan 28) [canvas slides pptx] [canvas slides pdf] J+M (3ed) Chapter 5, "Logistic Regression" pages 1-17. Page 18 and 21 may also be useful! |
5 | Feb 4 and 6 | PA 4: Information Retrieval [starter code] Due Fri Feb 7, 5:00pm | Quiz 4: Information Retrieval [gradescope] Due Tuesday Feb 4, 11:59pm | Tuesday: Lab #3: Information Retrieval [lab 3] [solutions] Thursday: No class: extra in-person TA office hours during class time in Hewlett 200 | Chris Manning Canvas Video: Information Retrieval (I) (watch/read before Monday Feb 3) [canvas slides pptx] [canvas slides pdf] MR+S Chapter 1: Boolean Retrieval (pages 1-17) MR+S Chapter 2: Term vocabulary and postings lists (only pages 33-42) Chris Manning Canvas Video: Information Retrieval (II) (watch/read before Monday Feb 3) [canvas slides pptx] [canvas slides pdf] J+M (3ed) Chapter 14, "QA and IR", just pages 1-6 MR+S Chapter 6: Scoring, term weighting, and the vector space model, (only pages 100 and 107-116) MR+S Chapter 8: Evaluation in Information Retrieval (only pages 139-149) |
6 | Feb 11 and 13 | PA 5: Embeddings and Vector Semantics [starter code] Due Fri Feb 14, 5:00pm. | Quiz 5: Embeddings and Vector Semantics [gradescope] Due Tue Feb 11, 11:59pm | Tuesday: Guest Lecture (required and not recorded): Dora Demszky, Graduate School of Education: "Empowering educators via language technology" (plus Dan mini-lecture on contextual embeddings) [Demszky slides pdf] [Dan embeddings slides pdf] Thursday: No class: extra in-person TA office hours during class time in Hewlett 200 | Vector Semantics and Embeddings Canvas Videos [slides pptx] [slides pdf] J+M (3ed) Chapter 6: Vector Semantics, 1-7, 17-26, and review 7-16 (should already be familiar) |
7 | Feb 18 and 20 | PA 6: Neural Networks [starter code] Due Fri Feb 21, 5:00pm | Quiz 6: Neural Networks [gradescope] Due Tue Feb 18, 11:59pm. | Tuesday: Dan live Lecture (optional but strongly recommended, not recorded): "LLMs and Transformers! (Plus more on backprop if we have time)" [slides pptx] [slides pdf] Thursday: No class: extra in-person TA office hours during class time in Hewlett 200 | Neural Networks Video [slides pptx] [slides pdf] J+M (3ed) Chapter 7: Neural Networks (skip section 7.7 on Training Neural Language Models). |
8 | Feb 25 and Feb 27 | Quiz 7: Transformers [gradescope] Due Tue Feb 25, 11:59pm. | Tuesday: Lab #5: PA7 and Git [Lab 5] [Lab 5 Solutions] Thursday: No class: extra in-person TA office hours during class time in Hewlett 200 | The lecture material is the LLM/Transformer lecture that I gave live on Feb 18. There is no recording, but the slides are listed above in week 7 and the reading (which is similar to the lecture) is: J+M (3ed) Chapter 9: The Transformer. J+M (3ed) Chapter 10: Large Language Models (only pages 1-11 and page 17. | |
9 | Mar 4 and 6 | PA 7: Chatbot [starter code] Due Wed Mar 12, 5:00pm | Quiz 8: Recommendation Systems Due Tues Mar 4, 11:59pm | Tuesday Lab #4: Collaborative Filtering and Ethical Use of LLMs in the Classroom [Lab 4] [Lab 4 solutions] Thursday: No class: extra in-person TA office hours during class time in Hewlett 200 | Chat Bot Videos (watch by Monday Mar 3) [slides pptx] [slides pdf] J+M (3ed) Chapter 15: Chatbots and Dialogue Systems Recommender systems and Collaborative Filtering Canvas videos (watch by Monday Mar 3) [slides pptx] [slides pdf] Jure Leskovec, Anand Rajaraman, Jeff Ullman. 2014. Mining of Massive Datasets. Chapter 9 3rd edition. pages 319-339 (sections 9.1, 9.2, 9.3; you can skip 9.2.7). |
10 | Mar 11 and 13 | Reminder: PA 7: Chatbot due Wed Mar 12, 5:00pm | Quiz 9: Pagerank and Networks Due Tues Mar 11, 11:59pm | Tuesday: Dan Live Lecture (required and not recorded): "Large Language Models Continued" [slides pdf] [slides pptx] Thursday: No class (but no extra office hours) | Web graphs, Links, and PageRank (watch by Mon Mar 10) [slides pptx] [slides pdf] MR+S Chapter 21: Link Analysis, just pages 421-433 (Skip section 21.3 and 21.4) Social Networks Canvas Videos (watch by Mon Mar 10) [slides pptx] [slides pdf] David Easley and Jon Kleinberg. 2010. Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Chapter 2, Sections 3.1-3.3 and Secs 18.1-18.5. Cambridge |
Logistics
Instructor
Dan Jurafsky (jurafsky@stanford.edu)
Office: Margaret Jacks 117
Office Hours: Book an appointment for Thursdays 3-4:20 (except week 2, in which case they are Tues Jan 14, 4:30-5:40).
(instructions for booking are here)
TA Office Hours
(see more details here including the virtual queue and the zoom link)
- Monday 3:30 PM - 6:30 PM - GESB 150 (Green Earth Sciences B 150)
- Tuesday 5 PM - 8 PM - virtual
- Wednesday 3:30 PM - 6:30 PM - 260-003, Pigott Hall, Main Quad
- Thursday
- 3 - 4:20 PM in Hewlett 200
- 4:30 - 7:30 PM week 2 / 4:30 - 6 PM week 3 onwards - 540-108, Bldg.540, Blume Earthquake Center
- Friday 12:30 - 3:30 PM - 100-101K, Bldg.100, Main Quad
- Saturday 9:30 AM - 12:30 AM - virtual
Class Time
Tuesday and Thursday 3:00-4:20
Attendance
We require you come to 6 classes: 4 of the 5 live lectures (except Transformers) and lab #1 and lab #5 and strongly strongly recommend the other lecture, 3 labs and 2 tutorials, you will learn more from doing them with other people (I won't require attendance at labs 2/3/4 but I will give extra credit for attending labs 2/3/4). For any lab you miss, you must still do them at home yourself. The course can be taken asynchronously only if you have permission from Dan due to a required conflict or medical issue. Also: different people learn better from different combinations of videos/lectures, reading the chapters, coming to the labs, and coming to office hours. But I will say that students who do all four tend to do the best on quizzes and in the course in general.
Alas, we can't reply to email sent to individual staff members. If you have a question that is not confidential or personal, post it on the Ed Discussion forum! Responses are quicker and you'll also be helping others with the same question! To contact the teaching staff directly, come see us in office hours!
If that is not possible, you can also email (non-technical questions) to the course staff list,cs124_requests@lists.stanford.edu. For urgent requests: We check the staff email list very frequently, but please don't worry if you don't hear from us right away. We will do our best to get back to you within a day or so. Just make sure to send an email as soon as you have the request so it's timestamped!
If you have a matter to be discussed privately, come to office hours or use cs124_requests@lists.stanford.edu to make an appointment. For grading questions, please talk to us after class or during office hours.
Class announcements will be on Ed Discussion (although we will occasionally try Canvas and mailing lists). We will assume that everyone reads all announcements.
Honor Code
Since we occasionally reuse homeworks from previous years, we expect students not to copy, refer to, or look at the solutions in preparing their answers. It is an honor code violation to intentionally refer to a previous year's solutions. This applies both to the official solutions and to solutions that you or someone else may have written up in a previous year. It is also an honor code violation to find some way to look at the test set, or to interfere in any way with programming assignment scoring or tampering with the submit script. It's also an honor code violation to use ChatGPT or any automatic coding system to write your code for you.
Unlike prior years, students are allowed to collaborate on quizzes. However, you must each do your own work and only discuss after; each person will be uploading their own work in addition to the answers.
CS124 follows the general Stanford policy on generative AI which is that "use of or consultation with generative AI shall be treated analogously to assistance from another person. In particular, using generative AI tools to substantially complete an assignment or quiz (e.g. by entering quiz or assignment questions) is not permitted", just as having someone do your homework or quizzes for you is not permitted.
Textbooks
- There is no required textbook, but I'll expect you to know the textbook/reading material listed above, and will test it on the quizzes.
- Online new chapters from Jurafsky and Martin. third edition August 20, 2024 release. Speech and Language Processing.
- Chapters from Manning, Raghavan, and Schutze. 2008. Introduction to Information Retrieval. Cambridge University Press. You can buy the book, get it from the library, or it's also available online *HERE*.
Course Description
Extracting meaning, information, and structure from human language text, speech, web pages, social networks. Introducing methods (string algorithms, edit distance, language modeling, machine learning, logistic regression, neural networks, neural embeddings, inverted indices, collaborative filtering, PageRank), applications (chatbots, sentiment analysis, information retrieval, text classification, social networks, recommender systems), and ethical issues.
Prerequisites
CS106B, Python (at the level of CS106A), CS109 (or equivalent background in probability), and programming maturity and knowledge of UNIX equivalent to CS107 (or taking CS107 or CS1U concurrently).
Required Work
From Languages to Information is a flipped class with much of the material online. All the lectures (except 5 live lectures) have been prerecorded, and you can watch them at home. The weekly quizzes and programming homeworks will be automatically uploaded and graded. Lectures are available in the Modules section on Canvas. Quizzes and homeworks are on Gradescope and github, but you can find them all on this webpage!!
Prerecorded Video Lectures
Most weeks, we will ask you to watch a set of video lectures (2 to 2.5 hours total). Most videos will have some in-video questions embedded in them, which you should answer. You are required to watch the videos but the embedded quizzes are not counted toward the final grade.
In-class Lectures
5 lectures will be live, and are required (except Week 7 Transformers lecture is only Strongly Recommended). For all 5 the material will be on the quizzes.
Labs
There are 5 in-class labs are in which we do group problem-solving activities. The labs are required and will be tested on the quizzes, meaning that if you can't make a particular in-person lab, you must still do the exercises at home instead. But Lab 1 and Lab 5 are required to be attended in-class; the other 3 you can do at home.
Automated Review Quizzes
After watching a week's video lectures, we will ask you to answer an open-notes, open-book review quiz (about 5-6 questions) on the content that you just learned. These quizzes are not timed, they are open book, and they may be attempted an infinite number of times. The questions, as well as the options for each question, may change and be randomly selected from a larger pool each time you take a quiz. You will not see your quiz grade/correct answers until after the due date, but the system will take the the score from the last submission of all your infinitely-allowed submissions for the quiz. So if you worry you might have got something wrong, just submit another one! Review Quizzes for each week are due 11:59pm Tuesday of the following week There are no late days for review quizzes. Because of the strict no-late-day policy, we will drop your lowest scoring quiz (i.e. we will only count your best 8 of the 9 quizzes in your final grade).
Can I work with my friends on the quiz? Yes, you can work with your pair programming partner. But you must each do the problem yourselves, and only then discuss with your partner, and you each submit separately (and you will have to show your own work in the "show your work" section when you upload the quiz answers).
Class Participation
You have to watch all lectures, and attendance for the 5 live lectures is required (except for the Transformer lecture, which is only Strongly Recommended). The labs are required and we will test material from them on the quizzes, and labs 1 and 5 must be attended in person. However, attendance for labs 2,3,4 is only strongly recommended; you may do them yourself at home if you really cannot come to class.You can get extra credit for class participation and other things by: Coming to labs 2/3/4 in person; particularly answers on the class Ed forum, helping out other students in office hours or labs, being the first person to find typos in the textbook (not counting bugs in figure or chapter numbering), speaking up in the labs. Plus there will be extra credit problems on some of the labs and possibly PAs.
Programming Assignments
7 Python programming assignments. All are due Fridays at 5pm.
Programming Assignment Collaboration for PA 1-6: You may talk to anybody you want about the assignments and bounce ideas off each other. And if you want, you can also choose a partner and do pair programming for PA 1-6. Pair programming has many advantages for learning!!! You and your pair-partner can discuss code, but it's important that each of you work on each part of the assignment so that you're comfortable with the whole assignment, since assignments build on each other (and we will test concepts from the assignments on the quizzes). If you choose to pair-program, you should specify in the submission who your partner is. We will use the normal automatic checks for overlap between your code and other students' code who are not your pair partner. You must describe in your writeup exactly who did what in your code.
Programming Assignment Collaboration for PA 7: PA7 is a group homework that must be done in groups. You will work together with your group, and write code together. Groups must be of size 3 or 4. To work in a group of size 2, you must get special permission from the staff. You cannot work by yourself on PA 7, because part of the goal of this homework is to learn to work on group projects. You must describe in your writeup in detail exactly who in your group did what, and who worked on which parts of the assignment/code.
Late homeworks
You have a total of 4 free late (calendar) days to use on programming assignments 1-6. If you are pair programming, late days are still individual (i.e if one of you has used up late days, and one has not, and you submit a homework late one day, only the student without remaining late days will be penalized).You cannot use late days on PA 7. Once late days are exhausted, any PA turned in late will be penalized 20% per late day. Each 24 hours or part thereof that a homework is late uses up one full late day. However, no assignment will be accepted more than four days after its due date.
Readings
This class has a significant amount of textbook reading. Most weeks have around 25 textbook pages. The homeworks and quizzes are based heavily on the readings.
Final grade computation
- 73% homeworks (PAs 1-6 are each worth the same, 9% (ignore the different point values for each homework). PA7 is worth 18%, double the others, PA0 is worth 1%.)
- 27% weekly review quizzes, each identically worth 27%/8, because the lowest quiz is dropped)
Final letter grades
(the numerator will include your extra credit, the denominator does not include possible extra credit (otherwise it wouldn't be extra credit))
- A+: 102.000% and above strictly no rounding (i.e., not 101.99% or below)
- A: 93% and above of the total points
- A-: 90% and above of the total points
- B+: 87% and above of the total points
- B: 83% and above of the total points
- B-: 80% and above of the total points
- C+: 77% and above of the total points
- etc.