CSC401/2511 :: Natural Language Computing :: University of Toronto (original) (raw)

Contact information

Instructor	Gerald Penn
Office	PT 283
Office hours	M 4-6pm
Email	gpenn@teach.cs.toronto.edu (please put CSC 401/2511 in the subject line)
Forum (Piazza)	Piazza - (signup)
Quercus	https://q.utoronto.ca/courses/352606
Email policy	For non-confidential inquiries, consult the Piazza forum first. Otherwise, for confidential assignment-related inquiries, consult the TA associated with the particular assignment. Emails sent with appropriate subject headings and from University of Toronto email addresses are most likely not to be redirected towards junk email folders, for example.

Course overview

This course presents an introduction to natural language computing in applications such as information retrieval and extraction, intelligent web searching, speech recognition, and machine translation. These applications will involve various statistical and machine learning techniques. Assignments will be completed in Python. All code must run on the 'teaching servers'.

Prerequisites: CSC207/ CSC209/ APS105/ APS106/ ESC180/ CSC180 and STA237/ STA247/ STA255/ STA257/ STAB52/ ECE302/ STA286/ CHE223/ CME263/ MIE231/ MIE236/ MSE238/ ECE286 and a CGPA of 3.0 or higher or a CSC subject POSt. MAT 223 or 240, CSC 311 (or equivalent) are strongly recommended.

Meeting times

Locations	BA	Bahen Centre for Information Technology
Lectures	MW	10-11h at BA 1180; 11-12h at BA 1190
Tutorials	F	10-11h at BA 1180; 11-12h at BA 1190

Syllabus

The following is an estimate of the topics to be covered in the course and is subject to change.

Introduction to corpus-based linguistics
_N_-gram, linguistic features, word embeddings
Entropy and information theory
Intro to deep neural networks and neural language models
Machine translation (statistical and neural) (MT)
Transformers, attention based models and variants
Large language models (LLMs)
Acoustics and phonetics
Speech features and speaker identification
Dynamic programming for speech recognition.
Speech synthesis (TTS)
Information Retrieval (IR)
Text Summarization
Ethics in NLP

Calendar

4 September	First lecture
18 September	Last day to enrol
24 September	Part 1 of Assignment 1 due
8 October	Assignment 1 due
28 October	Last day to drop CSC 2511
28 October - 1 November	Reading week -- no lectures or tutorial
4 November	Last day to drop CSC 401
5 November	Assignment 2 due
3 December	Last lecture
3 December	Assignment 3 due
6-21 December	Final exam period

See Dates for undergraduate students.

See Dates for graduate students.

Readings for this course

Optional		Foundations of Statistical Natural Language Processing	C. Manning and H. Schutze	Errata Online edition (free if you're on a UofT computer of VPN)
Optional		Speech and Language Processing	D. Jurafsky and J.H. Martin (2nd ed.)	Errata 3rd ed. N.B. all readings sections refer to the 2nd ed.
Optional		Deep Learning	I Goodfellow, Y Bengio, and A Courville

Supplementary reading

Please see additional lecture specific supplementary resources under Lecture Materials section.

Topic	Title	Author(s)
Good-Turing Smoothing	A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams	Kenneth Church and William Gale
ML History	What is science for? The Lighthill report on artificial intelligence reinterpreted	Jon Agar
Smoothing	An Empirical Study of Smoothing Techniques for Language Modeling	Stanley F Chen and Joshua Goodman
Hidden Markov models	A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition	Lawrence R. Rabiner
Sentence alignment	A Program for Aligning Sentences in Bilingual Corpora	William A. Gale and Kenneth W. Church
Transformation-based learning	Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging	Eric Brill
Sentence boundaries	Sentence boundaries	Read J, Dridan R, Oepen S, Solberg LJ
Seq2Seq	Sequence to Sequence Learning with Neural Networks	Ilya Sutskever, Oriol Vinyals, Quoc V. Le
Transformer	Attention Is All You Need	Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Attention-based NMT	Effective Approaches to Attention-based Neural Machine Translation	Minh-Thang Luong, Hieu Pham, Christopher D. Manning
NMT	Neural machine translation by jointly learning to align and translate	Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio
NMT	Massive exploration of neural machine translation architectures	Britz, Denny, et al.

Evaluation policies

General

You will be graded on three homework assignments and a final exam. The relative proportions of these grades are as follows:

Assignment 1	20%
Assignment 2	20%
Assignment 3	20%
Ethics Surveys (2x)	1%
Final exam	39%

Lateness

A 10% (absolute) deduction is applied to late homework one minute after the due time. Thereafter, an additional 10% deduction is applied every 24 hours up to 72 hours late at which time the homework will receive a mark of zero. No exceptions will be made except in case of documented emergencies.

Final

The final exam will be a timed 3-hour test. A mark of at least 50 on the final exam is required to pass the course. In other words, if you receive a 49 or less on the final exam then you automatically fail the course, regardless of your performance in the rest of the course.

Collaboration and plagiarism

No collaboration on the homeworks is permitted. The work you submit must be your own. `Collaboration' in this context includes but is not limited to sharing of source code, correction of another's source code, copying of written answers, and sharing of answers prior to or after submission of the work (including the final exam). Failure to observe this policy is an academic offense, carrying a penalty ranging from a zero on the homework to suspension from the university. The use of AI writing assistance (ChatGPT, Copilot, etc) is allowed only for refining the English grammar and/or spelling of text that you have already written. Submitting any Python code generated or modified by any AI assistants is strictly prohibited. See Academic integrity at the University of Toronto.

Lecture materials

Introduction
- Date: 4 Sep.
- Reading: Manning & Schütze: Sections 1.3-1.4.2, Sections 6.0-6.2.1
Corpora and Smoothing
- Dates: 9-16 Sep.
- Reading: Manning & Schütze: Section 1.4.3, Section 6.1-6.2.2, Section 6.2.5, Sections 6.3
- Reading: Jurafsky & Martin: 3.4-3.5
- See also the supplementary reading for Good-Turing smoothing
Features and Classification
- Date: 18-23 Sep.
- Reading: Manning & Schütze: Section 1.4.3, Section 6.1-6.2.2, Section 6.2.5, Sections 6.3
- Reading: Jurafsky & Martin: 3.4-3.5
Entropy and information theory
- Dates: 25-30 Sep.
- Reading: Manning & Schütze: Sections 2.2, 5.3-5.5
Intro. to NNs and Neural Langauge Models
- Dates: 7, 9 Oct.
- Reading: DL (Goodfellow et al.). Sections: 6.3, 6.6, 10.2, 10.5, 10.10
- (Optional) Supplementary resources and readings:
  - Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space. (2013)" link
  - Xin Rong. "word2Vec Parameter Learning Explained". link
  - Bolukbasi, Tolga, et al. "Man is to computer programmer as woman is to homemaker? debiasing word embeddings." NeurIPS (2016). link
  - Greff, Klaus, et al. "LSTM: A search space odyssey." IEEE (2016). link
  - Jozefowicz, Sutskever et al. "An empirical exploration of recurrent network architectures." ICML (2015). link
  - GRU: Cho, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." (2014). link
  - ELMo: Peters, Matthew E., et al. "Deep contextualized word representations. (2018)." link
    Blogs:
  - The Unreasonable Effectiveness of Recurrent Neural Networks. link
  - Colah's Blog. "Understanding LSTM Networks". link.
Machine Translation (MT)
- Dates: 16,21,23 Oct.
- Readings:
  - Manning & Schuütze Sections 13.0, 13.1.2, 13.1.3, 13.2, 13.3, 14.2.2
  - DL (Goodfellow et al.). Sections: 10.3, 10.4, 10.7
- (Optional) Supplementary resources and readings:
  - Papineni, et al. "BLEU: a method for automatic evaluation of machine translation." ACL (2002). link
  - Sutskever, Ilya, Oriol Vinyals et al. "Sequence to sequence learning with neural networks."(2014). link
  - Bahdanau, Dzmitry, et al. "Neural machine translation by jointly learning to align and translate."(2014). link
  - Luong, Manning, et al. "Effective approaches to attention-based neural machine translation." arXiv (2015). link
  - Britz, Denny, et al. "Massive exploration of neural machine translation architectures."(2017). link
  - BPE: Sennrich, et al. "Neural machine translation of rare words with subword units." arXiv (2015). link
  - Wordpiece: Wu, Yonghui, et al. "Google's neural machine translation system: Bridging the gap between human and machine translation." arXiv (2016). link
    Blogs:
  - Distill: Olah & Carter "Attention and Augmented RNNs"(2016). link
Transformers
- Dates: 4,6 Nov.
- Readings:
  - Vaswani et al. "Attention is all you need." (2017). link
- (Optional) Supplementary resources and readings:
  - RoPE: Su, Jianlin, et al. "Roformer: Enhanced transformer with rotary position embedding." (2021). [arxiv]
  - Ba, Kiros, and Hinton. "Layer normalization." (2016). [link]
  - Xiong, Ruibin, et al. "On layer normalization in the transformer architecture." ICML PMLR (2020). [link]
  - Xie et al. "ResiDual: Transformer with Dual Residual Connections." (2023). [arxiv] [github]
    BERTology:
  - Devlin et al. "BERT: Pre-training of deep bidirectional transformers for language understanding." (2019). link
  - Clark et al. "What does bert look at? an analysis of bert's attention." (2019). link
  - Rogers, Anna et al. "A primer in BERTology: What we know about how bert works." TACL(2020). link
  - Tenney et al. "BERT rediscovers the classical NLP pipeline." (2019). link
  - Niu et al. "Does BERT rediscover a classical NLP pipeline." (2022). link
  - Lewis et al. "BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." (2019). link
  - T5: Raffel et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." J. Mach. Learn. Res. 21.140 (2020). link
  - GPT3: Radford et al. "Language models are few-shot learners." (2020). link
    Attention-free models:
  - Fu, Daniel, et al. "Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture." (2023). [arxiv]. [blog].
    Token-free models:
  - Clark et al. "CANINE: Pre-training an efficient tokenization-free encoder for language representation." (2021). link
  - Xue et al. "ByT5: Towards a token-free future with pre-trained byte-to-byte models." (2022). link
    Blogs:
  - Harvard NLP. "The Annotated Transformer". link.
  - Jay Allamar. "The Illustrated Transformer". link.
Acoustics and Phonetics
- Dates: 8,11 Nov.
- Reading: Phonetics: J&M SLP2 (2nd ed.) Chapter 7; J&M SLP3 (3rd ed.) Chapter H
Speech Features and Speaker Identification
- Dates: 13,18 Nov.
- Readings:
  - Jurafsky & Martin SLP3 (3rd ed.): Chapter 16. link
Dynamic Programming for Speech Recognition

Dates: 18,20 Nov.
Readings: N/A

Information Retrieval (IR)

Date(s): 20 Nov.
Readings:
- Jurafsky & Martin SLP3 (3rd ed.): Chapter 14, only the first part (14.1). link

Text Summarization

Date(s): 25 Nov.

Guest Lectures on Ethics: [Module 1], [Module 2]

Date(s): 27 Nov., 29 Nov.
Supplementary materials/links:
- Guest lecturer: Steven Coyne
- The Embedded Ethics Education Initiative at UofT, SRI Institute link
- SRI Institute events

Summary and Review (last lecture).

Date: 3 Dec.

Tutorial materials

Assignment 1 tutorials:
Sept. 6, 2024: Tutorial 0
Sept. 13, 2024: Tutorial 1 with slides
Sept. 27, 2024: Tutorial 2 with slides
Assignment 2 tutorials:
Oct. 11, 2024: Tutorial 1
Oct. 18, 2024: Tutorial 2
Assignment 3 tutorials:
Nov. 15, 2024: Tutorial 1
Nov. 22, 2024: Tutorial 2

Assignments

Here is the ID template that you must submit with your assignments.

Head TA: Ken Shi

Extension requests: All extension requests must be made to the head TA. All undergrads should follow the FAS student absences policy. Specifically, undergrads must file an ACORN absence declaration when it is allowed, and a VOI form for extensions due to illness when it is not allowed (because an ACORN declaration has already been filed this term). Grads should always use a VOI form for extensions due to illness.

Remark requests: Please follow the remarking policy.

General Tips & F.A.Q.:

Working on teach.cs (wolf) server: CSC401_F24_Assignments.pdf
Creating a local env mimicking teach.cs environment:
Note that a 24-hour `silence policy' will be in effect -- we do not guarantee that the instructors or TAs will respond to your request within 24 hours before an assignment's due time.

Assignment 1: Financial Sentiment Analysis

Due: 24th September / 8th October, 2024
For all A1-related emails, please use: csc401-2024-09-a1@cs.toronto.edu
Download the starter code from MarkUS

Assignment 2: Neural Machine Translation with Transformers

Due: 5th November, 2024
For all A2-related emails, please use: csc401-2024-09-a2@cs.toronto.edu

Assignment 3: ASR, Speakers, and Lies

Due: 3rd December, 2024
For all A3 related emails, please use: csc401-2024-09-a3@cs.toronto.edu

News and announcements

FIRST WEEK: Our first lecture will take place on 4th September at 10:00 or 11:00, depending on your section. There will be a tutorial on the 6th.
ANNOUNCEMENT FROM ACCESSIBILITY SERVICES: Accessibility Services is seeking volunteer note takers for students in this class who are registered in Accessibility Services. By volunteering to take notes for students with disabilities, you are making a positive contribution to their academic success. By volunteering as a note-taker, you will benefit as well - It is an excellent way to improve your own note-taking skills and to maintain consistent class attendance.� At the end of term, we would be happy to provide a Certificate of Appreciation for your hard work. To request a Certificate of Appreciation please email us at as.notetaking@utoronto.ca. You may also qualify for a Co-Curricular Record by registering your volunteer work on�Folio before the end of June. We also have a draw for qualifying volunteers throughout the academic year. Register online as a Volunteer Note-Taker at: � https://clockwork.studentlife.utoronto.ca/custom/misc/home.aspx Email us at as.notetaking@utoronto.ca if you have questions or require any assistance with uploading notes. If you are no longer able to upload notes for a course, please also let us know immediately. Thank you for your support and for making notes more accessible for our students.