CS 288: Statistical Natural Language Processing (original) (raw)

This course will explore current statistical techniques for the automatic analysis of natural (human) language data. The dominant modeling paradigm is corpus-driven statistical learning, with a split focus between supervised and unsupervised methods.

In the first part of the course, we will examine the core tasks in natural language processing, including language modeling, syntactic analysis, semantic interpretation, coreference resolution, and discourse analysis. In each case, we will discuss the underlying linguistic phenomena, which features are relevant to the task, how to design efficient models which can accommodate those features, and how to learn such models. In the second part of the course, we will explore how these core techniques can be applied to user applications such as information extraction, question answering, speech recognition, machine translation, and interactive dialog systems.

Course assignments will highlight several core NLP tasks and methods. For each task, you will construct a basic system, then improve it through a cycle of linguistic error analysis and model redesign. There will also be a final project, which will investigate a single topic or application in greater depth. This course assumes a good background in basic probability and a strong ability to program in Java. Prior experience with linguistics or natural languages is helpful, but not required. There will be a lot of statistics, algorithms, and coding in this class.

Note that M&S is free online. Also, make sure you get the purple 2nd edition of J+M, not the white 1st edition.

Week

Date

Topics

Techniques

Readings

Assignments (Out)

Assignments (Due)

1

Jan 20

Course Introduction[6PP] [2PP]

J+M 1, M+S 1-3

HW1: Language Models

2

Jan 25

Words: Language Modeling[6PP] [2PP]

N-Grams, Smoothing

J+M 4, M+S 6, Chen & Goodman,Interpreting KN Massive Data

Jan 27

Words: LMs II[6PP] [2PP]

Smoothing, Naive Bayes

M+S 7,Event Models

3

Feb 1

Words: Text Cat[6PP] [2PP]

Feb 3

Words: WSD[6PP] [2PP]

Maxent

Classification Tutorial, Maxent Tutorial 1, 2, J+M 6

HW2: PNP Classification

HW1

4

Feb 8

Parts-of-Speech: Tagging[6PP] [2PP]

HMMs/CRFs

J+M 5, Toutanova & Manning,
Brants,Brill

Feb 10

Parts-of-Speech: Induction[6PP] [2PP]

EM

J+M 6, M+S 9-10, HMM Learning,Distributional Clustering,Johnson

5

Feb 15

NO CLASS

Feb 17

Speech Recognition[6PP] [2PP]

Speech Signal

J+M 7

HW3: POS Tagging

HW2

6

Feb 22

Speech Recognition II[6PP] [2PP]

Acoustic Modeling

J+M 9

Feb 24

Interlude: Competitive Parsing

7

Mar 1

Interlude: Competitive Parsing

Mar 3

Syntax: PCFGs[6PP] [2PP]

M+S 3.2, 12.1, J+M 11

8

Mar 8

Syntax: Algorithms [6PP] [2PP]

M+S 11, J+M 12,Best-First,A*,K-best

HW4: Parsing

HW3

Mar 10

Syntax: Richer Models[6PP] [2PP]

Unlexicalized, Split, Lexicalized

9

Mar 15

NO CLASS

Mar 17

Syntax: Grammar Induction[6PP] [2PP]

10

Mar 22

Spring Break

Mar 24

Spring Break

11

Mar 29

Machine Translation I [6PP] [2PP]

Word-Based Models

J+M 25,IBM Models,HMM Agreement Discriminative, Decoding

Mar 31

Machine Translation II [6PP] [2PP]

Phrase-Based Systems

Decoding, Learning Phrases

FP Guidelines

HW4

12

Apr 5

Machine Translation III[6PP] [2PP]

Syntactic Systems

GHKM,Vs Phrases, Decoding

HW5: Machine Translation

Apr 7

Semantics: Roles [6PP] [2PP]

J+M 16, 19

13

Apr 12

Semantics: Compositional [6PP] [2PP]

Manning, J+M 18

Apr 14

Semantics: Interpretation [6PP] [2PP]

Parsing to LF

14

Apr 19

Discourse: Coreference [6PP] [2PP]

Supervised,Unsupervised, J+M 21

HW5

Apr 21

Discourse: Summarization[6PP] [2PP]

Topic-based,N-gram based

15

Apr 26

Question Answering[6PP] [2PP]

N-gram-based, Grammar-based

Apr 28

Diachronics[6PP] [2PP]

Reconstruction

FP Due May 21