Natural Language Processing and Social Interaction, Fall 2018 (original) (raw)
No tab selected
(If you're looking for anything other than lecture contents and have javascript enabled, click on the appropriate tab above.)
Prerequisites, enrollment, related classes
Prerequisites All of the following: CS 2110 or equivalent programming experience; a course in artificial intelligence or any relevant subfield (e.g., NLP, information retrieval, machine learning, Cornell CS courses numbered 47xx or 67xx); proficiency with using machine learning tools (e.g., fluency at training an SVM, comfort with assessing a classifier’s performance using cross-validation)
Enrollment Limited to [[PhD and [CS MS] students] who meet the prerequisites]. Auditing (either officially or unofficially) is not permitted.
Related classes: see Cornell's NLP course list, plus INFO 6750 Causal Inference and Design of Experiments , INFO 6310 Behavior and Information Technology.
The homepage for the previous running of CS6742 may also be useful. Here is the list of all prior runnings: 2017 fall :: 2016 fall :: 2015 fall :: 2014 fall :: 2013 fall:: 2011 spring
Administrative info and overall course structure
Course homepage http://www.cs.cornell.edu/courses/cs6742/2018fa. Main site for course info, assignments, readings, lecture references, etc.; updated frequently.
CMS page http://cms.csuglab.cornell.edu. Site for submitting assignments, unless otherwise noted.
Piazza page http://piazza.com/cornell/Fall2018/cs6742 Course announcements and Q&A/discussion site. Social interaction and all that, you know. (Access code provided on first day of classes.)
Contacting the instructor
- Office hours and contact info: see Prof. Cristian Danescu-Niculescu-Mizil's homepage
Overview of course schedule. Details subject to change. Full schedule is maintained on the main course webpage.
Lecture | Agenda | Pedagogical purpose | Assignments |
---|---|---|---|
#1 | Course overview | Pilot empirical study for a research idea based on readings provided. | |
# 2 - #3 | A1 Brainstorming (Prof. DNM out) | ||
# 4 - #7 | Lecture topics related to the A1 readings: Online reviews: individual expression, community dynamics; Online asynchronous conversations. | Case studies to explore some topics and research styles find interesting. Get-to-know-you exercises to get everyone familiar and comfortable with each other. | |
Next block of meetings | Lectures on, potentially, linguistic coordination, linguistic adaptation, influence, persuasion, diffusion, discourse structure, advanced language modeling | Foundational material | Potentially some assignments based on the lectures. |
Next block of meetings | Dicussion of proposed projects based on the readings | Practice with fast research-idea generation. Feedback as to what proposals are most interesting, most feasible, etc. | Discussion of student project proposals, based on the readings for that class meeting. Each class meeting thus involves everyone reading at least one of the two assigned papers and posting a new research proposal based on the reading to Piazza. Thoughtfulness and creativity are most important to , but take feasibility into account. |
Remainder of the course | Activities related to course projects | Development of a "full-blown" research project (although time restrictions may limit ambitions). For our purposes, "interesting" is more important than "thorough". | |
Some time in December (to be determined by the registrar): final project writeup due |
Grading Of most interest to is productive research-oriented discussion participation (in class and on Piazza), interesting research proposals and pilot studies, and a good-faith final research project.
Academic Integrity Academic and scientific integrity compels one to properly attribute to others any work, ideas, or phrasing that one did not create oneself. To do otherwise is fraud.
We emphasize certain points here. In this class, talking to and helping others is strongly encouraged. You may also, with attribution, use the code from other sources. The easiest rule of thumb is, acknowledge the work and contributions and ideas and words and wordings of others. Do not copy or slightly reword portions of papers, Wikipedia articles, textbooks, other students' work, Stack Overflow answers, something you heard from a talk or a conversation or saw on the Internet, or anything else, really, without acknowledging your sources. See http://www.cs.cornell.edu/courses/cs6742/2011sp/handouts/ack-others.pdf and http://www.theuniversityfaculty.cornell.edu/AcadInteg/ for more information and useful examples.
This is not to say that you can receive course credit for work that is not your own — e.g., taking someone else's report and putting your name at the top, next to the other person(s)' names. However, violations of academic integrity (e.g., fraud) undergo the academic-integrity hearing process on top of any grade penalties imposed, whereas not following the rules of the assignment only risk grade penalties.
Resources
- Webpage of the Fall 2017 offering of this course
- ACL anthology of all conferences, journals and workshops published under the aegis of the Association for Computational Linguistics; ACM digital library proceedings publication archive for WWW; AAAI proceedings archive for ICWSM
- ACL wiki of resources - corpora, datasets, tools, software, lexicons (organized by language)
- Toolkits: Cornell Conversational Analysis Toolkit (Python3) :: CMU twitter tools (Java) :: GATE (Java) :: Illinois tools (Java?) :: Lingpipe (Java) :: Mallet (Java) :: OpenNLP (Java) :: NLTK (Python) :: Stanford tools (Java) :: tm (R)
- NLP at Cornell
#1 Aug 23: Course overview: scope, course goals, course design
Details will be appear here before each lecture.
Assignment A1 released
Student-information assignment released: see handout
Class images, links and handouts
- Handout
- Inspirational image: The_School_of_Athens
- An Honest Facebook Political Argument: hypothetical comment thread with re-entry
- Wikipedia Article for Deletion discussion ("not a vote"); annotated version; Wikipedia essay on arguments to avoid in deletion discussions; notabilia.net visualization of vote dynamics on selected AfD discussions
- Poster depicting expansionary ("guestbook") vs. focused ("repeated-engagement") conversation threads
- Politeness web app and a particular instance of it in action
- whichlight visualization of a reddit conversation thread
Datasets
References
- Althoff, Tim, Cristian Danescu-Niculescu-Mizil, Dan Jurafsky. 2014. How to ask for a favor: A case study on the success of altruistic requests. ICWSM, pp. 12–21.
- Backstrom, Lars, Jon Kleinberg, Lillian Lee, and Cristian Danescu-Niculescu-Mizil. 2013. Characterizing and curating conversation threads: Expansion, focus, volume, re-entry. WSDM, pp. 13–22.
- Brown, Penelope and Stephen C. Levinson. 1987. Politeness: Some Universals in Language Usage. Reissued with new introduction by Cambridge University Press
- Bryan, Christopher J., Gregory M. Walton, Todd Rogers, and Carol S. Dweck. 2011. Motivating voter turnout by invoking the self. Proceedings of the National Academy of Sciences 108 (31): 12653-12656.
- Chong, Dennis and James N. Druckman. 2007. Framing theory. Annual Review of Political Science 10:103–26.
- Hopkins, Daniel J. 2017. The exaggerated life of death panels? The limited but real influence of elite rhetoric in the 2009–2010 health care debate. Policital Behavior. [official link] ["ungated" version]
- Danescu-Niculescu-Mizil, Cristian, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, Christopher Potts. 2013. A computational approach to politeness with application to social factors. ACL, pp. 250–259.
- Taraborelli, Dario and Giovanni Luca Ciampaglia. Beyond notability. Collective deliberation on content inclusion in Wikipedia. Second international workshop on quality in techno-social systems, pp. 122-125. [alt link]
#2 Aug 28: No Lecture: Prof. DNM out
#3 Aug 30: A1 Brainstorming with Jonathan P. Chang
#4 Sep 4: Types and properties of conversations
Class images, links and handouts
- Dialectic vs eristic (Blount, Millard, Weal 2014, from the 14th workshop on Computational Models of Natural Argument)
Datasets
- Cornell Conversational Analysis Toolkit (Convokit): Includes several large conversational datasets and tools to process them.
- UBC BC3 Blog Corpus: 7000 blog conversations with user-labeled comments from 6 popular websites (Slashdot, Macrumors, AndroidCentral, Dailykos, BusinessInsider, TSN). Slashdot includes "Funny" tags.
- CORPS: corpus of political speeches tagged with specific audience reactions, such as APPLAUSE or LAUGHTER.
- Intelligence Square Debate Dataset a collection of public debates with metadata (audience voting results pre- and post-debate, and audience reaction markers)
- Supreme Court Dialog Corpus a collection of conversations from the U.S. Supreme Court Oral Arguments (http://www.supremecourt.gov/oral_arguments/) with metadata. Includes "laughter".
- HCRC Map Task Corpus: 128 dialogues recorded, transcribed, and annotated for a wide range of behaviours. It references other related corpora:
The DCIEM Map Task Corpus uses very similar materials to the HCRC Map Task Corpus, but with a different structure designed to test the effects of sleep deprivation under a number of pharmaceutical conditions. The subjects were Canadian army reservists. The Map Task Corpus has been replicated in whole or in part in a number of languages including Dutch, Italian, Japanese, Swedish, Occitan, andPortuguese. It has also been replicated in part for other forms of English besides the original Glaswegian speakers, including American English, Australian English, and some urban British dialects. The Occitan site has a list of some other language replications. The Map Task has been used to test the effects of many conditions on human communication, including stuttering, computer mediation, textual communication, and the use of avatars.
#5 Sep 6: A1 check-ins, Instrumentation, Conversational Structure
- Gonzalez-Bailon, Sandra, Andreas Kaltenbrunner, and Rafael E. Banchs. 2010. The structure of political discussion networks: A model for the analysis of online deliberation. Journal of Information Technology 25(2): 230-243. [ author-posted version]
- Characterizing Online Public Discussions through Patterns of Participant Interactions Justine Zhang, Cristian Danescu-Niculescu-Mizil, Christy Sauper, Sean Taylor. Proceedings of CSCW 2018
- Sample conversations:
* Slack. Image from Fortune.com
* Slashdot (useful to look at in conjunction with Wikipedia explanation of Slashdot moderation), with a txt version from the UBC BC3 Blog corpus.
* Reddit. Online thread visualizer by Kawandeep Virdee and different types of threads.
>
#6 Sep 11: From monologues to conversations; Case study: from hypothesis to research (Coordination)
Class images, links and handouts
References
- Fay, Nicolas, Simon Garrod, and Jean Carletta. 2000. Group discussion as interactive dialogue or as serial monologue: The influence of group size.Psychological Science 11(6): 481-486.
- Related quote: "There is no such thing as conversation. There are intersecting monologues, that's all". Rebecca West's short story, "There is no conversation".
- Pickering, Martin J. and Simon Garrod. 2004. Toward a mechanistic psychology of dialogue. Behavioral and Brain Sciences 27(02): 169-190. [alt link]
- Danescu-Niculescu-Mizil, Cristian, Michael Gamon, and Susan Dumais. 2011. Mark my words! Linguistic style accommodation in social media. In Proceedings of WWW.
- Levelt, Willem J M and Stephanie Kelter. 1982. Surface form and memory in question answering. Cogn Psychol 14 (1): 78 - 106.
- Giles, Howard, Justine Coupland, and Nikolas Coupland. 1991.Accommodation theory: Communication, context, and consequence. In Contexts of Accommodation: Developments in Applied Sociolinguistics. Cambridge Univ Pr.
- Gonzales, Amy L., Jeffrey T. Hancock, and James W. Pennebaker. 2010. Language style matching as a predictor of social dynamics in small groups.Communication Research 37 (1): 3-19.
- Feng, S, R Banerjee, and Y Choi. 2012. Characterizing stylistic elements in syntactic structure. Proceedings of EMNLP.
- Bramsen, Philip, Martha Escobar-Molana, Ami Patel, and Rafael Alonso. 2011. Extracting social power relationships from natural language. Proceedings of ACL HLT.
#7 Sep 13: Social aspects of coordination; Second case study (Socialization)
- Upcoming deadlines: A1 writeup and presentations due next week
Class images, links and handouts
Lecture references
- Bell, Allan. 1984. Language style as audience design.Language in Society 13(2): 145-204.
- Danescu-Niculescu-Mizil, Cristian, Lillian Lee, Bo Pang, and Jon Kleinberg. 2012.Echoes of power: Language effects and power differences in social interaction.WWW, pp. 699--708. [ACM link] [ paper "homepage" (paper, slides, data, etc.)]
- Danescu-Niculescu-Mizil, Cristian and Lillian Lee. 2011. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. Proceedings of the ACL Workshop on Cognitive Modeling and Computational Linguistics.
- Daniel M. Romero, Roderick I. Swaab, Brian Uzzi, Adam D. Galinsky. 2015. Mimicry Is Presidential: Linguistic Style Matching in Presidential Debates and Improved Polling Numbers
- Tim Althoff, Kevin Clark, Jure Leskovec. 2016 Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health TACL.
- Danescu-Niculescu-Mizil, Cristian, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. No country for old members: User lifecycle and linguistic change in online communities. WWW, pp. 307--318. Best paper award. [ACM link] [ paper "homepage" ]
- Doyle, Gabriel, Amir Goldberg, Sameer B. Srivastava, and Michael C. Frank. 2017.Alignment at work: Accommodation and enculturation in corporate communication. ACL, 604--612.
#8 Sep 18: No Lecture: Prof. DNM out
#9 Sep 20: A1 presentations (fun! fun! fun!)
#10 Sep 25: From hypothesis to research: Second case study (Socialization)
- Inspiration-readings assignments released
Class images, links and handouts
Lecture references
- Danescu-Niculescu-Mizil, Cristian, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. No country for old members: User lifecycle and linguistic change in online communities.WWW, pp. 307--318. [ACM link] [ paper "homepage" ]
- Doyle, Gabriel, Amir Goldberg, Sameer B. Srivastava, and Michael C. Frank. 2017.Alignment at work: Accommodation and enculturation in corporate communication. ACL, 604--612.
#11 Sep 27: News, influence and information propagation, part 1
Class images, links and handouts
- MemeTracker visualization (uses Flash), including variants of the "Lipstick on a pig" quote
- QUOTUS visualization: how media outlets quote the President.
Lecture references
- Bakshy, Eitan, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. 2011. Everyone's an influencer: Quantifying influence on Twitter.WSDM, 65--74.
- Friggeri, Adrien, Lada A. Adamic, Dean Eckles, and Justin Cheng. 2014. Rumor cascades.ICWSM
- Leskovec, Jure, Lars Backstrom, and Jon Kleinberg. 2009. Meme-tracking and the dynamics of the news cycle.KDD, pp. 497--506. [paper "homepage"]
- Niculae, Vlad, Caroline Suen, Justine Zhang, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. 2015. QUOTUS: The structure of political media coverage as revealed by quoting patterns. WWW, 798-808. [paper "homepage"]
- Rotabi, Rahmtin, Cristian Danescu-Niculescu-Mizil, and Jon Kleinberg. 2017. Competition and selection among conventions. Conference on World Wide Web, 1361-1370. [ paper "homepage"]
- Matei, Sorin Adam. Is two-step flow theory still relevant for social media research?. (accessed September 26, 2018).
- Prabhumoye, Shrimai, Samridhi Choudhary, Evangelia Spiliopoulou, Christopher Bogart, Carolyn Penstein Rosé, and Alan W. Black. 2017.Linguistic markers of influence in informal interactions. In the Workshop on Natural Language Processing and Computational Social Science, 53--62.
- Simmons, Matthew P., Lada A. Adamic, and Eytan Adar. 2011. Memes online: Extracted, subtracted, injected, and recollected.ICWSM, pp. 353--360.
- Tan, Chenhao, Adrien Friggeri, and Lada A. Adamic. 2016. Lost in propagation? Unfolding news cycles from the source. In ICWSM, 378-387. [paper homepage]
- Tan, Chenhao, Lillian Lee, and Bo Pang. 2014. The effect of wording on message propagation: Topic-and author-controlled natural experiments on Twitter. InACL, pp. 175--185. [ paper "homepage"]
- Tsur, Oren and Ari Rappoport. 2015. Don't let me be #misunderstood: Linguistically motivated algorithm for predicting the popularity of textual memes. InICWSM
- Watts, Duncan J. and Peter Sheridan Dodds. December 2007. Influentials, networks, and public opinion formation.Journal of Consumer Research 34(4): 441-458.
Other references
- "Special Report With Brit Hume", September 10, 2008: panel-discussion transcript regarding Obama's "lipstick on a pig" utterance
#12 Oct 2: Proposals discussion (A2)
The readings
- Murgia, Alessandro, Daan Janssens, Serge Demeyer, and Bogdan Vasilescu. 2016. Among the machines: Human-bot interaction on social Q&A websites. In CHI Extended Abstracts, 1272-1279. [author-posted version]
- Sap, Maarten, Marcella Cindy Prasettio, Ari Holtzman, Hannah Rashkin, and Yejin Choi. 2017.Connotation frames of power and agency in modern films.EMNLP, 2319-2324. [ paper "homepage" with connotation frames data and query interface]
#13 Sep 27: News, influence and information propagation, part 2
Lecture references
- Bakshy, Eitan, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. 2011. Everyone's an influencer: Quantifying influence on Twitter.WSDM, 65--74.
- Friggeri, Adrien, Lada A. Adamic, Dean Eckles, and Justin Cheng. 2014. Rumor cascades.ICWSM
- Matei, Sorin Adam. Is two-step flow theory still relevant for social media research?. (accessed September 26, 2018).
- Prabhumoye, Shrimai, Samridhi Choudhary, Evangelia Spiliopoulou, Christopher Bogart, Carolyn Penstein Rosé, and Alan W. Black. 2017.Linguistic markers of influence in informal interactions. In the Workshop on Natural Language Processing and Computational Social Science, 53--62.
- Simmons, Matthew P., Lada A. Adamic, and Eytan Adar. 2011. Memes online: Extracted, subtracted, injected, and recollected.ICWSM, pp. 353--360.
- Tan, Chenhao, Adrien Friggeri, and Lada A. Adamic. 2016. Lost in propagation? Unfolding news cycles from the source. In ICWSM, 378-387. [paper homepage]
- Danescu-Niculescu-Mizil, Cristian, Justin Cheng, Jon Kleinberg, and Lillian Lee. 2012. You had me at hello: How phrasing affects memorability. ACL, pp. 892-901.
- Tan, Chenhao, Lillian Lee, and Bo Pang. 2014. The effect of wording on message propagation: Topic-and author-controlled natural experiments on Twitter. InACL, pp. 175--185. [ paper "homepage"]
- Tsur, Oren and Ari Rappoport. 2015. Don't let me be #misunderstood: Linguistically motivated algorithm for predicting the popularity of textual memes. InICWSM
- Watts, Duncan J. and Peter Sheridan Dodds. December 2007. Influentials, networks, and public opinion formation.Journal of Consumer Research 34(4): 441-458.
Other references
- "Special Report With Brit Hume", September 10, 2008: panel-discussion transcript regarding Obama's "lipstick on a pig" utterance
#14 Oct 11: Proposals discussion (A3)
A5, the final-project proposal assignment, has been released. Note the multiple phases and due-dates.
The readings
- Chandrasekharan, E., Samory, M., Jhaver, S., Charvat, H., Bruckman, A., Lampe, C., Eisenstein, J., Gilbert, E. 2018. The Internet’s Hidden Rules: An Empirical Study of Reddit Norm Violations at Micro, Meso, and Macro Scales. Proeedings of CSCW
- Sap, Maarten, Marcella Cindy Prasettio, Ari Holtzman, Hannah Rashkin, and Yejin Choi. 2017.Connotation frames of power and agency in modern films.EMNLP, 2319-2324. [ paper "homepage" with connotation frames data and query interface]
#15 Oct 16: Proposals discussion (A4)
The readings
- Nguyen, Dong, Elijah Mayfield, and Carolyn P Rosé. 2010. An analysis of perspectives in interactive settings. In Proceedings of the First Workshop on Social Media Analytics, 44-52. [alt link]
- Sudhof, Moritz, Andrés Goméz Emilsson, Andrew L Maas, and Christopher Potts. Sentiment expression conditioned by affective transitions and social forces. In Proceedings of KDD
#16 Oct 18: (Breaking) conversation rules
Class images, links and handouts
References
- Pickering, Martin J. and Simon Garrod. 2004. Toward a mechanistic psychology of dialogue. Behavioral and Brain Sciences 27(02): 169-190. [alt link]
- Grice, H.P. 1975. Logic and Conversation. In Cole et al., Syntax and Semantics 3: Speech Acts. and 1978.
- Galantucci, Bruno and Gareth Roberts. 2014. Do we notice when communication goes awry? An investigation of people's sensitivity to coherence in spontaneous conversation. PLoS ONE 9(7).
- Langer, Ellen J, Arthur Blank, and Benzion Chanowitz.1978. The mindlessness of ostensibly thoughtful action: The role of "placebic" information in interpersonal interaction. J Pers Soc Psychol 36 (6): 635.
- Rogers, Todd and Michael I. Norton. 2011. The artful dodger: Answering the wrong question the right way. Journal of Experimental Psychology: Applied 17 (2). [alt link]
- Characterizing Online Public Discussions through Patterns of Participant Interactions Justine Zhang, Cristian Danescu-Niculescu-Mizil, Christy Sauper, Sean Taylor. Proceedings of CSCW 2018
- Danescu-Niculescu-Mizil, Cristian, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, Christopher Potts. 2013. A computational approach to politeness with application to social factors. ACL, pp. 250–259.
#17 Oct 25:Advanced yet “off-the-shelf” features roundupp
Assignments/announcements
Class images, links and handouts
- Louis, Annie and Ani Nenkova. 2013. What makes writing great? First experiments on article quality prediction in the science journalism domain. Transactions of the Association for Computational Linguistics 1:341-352.
- Flesch, Rudolf. June 1948. A new readability yardstick. Journal of Applied Psychology 32(3): 221-33. [Alternative link: the paper is bundled is the collection The Classic Readability Studies, ed. William H. DuBay. Published as Unlocking Language: The Classic Studies in Readability, BookSurge Publishing, 2007.
- MRC Psycholinguistic database. Wilson, M.D. (1988) The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments and Computers, 20(1), 6-11.
- Sentiment/subjectivity lexicons: Connotation Lexicon, MPQA lexica (goodFor/badFor, +/-affect, arguing, subjectivity), opinion lexicon, SentiWordNet. Financial sentiment (2014 version). See also the ...
- Multi-category lexicons: Harvard General Inquirer. The LIWC lexicon (2015 version). The NRC lexicons.
- Danescu-Niculescu-Mizil, Cristian, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, Christopher Potts. 2013. A computational approach to politeness with application to social factors. ACL, pp. 250–259.
References
- Concreteness ratings. Brysbaert, M., Warriner, A.B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods 46:904-911.
- Valence, arousal, dominance ratings. Warriner, A.B., Kuperman, V., & Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior Research Methods 45:1191-1207.
- Hedge-annotated data. Farkas, Richárd, Veronika Vincze, György Móra, János Csirik, and György Szarvas. 2010. The CONLL-2010 shared task: Learning to detect hedges and their scope in natural language text. Fourteenth Conference on Computational Natural Language Learning---Shared Task, pp. 1-12.
#18 Oct 30: What makes two sub-languages different?
Class images, links and handouts
Image source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-never-tell-me-the-odds-6/.
- Handout
- Slides adapted from the relevant section of Cristian Danescu-Niculescu-Mizil and Lillian Lee, 2016.Natural Language Processing for Computational Social Science. Invited Tutorial at NIPS.
- Percy Liang and Dan Klein. 2007. Structured Bayesian nonparametric models with variational inference. Tutorial. We started at slide 11.
- Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. 2008. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16(4): 372-403. [alt link]
Lecture references
- Kenneth W. Church. 2000. Empirical estimates of adaptation: The chance of two Noriegas is closer to p/2 than p2.COLING.
- Liberman, Mark. Jan 3, 2016. The case of the disappearing determiners. Language Log blog post.
- Kleinberg, Jon. 2004. Temporal dynamics of on-line information streams. InData Stream Management: Processing High-speed Data Streams, published in 2016. [Author-posted 2004 preprint] See section ''two-point trends'' (pg. 28ff)
- Implementations:
- Hessel, Jack (who took this class!)FightingWords.
- Lim, Kenneth (who took this class!).fightin-words 1.0.4. Compliant with sci-kit learn and distributed by PyPI; borrows (with acknowledgment) from Jack's version.
- Marzagão, Thiago. mcq.py
- Fredette, Marc and Jean-François Angers. 2002. A new approximation of the posterior distribution of the log-odds ratio.Statistica Neerlandica 56(3): 314-329. [Author's institution link] Attributes the Monroe et al. fact to Chapter 10 of O'Hagan's Kendall's advanced theory of statistic, vol 2b. The little-o analysis of the error appear in section 2.1 of Newson, 2008,Asymptotic distributions of linear combinations of logs of multinomial parameter estimates.Wikipedia attributes the approximation of the confidence interval to a 1988 article that cites a 1978 Biometrika article.
- Liberman, Mark. The most Kasichoid, Cruzian, Trumpish, and Rubiositous words, 2016.The most Trumpish (and Bushish) words, 2015. Obama's favored (and disfavored) SOTU words, 2014.Draft words (descriptions of white vs black NFL prospects), 2014.Male and female word usage, 2014.
Nov6, Nov 8: No class — CSCW
#21 Nov 13: N-Gram Language Models
Final project writeup due Thursday Dec. 13, 4:30pm (date determined by the registrar). Submit both your presentation materials and your final project writeup; but don't spend time post-editing your presentation materials after the fact, as I will only be using them as a reference while evaluating your writeup.
The main evaluation criteria will be the reasonableness (in approach and amount of effort), thoughtfulness, and creativity of what you tried, as documented in your writeup. Individual effort within team projects will be taken into account; see item 3 below.
- Use the ICWSM style files provided by AAAI(LaTex style and bib files, Word template)
- We make this requirement to facilitate submission to ICWSM 2019. However, note that your final-project submission should have your names and acknowledgments included, in a particular format (see item 1c amd 2b below); in contrast, you will want to strip any identifying information for ICWSM submissions.
- AAAI prefers non-numbered section headings. You may change the style files to include section numbers in your headings for the purposes of CS6742 submission.
- For the author heading, list only the names of your teammates that are enrolled in the class, even if you had external collaborators. (Reason: only students in the class are submitting the paper for a grade.) But see item 2b1 below.
- Include the following sections:
- "content" sections: abstract, introduction/motivation (broad question), data description (how you gathered, cleaned, and processed it), methods (discuss operationalization of the high level question), experiments (highlight which controls you've done and why they are needed), related work, references, conclusions (what you learned), directions for future work.
- Make sure that your introduction section explicitly sets out your hypothesis or hypotheses.
- Throughout, highlight your most interesting findings (positive or negative).
- For the purposes of CS6742 submission, your related-work section does not need to be exhaustive; you may cover just a few most-related papers.
- An "acknowledgments" section: give the name and state the contribution of those who you received significant help from. (This may or may not include your advisor(s), your instructor, fellow students in the class).
- Authorship statement: if you intend to ask or have already arranged to have people other than your CS6742-enrolled teammates, also name each such person.
- "content" sections: abstract, introduction/motivation (broad question), data description (how you gathered, cleaned, and processed it), methods (discuss operationalization of the high level question), experiments (highlight which controls you've done and why they are needed), related work, references, conclusions (what you learned), directions for future work.
- Projects done collaboratively must also include a section describing who did what. External collaborators should be included in this enumeration.
- Use the number of pages you feel is appropriate.
Class images, links and handouts
- Quote memorability quizz
- MIT Language Modeling Toolkit
- SRILM - The SRI Language Modeling Toolkit
- N-gram language models in Python
- Mention of the bug in NLTK (the point being that language modeling can actually be quite subtle)
Lecture references
- Chen, Stanley F. and Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, 310-318. More detailed technical report version (recommended)
- Danescu-Niculescu-Mizil, Cristian, Justin Cheng, Jon Kleinberg, and Lillian Lee. 2012. You had me at hello: How phrasing affects memorability. Proceedings of the ACL, pp.892--901.
- F. Jelinek, R.L. Mercer and S. Roukos. Principles of Lexical Language Modeling for Speech Recognition. Advances in Speech Signal Processing, S. Furui and J. Sondhi, Eds. M. Dekker Publishers, New York, NY 1991. Pp.651-700
- Gale, William A. and Kenneth W. Church. 1994.What's wrong with adding one. Corpus-based Research Into Language: In Honour of Jan Aarts, pp. 189--200.
#22 Nov 15Mandatory projects progress-and-problems appointments
By 2pm the afternoon on Wednesday, post a Piazza followup to your proposal that summarizes your progress and what discussion points or problems you'd like to bring up with me. Ideally, this followup post will be the agenda for your team's appointment, and will make the meeting efficient and useful for you. -->
#23 Nov 20: Entropy
Assignments/announcements
Nov 22: No class — Thanksgiving Break
#24 Nov 27Cross-entropy and divergence, Language models in practice
- Fu, Liye, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. 2016. Tie-breaker: Using language models to quantify gender bias in sports journalism. In Proceedings of the IJCAI Workshop on NLP Meets Journalism. (CS6742 class project.) New York Times writeup and visualization.
- Juola, Patrick and Harald R. Baayen. A controlled-corpus experiment in authorship identification by cross-entropy. Literary and Linguistic Computing 20 (Suppl 1): 59-67.
- Purohit, Hemant, Yiye Ruan, David Fuhry, Srinivasan Parthasarathy and Amit Sheth. 2014. On understanding the divergence of online social group discussion. Proceedings of ICWSM
- Tran, Trang and Mari Ostendorf. 2016. Characterizing the language of online communities and its relation to community reception. In EMNLP
- Genzel, Dmitriy and Eugene Charniak. 2002. Entropy rate constancy in text. ACL, pp.199--206.
- Doyle, Gabriel and Michael C Frank. 2015. Audience size and contextual effects on information density in twitter conversations. Proceedings of the ACL Workshop on Cognitive Modeling and Computational Linguistics (CMCL), pp.19--28.
#25 Nov 28: Project presentations (attendance by all is mandatory)
Aim for 10-15 minutes presentations (include results, challenges and questions). Class participation is important.
#26 Dec 4 Final lecture
Code for generating the calendar formatting adapted from the original versions created byAndrew Myers