Natural Language Processing and Social Interaction, Fall 2017

If you're looking for something other than lecture content and have javascript enabled, click on the appropriate tab above. The tabs may take a little time to come up.

Prerequisites All of the following: CS 2110 or equivalent programming experience; a course in artificial intelligence or any relevant subfield (e.g., NLP, information retrieval, machine learning, Cornell CS courses numbered 47xx or 67xx); proficiency with using machine learning tools (e.g., fluency at training an SVM, comfort with assessing a classifier’s performance using cross-validation)

Enrollment Limited to [[PhD and [CS MS] students] who meet the prerequisites]. Auditing (either officially or unofficially) is not permitted.

Related classes: see Cornell's NLP course list, plus GOVT 6461, Public Opinion [the 2012 syllabus, time/location/some material/paper coverage is different 2017fall] COMM 6750 Research methods for social networks and social media.

The homepage for the previous running of CS6742 may also be useful. Here is the list of all prior runnings: 2016 fall :: 2015 fall :: 2014 fall :: 2013 fall:: 2011 spring

Administrative info

CMS page http://cmsx.csuglab.cornell.edu. Site for submitting assignments, unless otherwise noted. You may find this graphically-oriented guide to common operations useful: see how to replace a prior submission (point 1), how to tell if CMS successfully received your files (point 2), how to form a group (point 4).

Course discussion site https://blogs.cornell.edu/nlpsoc2017fa (access restricted to enrolled students). Course announcements and Q&A/discussion site. Social interaction and all that, you know.

Office hours and contact info See Prof. Lee's homepage and scroll to the section on Contact and availability info.

Grading Of most interest to is productive research-oriented discussion participation (in class and/or on the course discussion site, interesting research proposals and pilot studies, and a good-faith final research project.

Academic Integrity Academic and scientific integrity compels one to properly attribute to others any work, ideas, or phrasing that one did not create oneself. To do otherwise is fraud.

Certain points deserve emphasis here. In this class, talking to and helping others is strongly encouraged. You may also, with attribution, use the code from other sources. The easiest rule of thumb is, acknowledge the work and contributions and ideas and words and wordings of others. Do not copy or slightly reword portions of papers, Wikipedia articles, textbooks, other students' work, Stack Overflow answers, something you heard from a talk or a conversation or saw on the Internet, or anything else, really, without acknowledging your sources. See "Acknowledging the Work of Others" in The Essential Guide to Academic Integrity at Cornell and http://www.theuniversityfaculty.cornell.edu/AcadInteg/ for more information and useful examples.

This is not to say that you can receive course credit for work that is not your own — e.g., taking someone else's report and putting your name at the top, next to the other person(s)' names. However, violations of academic integrity (e.g., fraud) undergo the academic-integrity hearing process on top of any grade penalties imposed, whereas not following the rules of the assignment “only” risks grade penalties.

Overall course structure

Lecture	Agenda	Pedagogical purpose	Assignments
#1	Course overview		A1 released: pilot empirical study for a research idea based on the given readings.
#2 - #4	Lectures on topics related to the A1 readings	Case studies to explore some topics and research styles find interesting.Get-to-know-you exercises to get everyone familiar and comfortable with each other.
Next block of meetings	Dicussion of proposed projects based on the readings	Practice with fast research-idea generation. Feedback as to what proposals are most interesting, most feasible, etc.	Discussion of student project proposals, based on the readings for that class meeting. Each class meeting involves everyone reading at least one of the two assigned papers and posting a new research proposal based on the reading to the course discussion site. Thoughtfulness and creativity are most important to , but take feasibility into account.
Next block of meetings	Lectures on, potentially, linguistic coordination, linguistic adaptation, influence, persuasion, diffusion, discourse structure, advanced language modeling.	Foundational material	Potentially some assignments based on the lectures.
Remainder of the course	Activities related to course projects	Development of a "full-blown" research project (although time restrictions may limit ambitions). For purposes, "interesting" is more important than "thorough".

Resources

Cornell's Passkey for your web browser: "If you find yourself on a web page that has access restrictions, click on the bookmarklet icon and you will be redirected to the Cornell Web log-in screen to check for your valid Cornell affiliation.
You will be automatically led to the page you were trying to read, this time recognized for your right to gain access to the library's licensed resources."
Upcoming conference deadlines:NAACL 2018, long paper deadline Dec 15th, short paper deadline Jan 10 ::ICWSM 2018: deadline not yet announced, expected early Jan ::ACL 2018: Feb 22 ::CSCW 2018: second deadline during spring 2018 ::SIGDIAL 2018: not yet announced ::
all ACL conferences, journals, workshops proceedings/volumes ::WWW proceedings ::ICWSM proceedings
ACL wiki of resources — corpora, datasets, tools, software, lexicons, organized by language
Books, surveys, and tutorials: Dan Jurafsky and James Martin, 2009:Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (3rd edition draft chapters and slides) :: Jacob Eisenstein, 2017:A Technical Introduction to Natural Language Processing (book and slides) :: Cristian Danescu-Niculescu-Mizil and Lillian Lee, 2016.Natural Language Processing for Computational Social Science. Invited Tutorial at NIPS. :: Atefeh Farzinder and Diana Inkpen, 2015:NLP for Social Media (access via Cornell, review by Annie Louis) :: Yoav Goldberg, 2017:Neural Network Methods for Natural Language Processing (access via Cornell,JAIR version) :: Dong Nguyen, A. Seza Doğruöz, Carolyn P. Rosé and Franciska de Jong, 2016:Computational Sociolinguistics: A Survey.Computational Linguistics 42(3):537--593.
Toolkits, alphabetically:CMU twitter tools (Java) ::GATE (Java) ::Gensim (Python) ::Illinois tools (Java?) ::Lingpipe (Java) ::Mallet (Java) ::OpenNLP (Java) ::NLTK (Python) ::SpaCy (Cython) ::Stanford tools (Java) ::CRAN NLP tools (R)
Pretrained word embeddings: a recent list
NLP at Cornell

#1 Aug 22: Introduction

Assignment A1: Pilot empirical research study. Note the first deadline (of several) on Friday Aug. 25.

Class images, links and handouts

Handout
Inspirational image: Raphael's The School of Athens
Wikipedia Article for Deletion discussion ("not a vote"); annotated version; Wikipedia essay on arguments to avoid in deletion discussions; notabilia.net visualization of vote dynamics on selected AfD discussions
Poster depicting expansionary ("guestbook") vs. focused ("repeated-engagement") conversation threads
Slashdot conversation thread

Lecture references

Backstrom, Lars, Jon Kleinberg, Lillian Lee, and Cristian Danescu-Niculescu-Mizil. 2013. Characterizing and curating conversation threads: Expansion, focus, volume, re-entry.WSDM, pp. 13–22.
Bryan, Christopher J., Gregory M. Walton, Todd Rogers, and Carol S. Dweck. 2011. Motivating voter turnout by invoking the self.Proceedings of the National Academy of Sciences 108 (31): 12653-12656.
Chong, Dennis and James N. Druckman. 2007. Framing theory.Annual Review of Political Science 10:103–26.
Hopkins, Daniel J. 2017. The exaggerated life of death panels? The limited but real influence of elite rhetoric in the 2009–2010 health care debate. Policital Behavior. [official link] ["ungated" version]
- Cf. Nyhan, Brendan, 2014. Can we have a fact-based conversation about end-of-life planning? The New York Times, Sept. 10.

#2 Aug 24: A1 inspiration: Overview of conversations

Class images, links and handouts

Gespraechsgemetzel
Image: photo of entry 106 of Ben Schott, Schottenfreude: German Words for the Human Condition (2013)

Some ways in which a conversation can go wrong: photo of part of a page in Schottenfreude
ConVis demo (Slashdot threads visualization):live demo,source code

Lecture references

Blount, Thomas, David Millard, and Mark Weal. 2014. Towards modelling dialectic and eristic argumentation on the social web. In 14th Workshop on Computational Models of Natural Argument (CMNA).
Fay, Nicolas, Simon Garrod, and Jean Carletta. 2000. Group discussion as interactive dialogue or as serial monologue: The influence of group size.Psychological Science 11(6): 481-486. [author-posted version]
- Related quote: "There is no such thing as conversation. There are intersecting monologues, that's all". Rebecca West's short story, "There is no conversation".
Gonzalez-Bailon, Sandra, Andreas Kaltenbrunner, and Rafael E. Banchs. 2010. The structure of political discussion networks: A model for the analysis of online deliberation. Journal of Information Technology 25(2): 230-243. [ author-posted version]
On "forest-fire" sampling of graphs to preserve certain structural aspects:
- Leskovec, Jure, Jon Kleinberg, and Christos Faloutsos. 2005. Graphs over time: Densification laws, shrinking diameters and possible explanations. In KDD, 177-187. KDD Test of Time Award. [ACM link] [ author-posted version]
  Leskovec, Jure, Jon Kleinberg, and Christos Faloutsos. Graph evolution: densification and shrinking diameters.ACM Transactions on Knowledge Discovery from Data 1(1), 2007. [arxiv link]
- Leskovec, Jure and Christos Faloutsos. 2006. Sampling from large graphs. In KDD, 631-636. [ACM link] [ author-posted version]

Other references

Hoque, Enamul and Giuseppe Carenini. 2014. Convis: A visual text analytic system for exploring blog conversations. Computer Graphics Forum 33(3):221-230 (Proceedings of EuroVis) [ author-posted version ]

#3 Aug 29: More A1 inspiration: discussion and persuasion

First time in the new room (Gates 344 breakout room)

Class images, links and handouts

Wondermark cartoon
Image credit: David Malki !,In which Debate is debated, Feb 21st, 2014.

Chenhao Tan's curated hedge list, which merges several pre-existing data sources; see README. Appears in Tan, Chenhao and Lillian Lee. 2016.Talk it up or play it down?(Un) expected correlations between (de-) emphasis and recurrence of discussion points in consequential US economic policy meetings. ArXiv Preprint ArXiv:1612.06391.
Harvard General Inquirer lexicon:homepage; documentation about the categories
The LIWC lexicon, 2015 version. A standard reference: Tausczik, Yla R. and James W. Pennebaker. 2010. The psychological meaning of words: LIWC and computerized text analysis methods.Journal of Language and Social Psychology 29(1): 24-54.
A github page with code for readability scores
ChangeMyView wiki, describing rules, regulations, rationale. Indeed, "three hours" was correct.

Lecture references

Flynn, D.J., Brendan Nyhan, and Jason Reifler. 2017. The nature and origins of misperceptions: Understanding false and unsupported beliefs about politics. Advances in Political Psychology 38:127-150. [ journal page] ] [ author-posted version]
- The manuscript regarding graphical vs. textual information: Nyhan, Brendan and Jason Reifler. The role of information deficits and identity threat in the prevalence of misperceptions. Version was dated Feb. 24, 2017 when posted. One of three 2015 winners of the Frank Prize in Public Interest Research.
Lukin, Stephanie, Pranav Anand, Marilyn Walker, and Steve Whittaker. 2017. Argument strength is in the eye of the beholder: Audience effects in persuasion. InEACL: Volume 1, Long Papers, 742-753.
McRaney, David. 2011. The Backfire Effect. Entry on the "You are Not So Smart" blog. Note the many links given.
- McRaney's post is credited as the inspiration behind this Oatmeal cartoon, You're not going to believe what I'm about to tell you, by Matthew Inman. (link is the the classroom-friendly (=profanity-filtered) version)
Paxton, Alexandra and Rick Dale. 2014. Leveraging linguistic content and debater traits to predict debate outcomes. In Cognitive Science Society.
Tan, Chenhao, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee, 2016,Winning arguments: Interaction dynamics and persuasion strategies in good-faith online discussions,WWW, pp. 613–624. [ACM link] [ paper "homepage" (paper, slides, data, etc.)]

#4 Aug 31: Linguistic coordination

Upcoming deadlines (default - 5pm unless otherwise noted): Friday Sept. 1, 2:30pm; Monday Sept 4

Class images, links and handouts

Slides on Asymmetric language synchronization in social interaction

Lecture references

Bell, Allan. 1984. Language style as audience design.Language in Society 13(2): 145-204.
Danescu-Niculescu-Mizil, Cristian, Lillian Lee, Bo Pang, and Jon Kleinberg. 2012.Echoes of power: Language effects and power differences in social interaction.WWW, pp. 699--708. [ACM link] [ paper "homepage" (paper, slides, data, etc.)]
Danescu-Niculescu-Mizil, Cristian, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. No country for old members: User lifecycle and linguistic change in online communities. WWW, pp. 307--318. Best paper award. [ACM link] [ paper "homepage" ]
Doyle, Gabriel, Amir Goldberg, Sameer B. Srivastava, and Michael C. Frank. 2017.Alignment at work: Accommodation and enculturation in corporate communication. ACL, 604--612.
Levelt, Willem J. M. and Stephanie Kelter. 1982. Surface form and memory in question answering. Cognitive Psychology 14 (1):78--106. [ author institution-posted link]
Nguyen, Viet-An, Jordan Boyd-Graber, Philip Resnik, Deborah A. Cai, Jennifer E. Midberry, and Yuanxin Wang. 2014. Modeling topic control to detect influence in conversations using nonparametric topic models.Machine Learning 95:381--421. [ author-posted version] [code] [data (local copy donated by author)]

#5 Sep 5: Real-time measurement of coordination; A1 check-ins

Life can be easier:
- View discussion-site comments in reverse-chronological order by clicking on the speech balloon in the top bar
- Cornell's Passkey for accessing restricted content in your browser. (So I will stop posting Cornell-access-specific URLs.)
Remember what we talked about sharing on the course discussion site!

References

Boyd-Graber, Jordan, David Mimno, and David Newman. 2014. Care and feeding of topic models: Problems, diagnostics, and improvements. In Handbook of Mixed Membership Models and Their Applications [ Author-posted version]
Leshed, Gilly, Diego Perez, Jeffrey T. Hancock, Dan Cosley, Jeremy Birnholtz, Soyoung Lee, Poppy L. McLeod, and Geri Gay. 2009.Visualizing real-time language-based feedback on teamwork behavior in computer-mediated groups. In CHI, 537-546. [ author-posted version]
Niculae, Vlad, Srijan Kumar, Jordan Boyd-Graber, and Cristian Danescu-Niculescu-Mizil. 2015. Linguistic harbingers of betrayal: A case study on an online strategy game.ACL/IJCNLP (Volume 1: Long Papers), 1650-1659. [paper "homepage"]
Tausczik, Yla R. and James W. Pennebaker. 2013. Improving teamwork using real-time language feedback CHI, 459-468. [ author-posted version]
On the cosine measure being still vulnerable to length effects:
- Notes by Lakshmi Ganesh and Navin Sivakumar from a lecture by Lillian Lee on pivoted document length normalization, Spring 2010.
- Original paper, stating that "cosine normalization tends to favor short documents in retrieval": Singhal, Amit, Chris Buckley, and Mandar Mitra. 1996. Pivoted document length normalization. SIGIR, 21--29 [ author-posted version]
- Further reading can be found in the reference list of Lecture 3 ofCS6740, spring 2010.

#6 Sep 7: Appointments (see email for signup link)

#7 Sep 12: A1 presentations

#8 Sep 14: News, influence and information propagation, part 1

Heads-up: final-project proposals due Fri Oct. 6 11:59pm
Inspiration-readings assignments released

Class images, links and handouts

Image source: David Malki ! Wondermark 1209: Talk and Awe

MemeTracker visualization (uses Flash), including variants of the "Lipstick on a pig" quote
QUOTUS visualization: how media outlets quote the President.

Lecture references

Bakshy, Eitan, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. 2011. Everyone's an influencer: Quantifying influence on Twitter.WSDM, 65--74.
Friggeri, Adrien, Lada A. Adamic, Dean Eckles, and Justin Cheng. 2014. Rumor cascades.ICWSM
Leskovec, Jure, Lars Backstrom, and Jon Kleinberg. 2009. Meme-tracking and the dynamics of the news cycle.KDD, pp. 497--506. [paper "homepage"]
Niculae, Vlad, Caroline Suen, Justine Zhang, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. 2015. QUOTUS: The structure of political media coverage as revealed by quoting patterns. WWW, 798-808. [paper "homepage"]
Rotabi, Rahmtin, Cristian Danescu-Niculescu-Mizil, and Jon Kleinberg. 2017. Competition and selection among conventions. Conference on World Wide Web, 1361-1370. [ paper "homepage"]

Other references

"Special Report With Brit Hume", September 10, 2008: panel-discussion transcript regarding Obama's "lipstick on a pig" utterance

#9 Sep 19: News, influence and information propagation, part 2

Class images, links and handouts

ICWSM 2011 Spinn3r dataset: "386 million blog posts, news articles, classifieds, forum posts and social media content between January 13th and February 14th"

Lecture references

Matei, Sorin Adam. Is two-step flow theory still relevant for social media research?. (accessed September 19, 2017).
Prabhumoye, Shrimai, Samridhi Choudhary, Evangelia Spiliopoulou, Christopher Bogart, Carolyn Penstein Rosé, and Alan W. Black. 2017.Linguistic markers of influence in informal interactions. In the Workshop on Natural Language Processing and Computational Social Science, 53--62.
Simmons, Matthew P., Lada A. Adamic, and Eytan Adar. 2011. Memes online: Extracted, subtracted, injected, and recollected.ICWSM, pp. 353--360.
Tan, Chenhao, Adrien Friggeri, and Lada A. Adamic. 2016. Lost in propagation? Unfolding news cycles from the source. In ICWSM, 378-387. [paper homepage]
Tan, Chenhao, Lillian Lee, and Bo Pang. 2014. The effect of wording on message propagation: Topic-and author-controlled natural experiments on Twitter. InACL, pp. 175--185. [ paper "homepage"]
Tsur, Oren and Ari Rappoport. 2015. Don't let me be #misunderstood: Linguistically motivated algorithm for predicting the popularity of textual memes. InICWSM
Watts, Duncan J. and Peter Sheridan Dodds. December 2007. Influentials, networks, and public opinion formation.Journal of Consumer Research 34(4): 441-458.

#10 Sep 21: Proposals discussion (A2)

The readings

Murgia, Alessandro, Daan Janssens, Serge Demeyer, and Bogdan Vasilescu. 2016. Among the machines: Human-bot interaction on social Q&A websites. In CHI Extended Abstracts, 1272-1279. [author-posted version]
Sap, Maarten, Marcella Cindy Prasettio, Ari Holtzman, Hannah Rashkin, and Yejin Choi. 2017.Connotation frames of power and agency in modern films.EMNLP, 2319-2324. [ paper "homepage" with connotation frames data and query interface]

Class images, links and handouts

Image source: Dorothy Gambrell, Cat and Girl: Steal This Cat and Girl

Past CS/IS 6742 projects on related topics that became publications:
- Fu, Liye, Cristian Danescu-Niculescu-Mizil, and Lillian Lee. 2016. Tie-breaker: Using language models to quantify gender bias in sports journalism. In IJCAI Workshop on NLP Meets Journalism. Best paper award. [paper homepage] [ writeup in the New York Times' Upshot section]
- Schofield, Alexandra and Leo Mehr. 2016. Gender-distinguishing features in film dialogue. In NAACL-HLT Workshop on Computational Linguistics for Literature, 32-39.
Possible movie-script or -dialog sources:
- Cornell Movie-Dialogs Corpus
- Film Corpus 2.0, UCSC
- Nara dataset, described in Lasguido Nio, Sakriani Sakti, Graham Neubig, Tomoki Toda, Satoshi Nakamura. 2014. Conversation dialog corpora from television and movie scripts Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA) [ author-posted version]
- Scriptbase (University of Edinburgh), described in Gorinski, Philip John and Mirella Lapata. 2015. Movie script summarization as graph-based scene extraction. In HLT-NAACL, 1066-1076. Vincent, Alice. 2017. Tom Hiddleston criticised for 'white saviour' Golden Globes speech. The Telegraph.

Lecture references

Greene, Stephan and Philip Resnik. 2009. More than words: Syntactic packaging and implicit sentiment. NAACL, pp. 503--511.
Herring, Susan C. 2003. Gender and power in on-line communication. Chapter 9 inThe Handbook of Language and Gender.
Jung, Malte F., Nikolas Martelaro, and Pamela J. Hinds. 2015. Using robots to moderate team conflict: The case of repairing violations. InACM/IEEE International Conference on Human-Robot Interaction (HRI), 229-236. [movie]
King, Gary, Jennifer Pan, and Margaret E. Roberts. May 2013. How censorship in China allows government criticism but silences collective expression. American Political Science Review 107(02): 326-343. [ paper homepage]
King, Gary, Jennifer Pan, and Margaret E. Roberts. August 2014.Reverse-engineering censorship in China: Randomized experimentation and participant observation.Science 345(6199). [ paper homepage]
Memmott, Mark. 2013. It's True: 'Mistakes Were Made' Is The King Of Non-Apologies.NPR's "the two-way", May 14. Notice the reference to the term "the past exonerative", attributed to William Schneider.
Munger, Kevin. 2016. Tweetment effects on the tweeted: Experimentally reducing racist harassment.Political Behavior 39(3):629-649. [ author slides ] [ replication data]
Munger, Kevin. 2017. Don't @ Me: Experimentally Reducing Partisan Incivility on Twitter. PolMeth 2017. [author-posted version (downloaded Sep. 20, 2017)] [ author slides] [ author poster]

Other references

Larson, Brian N. 2017. Gender as a variable in natural-language processing: Ethical considerations. In Workshop on Ethics in Natural Language Processing, 1-11.
Milli, Smitha and David Bamman. 2016. Beyond canonical texts: A computational analysis of fanfiction. In EMNLP, 2048--2053.
Prabhakaran, Vinodkumar and Owen Rambow. 2017. Dialog structure through the lens of gender, gender environment, and power. Dialogue and Discourse 8(2): 21-55. [arxiv version]
Varol, Onur, Emilio Ferrara, Clayton A. Davis, Filippo Menczer, and Alessandro Flammini. Online human-bot interactions: Detection, estimation, and characterization. In ICWSM

#11 Sep 26: Words across space, community, and time

Class images, links and handouts

entry for "ard" in Urban Dictionary

Lecture references

Eisenstein, Jacob, Brendan O'Connor, Noah A. Smith, and Eric P. Xing. 2014. Diffusion of lexical change in social media. PLoS One 9(11): e113114.

Other references

Altmann, Eduardo G., Janet B. Pierrehumbert, and Adilson E. Motter. 2010. Niche as a determinant of word fate in online groups. PLoS One 6(5). doi:10.1371/journal.pone.0019009.
Garley, Matthew E. 2012. Crossing the lexicon: Anglicisms in the German hip hop community. Ph.D. Thesis, University of Illinois at Urbana-Champaign
Dong Nguyen, A. Seza Doğruöz, Carolyn P. Rosé and Franciska de Jong, 2016:Computational Sociolinguistics: A Survey.Computational Linguistics 42(3):537--593.
Pierrehumbert, Janet B. 2012. The dynamic lexicon. InHandbook of Laboratory Phonology. [ author-posted version]

#12 Sep 28: Proposals discussion (A3)

Class images, links and handouts

Image source: English Language & Usage Stack Exchange. Click through for some interesting answers!

Lecture references (thanks to everyone for these pointers!)

Frankfurt, Harry. 1986. On bullshit. Raritan: A Quarterly Review 6(2). [a pdf]
Great, JonChristian. 2013. From 'preggers' to 'pizzle': Android's bizarre list of banned words. Wired. [purported word lists]
Lukes, Steven. 2005. Power: A Radical View. Houndmills, Basingstoke, Hampshire; New York: Palgrave Macmillan.
Wakeman, Joshua. Bullshit as a problem of social epistemology. 35(1):15--38.
Weber, Max. 1922 (translation date: 1978). Class, Status, Party. In Economy and Society. Berkeley: University of California Press. [an abridged version] "In general, we understand by 'power' the chance of a man or of a number of men to realize their own will in a communal action even against the resistance of others who are participating in the action."

Other references

Lavigne, Sam and Tega Brain. 2016. Simulating Enron: The undead corpus of emails from a massive corporate fraud. Rhizome. (accessed September 28, 2017). Mentions some of the redaction process. The Good Life (an Enron simulator) "recreates the experience of receiving all 500,000 emails from the Enron email archive via a chronological timescale of the viewer's choosing".
Leber, Jessica. The Immortal Life of the Enron E-mails. MIT Technology Review,.
Romero, Daniel M., Brian Uzzi, and Jon Kleinberg. 2016. Social networks under stress. In WWW, 9-20. Best paper award. [author slides]
Shoemark, Philippa, Debnil Sur, Luke Shrimpton, Iain Murray, and Sharon Goldwater. 2017.Aye or naw, whit dae ye hink? Scottish independence and linguistic identity on social media. In EACL, 1239-1248.

#13 Oct 3: (Misc.) topics and power

Class images, links and handouts

Polymath Project
Random acts of pizza subreddit

Lecture references

Althoff, Tim, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. 2014. How to ask for a favor: A case study on the success of altruistic requests. In Proceedings of ICWSM
Cranshaw, Justin and Aniket Kittur. 2011. The Polymath project: Lessons from a successful online collaboration in mathematics. In CHI, 1865-1874.
Kloumann, Isabel Mette, Chenhao Tan, Jon Kleinberg, and Lillian Lee. 2016. Internet collaboration on extremely difficult problems: Research versus olympiad questions on the Polymath site. InWWW, 1283-1292.
Mitra, Tanushree and Eric Gilbert. 2014. The language that gets people to give: Phrases that predict success on Kickstarter. In CSCW.
Nguyen, Viet-An, Jordan Boyd-Graber, Philip Resnik, Deborah A. Cai, Jennifer E. Midberry, and Yuanxin Wang. 2014. Modeling topic control to detect influence in conversations using nonparametric topic models.Machine Learning 95(3 - special issue on computational social science and social computing): 381--421. [author-posted version] [author "keynote-level" talk slides] [author code]
Prabhakaran, Vinodkumar, Ashima Arora, and Owen Rambow. 2014. Staying on topic: An indicator of power in political debates. EMNLP (short Papers), 1481-1486.

#14 Oct 5: Proposals discussion (A4)

The readings

Cunha, Tiago O., Ingmar Weber, Hamed Haddadi, and Gisele L. Pappa. 2016. The effect of social feedback in a Reddit weight loss community. the International Conference on Digital Health, 99-103. [arxiv version]. [followup 2017 paper that takes reported weight loss into account]
Zhong, Changtao, Hau-wen Chang, Dmytro Karamshuk, Dongwon Lee, and Nishanth Sastry. 2017. Wearing many (social) hats: How different are your different social network personae?. In ICWSM

Lecture references

Chandrasekharan, Eshwar, Umashanthi Pavalanathan, Anirudh Srinivasan, Adam Glynn, Jacob Eisenstein, and Eric Gilbert. 2018. You can’t stay here: The efficacy of Reddit's 2015 ban examined through hate speech. In CSCW
On the legality of scraping (H/T Charles Tong): Daniel Tunkelang Aug 2017 blog post,On hiQ v. LinkedIn; Prayag Narula Sep 2017 Forbes article, LinkedIn Vs. hiQ Ruling Casts A Long Shadow Over The Tech Industry; the ruling granting a preliminary injunction, thus preventing LinkedIn from preventing scraping of public data.

Other references

Callison-Burch, Chris, Lyle Ungar, and Ellie Pavlick. Crowdsourcing for NLP [NAACL 2015 tutorial homepage, including video] [UPenn Spring 2016 class homepage, Chris Callison-Burch and Ellie Pavlick]

Oct 10: No class — Fall Break

#15 Oct 12: Optional project-proposal appointments

You don't need to come to class unless you made an appointment; see the optional deadline described in A5.

#16 Oct 17: What makes two sub-languages different?

Class images, links and handouts

Image source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-never-tell-me-the-odds-6/.

Handout
Slides (pptx, pdf) adapted from the relevant section of Cristian Danescu-Niculescu-Mizil and Lillian Lee, 2016.Natural Language Processing for Computational Social Science. Invited Tutorial at NIPS.
Percy Liang and Dan Klein. 2007. Structured Bayesian nonparametric models with variational inference. Tutorial. We started at slide 11.
Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. 2008. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16(4): 372-403. [alt link]

Lecture references

Kenneth W. Church. 2000. Empirical estimates of adaptation: The chance of two Noriegas is closer to p/2 than p2.COLING.
Liberman, Mark. Jan 3, 2016. The case of the disappearing determiners. Language Log blog post.
Kleinberg, Jon. 2004. Temporal dynamics of on-line information streams. InData Stream Management: Processing High-speed Data Streams, published in 2016. [Author-posted 2004 preprint] See section ''two-point trends'' (pg. 28ff)
Implementations:
- Hessel, Jack (who took this class!)FightingWords.
- Lim, Kenneth (who took this class!).fightin-words 1.0.4. Compliant with sci-kit learn and distributed by PyPI; borrows (with acknowledgment) from Jack's version.
- Marzagão, Thiago. mcq.py
Fredette, Marc and Jean-François Angers. 2002. A new approximation of the posterior distribution of the log-odds ratio.Statistica Neerlandica 56(3): 314-329. [Author's institution link] Attributes the Monroe et al. fact to Chapter 10 of O'Hagan's Kendall's advanced theory of statistic, vol 2b. The little-o analysis of the error appear in section 2.1 of Newson, 2008,Asymptotic distributions of linear combinations of logs of multinomial parameter estimates.Wikipedia attributes the approximation of the confidence interval to a 1988 article that cites a 1978 Biometrika article.
Liberman, Mark. The most Kasichoid, Cruzian, Trumpish, and Rubiositous words, 2016.The most Trumpish (and Bushish) words, 2015. Obama's favored (and disfavored) SOTU words, 2014.Draft words (descriptions of white vs black NFL prospects), 2014.Male and female word usage, 2014.

#17 Oct 19: How different are two language models?

Reminder: Phase 3 of A5 due on Monday; sign up beforehand for and attend mandatory feasibility-check appointment on Tuesday.

Class images, links and handouts

Lecture references

Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. 2008. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis 16(4): 372-403. [alt link]

Other references

An example reference for deriving the maximum-likelihood estimate (and a little about Dirichlet priors) for the multinomial is the following slide set: Ronald Williams, CSG 200, Spring 2007 Maximum Likelihood vs. Bayesian Parameter Estimation. Some of the slides from there were taken with attribution from "apparently ... Nir Friedman", PGM: Tirgul 10, Parameter Learning and Priors, which goes into more depth and covers more topics.

#18 Oct 24: Feasibility-check appointments

Only come to class during your scheduled appointment; seePhase 3 of A5.

#19 Oct 26: Language modeling and differences between language models, cont.

A5 "our week" commitment statements due tonight

Class images, links and handouts

Inspirational, thought-provoking image by Chenhao Tan (see handout for explanation):

Handout
Visualization of the behavior of different distributional distance functions. The embedded data plot was generated bythis gnuplot script.

Lecture references

Danescu-Niculescu-Mizil, Cristian, Robert West, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. No country for old members: User lifecycle and linguistic change in online communities. WWW, pp. 307--318. Best paper award. [paper "homepage"]
Eisenstein, Jacob. 2013. What to do about bad language on the internet.NAACL-HLT, 359-369.
Lee, Lillian. 1997. Chapter 2.3, Measures of distributional Similarity, in Similarity-Based Approaches to Natural Language Processing. Ph.D. Thesis.

Other references

Brysbaert, Marc, Michaël Stevens, Paweł Mandera, and Emmanuel Keuleers. 2016.How many words do we know? Practical estimates of vocabulary size dependent on word definition, the degree of language input and the participants age.Frontiers in Psychology 7(1116).
Eisenstein, Jacob, Amr Ahmed, and Eric P. Xing. 2011.Sparse additive generative models of text. In ICML, 1041-1048. [code]
Kavaler, David, Sasha Sirovica, Vincent Hellendoorn, Raul Aranovich, and Vladimir Filkov. 2017.Perceived language complexity in GitHub issue discussions and their effect on issue resolution. In IEEE/ACM Conference on Automated Software Engineering.
Lee, Lillian. 1999.Measures of distributional similarity. In ACL, 25-32. [paper "homepage"]
Lin, Jianhua. 1991.Divergence measures based on the Shannon entropy.IEEE Transactions on Information Theory 37(1): 145-151. [alt link]

#20 Oct 31: Models of local language structure: vocabulary space

Class images, links and handouts

Twitter word clusters, by Olutobi Owoputi et al. (reference below). A "mild curse words" cluster.
Illustration of nouns as inducing distributions over the verbs that take them as direct objects. (Old-school illustrations from a talk given in 1997!)

Lecture references

Brown, Peter F., Vincent J. Della Pietra, Peter V. DeSouza, Jennifer C. Lai, and Robert L. Mercer. 1992.Class-based n-gram models of natural language.Computational Linguistics 18(4): 467-479. Percy Liang's implementation
Genzel, Dmitriy and Eugene Charniak. 2002.Entropy rate constancy in text. In ACL, 199-206.
Owoputi, Olutobi, Brendan O'Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith.Improved part-of-speech tagging for online conversational text with word clusters. In NAACL, 380--390.

Other references

Levy, Omer, Yoav Goldberg, and Ido Dagan. May 2015.Improving distributional similarity with lessons learned from word embeddings.Transactions of the ACL 3:211--225.
Pereira, Fernando, Naftali Tishby, and Lillian Lee. 1993.Distributional clustering of English words ACL, 183--190.

#21 Nov 2: Foreshadowing: some connections between information theory and psycholinguistics; the Brown clustering algorithm for deriving structure of vocabulary space

Handout
Coincidence: today's news (10:53am) about Robert Mercer:Bloomberg

Lecture references

Brown, Peter F., Vincent J. Della Pietra, Peter V. DeSouza, Jennifer C. Lai, and Robert L. Mercer. 1992.Class-based n-gram models of natural language.Computational Linguistics 18(4): 467-479.
Doyle, Gabriel and Michael C. Frank. 2015.Audience size and contextual effects on information density in Twitter conversations."ACL Workshop on Cognitive Modeling and Computational Linguistics (CMCL), 19-28. Potential contrast: same authors, 2015, Shared common ground influences information density in microblog texts, NAACL, 1587--1596.
Hale, John. 2001. A probabilistic Earley parser as a psycholinguistic model.NAACL, pp. 1-8.
Levy, Roger and T. Florian Jaeger. 2007. Speakers optimize information density through syntactic reduction. In NIPS, 849-856.

Other references

Hale, John. 2016. Information-theoretical complexity metrics.Language and Linguistics Compass 10(9): 397-412.

#22 Nov 7: Local structure: phrase and sentence space

A6: (due Fri Nov. 10, 11:59pm): post as a comment to your final project posting your planned project schedule from now until Dec 11th (the project due date)
No lecture on November 14

Class images, links and handouts

Image source: AZ Quotes.

The Paraphrase Database (PPDB), version 2.0
Example sentential paraphrase template
Example alignment and a first-year undergraduate-level schematic of IBM Model 1.

Lecture references

Bannard, Colin and Chris Callison-Burch. 2005.Paraphrasing with bilingual parallel corpora. In ACL, 597-604.
Barzilay, Regina and Lillian Lee. 2003.Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. InNAACL, 16-23. [paper homepage]
Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993.The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2 (Special issue on using large corpora: II)): 263--311.
Lin, Dekang and Patrick Pantel. December 2001.Discovery of inference rules for question-answering.Natural Language Engineering 7:343-360.

Other references

Knight, Kevin. 1999.A statistical MT Tutorial Workbook. "The basic text that this tutorial relies on is Brown et al., “The Mathematics of Statistical Machine Translation”, Computational Linguistics, 1993. On top of this excellent presentation, I can only add some perspective and perhaps some sympathy for the poor reader, who has (after all) done nothing wrong."
Detlef Prescher. 2005. A tutorial on the Expectation-Maximization algorithm including maximum-likelihood estimation and EM training of probabilistic context-free grammars.

#23 Nov 9: Latent discourse/dialog structure

Class images, links and handouts

Left: Garry Kasparov, Maurice Ashley, Yasser Seirawan and a bunch of soft drinks at the 1996 match against Deep Blue. Photo by Kenneth Thompson, provided at computerhistory.org
Right: Muarice Ashley and Yasser Seirawan commentating on the 1997 re-match. Photo by Monroe Newborn, provided atcomputerhistory.org

Handout
Summary of and source for transcripts of live commentary on the first Kasparov/Deep Blue match
AMI Guidelines for Dialogue Act and Addressee Annotation Version 1.0

Lecture references

Clark, Herbert H. and Jean E. Fox Tree. 2002. Using uh and um in spontaneous speaking.Cognition 84(1): 73 - 111.
Jurafsky, Dan, Rajesh Ranganath, and Dan McFarland. 2009.Extracting Social Meaning: Identifying Interactional Style in Spoken Conversation.NAACL, 638--646.
Chris Pott's material on the Switchboard Dialog Act corpus
Ranganath, Rajesh, Dan Jurafsky, and Dan McFarland. 2009.It's not you, it's me: Detecting flirting and its misperception in speed-dates. In EMNLP, 334-342.
Schegloff, Emanuel A. 2009.A practice for (re-)exiting a sequence: And/but/so + uh(m) + silence. In K. Turner and B. Fraser (Eds.), Language in life, and a life in language: Jacob Mey --- A festschrift, pp. 365-374. [Researchgate link] [automatic pdf request site] [sound clip]
Zarisheva, Elina and Tatjana Scheffler. 2015.Dialog act annotation for Twitter conversations. In SIGDIAL, 114--123.

Other references

Clark, Herbert H. and Jean E. Fox Tree. November 11, 2014On thee-yuh fillers uh and um. Language Log post.
Schegloff, Emanuel A. 2010. Some other "uh(m)"s.Discourse Processes 47(2): 130–174. [automatic pdf request site] [sound clip]

#24 Nov 14: No class

#25 Nov 16: Latent discourse/dialog structure, part two

Project presentations after Thanksgiving Break

Class images, links and handouts

Clip source: hill35billy's YouTube channel; the movie is The Pink Panther Strikes Again. Start at 50s.

Handout

Lecture references

Allen, James. 1995. Natural Language Understanding. Second ed. Redwood City, Calif.: Benjamin/Cummings.
Galantucci, Bruno and Gareth Roberts. 2014. Do we notice when communication goes awry? An investigation of people's sensitivity to coherence in spontaneous conversation. PLoS ONE 9(7).
Grice, H.P. 1975. Logic and Conversation. In Cole et al., Syntax and Semantics 3: Speech Acts. (link is not to an official site)
Grishman, Ralph. 1986. Computational Linguistics: An Introduction. Cambridge [Cambridgeshire]; New York: Cambridge University Press.
Grosz, Barbara J., and Sidner, Candace L. 1986.Attention, intentions, and the structure of discourse.Computational Linguistics 12(3): 175-204.
Rogers, Todd and Michael I. Norton. 2011.The artful dodger: Answering the wrong question the right way. Journal of Experimental Psychology: Applied 17 (2). [alt link]
Sidner, Candace Lee. 1979. Towards a computational theory of definite anaphora comprehension in English discourse. MIT AITR-537.

Other references

Section 24.1.5 of Jurafsky, Daniel and James H. Martin. 2009.Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 2nd edition (there does not seem to be an electronic version of the third edition's relevant chapter available) [chapter link at UCSC]
Grosz, Barbara J., Weinstein, Scott, and Joshi, Aravind K. 1995.Centering: A framework for modeling the local coherence of discourse.Computational Linguistics 21 (June): 203-225. A theory said to account for the "wine on the table" example.

#26 Nov 21: Latent discourse/dialog structure, part three

Class images, links and handouts

Pinker, Steven and the Royal Society for the Encouragement of Arts, Manufactures and Commerce (RSA) Animate, posted to YouTube on Feb 10, 2011. Language as a Window into Human Nature

Handout
Rhetorical Structure Theory page, created by Bill Mann and maintained by Maite Taboada
Code for a recent RST parser, by Yangfeng Ji and Jacob Eisenstein (2014)
RST LaTeX typesetting package by David Reitter

Lecture references

Clark, Herbert H. 2004. Pragmatics of language performance. In L. R. Horn & G. Ward (Eds.), Handbook of pragmatics. Oxford: Blackwell, pp. 365-382. [alt link]
Grosz, Barbara J., and Sidner, Candace L. 1986.Attention, intentions, and the structure of discourse.Computational Linguistics 12(3): 175-204.

Other references

Ji, Yangfeng and Jacob Eisenstein. 2014. Representation learning for text-level discourse parsing. In ACL, 13-24.
Section 27.4.2 of Jurafsky, Daniel and James H. Martin. 2009.Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 2nd edition (there does not seem to be an electronic version of the third edition's relevant chapter available) [chapter link at UCSC]
Mann, William C. and Sandra A. Thompson. 1988.Rhetorical structure theory: Toward a functional theory of text organization.Text: Interdisciplinary Journal for the Study of Discourse 8(3): 243-281. [link at Mann/Taboada's site]
Marcu, Daniel. 2000.Extending a formal and computational model of rhetorical structure theory with intentional structures à la Grosz and Sidner.COLING, 523-529.
Walker, Marilyn A. 1996. Limited attention and discourse structure.Computational Linguistics 22(2): 255-264.

Nov 23: No class — Thanksgiving Break

#27 Nov 28: Project presentations (attendance by all is mandatory)

A7 posted on CMS, due Mon Dec. 11, 4:30pm (date determined by the registrar). Submit both your presentation materials and your final project writeup; but don't spend time post-editing your presentation materials after the fact, as I will only be using them as a reference while evaluating your writeup.
The main evaluation criteria will be the reasonableness (in approach and amount of effort), thoughtfulness, and creativity of what you tried, as documented in your writeup. Individual effort within team projects will be taken into account; see item 3 below.
1. For the author heading, list only the names of your teammates that are enrolled in the class, even if you had external collaborators. (Reason: only students in the class are submitting the paper for a grade.) But see item 2bi below.
2. Include the following sections:
  1. "content" sections: abstract, introduction/motivation, data description (how you gathered, cleaned, and processed it), methods, experiments/results, related work, conclusions (what you learned), directions for future work, references
    - Make sure that your introduction section explicitly sets out your hypotheses/research questions.
    - Throughout, highlight your most interesting findings (positive or negative).
    - For the purposes of CS/IS 6742 submission, your related-work section does not need to be exhaustive; you may cover just a few most-related papers.
  2. An "acknowledgments" section: give the name and state the contribution of those who you received significant help from. (This may or may not include your advisor(s), one or both of your instructors, fellow students in the class).
    1. Authorship statement: if you intend to ask or have already arranged to have people other than your CS6742-enrolled teammates, also name each such person.
3. Projects done collaboratively must also include a section describing who did what. External collaborators should be included in this enumeration.

Class images, links and handouts

References

#28 Nov 30: Project presentations (attendance by all is mandatory)

Class images, links and handouts

Lecture references

Mon Dec. 11, 4:30pm: Final project writeup due

Code for generating the calendar formatting adapted from Andrew Myers. Portions of the content of this website and course were created by collaboration between Cristian Danescu-Niculescu-Mizil and Lillian Lee over multiple runnings of this course.

Natural Language Processing and Social Interaction, Fall 2017 (original) (raw)

Prerequisites, enrollment, related classes

Administrative info

Overall course structure

Resources