A Large-Scale Corpus for Conversation Disentanglement (original) (raw)
Related papers
ArXiv, 2018
Disentangling conversations mixed together in a single stream of messages is a difficult task with no large annotated datasets. We created a new dataset that is 25 times the size of any previous publicly available resource, has samples of conversation from 152 points in time across a decade, and is annotated with both threads and a within-thread reply-structure graph. We also developed a new neural network model, which extracts conversation threads substantially more accurately than prior work. Using our annotated data and our model we tested assumptions in prior work, revealing major issues in heuristically constructed resources, and identifying how small datasets have biased our understanding of multi-party multi-conversation chat.
You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement
Meeting of the Association for Computational Linguistics, 2008
When multiple conversations occur simultane- ously, a listener must decide which conversa- tion each utterance is part of in order to inter- pret and respond to it appropriately. We refer to this task as disentanglement. We present a corpus of Internet Relay Chat (IRC) dialogue in which the various conversations have been manually disentangled, and evaluate annota- tor reliability. This
Appendices : A Large-Scale Corpus for Conversational Disentanglement
2019
We do not include a range of other tools for preand post-processing of the data, collection of statistics, and other forms of format conversions (e.g. Elsner’s format to/from ours). Format: The data is stored with stand-off annotations. There are a set of ascii text files containing the original logs. For each file there is a matching annotation file with lines of the format ”number number -” where the two numbers refer to lines in the text file, indicating that the two lines are linked. If the two numbers are equal, the link is a self-link, indicating the start of a conversation.
Computational Linguistics, 2010
When multiple conversations occur simultaneously, a listener must decide which conversation each utterance is part of in order to interpret and respond to it appropriately. We refer to this task as disentanglement. We present a corpus of Internet Relay Chat dialogue in which the various conversations have been manually disentangled, and evaluate annotator reliability. We propose a graph-based clustering model for disentanglement, using lexical, timing, and discourse-based features. The model's predicted disentanglements are highly correlated with manual annotations. We conclude by discussing two extensions to the model, specificity tuning and conversation start detection, both of which are promising but do not currently yield practical improvements.
POSLAN: Disentangling Chat with Positional and Language encoded Post Embeddings
2021
Most online message threads inherently will be cluttered and any new user or an existing user visiting after a hiatus will have a difficult time understanding whats being discussed in the thread. Similarly cluttered responses in a message thread makes analyzing the messages a difficult problem. The need for disentangling the clutter is much higher when the platform where the discussion is taking place does not provide functions to retrieve reply relations of the messages. This introduces an interesting problem to which [6] phrases as a structural learning problem. We create vector embeddings for posts in a thread so that it captures both linguistic and positional features in relation to a context of where a given message is in. Using these embeddings for posts we compute a similarity based connectivity matrix which then converted into a graph. After employing a pruning mechanisms the resultant graph can be used to discover the reply relation for the posts in the thread. The process ...
Disentangling utterances and recovering coherent multi party distinct conversations
2014
Automatic separation of interposed sequence of utterances into distinct conversations is an essential prerequisite for any kind of higher-level dialogue analysis. Unlike most models that involve highly computational intensive methods such as clustering techniques, our proposed approach uses a simple and efficient sequential thread detection method which is less computational intensive. It uses the waiting time (time gap between the current speaker and the next speaker), similarity between utterances, turn-taking and participant-based features. 1 Motivation:Disentanglement problem Chat rooms are where people can meet each other to chat on the internet (Davies, 2010). Chat room logs are not a single continuous conversation of two or a group of people at a time rather each time widow is a sequence of frequently broken utterances (Elsner and Charniak, 2008). A typical conversation, therefore, does not form an adjacent segment of the chat-room transcript, but sequences of frequently brok...
Unsupervised modeling of dialog acts in asynchronous conversations
2011
We present unsupervised approaches to the problem of modeling dialog acts in asynchronous conversations; i.e., conversations where participants collaborate with each other at different times. In particular, we investigate a graph-theoretic deterministic framework and two probabilistic conversation models (i.e., HMM and HMM+Mix) for modeling dialog acts in emails and forums. We train and test our conversation models on (a) temporal order and (b) graph-structural order of the datasets. Empirical evaluation suggests (i) the graph-theoretic framework that relies on lexical and structural similarity metrics is not the right model for this task, (ii) conversation models perform better on the graphstructural order than the temporal order of the datasets and (iii) HMM+Mix is a better conversation model than the simple HMM model.
A Graph-Based Context-Aware Model to Understand Online Conversations
arXiv (Cornell University), 2022
Online forums that allow for participatory engagement between users have been transformative for the public discussion of many important issues. However, such conversations can sometimes escalate into full-blown exchanges of hate and misinformation. Existing approaches in natural language processing (NLP), such as deep learning models for classification tasks, use as inputs only a single comment or a pair of comments depending upon whether the task concerns the inference of properties of the individual comments or the replies between pairs of comments, respectively. But in online conversations, comments and replies may be based on external context beyond the immediately relevant information that is input to the model. Therefore, being aware of the conversations' surrounding contexts should improve the model's performance for the inference task at hand. We propose GraphNLI 1 , a novel graph-based deep learning architecture that uses graph walks to incorporate the wider context of a conversation in a principled manner. Specifically, a graph walk starts from a given comment and samples "nearby" comments in the same or parallel conversation threads, which results in additional embeddings that are aggregated together with the initial comment's embedding. We then use these enriched embeddings for downstream NLP prediction tasks that are important for online conversations. We evaluate GraphNLI on two such taskspolarity prediction and misogynistic hate speech detection-and found that our model consistently outperforms all relevant baselines for both tasks. Specifically, GraphNLI with a biased root-seeking random walk performs with a macro-1 score of 3 and 6 percentage points better than the best-performing BERT-based baselines for the polarity prediction and hate speech detection tasks, respectively. We also perform extensive ablative experiments and hyperparameter searches to understand the efficacy of GraphNLI. This demonstrates the potential of context-aware models to capture the global context along with the local context of online conversations for these two tasks.
Who Did They Respond to? Conversation Structure Modeling Using Masked Hierarchical Transformer
Proceedings of the AAAI Conference on Artificial Intelligence
Conversation structure is useful for both understanding the nature of conversation dynamics and for providing features for many downstream applications such as summarization of conversations. In this work, we define the problem of conversation structure modeling as identifying the parent utterance(s) to which each utterance in the conversation responds to. Previous work usually took a pair of utterances to decide whether one utterance is the parent of the other. We believe the entire ancestral history is a very important information source to make accurate prediction. Therefore, we design a novel masking mechanism to guide the ancestor flow, and leverage the transformer model to aggregate all ancestors to predict parent utterances. Our experiments are performed on the Reddit dataset (Zhang, Culbertson, and Paritosh 2017) and the Ubuntu IRC dataset (Kummerfeld et al. 2019). In addition, we also report experiments on a new larger corpus from the Reddit platform and release this datase...
Discourse-Wizard: Discovering Deep Discourse Structure in your Conversation with RNNs
ArXiv, 2018
Spoken language understanding is one of the key factors in a dialogue system, and a context in a conversation plays an important role to understand the current utterance. In this work, we demonstrate the importance of context within the dialogue for neural network models through an online web interface live demo. We developed two different neural models: a model that does not use context and a context-based model. The no-context model classifies dialogue acts at an utterance-level whereas the context-based model takes some preceding utterances into account. We make these trained neural models available as a live demo called Discourse-Wizard using a modular server architecture. The live demo provides an easy to use interface for conversational analysis and for discovering deep discourse structures in a conversation.