Topic-Focused Summarization of Chat Conversations (original) (raw)

Summarizing Online Conversations: A Machine Learning Approach

Summarization has emerged as an increasingly useful approach to tackle the problem of information overload. Extracting information from online conversations can be of very good commercial and educational value. But majority of this information is present as noisy unstructured text making traditional document summarization techniques difficult to apply. In this paper, we propose a novel approach to address the problem of conversation summarization. We develop an automatic text summarizer which extracts sentences from the conversation to form a summary. Our approach consists of three phases. In the first phase, we prepare the dataset for usage by correcting spellings and segmenting the text. In the second phase, we represent each sentence by a set of predefined features. These features capture the statistical, linguistic and sentimental aspects along with the dialogue structure of the conversation. Finally, in the third phase we use a machine learning algorithm to train the summarizer on the set of feature vectors. Experiments performed on conversations taken from the technical domain show that our system significantly outperforms the baselines on ROUGE F-scores.

Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and Context-Aware Auto-Encoders

Proceedings of the AAAI Conference on Artificial Intelligence

Automatic chat summarization can help people quickly grasp important information from numerous chat messages. Unlike conventional documents, chat logs usually have fragmented and evolving topics. In addition, these logs contain a quantity of elliptical and interrogative sentences, which make the chat summarization highly context dependent. In this work, we propose a novel unsupervised framework called RankAE to perform chat summarization without employing manually labeled data. RankAE consists of a topic-oriented ranking strategy that selects topic utterances according to centrality and diversity simultaneously, as well as a denoising auto-encoder that is carefully designed to generate succinct but context-informative summaries based on the selected utterances. To evaluate the proposed method, we collect a large-scale dataset of chat logs from a customer service environment and build an annotated set only for model evaluation. Experimental results show that RankAE significantly outp...

Summarizing spoken and written conversations

Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP '08, 2008

In this paper we describe research on summarizing conversations in the meetings and emails domains. We introduce a conversation summarization system that works in multiple domains utilizing general conversational features, and compare our results with domain-dependent systems for meeting and email data. We find that by treating meetings and emails as conversations with general conversational features in common, we can achieve competitive results with state-of-theart systems that rely on more domain-specific features.

Summarizing emails with conversational cohesion and subjectivity

2008

In this paper, we study the problem of summarizing email conversations. We first build a sentence quotation graph that captures the conversation structure among emails. We adopt three cohesion measures: clue words, semantic similarity and cosine similarity as the weight of the edges. Second, we use two graph-based summarization approaches, Generalized ClueWordSummarizer and Page-Rank, to extract sentences as summaries. Third, we propose a summarization approach based on subjective opinions and integrate it with the graph-based ones. The empirical evaluation shows that the basic clue words have the highest accuracy among the three cohesion measures. Moreover, subjective words can significantly improve accuracy.

Summarizing web forum threads based on a latent topic propagation process

Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11, 2011

With an increasingly amount of information in web forums, quick comprehension of threads in web forums has become a challenging research problem. To handle this issue, this paper investigates the task of Web Forum Thread Summarization (WFTS), aiming to give a brief statement of each thread that involving multiple dynamic topics. When applied to the task of WFTS, traditional summarization methods are cramped by topic dependencies, topic drifting and text sparseness. Consequently, we explore an unsupervised topic propagation model in this paper, the Post Propagation Model (PPM), to burst through these problems by simultaneously modeling the semantics and the reply relationship existing in each thread. Each post in PPM is considered as a mixture of topics, and a product of Dirichlet distributions in previous posts is employed to model each topic dependencies during the asynchronous discussion. Based on this model, the task of WFTS is accomplished by extracting most significant sentences in a thread. The experimental results on two different forum data sets show that WFTS based on the PPM outperforms several state-of-the-art summarization methods in terms of ROUGE metrics.

Multi-Topic Multi-Document Summarizer

Current multi-document summarization systems can successfully extract summary sentences, however with many limitations including: low coverage, inaccurate extraction to important sentences, redundancy and poor coherence among the selected sentences. The present study introduces a new concept of centroid approach and reports new techniques for extracting summary sentences for multi-document. In both techniques keyphrases are used to weigh sentences and documents. The first summarization technique (Sen-Rich) prefers maximum richness sentences. While the second (Doc-Rich), prefers sentences from centroid document. To demonstrate the new summarization system application to extract summaries of Arabic documents we performed two experiments. First, we applied Rouge measure to compare the new techniques among systems presented at TAC2011. The results show that Sen-Rich outperformed all systems in ROUGE-S. Second, the system was applied to summarize multi-topic documents. Using human evaluators, the results show that Doc-Rich is the superior, where summary sentences characterized by extra coverage and more cohesion.

Review of Topic Modeling and Summarization

IRJET, 2022

Topic Modeling is a technique of unsupervised machine learning which is used in discovering topics that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is one of the most used algorithm for topic modeling. It considers that documents are mixture of topics and each topic is a mixture of different tokens or words. While considering many documents, one can think that the topics extracted by the LDA algorithm relate to all of the documents together. But if we consider only one text document, if we try to extract topics from it using LDA algorithm, we can say that these are the keywords of the text document as it summarizes the entire idea of the document in a concise form. This can be useful in summarization of the document. Summarization with respect to text is shortening of the text document such that it highlights all the important points of the text document. In this paper, we represent a LDA model which helps to identify the dominant topics in the text document, then identifies sentences that reflect these dominant topics and stiches them together to formulate a human readable summary.

Automatic summarisation of discussion fora

2010

Web-based discussion fora proliferate on the Internet. These fora consist of threads about specific matters. Existing forum search facilities provide an easy way for finding threads of interest. However, understanding the content of threads is not always trivial. This problem becomes more pressing as threads become longer. It frustrates users that are looking for specific information and also makes it more difficult to make valuable contributions to a discussion. We postulate that having a concise summary of a thread would greatly help forum users. But, how would we best create such summaries? In this paper, we present an automated method of summarising threads in discussion fora. Compared with summarisation of unstructured texts and spoken dialogues, the structural characteristics of threads give important advantages. We studied how to best exploit these characteristics. Messages in threads contain both explicit and implicit references to each other and are structured. Therefore, we term the threads hierarchical dialogues. Our proposed summarisation algorithm produces one summary of an hierarchical dialogue by 'cherry-picking' sentences out of the original messages that make up a thread. We try to select sentences usable for obtaining an overview of the discussion. Our method is built around a set of heuristics based on observations of real fora discussions. The data used for this research was in Dutch, but the developed method equally applies to other languages. We evaluated our approach using a prototype. Users judged our summariser as very useful, half of them indicating they would use it regularly or always when visiting fora.

Summarization of discussion groups

Proceedings of the tenth international conference on Information and knowledge management - CIKM'01, 2001

Abstract In this paper, we describe an algorithm to generate textual summaries of discussion groups. Our system combines sentences extracted from individual postings into variable-length summaries by utilizing the hierarchical discourse context provided by discussion ...

Creating a reference data set for the summarization of discussion forum threads

Language Resources and Evaluation

In this paper we address extractive summarization of long threads in online discussion fora. We present an elaborate user evaluation study to determine human preferences in forum summarization and to create a reference data set. We showed long threads to ten different raters and asked them to create a summary by selecting the posts that they considered to be the most important for the thread. We study the agreement between human raters on the summarization task, and we show how multiple reference summaries can be combined to develop a successful model for automatic summarization. We found that although the inter-rater agreement for the summarization task was slight to fair, the automatic summarizer obtained reasonable results in terms of precision, recall, and ROUGE. Moreover, when human raters were asked to choose between the summary created by another human and the summary created by our model in a blind side-by-side comparison, they judged the model's summary equal to or better than the human summary in over half of the cases. This shows that even for a summarization task with low inter-rater agreement,