Automatic summarisation of discussion fora (original) (raw)

Creating a reference data set for the summarization of discussion forum threads

Language Resources and Evaluation

In this paper we address extractive summarization of long threads in online discussion fora. We present an elaborate user evaluation study to determine human preferences in forum summarization and to create a reference data set. We showed long threads to ten different raters and asked them to create a summary by selecting the posts that they considered to be the most important for the thread. We study the agreement between human raters on the summarization task, and we show how multiple reference summaries can be combined to develop a successful model for automatic summarization. We found that although the inter-rater agreement for the summarization task was slight to fair, the automatic summarizer obtained reasonable results in terms of precision, recall, and ROUGE. Moreover, when human raters were asked to choose between the summary created by another human and the summary created by our model in a blind side-by-side comparison, they judged the model's summary equal to or better than the human summary in over half of the cases. This shows that even for a summarization task with low inter-rater agreement,

Summarization of discussion groups

Proceedings of the tenth international conference on Information and knowledge management - CIKM'01, 2001

Abstract In this paper, we describe an algorithm to generate textual summaries of discussion groups. Our system combines sentences extracted from individual postings into variable-length summaries by utilizing the hierarchical discourse context provided by discussion ...

SUMMARIZATION USING NTC APPROACH BASED ON KEYWORD EXTRACTION FOR DISCUSSION FORUMS

Internet has become a ubiquitous medium of communication, be it through any social networking websites like Facebook, Twitter or any discussion forums like Yahoo Answers, Quora, Stack Overflow. One can participate in any kind of discussion ranging from politics, education, spirituality, philosophy, science and geography to medicine and many more. Often, most of the discussion forums are loaded up with data. Hence, when a new user wants to know the public opinion, it is impossible for him/her to go through all the tens or hundreds of threads or comments under a particular thematic discussion. The problem here is -we are buried in data but we starve for information. So, to solve this problem, we are proposing a novel approach called Discussion Summarization which is aimed at presenting the user with the most relevant summary containing all the important points of the discussion. This allows the user to easily and quickly grasp and catch up on the on-going conversation in a discussion thread. The summary generated follows CRS approach (Clustering and Ranking and Score calculation for each sentence).The Cluster based Summarization technique is coupled with Nested Thematic Clustering (NTC) and Corpus Based Semantic Similarity (CBSS) approaches. The summary produced is the set of top-ranked sentences (of high scores). Results have shown that a completely unbiased summary with the multidimensionality of comments is generated.

Summarizing web forum threads based on a latent topic propagation process

Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11, 2011

With an increasingly amount of information in web forums, quick comprehension of threads in web forums has become a challenging research problem. To handle this issue, this paper investigates the task of Web Forum Thread Summarization (WFTS), aiming to give a brief statement of each thread that involving multiple dynamic topics. When applied to the task of WFTS, traditional summarization methods are cramped by topic dependencies, topic drifting and text sparseness. Consequently, we explore an unsupervised topic propagation model in this paper, the Post Propagation Model (PPM), to burst through these problems by simultaneously modeling the semantics and the reply relationship existing in each thread. Each post in PPM is considered as a mixture of topics, and a product of Dirichlet distributions in previous posts is employed to model each topic dependencies during the asynchronous discussion. Based on this model, the task of WFTS is accomplished by extracting most significant sentences in a thread. The experimental results on two different forum data sets show that WFTS based on the PPM outperforms several state-of-the-art summarization methods in terms of ROUGE metrics.

A Web-Based Experiment on Dialogue Summarisation

This document describes technical details of a web experiment carried out at the State University of Campinas, Brazil, on summarisation of a set of dialogues. Our intention here is to present the strategy we followed to build the experiment's website, as well as the statistical issues and the technicalities involved. In this report, we do not present the experiment's results.

Summarizing Online Discussions by filtering posts

2009 IEEE International Conference on Information Reuse & Integration, 2009

In this paper, we attempt to summarize online discussions by filtering posts. Selecting the highly related posts from the discussion boards leads to a summarized version of the discussion. Online Discussion Summarizer (ODS) is based on unsupervised information retrieval techniques. Four features are used in the summarization function; which are the term frequency inverse post frequency, title term frequency, description term frequency and author reputation. This paper shows that combining the four features in the same function results in higher accuracy than using each alone. ODS was able to summarize online discussions with an accuracy of 72%, precession of 83% and recall of 62%.

Summarizing Online Conversations: A Machine Learning Approach

Summarization has emerged as an increasingly useful approach to tackle the problem of information overload. Extracting information from online conversations can be of very good commercial and educational value. But majority of this information is present as noisy unstructured text making traditional document summarization techniques difficult to apply. In this paper, we propose a novel approach to address the problem of conversation summarization. We develop an automatic text summarizer which extracts sentences from the conversation to form a summary. Our approach consists of three phases. In the first phase, we prepare the dataset for usage by correcting spellings and segmenting the text. In the second phase, we represent each sentence by a set of predefined features. These features capture the statistical, linguistic and sentimental aspects along with the dialogue structure of the conversation. Finally, in the third phase we use a machine learning algorithm to train the summarizer on the set of feature vectors. Experiments performed on conversations taken from the technical domain show that our system significantly outperforms the baselines on ROUGE F-scores.

ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining

2021

While online conversations can cover a vast amount of information in many different formats, abstractive text summarization has primarily focused on modeling solely news articles. This research gap is due, in part, to the lack of standardized datasets for summarizing online discussions. To address this gap, we design annotation protocols motivated by an issues–viewpoints–assertions framework to crowdsource four new datasets on diverse online conversation forms of news comments, discussion forums, community question answering forums, and email threads. We benchmark state-of-the-art models on our datasets and analyze characteristics associated with the data. To create a comprehensive benchmark, we also evaluate these models on widely-used conversation summarization datasets to establish strong baselines in this domain. Furthermore, we incorporate argument mining through graph construction to directly model the issues, viewpoints, and assertions present in a conversation and filter noi...

Summarization of Online Document Repositories

International journal of engineering research and technology, 2018

As websites on the Internet in the Web 2.0 era have become more interactive,there has been an explosion of new user-generated content. The goal of Summarization Pipeline for Online Repositories of Knowledge(SPORK) is to be able to identify important key topics presentedin multi-document texts, such as online comment threads. While most otherautomatic summarization systems simply focus on finding the top sentences representedin the text, SPORK separates the text into clusters, and identifies differenttopics and opinions presented in the text. SPORK has shown results ofmanaging to identify 72% of key topics present in any discussion and up to 80%of key topics in a well-structured discussion.

The ACODEA framework: Developing segmentation and classification schemes for fully automatic analysis of online discussions

International Journal of Computer-Supported Collaborative Learning, 2012

Research related to online discussions frequently faces the problem of analyzing huge corpora. Natural Language Processing (NLP) technologies may allow automating this analysis. However, the state-of-the-art in machine learning and text mining approaches yields models that do not transfer well between corpora related to different topics. Also, segmenting is a necessary step, but frequently, trained models are very sensitive to the particulars of the segmentation that was used when the model was trained. Therefore, in prior published research on text classification in a CSCL context, the data was segmented by hand. We discuss work towards overcoming these challenges. We present a framework for developing coding schemes optimized for automatic segmentation and context-independent coding that builds on this segmentation. The key idea is to extract the semantic and syntactic features of each single word by using the techniques of part-of-speech tagging and named-entity recognition before the raw data can be segmented and classified. Our results show that the coding on the micro-argumentation dimension can be fully automated. Finally, we discuss how fully automated analysis can enable context-sensitive support for collaborative learning.

Automatic summarisation of discussion fora (original) (raw)

Related papers