Improving Keyword-Based Topic Classification in Cancer Patient Forums with Multilingual Transformers (original) (raw)
Related papers
Identification of Disease or Symptom terms in Reddit to Improve Health Mention Classification
Proceedings of the ACM Web Conference 2022
In a user-generated text such as on social media platforms and online forums, people often use disease or symptom terms in ways other than to describe their health. In data-driven public health surveillance, the health mention classification (HMC) task aims to identify posts where users are discussing health conditions rather than using disease and symptom terms for other reasons. Existing computational research typically only studies health mentions in Twitter, with limited coverage of disease or symptom terms, ignore user behavior information, and other ways people use disease or symptom terms. To advance the HMC research, we present a Reddit health mention dataset (RHMD), a new dataset of multi-domain Reddit data for the HMC. RHMD consists of 10,015 manually labeled Reddit posts that mention 15 common disease or symptom terms and are annotated with four labels: namely personal health mentions, non-personal health mentions, figurative health mentions, and hyperbolic health mentions. With RHMD, we propose HMC-NET that combines a target keyword (disease or symptom term) identification and user behavior hierarchically to improve HMC. Experimental results demonstrate that the proposed approach outperforms state-of-the-art methods with an F1-Score of 0.75 (an increase of 11% over the state-of-the-art) and shows that our new dataset poses a strong challenge to the existing HMC methods. CCS CONCEPTS • Applied computing → Health informatics; • Computing methodologies → Natural language processing.
Ibérica, 2020
This study aimed to characterize medical terms in an online cancer forum, with particular focus on specialization and semantic features. A three-step analysis was carried out on a 60-million-word corpus to detect and characterize the most typical medical terms used in a cancer forum by means of (1) keywords contrastive, (2) co-text-guided, and (3) semantic analyses. More than half of the 1000 words analysed were medical terms according to the co-text-guided analysis carried out. Most of them (73%) were dictionary-defined medical terms, followed by co-text-defined terms (9%) and medical initialisms (8.5%). The semantic analysis showed a higher number of terms within the fields of Anatomy, Treatment, Hospital and Symptoms. Our findings suggest that medical terms are commonly used in cancer forums, especially to share e-patients' concerns about treatment, symptoms and hospital environment. The method followed is efficient and could be applied in future studies. Altogether, this art...
Knowledge-Aware Neural Networks for Medical Forum Question Classification
Proceedings of the 30th ACM International Conference on Information & Knowledge Management
Online medical forums have become a predominant platform for answering health-related information needs of consumers. However, with a significant rise in the number of queries and the limited availability of experts, it is necessary to automatically classify medical queries based on a consumer's intention, so that these questions may be directed to the right set of medical experts. Here, we develop a novel medical knowledge-aware BERT-based model (MedBERT) that explicitly gives more weightage to medical concept-bearing words, and utilize domain-specific side information obtained from a popular medical knowledge base. We also contribute a multi-label dataset for the Medical Forum Question Classification (MFQC) task. MedBERT achieves state-of-the-art performance on two benchmark datasets and performs very well in low resource settings. CCS CONCEPTS • Applied computing → Consumer health; • Computing methodologies → Supervised learning by classification.
2021
The widespread influence of social media impacts every aspect of life, including the healthcare sector. Although medics and health professionals are the final decision makers, the advice and recommendations obtained from fellow patients are significant. In this context, the present paper explores the topics of discussion posted by breast cancer patients and survivors on online forums. The study examines an online forum, Breastcancer.org, maps the discussion entries to several topics, and proposes a machine learning model based on a classification algorithm to characterize the topics. To explore the topics of breast cancer patients and survivors, approximately 1000 posts are selected and manually labeled with annotations. In contrast, millions of posts are available to build the labels. A semi-supervised learning technique is used to build the labels for the unlabeled data; hence, the large data are classified using a deep learning algorithm. The deep learning algorithm BiLSTM with B...
Medicine Radar - a Tool for Exploring Online Health Discussions
2018
Research focusing on online health discussions provides valuable insights into the use of medicines, as well as health-related experiences and difficulties currently not well understood. We introduce Medicine Radar, a tool for exploring health-related online discussions obtained from the Finnish Suomi24 chat forum. The health subset of the entire Suomi24 data consists of 19 million messages written over a time span of 16 years. We outline the method, identify some challenges in analyzing Finnish texts and explain how we overcame them in this specific domain. In particular, we present a novel method for generating domain vocabularies from colloquial texts, which utilizes a combination of machine learning and human input. Medicine Radar is accessible as an open sourced web interface that we hope will inspire and facilitate further research.
forumBERT: Topic Adaptation and Classification of Contextualized Forum Comments in German
2021
Online user comments in public forums are often associated with low quality, hate speech or even excessive demands for moderation. To better exploit their constructive and deliberate potential, we present forumBERT. forumBERT is built on top of the BERT architecture and uses a shared weight and late fusion technique to better determine the quality and relevance of a comment on a forum article. Our model integrates article context with comments for the online/offline comment moderation task. This is done using a two step procedure: self-supervised BERT language model fine tuning for topic adaptation followed by integration into the forumBERT architecture for online/offline classification. We present evaluation results on various classification tasks of the public One Million Post dataset, as well as on the online/offline comment moderation task on 998,158 labelled comments from NDR.de, a popular German broadcaster’s website. forumBERT significantly outperforms baseline models on the ...
Health Information Science and Systems, 2019
Online remedy finders and health-related discussion forums have become increasingly popular in recent years. Common web users write their health problems there and request suggestion from experts or other users. As a result, these forums became a huge repository of information and discussions on various health issues. An intelligent information retrieval system can help to utilize this repository in various applications. In this paper, we propose a system for the automatic identification of existing similar forum posts given a new post. The system is based on computing similarity between two patient authored texts. For computing the similarity between the current post and existing posts, the system uses a hybrid strategy based on template information, topic modelling, and latent semantic indexing. The system is tested using a set of real questions collected from a homeopathy forum namely abchomeopathy. com. The relevance of the posts retrieved by the system is evaluated by human experts. The evaluation results demonstrate that the precision of the system is 88.87%.
Expertise in French health forums
Health Informatics Journal, 2019
More and more health websites hire medical experts (physicians, medical students, experienced volunteers, etc.) and indicate explicitly their medical role in order to notify that they provide high-quality answers. However, medical experts may participate in forum discussions even when their role is not officially indicated. Detecting posts written by medical experts facilitates the quick access to posts that have more chances of being correct and informative. The main objective of this work is to learn classification models that can be used to detect posts written by medical experts in any health forum discussions. Two French health forums have been used to discover the best features and methods for this text categorization task. The obtained results confirm that models learned on appropriate websites may be used efficiently on other websites (more than 98% of F1-measure has been obtained using a Random Forest classifier). A study of misclassified posts highlights the participation ...
Journal of Big Data
Twitter and social media as a whole have great potential as a source of disease surveillance data however the general messiness of tweets presents several challenges for standard information extraction methods. Most deployed systems employ approaches that rely on simple keyword matching and do not distinguish between relevant and irrelevant keyword mentions making them susceptible to false positives as a result of the fact that keyword volume can be influenced by several social phenomena that may be unrelated to disease occurrence. Furthermore, most solutions are intended for a single language and those meant for multilingual scenarios do not incorporate semantic context. In this paper we experimentally examine different approaches for classifying text for epidemiological surveillance on the social web in addition we offer a systematic comparison of the impact of different input representations on performance. Specifically we compare continuous representations against one-hot encodi...