Min-Yen Kan | National University of Singapore (original) (raw)

Papers by Min-Yen Kan

With a rise in false, inaccurate, and misleading information in propaganda, news, and social medi... more With a rise in false, inaccurate, and misleading information in propaganda, news, and social media, real-world Question Answering (QA) systems face the challenges of synthesizing and reasoning over misinformation-polluted contexts to derive correct answers. This urgency gives rise to the need to make QA systems robust to misinformation, a topic previously unexplored. We study the risk of misinformation to QA models by investigating the sensitivity of open-domain QA models to corpus pollution with misinformation documents. We curate both human-written and model-generated false documents that we inject into the evidence corpus of QA models, and assess the impact on the performance of these systems. Experiments show that QA models are vulnerable to even small amounts of evidence contamination brought by misinformation, with large absolute performance drops on all models. Misinformation attack brings more threat when fake documents are produced at scale by neural models or the attacker targets on hacking specific questions of interest. To defend against such a threat, we discuss the necessity of building a misinformation-aware QA system that integrates question-answering and misinformation detection in a joint fashion.

We explore zero-and few-shot generalization for fact verification (FV), which aims to generalize ... more We explore zero-and few-shot generalization for fact verification (FV), which aims to generalize the FV model trained on well-resourced domains (e.g., Wikipedia) to low-resourced domains that lack human annotations. To this end, we first construct a benchmark dataset collection which contains 11 FV datasets representing 6 domains. We conduct an empirical analysis of generalization across these FV datasets, finding that current models generalize poorly. Our analysis reveals that several factors affect generalization, including dataset size, length of evidence, and the type of claims. Finally, we show that two directions of work improve generalization: 1) incorporating domain knowledge via pretraining on specialized domains, and 2) automatically generating training data via claim generation.

Current scientific fact-checking benchmarks exhibit several shortcomings, such as biases arising ... more Current scientific fact-checking benchmarks exhibit several shortcomings, such as biases arising from crowd-sourced claims and an overreliance on text-based evidence. We present SCITAB, a challenging evaluation dataset consisting of 1.2K expert-verified scientific claims that 1) originate from authentic scientific publications and 2) require compositional reasoning for verification. The claims are paired with evidence-containing scientific tables annotated with labels. Through extensive evaluations, we demonstrate that SCITAB poses a significant challenge to state-of-the-art models, including table-based pretraining models and large language models. All models except GPT-4 achieved performance barely above random guessing. Popular prompting techniques, such as Chain-of-Thought, do not achieve much performance gains on SCITAB. Our analysis uncovers several unique challenges posed by SCITAB, including table grounding, claim ambiguity, and compositional reasoning. Our codes and data are publicly available at https: //github.com/XinyuanLu00/SciTab. Supported Claim Refuted Claim A's productivity of 57.5% expresses that it appears in 7.5% more often than expected by random chance. A's productivity of 57.5% expresses that it appears in 9.5% more often than expected by random chance. Paper: When Choosing Plausible Alternatives, Clever Hans can be Clever Not Enough Info Claim The low performance of "to" can be explained by the fact that it is responsible for only 4.6% of the inference in the training set. Paper ID: 1911.00225v1 Claim: A's productivity of 57.5% expresses that it appears in 7.5% more often than expected by random chance.

Annotated data plays a critical role in Natural Language Processing (NLP) in training models and ... more Annotated data plays a critical role in Natural Language Processing (NLP) in training models and evaluating their performance. Given recent developments in Large Language Models (LLMs), models such as ChatGPT demonstrate zero-shot capability on many text-annotation tasks, comparable with or even exceeding human annotators. Such LLMs can serve as alternatives for manual annotation, due to lower costs and higher scalability. However, limited work has leveraged LLMs as complementary annotators, nor explored how annotation work is best allocated among humans and LLMs to achieve both quality and cost objectives. We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale. Under this framework, we utilize uncertainty to estimate LLMs' annotation capability. Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline. For code implementation, see https: //github.com/SALT-NLP/CoAnnotating.

Fact-checking real-world claims often requires complex, multi-step reasoning due to the absence o... more Fact-checking real-world claims often requires complex, multi-step reasoning due to the absence of direct evidence to support or refute them. However, existing fact-checking systems often lack transparency in their decisionmaking, making it challenging for users to comprehend their reasoning process. To address this, we propose the Question-guided Multihop Fact-Checking (QACHECK) system, which guides the model's reasoning process by asking a series of questions critical for verifying a claim. QACHECK has five key modules: a claim verifier, a question generator, a questionanswering module, a QA validator, and a reasoner. Users can input a claim into QACHECK, which then predicts its veracity and provides a comprehensive report detailing its reasoning process, guided by a sequence of (question, answer) pairs. QACHECK 1 also provides the source of evidence supporting each question, fostering a transparent, explainable, and userfriendly fact-checking process.

With NLP research now quickly being transferred into real-world applications, it is important to ... more With NLP research now quickly being transferred into real-world applications, it is important to be aware of and think through the consequences of our scientific investigation. Such ethical considerations are important in both authoring and reviewing. This tutorial will equip participants with basic guidelines for thinking deeply about ethical issues and review common considerations that recur in NLP research. The methodology is interactive and participatory, including case studies and working in groups. Importantly, the participants will be co-building the tutorial outcomes and will be working to create further tutorial materials to share as public outcomes.

We investigate the potential misuse of modern Large Language Models (LLMs) for generating credibl... more We investigate the potential misuse of modern Large Language Models (LLMs) for generating credible-sounding misinformation and its subsequent impact on information-intensive applications, particularly Open-Domain Question Answering (ODQA) systems. We establish a threat model and simulate potential misuse scenarios, both unintentional and intentional, to assess the extent to which LLMs can be utilized to produce misinformation. Our study reveals that LLMs can act as effective misinformation generators, leading to a significant degradation (up to 87%) in the performance of ODQA systems. Moreover, we uncover disparities in the attributes associated with persuading humans and machines, presenting an obstacle to current human-centric approaches to combat misinformation. To mitigate the harm caused by LLM-generated misinformation, we propose three defense strategies: misinformation detection, vigilant prompting, and reader ensemble. These approaches have demonstrated promising results, albeit with certain associated costs. Lastly, we discuss the practicality of utilizing LLMs as automatic misinformation generators and provide relevant resources and code to facilitate future research in this area. 1

arXiv (Cornell University), Nov 21, 2018

In discussions hosted on discussion forums for Massive Online Open Courses (MOOCs), references to... more In discussions hosted on discussion forums for Massive Online Open Courses (MOOCs), references to online learning resources are often of central importance. They contextualize the discussion, anchoring the discussion participants' presentation of the issues and their understanding. However they are usually mentioned in free text, without appropriate hyperlinking to their associated resource. Automated learning resource mention hyperlinking and categorization will facilitate discussion and searching within MOOC forums, and also benefit the contextualization of such resources across disparate views. We propose the novel problem of learning resource mention identification in MOOC forums; i.e., to identify resource mentions in discussions, and classify them into pre-defined resource types. As this is a novel task with no publicly available data, we first contribute a large-scale labeled dataset-dubbed the Forum Resource Mention (FoRM) dataset-to facilitate our current research and future research on this task. FoRM contains over 10, 000 real-world forum threads in collaboration with Coursera, with more than 23, 000 manually labeled resource mentions. We then formulate this task as a sequence tagging problem and investigate solution architectures to address the problem. Importantly, we identify

Lecture Notes in Computer Science, 2018

New learning resources are created and minted in Massive Open Online Courses every week-new video... more New learning resources are created and minted in Massive Open Online Courses every week-new videos, quizzes, assessments and discussion threads are deployed and interacted within the era of on-demand online learning. However, these resources are often artificially siloed between platforms and artificial web application models. Facilitating the linking between such resources facilitates learning and multimodal understanding, bettering learners' experience. We create a framework for MOOC Uniform Identifier for Resources (MUIR). MUIR enables applications to refer and link to such resources in a cross-platform way, allowing the easy minting of identifiers to MOOC resources, akin to #hashtags. We demonstrate the feasibility of this approach to the automatic identification, linking and resolution-a task known as Wikification-of learning resources mentioned on MOOC discussion forums, from a harvested collection of 100K+ resources. Our Wikification system achieves a high initial rate of 54.6% successful resolutions on key resource mentions found in discussion forums, demonstrating the utility of the MUIR framework. Our analysis on this new problem shows that context is a key factor in determining the correct resolution of such mentions.

arXiv (Cornell University), Aug 15, 2017

The bipartite graph is a ubiquitous data structure that can model the relationship between two en... more The bipartite graph is a ubiquitous data structure that can model the relationship between two entity types: for instance, users and items, queries and webpages. In this paper, we study the problem of ranking vertices of a bipartite graph, based on the graph's link structure as well as prior information about vertices (which we term a query vector). We present a new solution, BiRank, which iteratively assigns scores to vertices and finally converges to a unique stationary ranking. In contrast to the traditional random walk-based methods, BiRank iterates towards optimizing a regularization function, which smooths the graph under the guidance of the query vector. Importantly, we establish how BiRank relates to the Bayesian methodology, enabling the future extension in a probabilistic way. To show the rationale and extendability of the ranking methodology, we further extend it to rank for the more generic n-partite graphs. BiRank's generic modeling of both the graph structure and vertex features enables it to model various ranking hypotheses flexibly. To illustrate its functionality, we apply the BiRank and TriRank (ranking for tripartite graphs) algorithms to two real-world applications: a general ranking scenario that predicts the future popularity of items, and a personalized ranking scenario that recommends items of interest to users. Extensive experiments on both synthetic and real-world datasets demonstrate BiRank's soundness (fast convergence), efficiency (linear in the number of graph edges) and effectiveness (achieving state-of-the-art in the two real-world tasks).

Existing multimodal tasks mostly target at the complete input modality setting, i.e., each modali... more Existing multimodal tasks mostly target at the complete input modality setting, i.e., each modality is either complete or completely missing in both training and test sets. However, the randomly missing situations have still been underexplored. In this paper, we present a novel approach named MM-Align to address the missing-modality inference problem. Concretely, we propose 1) an alignment dynamics learning module based on the theory of optimal transport (OT) for indirect missing data imputation; 2) a denoising training algorithm to simultaneously enhance the imputation results and backbone network performance. Compared with previous methods which devote to reconstructing the missing inputs, MM-Align learns to capture and imitate the alignment dynamics between modality sequences. Results of comprehensive experiments on three datasets covering two multimodal tasks empirically demonstrate that our method can perform more accurate and faster inference and relieve overfitting under various missing conditions.

Fact-checking real-world claims often requires collecting multiple pieces of evidence and applyin... more Fact-checking real-world claims often requires collecting multiple pieces of evidence and applying complex multi-step reasoning. In this paper, we present Program-Guided Fact-Checking (PROGRAMFC), a novel factchecking model that decomposes complex claims into simpler sub-tasks that can be solved using a shared library of specialized functions. We first leverage the in-context learning ability of large language models to generate reasoning programs to guide the verification process. Afterward, we execute the program by delegating each sub-task to the corresponding sub-task handler. This process makes our model both explanatory and data-efficient, providing clear explanations of its reasoning process and requiring minimal training data. We evaluate PRO-GRAMFC on two challenging fact-checking datasets and show that it outperforms seven fact-checking baselines across different settings of evidence availability, with explicit output programs that benefit human debugging. 1 Claim: Both James Cameron and the director of the film Interstellar were born in Canada.

arXiv (Cornell University), Jul 9, 2023

Image-text models (ITMs) are the prevalent architecture to solve video question-answering tasks. ... more Image-text models (ITMs) are the prevalent architecture to solve video question-answering tasks. ITMs requires only a few input frames, saving significant computation over against video-language models. However, we find existing ITM video question-answering either 1) adopts simplistic and unintentional sampling strategies, which may miss key frames that offer answer clues; or 2) samples a large number of frames into divided groups, which computational sources can not accommodate. We develop an efficient sampling method for the few-frame scenario. We first summarize a family of prior sampling methods based on question-frame correlation into a unified one, dubbed Most Implied Frames (MIF). Through analysis, we form a hypothesis that questionaware sampling is not necessary, from which we further propose the second method Most Dominant Frames (MDF). Results on four public datasets and three ITMs demonstrate that MIF and MDF boost the performance for image-text pretrained models, and have a wide application over both model architectures and datasets. Code is available at https:// github.com/declare-lab/Sealing.

arXiv (Cornell University), Sep 23, 2022

Prerequisites can play a crucial role in users' decision-making yet recommendation systems have n... more Prerequisites can play a crucial role in users' decision-making yet recommendation systems have not fully utilized such contextual background knowledge. Traditional recommendation systems (RS) mostly enrich user-item interactions where the context consists of static user profiles and item descriptions, ignoring the contextual logic and constraints that underlie them. For example, a RS may recommend an item on the condition that the user has interacted with another item as its prerequisite. Modeling prerequisite context from conceptual side information can overcome this weakness. We propose Prerequisite Driven Recommendation (PDR), a generic context-aware framework where prerequisite context is explicitly modeled to facilitate recommendation. We first design a Prerequisite Knowledge Linking (PKL) algorithm, to curate datasets facilitating PDR research. Employing it, we build a 75k+ highquality prerequisite concept dataset which spans three domain. We then contribute PDRS, a neural instantiation of PDR. By jointly optimizing both the prerequisite learning and recommendation tasks through multi-layer perceptrons, we find PDRS consistently outperforms baseline models in all three domains, by an average margin of 7.41%. Importantly, PDRS performs especially well in cold-start scenarios with improvements of up to 17.65%.

Findings of the Association for Computational Linguistics: ACL 2022, 2022

Modern Natural Language Processing (NLP) models are known to be sensitive to input perturbations ... more Modern Natural Language Processing (NLP) models are known to be sensitive to input perturbations and their performance can decrease when applied to real-world, noisy data. However, it is still unclear why models are less robust to some perturbations than others. In this work, we test the hypothesis that the extent to which a model is affected by an unseen textual perturbation (robustness) can be explained by the learnability of the perturbation (defined as how well the model learns to identify the perturbation with a small amount of evidence). We further give a causal justification for the learnability metric. We conduct extensive experiments with four prominent NLP models-TextRNN, BERT, RoBERTa and XLNetover eight types of textual perturbations on three datasets. We show that a model which is better at identifying a perturbation (higher learnability) becomes worse at ignoring such a perturbation at test time (lower robustness), providing empirical support for our hypothesis.

arXiv (Cornell University), Oct 23, 2020

Obtaining training data for multi-hop question answering (QA) is time-consuming and resource-inte... more Obtaining training data for multi-hop question answering (QA) is time-consuming and resource-intensive. We explore the possibility to train a well-performed multi-hop QA model without referencing any human-labeled multihop question-answer pairs, i.e., unsupervised multi-hop QA. We propose MQA-QG, an unsupervised framework that can generate human-like multi-hop training data from both homogeneous and heterogeneous data sources. MQA-QG generates questions by first selecting/generating relevant information from each data source and then integrating the multiple information to form a multi-hop question. Using only generated training data, we can train a competent multi-hop QA which achieves 61% and 83% of the supervised learning performance for the HybridQA and the HotpotQA dataset, respectively. We also show that pretraining the QA system with the generated data would greatly reduce the demand for human-annotated training data. Our codes are publicly available at https: //github.com/teacherpeterpan/ Unsupervised-Multi-hop-QA.

IEEE Transactions on Multimedia, 2020

Modeling the structure of culinary recipes is the core of recipe representation learning. Current... more Modeling the structure of culinary recipes is the core of recipe representation learning. Current approaches mostly focus on extracting the workflow graph from recipes based on text descriptions. Process images, which constitute an important part of cooking recipes, has rarely been investigated in recipe structure modeling. We study this recipe structure problem from a multi-modal learning perspective, by proposing a prerequisite tree to represent recipes with cooking images at a step-level granularity. We propose a simple-yet-effective two-stage framework to automatically construct the prerequisite tree for a recipe by (1) utilizing a trained classifier to detect pairwise prerequisite relations that fuses multi-modal features as input; then (2) applying different strategies (greedy method, maximum weight, and beam search) to build the tree structure. Experiments on the MM-ReS dataset demonstrates the advantages of introducing process images for recipe structure modeling. Also, compared with neural methods which require large numbers of training data, we show that our two-stage pipeline can achieve promising results using only 400 labeled prerequisite trees as training data.

Springer eBooks, Nov 1, 2006

This paper contributes improvements on both the effectiveness and efficiency of Matrix Factorizat... more This paper contributes improvements on both the effectiveness and efficiency of Matrix Factorization (MF) methods for implicit feedback. We highlight two critical issues of existing works. First, due to the large space of unobserved feedback, most existing works resort to assign a uniform weight to the missing data to reduce computational complexity. However, such a uniform assumption is invalid in real-world settings. Second, most methods are also designed in an offline setting and fail to keep up with the dynamic nature of online data. We address the above two issues in learning MF models from implicit feedback. We first propose to weight the missing data based on item popularity, which is more effective and flexible than the uniform-weight assumption. However, such a non-uniform weighting poses efficiency challenge in learning the model. To address this, we specifically design a new learning algorithm based on the element-wise Alternating Least Squares (eALS) technique, for efficiently optimizing a MF model with variably-weighted missing data. We exploit this efficiency to then seamlessly devise an incremental update strategy that instantly refreshes a MF model given new feedback. Through comprehensive experiments on two public datasets in both offline and online protocols, we show that our eALS method consistently outperforms state-of-the-art implicit MF methods. Our implementation is available at https://github.com/hexiangnan/sigir16-eals.

arXiv (Cornell University), May 26, 2019

Due to time constraints, course instructors often need to selectively participate in student disc... more Due to time constraints, course instructors often need to selectively participate in student discussion threads, due to their limited bandwidth and lopsided student-instructor ratio on online forums. We propose the first deep learning models for this binary prediction problem. We propose novel attention based models to infer the amount of latent context necessary to predict instructor intervention. Such models also allow themselves to be tuned to instructor's preference to intervene early or late. Our four proposed attentive model variants improve over the state-of-the-art by a significant, large margin of 11% in F 1 and 10% in recall, on average. Further, introspection of attention help us better understand what aspects of a discussion post propagate through the discussion thread that prompts instructor intervention.