Uncovering the Risks and Drawbacks Associated With the Use of Synthetic Data for Grammatical Error Correction (original) (raw)

Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction

arXiv (Cornell University), 2023

Data-centric AI approach aims to enhance the model performance without modifying the model and has been shown to impact model performance positively. While recent attention has been given to data-centric AI based on synthetic data, due to its potential for performance improvement, datacentric AI has long been exclusively validated using real-world data and publicly available benchmark datasets. In respect of this, data-centric AI still highly depends on real-world data, and the verification of models using synthetic data has not yet been thoroughly carried out. Given the challenges above, we ask the question: "Does data quality control (noise injection and balanced data), a data-centric AI methodology acclaimed to have a positive impact, exhibit the same positive impact in models trained solely with synthetic data?" To address this question, we conducted comparative analyses between models trained on synthetic and real-world data based on grammatical error correction (GEC) task. Our experimental results reveal that the data quality control method has a positive impact on models trained with realworld data, as previously reported in existing studies, while a negative impact is observed in models trained solely on synthetic data.

Data-centric AI: Perspectives and Challenges

SDM, 2023

The role of data in building AI systems has recently been significantly magnified by the emerging concept of data-centric AI (DCAI), which advocates a fundamental shift from model advancements to ensuring data quality and reliability. Although our community has continuously invested efforts into enhancing data in different aspects, they are often isolated initiatives on specific tasks. To facilitate the collective initiative in our community and push forward DCAI, we draw a big picture and bring together three general missions: training data development, inference data development, and data maintenance. We provide a top-level discussion on representative DCAI tasks and share perspectives. Finally, we list open challenges. More resources are summarized at https://github.com/daochenzha/data-centric-AI

Data-centric Artificial Intelligence: A Survey

Artificial Intelligence (AI) is making a profound impact in almost every domain. A vital enabler of its great success is the availability of abundant and high-quality data for building machine learning models. Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI. The attention of researchers and practitioners has gradually shifted from advancing model design to enhancing the quality and quantity of the data. In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and the representative methods. We also organize the existing literature from automation and collaboration perspectives, discuss the challenges, and tabulate the benchmarks for various tasks. We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle. We hope it can help the readers efficiently grasp a broad picture of this field, and equip them with the techniques and further research ideas to systematically engineer data for building AI systems. A companion list of data-centric AI resources will be regularly updated on https://github.com/daochenzha/data-centric-AI

DataPerf: Benchmarks for Data-Centric AI Development

Cornell University - arXiv, 2022

Machine learning (ML) research has generally focused on models, while the most prominent datasets have been employed for everyday ML tasks without regard for the breadth, difficulty, and faithfulness of these datasets to the underlying problem. Neglecting the fundamental importance of datasets has caused major problems involving data cascades in real-world applications and saturation of dataset-driven criteria for model quality, hindering research growth. To solve this problem, we present DataPerf, a benchmark package for evaluating ML datasets and datasetworking algorithms. We intend it to enable the "data ratchet," in which training sets will aid in evaluating test sets on the same problems, and vice versa. Such a feedback-driven strategy will generate a virtuous loop that will accelerate development of data-centric AI. The MLCommons Association will maintain DataPerf.

Augment & Valuate : A Data Enhancement Pipeline for Data-Centric AI

ArXiv, 2021

Data scarcity and noise are important issues in industrial applications of machine learning. However, it is often challenging to devise a scalable and generalized approach to address the fundamental distributional and semantic properties of dataset with black box models. For this reason, data-centric approaches are crucial for the automation of machine learning operation pipeline. In order to serve as the basis for this automation, we suggest a domain-agnostic pipeline for refining the quality of data in image classification problems. This pipeline contains data valuation, cleansing, and augmentation. With an appropriate combination of these methods, we could achieve 84.711% test accuracy (ranked #6, Honorable Mention in the Most Innovative) in the Data-Centric AI competition only with the provided dataset.

A COMPREHENSIVE REVIEW OF AI'S DEPENDENCE ON DATA

IAEME PUBLICATION, 2024

AI relies heavily on data to function effectively, drawing upon vast datasets to train algorithms and optimize model performance. The relationship between AI and data is multifaceted, with theoretical frameworks emphasizing the critical role of high-quality data in AI development. Insufficient or biased data can significantly impact the outcomes of AI systems, highlighting the importance of data quality assurance processes. In the context of generative AI, data science plays a pivotal role in training and validating models, shaping their ability to generate realistic outputs. The integration of AI and data analytics offers valuable insights for businesses, enabling them to make informed decisions and drive innovation. Moving forward, further research into AI's dependence on data and its implications is crucial for advancing both theoretical understanding and practical applications in various domains.

Use of Synthetic Data to Train AI Models

Policy Guideline produced by the United Nations University, 2024

Using synthetic or artiffcially generated data in training Artiffcial Intelligence (AI) algorithms is a burgeoning practice with signiffcant potential to affect society directly. It can address data scarcity, privacy, and bias issues but does raise concerns about data quality, security, and ethical implications. While some systems use only synthetic data, most times synthetic data is used together with real-world data to train AI models. Our recommendations in this document are for any system where some synthetic data are used. The use of synthetic data has the potential to enhance existing data to allow for more efffcient and inclusive practices and policies. However, we cannot assume synthetic data to be automatically better or even equivalent to data from the physical world. There are many risks to using synthetic data, including cybersecurity risks, bias propagation, and increasing model error. This document sets out recommendations for the responsible use of synthetic data in AI training.

DC-Check: A Data-Centric AI checklist to guide the development of reliable machine learning systems

Cornell University - arXiv, 2022

While there have been a number of remarkable breakthroughs in machine learning (ML), much of the focus has been placed on model development. However, to truly realize the potential of machine learning in real-world settings, additional aspects must be considered across the ML pipeline. Datacentric AI is emerging as a unifying paradigm that could enable such reliable end-to-end pipelines. However, this remains a nascent area with no standardized framework to guide practitioners to the necessary data-centric considerations or to communicate the design of data-centric driven ML systems. To address this gap, we propose DC-Check, an actionable checklist-style framework to elicit data-centric considerations at different stages of the ML pipeline: Data, Training, Testing, and Deployment. This data-centric lens on development aims to promote thoughtfulness and transparency prior to system development. Additionally, we highlight specific data-centric AI challenges and research opportunities. DC-Check is aimed at both practitioners and researchers to guide day-today development. As such, to easily engage with and use DC-Check and associated resources, we provide a DC-Check companion website (https://www.vanderschaar-lab.com/dc-check/). The website will also serve as an updated resource as methods and tooling evolve over time.

AI System Engineering—Key Challenges and Lessons Learned

Machine Learning and Knowledge Extraction

The main challenges are discussed together with the lessons learned from past and ongoing research along the development cycle of machine learning systems. This will be done by taking into account intrinsic conditions of nowadays deep learning models, data and software quality issues and human-centered artificial intelligence (AI) postulates, including confidentiality and ethical aspects. The analysis outlines a fundamental theory-practice gap which superimposes the challenges of AI system engineering at the level of data quality assurance, model building, software engineering and deployment. The aim of this paper is to pinpoint research topics to explore approaches to address these challenges.

Massive Exploration of Pseudo Data for Grammatical Error Correction

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Collecting a large amount of training data for grammatical error correction (GEC) models has been an ongoing challenge in the field of GEC. Recently, it has become common to use data demanding deep neural models such as an encoder-decoder for GEC; thus, tackling the problem of data collection has become increasingly important. The incorporation of pseudo data in the training of GEC models is one of the main approaches for mitigating the problem of data scarcity. However, a consensus is lacking on experimental configurations, namely, (i) the methods for generating pseudo data, (ii) the seed corpora used as the source of the pseudo data, and (iii) the means of optimizing the model. In this study, these configurations are thoroughly explored through massive amount of experiments, with the aim of providing an improved understanding of pseudo data. Our main experimental finding is that pretraining a model with pseudo data generated by back-translation-based method is the most effective approach. Our findings are supported by the achievement of state-of-the-art performance on multiple benchmark test sets (the CoNLL-2014 test set and the official test set of the BEA-2019 shared task) without requiring any modifications to the model architecture. We also perform an in-depth analysis of our model with respect to the grammatical error type and proficiency level of the text. Finally, we suggest future directions for further improving model performance.