Uncovering the Risks and Drawbacks Associated With the Use of Synthetic Data for Grammatical Error Correction (original) (raw)

Abstract

sparkles

In a Data-Centric AI paradigm, model performance is often enhanced without modifying the model architecture by utilizing high-quality synthetic data. This research investigates whether data quality control techniques effectively aid models trained solely on synthetic datasets, particularly focusing on the adverse effects of these techniques in synthetic-only scenarios. The results indicate that data-centric methodologies can lead to a significant performance drop (up to 44.03 points) when exclusively relying on synthetic data, highlighting risks and drawbacks associated with this approach.

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (49)

A. E. Goosen, ''A system to quantify industrial data quality,'' Ph.D. dissertation, North-West Univ., Potchefstroom, South Africa, 2019.
M. Tarafdar, C. M. Beath, and J. W. Ross, ''Using AI to enhance business operations,'' MIT Sloan Manage. Rev., vol. 60, no. 4, pp. 1-9, 2019.
S. Laato, M. Tiainen, A. K. M. N. Islam, and M. Mäntymäki, ''How to explain AI systems to end users: A systematic literature review and research agenda,'' Internet Res., vol. 32, no. 7, pp. 1-31, Dec. 2022.
M. Mazumder et al., ''DataPerf: Benchmarks for data-centric AI develop- ment,'' 2022, arXiv:2207.10062.
E. Choi and C. Park, ''DMOps: Data management operation and recipes,'' 2023, arXiv:2301.01228.
P. Koehn, V. Chaudhary, A. El-Kishky, N. Goyal, P.-J. Chen, and F. Guzmán, ''Findings of the WMT 2020 shared task on parallel cor- pus filtering and alignment,'' in Proc. 5th Conf. Mach. Transl., 2020, pp. 726-742.
S. Sarp, M. Kuzlu, U. Cali, O. Elma, and O. Guler, ''Analysis of false data injection impact on AI based solar photovoltaic power generation forecasting,'' 2021, arXiv:2110.09948.
A. Partovyan, V. Nourani, and M. T. Alami, ''Noise injection-denoising techniques to improve artificial intelligence-based rainfall-runoff model- ing,'' Water Resour. Eng., vol. 11, no. 36, pp. 81-94, 2018.
C. Shorten and T. M. Khoshgoftaar, ''A survey on image data augmentation for deep learning,'' J. Big Data, vol. 6, no. 1, pp. 1-48, Dec. 2019.
S. I. Nikolenko, ''Synthetic data for deep learning,'' 2019, arXiv:1909.11512.
T. Brown et al., ''Language models are few-shot learners,'' in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 1877-1901.
R. Thoppilan et al., ''LaMDA: Language models for dialog applications,'' 2022, arXiv:2201.08239.
S. Ubani, S. O. Polat, and R. Nielsen, ''ZeroShotDataAug: Generating and augmenting training data with ChatGPT,'' 2023, arXiv:2304.14334.
N. Polyzotis and M. Zaharia, ''What can data-centric AI learn from data and ML engineering?'' 2021, arXiv:2112.06439.
C. Park, S. Lee, H. Moon, S. Eo, J. Seo, and H. Lim, ''How should human translation coexist with NMT? Efficient tool for building high quality parallel corpus,'' 2021, arXiv:2111.00191.
H. Moon, C. Park, J. Seo, S. Eo, and H. Lim, ''An automatic post editing with efficient and simple data generation method,'' IEEE Access, vol. 10, pp. 21032-21040, 2022.
M. Meingast, T. Roosta, and S. Sastry, ''Security and privacy issues with health care information technology,'' in Proc. Int. Conf. IEEE Eng. Med. Biol. Soc., Aug. 2006, pp. 5453-5458.
D. S. Terzi, R. Terzi, and S. Sagiroglu, ''A survey on security and privacy issues in big data,'' in Proc. 10th Int. Conf. Internet Technol. Secured Trans. (ICITST), Dec. 2015, pp. 202-207.
I. Pilán, P. Lison, L. Øvrelid, A. Papadopoulou, D. Sánchez, and M. Batet, ''The text anonymization benchmark (TAB): A dedicated corpus and eval- uation framework for text anonymization,'' Comput. Linguistics, vol. 48, no. 4, pp. 1053-1101, Dec. 2022.
N. Ng, K. Cho, and M. Ghassemi, ''SSMBA: Self-supervised manifold based data augmentation for improving out-of-domain robustness,'' in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics, Nov. 2020, pp. 1268-1283. [Online]. Available: https://aclanthology.org/2020.emnlp-main.97
D. Ruiter, D. Klakow, J. van Genabith, and C. España-Bonet, ''Inte- grating unsupervised data generation into self-supervised neural machine translation for low-resource languages,'' in Proc. Mach. Transl. Sum- mit XVIII, Res. Track. Washington, DC, USA: Association for Machine Translation in the Americas, Aug. 2021, pp. 76-91. [Online]. Available: https://aclanthology.org/2021.mtsummit-research.7
Z. Gan, H. Xu, and H. Zan, ''Self-supervised curriculum learning for spelling error correction,'' in Proc. Conf. Empirical Methods Natural Lang. Process., 2021, pp. 3487-3494.
A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, ''A survey on contrastive self-supervised learning,'' Technologies, vol. 9, no. 1, p. 2, Dec. 2020.
T. E. Raghunathan, ''Synthetic data,'' Annu. Rev. Statist. Appl., vol. 8, no. 6, pp. 129-140, 2021.
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, ''Scaling laws for neural language models,'' 2020, arXiv:2001.08361.
C. Park, J. Seo, S. Lee, C. Lee, H. Moon, S. Eo, and H. Lim, ''BTS: Back transcription for speech-to-text post-processor using text-to-speech- to-text,'' in Proc. 8th Workshop Asian Transl. (WAT), 2021, pp. 106-116.
M. Ivanovs, R. Kadikis, and K. Ozols, ''Perturbation-based methods for explaining deep neural networks: A survey,'' Pattern Recognit. Lett., vol. 150, pp. 228-234, Oct. 2021.
C. Park, W.-Y. Go, S. Eo, H. Moon, S. Lee, and H. Lim, ''Mimick- ing infants' bilingual language acquisition for domain specialized neural machine translation,'' IEEE Access, vol. 10, pp. 38684-38693, 2022.
L. Chen, S. Wan, and L. Dou, ''Improving diagnostic performance of high-voltage circuit breakers on imbalanced data using an oversam- pling method,'' IEEE Trans. Power Del., vol. 37, no. 4, pp. 2704-2716, Aug. 2022.
Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and L. Sun, ''A comprehensive survey of AI-generated content (AIGC): A his- tory of generative AI from GAN to ChatGPT,'' 2023, arXiv:2303. 04226.
M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, and R. Cucchiara, ''From show to tell: A survey on deep learning-based image captioning,'' IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 539-559, Jan. 2023.
P. Pu Liang, A. Zadeh, and L.-P. Morency, ''Foundations and trends in multimodal machine learning: Principles, challenges, and open questions,'' 2022, arXiv:2209.03430.
H. Else, ''Abstracts written by ChatGPT fool scientists,'' Nature, vol. 613, no. 7944, p. 423, Jan. 2023.
T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. De Leon, C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo, and V. Tseng, ''Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models,'' PLOS Digit. Health, vol. 2, no. 2, Feb. 2023, Art. no. e0000198.
H. Ringberg, M. Roughan, and J. Rexford, ''The need for simulation in evaluating anomaly detectors,'' ACM SIGCOMM Comput. Commun. Rev., vol. 38, no. 1, pp. 55-59, Jan. 2008.
S. Abt and H. Baier, ''A plea for utilising synthetic data when performing machine learning based cyber-security experiments,'' in Proc. Workshop Artif. Intell. Secur. Workshop, Nov. 2014, pp. 37-45.
J.-H. Kim and Y. Hwang, ''GAN-based synthetic data augmentation for infrared small target detection,'' IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art. no. 5002512.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y. Bengio, ''Generative adver- sarial networks,'' Commun. ACM, vol. 63, no. 11, pp. 139-144, 2020.
S. A. Assefa, D. Dervovic, M. Mahfouz, R. E. Tillman, P. Reddy, and M. Veloso, ''Generating synthetic data in finance: Opportunities, chal- lenges and pitfalls,'' in Proc. 1st ACM Int. Conf. AI Finance, 2020, pp. 1-8.
C. J. Hoofnagle, B. van der Sloot, and F. Z. Borgesius, ''The Euro- pean Union general data protection regulation: What it is and what it means,'' Inf. Commun. Technol. Law, vol. 28, no. 1, pp. 65-98, Jan. 2019.
T. Gilbert, ''Family educational rights and privacy act (FERPA),'' J. Empir- ical Res. Hum. Res. Ethics, vol. 2, p. 101, Nov. 2007.
I. G. Cohen and M. M. Mello, ''HIPAA and protecting health information in the 21st century,'' J. Amer. Med. Assoc., vol. 320, no. 3, pp. 231-232, 2018.
Z. Jiang, Y. Mao, P. He, G. Neubig, and W. Chen, ''OmniTab: Pre- training with natural and synthetic data for few-shot table-based ques- tion answering,'' in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol. Seattle, WA, USA: Association for Computational Linguistics, Jul. 2022, pp. 932-942. [Online]. Available: https://aclanthology.org/2022.naacl-main.68
C. Park, M. Shim, S. Eo, S. Lee, J. Seo, H. Moon, and H. Lim, ''Empirical analysis of parallel corpora and in-depth analysis using LIWC,'' Appl. Sci., vol. 12, no. 11, p. 5545, May 2022.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, ''Attention is all you need,'' in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, pp. 1-11.
M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, ''Fairseq: A fast, extensible toolkit for sequence modeling,'' 2019, arXiv:1904.01038.
T. Kudo and J. Richardson, ''SentencePiece: A simple and language inde- pendent subword tokenizer and detokenizer for neural text processing,'' 2018, arXiv:1808.06226.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, ''BLEU: A method for automatic evaluation of machine translation,'' in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics, 2001, pp. 311-318.
C. Napoles, K. Sakaguchi, M. Post, and J. Tetreault, ''Ground truth for grammatical error correction metrics,'' in Proc. 53rd Annu. Meeting Assoc. Comput. Linguistics, 7th Int. Joint Conf. Natural Lang. Process., 2015, pp. 588-593.