Multi-Dialect Speech Corpus Creation for Enhancing Tamil Automatic Speech Recognition (original) (raw)
Abstract
In the current technological revolution, voice assistance systems are widely used. With the use of Automatic Speech Recognition (ASR) technology, a computer can recognize spoken words and translate them into printed text. In general, the spoken form of a language differs from the written form. Conversational systems and their applications have found various applications, such as the operation of various equipment through speech, access to maps for hands-free driving, query response systems for information retrieval, etc. Like human-to-human communication, human–machine communication should ideally be in spoken form to ensure accessibility and usability, accommodating dialect variations so that a broad population can utilize speech-enabled applications seamlessly. The spoken language is now widely used worldwide, but the presence of borrowed or unique words from other languages poses challenges to developing advanced ASR systems. There remains a pressing need for specialized systems capable of accurately recognizing speech in regional languages, particularly to serve underrepresented and underserved populations. Typically, only native speakers can accurately render these spoken forms, which are often not documented in written text. Tamil is a prime example of such a language, with numerous dialects spoken across various regions of Tamil Nadu. Access to digital texts and labeled dialect speech data in Tamil remains scarce, and collecting labeled dialect speech data for the language is a demanding and time-consuming process. This research paper seeks to address this gap by presenting the development and evaluation of a real-time, multi-dialect automatic speech recognition system tailored specifically to the Tamil language, a regional dialect with unique characteristics that pose distinct challenges for conventional ASR technologies. Emerging evidence suggests that the performance of automated speech recognition systems can vary significantly across different demographic groups, with certain subpopulations facing considerable hurdles in effectively utilizing these technologies. To achieve this, we collected dialect-specific Tamil speech data from southern, northern, western, and central regions of Tamil Nadu. Utilizing open-source pre-trained ASR models, we developed a proof-of-concept ASR system. Our data set comprises 24 h and 27 min of Tamil dialect speech data spoken by 240 individuals. We anticipate that our approach will improve opportunities for developing systems capable of accurately recognizing and interacting through spoken language across a diverse range of dialects, speakers, and environmental conditions. This corpus creation supports the training and evaluation of the proposed multi-dialect ASR system.
Access this article
Subscribe and save
- Starting from 10 chapters or articles per month
- Access and download chapters and articles from more than 300k books and 2,500 journals
- Cancel anytime View plans
Buy Now
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
Instant access to the full article PDF.
Similar content being viewed by others
Data availability
Data will be made available on request.
Notes
- Amrrs/wav2vec2-large-xlsr-53-tamil, Rajaram1996/wav2vec2-large-xlsr-53-tamil.
References
- R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, G. Weber. Common Voice: A Massively-Multilingual Speech Corpus, ed. by N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4218–4222. European Language Resources Association, Marseille, France (2020). https://aclanthology.org/2020.lrec-1.520
- J.L.E.K. Fendji, D.M. Tala, B.O. Yenke, M. Atemkeng. Automatic speech recognition using limited vocabulary: A survey. CoRR abs/2108.10254 (2021) 2108.10254
- A. Gangwar, S. Umesh, R. Sarab, A. Kumar Dubey, G. Divakaran, S. V. Gangashetty, Spring-inx: A multilingual indian language speech corpus by spring lab, it madras. arXiv preprint arXiv:2310.14654 (2023).
- T. Ismail, A survey of language and dialect identification systems. Adalya J 9(1) (2020)
- E. Keane, Tamil. J. Int. Phon. Assoc. 34(1), 111–116 (2004). https://doi.org/10.1017/S0025100304001549
Article Google Scholar - A. Madhavaraj, A.G. Ramakrishnan, Data-pooling and multi-task learning for enhanced performance of speech recognition systems in multiple low resourced languages. In: 2019 National Conference on Communications (NCC), pp. 1–5 (2019). https://doi.org/10.1109/NCC.2019.8732237
- M.H. Changrampadi, A. Shahina, M. Badri-Narayanan, A. Nayeemulla-Khan, A. Shahina, End-to-end speech recognition of tamil language. Intell Autom Soft Comput 32(2), 1309–1323 (2022). https://doi.org/10.32604/iasc.2022.022021
Article Google Scholar - M. Nanmalar, P. Vijayalakshmi, T. Nagarajan, Literary and colloquial Tamil dialect identification. Circuits Syst. Signal Process. 41(7), 4004–4027 (2022). https://doi.org/10.1007/s00034-022-01971-2
Article Google Scholar - M. Nanmalar, P. Vijayalakshmi, T. Nagarajan, Literary and colloquial dialect identification for tamil using acoustic features. In: TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON), pp. 1303–1306 (2019). https://doi.org/10.1109/TENCON.2019.8929499
- A. Radford, J.W. Kim, T. Xu, G. Brockman, C. Mcleavey, I. Sutskever. Robust speech recognition via large-scale weak supervision, ed. by A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, J. Scarlett. Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp. 28492–28518. (PMLR, Paris, France, 2023). https://proceedings.mlr.press/v202/radford23a.html
- S. Saranya, B. Bharathi, S. Gomathy Dhanya, A. Krishnakumar, Real-time continuous Tamil dialect speech recognition and summarization. Circuits Systems Signal Process. 44, 2855–2881 (2025). https://doi.org/10.1007/s00034-024-02950-5
Article Google Scholar - B.M.L. Srivastava, S. Sitaram, R. Kumar Mehta, K. Doss Mohan, P. Matani, S. Satpal, K. Bali, R. Srikanth, N. Nayak, in Interspeech 2018 Low Resource Automatic Speech Recognition Challenge for Indian Languages. In: Proceedings of 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), pp. 11–14 (2018). https://doi.org/10.21437/SLTU.2018-3
- H. Yadav, S. Sitaram, A survey of multilingual models for automatic speech recognition (2022)
Acknowledgements
The authors express their gratitude to the Tamil Nadu Virtual Academy (TVA/TC-EOI/2022/006-1) for providing funding for this project. The dedication and contributions of the numerous volunteers were instrumental in making the creation of the Tamil dialect speech corpus achievable. We extend our sincere appreciation to all the individuals involved, including workers, students, research scholars, lab assistants, faculty members within our institution, and external volunteers, for generously offering their time and efforts to support this endeavour.
Author information
Authors and Affiliations
- Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, Tamil Nadu, India
B. Bharathi & S. Saranya - Department of Electronics and Communication Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, Tamil Nadu, India
P. Vijayalakshmi - Department of Computer Science and Engineering, Shiv Nadar University, Chennai, 603110, Tamil Nadu, India
T. Nagarajan
Authors
- B. Bharathi
- S. Saranya
- P. Vijayalakshmi
- T. Nagarajan
Corresponding author
Correspondence toB. Bharathi.
Ethics declarations
Conflict of interest
The authors declare that they do not have any conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bharathi, B., Saranya, S., Vijayalakshmi, P. et al. Multi-Dialect Speech Corpus Creation for Enhancing Tamil Automatic Speech Recognition.Circuits Syst Signal Process 44, 9101–9119 (2025). https://doi.org/10.1007/s00034-025-03181-y
- Received: 18 January 2025
- Revised: 06 May 2025
- Accepted: 06 May 2025
- Published: 10 July 2025
- Version of record: 10 July 2025
- Issue date: December 2025
- DOI: https://doi.org/10.1007/s00034-025-03181-y