Multi-Dialect Speech Corpus Creation for Enhancing Tamil Automatic Speech Recognition (original) (raw)

Abstract

In the current technological revolution, voice assistance systems are widely used. With the use of Automatic Speech Recognition (ASR) technology, a computer can recognize spoken words and translate them into printed text. In general, the spoken form of a language differs from the written form. Conversational systems and their applications have found various applications, such as the operation of various equipment through speech, access to maps for hands-free driving, query response systems for information retrieval, etc. Like human-to-human communication, human–machine communication should ideally be in spoken form to ensure accessibility and usability, accommodating dialect variations so that a broad population can utilize speech-enabled applications seamlessly. The spoken language is now widely used worldwide, but the presence of borrowed or unique words from other languages poses challenges to developing advanced ASR systems. There remains a pressing need for specialized systems capable of accurately recognizing speech in regional languages, particularly to serve underrepresented and underserved populations. Typically, only native speakers can accurately render these spoken forms, which are often not documented in written text. Tamil is a prime example of such a language, with numerous dialects spoken across various regions of Tamil Nadu. Access to digital texts and labeled dialect speech data in Tamil remains scarce, and collecting labeled dialect speech data for the language is a demanding and time-consuming process. This research paper seeks to address this gap by presenting the development and evaluation of a real-time, multi-dialect automatic speech recognition system tailored specifically to the Tamil language, a regional dialect with unique characteristics that pose distinct challenges for conventional ASR technologies. Emerging evidence suggests that the performance of automated speech recognition systems can vary significantly across different demographic groups, with certain subpopulations facing considerable hurdles in effectively utilizing these technologies. To achieve this, we collected dialect-specific Tamil speech data from southern, northern, western, and central regions of Tamil Nadu. Utilizing open-source pre-trained ASR models, we developed a proof-of-concept ASR system. Our data set comprises 24 h and 27 min of Tamil dialect speech data spoken by 240 individuals. We anticipate that our approach will improve opportunities for developing systems capable of accurately recognizing and interacting through spoken language across a diverse range of dialects, speakers, and environmental conditions. This corpus creation supports the training and evaluation of the proposed multi-dialect ASR system.

Access this article

Log in via an institution

Subscribe and save

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data availability

Data will be made available on request.

Notes

  1. Amrrs/wav2vec2-large-xlsr-53-tamil, Rajaram1996/wav2vec2-large-xlsr-53-tamil.

References

  1. R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, G. Weber. Common Voice: A Massively-Multilingual Speech Corpus, ed. by N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4218–4222. European Language Resources Association, Marseille, France (2020). https://aclanthology.org/2020.lrec-1.520
  2. J.L.E.K. Fendji, D.M. Tala, B.O. Yenke, M. Atemkeng. Automatic speech recognition using limited vocabulary: A survey. CoRR abs/2108.10254 (2021) 2108.10254
  3. A. Gangwar, S. Umesh, R. Sarab, A. Kumar Dubey, G. Divakaran, S. V. Gangashetty, Spring-inx: A multilingual indian language speech corpus by spring lab, it madras. arXiv preprint arXiv:2310.14654 (2023).
  4. T. Ismail, A survey of language and dialect identification systems. Adalya J 9(1) (2020)
  5. E. Keane, Tamil. J. Int. Phon. Assoc. 34(1), 111–116 (2004). https://doi.org/10.1017/S0025100304001549
    Article Google Scholar
  6. A. Madhavaraj, A.G. Ramakrishnan, Data-pooling and multi-task learning for enhanced performance of speech recognition systems in multiple low resourced languages. In: 2019 National Conference on Communications (NCC), pp. 1–5 (2019). https://doi.org/10.1109/NCC.2019.8732237
  7. M.H. Changrampadi, A. Shahina, M. Badri-Narayanan, A. Nayeemulla-Khan, A. Shahina, End-to-end speech recognition of tamil language. Intell Autom Soft Comput 32(2), 1309–1323 (2022). https://doi.org/10.32604/iasc.2022.022021
    Article Google Scholar
  8. M. Nanmalar, P. Vijayalakshmi, T. Nagarajan, Literary and colloquial Tamil dialect identification. Circuits Syst. Signal Process. 41(7), 4004–4027 (2022). https://doi.org/10.1007/s00034-022-01971-2
    Article Google Scholar
  9. M. Nanmalar, P. Vijayalakshmi, T. Nagarajan, Literary and colloquial dialect identification for tamil using acoustic features. In: TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON), pp. 1303–1306 (2019). https://doi.org/10.1109/TENCON.2019.8929499
  10. A. Radford, J.W. Kim, T. Xu, G. Brockman, C. Mcleavey, I. Sutskever. Robust speech recognition via large-scale weak supervision, ed. by A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, J. Scarlett. Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp. 28492–28518. (PMLR, Paris, France, 2023). https://proceedings.mlr.press/v202/radford23a.html
  11. S. Saranya, B. Bharathi, S. Gomathy Dhanya, A. Krishnakumar, Real-time continuous Tamil dialect speech recognition and summarization. Circuits Systems Signal Process. 44, 2855–2881 (2025). https://doi.org/10.1007/s00034-024-02950-5
    Article Google Scholar
  12. B.M.L. Srivastava, S. Sitaram, R. Kumar Mehta, K. Doss Mohan, P. Matani, S. Satpal, K. Bali, R. Srikanth, N. Nayak, in Interspeech 2018 Low Resource Automatic Speech Recognition Challenge for Indian Languages. In: Proceedings of 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), pp. 11–14 (2018). https://doi.org/10.21437/SLTU.2018-3
  13. H. Yadav, S. Sitaram, A survey of multilingual models for automatic speech recognition (2022)

Download references

Acknowledgements

The authors express their gratitude to the Tamil Nadu Virtual Academy (TVA/TC-EOI/2022/006-1) for providing funding for this project. The dedication and contributions of the numerous volunteers were instrumental in making the creation of the Tamil dialect speech corpus achievable. We extend our sincere appreciation to all the individuals involved, including workers, students, research scholars, lab assistants, faculty members within our institution, and external volunteers, for generously offering their time and efforts to support this endeavour.

Author information

Authors and Affiliations

  1. Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, Tamil Nadu, India
    B. Bharathi & S. Saranya
  2. Department of Electronics and Communication Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, Tamil Nadu, India
    P. Vijayalakshmi
  3. Department of Computer Science and Engineering, Shiv Nadar University, Chennai, 603110, Tamil Nadu, India
    T. Nagarajan

Authors

  1. B. Bharathi
  2. S. Saranya
  3. P. Vijayalakshmi
  4. T. Nagarajan

Corresponding author

Correspondence toB. Bharathi.

Ethics declarations

Conflict of interest

The authors declare that they do not have any conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bharathi, B., Saranya, S., Vijayalakshmi, P. et al. Multi-Dialect Speech Corpus Creation for Enhancing Tamil Automatic Speech Recognition.Circuits Syst Signal Process 44, 9101–9119 (2025). https://doi.org/10.1007/s00034-025-03181-y

Download citation

Keywords