Natural Human Voice Text To Speech Engine (Linnet) FOR THE DEGREE OF MSC APPLIED AI & DATA SCIENCE (original) (raw)

Text to Speech Synthesis: A Systematic Review, Deep Learning Based Architecture and Future Research Direction

Journal of Advances in Information Technology

Text to Speech (TTS) synthesis is a process of translating natural language text into speech. Pieces of recorded speech generate synthesized speech and a database is maintained for storing this synthesized speech. A speech synthesizer's output is determined through its resemblance to the person utter and its capacity to be implied. In recent years between the two main subsections: machine learning and deep learning of Artificial Intelligence (AI), deep learning has achieved huge success in the domain of text to speech synthesis. In this literature, a taxonomy is introduced which represents some of the deep learning-based architectures and models popularly used in speech synthesis. Different datasets that are used in TTS have also been discussed. Further, for evaluating the quality of the synthesized speech some of the widely used evaluation matrices are described. Finally, the paper concludes with the challenges and future directions of the text-to-speech synthesis system. Index Terms-Text to Speech (TTS), deep learning, acoustic features, parametric synthesis, concatenative synthesis, text analysis autoregressive (AR) and non-autoregressive (NAR). The autoregressive-based models along with the architectures have been discussed in Section III and the nonautoregressive models are also discussed in the last part of Section III in the form of a table. Fig. 2 represents the Manuscript

WaveNet-Based Speech Synthesis Applied to Czech

Text, Speech, and Dialogue, 2018

WaveNet is a recently-developed deep neural network for generating high-quality synthetic speech. It produces directly raw audio samples. This paper describes the first application of WaveNet-based speech synthesis for the Czech language. We used the basic WaveNet architecture. The duration of particular phones and the required fundamental frequency used for local conditioning were estimated by additional LSTM networks. We conducted a MUSHRA listening test to compare WaveNet with 2 traditional synthesis methods: unit selection and HMM-based synthesis. Experiments were performed on 4 large speech corpora. Though our implementation of WaveNet did not outperform the unit selection method as reported in other studies, there is still a lot of scope for improvement, while the unit selection TTS have probably reached its quality limit.

Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

IEEE Access, 2018

WaveNet, which learns directly from speech waveform samples, has been used as an alternative to vocoders and achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. However, the WaveNet vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. So far, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation occurs, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose new frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder. The GAN generator performs as an acoustic model and its outputs are used as the local condition parameters of the WaveNet. We also extend the GAN frameworks and use the discretized-mixture-of-logistics (DML) loss of a well-trained WaveNet in addition to mean squared error and adversarial losses as parts of objective functions. Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated DML loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity. INDEX TERMS Generative adversarial network, multi-speaker modeling, speech synthesis, WaveNet.

Text to Speech Synthesizer for Afaan Oromo Using Deep Neural Network Technique

International Journal of Computer Science and Information Security (IJCSIS), Vol. 22, No. 5, October, 2024

Text-to-speech synthesis systems are concerned with the artificial generation of a natural and intelligible human voice from given text transcriptions. Despite the potential applications of the text-to-speech systems, it is a language-dependent discipline and most of the attempts are concerned with resourceful languages specifically the English language. Afaan Oromo is one of the under-resourced languages that have a shortage of language resources for developing a text-to-speech system. In this study, a speech dataset containing 8076 text and audio pairs was collected and prepared from legitimate sources to develop a text-to-speech synthesizer for Afaan Oromo. Apart from standard words and names, the proposed model incorporates nonstandard words including numbers, abbreviations, currency, and acronyms. The deep neural network is chosen for this work because it can map complex linguistic features into acoustic feature parameters. Several experiments are conducted to determine the best performing model. The attention error test and mean opinion score test are used for objective and subjective evaluations respectively used to assess the performance of the models. According to objective evaluation result, tacotron 2 made only 2 attention errors while deep voice 3 made 16 out of 148 words in the evaluation sentence list. In addition, tacotron 2 achieved the mos result of 4.32 and 4.21 out of five and deep voice 3 achieved 3.28 and 3.02 in terms of intelligibility and naturalness respectively. Consequently the tacotron 2 model provided an encouraging result, which makes the model appropriate for variety of applications such as recommendation systems, telephone inquiry services, and smart educations.

An Implementation of Advanced NLP for High-Quality Text-To-Speech Synthesis

2021

In this paper, we utilize Bengali voice information (Bangladesh and Kolkata) to convert it in to text format. In order to build a Speech to text conversion framework one should give two key parts: a NLP (Natural Language Processing) stage, basically works on the information on the input speech, and a text generation stage to produce the desired output. These two distinct levels must exchange both data and commands to supply Text. As completing task relies on many distinct scientific areas, any achievement toward standardization can minimize the effort and increase the dynamic of the results. The development in correspondence advancements AI (specially Machine learning and Deep learning) led researcher in convolutional neural network (CNN), which is standing out enough to be noticed because of its high performance. Nonetheless, most normal issue with deep learning architectures such as CNN is that they require large amount of data for training. This paper gives an overview of the NLP...

Rendering Of Voice By Using Convolutional Neural Network And With The Help Of Text-To-Speech Module

IRJET, 2022

This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without any recurrent units. Recurrent neural network (RNN) has been a standard technique to model sequential data recently, and this technique has been used in some cutting-edge neural TTS techniques. However, training RNN component often requires a very powerful computer, or very long time typically several days or weeks. Recent other studies, on the other hand, have shown that CNN-based sequence synthesis can be much faster than RNN-based techniques, because of high parallelizability. The objective of this paper is to show an alternative neural TTS system, based only on CNN, that can alleviate these economic costs of training. In our experiment, the proposed Deep Convolutional TTS can be sufficiently trained only in a night (15 hours), using an ordinary gaming PC equipped with two GPUs, while the quality of the synthesized speech was almost acceptable.

Development and Evaluation of Speech Synthesis System Based on Deep Learning Models

Symmetry

This study concentrates on the investigation, development, and evaluation of Text-to-Speech Synthesis systems based on Deep Learning models for the Azerbaijani Language. We have selected and compared state-of-the-art models-Tacotron and Deep Convolutional Text-to-Speech (DC TTS) systems to achieve the most optimal model. Both systems were trained on the 24 h speech dataset of the Azerbaijani language collected and processed from the news website. To analyze the quality and intelligibility of the speech signals produced by two systems, 34 listeners participated in an online survey containing subjective evaluation tests. The results of the study indicated that according to the Mean Opinion Score, Tacotron demonstrated better results for the In-Vocabulary words; however, DC TTS indicated a higher performance of the Out-Of-Vocabulary words synthesis.

Prosody modelling using machine learning techniques for neutral and emotional speech synthesis

2000

Speech is the basic mean of communication among people and along with the written are the two main means of exchanging opinions, views, ideas, knowledge and culture. Nonetheless, the oral communication is faster, more direct and more convenient than written, making oral speech mandatory in all the fields of human life and communication. The development of technology over the last decades, allowed the creation of systems, mainly focused on specific tasks, which are in position of replacing humans in specific tasks. It is obvious that such systems, which are in direct communication and contact to humans, in order to be friendly to the user, must follow the way of communication among humans. Such systems may be automatic call-centres, services of information (tourist information, cinema), financial services (bank), security control for access to rooms or buildings with speech, smart houses and cars, etc. Speech technology is the field of science that deals with the communication between human and machine (human-computer interaction-HCI) with the most natural way, Chapter 1 Introduction As it is shown in Fig. 1.2, the speech signal from the human-user is processed and led to an automatic speech recognition (Automatic Speech Recognizer-ASR), where it is converted to a sequence of recognized words. Consequently, a natural language processing system, which is the heart of the dialog system, will process the input data. This system deals with the conversion of text to concepts, natural language Chapter 1 Introduction Speech synthesis is the artificial production of human-like speech. Speech synthesizers are also called text-to-speech (TTS) systems since their task is to convert normal (or tagged) text to speech (Allen et al., 1987). Over the years, speech synthesis has been used in a great range of applications covering different fields of human life and human needs such as helping people with visual impairment as screen readers, or people with dyslexia or other reading difficulties as a learning tool, or in the entertainment area in games and animation producing various voices and speaking styles. A general block diagram of a TTS system is shown in Fig. 1.4. The frontend of Chapter 1 Introduction

High Quality, Lightweight and Adaptable TTS Using LPCNet

Interspeech 2019

We present a lightweight adaptable neural TTS system with high quality output. The system is composed of three separate neural network blocks: prosody prediction, acoustic feature prediction and Linear Prediction Coding Net as a neural vocoder. This system can synthesize speech with close to natural quality while running 3 times faster than real-time on a standard CPU. The modular setup of the system allows for simple adaptation to new voices with a small amount of data. We first demonstrate the ability of the system to produce high quality speech when trained on large, high quality datasets. Following that, we demonstrate its adaptability by mimicking unseen voices using 5 to 20 minutes long datasets with lower recording quality. Large scale Mean Opinion Score quality and similarity tests are presented, showing that the system can adapt to unseen voices with quality gap of 0.12 and similarity gap of 3% compared to natural speech for male voices and quality gap of 0.35 and similarity of gap of 9 % for female voices.

Speech synthesis with neural networks

Arxiv preprint cs/9811031, 1998

Text-to-speech conversion has traditionally been performed either by concatenating short samples of speech or by using rule-based systems to convert a phonetic representation of speech into an acoustic representation, which is then converted into speech. This paper describes a system that uses a time-delay neural network (TDNN) to perform this phonetic-to-acoustic mapping, with another neural network to control the timing of the generated speech. The neural network system requires less memory than a concatenation system, and performed well in tests comparing it to commercial systems using other technologies.