GitHub - Kyubyong/tacotron: A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model (original) (raw)

A (Heavily Documented) TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model

Requirements

Data

We train the model on three different speech datasets.

  1. LJ Speech Dataset
  2. Nick Offerman's Audiobooks
  3. The World English Bible

LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available. It has 24 hours of reasonable quality samples. Nick's audiobooks are additionally used to see if the model can learn even with less data, variable speech samples. They are 18 hours long.The World English Bible is a public domain update of the American Standard Version of 1901 into modern English. Its original audios are freely available here. Kyubyong split each chapter by verse manually and aligned the segmented audio clips to the text. They are 72 hours in total. You can download them at Kaggle Datasets.

Training

Sample Synthesis

We generate speech samples based on Harvard Sentences as the original paper does. It is already included in the repo.

Training Curve

Attention Plot

Generated Samples

Pretrained Files

Notes

t    frame numbers  
-----------------------  
0    [ 0  1  2  3  4]  
1    [ 5  6  7  8  9]  
2    [10 11 12 13 14]  
...  

After much experimentation, we were unable to have our model learning anything useful. We then switched to predicting r sequential frames during each decoding step.
```
t frame numbers

0 [ 0 1 2 3 4]
1 [ 5 6 7 8 9]
2 [10 11 12 13 14]
...

```
With this setup we noticed improvements in the attention and have since kept it.

Differences from the original paper

Papers that referenced this repo

Jan. 2018, Kyubyong Park & Tommy Mulc