Qiao Tian - Academia.edu (original) (raw)
Uploads
Papers by Qiao Tian
ArXiv, 2022
Speech restoration aims to remove distortions in speech signals. Prior methods mainly focus on a ... more Speech restoration aims to remove distortions in speech signals. Prior methods mainly focus on a single type of distortion, such as speech denoising or dereverberation. However, speech signals can be degraded by several different distortions simultaneously in the real world. It is thus important to extend speech restoration models to deal with multiple distortions. In this paper, we introduce VoiceFixer, a unified framework for high-fidelity speech restoration. VoiceFixer restores speech from multiple distortions (e.g., noise, reverberation, and clipping) and can expand degraded speech (e.g., noisy speech) with a low bandwidth to 44 . 1 kHz full-bandwidth high-fidelity speech. We design VoiceFixer based on (1) an analysis stage that predicts intermediate-level features from the degraded speech, and (2) a synthesis stage that generates waveform using a neural vocoder. Both objective and subjective evaluations show that VoiceFixer is effective on severely degraded speech, such as real-wo...
Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020
This paper presents the Tencent speech synthesis system for Blizzard Challenge 2019. The corpus r... more This paper presents the Tencent speech synthesis system for Blizzard Challenge 2019. The corpus released to the participants this year is a about 8 hours of speech data from an internet talk show by a well-known Chinese character. We built a end to end speech synthesis system for this task. Firstly, a multispeaker Tacotron-like acoustic model fed on nonalignment linguistic feature and sentence embedding by Bert were employed for mel spectrograms modeling. Then the model was retrained only on the corpus offered. At last, a modified multi-speaker WaveNet model conditioned on the predicted mel features was trained to generate 16-bit speech waveforms at 24 kHz, instead of the conventional vocoder. For achieving higher quality, channel embedding was incorporated in WaveNet. The evaluation results shows that the system we submitted performs good in various criteria which indicated the superiority of our system.
Interspeech 2020, 2020
In this paper, we propose the FeatherWave, yet another variant of WaveRNN vocoder combining the m... more In this paper, we propose the FeatherWave, yet another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding. The LPCNet, a recently proposed neural vocoder which utilized the linear predictive characteristic of speech signal in the WaveRNN architecture, can generate high quality speech with a speed faster than real-time on a single CPU core. However, LPCNet is still not efficient enough for online speech generation tasks. To address this issue, we adopt the multi-band linear predictive coding for WaveRNN vocoder. The multi-band method enables the model to generate several speech samples in parallel at one step. Therefore, it can significantly improve the efficiency of speech synthesis. The proposed model with 4 sub-bands needs less than 1.6 GFLOPS for speech generation. In our experiments, it can generate 24 kHz high-fidelity audio 9x faster than realtime on a single CPU, which is much faster than the LPCNet vocoder. Furthermore, our subjective listening test shows that the FeatherWave can generate speech with better quality than LPCNet.
ArXiv, 2022
Speech restoration aims to remove distortions in speech signals. Prior methods mainly focus on a ... more Speech restoration aims to remove distortions in speech signals. Prior methods mainly focus on a single type of distortion, such as speech denoising or dereverberation. However, speech signals can be degraded by several different distortions simultaneously in the real world. It is thus important to extend speech restoration models to deal with multiple distortions. In this paper, we introduce VoiceFixer, a unified framework for high-fidelity speech restoration. VoiceFixer restores speech from multiple distortions (e.g., noise, reverberation, and clipping) and can expand degraded speech (e.g., noisy speech) with a low bandwidth to 44 . 1 kHz full-bandwidth high-fidelity speech. We design VoiceFixer based on (1) an analysis stage that predicts intermediate-level features from the degraded speech, and (2) a synthesis stage that generates waveform using a neural vocoder. Both objective and subjective evaluations show that VoiceFixer is effective on severely degraded speech, such as real-wo...
Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020
This paper presents the Tencent speech synthesis system for Blizzard Challenge 2019. The corpus r... more This paper presents the Tencent speech synthesis system for Blizzard Challenge 2019. The corpus released to the participants this year is a about 8 hours of speech data from an internet talk show by a well-known Chinese character. We built a end to end speech synthesis system for this task. Firstly, a multispeaker Tacotron-like acoustic model fed on nonalignment linguistic feature and sentence embedding by Bert were employed for mel spectrograms modeling. Then the model was retrained only on the corpus offered. At last, a modified multi-speaker WaveNet model conditioned on the predicted mel features was trained to generate 16-bit speech waveforms at 24 kHz, instead of the conventional vocoder. For achieving higher quality, channel embedding was incorporated in WaveNet. The evaluation results shows that the system we submitted performs good in various criteria which indicated the superiority of our system.
Interspeech 2020, 2020
In this paper, we propose the FeatherWave, yet another variant of WaveRNN vocoder combining the m... more In this paper, we propose the FeatherWave, yet another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding. The LPCNet, a recently proposed neural vocoder which utilized the linear predictive characteristic of speech signal in the WaveRNN architecture, can generate high quality speech with a speed faster than real-time on a single CPU core. However, LPCNet is still not efficient enough for online speech generation tasks. To address this issue, we adopt the multi-band linear predictive coding for WaveRNN vocoder. The multi-band method enables the model to generate several speech samples in parallel at one step. Therefore, it can significantly improve the efficiency of speech synthesis. The proposed model with 4 sub-bands needs less than 1.6 GFLOPS for speech generation. In our experiments, it can generate 24 kHz high-fidelity audio 9x faster than realtime on a single CPU, which is much faster than the LPCNet vocoder. Furthermore, our subjective listening test shows that the FeatherWave can generate speech with better quality than LPCNet.