A unified accent estimation method based on multi-task learning for Japanese text-to-speech (original) (raw)

04.04.2022

Accepted to Interspeech 2022

Authors

Abstract

We propose a unified accent estimation method for Japanese text-to-speech (TTS). Unlike the conventional two-stage methods, which separately train two models for predicting accent phrase boundaries and accent nucleus positions, our method merges the two models and jointly optimizes the entire model in a multi-task learning framework. Furthermore, considering the hierarchical linguistic structure of intonation phrases (IPs), accent phrases, and accent nuclei, we generalize the proposed approach to simultaneously model the IP boundaries with accent information. Objective evaluation results reveal that the proposed method achieves an accent estimation accuracy of 80.4%, which is 6.67% higher than the conventional two-stage method. When the proposed method is incorporated into a neural TTS framework, the system achieves a 4.29 mean opinion score with respect to prosody naturalness.

model

Demo

TTS setup

The detailed model structure and training conditions of these two models were the same as those in [3].

Target tasks

accent

Systems used for comparision

Models
Model Encoder Decoder Task
(a)[4] - CRF AP
(b)[4] - CRF AN
(c) Bi-LSTM CRF IP
(d) Bi-LSTM CRF AP
(e) Bi-LSTM AR AN
(f) Bi-LSTM CRF, AR AP+AN
(g) Bi-LSTM CRF, CRF, AR IP+AP+AN
Systems
System Components
Reference (Recording) 1 -
Reference (TTS) 2 -
(A) (a), (b), (c)
(B) (c), (d), (e)
(C) (c), (f)
(D) (g)
Audio samples (Japanese)
Tags for simplified output
+ accent with high pitch
- accent with low pitch
/ accent phrase boundary
# intonation phrase boundary
(blue tag) is correct (e.g., +: correct accent)
(red tag) is incorrect (e.g., +: incorrect accent)
_ phrase boundary missing
Sample 1: "日大アメフト反則問題。"
System Audio Simplified output
Reference(TTS) ni- chi+ da+ i+ / a- me+ fu+ to+ / ha- N+ so+ ku+ mo+ N- da- i- .
System(A) ni- chi+ da+ i+ _ a+ me- fu- to- / ha- N+ so+ ku+ mo+ N- da- i- .
System(B) ni- chi+ da+ i+ _ a- me- fu- to- / ha- N+ so+ ku+ mo+ N+ da+ i+ .
System(C) ni- chi+ da+ i+ / a- me+ fu+ to+ / ha- N+ so+ ku+ mo+ N- da- i- .
System(D) ni- chi+ da+ i+ / a- me+ fu+ to+ / ha- N+ so+ ku+ mo+ N- da- i- .
Sample 2: "あなたとおはなししている時以外のスケジュールは、内緒ですよ。"
System Audio Simplified output
Reference(Recording) -
Reference(TTS) a- na+ ta- to- / o- ha+ na+ shi+ shi+ te+ i+ ru+ / to- ki+ i+ ga- i- no- / su- ke+ jyuu+ ru- wa- # na- i+ shyo+ de+ su- yo- .
System(A) a- na+ ta- to- / o- ha+ na+ shi+ shi+ te+ i+ ru+ / to+ ki- / i+ ga- i- no- / su- ke+ jyuu+ ru- wa- # na- i+ shyo+ de+ su- yo- .
System(B) a- na+ ta- to- / o- ha+ na+ shi+ shi+ te+ i+ ru+ / to+ ki- i- ga- i- no- / su+ ke- jyuu- ru- wa- # na- i+ shyo+ de- su- yo- .
System(C) a- na+ ta- to- / o- ha+ na+ shi+ shi+ te+ i+ ru+ / to+ ki- / i+ ga- i- no- / su- ke+ jyuu+ ru- wa- # na- i+ shyo+ de+ su+ yo+ .
System(D) a- na+ ta- to- / o- ha+ na+ shi+ shi+ te+ i+ ru+ / to+ ki- / i+ ga- i- no- / su- ke+ jyuu+ ru- wa- # na- i+ shyo+ de+ su+ yo+ .
Sample 3: "後あとトラブルになる可能性が高いですよ。"
System Audio Simplified output
Reference(Recording) -
Reference(TTS) a- to+ a+ to+ / to- ra+ bu- ru- ni- / na+ ru- / ka- noo+ see+ ga+ / ta- ka+ i- de- su- yo- .
System(A) a- to+ a+ to- / to- ra+ bu- ru- ni- / na+ ru- / ka- noo+ see+ ga+ / ta- ka+ i- de- su- yo- .
System(B) a+ to- a- to- _ to- ra- bu- ru- ni- / na+ ru- / ka- noo+ see+ ga+ / ta- ka+ i- de- su- yo- .
System(C) a+ to- a- to- / to- ra+ bu- ru- ni- / na+ ru- / ka- noo+ see+ ga+ / ta- ka+ i- de- su- yo- .
System(D) a+ to- a- to- / to- ra+ bu- ru- ni- / na+ ru- / ka- noo+ see+ ga+ / ta- ka+ i- de- su- yo- .
Sample 4: "「今世紀に入ってから1番楽しみな食事会」だと期待を寄せた。"
System Audio Simplified output
Reference(Recording) -
Reference(TTS) ko- N+ see+ ki- ni- / ha+ i- Q- te- ka- ra- # i- chi+ ba+ N+ / ta- no+ shi+ mi- na- / shyo- ku+ ji+ ka+ i+ da+ to- # ki- ta+ i+ o+ / yo- se+ ta+ .
System(A) ko- N+ see+ ki- ni- / ha+ i- Q- te- ka- ra- # i- chi+ ba+ N+ / ta- no+ shi+ mi- na- / shyo- ku+ ji+ ka+ i- da- to- / ki- ta+ i+ o+ / yo- se+ ta+ .
System(B) ko+ N- see- ki- ni- / ha+ i- Q- te- ka- ra- # i+ chi- ba- N- / ta- no+ shi+ mi+ na+ / shyo- ku+ ji+ ka- i- da- to- / ki- ta+ i+ o- / yo- se+ ta+ .
System(C) ko- N+ see+ ki- ni- / ha+ i- Q- te- ka- ra- # i- chi+ ba+ N+ / ta- no+ shi+ mi- na- / shyo- ku+ ji+ ka- i- da- to- / ki- ta+ i+ o+ / yo- se+ ta+ .
System(D) ko+ N- / see+ ki- ni- / ha+ i- Q- te- ka- ra- # i- chi+ ba+ N+ / ta- no+ shi+ mi- na- / shyo- ku+ ji+ ka- i- da- to- # ki- ta+ i+ o+ / yo- se+ ta+ .
Sample 5: "23日に心臓発作で倒れ、ロス市内の病院に入院していた女優のキャリー・フィッシャーさんが現地時間の27日朝、60歳で死去。"
System Audio Simplified output
Reference(Recording) -
Reference(TTS) ni+ jyuu- / sa+ N- ni- chi- ni- # shi- N+ zoo+ ho+ Q- sa- de- / ta- o+ re- # ro- su+ shi+ na- i- no- / byoo- i+ N+ ni+ / nyuu- i+ N+ shi+ te+ i+ ta+ # jyo- yuu+ no+ / kya- rii+ fi+ Q- shyaa- sa- N- ga- # ge- N+ chi+ ji+ ka- N- no- / ni+ jyuu- / shi- chi+ ni+ chi+ / a+ sa- # ro- ku+ jyu+ Q- sa- i- de- / shi+ kyo- .
System(A) ni+ jyuu- / sa+ N- ni- chi- ni- # shi- N+ zoo+ ho+ Q- sa- de- / ta- o+ re- # ro- su+ shi+ na- i- no- / byoo- i+ N+ ni+ / nyuu- i+ N+ shi+ te+ i+ ta+ # jyo- yuu+ no+ # kya- rii+ fi+ Q- shyaa- sa- N- ga- # ge- N+ chi+ ji+ ka- N- no- / ni+ jyuu- / shi- chi+ ni+ chi+ / a+ sa- # ro- ku+ jyu+ Q- sa- i- de- / shi+ kyo- .
System(B) ni+ jyuu- / sa+ N- ni- chi- ni- # shi- N+ zoo+ ho+ Q- sa- de- / ta- o+ re- # ro- su+ shi+ na- i- no- / byoo- i+ N+ ni+ / nyuu- i+ N+ shi+ te+ i+ ta+ # jyo- yuu+ no+ # kya- rii+ fi+ Q- shyaa- sa- N- ga- # ge+ N- chi- ji- ka- N- no- / ni- jyuu+ / shi+ chi- ni- chi- / a- sa+ # ro+ ku- jyu- Q- sa- i- de- / shi- kyo+ .
System(C) ni+ jyuu- / sa+ N- ni- chi- ni- # shi- N+ zoo+ ho+ Q- sa- de- / ta- o+ re- # ro- su+ shi+ na- i- no- / byoo- i+ N+ ni+ / nyuu- i+ N+ shi+ te+ i+ ta+ # jyo- yuu+ no+ # kya- rii+ fi+ Q- shyaa- sa- N- ga- # ge- N+ chi+ ji+ ka- N- no- / ni+ jyuu- / shi- chi+ ni+ chi+ / a+ sa- # ro- ku+ jyu+ Q- sa- i- de- / shi+ kyo- .
System(D) ni+ jyuu- / sa+ N- ni- chi- ni- # shi- N+ zoo+ ho+ Q- sa- de- / ta- o+ re- # ro- su+ shi+ na- i- no- / byoo- i+ N+ ni+ / nyuu- i+ N+ shi+ te+ i+ ta+ # jyo- yuu+ no+ # kya- rii+ fi+ Q- shyaa- sa- N- ga- # ge- N+ chi+ ji+ ka- N- no- / ni+ jyuu- / shi- chi+ ni+ chi+ / a+ sa- # ro- ku+ jyu+ Q- sa- i- de- / shi+ kyo- .

States used for comparision

Audio samples (Japanese)
Sample 1: "最初にすることは早起き、次にするのが二度寝、その次が後悔。"
State Audio Simplified output
Recording -
X sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- / ni+ do- / ne- # so- no+ / tsu- gi+ ga- / koo+ ka- i- .
IP sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- # ni+ do- / ne- # so- no+ / tsu- gi+ ga- # koo+ ka- i- .
AP sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- / ni+ do- ne- # so- no+ / tsu- gi+ ga- / koo+ ka- i- .
AN sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- / ni- do+ / ne+ # so- no+ / tsu- gi+ ga- / koo+ ka- i- .
IP+AP sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- # ni+ do- ne- # so- no+ / tsu- gi+ ga- # koo+ ka- i- .
IP+AN sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- # ni- do+ / ne+ # so- no+ / tsu- gi+ ga- # koo+ ka- i- .
AP+AN sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- / ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- / ni- do+ ne+ # so- no+ / tsu- gi+ ga- / koo+ ka- i- .
IP+AP+AN sa- i+ shyo+ ni+ / su- ru+ / ko- to+ wa- # ha- ya+ o- ki- # tsu- gi+ ni- / su- ru+ no- ga- # ni- do+ ne+ # so- no+ / tsu- gi+ ga- # koo+ ka- i- .
Sample 2: "メキシコ留学中の、エーケービーフォーティーエイト入山杏奈が一時帰国。"
State Audio Simplified output
Recording -
X me- ki+ shi+ ko+ _ ryuu+ ga+ ku+ chyuu+ no+ # ee- kee+ bii+ _ foo- tii- e- i- to- # i- ri+ ya+ ma+ / a+ N- na- ga- # i- chi+ ji- / ki- ko+ ku+ .
IP me- ki+ shi+ ko+ _ ryuu+ ga+ ku+ chyuu+ no+ # ee- kee+ bii+ _ foo- tii- e- i- to- # i- ri+ ya+ ma+ / a+ N- na- ga- / i- chi+ ji- / ki- ko+ ku+ .
AP me- ki+ shi+ ko+ / ryuu+ ga- ku- chyuu- no- # ee- kee+ bii+ / foo+ tii- e- i- to- # i- ri+ ya+ ma+ / a+ N- na- ga- # i- chi+ ji- / ki- ko+ ku+ .
AN me- ki+ shi+ ko+ _ ryuu- ga- ku- chyuu- no- # ee- kee+ bii+ _ foo+ tii+ e+ i+ to+ # i- ri+ ya+ ma+ / a+ N- na- ga- # i- chi+ ji- / ki- ko+ ku+ .
IP+AP me- ki+ shi+ ko+ / ryuu+ ga- ku- chyuu- no- # ee- kee+ bii+ / foo+ tii- e- i- to- # i- ri+ ya+ ma+ / a+ N- na- ga- / i- chi+ ji- / ki- ko+ ku+ .
IP+AN me- ki+ shi+ ko+ _ ryuu- ga- ku- chyuu- no- # ee- kee+ bii+ _ foo+ tii+ e+ i+ to+ # i- ri+ ya+ ma+ / a+ N- na- ga- / i- chi+ ji- / ki- ko+ ku+ .
AP+AN me- ki+ shi+ ko+ / ryuu- ga+ ku+ chyuu+ no+ # ee- kee+ bii+ / foo- tii+ e+ i- to- # i- ri+ ya+ ma+ / a+ N- na- ga- # i- chi+ ji- / ki- ko+ ku+ .
IP+AP+AN me- ki+ shi+ ko+ / ryuu- ga+ ku+ chyuu+ no+ # ee- kee+ bii+ / foo- tii+ e+ i- to- # i- ri+ ya+ ma+ / a+ N- na- ga- / i- chi+ ji- / ki- ko+ ku+ .
Sample 3: "私なんて、若い頃の貯金食いつぶし生活で、40代になってからはあんまり働いてないですもん。"
State Audio Simplified output
Recording -
X wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ # ku- i+ tsu+ bu+ shi- / see- ka+ tsu+ de+ # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i+ de+ su+ mo+ N+ .
IP wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ / ku- i+ tsu+ bu+ shi- / see- ka+ tsu+ de+ # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i+ de+ su+ mo+ N+ .
AP wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ # ku- i+ tsu+ bu+ shi- see- ka- tsu- de- # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i+ de+ su+ mo+ N+ .
AN wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ # ku- i+ tsu+ bu+ shi+ / see+ ka- tsu- de- # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i- de- su- mo- N- .
IP+AP wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ / ku- i+ tsu+ bu+ shi- see- ka- tsu- de- # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i+ de+ su+ mo+ N+ .
IP+AN wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ / ku- i+ tsu+ bu+ shi+ / see+ ka- tsu- de- # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i- de- su- mo- N- .
AP+AN wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ # ku- i+ tsu+ bu+ shi+ see+ ka- tsu- de- # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i- de- su- mo- N- .
IP+AP+AN wa- ta+ shi+ na+ N- te- # wa- ka+ i- / ko+ ro- no- / chyo- ki+ N+ / ku- i+ tsu+ bu+ shi+ see+ ka- tsu- de- # yo- N+ jyuu+ da- i- ni- / na+ Q- te- ka- ra- wa- # a- N+ ma+ ri+ / ha- ta+ ra+ i+ te+ na+ i- de- su- mo- N- .

References

Acknowledgements

This work was supported by Clova Voice, NAVER Corp., Seongnam, Korea. The authors would like to thank Yuma Shirahata and Kosuke Futamata at LINE Corp., Tokyo, Japan, for their support.