Unsupervised prominence prediction for speech synthesis (original) (raw)
We propose an unsupervised prominence prediction method for expressive speech synthesis. Prominence patterns are learned by statistical analysis of prosodic features extracted from speech data. The advantages of our unsupervised datadriven prominence prediction include: easy adaptation to new speakers, speech styles, and even languages without requiring expert knowledge or complicated linguistic rules. In this approach, first, prominence predictive prosodic features are extracted at the foot level. Next, the extracted prosodic features are clustered, each cluster representing a prominence level. Based on just-noticeable-differences of prosodic features, the optimal number of perceptually distinct prominence levels is determined. Finally, the proposed prominence prediction is applied to prosody prediction for unit selection speech synthesis. Perceptual evaluation results show a preference for a 4-level unsupervised prominence prediction over a rule-based baseline in terms of naturalness and expressiveness of synthesized speech.
Sign up for access to the world's latest research.
checkGet notified about relevant papers
checkSave papers to use in your research
checkJoin the discussion with peers
checkTrack your impact
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.