Book ; Online: ED-TTS
Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis
2024
Abstract: Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis ... ...
Abstract | Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training. Comment: Accepted by 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2024) |
---|---|
Keywords | Electrical Engineering and Systems Science - Audio and Speech Processing ; Computer Science - Sound |
Subject code | 004 |
Publishing date | 2024-01-16 |
Publishing country | us |
Document type | Book ; Online |
Database | BASE - Bielefeld Academic Search Engine (life sciences selection) |
Full text online
More links
Kategorien
Inter-library loan at ZB MED
Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.