LIVIVO - The Search Portal for Life Sciences

zur deutschen Oberfläche wechseln
Advanced search

Search results

Result 1 - 7 of total 7

Search options

  1. Book ; Online: ED-TTS

    Tang, Haobin / Zhang, Xulong / Cheng, Ning / Xiao, Jing / Wang, Jianzong

    Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis

    2024  

    Abstract: Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis ... ...

    Abstract Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.

    Comment: Accepted by 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2024)
    Keywords Electrical Engineering and Systems Science - Audio and Speech Processing ; Computer Science - Sound
    Subject code 004
    Publishing date 2024-01-16
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  2. Book ; Online: QI-TTS

    Tang, Haobin / Zhang, Xulong / Wang, Jianzong / Cheng, Ning / Xiao, Jing

    Questioning Intonation Control for Emotional Speech Synthesis

    2023  

    Abstract: Recent expressive text to speech (TTS) models focus on synthesizing emotional speech, but some fine-grained styles such as intonation are neglected. In this paper, we propose QI-TTS which aims to better transfer and control intonation to further deliver ... ...

    Abstract Recent expressive text to speech (TTS) models focus on synthesizing emotional speech, but some fine-grained styles such as intonation are neglected. In this paper, we propose QI-TTS which aims to better transfer and control intonation to further deliver the speaker's questioning intention while transferring emotion from reference speech. We propose a multi-style extractor to extract style embedding from two different levels. While the sentence level represents emotion, the final syllable level represents intonation. For fine-grained intonation control, we use relative attributes to represent intonation intensity at the syllable level.Experiments have validated the effectiveness of QI-TTS for improving intonation expressiveness in emotional speech synthesis.

    Comment: Accepted by ICASSP 2023
    Keywords Computer Science - Sound ; Computer Science - Computation and Language ; Electrical Engineering and Systems Science - Audio and Speech Processing
    Subject code 430
    Publishing date 2023-03-14
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  3. Book ; Online: EmoMix

    Tang, Haobin / Zhang, Xulong / Wang, Jianzong / Cheng, Ning / Xiao, Jing

    Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

    2023  

    Abstract: There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in ... ...

    Abstract There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS model based on a diffusion probabilistic model and a pre-trained speech emotion recognition (SER) model used to extract emotion embedding. Mixed emotion synthesis is achieved by combining the noises predicted by diffusion model conditioned on different emotions during only one sampling process at the run-time. We further apply the Neutral and specific primary emotion mixed in varying degrees to control intensity. Experimental results validate the effectiveness of EmoMix for synthesizing mixed emotion and intensity control.

    Comment: Accepted by 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023)
    Keywords Computer Science - Sound ; Electrical Engineering and Systems Science - Audio and Speech Processing
    Subject code 410
    Publishing date 2023-06-01
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  4. Article ; Online: Dilatation Eustachian tuboplasty with a Eustachian tube video endoscope and supporting balloon.

    Zhang, Huasong / Zhang, Qing / He, Kunwu / Chen, Minqi / Chen, Yucheng / Su, Dongliang / Tang, Haobin / Lin, Weifen / Chen, Shuhua

    The Journal of laryngology and otology

    2023  Volume 138, Issue 3, Page(s) 246–252

    Abstract: Objective: To evaluate the feasibility and safety of employing a Eustachian tube video endoscope with a supporting balloon as a viable treatment and examination option for patients with Eustachian tube dysfunction.: Methods: A study involving nine ... ...

    Abstract Objective: To evaluate the feasibility and safety of employing a Eustachian tube video endoscope with a supporting balloon as a viable treatment and examination option for patients with Eustachian tube dysfunction.
    Methods: A study involving nine fresh human cadaver heads was conducted to investigate the potential of balloon dilatation Eustachian tuboplasty using a Eustachian tube video endoscope and a supporting balloon catheter. The Eustachian tube cavity was examined with the Eustachian tube video endoscope during the procedure, which involved the dilatation of the cartilaginous portion of the Eustachian tube with the supporting balloon catheter.
    Results: The utilisation of the Eustachian tube video endoscope in conjunction with the supporting balloon catheter demonstrated technical ease during the procedure, with no observed damage to essential structures, particularly the Eustachian tube cavity.
    Conclusion: This newly introduced method of dilatation and examination of the Eustachian tube cavity using a Eustachian tube video endoscope and the supporting balloon is a feasible, safe procedure.
    MeSH term(s) Humans ; Eustachian Tube/surgery ; Dilatation/methods ; Tympanoplasty ; Ear Diseases/diagnosis ; Endoscopes ; Treatment Outcome
    Language English
    Publishing date 2023-07-26
    Publishing country England
    Document type Journal Article
    ZDB-ID 218299-3
    ISSN 1748-5460 ; 0022-2151
    ISSN (online) 1748-5460
    ISSN 0022-2151
    DOI 10.1017/S0022215123001202
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  5. Book ; Online: Dynamic Alignment Mask CTC

    Zhang, Xulong / Tang, Haobin / Wang, Jianzong / Cheng, Ning / Luo, Jian / Xiao, Jing

    Improved Mask-CTC with Aligned Cross Entropy

    2023  

    Abstract: Because of predicting all the target tokens in parallel, the non-autoregressive models greatly improve the decoding efficiency of speech recognition compared with traditional autoregressive models. In this work, we present dynamic alignment Mask CTC, ... ...

    Abstract Because of predicting all the target tokens in parallel, the non-autoregressive models greatly improve the decoding efficiency of speech recognition compared with traditional autoregressive models. In this work, we present dynamic alignment Mask CTC, introducing two methods: (1) Aligned Cross Entropy (AXE), finding the monotonic alignment that minimizes the cross-entropy loss through dynamic programming, (2) Dynamic Rectification, creating new training samples by replacing some masks with model predicted tokens. The AXE ignores the absolute position alignment between prediction and ground truth sentence and focuses on tokens matching in relative order. The dynamic rectification method makes the model capable of simulating the non-mask but possible wrong tokens, even if they have high confidence. Our experiments on WSJ dataset demonstrated that not only AXE loss but also the rectification method could improve the WER performance of Mask CTC.

    Comment: Accepted by ICASSP 2023
    Keywords Computer Science - Sound ; Computer Science - Computation and Language ; Electrical Engineering and Systems Science - Audio and Speech Processing
    Subject code 501
    Publishing date 2023-03-14
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  6. Book ; Online: Speech Augmentation Based Unsupervised Learning for Keyword Spotting

    Luo, Jian / Wang, Jianzong / Cheng, Ning / Tang, Haobin / Xiao, Jing

    2022  

    Abstract: In this paper, we investigated a speech augmentation based unsupervised learning approach for keyword spotting (KWS) task. KWS is a useful speech application, yet also heavily depends on the labeled data. We designed a CNN-Attention architecture to ... ...

    Abstract In this paper, we investigated a speech augmentation based unsupervised learning approach for keyword spotting (KWS) task. KWS is a useful speech application, yet also heavily depends on the labeled data. We designed a CNN-Attention architecture to conduct the KWS task. CNN layers focus on the local acoustic features, and attention layers model the long-time dependency. To improve the robustness of KWS model, we also proposed an unsupervised learning method. The unsupervised loss is based on the similarity between the original and augmented speech features, as well as the audio reconstructing information. Two speech augmentation methods are explored in the unsupervised learning: speed and intensity. The experiments on Google Speech Commands V2 Dataset demonstrated that our CNN-Attention model has competitive results. Moreover, the augmentation based unsupervised learning could further improve the classification accuracy of KWS task. In our experiments, with augmentation based unsupervised learning, our KWS model achieves better performance than other unsupervised methods, such as CPC, APC, and MPC.

    Comment: accepted by WCCI 2022
    Keywords Computer Science - Sound ; Computer Science - Computation and Language ; Electrical Engineering and Systems Science - Audio and Speech Processing
    Subject code 004
    Publishing date 2022-05-28
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  7. Book ; Online: SAR

    Wang, Jianzong / Zhang, Xulong / Tang, Haobin / Sun, Aolan / Cheng, Ning / Xiao, Jing

    Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

    2023  

    Abstract: In recent Text-to-Speech (TTS) systems, a neural vocoder often generates speech samples by solely conditioning on acoustic features predicted from an acoustic model. However, there are always distortions existing in the predicted acoustic features, ... ...

    Abstract In recent Text-to-Speech (TTS) systems, a neural vocoder often generates speech samples by solely conditioning on acoustic features predicted from an acoustic model. However, there are always distortions existing in the predicted acoustic features, compared to those of the groundtruth, especially in the common case of poor acoustic modeling due to low-quality training data. To overcome such limits, we propose a Self-supervised learning framework to learn an Anti-distortion acoustic Representation (SAR) to replace human-crafted acoustic features by introducing distortion prior to an auto-encoder pre-training process. The learned acoustic representation from the proposed framework is proved anti-distortion compared to the most commonly used mel-spectrogram through both objective and subjective evaluation.

    Comment: Accepted by IJCNN2023. 2023 International Joint Conference on Neural Networks (IJCNN2023)
    Keywords Computer Science - Sound ; Electrical Engineering and Systems Science - Audio and Speech Processing
    Publishing date 2023-04-23
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

To top