LIVIVO - The Search Portal for Life Sciences

zur deutschen Oberfläche wechseln
Advanced search

Search results

Result 1 - 10 of total 30

Search options

  1. Book ; Online: E-PANNs

    Singh, Arshdeep / Liu, Haohe / Plumbley, Mark D.

    Sound Recognition Using Efficient Pre-trained Audio Neural Networks

    2023  

    Abstract: Sounds carry an abundance of information about activities and events in our everyday environment, such as traffic noise, road works, music, or people talking. Recent machine learning methods, such as convolutional neural networks (CNNs), have been shown ... ...

    Abstract Sounds carry an abundance of information about activities and events in our everyday environment, such as traffic noise, road works, music, or people talking. Recent machine learning methods, such as convolutional neural networks (CNNs), have been shown to be able to automatically recognize sound activities, a task known as audio tagging. One such method, pre-trained audio neural networks (PANNs), provides a neural network which has been pre-trained on over 500 sound classes from the publicly available AudioSet dataset, and can be used as a baseline or starting point for other tasks. However, the existing PANNs model has a high computational complexity and large storage requirement. This could limit the potential for deploying PANNs on resource-constrained devices, such as on-the-edge sound sensors, and could lead to high energy consumption if many such devices were deployed. In this paper, we reduce the computational complexity and memory requirement of the PANNs model by taking a pruning approach to eliminate redundant parameters from the PANNs model. The resulting Efficient PANNs (E-PANNs) model, which requires 36\% less computations and 70\% less memory, also slightly improves the sound recognition (audio tagging) performance. The code for the E-PANNs model has been released under an open source license.

    Comment: Accepted in Internoise 2023 conference
    Keywords Computer Science - Sound ; Computer Science - Artificial Intelligence ; Electrical Engineering and Systems Science - Audio and Speech Processing ; Electrical Engineering and Systems Science - Signal Processing
    Subject code 780
    Publishing date 2023-05-29
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  2. Article ; Online: NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality.

    Tan, Xu / Chen, Jiawei / Liu, Haohe / Cong, Jian / Zhang, Chen / Liu, Yanqing / Wang, Xi / Leng, Yichong / Yi, Yuanhao / He, Lei / Zhao, Sheng / Qin, Tao / Soong, Frank / Liu, Tie-Yan

    IEEE transactions on pattern analysis and machine intelligence

    2024  Volume 46, Issue 6, Page(s) 4234–4245

    Abstract: Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, ... ...

    Abstract Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on benchmark datasets. Specifically, we leverage a variational auto-encoder (VAE) for end-to-end text-to-waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experimental evaluations on the popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time.
    MeSH term(s) Humans ; Algorithms ; Signal Processing, Computer-Assisted ; Speech/physiology ; Natural Language Processing ; Databases, Factual ; Sound Spectrography/methods
    Language English
    Publishing date 2024-05-07
    Publishing country United States
    Document type Journal Article ; Research Support, Non-U.S. Gov't ; Research Support, U.S. Gov't, Non-P.H.S.
    ISSN 1939-3539
    ISSN (online) 1939-3539
    DOI 10.1109/TPAMI.2024.3356232
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  3. Book ; Online: Leveraging Pre-trained AudioLDM for Text to Sound Generation

    Yuan, Yi / Liu, Haohe / Liang, Jinhua / Liu, Xubo / Plumbley, Mark D. / Wang, Wenwu

    A Benchmark Study

    2023  

    Abstract: Deep neural networks have recently achieved breakthroughs in sound generation with text prompts. Despite their promising performance, current text-to-sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting ... ...

    Abstract Deep neural networks have recently achieved breakthroughs in sound generation with text prompts. Despite their promising performance, current text-to-sound generation models face issues on small-scale datasets (e.g., overfitting), significantly limiting their performance. In this paper, we investigate the use of pre-trained AudioLDM, the state-of-the-art model for text-to-audio generation, as the backbone for sound generation. Our study demonstrates the advantages of using pre-trained models for text-to-sound generation, especially in data-scarcity scenarios. In addition, experiments show that different training strategies (e.g., training conditions) may affect the performance of AudioLDM on datasets of different scales. To facilitate future studies, we also evaluate various text-to-sound generation systems on several frequently used datasets under the same evaluation protocols, which allow fair comparisons and benchmarking of these methods on the common ground.

    Comment: EUSIPCO 2023
    Keywords Computer Science - Sound ; Computer Science - Artificial Intelligence ; Computer Science - Multimedia ; Electrical Engineering and Systems Science - Audio and Speech Processing
    Subject code 004
    Publishing date 2023-03-07
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  4. Book ; Online: Latent Diffusion Model Based Foley Sound Generation System For DCASE Challenge 2023 Task 7

    Yuan, Yi / Liu, Haohe / Liu, Xubo / Kang, Xiyuan / Plumbley, Mark D. / Wang, Wenwu

    2023  

    Abstract: Foley sound presents the background sound for multimedia content and the generation of Foley sound involves computationally modelling sound effects with specialized techniques. In this work, we proposed a system for DCASE 2023 challenge task 7: Foley ... ...

    Abstract Foley sound presents the background sound for multimedia content and the generation of Foley sound involves computationally modelling sound effects with specialized techniques. In this work, we proposed a system for DCASE 2023 challenge task 7: Foley Sound Synthesis. The proposed system is based on AudioLDM, which is a diffusion-based text-to-audio generation model. To alleviate the data-hungry problem, the system first trained with large-scale datasets and then downstreamed into this DCASE task via transfer learning. Through experiments, we found out that the feature extracted by the encoder can significantly affect the performance of the generation model. Hence, we improve the results by leveraging the input label with related text embedding features obtained by a significant language model, i.e., contrastive language-audio pertaining (CLAP). In addition, we utilize a filtering strategy to further refine the output, i.e. by selecting the best results from the candidate clips generated in terms of the similarity score between the sound and target labels. The overall system achieves a Frechet audio distance (FAD) score of 4.765 on average among all seven different classes, substantially outperforming the baseline system which performs a FAD score of 9.7.

    Comment: DCASE 2023 task 7 technical report, ranked 1st in the challenge
    Keywords Computer Science - Sound ; Computer Science - Multimedia ; Electrical Engineering and Systems Science - Audio and Speech Processing
    Subject code 004
    Publishing date 2023-05-25
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  5. Book ; Online: Retrieval-Augmented Text-to-Audio Generation

    Yuan, Yi / Liu, Haohe / Liu, Xubo / Huang, Qiushi / Plumbley, Mark D. / Wang, Wenwu

    2023  

    Abstract: Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, ...

    Abstract Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting augmented system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the existing approaches by a large margin. Furthermore, we show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types, indicating its potential in TTA tasks.

    Comment: Accepted by ICASSP 2024
    Keywords Computer Science - Sound ; Computer Science - Artificial Intelligence ; Computer Science - Multimedia ; Electrical Engineering and Systems Science - Audio and Speech Processing
    Subject code 420
    Publishing date 2023-09-14
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  6. Book ; Online: Channel-wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music

    Liu, Haohe / Xie, Lei / Wu, Jian / Yang, Geng

    2020  

    Abstract: This paper presents a new input format, channel-wise subband input (CWS), for convolutional neural networks (CNN) based music source separation (MSS) models in the frequency domain. We aim to address the major issues in CNN-based high-resolution MSS ... ...

    Abstract This paper presents a new input format, channel-wise subband input (CWS), for convolutional neural networks (CNN) based music source separation (MSS) models in the frequency domain. We aim to address the major issues in CNN-based high-resolution MSS model: high computational cost and weight sharing between distinctly different bands. Specifically, in this paper, we decompose the input mixture spectra into several bands and concatenate them channel-wise as the model input. The proposed approach enables effective weight sharing in each subband and introduces more flexibility between channels. For comparison purposes, we perform voice and accompaniment separation (VAS) on models with different scales, architectures, and CWS settings. Experiments show that the CWS input is beneficial in many aspects. We evaluate our method on musdb18hq test set, focusing on SDR, SIR and SAR metrics. Among all our experiments, CWS enables models to obtain 6.9% performance gain on the average metrics. With even a smaller number of parameters, less training data, and shorter training time, our MDenseNet with 8-bands CWS input still surpasses the original MMDenseNet with a large margin. Moreover, CWS also reduces computational cost and training time to a large extent.

    Comment: Accepted in INTERSPEECH 2020
    Keywords Electrical Engineering and Systems Science - Audio and Speech Processing ; Computer Science - Sound
    Subject code 780 ; 006
    Publishing date 2020-08-12
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  7. Book ; Online: Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation

    Kong, Qiuqiang / Cao, Yin / Liu, Haohe / Choi, Keunwoo / Wang, Yuxuan

    2021  

    Abstract: Deep neural network based methods have been successfully applied to music source separation. They typically learn a mapping from a mixture spectrogram to a set of source spectrograms, all with magnitudes only. This approach has several limitations: 1) ... ...

    Abstract Deep neural network based methods have been successfully applied to music source separation. They typically learn a mapping from a mixture spectrogram to a set of source spectrograms, all with magnitudes only. This approach has several limitations: 1) its incorrect phase reconstruction degrades the performance, 2) it limits the magnitude of masks between 0 and 1 while we observe that 22% of time-frequency bins have ideal ratio mask values of over~1 in a popular dataset, MUSDB18, 3) its potential on very deep architectures is under-explored. Our proposed system is designed to overcome these. First, we propose to estimate phases by estimating complex ideal ratio masks (cIRMs) where we decouple the estimation of cIRMs into magnitude and phase estimations. Second, we extend the separation method to effectively allow the magnitude of the mask to be larger than 1. Finally, we propose a residual UNet architecture with up to 143 layers. Our proposed system achieves a state-of-the-art MSS result on the MUSDB18 dataset, especially, a SDR of 8.98~dB on vocals, outperforming the previous best performance of 7.24~dB. The source code is available at: https://github.com/bytedance/music_source_separation

    Comment: 6 pages
    Keywords Computer Science - Sound ; Electrical Engineering and Systems Science - Audio and Speech Processing
    Subject code 780
    Publishing date 2021-09-11
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  8. Book ; Online: Simple Pooling Front-ends For Efficient Audio Classification

    Liu, Xubo / Liu, Haohe / Kong, Qiuqiang / Mei, Xinhao / Plumbley, Mark D. / Wang, Wenwu

    2022  

    Abstract: Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios. Most existing approaches are designed to reduce the size of audio neural networks using methods such as model pruning. In this work, we show ...

    Abstract Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios. Most existing approaches are designed to reduce the size of audio neural networks using methods such as model pruning. In this work, we show that instead of reducing model size using complex methods, eliminating the temporal redundancy in the input audio features (e.g., mel-spectrogram) could be an effective approach for efficient audio classification. To do so, we proposed a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information within the mel-spectrogram. We perform extensive experiments on four audio classification tasks to evaluate the performance of SimPFs. Experimental results show that SimPFs can achieve a reduction in more than half of the number of floating point operations (FLOPs) for off-the-shelf audio neural networks, with negligible degradation or even some improvements in audio classification performance.

    Comment: ICASSP 2023
    Keywords Electrical Engineering and Systems Science - Audio and Speech Processing ; Computer Science - Artificial Intelligence ; Computer Science - Sound ; Electrical Engineering and Systems Science - Signal Processing
    Subject code 006
    Publishing date 2022-10-03
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  9. Book ; Online: Segment-level Metric Learning for Few-shot Bioacoustic Event Detection

    Liu, Haohe / Liu, Xubo / Mei, Xinhao / Kong, Qiuqiang / Wang, Wenwu / Plumbley, Mark D.

    2022  

    Abstract: Few-shot bioacoustic event detection is a task that detects the occurrence time of a novel sound given a few examples. Previous methods employ metric learning to build a latent space with the labeled part of different sound classes, also known as ... ...

    Abstract Few-shot bioacoustic event detection is a task that detects the occurrence time of a novel sound given a few examples. Previous methods employ metric learning to build a latent space with the labeled part of different sound classes, also known as positive events. In this study, we propose a segment-level few-shot learning framework that utilizes both the positive and negative events during model optimization. Training with negative events, which are larger in volume than positive events, can increase the generalization ability of the model. In addition, we use transductive inference on the validation set during training for better adaptation to novel classes. We conduct ablation studies on our proposed method with different setups on input features, training data, and hyper-parameters. Our final system achieves an F-measure of 62.73 on the DCASE 2022 challenge task 5 (DCASE2022-T5) validation set, outperforming the performance of the baseline prototypical network 34.02 by a large margin. Using the proposed method, our submitted system ranks 2nd in DCASE2022-T5. The code of this paper is fully open-sourced at https://github.com/haoheliu/DCASE_2022_Task_5.

    Comment: 2nd place in the DCASE 2022 Challenge Task 5. Submitted to the DCASE 2022 workshop
    Keywords Electrical Engineering and Systems Science - Audio and Speech Processing ; Computer Science - Artificial Intelligence ; Computer Science - Sound ; Electrical Engineering and Systems Science - Signal Processing
    Subject code 006
    Publishing date 2022-07-15
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  10. Book ; Online: Surrey System for DCASE 2022 Task 5

    Liu, Haohe / Liu, Xubo / Mei, Xinhao / Kong, Qiuqiang / Wang, Wenwu / Plumbley, Mark D.

    Few-shot Bioacoustic Event Detection with Segment-level Metric Learning

    2022  

    Abstract: Few-shot audio event detection is a task that detects the occurrence time of a novel sound class given a few examples. In this work, we propose a system based on segment-level metric learning for the DCASE 2022 challenge of few-shot bioacoustic event ... ...

    Abstract Few-shot audio event detection is a task that detects the occurrence time of a novel sound class given a few examples. In this work, we propose a system based on segment-level metric learning for the DCASE 2022 challenge of few-shot bioacoustic event detection (task 5). We make better utilization of the negative data within each sound class to build the loss function, and use transductive inference to gain better adaptation on the evaluation set. For the input feature, we find the per-channel energy normalization concatenated with delta mel-frequency cepstral coefficients to be the most effective combination. We also introduce new data augmentation and post-processing procedures for this task. Our final system achieves an f-measure of 68.74 on the DCASE task 5 validation set, outperforming the baseline performance of 29.5 by a large margin. Our system is fully open-sourced at https://github.com/haoheliu/DCASE_2022_Task_5.

    Comment: Technical Report of the system that ranks 2nd in the DCASE Challenge Task 5. arXiv admin note: text overlap with arXiv:2207.07773
    Keywords Computer Science - Sound ; Electrical Engineering and Systems Science - Audio and Speech Processing
    Subject code 004
    Publishing date 2022-07-21
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

To top