xianjun xia - Academia.edu (original) (raw)

Papers by xianjun xia

This paper introduces the speech synthesis system developed by USTC for Blizzard Challenge 2012. ... more This paper introduces the speech synthesis system developed by USTC for Blizzard Challenge 2012. An audiobook speech corpus is adopted as the training data for system construction this year. Similar to our previous systems, the hidden Markov model (HMM) based unit selection and waveform concatenation approach is followed to develop our speech synthesis system using this corpus. Considering the inconsistent recording conditions and the narrator's expressiveness within the corpus, we add some channel and expressiveness related labels to each sentence besides the conventional segmental and prosodic labels for system construction. The evaluation results of Blizzard Challenge 2012 show that our system performs well in all evaluation tests, which proves the effectiveness of the HMM-based unit selection approach in coping with a non-standard speech synthesis corpus.

arXiv (Cornell University), Jul 16, 2020

In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and... more In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1-Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using lowcomplexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy threeclasses CNN-based architectures. On Task 1a development data set, an ASC accuracy of 76.9% is attained using our best single classifier and data augmentation. An accuracy of 81.9% is then attained by a final model fusion of our two-stage ASC classifiers. On Task 1b development data set, we achieve an accuracy of 96.7% with a model size smaller than 500KB 1 .

Acoustic event detection, the determination of the acoustic event type and the localisation of th... more Acoustic event detection, the determination of the acoustic event type and the localisation of the event, has been widely applied in many real-world applications. Many works adopt multi-label classification techniques to perform the polyphonic acoustic event detection with a global threshold to detect the active acoustic events. However, the global threshold has to be set manually and is highly dependent on the database being tested. To deal with this, we replaced the fixed threshold method with a frame-wise dynamic threshold approach in this paper. Two novel approaches, namely contour and regressor based dynamic threshold approaches are proposed in this work. Experimental results on the popular TUT Acoustic Scenes 2016 database of polyphonic events demonstrated the superior performance of the proposed approaches.

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Target speaker information can be utilized in speech enhancement (SE) models to more effectively ... more Target speaker information can be utilized in speech enhancement (SE) models to more effectively extract the desired speech. Previous works introduce the speaker embedding into speech enhancement models by means of concatenation or affine transformation. In this paper, we propose a speaker attentive module to calculate the attention scores between the speaker embedding and the intermediate features, which are used to rescale the features. By merging this module in the state-of-the-art SE model, we construct the personalized SE model for ICASSP Signal Processing Grand Challenge: DNS Challenge 5 (2023). Our system achieves a final score of 0.529 on the blind test set of track1 and 0.549 on track2.

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Acoustic echo cancellation is a key issue in hand-free communication systems. In this paper, we p... more Acoustic echo cancellation is a key issue in hand-free communication systems. In this paper, we proposed a hybrid signal processing and deep echo cancellation method, where a two-stage neural network is designed to remove residual echo progressively. For the personalized acoustic echo cancellation, we proposed to decouple the tasks of echo cancellation and target speech extraction, and introduced a speaker attentive module for personalized separation, where the ECAPA-TDNN is used for speaker embedding generation. The proposed method (ByteAudio-18) ranked first on both Track 1 and Track 2 in ICASSP 2023 AEC Challenge.

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

In ICASSP 2023 speech signal improvement challenge, we developed a dual-stage neural model which ... more In ICASSP 2023 speech signal improvement challenge, we developed a dual-stage neural model which improves speech signal quality induced by different distortions in a stage-wise divide-andconquer fashion. Specifically, in the first stage, the speech improvement network focuses on recovering the missing components of the spectrum, while in the second stage, our model aims to further suppress noise, reverberation, and artifacts introduced by the first-stage model. Achieving 0.446 in the final score and 0.517 in the P.835 score, our system ranks 4th in the non-real-time track.

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Beamforming weights prediction via deep neural networks has been one of the main methods in multi... more Beamforming weights prediction via deep neural networks has been one of the main methods in multi-channel speech enhancement tasks. The spectral-spatial cues are crucial in beamforming weights estimation, however, many existing works fail to optimally predict the beamforming weights with an absence of adequate spectral-spatial information learning. To tackle this challenge, we propose a Fourier convolutional attention encoder (FCAE) to provide a global receptive field over the frequency axis and boost the learning of spectral contexts and cross-channel features. Besides, a new convolutional recurrent encoder-decoder (CRED) structure is proposed in this work, within which FCAEs, attention blocks with skip connections and a deep feedback sequential memory network (DFSMN) serving as recurrent module are involved. The proposed CRED structure is exploited to capture the spectral-spatial joint information to obtain accurate estimation of beamforming weights. Experimental results demonstrate the superiority of the proposed approach with only 0.74M parameters and a PESQ improvement from 2.225 to 2.359 on the ConferencingSpeech2021 challenge development test set.

arXiv (Cornell University), Jul 3, 2021

We propose a novel neural model compression strategy combining data augmentation, knowledge trans... more We propose a novel neural model compression strategy combining data augmentation, knowledge transfer, pruning, and quantization for device-robust acoustic scene classification (ASC). Specifically, we tackle the ASC task in a low-resource environment leveraging a recently proposed advanced neural network pruning mechanism, namely Lottery Ticket Hypothesis (LTH), to find a sub-network neural model associated with a small amount non-zero model parameters. The effectiveness of LTH for low-complexity acoustic modeling is assessed by investigating various data augmentation and compression schemes, and we report an efficient joint framework for low-complexity multi-device ASC, called Acoustic Lottery. Acoustic Lottery could largely compress an ASC model and attain a superior performance (validation accuracy of 79.4% and Log loss of 0.64) compared to its not compressed seed model. All results reported in this work are based on a joint effort of four groups, namely GT-USTC-UKE-Tencent, aiming to address the "Low-Complexity Acoustic Scene Classification (ASC) with Multiple Devices" in the DCASE 2021 Challenge Task 1a.

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021

To improve device robustness, a highly desirable key feature of a competitive data-driven acousti... more To improve device robustness, a highly desirable key feature of a competitive data-driven acoustic scene classification (ASC) system, a novel two-stage system based on fully convolutional neural networks (CNNs) is proposed. Our two-stage system leverages on an ad-hoc score combination based on two CNN classifiers: (i) the first CNN classifies acoustic inputs into one of three broad classes, and (ii) the second CNN classifies the same inputs into one of ten finergrained classes. Three different CNN architectures are explored to implement the two-stage classifiers, and a frequency sub-sampling scheme is investigated. Moreover, novel data augmentation schemes for ASC are also investigated. Evaluated on DCASE 2020 Task 1a, our results show that the proposed ASC system attains a state-of-theart accuracy on the development set, where our best system, a twostage fusion of CNN ensembles, delivers a 81.9% average accuracy among multi-device test data, and it obtains a significant improvement on unseen devices. Finally, neural saliency analysis with class activation mapping (CAM) gives new insights on the patterns learnt by our models.

2017 IEEE International Conference on Multimedia and Expo (ICME), 2017

This paper deals with random forest regression based acoustic event detection (AED) by combining ... more This paper deals with random forest regression based acoustic event detection (AED) by combining acoustic features with bottleneck features (BN). The bottleneck features have a good reputation of being inherently discriminative in acoustic signal processing. To deal with the unstructured and complex real-world acoustic events, an acoustic event detection system is constructed using bottleneck features combined with acoustic features. Evaluations were carried out on the UPC-TALP and ITC-Irst databases which consist of highly variable acoustic events. Experimental results demonstrate the usefulness of the low-dimensional and discriminative bottleneck features with relative 5.33% and 5.51% decreases in error rates respectively.

Lecture Notes in Networks and Systems, 2019

Acoustic event detection is to perceive the surrounding auditory sound and popularly performed by... more Acoustic event detection is to perceive the surrounding auditory sound and popularly performed by the multi-label classification based approaches. The concatenated acoustic features of consecutive frames and the hard boundary labels are adopted as the input and output respectively. However, the different input frames are treated equally and the hard boundary based outputs are error-prone. To deal with these, this paper proposes to utilize the sequential attention together with the soft boundary information. Experimental results on the latest TUT Sound Event database demonstrate the superior performance of the proposed technique.

ArXiv, 2020

In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and... more In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-b...

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020

The classification framework has been popularly adopted to perform sound event detection. However... more The classification framework has been popularly adopted to perform sound event detection. However, the existing neural network based classification based approaches treat each feature dimension equally and the varying influence of feature dimensions has not been taken into consideration. To deal with this, we propose a feature space attention based convolution recurrent neural network approach utilizing the varying importance of each feature dimension to perform acoustic event detection. The convolution layers are used to extract the high level information from the audio signals. Then the feature space attention scheme is applied to the extracted features to automatically determine the importance of each feature dimension. Experimental results on the latest TUT Sound Event 2017 dataset demonstrate the improved performance of the proposed approach compared to the existing acoustic event detection systems.

Journal of Crystal Growth, 2019

4H-Silicon carbide (SiC) epitaxial layers on the C-face SiC substrates are potentially useful for... more 4H-Silicon carbide (SiC) epitaxial layers on the C-face SiC substrates are potentially useful for fabricating highperformance power SiC MOSFET device applications. In this research, 4H-SiC epilayers are prepared on 4°offangle C-face 4H-SiC substrates through a low pressure chemical vapor deposition (CVD). Surface morphologies of the epilayers show a strong dependence on C/Si ratio, growth temperature and etching time. A specular surface morphology was obtained at the temperature of about 1550°C, with the etching time of 15 min and C/Si ratio of 1.2.

IEEE Transactions on Multimedia, 2019

Acoustic event detection deals with the acoustic signals to determine the sound type and to estim... more Acoustic event detection deals with the acoustic signals to determine the sound type and to estimate the audio event boundaries. Multi-label classification based approaches are commonly used to detect the frame wise event types with a median filter applied to determine the happening acoustic events. However, the multi-label classifiers are trained only on the acoustic event types ignoring the frame position within the audio events. To deal with this, this paper proposes to construct a joint learning based multi-task system. The first task performs the acoustic event type detection and the second task is to predict the frame position information. By sharing representations between the two tasks, we can enable the acoustic models to generalize better than the original classifier by averaging respective noise patterns to be implicitly regularized. Experimental results on the monophonic UPC-TALP and the polyphonic TUT Sound Event datasets demonstrate the superior performance of the joint learning method by achieving lower error rate and higher F-score compared to the baseline AED system.

Interspeech 2017, 2017

Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 1995

The HPD is a nonmultiplicative light detector with typical gain of 1000 to 5000. Its development ... more The HPD is a nonmultiplicative light detector with typical gain of 1000 to 5000. Its development project, mainly supported by the CERN LAA project and by the INFN group V, was originally intended to find a replacement for the photo multiplier (PM) tubes for scintillating fibre calorimeter readout. After five years of development the HPD has become a versatile light detector, commercially available for everyday use, that can outperform PM tubes in photon counting efficiency and resolution, multi tesla magnetic field operation, uniformity of response, fast pulse dynamic range, and gain stability. The HPD has also a wide edge on PMs on pixelization potential and it is getting more and more competitive on timing properties. A review of the HPD performances and its latest advances are reported.

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

Acoustic event detection, the determination of the acoustic event type and the localisation of th... more Acoustic event detection, the determination of the acoustic event type and the localisation of the event, has been widely applied in many real-world applications. Many works adopt the multi-label classification technique to perform the polyphonic acoustic event detection with a global threshold to detect the active acoustic events. However, the manually labeled boundaries are error-prone and cannot always be accurate, especially when the frame length is too short to be accurately labeled by human annotators. To deal with this, a confidence is assigned to each frame and acoustic event detection is performed using a multi-variable regression approach in this paper. Experimental results on the latest TUT sound event 2017 database of polyphonic events demonstrate the superior performance of the proposed approach compared to the multi-label classification based AED method.

Pattern Recognition, 2018

The variety of event categories and event boundary information have resulted in limited success f... more The variety of event categories and event boundary information have resulted in limited success for acoustic event detection systems. To deal with this, we propose to utilize the long contextual information, low-dimensional discriminant global bottleneck features and category-specific bottleneck features. By concatenating several adjacent frames together, the use of contextual information makes it easier to cope with acoustic signals with long duration. Global and category-specific bottleneck features can extract the prior knowledge of the event category and boundary, which is ideally matched by the task of an event detection system. Evaluations on the UPC-TALP and ITC-IRST databases of highly variable acoustic events demonstrate the effectiveness of the proposed approaches by achieving a 5.30% and 4.44% absolute error rate improvement respectively compared to the state of art technique.

arXiv (Cornell University), Jul 16, 2020

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv (Cornell University), Jul 3, 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021

2017 IEEE International Conference on Multimedia and Expo (ICME), 2017

Lecture Notes in Networks and Systems, 2019

ArXiv, 2020

In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and... more In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-b...

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020

Journal of Crystal Growth, 2019

IEEE Transactions on Multimedia, 2019

Interspeech 2017, 2017

Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 1995

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

Acoustic event detection, the determination of the acoustic event type and the localisation of th... more Acoustic event detection, the determination of the acoustic event type and the localisation of the event, has been widely applied in many real-world applications. Many works adopt the multi-label classification technique to perform the polyphonic acoustic event detection with a global threshold to detect the active acoustic events. However, the manually labeled boundaries are error-prone and cannot always be accurate, especially when the frame length is too short to be accurately labeled by human annotators. To deal with this, a confidence is assigned to each frame and acoustic event detection is performed using a multi-variable regression approach in this paper. Experimental results on the latest TUT sound event 2017 database of polyphonic events demonstrate the superior performance of the proposed approach compared to the multi-label classification based AED method.

Pattern Recognition, 2018