Speech Processing Research Papers - Academia.edu (original) (raw)

Objective: Discrimination against nonnative speakers is widespread and largely socially acceptable. Nonnative speakers are evaluated negatively because accent is a sign that they belong to an outgroup and because understanding their... more

Objective: Discrimination against nonnative speakers is widespread and largely socially acceptable. Nonnative speakers are evaluated negatively because accent is a sign that they belong to an outgroup and because understanding their speech requires unusual effort from listeners. The present research investigated intergroup bias, based on stronger support for hierarchical relations between groups (social dominance orientation [SDO]), as a predictor of hiring recommendations of nonnative speakers. Method: In an online experiment using an adaptation of the thin-slices methodology, 65 U.S. adults (54% women; 80% White; M age 35.91, range 18 – 67) heard a recording of a job applicant speaking with an Asian (Mandarin Chinese) or a Latino (Spanish) accent. Participants indicated how likely they would be to recommend hiring the speaker, answered questions about the text, and indicated how difficult it was to understand the applicant. Results: Independent of objective comprehension, participants high in SDO reported that it was more difficult to understand a Latino speaker than an Asian speaker. SDO predicted hiring recommendations of the speakers, but this relationship was mediated by the perception that nonnative speakers were difficult to understand. This effect was stronger for speakers from lower status groups (Latinos relative to Asians) and was not related to objective comprehension. Conclusions: These findings suggest a cycle of prejudice toward nonnative speakers: Not only do perceptions of difficulty in understanding cause prejudice toward them, but also prejudice toward low-status groups can lead to perceived difficulty in understanding members of these groups.

This paper deals with the problem of noise cancellation of speech signals in an acoustic environment. In this regard, generally, different adaptive filter algorithms are employed, many of them may lack the flexibility of controlling the... more

This paper deals with the problem of noise cancellation of speech signals in an acoustic environment. In this regard, generally, different adaptive filter algorithms are employed, many of them may lack the flexibility of controlling the convergence rate, range of variation of filter coefficients, and consistency in error within tolerance limit. In order to achieve these desirable attributes as well as to cancel noise effectively, unlike conventional approaches, we formulate the task of noise cancellation as a coefficient optimization problem whereby we introduce and exploit the particle swarm optimization (PSO) algorithm. In this problem, the PSO is designed to perform the error minimization in frequency domain. The outcomes from extensive experimentations show that the proposed PSO based acoustic noise cancellation method provides high performance in terms of SNR improvements with a satisfactory convergence rate in comparison to that obtained by some of the state-of-the-art methods.

During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning. However Deep Learning-based algorithms require amounts of data that are often difficult and costly to gather. Particularly,... more

During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning. However Deep Learning-based algorithms require amounts of data that are often difficult and costly to gather. Particularly, modeling the variability in speech of different speakers, different styles or different emotions with few data remains challenging. In this paper, we investigate how to leverage fine-tuning on a pre-trained Deep Learning-based TTS model to synthesize speech with a small dataset of another speaker. Then we investigate the possibility to adapt this model to have emotional TTS by fine-tuning the neutral TTS model with a small emotional dataset.

IRCAM has a long experience in analysis, synthesis and transformation of voice. Natural voice transformations are of great interest for many applications and can be combine with text-to-speech system, leading to a powerful creation tool.... more

IRCAM has a long experience in analysis, synthesis and transformation of voice. Natural voice transformations are of great interest for many applications and can be combine with text-to-speech system, leading to a powerful creation tool. We present research conducted at IRCAM on voice transformations for the last few years. Transformations can be achieved in a global way by modifying pitch, spectral envelope, durations etc. While it sacrifices the possibility to attain a specific target voice, the ...

Negotiation is essential in settings where computational agents have conflicting interests and a desire to cooperate. Mechanisms in which agents exchange potential agreements according to various rules of interaction have become very... more

Negotiation is essential in settings where computational agents have conflicting interests and a desire to cooperate. Mechanisms in which agents exchange potential agreements according to various rules of interaction have become very popular in recent years as evident, for example, in the auction and mechanism design community. These can be seen as models of negotiation in which participants focus on their positions. It is argued, however, that if agents focus instead on the interests behind their positions, they may increase the likelihood and quality of an agreement. In order to achieve that, agents need to argue over each others’ goals and beliefs during the process of negotiation. In this paper, we identify concepts that seem essential for supporting this type of dialogue. In particular we investigate the types of arguments agents may exchange about each others’ interests, and we begin an analysis of dialogue moves involving goals.

— An automatic verification of person's identity from its voice is a part of modern telecommunication services. In order to execute a verification task, a speech signal has to be transmitted to a remote server. So, a performance of the... more

— An automatic verification of person's identity from its voice is a part of modern telecommunication services. In order to execute a verification task, a speech signal has to be transmitted to a remote server. So, a performance of the verification system can be influenced by various distortions that can occur when transmitting a speech signal through a communication channel. This paper studies an effect of the state of art wideband (WB) speech codecs on a performance of automatic speaker verification in the context of a channel/codec mismatch between enrollment and test utterances. The speaker verification system is developed on GMM-UBM method. The results show that EVS codec provides the best performance over all the investigated scenarios in this study. Moreover, deploying G.729.1 codec in a training process of the verification system provides the best equal error rate in the fully-codec mismatched scenario. Anyhow, differences between the equal error rates reported for all of the codecs involved in this scenario are mostly nonsignificant.

—Noise reduction of speech signals plays an important role in telecommunication systems. Various types of speech additive noise can be introduced such as babble, crowd, large city, and highway which are the main factor of degradation in... more

—Noise reduction of speech signals plays an important role in telecommunication systems. Various types of speech additive noise can be introduced such as babble, crowd, large city, and highway which are the main factor of degradation in perceived speech quality. There are some cases on the receiver side of telecommunication systems, where the direct value of interfering noise is not available and there is just access to noisy speech. In these cases the noise cannot be cancelled totally but it may be possible to reduce the noise in a sensible way by utilizing the statistics of the noise and speech signal. In this paper the proposed method for noise reduction is Bayesian recursive state-space Kalman filter, which is a method for estimation of a speech signal from its noisy version. It utilizes the prior probability distributions of the signal and noise processes, which are assumed to be zero-mean Gaussian processes. The function of Kalman filter is assessed for different types of noise such as babble, crowd, large city, and highway. The noise cancellation is implemented for each of aforementioned noises which their powers vary in a range of values. This method of noise reduction yields better speech perceived quality and efficient results compared to Wien-er filter.

The basic theory of Markov chains has been known to mathematicians and engineers for close to 80 years, but it is only in the past decade that it has been applied explicitly to problems in speech processing. One of the major reasons why... more

The basic theory of Markov chains has been known to mathematicians and engineers for close to 80 years, but it is only in the past decade that it has been applied explicitly to problems in speech processing. One of the major reasons why speech models, based on Markov chains, have not been developed until recently was the lack of a method for optimizing the parameters of the Markov model to match observed signal patterns. Such a method was proposed in the late 1960's and was immediately applied to speech processing in several research institutions. Continued refinements in the theory and implementation of Markov modelling techniques have greatly enhanced the method, leading to a wide range of applications of these models. It is the purpose of this tutorial paper to give an introduction to the theory of Markov models, and to illustrate how they have been applied to problems in speech recognition.

In this paper, the AM–FM modulation model is applied to speech analysis, synthesis and coding. The AM–FM model represents the speech signal as the sum of formant resonance signals each of which contains amplitude and frequency modulation.... more

In this paper, the AM–FM modulation model is applied to speech analysis, synthesis and coding. The AM–FM model represents the speech signal as the sum of formant resonance signals each of which contains amplitude and frequency modulation. Multiband filtering and demodulation using the energy separation algorithm are the basic tools used for speech analysis. First, multiband demodulation analysis (MDA) is

An observation regarding the nature of the spectrum of a windowed swept-frequency sinusoid is exploited to time-scale (stretch or compress) time-variant sinusoids within the window or frame of an otherwise basic phase vocoder process.... more

An observation regarding the nature of the spectrum of a windowed swept-frequency sinusoid is exploited to time-scale (stretch or compress) time-variant sinusoids within the window or frame of an otherwise basic phase vocoder process. Nonstationary sinusoids are more closely represented as a series of windowed linearly swept sinusoids than as a series of windowed constant frequency sinusoids. Both the frequency sweep rate and the amplitude ramp rate can be identified and modified according to the time-scale ratio. The mathematics outlining this concept will be developed in the continuous time domain whereas a MATLAB program demonstrating the concept is in discrete time. An example is shown, using a swept and ramped sinusoid, that there are visible differences between using this correction and not using it. These differences predictably get more pronounced as the time rate of change of frequency or amplitude increases

Résumé Le traitement de la parole a connu ces dernières années un formidable développement lié aux avancées technologiques des composants de traitement numérique des signaux et à la numérisation grandissante des réseaux. Cet article... more

Résumé Le traitement de la parole a connu ces dernières années un formidable développement lié aux avancées technologiques des composants de traitement numérique des signaux et à la numérisation grandissante des réseaux. Cet article fournit une analyse des principales techniques qui se sont imposées récemment dans les domaines du codage, de la reconnaissance et de la synthèse de la parole. En

In this paper, we present our work on analysis and classification of smiled vowels, chuckling (or shaking) vowels and laughter syllables. This work is part of a larger framework that aims at assessing the level of amusement in speech... more

In this paper, we present our work on analysis and classification of smiled vowels, chuckling (or shaking) vowels and laughter syllables. This work is part of a larger framework that aims at assessing the level of amusement in speech using the audio modality only. Indeed all of these three categories occur in amused speech and are considered to contribute in the expression of different levels of amusement. We first analyze these three amused speech components on the acoustic level. Then, we improve a classification system we previously developed. With a limited amount of data and features, we are able to obtain good classification results with different systems. Among the compared systems, the best one achieved 82.8% of accuracy, therefore outperforming chance.

Over the past couple of decades, research has established that infants are sensitive to the predominant stress pattern of their native language. However, the degree to which the stress pattern shapes infants' language development has yet... more

Over the past couple of decades, research has established that infants are sensitive to the predominant stress pattern of their native language. However, the degree to which the stress pattern shapes infants' language development has yet to be fully determined. Whether stress is merely a cue to help organize the patterns of speech or whether it is an important part of the representation of speech sound sequences has still to be explored. Building on research in the areas of infant speech perception and segmentation, we asked how several months of exposure to the target language shapes infants' speech processing biases with respect to lexical stress. We hypothesize that infants represent stressed and unstressed syllables differently, and employed analyses of child-directed speech to show how this change to the representational landscape results in better distribution-based word segmentation as well as an advantage for stress-initial syllable sequences. A series of experiments then tested 9- and 7-month-old infants on their ability to use lexical stress without any other cues present to parse sequences from an artificial language. We found that infants adopted a stress-initial syllable strategy and that they appear to encode stress information as part of their proto-lexical representations. Together, the results of these studies suggest that stress information in the ambient language not only shapes how statistics are calculated over the speech input, but that it is also encoded in the representations of parsed speech sequences.

The Glottal Source is an important component of voice as it can be considered as the excitation signal to the voice apparatus. Nowadays, new techniques of speech processing such as speech recognition and speech synthesis use the glottal... more

The Glottal Source is an important component of voice as it can be considered as the excitation signal to the voice apparatus. Nowadays, new techniques of speech processing such as speech recognition and speech synthesis use the glottal closure and opening instants. Current models of the glottal waves derive their shape from approximate information rather than from exactly measured data. General method concentrate on assessment of the glottis opening using optical, acoustical methods, or on visualization of the larynx position using ultrasound, computer tomography or magnetic resonance imaging techniques. In this work, circuit model of Human Glottis using MOS is designed by exploiting fluid volume velocity to current, fluid pressure to voltage, and linear and nonlinear mechanical impedances to linear and nonlinear electrical impedances. The glottis modeled as current source includes linear, non-linear impedances to represent laminar and turbulent flow respectively, in vocal tract. The MOS modelling and simulation results of glottal circuit has been carried out on BSIM 3v3 model in TSMC 0.18 micrometer technology using ELDO simulator.