Marius Cotescu - Academia.edu (original) (raw)

Papers by Marius Cotescu

Research paper thumbnail of PUB Entry in the Blizzard Challenge 2011

The paper presents the entry in this year’s Blizzard Challenge of the Politehnica University of B... more The paper presents the entry in this year’s Blizzard Challenge of the Politehnica University of Bucharest. We present a parametric speech synthesis system based on HTS, which tried to achieve two important goals: gain better control over the vocal tract filter, and allow greater variability for the excitation source features by separating, as much as possible, the two processes. We proposed the spectral tilt as a feature of both voiced and unvoiced excitation that can be easily and reliably estimated and extracted from the smoothed spectrum, leaving more consistent data for the vocal tract model. We also engaged the problem of modelling the STRAIGHT aperiodicity coefficients in a new manner, which provides more details to synthetic speech. It was the first entry from the laboratory, and unfortunately both the limited experience and the scarce human resource deployed had decisively influenced the results. Index Terms: speech synthesis, spectral tilt, glottal flow, HMM

Research paper thumbnail of Enhncement Influence of the Aperiodicity Coefficients in Speech Synthesis

Lucrarea prezintă un studiu asupra îmbunătăţirii calităţii vorbirii sintetice parametrice folosin... more Lucrarea prezintă un studiu asupra îmbunătăţirii calităţii vorbirii sintetice parametrice folosind coeficienții de aperiodicitate extraşi prin metoda STRAIGHT de analiză a vorbirii. În acest scop au fost construite trei voci sintetice pentru limba engleză, folosind corpusul de sinteză ARCTIC_SLT şi setul de programe HTS, exemplificând trei modalităţi de abordare a generării secvenţelor de coeficienţi de aperiodicitate, unul clasic şi două propuse de autori. Cele trei voci au fost evaluate de un lot de ascultători în vederea comparării naturaleţei şi similarităţii cu o voce naturală.

Research paper thumbnail of An adaptive lighting system using the simulated annealing algorithm

In the frame of our European project, ALADIN, light is intended to be a support for the elderly, ... more In the frame of our European project, ALADIN, light is intended to be a support for the elderly, in order to enhance their daily performance. The performance is appreciated by activity specific values of psychophysiological parameters that can be modified by light. This paper presents and discusses the implementation of a light controller using the Simulated Annealing algorithm and analyses the test results obtained by running the system on two subjects.

Research paper thumbnail of Intangible Cultural Heritage and New Technologies: Challenges and Opportunities for Cultural Preservation and Development

Mixed Reality and Gamification for Cultural Heritage, 2017

Intangible Cultural Heritage (ICH) is a relatively recent term coined to represent living cultura... more Intangible Cultural Heritage (ICH) is a relatively recent term coined to represent living cultural expressions and practices, which are recognised by communities as distinct aspects of identity. The safeguarding of ICH has become a topic of international concern primarily through the work of UNESCO (United Nations Educational, Scientific and Cultural Organisation). However, little research has been done on the role of new technologies in the preservation and transmission of intangible heritage. The chapter examines resources, projects and technologies providing access to ICH and identifies gaps and constraints. It draws on research conducted within the scope of the collaborative research project, i-Treasures. In so doing, it covers the state of the art in technologies that could be employed for access, capture and analysis of ICH in order to highlight how specific new technologies can contribute to the transmission and safeguarding of ICH.

Research paper thumbnail of Voice Conversion for Whispered Speech Synthesis

IEEE Signal Processing Letters

We present an approach to synthesize whisper by applying a handcrafted signal processing recipe a... more We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech. We investigate using Gaussian Mixture Models (GMM) and Deep Neural Networks (DNN) to model the mapping between acoustic features of normal speech and those of whispered speech. We evaluate naturalness and speaker similarity of the converted whisper on an internal corpus and on the publicly available wTIMIT corpus. We show that applying VC techniques is significantly better than using rule-based signal processing methods and it achieves results that are indistinguishable from copy-synthesis of natural whisper recordings. We investigate the ability of the DNN model to generalize on unseen speakers, when trained with data from multiple speakers. We show that excluding the target speaker from the training set has little or no impact on the perceived naturalness and speaker similarity of the converted whisper. The proposed DNN method is used in the newly released Whisper Mode of Amazon Alexa. Index Terms-whispered speech conversion, voice conversion (VC), whispered text to speech (TTS).

Research paper thumbnail of A Multimodal Approach for the Safeguarding and Transmission of Intangible Cultural Heritage: The Case of i-Treasures

IEEE Intelligent Systems

Intangible Cultural Heritage (ICH) creations include, amongst other, music, dance, singing, theat... more Intangible Cultural Heritage (ICH) creations include, amongst other, music, dance, singing, theatre, human skills, and craftsmanship. These cultural expressions are usually transmitted orally and/or using gestures and are modified over a period of time, through a process of collective recreation. As the world becomes more interconnected and many different cultures come into contact, local communities run the risk of losing important elements of their ICH, while young people find it difficult to maintain the connection with the cultural heritage treasured by their elders. In this paper, we present a novel holistic approach for the safeguarding and transmission of ICH that goes beyond the mere digitization of ICH content. Based on multisensory technology for the capturing of ICH, the proposed approach enables the generation of completely novel cultural content. High-level semantics are extracted from the acquired data, enabling researchers to identify possible implicit or hidden correlations between different ICH expressions or interpretation styles and study the evolution of a specific ICH. These data, coupled with other cultural resources, are accessible through the i-Treasures Web-platform, which provides the means for supporting knowledge exchange between researchers as well as know-how transmission from ICH bearers to apprentices.

Research paper thumbnail of Optimal Unit Stitching in a Unit Selection Singing Synthesis System

Interspeech 2016, 2016

Unit Selection based speech synthesis systems are currently the best performing, producing natura... more Unit Selection based speech synthesis systems are currently the best performing, producing natural sounding speech with minimal CPU load. One of the important reasons behind their success is the amount of recordings that are now commonly used in synthesis applications. However, in the case of singing applications, it is quite hard for a database to cover a large phonetic space due to the relative inefficiency of the recording process. Thus, due to the reduced catalogue of units, singing unit selection systems are more likely to produce spectral discontinuity artefacts. Taking advantage of the quasi stable nature of articulation during singing, we propose a novel unit stitching method. The method was implemented into the system that was used for the "Fill-In the Gap" Singing Synthesis Challenge.

Research paper thumbnail of An Adaptive Lighting System Using the Simulated Annealing Algorithm

In the frame of our European project, ALADIN, light is intended to be a support for the elderly, ... more In the frame of our European project, ALADIN, light is intended to be a support for the elderly, in order to enhance their daily performance. The performance is appreciated by activity specific values of psychophysiological parameters that can be modified by light. This paper presents and discusses the implementation of a light controller using the Simulated Annealing algorithm and analyses the test results obtained by running the system on two subjects.

Research paper thumbnail of Stochastic Algorithms for Adaptive Lighting Control using Psycho-Physiological Features

Light has a real important impact on our life, determining the circadian rhythm, the rhythm of ou... more Light has a real important impact on our life, determining the circadian rhythm, the rhythm of our daily activity. Light is benefic for healthy people, but it can be also very helpful for treating disease or for enhancing the comfort and wellbeing. In the frame of our European project, ALADIN, light is intended to be a support for the elderly, in order to enhance their daily performance. The performance is appreciated by activity specific values of psycho-physiological features that can be modified by light. This paper will describe the signal processing techniques deployed for extracting useful features and the algorithms used for developing an adaptive light controller. Two algorithms were used to implement the light controller: Monte Carlo and Simulated Annealing. Experimental results obtained using the Simulated Annealing algorithm will be presented.

Research paper thumbnail of Using Vaes and Normalizing Flows for One-Shot Text-To-Speech Synthesis of Expressive Speech

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 1, 2020

We propose a Text-to-Speech method to create an unseen expressive style using one utterance of ex... more We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational AutoEncoder (VAE) and a Householder Flow. The proposed system provides a 22% KLdivergence reduction while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9% relative naturalness improvement over standard Neural Text-to-Speech, while also improving the perceived emotional intensity (59 compared to the 55 of neutral speech).

Research paper thumbnail of Quantized Dynamic Time Warping (DTW) algorithm

DTW algorithm compares the parameters of an unknown spoken word with the parameters of one or mor... more DTW algorithm compares the parameters of an unknown spoken word with the parameters of one or more reference templates. The more reference templates are used for the same word, the higher is the recognition rate. But increasing the number of reference templates for the same word to recognize, leads to an increase in memory resources and computing time. The proposed

Research paper thumbnail of Lighting as Support for Enhancing Well-Being, Health and Mental Fitness of an Ageing Population - The FP6 EU Funded ALADIN Project

The paper presents the ALADIN prototype for adaptive lighting control designed to assist elderly ... more The paper presents the ALADIN prototype for adaptive lighting control designed to assist elderly in achieving a state of well-being, developed as a FP6 EU funded project. It uses psycho-physiological features extracted from Electro-Dermal Activity (EDA) and Pulse signals to determine the subject’s mental state and adapts the lighting parameters in order to achieve a certain desired state. One of the controller implementations was done using Simulated Annealing. Field test evaluations of this implementation are discussed.

Research paper thumbnail of Sources of increased variability in HMM synthetic voices

Research paper thumbnail of A study on the influence of prosody and excitation source model on synthetic speech

The paper presents a study regarding two methods for improving the naturalness of synthesized spe... more The paper presents a study regarding two methods for improving the naturalness of synthesized speech. We have modeled the excitation source for an LPC vocoder as an impulse train which is passed through a filter to be formed into the excitation signal. The delay between two impulses can be constant, or it can be modulated by the pitch contour extracted from the original utterance. A Glottal Pulse Filter is extracted from the LPC residual so that its frequency response best fits the spectrum of the residual. Four excitation generators were implemented: two unfiltered and two filtered impulse generators. Synthetic speech obtained using the four generators were evaluated and scored by a group of ten people. Festival voices were also evaluated for reference.

Research paper thumbnail of PUB Entry in the Blizzard Challenge 2011

The paper presents the entry in this year’s Blizzard Challenge of the Politehnica University of B... more The paper presents the entry in this year’s Blizzard Challenge of the Politehnica University of Bucharest. We present a parametric speech synthesis system based on HTS, which tried to achieve two important goals: gain better control over the vocal tract filter, and allow greater variability for the excitation source features by separating, as much as possible, the two processes. We proposed the spectral tilt as a feature of both voiced and unvoiced excitation that can be easily and reliably estimated and extracted from the smoothed spectrum, leaving more consistent data for the vocal tract model. We also engaged the problem of modelling the STRAIGHT aperiodicity coefficients in a new manner, which provides more details to synthetic speech. It was the first entry from the laboratory, and unfortunately both the limited experience and the scarce human resource deployed had decisively influenced the results. Index Terms: speech synthesis, spectral tilt, glottal flow, HMM

Research paper thumbnail of Enhncement Influence of the Aperiodicity Coefficients in Speech Synthesis

Lucrarea prezintă un studiu asupra îmbunătăţirii calităţii vorbirii sintetice parametrice folosin... more Lucrarea prezintă un studiu asupra îmbunătăţirii calităţii vorbirii sintetice parametrice folosind coeficienții de aperiodicitate extraşi prin metoda STRAIGHT de analiză a vorbirii. În acest scop au fost construite trei voci sintetice pentru limba engleză, folosind corpusul de sinteză ARCTIC_SLT şi setul de programe HTS, exemplificând trei modalităţi de abordare a generării secvenţelor de coeficienţi de aperiodicitate, unul clasic şi două propuse de autori. Cele trei voci au fost evaluate de un lot de ascultători în vederea comparării naturaleţei şi similarităţii cu o voce naturală.

Research paper thumbnail of An adaptive lighting system using the simulated annealing algorithm

In the frame of our European project, ALADIN, light is intended to be a support for the elderly, ... more In the frame of our European project, ALADIN, light is intended to be a support for the elderly, in order to enhance their daily performance. The performance is appreciated by activity specific values of psychophysiological parameters that can be modified by light. This paper presents and discusses the implementation of a light controller using the Simulated Annealing algorithm and analyses the test results obtained by running the system on two subjects.

Research paper thumbnail of Intangible Cultural Heritage and New Technologies: Challenges and Opportunities for Cultural Preservation and Development

Mixed Reality and Gamification for Cultural Heritage, 2017

Intangible Cultural Heritage (ICH) is a relatively recent term coined to represent living cultura... more Intangible Cultural Heritage (ICH) is a relatively recent term coined to represent living cultural expressions and practices, which are recognised by communities as distinct aspects of identity. The safeguarding of ICH has become a topic of international concern primarily through the work of UNESCO (United Nations Educational, Scientific and Cultural Organisation). However, little research has been done on the role of new technologies in the preservation and transmission of intangible heritage. The chapter examines resources, projects and technologies providing access to ICH and identifies gaps and constraints. It draws on research conducted within the scope of the collaborative research project, i-Treasures. In so doing, it covers the state of the art in technologies that could be employed for access, capture and analysis of ICH in order to highlight how specific new technologies can contribute to the transmission and safeguarding of ICH.

Research paper thumbnail of Voice Conversion for Whispered Speech Synthesis

IEEE Signal Processing Letters

We present an approach to synthesize whisper by applying a handcrafted signal processing recipe a... more We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech. We investigate using Gaussian Mixture Models (GMM) and Deep Neural Networks (DNN) to model the mapping between acoustic features of normal speech and those of whispered speech. We evaluate naturalness and speaker similarity of the converted whisper on an internal corpus and on the publicly available wTIMIT corpus. We show that applying VC techniques is significantly better than using rule-based signal processing methods and it achieves results that are indistinguishable from copy-synthesis of natural whisper recordings. We investigate the ability of the DNN model to generalize on unseen speakers, when trained with data from multiple speakers. We show that excluding the target speaker from the training set has little or no impact on the perceived naturalness and speaker similarity of the converted whisper. The proposed DNN method is used in the newly released Whisper Mode of Amazon Alexa. Index Terms-whispered speech conversion, voice conversion (VC), whispered text to speech (TTS).

Research paper thumbnail of A Multimodal Approach for the Safeguarding and Transmission of Intangible Cultural Heritage: The Case of i-Treasures

IEEE Intelligent Systems

Intangible Cultural Heritage (ICH) creations include, amongst other, music, dance, singing, theat... more Intangible Cultural Heritage (ICH) creations include, amongst other, music, dance, singing, theatre, human skills, and craftsmanship. These cultural expressions are usually transmitted orally and/or using gestures and are modified over a period of time, through a process of collective recreation. As the world becomes more interconnected and many different cultures come into contact, local communities run the risk of losing important elements of their ICH, while young people find it difficult to maintain the connection with the cultural heritage treasured by their elders. In this paper, we present a novel holistic approach for the safeguarding and transmission of ICH that goes beyond the mere digitization of ICH content. Based on multisensory technology for the capturing of ICH, the proposed approach enables the generation of completely novel cultural content. High-level semantics are extracted from the acquired data, enabling researchers to identify possible implicit or hidden correlations between different ICH expressions or interpretation styles and study the evolution of a specific ICH. These data, coupled with other cultural resources, are accessible through the i-Treasures Web-platform, which provides the means for supporting knowledge exchange between researchers as well as know-how transmission from ICH bearers to apprentices.

Research paper thumbnail of Optimal Unit Stitching in a Unit Selection Singing Synthesis System

Interspeech 2016, 2016

Unit Selection based speech synthesis systems are currently the best performing, producing natura... more Unit Selection based speech synthesis systems are currently the best performing, producing natural sounding speech with minimal CPU load. One of the important reasons behind their success is the amount of recordings that are now commonly used in synthesis applications. However, in the case of singing applications, it is quite hard for a database to cover a large phonetic space due to the relative inefficiency of the recording process. Thus, due to the reduced catalogue of units, singing unit selection systems are more likely to produce spectral discontinuity artefacts. Taking advantage of the quasi stable nature of articulation during singing, we propose a novel unit stitching method. The method was implemented into the system that was used for the "Fill-In the Gap" Singing Synthesis Challenge.

Research paper thumbnail of An Adaptive Lighting System Using the Simulated Annealing Algorithm

In the frame of our European project, ALADIN, light is intended to be a support for the elderly, ... more In the frame of our European project, ALADIN, light is intended to be a support for the elderly, in order to enhance their daily performance. The performance is appreciated by activity specific values of psychophysiological parameters that can be modified by light. This paper presents and discusses the implementation of a light controller using the Simulated Annealing algorithm and analyses the test results obtained by running the system on two subjects.

Research paper thumbnail of Stochastic Algorithms for Adaptive Lighting Control using Psycho-Physiological Features

Light has a real important impact on our life, determining the circadian rhythm, the rhythm of ou... more Light has a real important impact on our life, determining the circadian rhythm, the rhythm of our daily activity. Light is benefic for healthy people, but it can be also very helpful for treating disease or for enhancing the comfort and wellbeing. In the frame of our European project, ALADIN, light is intended to be a support for the elderly, in order to enhance their daily performance. The performance is appreciated by activity specific values of psycho-physiological features that can be modified by light. This paper will describe the signal processing techniques deployed for extracting useful features and the algorithms used for developing an adaptive light controller. Two algorithms were used to implement the light controller: Monte Carlo and Simulated Annealing. Experimental results obtained using the Simulated Annealing algorithm will be presented.

Research paper thumbnail of Using Vaes and Normalizing Flows for One-Shot Text-To-Speech Synthesis of Expressive Speech

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 1, 2020

We propose a Text-to-Speech method to create an unseen expressive style using one utterance of ex... more We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational AutoEncoder (VAE) and a Householder Flow. The proposed system provides a 22% KLdivergence reduction while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9% relative naturalness improvement over standard Neural Text-to-Speech, while also improving the perceived emotional intensity (59 compared to the 55 of neutral speech).

Research paper thumbnail of Quantized Dynamic Time Warping (DTW) algorithm

DTW algorithm compares the parameters of an unknown spoken word with the parameters of one or mor... more DTW algorithm compares the parameters of an unknown spoken word with the parameters of one or more reference templates. The more reference templates are used for the same word, the higher is the recognition rate. But increasing the number of reference templates for the same word to recognize, leads to an increase in memory resources and computing time. The proposed

Research paper thumbnail of Lighting as Support for Enhancing Well-Being, Health and Mental Fitness of an Ageing Population - The FP6 EU Funded ALADIN Project

The paper presents the ALADIN prototype for adaptive lighting control designed to assist elderly ... more The paper presents the ALADIN prototype for adaptive lighting control designed to assist elderly in achieving a state of well-being, developed as a FP6 EU funded project. It uses psycho-physiological features extracted from Electro-Dermal Activity (EDA) and Pulse signals to determine the subject’s mental state and adapts the lighting parameters in order to achieve a certain desired state. One of the controller implementations was done using Simulated Annealing. Field test evaluations of this implementation are discussed.

Research paper thumbnail of Sources of increased variability in HMM synthetic voices

Research paper thumbnail of A study on the influence of prosody and excitation source model on synthetic speech

The paper presents a study regarding two methods for improving the naturalness of synthesized spe... more The paper presents a study regarding two methods for improving the naturalness of synthesized speech. We have modeled the excitation source for an LPC vocoder as an impulse train which is passed through a filter to be formed into the excitation signal. The delay between two impulses can be constant, or it can be modulated by the pitch contour extracted from the original utterance. A Glottal Pulse Filter is extracted from the LPC residual so that its frequency response best fits the spectrum of the residual. Four excitation generators were implemented: two unfiltered and two filtered impulse generators. Synthetic speech obtained using the four generators were evaluated and scored by a group of ten people. Festival voices were also evaluated for reference.