Alexandra Markó | Eötvös Loránd University (original) (raw)
Papers by Alexandra Markó
Articulatory studies performed in Hungary date back to the sixties, when different methods were a... more Articulatory studies performed in Hungary date back to the sixties, when different methods were applied for the description of the segment inventory of Hungarian and various other languages (e.g. Russian, German, English, Polish). Palato- and linguography, labiography, and X-ray were used in the analyses of both typical and atypical speech. However, coarticulation, which requires dynamic methods, was not analysed until recently, when the suitable tools and methods, electromagnetic articulography, ultrasound tongue imaging and electroglottography became also available in Hungary. The paper presents an overview of the main issues of articulatory studies on Hungarian in the past and the present. It summarizes the main findings from some studies on gemination and degemination, transparent vowels, phonatory characteristics of emotion, and gives a couple of examples of possible and future applications.
Glottal marking is well described for adult speakers; however, children's speech has been les... more Glottal marking is well described for adult speakers; however, children's speech has been less documented yet. The present study analysed the appearance of glottal marking in 16 adolescent (16- and 17-year-old) and 16 adult (20- to 45-year-old) speakers' reading aloud (with an equal number of males and females in both age groups). Data in terms of gender as well as age were compared based on four parameters of frequency of occurrence. The results showed that although the frequency of occurrence of glottal marking in adolescent speech in general was somewhat lower than in adult speech, and the gender-specific differences did not appear yet, the positional triggers for glottal marking were found to affect the frequency of the phenomenon similarly in the two age groups. The results and further research may contribute to the better understanding of both the appearance of glottal marking and the emergence of gender-specific characteristics of speech.
ArXiv, 2021
Articulatory information has been shown to be effective in improving the performance of HMM-based... more Articulatory information has been shown to be effective in improving the performance of HMM-based and DNN-based textto-speech synthesis. Speech synthesis research focuses traditionally on text-to-speech conversion, when the input is text or an estimated linguistic representation, and the target is synthesized speech. However, a research field that has risen in the last decade is articulation-to-speech synthesis (with a target application of a Silent Speech Interface, SSI), when the goal is to synthesize speech from some representation of the movement of the articulatory organs. In this paper, we extend traditional (vocoder-based) DNN-TTS with articulatory input, estimated from ultrasound tongue images. We compare text-only, ultrasound-only, and combined inputs. Using data from eight speakers, we show that that the combined text and articulatory input can have advantages in limited-data scenarios, namely, it may increase the naturalness of synthesized speech compared to single text i...
In our research we aim to examine this allophonic alternation of the laryngeal fricative from a p... more In our research we aim to examine this allophonic alternation of the laryngeal fricative from a phonetic point of view, in an attempt to shed more light on the phonetic and phonological factors that may facilitate or restrain the occurrence of [ɦ] in Hungarian, and thus to test previous claims of phonology and phonetics on this issue. As a first step, the present study investigated the effect of two vowel quality features, vowel openness and backness, and a phonological conditioner, pitch-accent on the ratio of voicing that occurs in intervocalic /h/ in laboratory speech. As a secondary aim we also tried to raise questions regarding the very specific type of voice quality this unique fricative exhibits, breathy voice. For this purpose, we also analyzed two more acoustic parameters, center of gravity and the harmonics- to-noise ratio, which are traditionally suggested to reliably and informatively quantify voice quality in fricatives. The results confirmed our hypothesis, that the in...
Introduction. Mothers tend to speak differently to infants than to adults. This register is refer... more Introduction. Mothers tend to speak differently to infants than to adults. This register is referred to as motherese or infant-directed speech (IDS), whereas the one used talking to adults is called adult-directed speech (ADS) [1]. Higher fundamental frequency and slower speech rate are the most typical characteristics of IDS compared to ADS [1, 2]. There is rather sparse data available on the voice quality features of IDS, and it has not been in the focus of previous research. Meanwhile, in other areas of speech entrainment studies voice quality is gaining increasing attention. Its several measurable acoustic parameters make voice quality a useful indicator of speech entrainment [3] and facilitate the identification of positive emotion expressions [4], which are fundamental features of IDS. A recent study has found that on average vowels were produced with more breathy voice quality in Japanese speech directed to 20-month-old infants than in ADS [5]. However, it remains unknown whe...
Articulatory organization of geminates in Hungarian It is traditionally assumed that geminates un... more Articulatory organization of geminates in Hungarian It is traditionally assumed that geminates undergo degemination when being flanked by another consonant in Hungarian. As in Hungarian duration is considered to be the main acoustic cue to the singleton-geminate opposition, it appears valid to study the phonetic implementation of this process in the acoustic domain. However, previous acoustic analyses lead to inconclusive results on the status of the “degeminated” consonant, while articulatory data on Japanese singletons and geminates imply that it is revealing to study degemination on the level of gestural timing. The present study compared gestural organization of geminates, degeminated and singleton consonants in heterorganic C-clusters, and in intervocalic positions. We obtained EMA data from 10 female speakers of Hungarian (aged 27.7 ys). Consonant durations, plateau durations and tongue rise data showed that degemination does not yield realizations equivalent to intervocalic s...
Interspeech 2019, 2019
Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 i... more Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even when voicing is not present. Therefore, in this paper on UTI-based SSI, we use a simple continuous F0 tracker which does not apply a strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a convolutional neural network, with UTI as input. The results demonstrate that during the articulatory-toacoustic mapping experiments, the continuous F0 is predicted with lower error, and the continuous vocoder produces slightly more natural synthesized speech than the baseline vocoder using standard discontinuous F0.
Interspeech 2018, 2018
Silent Speech Interface systems apply two different strategies to solve the articulatory-to-acous... more Silent Speech Interface systems apply two different strategies to solve the articulatory-to-acoustic conversion task. The recognition-and-synthesis approach applies speech recognition techniques to map the articulatory data to a textual transcript, which is then converted to speech by a conventional text-tospeech system. The direct synthesis approach seeks to convert the articulatory information directly to speech synthesis (vocoder) parameters. In both cases, deep neural networks are an evident and popular choice to learn the mapping task. Recognizing that the learning of speech recognition and speech synthesis targets (acoustic model states vs. vocoder parameters) are two closely related tasks over the same ultrasound tongue image input, here we experiment with the multi-task training of deep neural networks, which seeks to solve the two tasks simultaneously. Our results show that the parallel learning of the two types of targets is indeed beneficial for both tasks. Moreover, we obtained further improvements by using multi-task training as a weight initialization step before task-specific training. Overall, we report a relative error rate reduction of about 7% in both the speech recognition and the speech synthesis tasks.
In the present paper the realization of vowel clusters in Hungarian speech is analyzed. We focus ... more In the present paper the realization of vowel clusters in Hungarian speech is analyzed. We focus our attention on cases in which the speaker wishes to highlight, rather than resolve, a hiatus – by employing irregular phonation. Glottalization occurred the most frequently (31.1%) across word boundaries; sometimes (with a frequency below 10%) it also happened morpheme internally or across compound boundaries. Glottalized word transitions were realized mostly at phrase boundaries (stress also influenced the occurrence of glottalization). Another major motivation for a glottalized realization of V(#)V clusters was to avoid the use of some phonological/articulatory mechanism (hiatus resolution or deletion). A large amount of inter-speaker variance was shown by the frequency of occurrence of glottalization.
7th International Conference on Speech Prosody 2014, May 20, 2014
Speakers tend to mark boundaries of larger prosodic units with glottalization and the deceleratio... more Speakers tend to mark boundaries of larger prosodic units with glottalization and the deceleration of articulation rate. In the present study, the final parts of Hungarian read and spontaneous utterances were analyzed in the temporal domain (compared to the other parts of the utterances) and in terms of glottalization. We investigated how glottalization and deceleration are related to each other in read and spontaneous speech in Hungarian. We also analyzed if these phenomena depend on the speech mode. Our results revealed connection between glottalization and deceleration in spontaneous speech, whereas for read speech no such relation could be detected. Speech modes were also found to differ in the frequencies of the occurrence of glottalization and the magnitude of the deceleration at utterance final positions.
The effects that speakers’ disfluencies have on the listener are rather complex. Speech perceptio... more The effects that speakers’ disfluencies have on the listener are rather complex. Speech perception is an incredibly fast process, given that while the mechanism interprets the incoming waveform as a series of linguistic segments and suprasegmentals, it is also continuously ready to receive and correct the incoming erroneous messages. The goal of the present experiment was to describe the correction process and determine its efficiency. Various types of disfluency were tested with nine-year-old children, young adults, and elders. The results show that the time span of the corrective process depends upon the type of disfluency, the context, and the listener’s age. The higher operational level the production error involves, the more time is required for correcting it and the corrections are poorer than at lower operational levels.
9th International Conference on Speech Prosody 2018, Jun 13, 2018
In the present study three members of the Hungarian vowel inventory (/i/, /u/, /ɒ/) were analysed... more In the present study three members of the Hungarian vowel inventory (/i/, /u/, /ɒ/) were analysed as a function of prominence, with respect to gender and vowel quality. The theoretically most prominent (stressed and accented) and nonprominent (unstressed and unaccented) realizations were compared in terms of duration, f0, formants, and OQ. The last two of these parameters were analysed systematically for the first time to the study of Hungarian. On duration, there was a significant interaction between the effect of prominence and vowel quality: prominence led to longer duration for the vowels /ɒ/ and /i/, but had no significant effect on /u/. On f0, we found a three-way interaction effect between prominence, vowel quality and gender, due to different patterns observed in males and females in the case of the vowel /i/. Formant analysis based on Euclidean distance from the vowel space centroid did not reveal any significant effect of prominence. The comparison of F1 and F2 values showed considerable differences between the prominence conditions in the case of the second formant of /ɒ/. For OQ, we found different patterns for genders and vowels: prominence led to higher OQ values for women and lower OQ values for men. These between-gender differences were the most pronounced for the vowel /ɒ/.
Articulatory studies performed in Hungary date back to the sixties, when different methods were a... more Articulatory studies performed in Hungary date back to the sixties, when different methods were applied for the description of the segment inventory of Hungarian and various other languages (e.g. Russian, German, English, Polish). Palato- and linguography, labiography, and X-ray were used in the analyses of both typical and atypical speech. However, coarticulation, which requires dynamic methods, was not analysed until recently, when the suitable tools and methods, electromagnetic articulography, ultrasound tongue imaging and electroglottography became also available in Hungary. The paper presents an overview of the main issues of articulatory studies on Hungarian in the past and the present. It summarizes the main findings from some studies on gemination and degemination, transparent vowels, phonatory characteristics of emotion, and gives a couple of examples of possible and future applications.
Glottal marking is well described for adult speakers; however, children's speech has been les... more Glottal marking is well described for adult speakers; however, children's speech has been less documented yet. The present study analysed the appearance of glottal marking in 16 adolescent (16- and 17-year-old) and 16 adult (20- to 45-year-old) speakers' reading aloud (with an equal number of males and females in both age groups). Data in terms of gender as well as age were compared based on four parameters of frequency of occurrence. The results showed that although the frequency of occurrence of glottal marking in adolescent speech in general was somewhat lower than in adult speech, and the gender-specific differences did not appear yet, the positional triggers for glottal marking were found to affect the frequency of the phenomenon similarly in the two age groups. The results and further research may contribute to the better understanding of both the appearance of glottal marking and the emergence of gender-specific characteristics of speech.
ArXiv, 2021
Articulatory information has been shown to be effective in improving the performance of HMM-based... more Articulatory information has been shown to be effective in improving the performance of HMM-based and DNN-based textto-speech synthesis. Speech synthesis research focuses traditionally on text-to-speech conversion, when the input is text or an estimated linguistic representation, and the target is synthesized speech. However, a research field that has risen in the last decade is articulation-to-speech synthesis (with a target application of a Silent Speech Interface, SSI), when the goal is to synthesize speech from some representation of the movement of the articulatory organs. In this paper, we extend traditional (vocoder-based) DNN-TTS with articulatory input, estimated from ultrasound tongue images. We compare text-only, ultrasound-only, and combined inputs. Using data from eight speakers, we show that that the combined text and articulatory input can have advantages in limited-data scenarios, namely, it may increase the naturalness of synthesized speech compared to single text i...
In our research we aim to examine this allophonic alternation of the laryngeal fricative from a p... more In our research we aim to examine this allophonic alternation of the laryngeal fricative from a phonetic point of view, in an attempt to shed more light on the phonetic and phonological factors that may facilitate or restrain the occurrence of [ɦ] in Hungarian, and thus to test previous claims of phonology and phonetics on this issue. As a first step, the present study investigated the effect of two vowel quality features, vowel openness and backness, and a phonological conditioner, pitch-accent on the ratio of voicing that occurs in intervocalic /h/ in laboratory speech. As a secondary aim we also tried to raise questions regarding the very specific type of voice quality this unique fricative exhibits, breathy voice. For this purpose, we also analyzed two more acoustic parameters, center of gravity and the harmonics- to-noise ratio, which are traditionally suggested to reliably and informatively quantify voice quality in fricatives. The results confirmed our hypothesis, that the in...
Introduction. Mothers tend to speak differently to infants than to adults. This register is refer... more Introduction. Mothers tend to speak differently to infants than to adults. This register is referred to as motherese or infant-directed speech (IDS), whereas the one used talking to adults is called adult-directed speech (ADS) [1]. Higher fundamental frequency and slower speech rate are the most typical characteristics of IDS compared to ADS [1, 2]. There is rather sparse data available on the voice quality features of IDS, and it has not been in the focus of previous research. Meanwhile, in other areas of speech entrainment studies voice quality is gaining increasing attention. Its several measurable acoustic parameters make voice quality a useful indicator of speech entrainment [3] and facilitate the identification of positive emotion expressions [4], which are fundamental features of IDS. A recent study has found that on average vowels were produced with more breathy voice quality in Japanese speech directed to 20-month-old infants than in ADS [5]. However, it remains unknown whe...
Articulatory organization of geminates in Hungarian It is traditionally assumed that geminates un... more Articulatory organization of geminates in Hungarian It is traditionally assumed that geminates undergo degemination when being flanked by another consonant in Hungarian. As in Hungarian duration is considered to be the main acoustic cue to the singleton-geminate opposition, it appears valid to study the phonetic implementation of this process in the acoustic domain. However, previous acoustic analyses lead to inconclusive results on the status of the “degeminated” consonant, while articulatory data on Japanese singletons and geminates imply that it is revealing to study degemination on the level of gestural timing. The present study compared gestural organization of geminates, degeminated and singleton consonants in heterorganic C-clusters, and in intervocalic positions. We obtained EMA data from 10 female speakers of Hungarian (aged 27.7 ys). Consonant durations, plateau durations and tongue rise data showed that degemination does not yield realizations equivalent to intervocalic s...
Interspeech 2019, 2019
Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 i... more Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even when voicing is not present. Therefore, in this paper on UTI-based SSI, we use a simple continuous F0 tracker which does not apply a strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a convolutional neural network, with UTI as input. The results demonstrate that during the articulatory-toacoustic mapping experiments, the continuous F0 is predicted with lower error, and the continuous vocoder produces slightly more natural synthesized speech than the baseline vocoder using standard discontinuous F0.
Interspeech 2018, 2018
Silent Speech Interface systems apply two different strategies to solve the articulatory-to-acous... more Silent Speech Interface systems apply two different strategies to solve the articulatory-to-acoustic conversion task. The recognition-and-synthesis approach applies speech recognition techniques to map the articulatory data to a textual transcript, which is then converted to speech by a conventional text-tospeech system. The direct synthesis approach seeks to convert the articulatory information directly to speech synthesis (vocoder) parameters. In both cases, deep neural networks are an evident and popular choice to learn the mapping task. Recognizing that the learning of speech recognition and speech synthesis targets (acoustic model states vs. vocoder parameters) are two closely related tasks over the same ultrasound tongue image input, here we experiment with the multi-task training of deep neural networks, which seeks to solve the two tasks simultaneously. Our results show that the parallel learning of the two types of targets is indeed beneficial for both tasks. Moreover, we obtained further improvements by using multi-task training as a weight initialization step before task-specific training. Overall, we report a relative error rate reduction of about 7% in both the speech recognition and the speech synthesis tasks.
In the present paper the realization of vowel clusters in Hungarian speech is analyzed. We focus ... more In the present paper the realization of vowel clusters in Hungarian speech is analyzed. We focus our attention on cases in which the speaker wishes to highlight, rather than resolve, a hiatus – by employing irregular phonation. Glottalization occurred the most frequently (31.1%) across word boundaries; sometimes (with a frequency below 10%) it also happened morpheme internally or across compound boundaries. Glottalized word transitions were realized mostly at phrase boundaries (stress also influenced the occurrence of glottalization). Another major motivation for a glottalized realization of V(#)V clusters was to avoid the use of some phonological/articulatory mechanism (hiatus resolution or deletion). A large amount of inter-speaker variance was shown by the frequency of occurrence of glottalization.
7th International Conference on Speech Prosody 2014, May 20, 2014
Speakers tend to mark boundaries of larger prosodic units with glottalization and the deceleratio... more Speakers tend to mark boundaries of larger prosodic units with glottalization and the deceleration of articulation rate. In the present study, the final parts of Hungarian read and spontaneous utterances were analyzed in the temporal domain (compared to the other parts of the utterances) and in terms of glottalization. We investigated how glottalization and deceleration are related to each other in read and spontaneous speech in Hungarian. We also analyzed if these phenomena depend on the speech mode. Our results revealed connection between glottalization and deceleration in spontaneous speech, whereas for read speech no such relation could be detected. Speech modes were also found to differ in the frequencies of the occurrence of glottalization and the magnitude of the deceleration at utterance final positions.
The effects that speakers’ disfluencies have on the listener are rather complex. Speech perceptio... more The effects that speakers’ disfluencies have on the listener are rather complex. Speech perception is an incredibly fast process, given that while the mechanism interprets the incoming waveform as a series of linguistic segments and suprasegmentals, it is also continuously ready to receive and correct the incoming erroneous messages. The goal of the present experiment was to describe the correction process and determine its efficiency. Various types of disfluency were tested with nine-year-old children, young adults, and elders. The results show that the time span of the corrective process depends upon the type of disfluency, the context, and the listener’s age. The higher operational level the production error involves, the more time is required for correcting it and the corrections are poorer than at lower operational levels.
9th International Conference on Speech Prosody 2018, Jun 13, 2018
In the present study three members of the Hungarian vowel inventory (/i/, /u/, /ɒ/) were analysed... more In the present study three members of the Hungarian vowel inventory (/i/, /u/, /ɒ/) were analysed as a function of prominence, with respect to gender and vowel quality. The theoretically most prominent (stressed and accented) and nonprominent (unstressed and unaccented) realizations were compared in terms of duration, f0, formants, and OQ. The last two of these parameters were analysed systematically for the first time to the study of Hungarian. On duration, there was a significant interaction between the effect of prominence and vowel quality: prominence led to longer duration for the vowels /ɒ/ and /i/, but had no significant effect on /u/. On f0, we found a three-way interaction effect between prominence, vowel quality and gender, due to different patterns observed in males and females in the case of the vowel /i/. Formant analysis based on Euclidean distance from the vowel space centroid did not reveal any significant effect of prominence. The comparison of F1 and F2 values showed considerable differences between the prominence conditions in the case of the second formant of /ɒ/. For OQ, we found different patterns for genders and vowels: prominence led to higher OQ values for women and lower OQ values for men. These between-gender differences were the most pronounced for the vowel /ɒ/.