Helmer Strik | Radboud University Nijmegen (original) (raw)
Papers by Helmer Strik
6th International Conference on Spoken Language Processing (ICSLP 2000)
In this paper a Bottom-Up (BU) method of obtaining information about pronunciation variation is p... more In this paper a Bottom-Up (BU) method of obtaining information about pronunciation variation is proposed. BU transcriptions (Tbu) were obtained by letting a CSR decide for each phone whether it was deleted or not. The Tbu were compared to transcriptions obtained automatically with a Top-Down method, and the agreement appeared to be very high. Subsequently, the Tbu were aligned with canonical reference transcriptions (Tref) and on the basis of this alignment, deletion rules were derived. The BU rules were employed to generate variants which were used in recognition experiments. The results of these recognition experiments show that the information about pronunciation variation obtained using the BU method can be used to improve recognition performance.
5th International Conference on Spoken Language Processing (ICSLP 1998)
This paper describes how the performance of a continuous speech recognizer for Dutch has been imp... more This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling pronunciation variation. We used three methods to model pronunciation variation. First, within-word variation was dealt with. Phonological rules were applied to the words in the lexicon, thus automatically generating pronunciation variants. Secondly, crossword pronunciation variation was modeled using two different approaches. The first approach was to model crossword processes by adding the variants as separate words to the lexicon and in the second approach this was done by using multi-words. For each of the methods, recognition experiments were carried out. A significant improvement was found for modeling within-word variation. Furthermore, modeling crossword processes using multi-words leads to significantly better results than modeling them using separate words in the lexicon.
Dans cet article, les performances d'un outil de transcription automatique sont évaluées. L'outil... more Dans cet article, les performances d'un outil de transcription automatique sont évaluées. L'outil de transcription est un reconnaisseur de parole continue (CSR) fonctionnant en mode de reconnaissance forcée. Pour l'évaluation les performances du CSR ont été comparées à celles de neuf auditeurs experts. La machine et l'humain ont effectué exactement la même tâche: décider si un segment était présent ou non dans 467 cas. Il s'est avéré que les performances du CSR étaient comparables à celle des experts.
Symposium on Languages, Applications and Technologies, 2015
The DigLin project aims at providing concrete solutions for low-literate and illiterate adults wh... more The DigLin project aims at providing concrete solutions for low-literate and illiterate adults who have to learn a second language (L2). Besides learning the L2, they thus also have to acquire literacy in the L2. To allow intensive practice and feedback in reading aloud, appropriate speech technology is developed for the four targeted languages: Dutch, English, German and Finnish. Since relatively limited resources are available for this application for the four studied languages, this had to be taken into account while developing the speech technology. Exercises with suitable content were developed for the four languages, and are tested in four countries: Netherlands, United Kingdom, Germany, and Finland. Preliminary results are presented in the paper, and suggestions for future directions are discussed.
This paper describes how the performance of a continuous speech recognizer for Dutch has been imp... more This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling pronunciation variation. We used three methods to model pronunciation variation. First, within-word variation was dealt with. Phonological rules were applied to the words in the lexicon, thus automatically generating pronunciation variants. Secondly, crossword pronunciation variation was modeled using two different approaches. The first approach was to model crossword processes by adding the variants as separate words to the lexicon and in the second approach this was done by using multi-words. For each of the methods, recognition experiments were carried out. A significant improvement was found for modeling within-word variation. Furthermore, modeling crossword processes using multi-words leads to significantly better results than modeling them using separate words in the lexicon.
In this paper we report on the ASR-based CALL system DISCO: Development and Integration of Speech... more In this paper we report on the ASR-based CALL system DISCO: Development and Integration of Speech technology into COurseware for language learning. The DISCO system automatically detects pronunciation and grammar errors in Dutch L2 speaking and generates appropriate, detailed feedback on the errors detected. We briefly introduce DISCO and present the results of a first evaluation of the complete system.
This paper describes how the performance of a continuous speech recognizer for Dutch has been imp... more This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling pronunciation variation. We used three methods in order to model pronunciation variation. First, withinword variation was dealt with. Phonological rules were applied to the words in the lexicon, thus automatically generating pronunciation variants. Secondly, crossword pronunciation variation was accounted for by adding multi-2. METHOD AND MATERIAL words and their variants to the lexicon. Thirdly, probabilities of pronunciation variants were incorporated in the language model (LM), and thresholds were used to choose which pronunciation variants to add to the LMs. For each of the methods, recognition experiments were carried out. A significant improvement in error rates was measured.
This paper describes how the performance of a continuous speech recognizer for Dutch has been imp... more This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling within-word and crossword pronunciation variation. A relative improvement of 8.8% in WER was found compared to baseline system performance. However, as WERs do not reveal the full effect of modeling pronunciation variation, we performed a detailed analysis of the differences in recognition results that occur due to modeling pronunciation variation and found that indeed a lot of the differences in recognition results are not reflected in the error rates. Furthermore, error analysis revealed that testing sets of variants in isolation does not predict their behavior in combination. However, these results appeared to be corpus dependent.
The production and perception of L2 vowels are influenced by the L1 vowel system. Most studies on... more The production and perception of L2 vowels are influenced by the L1 vowel system. Most studies on L2 vowel production evaluate the learners' pronunciation using subjective listening tests. In this study we present a novel objective method for investigating learner vowel confusability based on acoustic measurements. Monosyllabic words uttered by Spanish learners of Dutch are analyzed, and basic acoustic featuresformant frequencies and duration-are extracted. Native Dutch speakers' measurements are used to obtain models for the Dutch vowels, which are employed to compute likelihood ratios and similarity distributions of the Spanish realizations in comparison to the Dutch target vowels. The likelihood ratios are presented in a matrix format similar to a confusion matrix crossing the target vowels by the vowels as classified. Results based on spectral features alone confirm the existence of an attractor effect of L1 vowels on L2 vowels. Overall, including duration in the analyses decreases the number of confusions. Comparing the confusion values on different feature sets helps analyzing the impact of the specific features. The results of the present study suggest that although the Spanish learners' use of duration is not native-like, it does help reduce confusability among Dutch vowels.
An important objective in health-technology is the ability to gather information about people’s w... more An important objective in health-technology is the ability to gather information about people’s well-being. Structured interviews can be used to obtain this information, but are time-consuming and not scalable. Questionnaires provide an alternative way to extract such information, though typically lack depth. In this paper, we present our first prototype of the BLISS agent, an artificial intelligent agent which intends to automatically discover what makes people happy and healthy. The goal of Behaviour-based Language-Interactive Speaking Systems (BLISS) is to understand the motivations behind people’s happiness by conducting a personalized spoken dialogue based on a happiness model. We built our first prototype of the model to collect 55 spoken dialogues, in which the BLISS agent asked questions to users about their happiness and well-being. Apart from a description of the BLISS architecture, we also provide details about our dataset, which contains over 120 activities and 100 motiv...
Pronunciation variability is present in both native and foreign words. Since pronunciation variab... more Pronunciation variability is present in both native and foreign words. Since pronunciation variability constitutes a problem for automatic speech recognition (ASR) systems, modeling pronunciation variation for ASR has been the topic of various studies. In most studies, modeling pronunciation variation was attempted within the standard framework used in mainstream ASR systems. Given that some assumptions made within this framework are not in line with the properties of speech signals and the findings in human speech recognition, and that the improvements obtained by modeling pronunciation variation within this framework have generally been small, it might be better to look for a new paradigm in which pronunciation variation can be modeled more accurately. In this paper a novel paradigm for ASR is presented, which has many potential advantages for modeling pronunciation variation.
The current proposal is about a completely automatic Transcription Quality Evaluation (TQE) tool.... more The current proposal is about a completely automatic Transcription Quality Evaluation (TQE) tool. Input is a corpus with audio files and phone transcriptions (PTs). Audio and PTs are aligned, phone boundaries are derived, and for each segment-phone combination it is determined how well they fit together, i.e. for each phone a TQE measure (a confidence measure) is determined, e.g. ranging from 0-100%, indicating how well the fit is, what the quality of the phone transcription is. The output of the TQE tool will consist of a TQE measure and the segment boundaries for each phone in the corpus. The tool will be useful for validating, obtaining, and selecting phone transcriptions, for detecting phone strings (e.g. words) with deviating pronunciation, and, in general, it can be usefully applied in all research in various (sub-)fields of humanities and language and speech technology (L&ST) in which audio and PTs are involved. Target Start Date: 01-01-2010 Target End Date: 01-07-2010 Type: ...
The 10th annual conference of LESLLA. Adult Literacy, Second Language and Cognition, 28 augustus ... more The 10th annual conference of LESLLA. Adult Literacy, Second Language and Cognition, 28 augustus 2014
In this paper we describe the language resources developed within the project Feedback and the A... more In this paper we describe the language resources developed within the project Feedback and the Acquisition of Syntax in Oral Proficiency (FASOP), which is aimed at investigating the effectiveness of various forms of practice and feedback on the acquisition of syntax in second language (L2) oral proficiency, as well as their interplay with learner characteristics such as education level, learner motivation and confidence. For this purpose, use is made of a Computer Assisted Language Learning (CALL) system that employs Automatic Speech Recognition (ASR) technology to allow spoken interaction and to create an experimental environment that guarantees as much control over the language learning setting as possible. The focus of the present paper is on the resources that are being produced in FASOP. In line with the theme of this conference, we present the different types of resources developed within this project and the way in which these could be used to pursue innovative research in ...
Interspeech 2013, 2013
In this paper we report on a study on pronunciation errors by Spanish learners of Dutch, which wa... more In this paper we report on a study on pronunciation errors by Spanish learners of Dutch, which was aimed at obtaining information to develop a dedicated Computer Assisted Pronunciation Training (CAPT) program for this fixed language pair (Spanish L1, Dutch L2). The results of our study indicate, that, first, vowel errors are more frequent and variable than consonant mispronunciations. Second, Spanish natives appear to have problems with vowel length, vowel height, and front rounded vowels. Third, they tend to fall back on the pronunciation of their L1 vowels.
Interspeech 2021, 2021
We investigated speech intelligibility in dysarthric and nondysarthric speakers as measured by tw... more We investigated speech intelligibility in dysarthric and nondysarthric speakers as measured by two commonly used metrics, ratings through the Visual Analogue Scale (VAS) and word accuracy (AcW) through orthographic transcriptions. To gain a better understanding of how acoustic-phonetic correlates could be employed to obtain more objective measures of speech intelligibility and a better classification of dysarthric and non-dysarthric speakers, we studied the relation between these measures of intelligibility and some important acoustic-phonetic correlates. We found that the two intelligibility measures are related, but distinct, and that they might refer to different components of the intelligibility construct. The acoustic-phonetic features showed no difference in the mean values between the two speaker types at the utterance level, but more than half of them played a role in classifying the two speaker types. We computed an acoustic-phonetic probability index (API) at the speaker level. API is moderately correlated to VAS ratings but not correlated to AcW. In addition, API and VAS complement each other in classifying dysarthric and non-dysarthric speakers. This suggests that the intelligibility measures assigned by human raters and acoustic-phonetic features relate to different constructs of intelligibility.
Interspeech 2020, 2020
Speech intelligibility is an essential though complex construct in speech pathology. It is affect... more Speech intelligibility is an essential though complex construct in speech pathology. It is affected by multiple contextual variables and it is often measured in different ways. In this paper, we evaluate various measures of speech intelligibility based on orthographic transcriptions, with respect to their reliability and validity. For this study, different speech tasks were analyzed together with their respective perceptual ratings assigned by five experienced speech-language pathologists: a Visual Analogue Scale (VAS) and two types of orthographic transcriptions, one in terms of existing words and the other in terms of perceived segments, including nonsense words. Six subword measures concerning graphemes and phonemes were derived automatically from these transcriptions. All measures exhibit high degrees of reliability. Correlations between the six subword measures and three independent measures, VAS, word accuracy, and severity level, reveal that the measures extracted automatically from the orthographic transcriptions are valid predictors of speech intelligibility. The results also indicate differences between the speech tasks, suggesting that a comprehensive assessment of speech intelligibility requires materials from different speech tasks in combination with measures at different granularity levels: utterance, word, and subword. We discuss these results in relation to those of previous research and suggest possible avenues for future research.
SLaTE 2019: 8th ISCA Workshop on Speech and Language Technology in Education, 2019
Although speech intelligibility has been studied in different fields such as speech pathology, la... more Although speech intelligibility has been studied in different fields such as speech pathology, language learning, psycholinguistics, and speech synthesis, it is still unclear which concrete speech features most impact intelligibility. Commonly used subjective measures of speech intelligibility based on labour-intensive human ratings are time-consuming and expensive, so objective procedures based on automatically calculated features are needed. In this paper, we investigate possible correlations between a set of objective features and speech intelligibility. Specifically, we study the usability of acoustic features in the eGeMAPS feature set for predicting phoneme intelligibility by using stepwise linear multiple regression analysis. The results showed that the acoustic features are potentially usable for predicting intelligibility. This finding may help to boost the development of automatic procedures to measure speech intelligibility with the underlying relevant acoustic phonetic characteristics. Our analysis also covers the comparison between two speech types (dysarthric and normal), and between two different types of speech material (isolated words and running text). Finally, we discuss possible avenues for future research on speech intelligibility and implications for clinical practice.
SLaTE 2019: 8th ISCA Workshop on Speech and Language Technology in Education, 2019
We present an overview of the second edition of the Spoken CALL Shared Task. Groups competed on a... more We present an overview of the second edition of the Spoken CALL Shared Task. Groups competed on a prompt-response task using English-language data collected, through an online CALL game, from Swiss German teens in their second and third years of learning English. Each item consists of a written German prompt and an audio file containing a spoken response. The task is to accept linguistically correct responses and reject linguistically incorrect ones, with "linguistically correct" defined by a gold standard derived from human annotations. Scoring was performed using a metric defined as the ratio of the relative rejection rates on incorrect and correct responses. The second edition received eighteen entries and showed very substantial improvement on the first edition; all entries were better than the best entry from the first edition, and the best score was about four times higher. We present the task, the resources, the results, a discussion of the metrics used, and an analysis of what makes items challenging. In particular, we present quantitative evidence suggesting that incorrect responses are much more difficult to process than correct responses, and that the most significant factor in making a response challenging is its distance from the closest training example.
6th International Conference on Spoken Language Processing (ICSLP 2000)
In this paper a Bottom-Up (BU) method of obtaining information about pronunciation variation is p... more In this paper a Bottom-Up (BU) method of obtaining information about pronunciation variation is proposed. BU transcriptions (Tbu) were obtained by letting a CSR decide for each phone whether it was deleted or not. The Tbu were compared to transcriptions obtained automatically with a Top-Down method, and the agreement appeared to be very high. Subsequently, the Tbu were aligned with canonical reference transcriptions (Tref) and on the basis of this alignment, deletion rules were derived. The BU rules were employed to generate variants which were used in recognition experiments. The results of these recognition experiments show that the information about pronunciation variation obtained using the BU method can be used to improve recognition performance.
5th International Conference on Spoken Language Processing (ICSLP 1998)
This paper describes how the performance of a continuous speech recognizer for Dutch has been imp... more This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling pronunciation variation. We used three methods to model pronunciation variation. First, within-word variation was dealt with. Phonological rules were applied to the words in the lexicon, thus automatically generating pronunciation variants. Secondly, crossword pronunciation variation was modeled using two different approaches. The first approach was to model crossword processes by adding the variants as separate words to the lexicon and in the second approach this was done by using multi-words. For each of the methods, recognition experiments were carried out. A significant improvement was found for modeling within-word variation. Furthermore, modeling crossword processes using multi-words leads to significantly better results than modeling them using separate words in the lexicon.
Dans cet article, les performances d'un outil de transcription automatique sont évaluées. L'outil... more Dans cet article, les performances d'un outil de transcription automatique sont évaluées. L'outil de transcription est un reconnaisseur de parole continue (CSR) fonctionnant en mode de reconnaissance forcée. Pour l'évaluation les performances du CSR ont été comparées à celles de neuf auditeurs experts. La machine et l'humain ont effectué exactement la même tâche: décider si un segment était présent ou non dans 467 cas. Il s'est avéré que les performances du CSR étaient comparables à celle des experts.
Symposium on Languages, Applications and Technologies, 2015
The DigLin project aims at providing concrete solutions for low-literate and illiterate adults wh... more The DigLin project aims at providing concrete solutions for low-literate and illiterate adults who have to learn a second language (L2). Besides learning the L2, they thus also have to acquire literacy in the L2. To allow intensive practice and feedback in reading aloud, appropriate speech technology is developed for the four targeted languages: Dutch, English, German and Finnish. Since relatively limited resources are available for this application for the four studied languages, this had to be taken into account while developing the speech technology. Exercises with suitable content were developed for the four languages, and are tested in four countries: Netherlands, United Kingdom, Germany, and Finland. Preliminary results are presented in the paper, and suggestions for future directions are discussed.
This paper describes how the performance of a continuous speech recognizer for Dutch has been imp... more This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling pronunciation variation. We used three methods to model pronunciation variation. First, within-word variation was dealt with. Phonological rules were applied to the words in the lexicon, thus automatically generating pronunciation variants. Secondly, crossword pronunciation variation was modeled using two different approaches. The first approach was to model crossword processes by adding the variants as separate words to the lexicon and in the second approach this was done by using multi-words. For each of the methods, recognition experiments were carried out. A significant improvement was found for modeling within-word variation. Furthermore, modeling crossword processes using multi-words leads to significantly better results than modeling them using separate words in the lexicon.
In this paper we report on the ASR-based CALL system DISCO: Development and Integration of Speech... more In this paper we report on the ASR-based CALL system DISCO: Development and Integration of Speech technology into COurseware for language learning. The DISCO system automatically detects pronunciation and grammar errors in Dutch L2 speaking and generates appropriate, detailed feedback on the errors detected. We briefly introduce DISCO and present the results of a first evaluation of the complete system.
This paper describes how the performance of a continuous speech recognizer for Dutch has been imp... more This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling pronunciation variation. We used three methods in order to model pronunciation variation. First, withinword variation was dealt with. Phonological rules were applied to the words in the lexicon, thus automatically generating pronunciation variants. Secondly, crossword pronunciation variation was accounted for by adding multi-2. METHOD AND MATERIAL words and their variants to the lexicon. Thirdly, probabilities of pronunciation variants were incorporated in the language model (LM), and thresholds were used to choose which pronunciation variants to add to the LMs. For each of the methods, recognition experiments were carried out. A significant improvement in error rates was measured.
This paper describes how the performance of a continuous speech recognizer for Dutch has been imp... more This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling within-word and crossword pronunciation variation. A relative improvement of 8.8% in WER was found compared to baseline system performance. However, as WERs do not reveal the full effect of modeling pronunciation variation, we performed a detailed analysis of the differences in recognition results that occur due to modeling pronunciation variation and found that indeed a lot of the differences in recognition results are not reflected in the error rates. Furthermore, error analysis revealed that testing sets of variants in isolation does not predict their behavior in combination. However, these results appeared to be corpus dependent.
The production and perception of L2 vowels are influenced by the L1 vowel system. Most studies on... more The production and perception of L2 vowels are influenced by the L1 vowel system. Most studies on L2 vowel production evaluate the learners' pronunciation using subjective listening tests. In this study we present a novel objective method for investigating learner vowel confusability based on acoustic measurements. Monosyllabic words uttered by Spanish learners of Dutch are analyzed, and basic acoustic featuresformant frequencies and duration-are extracted. Native Dutch speakers' measurements are used to obtain models for the Dutch vowels, which are employed to compute likelihood ratios and similarity distributions of the Spanish realizations in comparison to the Dutch target vowels. The likelihood ratios are presented in a matrix format similar to a confusion matrix crossing the target vowels by the vowels as classified. Results based on spectral features alone confirm the existence of an attractor effect of L1 vowels on L2 vowels. Overall, including duration in the analyses decreases the number of confusions. Comparing the confusion values on different feature sets helps analyzing the impact of the specific features. The results of the present study suggest that although the Spanish learners' use of duration is not native-like, it does help reduce confusability among Dutch vowels.
An important objective in health-technology is the ability to gather information about people’s w... more An important objective in health-technology is the ability to gather information about people’s well-being. Structured interviews can be used to obtain this information, but are time-consuming and not scalable. Questionnaires provide an alternative way to extract such information, though typically lack depth. In this paper, we present our first prototype of the BLISS agent, an artificial intelligent agent which intends to automatically discover what makes people happy and healthy. The goal of Behaviour-based Language-Interactive Speaking Systems (BLISS) is to understand the motivations behind people’s happiness by conducting a personalized spoken dialogue based on a happiness model. We built our first prototype of the model to collect 55 spoken dialogues, in which the BLISS agent asked questions to users about their happiness and well-being. Apart from a description of the BLISS architecture, we also provide details about our dataset, which contains over 120 activities and 100 motiv...
Pronunciation variability is present in both native and foreign words. Since pronunciation variab... more Pronunciation variability is present in both native and foreign words. Since pronunciation variability constitutes a problem for automatic speech recognition (ASR) systems, modeling pronunciation variation for ASR has been the topic of various studies. In most studies, modeling pronunciation variation was attempted within the standard framework used in mainstream ASR systems. Given that some assumptions made within this framework are not in line with the properties of speech signals and the findings in human speech recognition, and that the improvements obtained by modeling pronunciation variation within this framework have generally been small, it might be better to look for a new paradigm in which pronunciation variation can be modeled more accurately. In this paper a novel paradigm for ASR is presented, which has many potential advantages for modeling pronunciation variation.
The current proposal is about a completely automatic Transcription Quality Evaluation (TQE) tool.... more The current proposal is about a completely automatic Transcription Quality Evaluation (TQE) tool. Input is a corpus with audio files and phone transcriptions (PTs). Audio and PTs are aligned, phone boundaries are derived, and for each segment-phone combination it is determined how well they fit together, i.e. for each phone a TQE measure (a confidence measure) is determined, e.g. ranging from 0-100%, indicating how well the fit is, what the quality of the phone transcription is. The output of the TQE tool will consist of a TQE measure and the segment boundaries for each phone in the corpus. The tool will be useful for validating, obtaining, and selecting phone transcriptions, for detecting phone strings (e.g. words) with deviating pronunciation, and, in general, it can be usefully applied in all research in various (sub-)fields of humanities and language and speech technology (L&ST) in which audio and PTs are involved. Target Start Date: 01-01-2010 Target End Date: 01-07-2010 Type: ...
The 10th annual conference of LESLLA. Adult Literacy, Second Language and Cognition, 28 augustus ... more The 10th annual conference of LESLLA. Adult Literacy, Second Language and Cognition, 28 augustus 2014
In this paper we describe the language resources developed within the project Feedback and the A... more In this paper we describe the language resources developed within the project Feedback and the Acquisition of Syntax in Oral Proficiency (FASOP), which is aimed at investigating the effectiveness of various forms of practice and feedback on the acquisition of syntax in second language (L2) oral proficiency, as well as their interplay with learner characteristics such as education level, learner motivation and confidence. For this purpose, use is made of a Computer Assisted Language Learning (CALL) system that employs Automatic Speech Recognition (ASR) technology to allow spoken interaction and to create an experimental environment that guarantees as much control over the language learning setting as possible. The focus of the present paper is on the resources that are being produced in FASOP. In line with the theme of this conference, we present the different types of resources developed within this project and the way in which these could be used to pursue innovative research in ...
Interspeech 2013, 2013
In this paper we report on a study on pronunciation errors by Spanish learners of Dutch, which wa... more In this paper we report on a study on pronunciation errors by Spanish learners of Dutch, which was aimed at obtaining information to develop a dedicated Computer Assisted Pronunciation Training (CAPT) program for this fixed language pair (Spanish L1, Dutch L2). The results of our study indicate, that, first, vowel errors are more frequent and variable than consonant mispronunciations. Second, Spanish natives appear to have problems with vowel length, vowel height, and front rounded vowels. Third, they tend to fall back on the pronunciation of their L1 vowels.
Interspeech 2021, 2021
We investigated speech intelligibility in dysarthric and nondysarthric speakers as measured by tw... more We investigated speech intelligibility in dysarthric and nondysarthric speakers as measured by two commonly used metrics, ratings through the Visual Analogue Scale (VAS) and word accuracy (AcW) through orthographic transcriptions. To gain a better understanding of how acoustic-phonetic correlates could be employed to obtain more objective measures of speech intelligibility and a better classification of dysarthric and non-dysarthric speakers, we studied the relation between these measures of intelligibility and some important acoustic-phonetic correlates. We found that the two intelligibility measures are related, but distinct, and that they might refer to different components of the intelligibility construct. The acoustic-phonetic features showed no difference in the mean values between the two speaker types at the utterance level, but more than half of them played a role in classifying the two speaker types. We computed an acoustic-phonetic probability index (API) at the speaker level. API is moderately correlated to VAS ratings but not correlated to AcW. In addition, API and VAS complement each other in classifying dysarthric and non-dysarthric speakers. This suggests that the intelligibility measures assigned by human raters and acoustic-phonetic features relate to different constructs of intelligibility.
Interspeech 2020, 2020
Speech intelligibility is an essential though complex construct in speech pathology. It is affect... more Speech intelligibility is an essential though complex construct in speech pathology. It is affected by multiple contextual variables and it is often measured in different ways. In this paper, we evaluate various measures of speech intelligibility based on orthographic transcriptions, with respect to their reliability and validity. For this study, different speech tasks were analyzed together with their respective perceptual ratings assigned by five experienced speech-language pathologists: a Visual Analogue Scale (VAS) and two types of orthographic transcriptions, one in terms of existing words and the other in terms of perceived segments, including nonsense words. Six subword measures concerning graphemes and phonemes were derived automatically from these transcriptions. All measures exhibit high degrees of reliability. Correlations between the six subword measures and three independent measures, VAS, word accuracy, and severity level, reveal that the measures extracted automatically from the orthographic transcriptions are valid predictors of speech intelligibility. The results also indicate differences between the speech tasks, suggesting that a comprehensive assessment of speech intelligibility requires materials from different speech tasks in combination with measures at different granularity levels: utterance, word, and subword. We discuss these results in relation to those of previous research and suggest possible avenues for future research.
SLaTE 2019: 8th ISCA Workshop on Speech and Language Technology in Education, 2019
Although speech intelligibility has been studied in different fields such as speech pathology, la... more Although speech intelligibility has been studied in different fields such as speech pathology, language learning, psycholinguistics, and speech synthesis, it is still unclear which concrete speech features most impact intelligibility. Commonly used subjective measures of speech intelligibility based on labour-intensive human ratings are time-consuming and expensive, so objective procedures based on automatically calculated features are needed. In this paper, we investigate possible correlations between a set of objective features and speech intelligibility. Specifically, we study the usability of acoustic features in the eGeMAPS feature set for predicting phoneme intelligibility by using stepwise linear multiple regression analysis. The results showed that the acoustic features are potentially usable for predicting intelligibility. This finding may help to boost the development of automatic procedures to measure speech intelligibility with the underlying relevant acoustic phonetic characteristics. Our analysis also covers the comparison between two speech types (dysarthric and normal), and between two different types of speech material (isolated words and running text). Finally, we discuss possible avenues for future research on speech intelligibility and implications for clinical practice.
SLaTE 2019: 8th ISCA Workshop on Speech and Language Technology in Education, 2019
We present an overview of the second edition of the Spoken CALL Shared Task. Groups competed on a... more We present an overview of the second edition of the Spoken CALL Shared Task. Groups competed on a prompt-response task using English-language data collected, through an online CALL game, from Swiss German teens in their second and third years of learning English. Each item consists of a written German prompt and an audio file containing a spoken response. The task is to accept linguistically correct responses and reject linguistically incorrect ones, with "linguistically correct" defined by a gold standard derived from human annotations. Scoring was performed using a metric defined as the ratio of the relative rejection rates on incorrect and correct responses. The second edition received eighteen entries and showed very substantial improvement on the first edition; all entries were better than the best entry from the first edition, and the best score was about four times higher. We present the task, the resources, the results, a discussion of the metrics used, and an analysis of what makes items challenging. In particular, we present quantitative evidence suggesting that incorrect responses are much more difficult to process than correct responses, and that the most significant factor in making a response challenging is its distance from the closest training example.