Jonas Lindh | University of Gothenburg (original) (raw)
Papers by Jonas Lindh
This thesis has three main objectives. The first objective (A) includes Study I, which investigat... more This thesis has three main objectives. The first objective (A) includes Study I, which investigates the parameter fundamental frequency (F0) and its robustness in different acoustic contexts by using different measures. The outcome concludes that using the alternative baseline as a measure will diminish the effect of low-quality recordings or varying speaking liveliness. However, both creaky voice and raised vocal effort induce intra-variation problems that are yet to be solved. The second objective (B) includes study II, III and IV. Study II investigates the differences between the results from an ear witness line-up experiment and the pairwise perceptual judgments of voice similarity performed by a large group of listeners. The study shows that humans seem to be much more focused on similarities of speech style than features connected to voice quality, even when recordings are played backwards. Study III investigates the differences between an automatic voice comparison system and...
The most successful methods to induce emotions on state of the art unit selection speech synthesi... more The most successful methods to induce emotions on state of the art unit selection speech synthesis have been built by switching speech database depending on the desired emotion. These methods require a substantial increase of memory compared to a single database and are computationally slow. The model-based approach is an attempt to reshape a neutrally recorded utterance (comparable to the desired output from a modern unit selection system) into simulating a recorded model of a desired emotion. Factors for manipulation of duration, amplitude and formant shift ratio are calculated by comparing the recorded neutral utterance with
This volume contains the contributions to FONETIK 2005, the Eighteenth Swedish Phonetics Conferen... more This volume contains the contributions to FONETIK 2005, the Eighteenth Swedish Phonetics Conference, organized by the Phonetics group at Göteborg University on May 25–27, 2005. The papers appear in the order they were presented at the conference. Only a limited number of copies of this publication has been printed for distribution among the authors and those attending the conference. For access to electronic versions of the contributions, please look under:
Ph.D. dissertation at University of Gothenburg, Sweden, 2017 Title: Forensic Comparison of Voices... more Ph.D. dissertation at University of Gothenburg, Sweden, 2017 Title: Forensic Comparison of Voices, Speech and Speakers Author: Jonas Lindh Language: English, with a Swedish summary Department: Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg, Box 200, SE-405 30 Göteborg ISBN: 978-91-629-0142-4 (digital) ISBN: ISBN 978-91-629-0141-7 (print) This thesis has three main objectives. The first objective (A) includes Study I, which investigates the parameter fundamental frequency (F0) and its robustness in different acoustic contexts by using different measures. The outcome concludes that using the alternative baseline as a measure will diminish the effect of low-quality recordings or varying speaking liveliness. However, both creaky voice and raised vocal effort induce intra-variation problems that are yet to be solved. The second objective (B) includes study II, III and IV. Study II investigates the differences between the results from an ear witness ...
Text-independent speaker verification can be a useful tool as a substitute for passwords or incre... more Text-independent speaker verification can be a useful tool as a substitute for passwords or increased security check. The tool can also be used in forensic phonetic casework. A text-independent speaker verification Praat plug-in was created using tools from the open source Mistral/Alize toolkit. A gate keeper setup was created for 13 department employees and tested for verification. 2 different universal background models where trained and the same set tested and evaluated. The results show promising results and give implications for the usefulness of such a tool in research on voice quality.
IEEE Signal Processing Letters, 2017
Automatic alignment of text and sound is of great help and saves a lot of time labelling speech d... more Automatic alignment of text and sound is of great help and saves a lot of time labelling speech databases, either for research or for developing speech technology tools such as automatic speech recognition or text to speech systems. It is also a very useful tool in forensic speaker identification as one often receives a tapped recording together with an orthographic transcription. The orthographic transcription can be used together with the sound file to provide information of where in the recording significant events occur to a greater or lesser extent. Even if the aligning sometimes is not perfect, it replaces some of the time consuming manual labelling. To perform automatic aligning, common speech recognition techniques are applied at various levels. In this case, a framework for doing automatic aligning, called EasyAlign, was developed for the free software Praat (Goldman, 2007). Praat is distributed as an open source software under a GPL license. On top of the source code a bui...
Assessing the perceptual similarity of voices is necessary for the creation of voice parades, alo... more Assessing the perceptual similarity of voices is necessary for the creation of voice parades, along with media applications such as voice casting. These applications are normally prohibitively expensive to administer, requiring significant amounts of ‘expert listening’. The ability to automatically assess voice similarity could benefit these applications by increasing efficiency and reducing subjectivity, while enabling the use of a much larger search space of candidate voices. In this paper, the use of automatically extracted phonetic features within an i-vector speaker recognition system is proposed as a means of identifying cohorts of perceptually similar voices. Features considered include formants (F1-F4), fundamental frequency (F0), semitones of F0, and their derivatives. To demonstrate the viability of this approach, a subset of the Interspeech 2016 special session ‘Speakers In The Wild’ (SITW) dataset is used in a pilot study comparing subjective listener ratings of similari...
Many studies of automatic speaker recognition have investigated which parameters that perform bes... more Many studies of automatic speaker recognition have investigated which parameters that perform best. This paper presents an experiment where graphic representations of LTAS (Long Time Average Spectrum) were used to identify speakers from a closed set of disguised voices and determine how well the graphic method performed compared to an aural approach. Nine different speakers were recorded uttering a fake threat. The speakers used different disguises such as dialect, accent, whisper, falsetto etc. and the verbatim “threat” in a normal voice. Using high quality recordings, visual comparison of the Praat “vocal tract” graphs of LTAS outperformed the aural approach in identifying the disguised voices. Performing speaker identification aurally does not mean analyzing a different sample than the one being analyzed acoustically. Studies of aural perception show a hypothesizing, top-down, active process, which create interesting questions regarding aural speaker identification with bad quali...
A procedure for comparing the performance of humans and machines on speaker recognition and on fo... more A procedure for comparing the performance of humans and machines on speaker recognition and on forensic voice comparison is proposed and demonstrated. The procedure is consistent with the new paradigm for forensic-comparison science (use of the likelihood-ratio framework and testing of the validity and reliability of the results). The use of the procedure is demonstrated using a small database of Swedish voice recordings.
Fundamental frequency has been used for a long time in speaker identification (Braun, 1995; Rose,... more Fundamental frequency has been used for a long time in speaker identification (Braun, 1995; Rose, 2003). The within-speaker variation in F0 is affected by several factors. In Braun (1995), they are categorized as technical, physiological and psychological factors. Tape speed, which surprisingly still is an issue for forensic samples, and sample size are examples of technical factors. Smoking and age are examples of physiological factors, while emotional state and background noise are examples of psychological factors. However, fundamental frequency has been shown to be a successful forensic phonetic parameter (Nolan, 1983). To be able to study differences it is suggested to use long-term distribution measures such as arithmetical mean and standard deviation (Rose, 2002). The duration of the samples should be more than 60 seconds according to Nolan (1983), but Rose (1991) reports that F0 measurements for seven Chinese speakers stabilized much earlier, implying that the values may be ...
first step towards a text-independent speaker verification
The Encyclopedia of Applied Linguistics, 2012
Proceedings of the Fourth European Conference on Tone and Intonation, 2010
Earlier studies on the perception of Chinese tones have almost exclusively used 1-syllable words ... more Earlier studies on the perception of Chinese tones have almost exclusively used 1-syllable words for the listening tests (Kiriloff, 1969; Chuang, 1971; Klatt, 1973; Gandour, 1978). In these earlier studies the misperception between tone 2 and tone 3 has been shown to be the most ...
Motor control, Jan 26, 2015
In this study we systematically compared syllable repetition and finger tapping in healthy adults... more In this study we systematically compared syllable repetition and finger tapping in healthy adults, and explored possible impacts of tempi, metronome, musical experience, and age on motor timing ability. One hundred healthy adults used finger-tapping and syllable repetition to perform an isochronous pulse in three different tempi, with and without a metronome. Results showed that the motor timing was more accurate with finger tapping than with syllable repetition in the slowest tempo, and the motor timing ability was better with the metronome than without. Persons with musical experience showed better motor timing accuracy than persons without such experience, and the timing asynchrony increased with increasing age. The slowest tempo 90 bpm posed extra challenges to the participants. We speculate that this pattern reflects the fact that the slow tempo lies outside the 3-8 Hz syllable rate of natural speech, which in turn has been linked to theta-based oscillations in the brain.
This thesis has three main objectives. The first objective (A) includes Study I, which investigat... more This thesis has three main objectives. The first objective (A) includes Study I, which investigates the parameter fundamental frequency (F0) and its robustness in different acoustic contexts by using different measures. The outcome concludes that using the alternative baseline as a measure will diminish the effect of low-quality recordings or varying speaking liveliness. However, both creaky voice and raised vocal effort induce intra-variation problems that are yet to be solved. The second objective (B) includes study II, III and IV. Study II investigates the differences between the results from an ear witness line-up experiment and the pairwise perceptual judgments of voice similarity performed by a large group of listeners. The study shows that humans seem to be much more focused on similarities of speech style than features connected to voice quality, even when recordings are played backwards. Study III investigates the differences between an automatic voice comparison system and...
The most successful methods to induce emotions on state of the art unit selection speech synthesi... more The most successful methods to induce emotions on state of the art unit selection speech synthesis have been built by switching speech database depending on the desired emotion. These methods require a substantial increase of memory compared to a single database and are computationally slow. The model-based approach is an attempt to reshape a neutrally recorded utterance (comparable to the desired output from a modern unit selection system) into simulating a recorded model of a desired emotion. Factors for manipulation of duration, amplitude and formant shift ratio are calculated by comparing the recorded neutral utterance with
This volume contains the contributions to FONETIK 2005, the Eighteenth Swedish Phonetics Conferen... more This volume contains the contributions to FONETIK 2005, the Eighteenth Swedish Phonetics Conference, organized by the Phonetics group at Göteborg University on May 25–27, 2005. The papers appear in the order they were presented at the conference. Only a limited number of copies of this publication has been printed for distribution among the authors and those attending the conference. For access to electronic versions of the contributions, please look under:
Ph.D. dissertation at University of Gothenburg, Sweden, 2017 Title: Forensic Comparison of Voices... more Ph.D. dissertation at University of Gothenburg, Sweden, 2017 Title: Forensic Comparison of Voices, Speech and Speakers Author: Jonas Lindh Language: English, with a Swedish summary Department: Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg, Box 200, SE-405 30 Göteborg ISBN: 978-91-629-0142-4 (digital) ISBN: ISBN 978-91-629-0141-7 (print) This thesis has three main objectives. The first objective (A) includes Study I, which investigates the parameter fundamental frequency (F0) and its robustness in different acoustic contexts by using different measures. The outcome concludes that using the alternative baseline as a measure will diminish the effect of low-quality recordings or varying speaking liveliness. However, both creaky voice and raised vocal effort induce intra-variation problems that are yet to be solved. The second objective (B) includes study II, III and IV. Study II investigates the differences between the results from an ear witness ...
Text-independent speaker verification can be a useful tool as a substitute for passwords or incre... more Text-independent speaker verification can be a useful tool as a substitute for passwords or increased security check. The tool can also be used in forensic phonetic casework. A text-independent speaker verification Praat plug-in was created using tools from the open source Mistral/Alize toolkit. A gate keeper setup was created for 13 department employees and tested for verification. 2 different universal background models where trained and the same set tested and evaluated. The results show promising results and give implications for the usefulness of such a tool in research on voice quality.
IEEE Signal Processing Letters, 2017
Automatic alignment of text and sound is of great help and saves a lot of time labelling speech d... more Automatic alignment of text and sound is of great help and saves a lot of time labelling speech databases, either for research or for developing speech technology tools such as automatic speech recognition or text to speech systems. It is also a very useful tool in forensic speaker identification as one often receives a tapped recording together with an orthographic transcription. The orthographic transcription can be used together with the sound file to provide information of where in the recording significant events occur to a greater or lesser extent. Even if the aligning sometimes is not perfect, it replaces some of the time consuming manual labelling. To perform automatic aligning, common speech recognition techniques are applied at various levels. In this case, a framework for doing automatic aligning, called EasyAlign, was developed for the free software Praat (Goldman, 2007). Praat is distributed as an open source software under a GPL license. On top of the source code a bui...
Assessing the perceptual similarity of voices is necessary for the creation of voice parades, alo... more Assessing the perceptual similarity of voices is necessary for the creation of voice parades, along with media applications such as voice casting. These applications are normally prohibitively expensive to administer, requiring significant amounts of ‘expert listening’. The ability to automatically assess voice similarity could benefit these applications by increasing efficiency and reducing subjectivity, while enabling the use of a much larger search space of candidate voices. In this paper, the use of automatically extracted phonetic features within an i-vector speaker recognition system is proposed as a means of identifying cohorts of perceptually similar voices. Features considered include formants (F1-F4), fundamental frequency (F0), semitones of F0, and their derivatives. To demonstrate the viability of this approach, a subset of the Interspeech 2016 special session ‘Speakers In The Wild’ (SITW) dataset is used in a pilot study comparing subjective listener ratings of similari...
Many studies of automatic speaker recognition have investigated which parameters that perform bes... more Many studies of automatic speaker recognition have investigated which parameters that perform best. This paper presents an experiment where graphic representations of LTAS (Long Time Average Spectrum) were used to identify speakers from a closed set of disguised voices and determine how well the graphic method performed compared to an aural approach. Nine different speakers were recorded uttering a fake threat. The speakers used different disguises such as dialect, accent, whisper, falsetto etc. and the verbatim “threat” in a normal voice. Using high quality recordings, visual comparison of the Praat “vocal tract” graphs of LTAS outperformed the aural approach in identifying the disguised voices. Performing speaker identification aurally does not mean analyzing a different sample than the one being analyzed acoustically. Studies of aural perception show a hypothesizing, top-down, active process, which create interesting questions regarding aural speaker identification with bad quali...
A procedure for comparing the performance of humans and machines on speaker recognition and on fo... more A procedure for comparing the performance of humans and machines on speaker recognition and on forensic voice comparison is proposed and demonstrated. The procedure is consistent with the new paradigm for forensic-comparison science (use of the likelihood-ratio framework and testing of the validity and reliability of the results). The use of the procedure is demonstrated using a small database of Swedish voice recordings.
Fundamental frequency has been used for a long time in speaker identification (Braun, 1995; Rose,... more Fundamental frequency has been used for a long time in speaker identification (Braun, 1995; Rose, 2003). The within-speaker variation in F0 is affected by several factors. In Braun (1995), they are categorized as technical, physiological and psychological factors. Tape speed, which surprisingly still is an issue for forensic samples, and sample size are examples of technical factors. Smoking and age are examples of physiological factors, while emotional state and background noise are examples of psychological factors. However, fundamental frequency has been shown to be a successful forensic phonetic parameter (Nolan, 1983). To be able to study differences it is suggested to use long-term distribution measures such as arithmetical mean and standard deviation (Rose, 2002). The duration of the samples should be more than 60 seconds according to Nolan (1983), but Rose (1991) reports that F0 measurements for seven Chinese speakers stabilized much earlier, implying that the values may be ...
first step towards a text-independent speaker verification
The Encyclopedia of Applied Linguistics, 2012
Proceedings of the Fourth European Conference on Tone and Intonation, 2010
Earlier studies on the perception of Chinese tones have almost exclusively used 1-syllable words ... more Earlier studies on the perception of Chinese tones have almost exclusively used 1-syllable words for the listening tests (Kiriloff, 1969; Chuang, 1971; Klatt, 1973; Gandour, 1978). In these earlier studies the misperception between tone 2 and tone 3 has been shown to be the most ...
Motor control, Jan 26, 2015
In this study we systematically compared syllable repetition and finger tapping in healthy adults... more In this study we systematically compared syllable repetition and finger tapping in healthy adults, and explored possible impacts of tempi, metronome, musical experience, and age on motor timing ability. One hundred healthy adults used finger-tapping and syllable repetition to perform an isochronous pulse in three different tempi, with and without a metronome. Results showed that the motor timing was more accurate with finger tapping than with syllable repetition in the slowest tempo, and the motor timing ability was better with the metronome than without. Persons with musical experience showed better motor timing accuracy than persons without such experience, and the timing asynchrony increased with increasing age. The slowest tempo 90 bpm posed extra challenges to the participants. We speculate that this pattern reflects the fact that the slow tempo lies outside the 3-8 Hz syllable rate of natural speech, which in turn has been linked to theta-based oscillations in the brain.