Aural and automatic forensic speaker recognition in mismatched conditions (original) (raw)

The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications

In this paper, we analyse mismatched technical conditions in training and testing phases of speaker recognition and their effect on forensic human and automatic speaker recognition. We use perceptual tests performed by non-experts and compare their performance with that of a baseline automatic speaker recognition system. The degradation of the accuracy of human recognition in mismatched recording conditions is contrasted with that of the automatic system under similar recording conditions. The conditions considered are of public switched telephone network (PSTN) and global system for mobile communications (GSM) transmission and background noise. The perceptual cues that the human subjects use to perceive differences in voices are studied along with their importance in different conditions. We discuss the possibility of increasing the accuracy of automatic systems using the perceptual cues that remain robust to mismatched conditions. We estimate the strength of evidence for both humans and automatic systems, calculating likelihood ratios using the perceptual scores for humans and the log-likelihood scores for automatic systems.

The impact of mismatched recordings on an automatic-speaker-recognition system and human listeners

Acta Universitatis Carolinae. Philologica, 2023

The so-called 'mismatch' is a factor which experts in the forensic voice comparison field encounter regularly. Therefore, we decided to explore to what extent the automatic-speaker-recognition system's and the earwitness' ability to identify speakers is influenced when recordings are acquired in different languages and at different times. 100 voices in a database of 300 recordings (100 speakers recorded in three mutually mismatched sessions) were compared with an automatic-speaker-recognition software VOCALISE based on i-vectors and x-vectors, and by 39 respondents in simulated voice parades. Both the automatic and perceptual approach seem to have yielded similar results in that the less complex the mismatch type, the more successful the identification. The results point to the superiority of the x-vector approach, and also to varying identification abilities of listeners.

Calibration and weight of the evidence by human listeners. The ATVS-UAM submission to NIST HUMAN-aided speaker recognition 2010

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011

This work analyzes the performance of speaker recognition when carried out by human lay listeners. In forensics, judges and jurors usually manifest intuition that people is proficient to distinguish other people from their voices, and therefore opinions are easily elicited about speech evidence just by listening to it, or by means of panels of listeners. There is a danger, however, since little attention has been paid to scientifically measure the performance of human listeners, as well as to the strength with which they should elicit their opinions. In this work we perform such a rigorous analysis in the context of NIST Human-Aided Speaker Recognition 2010 (HASR). We have recruited a panel of listeners who have elicited opinions in the form of scores. Then, we have calibrated such scores using a development set, in order to generate calibrated likelihood ratios. Thus, the discriminating power and the strength with which human lay listeners should express their opinions about the speech evidence can be assessed, giving a measure of the amount of information given by human listeners to the speaker recognition process.

Forensic speaker recognition

IEEE Signal Processing Magazine, 2009

Forensic Speaker Recognition T here has long been a desire to be able to identify a person on the basis of his or her voice. For many years, judges, lawyers, detectives, and law enforcement agencies have wanted to use forensic voice authentication to investigate a suspect or to confirm a judgment of guilt or innocence [3] [35]. Challenges, realities, and cautions regarding the use of speaker recognition applied to forensic-quality samples are presented. Identifying a voice using forensic-quality samples is generally a challenging task for automatic, semiautomatic, and humanbased methods. The speech samples being compared may be recorded in different situations; e.g., one sample could be a yelling over the telephone, whereas the other might be a whisper in an interview room. A speaker could be disguising his or her voice, ill, or under the influence of drugs, alcohol, or stress in one or more of the samples. The speech samples will most likely contain noise, may be very short, and may not contain enough relevant speech material for comparative purposes. Each of these variables, in addition to the known variability of speech in general, makes reliable discrimination of speakers a complicated and daunting task. Although the scientific basis of authentication of a person by using his or her voice has been questioned by researchers (e.g., by scientists in 1970 [4], British academic phoneticians in 1983 [5], and the French speech communication community from 1990 to today [6]), there is a perception among the

The case for aural perceptual speaker identification

Forensic Science International, 2016

It was expected that valid computer-based speaker identification (SI) systems would be developed well before initiation of the 21 st Century. This has not happened. One reason for the lack of progress here was that problem complexity was seriously underestimated. Another was that appropriate protocol was not followed during system development. Appropriate standards are now available and will be reviewed. Nevertheless, there is a critical, and growing, demand for effective SI systems. This situation leads to a need for some sort of a stopgap procedure that would fill the temporary void. Fortunately, a large amount of SI research has been carried out during the cited period. Its results (which will be summarized) lead to the postulation that an aural perceptual (AP SI) system, although somewhat subjective, could be used to meet this requirement. A number of them have been developed (an example will be provided) and relevant research supports the position that, if rigorously controlled and cautiously interpreted, they can provide useful SI information. It also is recommended that, if an appropriate research model is followed, the speech/voice analysis approach should be seriously considered as a platform for the creation of digital speaker identification systems.

Speaker Recognition System and its Forensic Implications

2013

Speaker recognition comprises all those activities which attempt to link a speech sample to its speaker through its acoustic or perceptual properties [1]. Speech signal is a multidimensional acoustic wave (Figure 1), which provides information regarding speaker characteristics, spoken phrase, speaker emotions, additional noise, channel transformations etc [2,3]. The human voice is unique personal trait. For indistinguishable voice, the two individuals should have the identical vocal mechanism and identical coordination of their articulators, which is least probable. However, the some amount variations also occur in the speech exemplars obtained from the same speaker. This is due to the fact that a speaker cannot exactly imitate the same utterance again and again. Even, the signature of an individual also shows variation from trails to trials.

Forensic automatic speaker recognition using Bayesian interpretation and statistical compensation for mismatched conditions

International Journal of Speech Language and The Law, 2007

keywords: automatic speaker recognition, strength of evidence, mismatched recording conditions, statistical compensation techniques State-of-the-art automatic speaker recognition systems show very good performance in discriminating between voices of speakers under controlled recording conditions. However, the conditions in which recordings are made in investigative activities (e.g., anonymous calls and wire-tapping) cannot be controlled and pose a challenge to automatic speaker recognition. Differences in the telephone handset, in the transmission channel and in the recording devices can introduce variability over and above that of the voices in the recordings. The strength of evidence, estimated using statistical models of within-source variability and between-source variability, is expressed as a 'likelihood ratio' . The likelihood ratio is estimated using a probabilistic Bayesian interpretation which gives the probability of observing the features of the questioned recording in the statistical model of the suspected speaker's voice, given two competing hypotheses: first, that the suspected speaker is the same speaker as that on the questioned recording, and second, that the speaker heard on PhD abstracts 151 Haviland, J. (2003) Ideologies of language: reflections on language and U.S. law. American

Impact of the Passage of Time on the Correct Identification of the Speaker Using the Auditory Method

Archives of Acoustics, 2024

Courts in Poland, as well as in most countries in the world, allow for the identification of a person on the basis of his/her voice using the so-called voice presentation method, i.e., the auditory method. This method is used in situations where there is no sound recording and the perpetrator of the criminal act was masked and the victim heard only his or her voice. However, psychologists, forensic acousticians, as well as researchers in the field of auditory perception and forensic science more broadly describe many cases in which such testimony resulted in misjudgement. This paper presents the results of an experiment designed to investigate, in a Polish language setting, the extent to which the passage of time impairs the correct identification of a person. The study showed that 31 days after the speaker's voice was first heard, the correct identification for a female voice was 30% and for a male voice 40%.

Perceptual and memory factors in simulated machine-aided speaker verification

International Journal of Man-Machine Studies, 1979

Speaker verification by machine alone may be more accurate than by human listener but it is slower and demands powerful programs and peripherals. Simple recording devices can juxtapose a claimant utterance with a stored sample to provide rapid verification by human judgement, but this raises the question of how to optimize the sample size between insufficient information and an overload of auditory memory. To identify the processes at work in such judgements, a simulation was conducted of the situation where a human operator verifies claimant speakers against stored samples of a standard utterance. Realism was incorporated by restricting signals to telephone frequency bandwidth while both control and a stringent level of difficulty were incorporated by the selection of 5 better than average imposters and five more than averagely imitable male speakers. Naive, unselected listeners participated. With a 9-syllable sentence lasting about 2 seconds, correct acceptances varied from 92% to 100% and false acceptances from 54% to 21%. Conditions in which the length of the sample was reduced in various ways gave lower performance. The major factor differentiating the performance of individual subjects was a bias factor-the degree to which "same" responses predominated over "different" responses. Despite this, the different sample conditions tended to produce a fixed percentage of acceptance responses rather than a proportion varying with the available sensitivity in the fashion of an optimal decision-maker. The data justify several conclusions. (1) Listeners can integrate speaker information over periods as long as 2 seconds and probably longer. (2) Improvement in performance can result from increasing the length of either the claimant utterance or the stored sample even when the other cannot be increased. Thus it appears that listeners are extracting and storing parameters characterising the style of a speaker rather than matching a raw sound image. (3) Speaker verification by skilled listeners should be able to reach levels of sensitivity which, in combination with manipulations of the acceptance criterion, would ensure tolerably low false acceptance rates. (4) Training of the listener in speaker verification should involve training of acceptance criteria as well as perceptual discrimination training.