Automatic named identification of speakers using diarization and ASR systems (original) (raw)

Speaker diarization from speech transcripts

Proc. ICSLP, 2004

The aim of this study is to investigate the use of the linguistic information present in the audio signal to structure broadcast news data, and in particular to associate speaker identities with audio segments. While speaker recognition has been an active area of research for many years, addressing the problem of identifying speakers in huge audio corpora is relatively recent and has been mainly concerned with speaker tracking. The speech transcriptions contain a wealth of linguistic information that is useful for speaker diarization. Patterns which can be used to identify the current, previous or next speaker have been developed based on the analysis of 150 hours of manually transcribed broadcast news data. Each pattern is associated with one or more rules. After validation on the training transcripts, these patterns and rules were tested on an independent data set containing transcripts of 10 hours of broadcasts.

Speaker Diarization of Broadcast News in Albayzin 2010 Evaluation Campaign

2012

Abstract In this article, we present the evaluation results for the task of speaker diarization of broadcast news, which was part of the Albayzin 2010 evaluation campaign of language and speech technologies. The evaluation data consists of a subset of the Catalan broadcast news database recorded from the 3/24 TV channel. The description of five submitted systems from five different research labs is given, marking the common as well as the distinctive system features.

Combining transcription-based and acoustic-based speaker identifications for broadcast news

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

In this paper, we consider the issue of speaker identification within audio records of broadcast news. The speaker identity information is extracted from both transcript-based and acoustic-based speaker identification systems. This information is combined in the belief functions framework, which makes coherent the knowledge representation of the problem. The Kuhn-Munkres algorithm is used to optimize the assignment problem of speaker identities and speaker clusters. Experiments carried out on French broadcast news from the French evaluation campaign ESTER show the efficiency of the proposed combination method.

Towards using STT for broadcast news speaker diarization

Proc. DARPA RT04, …, 2004

The aim of this study is to investigate the use of the linguistic information present in the audio signal to structure broadcast news data, and in particular to associate speaker identities with audio segments. While speaker recognition has been an active area of research for many years, addressing the problem of identifying speakers in huge audio corpora is relatively recent and has been mainly concerned with speaker tracking. The speech transcriptions contain a wealth of linguistic information that is useful for speaker diarization. Patterns which can be used to identify the current, previous or next speaker have been developed based on the analysis of 150 hours of manually transcribed broadcast news data. Each pattern is associated with one or more rules to assign speaker identities. After validation on the training transcripts, these patterns and rules were tested on an independent data set containing transcripts of 9 hours of broadcasts, and a speaker diarization error rate of about 11% was obtained. Future work will validate the approach on automatically generated transcripts and also combine the linguistic information with information derived from the acoustic level.

Diarization-Based Speaker Retrieval for Broadcast Television Archives

In this study we extend a query-by-example diarizationbased speaker retrieval system to a full speaker retrieval system for broadcast television. The envisioned system is capable of finding all speakers in an archive using their names instead of example speech fragments. Information extracted from a television guide is used to label speaker clusters that most likely correspond to the found names. As part of the labeling process, all speaker clusters are first classified automatically based on their role in the programs they appear in. The role classification accuracy is 64% on our evaluation set. Speaker names can automatically be attributed to a fraction of the speaker clusters with an accuracy of 70%.

Speaker diarization: about whom the speaker is talking?

2006

The automatic speaker diarization consists in splitting the signal into homogeneous segments and clustering them by speakers. However the speaker segments are specified with anonymous labels. This paper suggests a solution to identify those speakers by extracting their full names pronounced in French broadcast news. A semantic classification tree is automatically built on a training corpus and associate the full names detected in the transcription of a segment to this segment or to one of its neighbors. Then, a merging method permits to associate a full name to a speaker cluster instead of an anonymous label provided by the diarization.

A comparative study using manual and automatic transcriptions for diarization

… Speech Recognition and …, 2005

This paper describes recent studies on speaker diarization from automatic broadcast news transcripts. Linguistic information revealing the true names of who speaks during a broadcast (the next, the previous and the current speaker) is detected by means of linguistic patterns. In order to associate the true speaker names with the speech segments, a set of rules are defined for each pattern. Since the effectiveness of linguistic patterns for diarization depends on the quality of the transcription, the performance using automatic transcripts generated with an LVCSR system are compared with those obtained using manual transcriptions. On about 150 hours of broadcast news data (295 shows) the global ratio of false identity association is about 13% for the automatic and the manual transcripts.

Speaker diarization: From broadcast news to lectures

2006

This paper presents the LIMSI speaker diarization system for lecture data, in the framework of the Rich Transcription 2006 Spring (RT-06S) meeting recognition evaluation. This system builds upon the baseline diarization system designed for broadcast news data. The baseline system combines agglomerative clustering based on Bayesian information criterion with a second clustering using state-of-the-art speaker identification techniques. In the RT-04F evaluation, the baseline system provided an overall diarization error of 8.5% on broadcast news data. However since it has a high missed speech error rate on lecture data, a different speech activity detection approach based on the log-likelihood ratio between the speech and non-speech models trained on the seminar data was explored. The new speaker diarization system integrating this module provides an overall diarization error of 20.2% on the RT-06S Multiple Distant Microphone (MDM) data.

Automatic named identification of speakers using belief functions

2010

In this paper, we consider the extraction of speaker identity (first name and last name) from audio records of broadcast news. Using an automatic speech recognition system, we present improvements for a method which allows to extract speaker identities from automatic transcripts and to assign them to speaker turns. The detected full names are chosen as potential candidates for these assignments. All this information , which is often contradictory, is described and combined in the Belief Functions formalism, which makes the knowledge representation of the problem coherent. The Belief Function theory has proven to be very suitable and adapted for the management of uncertainties concerning the speaker identity. Experiments are carried out on French broadcast news records from a French evaluation campaign of automatic speech recognition.