SCANMail: browsing and searching speech data by content (original) (raw)

Eurospeech 2001 -Scandinavia SCANMail: Browsing and Searching Speech Data by Content

2020

Increasing amounts of public, corporate, and private audio data are available for use, but limited in usefulness by the lack of tools to permit their browsing and search. In this paper, we describe SCANMail, a system that employs automatic speech recognition, information retrieval, information extraction, and human computer interaction technology to permit users to browse and search their voicemail messages by content through a graphical user interface interface. The SCANMail client also provides note-taking capabilities as well as browsing and querying features. A CallerId server also proposes caller names from existing caller acoustic models and is trained from user feedback. An Email server sends the original message plus its transcription to a mailing address specified in the user's profile.

SCANMail: audio navigation in the voicemail domain

2000

This paper describes SCANMail, a system that allows users to browse and search their voicemail messages by content through a GUI. Content based navigation is realized by use of automatic speech recognition, information retrieval, information extraction and human computer interaction technology. In addition to the browsing and querying functionalities, acoustics-based caller ID technology is used to proposes caller names from existing caller acoustic models trained from user feedback. The GUI browser also provides a note-taking capability. Comparing SCANMail to a regular voicemail interface in a user study, SCANMail performed better both in terms of objective (time to and quality of solutions) as well as subjective objectives.

SCANMail

Proceedings of the first international conference on Human language technology research - HLT '01, 2001

This paper describes SCANMail, a system that allows users to browse and search their voicemail messages by content through a GUI. Content based navigation is realized by use of automatic speech recognition, information retrieval, information extraction and human computer interaction technology. In addition to the browsing and querying functionalities, acoustics-based caller ID technology is used to proposes caller names from existing caller acoustic models trained from user feedback. The GUI browser also provides a note-taking capability. Comparing SCANMail to a regular voicemail interface in a user study, SCANMail performed better both in terms of objective (time to and quality of solutions) as well as subjective objectives.

An introduction to voice search

IEEE Signal Processing Magazine, 2008

oice search is the technology underlying many spoken dialog systems (SDSs) that provide users with the information they request with a spoken query. The information normally exists in a large database, and the query has to be compared with a field in the database to obtain the relevant information. The contents of the field, such as business or product names, are often unstructured text. For example, directory assistance (DA) [1] is one of the most popular voice search applications, in which users issue a spoken query and an automated system returns the phone number and address information of a business or an individual. Other voice search applications include music/video management [2], business and product reviews [3], stock price quotes, and conference information systems [4], [5]. Figure 1 shows the typical architecture of a voice search system, where a user's utterance is first recognized with an automatic speech recognizer (ASR) that utilizes an acoustic model (AM), pronunciation model (PM), and language model (LM). The m-best results from the ASR are passed to a search component to obtain the n-best semantic interpretations; i.e., a list of up to n entries in the database. The interpretations are passed to a dialog manager (DM) subsequently. The DM utilizes confidence measures, which indicate the certainty of the interpretations, to decide how to present the n-best results. If the system has high confidence on a few entries, it directly presents them to the user. Otherwise, a disambiguation module is exploited to interact with the user to understand what he actually needs. VOICE SEARCH AND OTHER SPOKEN DIALOG TECHNOLOGIES SDSs are often chronologically categorized into three generations: informational, transactional, and problem solving [6], [7] (earlier command and control speech applications in the 1980s are not considered as SDSs in this categorization). The first-generation SDSs focus on providing users with the information they request, such as flight status and weather information. The second-generation SDSs conduct transactions automatically with users; e.g., to book air flight tickets or perform bank balance transfers. The third-generation SDSs are often used in customer support by interacting with callers to diagnose the problems they are experiencing with a device or a service.

Foldering voicemail messages by caller using text independent speaker recognition

2000

The ability to automatically scan voicemail messages for content and caller identity c u e s w ould be a useful service. This paper describes a system which automatically les voicemail messages into caller folders using text independent speaker recognition techniques. Callers are represented by Gaussian mixture models (GMM's). The speech for an incoming message is processed and scored against caller models created for a subscriber. A message whose matching score exceeds a threshold is led in the matching caller folder otherwise it is tagged as \unknown". The subscriber has the ability to listen to an \unknown" message and le it in the proper folder, if it exists, or create a new folder, if it does not. Such subscriber labelled messages are used to train and adapt caller models. The system has been evaluated on a database of voicemail messages collected at AT&T Labs. A set of 20 callers from this database is designated as \ingroup". Each of these callers has recorded at least 20 messages totalling 10 or more minutes in duration. A distinct set of 220 messages, each from a di erent caller, are designated as \outgroup". Representative performance gures with threshold parameters set to ensure that outgroup acceptance is low compared with ingroup rejection are the following. The average ingroup message rejection rate is 11.0% and the average ingroup message confusion rate (matching the wrong caller) is 1.0%, while the average outgroup message accept rate is 2.7%.

Speechbot: a speech recognition based audio indexing system for the web

… on Computer-Assisted …, 2000

We have developed an audio search engine incorporating speech recognition technology. This allows indexing of spoken documents from the World Wide Web when no transcription is available. This site indexes several talk and news radio shows covering a wide range of topics and speaking styles from a selection of public Web sites with multimedia archives. Our Web site is similar in spirit to normal Web search sites; it contains an index, not the actual multimedia content. The audio from these shows suffers in acoustic quality due to bandwidth limitations, coding, compression, and poor acoustic conditions. The shows are typically sampled at 8 kHz and transmitted, RealAudio compressed, at 6.5 kbps. Our word-error rate results using appropriately trained acoustic models show remarkable resilience to the high compression, though many factors combine to increase the average word-error rates over standard broadcast news benchmarks. We show that, even if the transcription is inaccurate, we can still achieve good retrieval performance for typical user queries (69%). Because the archive is large -over 5000 hours of content (and growing at a rate of 100 hours per week), totaling 47 million words and growing rapidly -we measure performance in terms of the precision of the top-ranked matches returned to the user.

Information extraction from voicemail

Proceedings of the 39th Annual Meeting on Association for Computational Linguistics - ACL '01, 2001

In this paper we address the problem of extracting key pieces of information from voicemail messages, such as the identity and phone number of the caller. This task differs from the named entity task in that the information we are interested in is a subset of the named entities in the message, and consequently, the need to pick the correct subset makes the problem more difficult. Also, the caller's identity may include information that is not typically associated with a named entity. In this work, we present three information extraction methods, one based on hand-crafted rules, one based on maximum entropy tagging, and one based on probabilistic transducer induction. We evaluate their performance on both manually transcribed messages and on the output of a speech recognition system.

Samsa: a speech analysis, mining and summary application for outbound telephone calls

Proceedings of the 34th Annual Hawaii International Conference on System Sciences, 2001

We have applied speech recognition and text-mining technologies to a set of 522 recorded outbound marketing calls and analyzed the results. Since speaker-independent speech recognition technology results in a significantly lower recognition rate than that found when the recognizer is trained for a particular speaker, we applied a number of post-processing algorithms to the output of the recognizer to render it suitable for the Textract text mining system.

1 Samsa: A Speech Analysis, Mining and Summary Application for Outbound Telephone Calls

2013

We have applied speech recognition and text-mining technologies to a set of 522 recorded outbound marketing calls and analyzed the results. Since speaker-independent speech recognition technology results in a significantly lower recognition rate than that found when the recognizer is trained for a particular speaker, we applied a number of post-processing algorithms to the output of the recognizer to render it suitable for the Textract text mining system. We indexed the call transcripts using a search engine and used Textract and associated Java technologies to place the relevant terms for each document in a relational database. Following a search query, we generated a thumbnail display of the results of each call with the salient terms highlighted. We illustrate these results and discuss their utility. We describe a distinct document genre based on the notetaking concept of document content, and propose a significant new method for measuring speech recognition accuracy. 1.