Luboš Šmídl - Academia.edu (original) (raw)

Papers by Luboš Šmídl

Research paper thumbnail of Czech dialog system for examination registration at UWB

Research paper thumbnail of False alarms reduction in keyword spotting system

The 8th World Multi Conference on Systemics Cybernetics and Informatics Vol Vi Image Acoustic Signal Processing and Optical Systems Technologies and Applications, Jul 18, 2004

Research paper thumbnail of Dialogový systém pro přihlašování studentů na zkoušky

Research paper thumbnail of Multi-agent decision support systems with remote multimedia access

Research paper thumbnail of Adjusting BERT’s Pooling Layer for Large-Scale Multi-Label Text Classification

Text, Speech, and Dialogue, 2020

In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label... more In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label Text Classification (LMTC). In the LMTC task, each text document can have multiple class labels, while the total number of classes is in the order of thousands. We propose a pooling layer architecture on top of BERT models, which improves the quality of classification by using information from the standard [CLS] token in combination with pooled sequence output. We demonstrate the improvements on Wikipedia datasets in three different languages using public pre-trained BERT models.

Research paper thumbnail of An Analysis of the RNN-Based Spoken Term Detection Training

Speech and Computer, 2017

This paper studies the training process of the recurrent neural networks used in the spoken term ... more This paper studies the training process of the recurrent neural networks used in the spoken term detection (STD) task. The method used in the paper employ two jointly trained Siamese networks using unsupervised data. The grapheme representation of a searched term and the phoneme realization of a putative hit are projected into the pronunciation embedding space using such networks. The score is estimated as relative distance of these embeddings. The paper studies the influence of different loss functions, amount of unsupervised data and the meta-parameters on the performance of the STD system.

Research paper thumbnail of Automatic Correction of i/y Spelling in Czech ASR Output

Text, Speech, and Dialogue, 2020

This paper concentrates on the design and evaluation of the method that would be able to automati... more This paper concentrates on the design and evaluation of the method that would be able to automatically correct the spelling of i/y in the Czech words at the output of the ASR decoder. After analysis of both the Czech grammar rules and the data, we have decided to deal only with the endings consisting of consonants b/f/l/m/p/s/v/z followed by i/y in both short and long forms. The correction is framed as the classification task where the word could belong to the “i” class, the “y” class or the “empty” class. Using the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) architecture, we were able to substantially improve the correctness of the i/y spelling both on the simulated and the real ASR output. Since the misspelling of i/y in the Czech texts is seen by the majority of native Czech speakers as a blatant error, the corrected output greatly improves the perceived quality of the ASR system.

Research paper thumbnail of STAZKA – Speech recordings from vehicles

The database actually contains two sets of recordings, both recorded in the moving or stationary ... more The database actually contains two sets of recordings, both recorded in the moving or stationary vehicles (passenger cars or trucks). All data were recorded within the project “Intelligent Electronic Record of the Operation and Vehicle Performance” whose aim is to develop a voice-operated software for registering the vehicle operation data. The first part (full_noises.zip) consists of relatively long recordings from the vehicle cabin, containing spontaneous speech from the vehicle crew. The recordings are accompanied with detailed transcripts in the Transcriber XML-based format (.trs). Due to the recording settings, the audio contains many different noises, only sparsely interspersed with speech. As such, the set is suitable for robust estimation of the voice activity detector parameters. The second set (prompts.zip) consists of short prompts that were recorded in the controlled setting – the speakers either answered simple questions or they repeated commands and short phrases. The ...

Research paper thumbnail of Rejection and Key-Phrase Spottin Techniques Using a Mumble Model in a Czech Telephone Dialog System

Sixth International Conference on …, 2000

This paper describes an implementation of a mumble model in a real-time Czech telephone dialog sy... more This paper describes an implementation of a mumble model in a real-time Czech telephone dialog system, and its contribution to speech recognition. A short overview of the Czech telephone dialog system with a special focus on its speech recognition module is given. ...

Research paper thumbnail of Initial Experiments on Question Answering from the Intrinsic Structure of Oral History Archives

Research paper thumbnail of How to Detect Speech in Telephone Dialogue Systems

system, speech recognition Abstract – In practical telephone dialogue applications there are many... more system, speech recognition Abstract – In practical telephone dialogue applications there are many problems with a speech detection. This paper discusses two main problems with a silence detection and presents techniques for increasing reliability and recognition accuracy of the whole telephone dialogue system. The first problem causes a different level of signal and background noise of incoming calls. The solution could be to store the local speech/silence decision in a buffer and to use an adaptive threshold to make a global decision about each frame. The second problem is the detection of the spoke-too-soon error and its recovery. In addition this paper describes a dialogue system with a speaker independent recognition module for Czech continuous speech. The recognition module has been designed at the University of West Bohemia (UWB) and is used for telephone applications. The task of this module is to recognize and transcribe continuous Czech telephone speech to a word sequence r...

Research paper thumbnail of Fast Keyword Spotting from Acoustic Baseforms

This paper describes a filler model, used in our keyword spotting system, which is implemented as... more This paper describes a filler model, used in our keyword spotting system, which is implemented as a phoneme recognizer. The filler model produces a sequence of phones corresponding to the input utterance and can be used as a phoneme recognizer. The dependency of accuracy and correctness on the filler model back loop penalty as well as the influence of the filler model language model are depicted. The output of the phoneme recognizer can be used for keyword spotting. Two modifications of basic DTW algorithm are presented. The advantage of this keyword spotting approach is the possibility of two pass detection. The first pass (slow) is done only once. The second pass (fast) is done on the request of searching the keyword and uses only the sequence of the phones generated by the first pass. All the tests are performed on the telephone speech corpus. 1.

Research paper thumbnail of Benefit of Mumble Model to the Czech Telephone Dialogue System

This paper discusses a usage of a mumble model in a Czech telephone dialogue system designed and ... more This paper discusses a usage of a mumble model in a Czech telephone dialogue system designed and constructed at the Department of Cybernetics, University of West Bohemia, and describes benefits of the mumble model to speech recognition, namely to a rejection method. Firstly, the overview of the Czech telephone dialogue system and its recognition engine is given. The recognition is based on a statistical approach. The triphones are used and modeled by tree−state left−to−right HMMs with an output probability density function expressed as a multivariate Gaussian mixture. The stochastic regular grammars are used as a language model to reduce a task perplexity. Secondly, the mumble model is introduced as a recursive network of Czech phones HMM models connected in parallel, and an implementation of a rejection and a key−word spotting method, both based on the mumble model, is explained. Finally, the experimental results providing the 19.4 % equal error rate (EER) of the rejection and 16.7...

Research paper thumbnail of Expanding Decision Support Systems Outside Company Gates

Research paper thumbnail of er sp ee ch . 2 01 9 Multimodal Dialog with the MALACH Audiovisual Archive

In this paper, we present a multimodal dialog system capable of information retrieval from the la... more In this paper, we present a multimodal dialog system capable of information retrieval from the large audiovisual archive MALACH of Holocaust testimonies. The users can use spoken natural language queries to search the archive. A graphical user interface allows the users to quickly view footage with the answers and explore their context. The dialog was deployed in two languages English and Czech. The system uses automatic speech recognition and natural language processing for knowledge base construction and for processing of the user’s input.

Research paper thumbnail of Towards Network Simplification for Low-Cost Devices by Removing Synapses

Lecture Notes in Computer Science, 2018

The deployment of robust neural network based models on low-cost devices touches the problem with... more The deployment of robust neural network based models on low-cost devices touches the problem with hardware constraints like limited memory footprint and computing power. This work presents a general method for a rapid reduction of parameters (80-90%) in a trained (DNN or LSTM) network by removing its redundant synapses, while the classification accuracy is not significantly hurt. The massive reduction of parameters leads to a notable decrease of the model's size and the actual prediction time of on-board classifiers. We show the pruning results on a simple speech recognition task, however, the method is applicable to any classification data.

Research paper thumbnail of Improving a Keyword Spotting System Using Phoneme Sequence Generated by a Filler Model

Abstract:- This paper presents a technique for improving keyword spotting. The main idea is based... more Abstract:- This paper presents a technique for improving keyword spotting. The main idea is based on the combination of a standard keyword spotting system working with a filler model with a phone recognizer. If the filler model is well designed it produces a phoneme sequence corresponding to the investigated utterance. This phoneme sequence generated by a filler model can be used to reject incorrectly detected keywords. The quality of this technique depends on how good is the acoustic match between an incoming utterance and a bigram phoneme filler model.

Research paper thumbnail of Improving a keyword spotting system using phoneme sequence

This paper presents a technique for improving keyword spotting. The main idea is based on the com... more This paper presents a technique for improving keyword spotting. The main idea is based on the combination of a standard keyword spotting system working with a filler model with a phone recognizer. If the filler model is well designed it produces a phoneme sequence corresponding to the investigated utterance. This phoneme sequence generated by a filler model can be used to reject incorrectly detected keywords. The quality of this technique depends on how good is the acoustic match between an incoming utterance and a bigram phoneme filler model. Key-Words: keyword spotting, filler model, confidence measure, acoustic baseforms.

Research paper thumbnail of Choosing a Dialogue System's Modality in Order to Minimize User's Workload

The communication during human-machine interaction often happens only as a secondary task that di... more The communication during human-machine interaction often happens only as a secondary task that distract the user’s main focus on a primary task. In our study, the primary task was driving a vehicle and the secondary task was an interaction with a dialogue system on a tablet device using touch and speech. In this paper we present the design and the analysis of a study that can be used to create an optimal strategy for a dialogue manager that takes into consideration several metrics. These include the type of the information we require from the user, the expected cognitive load on the user, the expected duration of a user’s response and the expected error rate.

Research paper thumbnail of Air Traffic Control Communication

Corpus contains recordings of communication between air traffic controllers and pilots. The speec... more Corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono.

Research paper thumbnail of Czech dialog system for examination registration at UWB

Research paper thumbnail of False alarms reduction in keyword spotting system

The 8th World Multi Conference on Systemics Cybernetics and Informatics Vol Vi Image Acoustic Signal Processing and Optical Systems Technologies and Applications, Jul 18, 2004

Research paper thumbnail of Dialogový systém pro přihlašování studentů na zkoušky

Research paper thumbnail of Multi-agent decision support systems with remote multimedia access

Research paper thumbnail of Adjusting BERT’s Pooling Layer for Large-Scale Multi-Label Text Classification

Text, Speech, and Dialogue, 2020

In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label... more In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label Text Classification (LMTC). In the LMTC task, each text document can have multiple class labels, while the total number of classes is in the order of thousands. We propose a pooling layer architecture on top of BERT models, which improves the quality of classification by using information from the standard [CLS] token in combination with pooled sequence output. We demonstrate the improvements on Wikipedia datasets in three different languages using public pre-trained BERT models.

Research paper thumbnail of An Analysis of the RNN-Based Spoken Term Detection Training

Speech and Computer, 2017

This paper studies the training process of the recurrent neural networks used in the spoken term ... more This paper studies the training process of the recurrent neural networks used in the spoken term detection (STD) task. The method used in the paper employ two jointly trained Siamese networks using unsupervised data. The grapheme representation of a searched term and the phoneme realization of a putative hit are projected into the pronunciation embedding space using such networks. The score is estimated as relative distance of these embeddings. The paper studies the influence of different loss functions, amount of unsupervised data and the meta-parameters on the performance of the STD system.

Research paper thumbnail of Automatic Correction of i/y Spelling in Czech ASR Output

Text, Speech, and Dialogue, 2020

This paper concentrates on the design and evaluation of the method that would be able to automati... more This paper concentrates on the design and evaluation of the method that would be able to automatically correct the spelling of i/y in the Czech words at the output of the ASR decoder. After analysis of both the Czech grammar rules and the data, we have decided to deal only with the endings consisting of consonants b/f/l/m/p/s/v/z followed by i/y in both short and long forms. The correction is framed as the classification task where the word could belong to the “i” class, the “y” class or the “empty” class. Using the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) architecture, we were able to substantially improve the correctness of the i/y spelling both on the simulated and the real ASR output. Since the misspelling of i/y in the Czech texts is seen by the majority of native Czech speakers as a blatant error, the corrected output greatly improves the perceived quality of the ASR system.

Research paper thumbnail of STAZKA – Speech recordings from vehicles

The database actually contains two sets of recordings, both recorded in the moving or stationary ... more The database actually contains two sets of recordings, both recorded in the moving or stationary vehicles (passenger cars or trucks). All data were recorded within the project “Intelligent Electronic Record of the Operation and Vehicle Performance” whose aim is to develop a voice-operated software for registering the vehicle operation data. The first part (full_noises.zip) consists of relatively long recordings from the vehicle cabin, containing spontaneous speech from the vehicle crew. The recordings are accompanied with detailed transcripts in the Transcriber XML-based format (.trs). Due to the recording settings, the audio contains many different noises, only sparsely interspersed with speech. As such, the set is suitable for robust estimation of the voice activity detector parameters. The second set (prompts.zip) consists of short prompts that were recorded in the controlled setting – the speakers either answered simple questions or they repeated commands and short phrases. The ...

Research paper thumbnail of Rejection and Key-Phrase Spottin Techniques Using a Mumble Model in a Czech Telephone Dialog System

Sixth International Conference on …, 2000

This paper describes an implementation of a mumble model in a real-time Czech telephone dialog sy... more This paper describes an implementation of a mumble model in a real-time Czech telephone dialog system, and its contribution to speech recognition. A short overview of the Czech telephone dialog system with a special focus on its speech recognition module is given. ...

Research paper thumbnail of Initial Experiments on Question Answering from the Intrinsic Structure of Oral History Archives

Research paper thumbnail of How to Detect Speech in Telephone Dialogue Systems

system, speech recognition Abstract – In practical telephone dialogue applications there are many... more system, speech recognition Abstract – In practical telephone dialogue applications there are many problems with a speech detection. This paper discusses two main problems with a silence detection and presents techniques for increasing reliability and recognition accuracy of the whole telephone dialogue system. The first problem causes a different level of signal and background noise of incoming calls. The solution could be to store the local speech/silence decision in a buffer and to use an adaptive threshold to make a global decision about each frame. The second problem is the detection of the spoke-too-soon error and its recovery. In addition this paper describes a dialogue system with a speaker independent recognition module for Czech continuous speech. The recognition module has been designed at the University of West Bohemia (UWB) and is used for telephone applications. The task of this module is to recognize and transcribe continuous Czech telephone speech to a word sequence r...

Research paper thumbnail of Fast Keyword Spotting from Acoustic Baseforms

This paper describes a filler model, used in our keyword spotting system, which is implemented as... more This paper describes a filler model, used in our keyword spotting system, which is implemented as a phoneme recognizer. The filler model produces a sequence of phones corresponding to the input utterance and can be used as a phoneme recognizer. The dependency of accuracy and correctness on the filler model back loop penalty as well as the influence of the filler model language model are depicted. The output of the phoneme recognizer can be used for keyword spotting. Two modifications of basic DTW algorithm are presented. The advantage of this keyword spotting approach is the possibility of two pass detection. The first pass (slow) is done only once. The second pass (fast) is done on the request of searching the keyword and uses only the sequence of the phones generated by the first pass. All the tests are performed on the telephone speech corpus. 1.

Research paper thumbnail of Benefit of Mumble Model to the Czech Telephone Dialogue System

This paper discusses a usage of a mumble model in a Czech telephone dialogue system designed and ... more This paper discusses a usage of a mumble model in a Czech telephone dialogue system designed and constructed at the Department of Cybernetics, University of West Bohemia, and describes benefits of the mumble model to speech recognition, namely to a rejection method. Firstly, the overview of the Czech telephone dialogue system and its recognition engine is given. The recognition is based on a statistical approach. The triphones are used and modeled by tree−state left−to−right HMMs with an output probability density function expressed as a multivariate Gaussian mixture. The stochastic regular grammars are used as a language model to reduce a task perplexity. Secondly, the mumble model is introduced as a recursive network of Czech phones HMM models connected in parallel, and an implementation of a rejection and a key−word spotting method, both based on the mumble model, is explained. Finally, the experimental results providing the 19.4 % equal error rate (EER) of the rejection and 16.7...

Research paper thumbnail of Expanding Decision Support Systems Outside Company Gates

Research paper thumbnail of er sp ee ch . 2 01 9 Multimodal Dialog with the MALACH Audiovisual Archive

In this paper, we present a multimodal dialog system capable of information retrieval from the la... more In this paper, we present a multimodal dialog system capable of information retrieval from the large audiovisual archive MALACH of Holocaust testimonies. The users can use spoken natural language queries to search the archive. A graphical user interface allows the users to quickly view footage with the answers and explore their context. The dialog was deployed in two languages English and Czech. The system uses automatic speech recognition and natural language processing for knowledge base construction and for processing of the user’s input.

Research paper thumbnail of Towards Network Simplification for Low-Cost Devices by Removing Synapses

Lecture Notes in Computer Science, 2018

The deployment of robust neural network based models on low-cost devices touches the problem with... more The deployment of robust neural network based models on low-cost devices touches the problem with hardware constraints like limited memory footprint and computing power. This work presents a general method for a rapid reduction of parameters (80-90%) in a trained (DNN or LSTM) network by removing its redundant synapses, while the classification accuracy is not significantly hurt. The massive reduction of parameters leads to a notable decrease of the model's size and the actual prediction time of on-board classifiers. We show the pruning results on a simple speech recognition task, however, the method is applicable to any classification data.

Research paper thumbnail of Improving a Keyword Spotting System Using Phoneme Sequence Generated by a Filler Model

Abstract:- This paper presents a technique for improving keyword spotting. The main idea is based... more Abstract:- This paper presents a technique for improving keyword spotting. The main idea is based on the combination of a standard keyword spotting system working with a filler model with a phone recognizer. If the filler model is well designed it produces a phoneme sequence corresponding to the investigated utterance. This phoneme sequence generated by a filler model can be used to reject incorrectly detected keywords. The quality of this technique depends on how good is the acoustic match between an incoming utterance and a bigram phoneme filler model.

Research paper thumbnail of Improving a keyword spotting system using phoneme sequence

This paper presents a technique for improving keyword spotting. The main idea is based on the com... more This paper presents a technique for improving keyword spotting. The main idea is based on the combination of a standard keyword spotting system working with a filler model with a phone recognizer. If the filler model is well designed it produces a phoneme sequence corresponding to the investigated utterance. This phoneme sequence generated by a filler model can be used to reject incorrectly detected keywords. The quality of this technique depends on how good is the acoustic match between an incoming utterance and a bigram phoneme filler model. Key-Words: keyword spotting, filler model, confidence measure, acoustic baseforms.

Research paper thumbnail of Choosing a Dialogue System's Modality in Order to Minimize User's Workload

The communication during human-machine interaction often happens only as a secondary task that di... more The communication during human-machine interaction often happens only as a secondary task that distract the user’s main focus on a primary task. In our study, the primary task was driving a vehicle and the secondary task was an interaction with a dialogue system on a tablet device using touch and speech. In this paper we present the design and the analysis of a study that can be used to create an optimal strategy for a dialogue manager that takes into consideration several metrics. These include the type of the information we require from the user, the expected cognitive load on the user, the expected duration of a user’s response and the expected error rate.

Research paper thumbnail of Air Traffic Control Communication

Corpus contains recordings of communication between air traffic controllers and pilots. The speec... more Corpus contains recordings of communication between air traffic controllers and pilots. The speech is manually transcribed and labeled with the information about the speaker (pilot/controller, not the full identity of the person). The corpus is currently small (20 hours) but we plan to search for additional data next year. The audio data format is: 8kHz, 16bit PCM, mono.