Pavel Ircing | University of West Bohemia (original) (raw)

Papers by Pavel Ircing

Research paper thumbnail of Transformer-Based Automatic Punctuation Prediction and Word Casing Reconstruction of the ASR Output

Lecture Notes in Computer Science, 2021

Research paper thumbnail of BERT-Based Sentiment Analysis Using Distillation

Lecture Notes in Computer Science, 2020

In this paper, we present our experiments with BERT (Bidirectional Encoder Representations from T... more In this paper, we present our experiments with BERT (Bidirectional Encoder Representations from Transformers) models in the task of sentiment analysis, which aims to predict the sentiment polarity for the given text. We trained an ensemble of BERT models from a large self-collected movie reviews dataset and distilled the knowledge into a single production model. Moreover, we proposed an improved BERT’s pooling layer architecture, which outperforms standard classification layer while enables per-token sentiment predictions. We demonstrate our improvements on a publicly available dataset with Czech movie reviews

Research paper thumbnail of Towards Processing of the Oral History Interviews and Related Printed Documents

Language Resources and Evaluation, May 1, 2018

In this paper, we describe the initial stages of our project, the goal of which is to create an i... more In this paper, we describe the initial stages of our project, the goal of which is to create an integrated archive of the recordings, scanned documents, and photographs that would be accessible online and would provide multifaceted search capabilities (spoken content, biographical information, relevant time period, etc.). The recordings contain retrospective interviews with the witnesses of the totalitarian regimes in Czechoslovakia, where the vocabulary used in such interviews consists of many archaic words and named entities that are now quite rare in everyday speech. The scanned documents consist of text materials and photographs mainly from the home archives of the interviewees or the archive of the State Security. These documents are usually typewritten or even handwritten and have really bad optical quality. In order to build an integrated archive, we will employ mainly methods of automatic speech recognition (ASR), automatic indexing and search in recognized recordings and, to a certain extent, also the optical character recognition (OCR). Other natural language processing techniques like topic detection are also planned to be used in the later stages of the project. This paper focuses on the processing of the speech data using ASR and the scanned typewritten documents with OCR and describes the initial experiments.

Research paper thumbnail of A study of different weighting schemes for spoken language understanding based on convolutional neural networks

This paper describes the development of a stateless spoken spoken language understanding (SLU) mo... more This paper describes the development of a stateless spoken spoken language understanding (SLU) module based on artificial neural networks that is able to deal with the uncertainty of the automatic speech recognition (ASR) output. The work builds upon the concept of weighted neurons introduced by the authors previously and presents a generalized weighting term for such a neuron. The effect of different forms and parameter estimation methods of the weighting term is experimentally evaluated on the multi-task training corpus, created by merging two different semantically annotated corpora. The robustness of the best performing weighting schemes is then demonstrated by experiments involving hybrid word-semantic (WSE) lattices and also limited data scenario.

Research paper thumbnail of Combining Textual and Speech Features in the NLI Task Using State-of-the-Art Machine Learning Techniques

Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

We summarize the involvement of our CEMI team in the "NLI Shared Task 2017", which deals with bot... more We summarize the involvement of our CEMI team in the "NLI Shared Task 2017", which deals with both textual and speech input data. We submitted the results achieved by using three different system architectures; each of them combines multiple supervised learning models trained on various feature sets. As expected, better results are achieved with the systems that use both the textual data and the spoken responses. Combining the input data of two different modalities led to a rather dramatic improvement in classification performance. Our best performing method is based on a set of feed-forward neural networks whose hidden-layer outputs are combined together using a softmax layer. We achieved a macro-averaged F1 score of 0.9257 on the evaluation (unseen) test set and our team placed first in the main task together with other three teams.

Research paper thumbnail of Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL), 2020

Research paper thumbnail of Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0)

Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL), 2017

Research paper thumbnail of Facilitating Digital Humanities Research in Central Europe through the Advanced Speech and Image Processing Technologies

Research paper thumbnail of USC-SFI MALACH Interviews and Transcripts Czech

Research paper thumbnail of Access to Large Spoken Archives)

The paper present the issues encountered in processing spontaneous Czech speech in the MALACH pro... more The paper present the issues encountered in processing spontaneous Czech speech in the MALACH project. Specific problems connected with a frequent occurrence of colloquial words in spontaneous Czech are analyzed; a partial solution is proposed and experimentally evaluated.

Research paper thumbnail of Bene t of Proper Language Processing for Czech Speech Retrieval in the CL-SR Task at CLEF 2006?

Abstract. The paper describes the system built by the team from the University of West Bohemia fo... more Abstract. The paper describes the system built by the team from the University of West Bohemia for participation in the CLEF 2006 CL-SR track. We have decided to concentrate only on the monolingual search-ing in the Czech test collection and investigate the eect of proper lan-guage processing on the retrieval performance. We have employed the Czech morphological analyser and tagger for that purposes. For the ac-tual search system, we have used the classical tf.idf approach with blind relevance feedback as implemented in the Lemur toolkit. The results in-dicate that a suitable linguistic preprocessing is indeed crucial for the Czech IR performance. 1

Research paper thumbnail of Automatic Correction of i/y Spelling in Czech ASR Output

Text, Speech, and Dialogue, 2020

This paper concentrates on the design and evaluation of the method that would be able to automati... more This paper concentrates on the design and evaluation of the method that would be able to automatically correct the spelling of i/y in the Czech words at the output of the ASR decoder. After analysis of both the Czech grammar rules and the data, we have decided to deal only with the endings consisting of consonants b/f/l/m/p/s/v/z followed by i/y in both short and long forms. The correction is framed as the classification task where the word could belong to the “i” class, the “y” class or the “empty” class. Using the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) architecture, we were able to substantially improve the correctness of the i/y spelling both on the simulated and the real ASR output. Since the misspelling of i/y in the Czech texts is seen by the majority of native Czech speakers as a blatant error, the corrected output greatly improves the perceived quality of the ASR system.

Research paper thumbnail of Prague DaTabase of Spoken Czech 1.0

PDTSC 1.0 is a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,32... more PDTSC 1.0 is a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,324 minutes of spontaneous dialog speech have been recorded, transcribed and edited in several interlinked layers: audio recordings, automatic and manual transcription and manually reconstructed text. PDTSC 1.0 is a delayed release of data annotated in 2012. It is an update of Prague Dependency Treebank of Spoken Language (PDTSL) 0.5 (published in 2009). In 2017, Prague Dependency Treebank of Spoken Czech (PDTSC) 2.0 was published as an update of PDTSC 1.0.

Research paper thumbnail of Czech Malach Cross-lingual Speech Retrieval Test Collection

The package contains Czech recordings of the Visual History Archive which consists of the intervi... more The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four types of automatic transcripts, manual annotations of selected topics and interviews' metadata. The archive totally contains 353 recordings and 592 hours of interviews.

Research paper thumbnail of An Engine for Online Video Search in Large Archives of the Holocaust Testimonies

In this paper we present an online system for cross-lingual lexical (full-text) searching in the ... more In this paper we present an online system for cross-lingual lexical (full-text) searching in the large archive of the Holocaust testimonies. Video interviews recorded in two languages (English and Czech) were automatically transcribed and indexed in order to provide efficient access to the lexical content of the recordings. The engine takes advantage of the state-of-the-art speech recognition system and performs fast spoken term detection (STD), providing direct access to the segments of interviews containing queried words or short phrases.

Research paper thumbnail of Adjusting BERT’s Pooling Layer for Large-Scale Multi-Label Text Classification

Text, Speech, and Dialogue, 2020

In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label... more In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label Text Classification (LMTC). In the LMTC task, each text document can have multiple class labels, while the total number of classes is in the order of thousands. We propose a pooling layer architecture on top of BERT models, which improves the quality of classification by using information from the standard [CLS] token in combination with pooled sequence output. We demonstrate the improvements on Wikipedia datasets in three different languages using public pre-trained BERT models.

Research paper thumbnail of On the Use of Grapheme Models for Searching in Large Spoken Archives

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

This paper explores the possibility to use grapheme-based word and sub-word models in the task of... more This paper explores the possibility to use grapheme-based word and sub-word models in the task of spoken term detection (STD). The usage of grapheme models eliminates the need for expert-prepared pronunciation lexicons (which are often far from complete) and/or trainable grapheme-to-phoneme (G2P) algorithms that are frequently rather inaccurate, especially for rare words (words coming from a different language). Moreover, the G2P conversion of the search terms that need to be performed on-line can substantially increase the response time of the STD system. Our results show that using various grapheme-based models, we can achieve STD performance (measured in terms of ATWV) comparable with phoneme-based models but without the additional burden of G2P conversion.

Research paper thumbnail of OCR Improvements for Images of Multi-page Historical Documents

Speech and Computer, 2021

Research paper thumbnail of An Automated Pipeline for Robust Image Processing and Optical Character Recognition of Historical Documents

Speech and Computer, 2020

In this paper we propose a pipeline for processing of scanned historical documents into the elect... more In this paper we propose a pipeline for processing of scanned historical documents into the electronic text form that could then be indexed and stored in a database. The nature of the documents presents a substantial challenge for standard automated techniques – not only there is a mix of typewritten and handwritten documents of varying quality but the scanned pages often contain multiple documents at once. Moreover, the language of the texts alternates mostly between Russian and Ukrainian but other languages also occur. The paper focuses mainly on segmentation, document type classification, and image preprocessing of the scanned documents; the output of those methods is then passed to the off-the-shelf OCR software and a baseline performance is evaluated on a simplified OCR task.

Research paper thumbnail of Air traffic control communication (ATCC) speech corpora and their use for ASR and TTS development

Language Resources and Evaluation, 2019

The paper introduces the motivation for creating dedicated speech corpora of air traffic control ... more The paper introduces the motivation for creating dedicated speech corpora of air traffic control communication, describes in detail the process of preparation of corpora for both automatic speech recognition and text-to-speech synthesis, presents an illustrative example of speech recognition system developed using the automatic speech recognition corpora and finally describes the technical aspects of the data and the distribution channel.

Research paper thumbnail of Transformer-Based Automatic Punctuation Prediction and Word Casing Reconstruction of the ASR Output

Lecture Notes in Computer Science, 2021

Research paper thumbnail of BERT-Based Sentiment Analysis Using Distillation

Lecture Notes in Computer Science, 2020

In this paper, we present our experiments with BERT (Bidirectional Encoder Representations from T... more In this paper, we present our experiments with BERT (Bidirectional Encoder Representations from Transformers) models in the task of sentiment analysis, which aims to predict the sentiment polarity for the given text. We trained an ensemble of BERT models from a large self-collected movie reviews dataset and distilled the knowledge into a single production model. Moreover, we proposed an improved BERT’s pooling layer architecture, which outperforms standard classification layer while enables per-token sentiment predictions. We demonstrate our improvements on a publicly available dataset with Czech movie reviews

Research paper thumbnail of Towards Processing of the Oral History Interviews and Related Printed Documents

Language Resources and Evaluation, May 1, 2018

In this paper, we describe the initial stages of our project, the goal of which is to create an i... more In this paper, we describe the initial stages of our project, the goal of which is to create an integrated archive of the recordings, scanned documents, and photographs that would be accessible online and would provide multifaceted search capabilities (spoken content, biographical information, relevant time period, etc.). The recordings contain retrospective interviews with the witnesses of the totalitarian regimes in Czechoslovakia, where the vocabulary used in such interviews consists of many archaic words and named entities that are now quite rare in everyday speech. The scanned documents consist of text materials and photographs mainly from the home archives of the interviewees or the archive of the State Security. These documents are usually typewritten or even handwritten and have really bad optical quality. In order to build an integrated archive, we will employ mainly methods of automatic speech recognition (ASR), automatic indexing and search in recognized recordings and, to a certain extent, also the optical character recognition (OCR). Other natural language processing techniques like topic detection are also planned to be used in the later stages of the project. This paper focuses on the processing of the speech data using ASR and the scanned typewritten documents with OCR and describes the initial experiments.

Research paper thumbnail of A study of different weighting schemes for spoken language understanding based on convolutional neural networks

This paper describes the development of a stateless spoken spoken language understanding (SLU) mo... more This paper describes the development of a stateless spoken spoken language understanding (SLU) module based on artificial neural networks that is able to deal with the uncertainty of the automatic speech recognition (ASR) output. The work builds upon the concept of weighted neurons introduced by the authors previously and presents a generalized weighting term for such a neuron. The effect of different forms and parameter estimation methods of the weighting term is experimentally evaluated on the multi-task training corpus, created by merging two different semantically annotated corpora. The robustness of the best performing weighting schemes is then demonstrated by experiments involving hybrid word-semantic (WSE) lattices and also limited data scenario.

Research paper thumbnail of Combining Textual and Speech Features in the NLI Task Using State-of-the-Art Machine Learning Techniques

Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

We summarize the involvement of our CEMI team in the "NLI Shared Task 2017", which deals with bot... more We summarize the involvement of our CEMI team in the "NLI Shared Task 2017", which deals with both textual and speech input data. We submitted the results achieved by using three different system architectures; each of them combines multiple supervised learning models trained on various feature sets. As expected, better results are achieved with the systems that use both the textual data and the spoken responses. Combining the input data of two different modalities led to a rather dramatic improvement in classification performance. Our best performing method is based on a set of feed-forward neural networks whose hidden-layer outputs are combined together using a softmax layer. We achieved a macro-averaged F1 score of 0.9257 on the evaluation (unseen) test set and our team placed first in the main task together with other three teams.

Research paper thumbnail of Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL), 2020

Research paper thumbnail of Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0)

Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL), 2017

Research paper thumbnail of Facilitating Digital Humanities Research in Central Europe through the Advanced Speech and Image Processing Technologies

Research paper thumbnail of USC-SFI MALACH Interviews and Transcripts Czech

Research paper thumbnail of Access to Large Spoken Archives)

The paper present the issues encountered in processing spontaneous Czech speech in the MALACH pro... more The paper present the issues encountered in processing spontaneous Czech speech in the MALACH project. Specific problems connected with a frequent occurrence of colloquial words in spontaneous Czech are analyzed; a partial solution is proposed and experimentally evaluated.

Research paper thumbnail of Bene t of Proper Language Processing for Czech Speech Retrieval in the CL-SR Task at CLEF 2006?

Abstract. The paper describes the system built by the team from the University of West Bohemia fo... more Abstract. The paper describes the system built by the team from the University of West Bohemia for participation in the CLEF 2006 CL-SR track. We have decided to concentrate only on the monolingual search-ing in the Czech test collection and investigate the eect of proper lan-guage processing on the retrieval performance. We have employed the Czech morphological analyser and tagger for that purposes. For the ac-tual search system, we have used the classical tf.idf approach with blind relevance feedback as implemented in the Lemur toolkit. The results in-dicate that a suitable linguistic preprocessing is indeed crucial for the Czech IR performance. 1

Research paper thumbnail of Automatic Correction of i/y Spelling in Czech ASR Output

Text, Speech, and Dialogue, 2020

This paper concentrates on the design and evaluation of the method that would be able to automati... more This paper concentrates on the design and evaluation of the method that would be able to automatically correct the spelling of i/y in the Czech words at the output of the ASR decoder. After analysis of both the Czech grammar rules and the data, we have decided to deal only with the endings consisting of consonants b/f/l/m/p/s/v/z followed by i/y in both short and long forms. The correction is framed as the classification task where the word could belong to the “i” class, the “y” class or the “empty” class. Using the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) architecture, we were able to substantially improve the correctness of the i/y spelling both on the simulated and the real ASR output. Since the misspelling of i/y in the Czech texts is seen by the majority of native Czech speakers as a blatant error, the corrected output greatly improves the perceived quality of the ASR system.

Research paper thumbnail of Prague DaTabase of Spoken Czech 1.0

PDTSC 1.0 is a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,32... more PDTSC 1.0 is a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,324 minutes of spontaneous dialog speech have been recorded, transcribed and edited in several interlinked layers: audio recordings, automatic and manual transcription and manually reconstructed text. PDTSC 1.0 is a delayed release of data annotated in 2012. It is an update of Prague Dependency Treebank of Spoken Language (PDTSL) 0.5 (published in 2009). In 2017, Prague Dependency Treebank of Spoken Czech (PDTSC) 2.0 was published as an update of PDTSC 1.0.

Research paper thumbnail of Czech Malach Cross-lingual Speech Retrieval Test Collection

The package contains Czech recordings of the Visual History Archive which consists of the intervi... more The package contains Czech recordings of the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four types of automatic transcripts, manual annotations of selected topics and interviews' metadata. The archive totally contains 353 recordings and 592 hours of interviews.

Research paper thumbnail of An Engine for Online Video Search in Large Archives of the Holocaust Testimonies

In this paper we present an online system for cross-lingual lexical (full-text) searching in the ... more In this paper we present an online system for cross-lingual lexical (full-text) searching in the large archive of the Holocaust testimonies. Video interviews recorded in two languages (English and Czech) were automatically transcribed and indexed in order to provide efficient access to the lexical content of the recordings. The engine takes advantage of the state-of-the-art speech recognition system and performs fast spoken term detection (STD), providing direct access to the segments of interviews containing queried words or short phrases.

Research paper thumbnail of Adjusting BERT’s Pooling Layer for Large-Scale Multi-Label Text Classification

Text, Speech, and Dialogue, 2020

In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label... more In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label Text Classification (LMTC). In the LMTC task, each text document can have multiple class labels, while the total number of classes is in the order of thousands. We propose a pooling layer architecture on top of BERT models, which improves the quality of classification by using information from the standard [CLS] token in combination with pooled sequence output. We demonstrate the improvements on Wikipedia datasets in three different languages using public pre-trained BERT models.

Research paper thumbnail of On the Use of Grapheme Models for Searching in Large Spoken Archives

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

This paper explores the possibility to use grapheme-based word and sub-word models in the task of... more This paper explores the possibility to use grapheme-based word and sub-word models in the task of spoken term detection (STD). The usage of grapheme models eliminates the need for expert-prepared pronunciation lexicons (which are often far from complete) and/or trainable grapheme-to-phoneme (G2P) algorithms that are frequently rather inaccurate, especially for rare words (words coming from a different language). Moreover, the G2P conversion of the search terms that need to be performed on-line can substantially increase the response time of the STD system. Our results show that using various grapheme-based models, we can achieve STD performance (measured in terms of ATWV) comparable with phoneme-based models but without the additional burden of G2P conversion.

Research paper thumbnail of OCR Improvements for Images of Multi-page Historical Documents

Speech and Computer, 2021

Research paper thumbnail of An Automated Pipeline for Robust Image Processing and Optical Character Recognition of Historical Documents

Speech and Computer, 2020

In this paper we propose a pipeline for processing of scanned historical documents into the elect... more In this paper we propose a pipeline for processing of scanned historical documents into the electronic text form that could then be indexed and stored in a database. The nature of the documents presents a substantial challenge for standard automated techniques – not only there is a mix of typewritten and handwritten documents of varying quality but the scanned pages often contain multiple documents at once. Moreover, the language of the texts alternates mostly between Russian and Ukrainian but other languages also occur. The paper focuses mainly on segmentation, document type classification, and image preprocessing of the scanned documents; the output of those methods is then passed to the off-the-shelf OCR software and a baseline performance is evaluated on a simplified OCR task.

Research paper thumbnail of Air traffic control communication (ATCC) speech corpora and their use for ASR and TTS development

Language Resources and Evaluation, 2019

The paper introduces the motivation for creating dedicated speech corpora of air traffic control ... more The paper introduces the motivation for creating dedicated speech corpora of air traffic control communication, describes in detail the process of preparation of corpora for both automatic speech recognition and text-to-speech synthesis, presents an illustrative example of speech recognition system developed using the automatic speech recognition corpora and finally describes the technical aspects of the data and the distribution channel.