Alexei Ivanov - Academia.edu (original) (raw)

Papers by Alexei Ivanov

Page 1. Automatic Turn Segmentation in Spoken Conversations Alexei V. Ivanov, Giuseppe Riccardi D... more Page 1. Automatic Turn Segmentation in Spoken Conversations Alexei V. Ivanov, Giuseppe Riccardi Department of Information Engineering and Computer Science, University of Trento, Povo (Trento), Italy ivanov@disi.unitn.it, riccardi@disi.unitn.it ...

arXiv (Cornell University), Dec 16, 2022

In this paper, we perform an exhaustive evaluation of different representations to address the in... more In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically-vs. manuallygenerated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.

Interspeech 2015, 2015

This paper investigates the connection between intelligibility and pronunciation accuracy. We com... more This paper investigates the connection between intelligibility and pronunciation accuracy. We compare which words in non-native English speech are likely to be misrecognized and which words are likely to be marked as pronunciation errors. We found that only 16% of the variability in word-level intelligibility can be explained by the presence of obvious mispronunciations. In some cases, a word remained recognizable or could be identified from the context despite obvious pronunciation errors. In many other cases, the annotators were unable to identify the word when listening to the audio but did not perceive it as mispronounced when presented with its transcription. At the same time, we see high agreement when the results are aggregated across all words from the same speaker.

Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

In this paper, we evaluate different alternatives to process richer forms of Automatic Speech Rec... more In this paper, we evaluate different alternatives to process richer forms of Automatic Speech Recognition (ASR) output based on lattice expansion algorithms for Spoken Document Retrieval (SDR). Typically, SDR systems employ ASR transcripts to index and retrieve relevant documents. However, ASR errors negatively affect the retrieval performance. Multiple alternative hypotheses can also be used to augment the input to document retrieval to compensate for the erroneous one-best hypothesis. In Weighted Finite State Transducer-based ASR systems, using the n-best output (i.e. the top "n" scoring hypotheses) for the retrieval task is common, since they can easily be fed to a traditional Information Retrieval (IR) pipeline. However, the n-best hypotheses are terribly redundant, and do not sufficiently encapsulate the richness of the ASR output, which is represented as an acyclic directed graph called the lattice. In particular, we utilize the lattice's constrained minimum path cover to generate a minimum set of hypotheses that serve as input to the reranking phase of IR. The novelty of our proposed approach is the incorporation of the lattice as an input for neural reranking by considering a set of hypotheses that represents every arc in the lattice. The obtained hypotheses are encoded through sentence embeddings using BERT-based models, namely SBERT and RoBERTa, and the final ranking of the retrieved segments is obtained with a max-pooling operation over the computed scores among the input query and the hypotheses set. We present our evaluation on the publicly available AMI meeting corpus. Our results indicate that the proposed use of hypotheses from the expanded lattice improves the SDR performance significantly over the-best ASR output.

Multimodal Interaction with W3C Standards, 2016

As dialog systems become increasingly multimodal and distributed in nature with advances in techn... more As dialog systems become increasingly multimodal and distributed in nature with advances in technology and computing power, they become that much more complicated to design and implement. However, open industry and W3C standards provide a silver lining here, allowing the distributed design of different components that are nonetheless compliant with each other. In this chapter we examine how an opensource, modular, multimodal dialog system-HALEF-can be seamlessly assembled, much like a jigsaw puzzle, by putting together multiple distributed components that are compliant with the W3C recommendations or other open industry standards. We highlight the specific standards that HALEF currently uses along with a perspective on other useful standards that could be included in the future. HALEF has an open codebase to encourage progressive community contribution and a common standard testbed for multimodal dialog system development and benchmarking.

Proceedings of the 11th …, 2010

We investigate the clarification strategies exhibited by a hybrid POMDP dialog manager based on d... more We investigate the clarification strategies exhibited by a hybrid POMDP dialog manager based on data obtained from a phone-based user study. The dialog manager combines task structures with a number of POMDP policies each optimized for obtaining an individual concept. We investigate the relationship between dialog length and task completion. In order to measure the effectiveness of the clarification strategies, we compute concept precisions for two different mentions of the concept in the dialog: first mentions and final values after clarifications and similar strategies, and compare this to a rulebased system on the same task. We observe an improvement in concept precision of 12.1% for the hybrid POMDP compared to 5.2% for the rule-based system.

Dialogue interaction is a difficult applica-tion area for speech recognition technol-ogy because ... more Dialogue interaction is a difficult applica-tion area for speech recognition technol-ogy because of the limited acoustic con-text, the narrow-band signal, high variabil-ity of spontaneous speech and timing con-straints. It is even more difficult in the case of interacting with non-native speakers be-cause of the broader allophonic variation, less canonical prosodic patterns, higher rate of false starts and incomplete words, unusual word choice and lesser probability to have a grammatically well formed sen-tence. We present a comparative study of various approaches to speech recognition in non-native dialogic context. Compar-ing accuracy and real-time factor we find that a Kaldi-based Deep Neural Network Acoustic Model (DNN-AM) system with online speaker adaptation by far outper-forms other available methods. 1

Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

We present the implementation of a largevocabulary continuous speech recognition (LVCSR) system o... more We present the implementation of a largevocabulary continuous speech recognition (LVCSR) system on NVIDIA's Tegra K1 hyprid GPU-CPU embedded platform. The system is trained on a standard 1000hour corpus, LibriSpeech, features a trigram WFST-based language model, and achieves state-of-the-art recognition accuracy. The fact that the system is realtime-able and consumes less than 7.5 watts peak makes the system perfectly suitable for fast, but precise, offline spoken dialog applications, such as in robotics, portable gaming devices, or in-car systems.

Interspeech 2016, 2016

Recently, text independent speaker recognition systems with phonetically-aware DNNs, which allow ... more Recently, text independent speaker recognition systems with phonetically-aware DNNs, which allow the comparison among different speakers with "soft-aligned" phonetic content, have significantly outperformed standard i-vector based systems [9-12]. However, when applied to speaker recognition on a nonnative spontaneous corpus, DNN-based speaker recognition does not show its superior performance due to the relatively lower accuracy of phonetic content recognition. In this paper, noise-aware features and multi-task learning are investigated to improve the alignment of speech feature frames into the subphonemic "senone" space and to "distill" the L1 (native language) information of the test takers into bottleneck features (BNFs), which we refer to as metadata sensitive BNFs. Experimental results show that the system with metadata sensitive BNFs can improve speaker recognition performance by a 23.9% relative reduction in equal error rate (EER) compared to the baseline i-vector system. In addition, L1 info is just used to train the BNFs extractor, so it is not necessary to be used as input for BNFs extraction, i-vector extraction and scoring for the enrollment and evaluation sets, which can avoid the use of erroneous L1s claimed by imposters.

2006 14th European Signal Processing Conference, Sep 1, 2006

This paper explores properties of the spiking neuron model of auditory nerve fiber. As it results... more This paper explores properties of the spiking neuron model of auditory nerve fiber. As it results from the described reasoning, the model response in a form of the spike sequence is in fact a first-order Markov chain of certain nonoverlapping sub-sequences, which, being taken separately, encode the incoming signal on the corresponding time intervals. This observation comes as a direct consequence of the finite precision of the spike registration process at the higher levels of neural signal processing. The result has important implications to the modelling of auditory apparatus and signal processing algorithmic interpretation of hearing physiology.

ETS Research Report Series, 2016

Page 1. Neuromorphic audio processing: A model simulation of the way auditory neurons encode sign... more Page 1. Neuromorphic audio processing: A model simulation of the way auditory neurons encode signals Alexei V. Ivanov*, Alexander A. Petrovsky** * - Speech Technology Center, Moscow, Russian Federation ** - Real Time ...

Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2015

Dialogue interaction with remote interlocutors is a difficult application area for speech recogni... more Dialogue interaction with remote interlocutors is a difficult application area for speech recognition technology because of the limited duration of acoustic context available for adaptation, the narrow-band and compressed signal encoding used in telecommunications, high variability of spontaneous speech and the processing time constraints. It is even more difficult in the case of interacting with non-native speakers because of the broader allophonic variation, less canonical prosodic patterns, a higher rate of false starts and incomplete words, unusual word choice and smaller probability to have a grammatically well formed sentence. We present a comparative study of various approaches to speech recognition in non-native context. Comparing systems in terms of their accuracy and real-time factor we find that a Kaldi-based Deep Neural Network Acoustic Model (DNN-AM) system with online speaker adaptation by far outperforms other available methods.

Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2015

We have previously presented HALEF-an open-source spoken dialog system-that supports telephonic i... more We have previously presented HALEF-an open-source spoken dialog system-that supports telephonic interfaces and has a distributed architecture. In this paper, we extend this infrastructure to be cloud-based, and thus truly distributed and scalable. This cloud-based spoken dialog system can be accessed both via telephone interfaces as well as through web clients with WebRTC/HTML5 integration, allowing in-browser access to potentially multimodal dialog applications. We demonstrate the versatility of the system with two conversation applications in the educational domain.

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

Automatic emotion recognition from speech is limited by the ability to discover the relevant pred... more Automatic emotion recognition from speech is limited by the ability to discover the relevant predicting features. The common approach is to extract a very large set of features over a generally long analysis time window. In this paper we investigate the applicability of two-sample Kolmogorov-Smirnov statistical test (KST) to the problem of segmental speech emotion recognition. We train emotion classifiers for each speech segment within an utterance. The segment labels are then combined to predict the dominant emotion label. Our findings show that KST can be successfully used to extract statistically relevant features. KST criterion is used to optimize the parameters of the statistical segmental analysis, namely the window segment size and shift. We carry out seven binary class emotion classification experiments on the Emo-DB and evaluate the impact of the segmental analysis and emotionspecific feature selection.

2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013

We explore the impact of speech-and speaker-specific modeling onto the Modulation Spectrum Analys... more We explore the impact of speech-and speaker-specific modeling onto the Modulation Spectrum Analysis-Kolmogorov-Smirnov feature Testing (MSA-KST) characterization method in the task of automated prediction of the cognitive impairment diagnosis, namely dysphasia and pervasive development disorder. Phoneme-synchronous capturing of speech dynamics is a reasonable choice for a segmental speech characterization system as it allows comparing speech dynamics in the similar phonetic contexts. Speaker-specific modeling aims at reducing the "within-the-class" variability of the characterized speech or speaker population by removing the effect of speaker properties that should have no relation to the characterization. Specifically the vocal tract length of a speaker has nothing to do with the diagnosis attribution and, thus, the feature set shall be normalized accordingly. The resulting system compares favorably to the baseline system of the Interspeech'2013 Computational Paralinguistics Challenge.

Proceedings of the 3rd ACM SIGCHI symposium on Engineering interactive computing systems - EICS '11, 2011

Wall Street Journal, 2004

Page 1. Anthropomorphic feature extraction algorithm for speech recognition in adverse environmen... more Page 1. Anthropomorphic feature extraction algorithm for speech recognition in adverse environments Alexei V. Ivanov (1), Alexander A. Petrovsky (2) Computer Engineering Department at the Belarusian State University of Informatics ...

ABSTRACT We are interested in understanding human personality and its manifestations in human int... more ABSTRACT We are interested in understanding human personality and its manifestations in human interactions. The automatic analysis of such personality traits in natural conversation is quite complex due to the user-profiled corpora acquisition, annotation task and multidimensional modeling. While in the experimental psychology research this topic has been addressed extensively, speech and language scientists have recently engaged in limited experiments. In this paper we describe an automated system for speaker-independent personality prediction in the context of human-human spoken conversations. The evaluation of such system is carried out on the PersIA human-human spoken dialog corpus annotated with user self-assessments of the Big-Five personality traits. The personality predictor has been trained on paralinguistic features and its evaluation on five personality traits shows encouraging results for the conscientiousness and extroversion labels.

We are interested in the problem of extracting meaning structures from spoken utterances in human... more We are interested in the problem of extracting meaning structures from spoken utterances in human communication. In Spoken Language Understanding (SLU) systems, parsing of meaning structures is carried over the word hypotheses generated by the Automatic Speech Recognizer (ASR). This approach suffers from high word error rates and ad-hoc conceptual representations. In contrast, in this paper we aim at discovering meaning components from direct measurements of acoustic and non-verbal linguistic features. The meaning structures are taken from the frame semantics model proposed in FrameNet, a consistent and extendable semantic structure resource covering a large set of domains. We give a quantitative analysis of meaning structures in terms of speech features across human-human dialogs from the manually annotated LUNA corpus. We show that the acoustic correlations between pitch, formant trajectories, intensity and harmonicity and meaning features are statistically significant over the whole corpus as well as relevant in classifying the target words evoked by a semantic frame.

arXiv (Cornell University), Dec 16, 2022

Interspeech 2015, 2015

Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Multimodal Interaction with W3C Standards, 2016

Proceedings of the 11th …, 2010

Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Interspeech 2016, 2016

2006 14th European Signal Processing Conference, Sep 1, 2006

ETS Research Report Series, 2016

Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2015

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012

2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 2013

Proceedings of the 3rd ACM SIGCHI symposium on Engineering interactive computing systems - EICS '11, 2011

Wall Street Journal, 2004