Konstantin Markov | University of Aizu (original) (raw)

Papers by Konstantin Markov

Research paper thumbnail of 調音運動ダイナミクスの音声認識への応用(聴覚・音声/一般)

Scientific Programming, Jun 19, 2003

Research paper thumbnail of Conclusions and Future Directions

Springer eBooks, Feb 26, 2009

Research paper thumbnail of Acoustic and Articulatory Information Combination Using Generalized Distillation

Research paper thumbnail of Personality Prediction from Social Media Posts using Text Embedding and Statistical Features

Computer Science and Information Systems (FedCSIS), 2019 Federated Conference on, Sep 26, 2022

Recent advances in deep learning based language models have boosted the performance in many downs... more Recent advances in deep learning based language models have boosted the performance in many downstream tasks such as sentiment analysis, text summarization, question answering, etc. Personality prediction from text is a relatively new task that has attracted researchers' attention due to the increased interest in personalized services as well as the availability of social media data. In this study, we propose a personality prediction system where text embeddings from large language models such as BERT are combined with multiple statistical features extracted from the input text. For the combination, we use the selfattention mechanism which is a popular choice when several information sources need to be merged together. Our experiments with the Kaggle dataset for MBTI clearly show that adding text statistical features improves the system performance relative to using only BERT embeddings. We also analyze the influence of the personality type words on the overall results.

Research paper thumbnail of Long - Term Effect Removal for Noisy Speech Recognition

IPSJ SIG Notes, Dec 21, 2000

Abstract; Noise speech recognition is of great interests in speech research recently. To make an ... more Abstract; Noise speech recognition is of great interests in speech research recently. To make an automatic speech recognition system robust to noise, we will probably have to solve two problems. One is the detection and identification of noise. Another is the consideration of noise effect during recognition process. In this paper, we will address a new method to estimate the noise effect using a long-term Fourier analysis. We will then discuss how to remove the noise effect from corrupted speech to make recognition system immune to ...

Research paper thumbnail of Unified User-Interface and Protocol for Managing Heterogeneous Deep Learning Services

New Trends in Software Methodologies, Tools and Techniques, 2017

Research paper thumbnail of Psychoacoustic features explain creakiness classifications made by naive and non-naive listeners

Speech Communication, Feb 1, 2023

Research paper thumbnail of Music Emotion Recognition

The recognition of music emotions using deep learning is one of the latest challenges in the fiel... more The recognition of music emotions using deep learning is one of the latest challenges in the field of speech processing. In this paper, we introduce music emotion recognition using Deep neural network(DNN). In the existing methods, the multi-linear regression and support vector machine are used to train the model. The simulation results show that recognition with deep neural network is better than the traditional methods.

Research paper thumbnail of Incorporating Knowledge Sources into Statistical Speech Recognition

Lecture notes in electrical engineering, 2009

Incorporating Knowledge Sources into Statistical Speech Recognition addresses the problem of deve... more Incorporating Knowledge Sources into Statistical Speech Recognition addresses the problem of developing efficient automatic speech recognition (ASR) systems, which maintain a balance between utilizing a wide knowledge of speech variability, while keeping the training / recognition effort feasible and improving speech recognition performance. The book provides an efficient general framework to incorporate additional knowledge sources into state-of-the-art statistical ASR systems. It can be applied to many existing ASR problems with their respective model-based likelihood functions in flexible ways.

Research paper thumbnail of Articulatory and spectrum features integration using generalized distillation framework

It has been shown that by combining the acoustic and articulatory information significant perform... more It has been shown that by combining the acoustic and articulatory information significant performance improvements in automatic speech recognition (ASR) task can be achieved. In practice, however, articulatory information is not available during recognition and the general approach is to estimate it from the acoustic signal. In this paper, we propose a different approach based on the generalized distillation framework, where acoustic-articulatory inversion is not necessary. We trained two DNN models: one called "teacher" learns from both acoustic and articulatory features and the other one called "student" is trained on acoustic features only, but its training process is guided by the "teacher" model and can reach a better performance that can't be obtained by regular training even without articulatory feature inputs during test time. The paper is organized as follows: Section 1 gives the introduction and briefly discusses some related works. Section 2 describes the distillation training process, Section 3 describes ASR system used in this paper. Section 4 presents the experiments and the paper is concluded by Section 5.

Research paper thumbnail of Design and Implementation of HMM/BN Acoustic Models

IEICE technical report. Speech, Dec 13, 2004

Research paper thumbnail of Sentence embedding based emotion recognition from text data

Research paper thumbnail of Evaluation of Advanced Language Modeling Techniques for Russian LVCSR

Lecture Notes in Computer Science, 2013

The Russian language is characterized by very flexible word order, which limits the ability of th... more The Russian language is characterized by very flexible word order, which limits the ability of the standard n-grams to capture important regularities in the data. Moreover, it is highly inflectional language with rich morphology, which leads to high out-of-vocabulary (OOV) word rates. In this paper, we present comparison of two advanced language modeling techniques: factored language model (FLM) and recurrent neural network (RNN) language model, applied for Russian large vocabulary speech recognition. Evaluation experiments showed that the FLM, built using training corpus of 10M words was better and reduced the perplexity and word error rate (WER) by 20% and 4.0% respectively. Further WER reduction by 7.4% was achieved when the training data were increased to 40M words and 3-gram, FLM and RNN language models were combined together by linear interpolation.

Research paper thumbnail of Future-generation personality prediction from digital footprints

Future Generation Computer Systems

Research paper thumbnail of Large Vocabulary ASR System based on the Hybrid HMM/BN model

Research paper thumbnail of Music Genre and Emotion Recognition Using

Gaussian Processes (GPs) are Bayesian nonparametric models that are becoming more and more popula... more Gaussian Processes (GPs) are Bayesian nonparametric models that are becoming more and more popular for their superior capabilities to capture highly nonlinear data relationships in various tasks, such as dimensionality reduction, time series analysis, novelty detection, as well as classical regression and classication tasks. In this paper, we investigate the feasibility and applicability of GP models for music genre classication and music emotion estimation. These are two of the main tasks in the music information retrieval (MIR) eld. So far, the support vector machine (SVM) has been the dominant model used in MIR systems. Like SVM, GP models are based on kernel functions and Gram matrices; but, in contrast, they produce truly probabilistic outputs with an explicit degree of prediction uncertainty. In addition, there exist algorithms for GP hyperparameter learningsomethin g the SVM framework lacks. In this paper, we built two systems, one for music genre classication and another for...

Research paper thumbnail of High level feature extraction for the self-taught learning algorithm

Eurasip Journal on Audio, Speech, and Music Processing, Apr 9, 2013

Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervis... more Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervised learning research. In most works, however, it is assumed that labeled and unlabeled data come from the same distribution. This restriction is removed in the self-taught learning algorithm where unlabeled data can be different, but nevertheless have similar structure. First, a representation is learned from the unlabeled samples by decomposing their data matrix into two matrices called bases and activations matrix respectively. This procedure is justified by the assumption that each sample is a linear combination of the columns in the bases matrix which can be viewed as high level features representing the knowledge learned from the unlabeled data in an unsupervised way. Next, activations of the labeled data are obtained using the bases which are kept fixed. Finally, a classifier is built using these activations instead of the original labeled data. In this work, we investigated the performance of three popular methods for matrix decomposition: Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF) and Sparse Coding (SC) as unsupervised high level feature extractors for the self-taught learning algorithm. We implemented this algorithm for the music genre classification task using two different databases: one as unlabeled data pool and the other as data for supervised classifier training. Music pieces come from 10 and 6 genres for each database respectively, while only one genre is common for the both of them. Results from wide variety of experimental settings show that the self-taught learning method improves the classification rate when the amount of labeled data is small and, more interestingly, that consistent improvement can be achieved for a wide range of unlabeled data sizes. The best performance among the matrix decomposition approaches was shown by the Sparse Coding method.

Research paper thumbnail of Music genre classification using self-taught learning via sparse coding

Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervis... more Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervised learning research. In most works, however, it is assumed that labeled and unlabeled data come from the same distribution. This restriction is removed in the self-taught learning approach where unlabeled data can be different, but nevertheless have similar structure. First, a representation is learned from the unlabeled data via sparse coding and then it is applied to the labeled data used for classification. In this work, we implemented this method for the music genre classification task using two different databases: one as unlabeled data pool and the other for supervised classifier training. Music pieces come from 10 and 6 genres for each database respectively, while only one genre is common for both of them. Results from wide variety of experimental settings show that the self-taught learning method improves the classification rate when the amount of labeled data is small and, more interestingly, that consistent improvement can be achieved for a wide range of unlabeled data sizes.

Research paper thumbnail of Emotional Analysis of Music

Music as a form of art is intentionally composed to be emotionally expressive. The emotional feat... more Music as a form of art is intentionally composed to be emotionally expressive. The emotional features of music are invaluable for music indexing and recommendation. In this paper we present a cross-comparison of automatic emotional analysis of music. We created a public dataset of Creative Commons licensed songs. Using valence and arousal model, the songs were annotated both in terms of the emotions that were expressed by the whole excerpt and dynamically with 1 Hz temporal resolution. Each song received 10 annotations on Amazon Mechanical Turk and the annotations were averaged to form a ground truth. Four different systems from three teams and the organizers were employed to tackle this problem in an open challenge. We compare their performances and discuss the best practices. While the effect of a larger feature set was not very apparent in the static emotion estimation, the combination of a comprehensive feature set and a recurrent neural network that models temporal dependencies has largely outperformed the other proposed methods for dynamic music emotion estimation.

Research paper thumbnail of Articulatory and Spectrum Information Fusion Based on Deep Recurrent Neural Networks

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019

Many studies have shown that articulatory features can significantly improve the performance of a... more Many studies have shown that articulatory features can significantly improve the performance of automatic speech recognition systems. Unfortunately, such features are not available at recognition time. There are two main approaches to solve this problem: a feature-based approach, the most popular example of which is the acoustic-to-articulatory inversion, where the missing articulatory features are generated from the speech signal, and a model-based approach, where articulatory information is embedded in the model structure and parameters in a way that allows recognition using only acoustic features. In this paper, we propose two new methods to integrate articulatory information into a phoneme recognition system. One of them is feature based, and the other is model based. In both cases, the underlying acoustic model (AM) is a deep neural networks-hidden Markov model (DNN-HMM) hybrid. In the feature-based method, the articulatory inversion DNN and the acoustic model DNN are trained jointly using a linear combination of their loss functions. In the modelbased method, we utilize the generalized distillation framework to train the AM DNN. In this case, first, a teacher DNN is trained on both the acoustic and articulatory features, and then its outputs are used as additional targets during the AM DNN training with acoustic features only. A 7-fold cross-validation experiments using 42 speakers from the XRMB database showed that both the proposed methods provide about 22% to 25% performance improvement with respect to the DNN acoustic model trained with acoustic features only.

Research paper thumbnail of 調音運動ダイナミクスの音声認識への応用(聴覚・音声/一般)

Scientific Programming, Jun 19, 2003

Research paper thumbnail of Conclusions and Future Directions

Springer eBooks, Feb 26, 2009

Research paper thumbnail of Acoustic and Articulatory Information Combination Using Generalized Distillation

Research paper thumbnail of Personality Prediction from Social Media Posts using Text Embedding and Statistical Features

Computer Science and Information Systems (FedCSIS), 2019 Federated Conference on, Sep 26, 2022

Recent advances in deep learning based language models have boosted the performance in many downs... more Recent advances in deep learning based language models have boosted the performance in many downstream tasks such as sentiment analysis, text summarization, question answering, etc. Personality prediction from text is a relatively new task that has attracted researchers' attention due to the increased interest in personalized services as well as the availability of social media data. In this study, we propose a personality prediction system where text embeddings from large language models such as BERT are combined with multiple statistical features extracted from the input text. For the combination, we use the selfattention mechanism which is a popular choice when several information sources need to be merged together. Our experiments with the Kaggle dataset for MBTI clearly show that adding text statistical features improves the system performance relative to using only BERT embeddings. We also analyze the influence of the personality type words on the overall results.

Research paper thumbnail of Long - Term Effect Removal for Noisy Speech Recognition

IPSJ SIG Notes, Dec 21, 2000

Abstract; Noise speech recognition is of great interests in speech research recently. To make an ... more Abstract; Noise speech recognition is of great interests in speech research recently. To make an automatic speech recognition system robust to noise, we will probably have to solve two problems. One is the detection and identification of noise. Another is the consideration of noise effect during recognition process. In this paper, we will address a new method to estimate the noise effect using a long-term Fourier analysis. We will then discuss how to remove the noise effect from corrupted speech to make recognition system immune to ...

Research paper thumbnail of Unified User-Interface and Protocol for Managing Heterogeneous Deep Learning Services

New Trends in Software Methodologies, Tools and Techniques, 2017

Research paper thumbnail of Psychoacoustic features explain creakiness classifications made by naive and non-naive listeners

Speech Communication, Feb 1, 2023

Research paper thumbnail of Music Emotion Recognition

The recognition of music emotions using deep learning is one of the latest challenges in the fiel... more The recognition of music emotions using deep learning is one of the latest challenges in the field of speech processing. In this paper, we introduce music emotion recognition using Deep neural network(DNN). In the existing methods, the multi-linear regression and support vector machine are used to train the model. The simulation results show that recognition with deep neural network is better than the traditional methods.

Research paper thumbnail of Incorporating Knowledge Sources into Statistical Speech Recognition

Lecture notes in electrical engineering, 2009

Incorporating Knowledge Sources into Statistical Speech Recognition addresses the problem of deve... more Incorporating Knowledge Sources into Statistical Speech Recognition addresses the problem of developing efficient automatic speech recognition (ASR) systems, which maintain a balance between utilizing a wide knowledge of speech variability, while keeping the training / recognition effort feasible and improving speech recognition performance. The book provides an efficient general framework to incorporate additional knowledge sources into state-of-the-art statistical ASR systems. It can be applied to many existing ASR problems with their respective model-based likelihood functions in flexible ways.

Research paper thumbnail of Articulatory and spectrum features integration using generalized distillation framework

It has been shown that by combining the acoustic and articulatory information significant perform... more It has been shown that by combining the acoustic and articulatory information significant performance improvements in automatic speech recognition (ASR) task can be achieved. In practice, however, articulatory information is not available during recognition and the general approach is to estimate it from the acoustic signal. In this paper, we propose a different approach based on the generalized distillation framework, where acoustic-articulatory inversion is not necessary. We trained two DNN models: one called "teacher" learns from both acoustic and articulatory features and the other one called "student" is trained on acoustic features only, but its training process is guided by the "teacher" model and can reach a better performance that can't be obtained by regular training even without articulatory feature inputs during test time. The paper is organized as follows: Section 1 gives the introduction and briefly discusses some related works. Section 2 describes the distillation training process, Section 3 describes ASR system used in this paper. Section 4 presents the experiments and the paper is concluded by Section 5.

Research paper thumbnail of Design and Implementation of HMM/BN Acoustic Models

IEICE technical report. Speech, Dec 13, 2004

Research paper thumbnail of Sentence embedding based emotion recognition from text data

Research paper thumbnail of Evaluation of Advanced Language Modeling Techniques for Russian LVCSR

Lecture Notes in Computer Science, 2013

The Russian language is characterized by very flexible word order, which limits the ability of th... more The Russian language is characterized by very flexible word order, which limits the ability of the standard n-grams to capture important regularities in the data. Moreover, it is highly inflectional language with rich morphology, which leads to high out-of-vocabulary (OOV) word rates. In this paper, we present comparison of two advanced language modeling techniques: factored language model (FLM) and recurrent neural network (RNN) language model, applied for Russian large vocabulary speech recognition. Evaluation experiments showed that the FLM, built using training corpus of 10M words was better and reduced the perplexity and word error rate (WER) by 20% and 4.0% respectively. Further WER reduction by 7.4% was achieved when the training data were increased to 40M words and 3-gram, FLM and RNN language models were combined together by linear interpolation.

Research paper thumbnail of Future-generation personality prediction from digital footprints

Future Generation Computer Systems

Research paper thumbnail of Large Vocabulary ASR System based on the Hybrid HMM/BN model

Research paper thumbnail of Music Genre and Emotion Recognition Using

Gaussian Processes (GPs) are Bayesian nonparametric models that are becoming more and more popula... more Gaussian Processes (GPs) are Bayesian nonparametric models that are becoming more and more popular for their superior capabilities to capture highly nonlinear data relationships in various tasks, such as dimensionality reduction, time series analysis, novelty detection, as well as classical regression and classication tasks. In this paper, we investigate the feasibility and applicability of GP models for music genre classication and music emotion estimation. These are two of the main tasks in the music information retrieval (MIR) eld. So far, the support vector machine (SVM) has been the dominant model used in MIR systems. Like SVM, GP models are based on kernel functions and Gram matrices; but, in contrast, they produce truly probabilistic outputs with an explicit degree of prediction uncertainty. In addition, there exist algorithms for GP hyperparameter learningsomethin g the SVM framework lacks. In this paper, we built two systems, one for music genre classication and another for...

Research paper thumbnail of High level feature extraction for the self-taught learning algorithm

Eurasip Journal on Audio, Speech, and Music Processing, Apr 9, 2013

Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervis... more Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervised learning research. In most works, however, it is assumed that labeled and unlabeled data come from the same distribution. This restriction is removed in the self-taught learning algorithm where unlabeled data can be different, but nevertheless have similar structure. First, a representation is learned from the unlabeled samples by decomposing their data matrix into two matrices called bases and activations matrix respectively. This procedure is justified by the assumption that each sample is a linear combination of the columns in the bases matrix which can be viewed as high level features representing the knowledge learned from the unlabeled data in an unsupervised way. Next, activations of the labeled data are obtained using the bases which are kept fixed. Finally, a classifier is built using these activations instead of the original labeled data. In this work, we investigated the performance of three popular methods for matrix decomposition: Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF) and Sparse Coding (SC) as unsupervised high level feature extractors for the self-taught learning algorithm. We implemented this algorithm for the music genre classification task using two different databases: one as unlabeled data pool and the other as data for supervised classifier training. Music pieces come from 10 and 6 genres for each database respectively, while only one genre is common for the both of them. Results from wide variety of experimental settings show that the self-taught learning method improves the classification rate when the amount of labeled data is small and, more interestingly, that consistent improvement can be achieved for a wide range of unlabeled data sizes. The best performance among the matrix decomposition approaches was shown by the Sparse Coding method.

Research paper thumbnail of Music genre classification using self-taught learning via sparse coding

Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervis... more Availability of large amounts of raw unlabeled data has sparked the recent surge in semi-supervised learning research. In most works, however, it is assumed that labeled and unlabeled data come from the same distribution. This restriction is removed in the self-taught learning approach where unlabeled data can be different, but nevertheless have similar structure. First, a representation is learned from the unlabeled data via sparse coding and then it is applied to the labeled data used for classification. In this work, we implemented this method for the music genre classification task using two different databases: one as unlabeled data pool and the other for supervised classifier training. Music pieces come from 10 and 6 genres for each database respectively, while only one genre is common for both of them. Results from wide variety of experimental settings show that the self-taught learning method improves the classification rate when the amount of labeled data is small and, more interestingly, that consistent improvement can be achieved for a wide range of unlabeled data sizes.

Research paper thumbnail of Emotional Analysis of Music

Music as a form of art is intentionally composed to be emotionally expressive. The emotional feat... more Music as a form of art is intentionally composed to be emotionally expressive. The emotional features of music are invaluable for music indexing and recommendation. In this paper we present a cross-comparison of automatic emotional analysis of music. We created a public dataset of Creative Commons licensed songs. Using valence and arousal model, the songs were annotated both in terms of the emotions that were expressed by the whole excerpt and dynamically with 1 Hz temporal resolution. Each song received 10 annotations on Amazon Mechanical Turk and the annotations were averaged to form a ground truth. Four different systems from three teams and the organizers were employed to tackle this problem in an open challenge. We compare their performances and discuss the best practices. While the effect of a larger feature set was not very apparent in the static emotion estimation, the combination of a comprehensive feature set and a recurrent neural network that models temporal dependencies has largely outperformed the other proposed methods for dynamic music emotion estimation.

Research paper thumbnail of Articulatory and Spectrum Information Fusion Based on Deep Recurrent Neural Networks

IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019

Many studies have shown that articulatory features can significantly improve the performance of a... more Many studies have shown that articulatory features can significantly improve the performance of automatic speech recognition systems. Unfortunately, such features are not available at recognition time. There are two main approaches to solve this problem: a feature-based approach, the most popular example of which is the acoustic-to-articulatory inversion, where the missing articulatory features are generated from the speech signal, and a model-based approach, where articulatory information is embedded in the model structure and parameters in a way that allows recognition using only acoustic features. In this paper, we propose two new methods to integrate articulatory information into a phoneme recognition system. One of them is feature based, and the other is model based. In both cases, the underlying acoustic model (AM) is a deep neural networks-hidden Markov model (DNN-HMM) hybrid. In the feature-based method, the articulatory inversion DNN and the acoustic model DNN are trained jointly using a linear combination of their loss functions. In the modelbased method, we utilize the generalized distillation framework to train the AM DNN. In this case, first, a teacher DNN is trained on both the acoustic and articulatory features, and then its outputs are used as additional targets during the AM DNN training with acoustic features only. A 7-fold cross-validation experiments using 42 speakers from the XRMB database showed that both the proposed methods provide about 22% to 25% performance improvement with respect to the DNN acoustic model trained with acoustic features only.