Ivan Tashev - Profile on Academia.edu (original) (raw)

Papers by Ivan Tashev

With mass propagation of the cellular phones and other small form factor devices as PDAs and othe... more With mass propagation of the cellular phones and other small form factor devices as PDAs and other handhelds their usage in noise adverse environment is substantially increased. With adoption of the 3G and 4G wireless technologies the transition to videophone mode of communication is imminent. In addition most of the modern mobile phones have integrated cameras and are able to

SiMPE

Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services - MobileHCI '09, 2009

SiMPE

Proceedings of the 13th International Conference on Human Computer Interaction with Mobile Devices and Services - MobileHCI '11, 2011

Proceedings of the tenth ACM international conference on Multimedia - MULTIMEDIA '02, 2002

The common meeting is an integral part of everyday life for most workgroups. However, due to trav... more The common meeting is an integral part of everyday life for most workgroups. However, due to travel, time, or other constraints, people are often not able to attend all the meetings they need to. Teleconferencing and recording of meetings can address this problem. In this paper we describe a system that provides these features, as well as a user study evaluation of the system. The system uses a variety of capture devices (a novel 360º camera, a whiteboard camera, an overview camera, and a microphone array) to provide a rich experience for people who want to participate in a meeting from a distance. The system is also combined with speaker clustering, spatial indexing, and time compression to provide a rich experience for people who miss a meeting and want to watch it afterward.

Mobile devices are being used in more and more adverse noise environments. This increases the req... more Mobile devices are being used in more and more adverse noise environments. This increases the requirements for designing headsets for these devices. Regardless of the limitations for larger battery life some designs use multiple microphones to achieve appropriate noise suppression. Unfortunately the most common approaches for designing the beamformer do not provide good results due to the complex way that the sound generated by the mouth travels around the head to reach the area around the ear, where the headset is usually positioned. In this paper we propose a data driven approach for designing a time invariant beamformer using set of calibration files. The approach is illustrated with the evaluation of a beamformer for binaural headset. As evaluation criteria are used the improvements in output Signal to Noise Ratio (SNR) and objective evaluation of the perceptual sound quality. The proposed design approach delivers 6.8 dBC improvement in SNR and 0.46 MOS points improvement in sound quality, compared to a single microphone.

Speech recognition technology is prone to mistakes, but this is not the only source of errors tha... more Speech recognition technology is prone to mistakes, but this is not the only source of errors that cause speech recognition systems to fail; sometimes the user simply does not utter the command correctly. Usually, user mistakes are not considered when a system is designed and evaluated. This creates a gap between the claimed accuracy of the system and the actual accuracy perceived by the users. We address this issue quantitatively in our in-car infotainment media search task and propose expanding the capability of voice command to accommodate user mistakes while retaining a high percentage of the performance for queries with correct syntax. As a result, failures caused by user mistakes were reduced by an absolute 70% at the cost of a drop in accuracy of only 0.28%.

The availability of digital maps and mapping software has led to significant growth in location-b... more The availability of digital maps and mapping software has led to significant growth in location-based software and ser- vices. To safely use these applications in mobile and automo- tive scenarios, users must be able to input precise locations us- ing speech. In this paper, we propose a novel method for loca- tion understanding based on spoken intersections. The proposed approach

2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007

While current post-filtering algorithms for microphone array applications can enhance beamformer ... more While current post-filtering algorithms for microphone array applications can enhance beamformer output signals, they assume that the noise is either incoherent or diffuse, and make no allowances for point noise sources which may be strongly correlated across the microphones. In this paper, we present a novel post-filtering algorithm that alleviates this assumption by tracking the spatial as well as spectral distribution of the speech and noise sources present. A generative statistical model is employed to model the speech and noise sources at distinct regions in the soundfield, and incremental Bayesian learning is used to track the model parameters over time. This approach allows a post-filter derived from these parameters to effectively suppress both diffuse ambient noise and interfering point sources. The performance of the proposed approach is evaluated on multiple recordings made in a realistic office environment.

2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010

In this paper, independent component analysis (ICA) in a subband domain has been extended into a ... more In this paper, independent component analysis (ICA) in a subband domain has been extended into a feed-forward network. The feed-forward network maximizes mutual independence of separated current frames using information from both current and previous multi-channel frames of speech signals captured by a microphone array. To guide into a proper separation preventing permutation and arbitrary scaling, we not only rely on the steered response for the first tap of the demixing filter but also penalize on the direction thus drastically increasing the mean squared error with the spatial filtered output. After convergence, by applying instantaneous direction of arrival (IDOA) based post-processing, we can additionally suppress the leakage of interference as well as the reverberated target signal. The signal to interference ratio (SIR) is improved more than 20 dBC for distances up to 2.7 m and angle differences down to 26°.

SiMPE

Proceedings of the 12th international conference on Human computer interaction with mobile devices and services - MobileHCI '10, 2010

IEEE Signal Processing Magazine, 2011

O ver the last decade, our ability to access, store, and consume huge amount of media and informa... more O ver the last decade, our ability to access, store, and consume huge amount of media and information on mobile devices has skyrocketed. While this has allowed people who are on the go to be more entertained, informed, and connected, the small-form factor of mobile devices makes managing all of this content a difficult task. This difficulty is significantly amplified when we consider how many people are using these devices while driving in automobiles and the high risk of driver distraction such devices present. A recent government study concluded that drivers performing complex second-ary tasks such as operating or viewing a mobile device or personal digital assistant (PDA) were between 1.7 and 5.5 times more likely to be involved in a crash or near crash .

2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007

In this paper, we propose a novel adaptive beamforming algorithm with enhanced noise suppression ... more In this paper, we propose a novel adaptive beamforming algorithm with enhanced noise suppression capability. The proposed algorithm incorporates the sound-source presence probability into the adaptive blocking matrix, which is estimated based on the instantaneous direction of arrival of the input signals and voice activity detection. The proposed algorithm guarantees robustness to steering vector errors without imposing ad hoc constraints on the adaptive filter coefficients. It can provide good suppression performance for both directional interference signals as well as isotropic ambient noise. For in-car environment the proposed beamformer shows SNR improvement up to 12 dB without using an additional noise suppressor.

IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005., 2005

The need for hands-free communication has led to an increased popularity in the use of headsets w... more The need for hands-free communication has led to an increased popularity in the use of headsets with mobile phones. Comfort and portability concerns have led to the desire for headsets with a small form factor. Unfortunately, this size constraint typically requires that the microphone be placed farther from the user's mouth, making it highly susceptible to environmental noise. One long term goal of our work is to develop a headset that can achieve the sound capture performance of a close-talking microphone located at the user's mouth, while maintaining the desired compact size. Toward this end, we have designed a headset consisting of three air microphones and a bone-conductive sensor. The speech enhancement is performed in two stages, a fixed beamformer followed by a single-channel adaptive post-filter. Unlike other techniques, the beamformer is calibrated in a purely data-drive manner. We present preliminary experimental results using real data collected in multiple environments. The proposed approach results in significant improvements in both speech recognition accuracy and SNR.

2006 IEEE International Conference on Multimedia and Expo, 2006

Group-to-individual (G2I) distributed meeting is an important but understudied area. Because of t... more Group-to-individual (G2I) distributed meeting is an important but understudied area. Because of the asymmetry between different parties in G2I meetings, it has two unique challenges: 1) the remote participant tends to be ignored by the local participants; and 2) the remote participant has inferior audio, video, and data experience than the local participants. To address these issues, in this paper we present PING, a system explicitly designed for G2I distributed meetings that combines recent advances in both hardware, e.g., microphone arrays, remote person stand-in devices, and software, e.g., audio-video processing, to improve users' G2I meeting experience. We report how PING addresses the above two challenges and its system design and implementation.

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011

Speaker identification is a well-established research problem but has not been a major applicatio... more Speaker identification is a well-established research problem but has not been a major application used in gaming scenarios. In this paper, we propose a new algorithm for the open-set, text-independent, speaker ID problem, applied as an important component (among other cues) of a game player identification system. This scenario poses new challenges: far-field, limited training and very short test data, and almost real-time processing. To tackle this, we introduce new and more informative feature sets. The scores given by these feature sets are then combined in an optimal way to construct the final score. Experimental results on the gaming device's processed reverberated-speech show the effectiveness of the new features, and that reliable decisions can be made after very short (2 -5 second) test utterances required by the gaming scheme.

2009 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, 2009

In this paper we describe a generic architecture for single channel speech enhancement. We assume... more In this paper we describe a generic architecture for single channel speech enhancement. We assume processing in frequency domain and suppression based speech enhancement methods. The framework consists of a two stage voice activity detector, noise variance estimator, a suppression rule, and an uncertain presence of the speech signal modifier. The evaluation corpus is a synthetic mixture of a clean speech (TIMIT database) and in-car recorded noises. Using the framework multiple speech enhancement algorithms are tuned for maximum performance. We propose a formalized procedure for automated tuning of these algorithms. The optimization criterion is a weighted sum of the mean opinion score (PESQ-MOS), signalto-noise-ratio (SNR), log-spectral distance (LSD), and mean square error (MSE). The proposed framework provides a complete speech enhancement chain and can be used for evaluation and tuning of other suppression rules and voice activity detector algorithms.

2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009

This paper addresses the problem of using unstructured queries to search a structured database in... more This paper addresses the problem of using unstructured queries to search a structured database in voice search applications. By incorporating structural information in music metadata, the end-toend search error has been reduced by 15% on text queries and up to 11% on spoken queries. Based on that, an HMM sequential rescoring model has reduced the error rate by 28% on text queries and up to 23% on spoken queries compared to the baseline system. Furthermore, a phonetic similarity model has been introduced to compensate speech recognition errors, which has improved the end-to-end search accuracy consistently across different levels of speech recognition accuracy.

Microphone array post - processing using instantaneous direction of arrival

Inthis,paper ,we describe ,a novel ,algorithm ,for post- processing,a microphone ,array’s beamfor... more Inthis,paper ,we describe ,a novel ,algorithm ,for post- processing,a microphone ,array’s beamformer ,output ,to achieve better spatial filtering under noise and reverberation. For each audio frame and frequency ,bin the algorithm ,esti- mates,the spatial probability for sound ,source presence ,and applies a spatio-temporal filter towards the look-up direction. Itis implemented ,as a ,real-time post-processor after a time- invariant beamformer,and it substantially improves,the direc- tivity of the microphone,array.The algorithm is CPU efficient and,adapts quickly ,when ,the listening direction changes. It was,evaluated ,with a linear ,four element ,microphone ,array. The directivity index improvement is up to 8 dB, the suppres- sion of a jammer,40° from the sound source is up to 17 dB.

The Journal of the Acoustical Society of America, 2010

Voice activity detectors (VAD) are integral part of the modern speech processing, speech enhancem... more Voice activity detectors (VAD) are integral part of the modern speech processing, speech enhancement and speech encoding systems. One of the major problems in practical realizations is to achieve robust VAD in conditions of background noise. Most of the statistical model-based approaches employ the Gaussian assumption in the discrete Fourier transform (DFT) domain, which deviates from the real observation. In this paper, we propose a class of VAD algorithms based on several statistical models of the probability density functions of the magnitudes. In addition, we evaluate several approaches for combining the likelihoods for each frequency bin for estimation of the likelihood for the entire frame. A data corpus with in-car noise is then used to evaluate the VAD and the results are discussed.

SiMPE

Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services - MobileHCI '09, 2009

SiMPE

Proceedings of the 13th International Conference on Human Computer Interaction with Mobile Devices and Services - MobileHCI '11, 2011

Proceedings of the tenth ACM international conference on Multimedia - MULTIMEDIA '02, 2002

2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007

2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010

SiMPE

Proceedings of the 12th international conference on Human computer interaction with mobile devices and services - MobileHCI '10, 2010

IEEE Signal Processing Magazine, 2011

2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007

IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005., 2005

2006 IEEE International Conference on Multimedia and Expo, 2006

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011

2009 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, 2009

2009 IEEE International Conference on Acoustics, Speech and Signal Processing, 2009

Microphone array post - processing using instantaneous direction of arrival

The Journal of the Acoustical Society of America, 2010