Benoît Maison - Academia.edu (original) (raw)
Papers by Benoît Maison
Proceedings of 1994 IEEE International Symposium on Information Theory, 1994
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001
We explore a novel approach for handwriting recognition tasks whose intrinsic vocabularies are to... more We explore a novel approach for handwriting recognition tasks whose intrinsic vocabularies are too large to be applied directly as constraints during recognition. Our approach makes use of vocabulary constraints, and addresses the issue that some parts of words may be written more recognizably than others. An initial pass is made with an HMM recognizer, without vocabulary constraints, generating a lattice of character-hypothesis arcs representing likely segmentations of the handwriting signal. Arc confidence scores are computed using a posteriori probabilities. The most confidently recognized characters are used to filter the overall vocabulary, generating a word subset manageable for constraining a second recognition pass. With a vocabulary of 273000 words, we can limit to 50000 words in the second pass and eliminate 39.3% of the word errors made by a one-pass recognizer without vocabulary constraints, and 18.3% of errors made using a fixed 30000-word set
Stereoscopic Displays and Virtual Reality Systems XIV, 2007
We present a point based reconstruction and transmission pipeline for a collaborative tele-immers... more We present a point based reconstruction and transmission pipeline for a collaborative tele-immersion system. Two or more users in different locations collaborate with each other in a shared, simulated environment as if they were in the same physical room. Each user perceives point-based models of distant users along with collaborative data like molecule models. Disparity maps, computed by a commercial stereo solution, are filtered and transformed into clouds of 3D points. The clouds are compressed and transmitted over the network to distant users. At the other side the clouds are decompressed and incorporated into the 3D scene. The viewpoint used to display the 3D scene is dependent on the position of the head of the user. Collaborative data is manipulated through natural hand gestures. We analyse the performance of the system in terms of computation time, latency and photo realistic quality of the reconstructed models.
Eighth European Conference on Speech …, 2003
... version of Yvon's overlapping chunks [4]. Our objective is to allow any phonem... more ... version of Yvon's overlapping chunks [4]. Our objective is to allow any phoneme sequence that is ... Those languages are: Spanish, Italian, French, German, Mandarin, Hindi, Czech and Russian. ... 3.1. Multiple Alignments We represent a multiple alignment as an array Φ of N rows ...
Abstract Audio-based speaker identification degrades severely when there is a mismatch between tr... more Abstract Audio-based speaker identification degrades severely when there is a mismatch between training and test conditions either due to channel or noise. In this paper, we explore various techniques to fuse video based speaker identification with audio-based speaker ...
Eighth European Conference on Speech …, 2003
Page 1. Using Place Name Data to Train Language Identification Models Stanley F. Chen, Benoıt Mai... more Page 1. Using Place Name Data to Train Language Identification Models Stanley F. Chen, Benoıt Maison IBM TJ Watson Research Center PO Box 218, Yorktown Heights, NY 10598 {stanchen,bmaison}@us.ibm.com Abstract ...
… Speech Recognition and …, 2004
Page 1. PRONUNCIATION MODELING FOR NAMES OF FORElGN ORIGIN Benoit Maison, Stanley E Chen and Paul... more Page 1. PRONUNCIATION MODELING FOR NAMES OF FORElGN ORIGIN Benoit Maison, Stanley E Chen and Paul S. Cohen IBM TJ Watson Research Center PO Box 218, Yorktown Heights, NY 10598, USA {bmaison,stanchen}@us.ibm.com, pausyl@aol.com ...
ABSTRACT Information fusion in the context of combining multiple streams of data e.g., audio stre... more ABSTRACT Information fusion in the context of combining multiple streams of data e.g., audio streams and video streams corresponding to the same perceptual process is considered in a somewhat generalized setting. Specifically, we consider the problem of combining visual cues with audio signals for the purpose of improved automatic machine recognition of descriptors e.g., speech recognition/transcription, speaker change detection, speaker identification and speaker event detection. These happen to be important descriptors for multimedia content (video) for efficient search and retrieval. A general framework for considering all of these fusion problems in a unified setting is considered.
Acoustics, Speech, and Signal …, 2001
We are looking for confidence scoring techniques that perform well on a broad variety of tasks. O... more We are looking for confidence scoring techniques that perform well on a broad variety of tasks. Our main focus is on word-level error rejection, but most results apply to other scenarios as well. A variation of the normalized cross entropy that is adapted to that ...
Proc. DARPA Speech Transcription Workshop, 2000
We describe the system used by IBM in the 1999 HUB4 Evaluation under the 10 times real-time const... more We describe the system used by IBM in the 1999 HUB4 Evaluation under the 10 times real-time constraint. We detail the system architecture and show that the performance of this system is over 20 percent more accurate at the same speed than the system used in the 1998 Evaluation. Furthermore, we have closed the gap between our unlimited resource system and our 10 times real time system from 45 percent to 14 percent.
1996 IEEE Digital Signal Processing Workshop Proceedings, 1996
A new transform/subband coding algorithm is proposed. Its original characteristic is that the tra... more A new transform/subband coding algorithm is proposed. Its original characteristic is that the transform operator is continuously updated during the encoding process on the base of previously decoded samples. Unlike other content based algorithms, no data overhead is implied and the method can be easily incorporated into most image compression schemes. Simulation results demonstrate that a substantial benefit can be
Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing, 1994
Proceedings of 1st International Conference on Image Processing, 1994
Proceedings of 1994 IEEE International Symposium on Information Theory, 1994
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001
We explore a novel approach for handwriting recognition tasks whose intrinsic vocabularies are to... more We explore a novel approach for handwriting recognition tasks whose intrinsic vocabularies are too large to be applied directly as constraints during recognition. Our approach makes use of vocabulary constraints, and addresses the issue that some parts of words may be written more recognizably than others. An initial pass is made with an HMM recognizer, without vocabulary constraints, generating a lattice of character-hypothesis arcs representing likely segmentations of the handwriting signal. Arc confidence scores are computed using a posteriori probabilities. The most confidently recognized characters are used to filter the overall vocabulary, generating a word subset manageable for constraining a second recognition pass. With a vocabulary of 273000 words, we can limit to 50000 words in the second pass and eliminate 39.3% of the word errors made by a one-pass recognizer without vocabulary constraints, and 18.3% of errors made using a fixed 30000-word set
Stereoscopic Displays and Virtual Reality Systems XIV, 2007
We present a point based reconstruction and transmission pipeline for a collaborative tele-immers... more We present a point based reconstruction and transmission pipeline for a collaborative tele-immersion system. Two or more users in different locations collaborate with each other in a shared, simulated environment as if they were in the same physical room. Each user perceives point-based models of distant users along with collaborative data like molecule models. Disparity maps, computed by a commercial stereo solution, are filtered and transformed into clouds of 3D points. The clouds are compressed and transmitted over the network to distant users. At the other side the clouds are decompressed and incorporated into the 3D scene. The viewpoint used to display the 3D scene is dependent on the position of the head of the user. Collaborative data is manipulated through natural hand gestures. We analyse the performance of the system in terms of computation time, latency and photo realistic quality of the reconstructed models.
Eighth European Conference on Speech …, 2003
... version of Yvon's overlapping chunks [4]. Our objective is to allow any phonem... more ... version of Yvon's overlapping chunks [4]. Our objective is to allow any phoneme sequence that is ... Those languages are: Spanish, Italian, French, German, Mandarin, Hindi, Czech and Russian. ... 3.1. Multiple Alignments We represent a multiple alignment as an array Φ of N rows ...
Abstract Audio-based speaker identification degrades severely when there is a mismatch between tr... more Abstract Audio-based speaker identification degrades severely when there is a mismatch between training and test conditions either due to channel or noise. In this paper, we explore various techniques to fuse video based speaker identification with audio-based speaker ...
Eighth European Conference on Speech …, 2003
Page 1. Using Place Name Data to Train Language Identification Models Stanley F. Chen, Benoıt Mai... more Page 1. Using Place Name Data to Train Language Identification Models Stanley F. Chen, Benoıt Maison IBM TJ Watson Research Center PO Box 218, Yorktown Heights, NY 10598 {stanchen,bmaison}@us.ibm.com Abstract ...
… Speech Recognition and …, 2004
Page 1. PRONUNCIATION MODELING FOR NAMES OF FORElGN ORIGIN Benoit Maison, Stanley E Chen and Paul... more Page 1. PRONUNCIATION MODELING FOR NAMES OF FORElGN ORIGIN Benoit Maison, Stanley E Chen and Paul S. Cohen IBM TJ Watson Research Center PO Box 218, Yorktown Heights, NY 10598, USA {bmaison,stanchen}@us.ibm.com, pausyl@aol.com ...
ABSTRACT Information fusion in the context of combining multiple streams of data e.g., audio stre... more ABSTRACT Information fusion in the context of combining multiple streams of data e.g., audio streams and video streams corresponding to the same perceptual process is considered in a somewhat generalized setting. Specifically, we consider the problem of combining visual cues with audio signals for the purpose of improved automatic machine recognition of descriptors e.g., speech recognition/transcription, speaker change detection, speaker identification and speaker event detection. These happen to be important descriptors for multimedia content (video) for efficient search and retrieval. A general framework for considering all of these fusion problems in a unified setting is considered.
Acoustics, Speech, and Signal …, 2001
We are looking for confidence scoring techniques that perform well on a broad variety of tasks. O... more We are looking for confidence scoring techniques that perform well on a broad variety of tasks. Our main focus is on word-level error rejection, but most results apply to other scenarios as well. A variation of the normalized cross entropy that is adapted to that ...
Proc. DARPA Speech Transcription Workshop, 2000
We describe the system used by IBM in the 1999 HUB4 Evaluation under the 10 times real-time const... more We describe the system used by IBM in the 1999 HUB4 Evaluation under the 10 times real-time constraint. We detail the system architecture and show that the performance of this system is over 20 percent more accurate at the same speed than the system used in the 1998 Evaluation. Furthermore, we have closed the gap between our unlimited resource system and our 10 times real time system from 45 percent to 14 percent.
1996 IEEE Digital Signal Processing Workshop Proceedings, 1996
A new transform/subband coding algorithm is proposed. Its original characteristic is that the tra... more A new transform/subband coding algorithm is proposed. Its original characteristic is that the transform operator is continuously updated during the encoding process on the base of previously decoded samples. Unlike other content based algorithms, no data overhead is implied and the method can be easily incorporated into most image compression schemes. Simulation results demonstrate that a substantial benefit can be
Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing, 1994
Proceedings of 1st International Conference on Image Processing, 1994