François Grondin | Université de Sherbrooke (University of Sherbrooke) (original) (raw)
Papers by François Grondin
Preprint, 2021
SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the res... more SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.
Ce projet s'inscrit dans une étude plus large sur l'audition artificielle pour favoriser une inte... more Ce projet s'inscrit dans une étude plus large sur l'audition artificielle pour favoriser une interaction naturelle humain-robot en réadaptation. L'objectif est de mettre à profit les capacités de localisation et de discrimination sonore, de reconnaissance de parole et d'émotions dans la voix afin de répondre à des besoins concrets en réadaptation. L'équipe de recherche développera la technologie selon les besoins des utilisateurs potentiels, ce qui guidera la méthodologie de la présente recherche.
Rotary-Wing Air Vehicles (RW-UAVs), also referred to as drones, have gained in popularity over th... more Rotary-Wing Air Vehicles (RW-UAVs), also referred to as drones, have gained in popularity over the last few years. Intrusions over secured areas have become common and authorities are actively looking for solutions to detect and localize undesired drones. The sound generated by the propellers of the RW-UAVs is powerful enough to be perceived by a human observer nearby. In this paper, we examine the use of particle filtering to detect and localize in 3D the position of a RW-UAV based on sound source localization (SSL) over distributed microphone arrays (MAs). Results show that the proposed method is able to detect and track a drone with precision, as long as the noise emitted by the RW-UAVs dominates the background noise.
A telepresence mobile robot is a remote-controlled, wheeled device with wireless internet connect... more A telepresence mobile robot is a remote-controlled, wheeled device with wireless internet connectivity for bidi-rectional audio, video and data transmission. In health care, a telepresence robot could be used to have a clinician or a caregiver assist seniors in their homes without having to travel to these locations. Many mobile telepresence robotic platforms have recently been introduced on the market, bringing mobility to telecommunication and vital sign monitoring at reasonable costs. What is missing for making them effective remote telepresence systems for home care assistance are capabilities specifically needed to assist the remote operator in controlling the robot and perceiving the environment through the robot's sensors or, in other words, minimizing cognitive load and maximizing situation awareness. This paper describes our approach adding navigation, artificial audition and vital sign monitoring capabilities to a commercially available telepresence mobile robot. This requires the use of a robot control architecture to integrate the autonomous and teleoperation capabilities of the platform.
Audition is a rich source of spatial, identity, linguistic and paralinguistic information. Proces... more Audition is a rich source of spatial, identity, linguistic and paralinguistic information. Processing all this information requires acquisition, processing and interpretation of sound sources, which are instantaneous, invisible and noisy signals. This can lead to different responses by the system in relation to the information perceived. This paper presents our first implementation of an integration framework for speech processing. Acquisition includes sound capture, sound source localization, tracking, separation and enhancement, and voice activity detection. Processing involves speech and emotion recognition. Interpretation consists of translating speech utterances into commands that can influence interaction through dialogue management and speech synthesis. The paper also describes two visualization interfaces, inspired by comic strips, to represent live vocal interactions in real life environments. These interfaces are used to demonstrate how the framework performs in live interactions and its use in a usability study.
To be used on a mobile robot, speech/non-speech discrimination must be robust to environmental no... more To be used on a mobile robot, speech/non-speech discrimination must be robust to environmental noise and to the position of the interlocutor, without necessarily having to satisfy low-latency requirements. To address these conditions, this paper presents a speech/non-speech discrimination approach based on pitch estimation. Pitch features are robust to noise and reverberation, and can be estimated over a few seconds. Results suggest that our approach is more robust compared to the use of Mel-Frequency Cepstrum Coefficients with Gaussian Mixture Models (MFCC-GMM) under high reverberation levels and additive noise (with an accuracy above 98% with a latency of 2.21 sec), which makes it ideal for mobile robot applications. The approach is also validated on a mobile robot equipped with a 8-microphone array, using speech/non-speech discrimination based on pitch estimation as a post-processing module of a localization, tracking and separation system.
Sound source localization is an important challenge for mobile robots operating in real life sett... more Sound source localization is an important challenge for mobile robots operating in real life settings. Sound sources of interest, such as speech, are often corrupted by broadband coherent noise sound source(s) that are non-stationary during transitions between steady-state segments. The interfering noise introduces localization ambiguities leading to the localization of invalid sound sources. Masks to reduce such interferences perform well under stationary noise, but the performance degrades as localization of invalid sound sources generated by noise appear and disappear suddently during transitions between steady-state. This paper presents a new mask based on speech non-stationarity to discriminate between the time difference of arrival (TDOA) of speech source and noise transition. Simulations and experiments on a mobile robot suggest that the proposed technique improve TDOA discrimination and reduces significantly localization of invalid sound sources caused by noise.
Localization of sound sources in adverse environments is an important challenge in robot audition... more Localization of sound sources in adverse environments is an important challenge in robot audition. The target sound source is often corrupted by coherent broadband noise, which introduces localization ambiguities as noise is often mistaken as the target source. To discriminate the time difference of arrival (TDOA) parameters of the target source and noise, this paper presents a binary mask for weighted generalized cross-correlation with phase transform (GCC-PHAT). Simulation and experiments on a mobile robot suggest that the proposed technique improves TDOA discrimination. It also brings the additional benefit of modulating the computing load requirement according to voice activity.
Robots that Talk and Listen - Technology and Social Impact, Nov 2014
Vision and audition provide crucial information for adapting to dynamic and changing environments... more Vision and audition provide crucial information for adapting to dynamic and changing environments. While vision is commonly used on robots, robot audition is relatively new and could greatly contribute to enhance human-robot communication capabilities. Up to now, most speech-recognition systems have been designed to work in a close-talking microphone environment (as is the case for cell phones). However, when a human interacts with a robot, the close-talking microphone condition is no longer satisfied and many additional phenomena, such as reverberation and additive noise, can no longer be neglected. Moreover, since these robots are mobile, the acoustic conditions change over time. Mobile robots also have to carry and power their own computing resources. To handle these challenges, the approach described in this chapter involves the use of a microphone array and the development of low-complexity, real-time algorithms for sound-source localization, tracking and separation. Localization is used to determine the direction from which each sound-source emanates. Tracking allows the system to follow sources as they or the robot move. Separation algorithms are then used to extract and enhance speech signals from simultaneous speaker or sound sources. This results in separated audio streams that can provide improved recognition performance for speaker, sound or speech recognition. A speaker- identification system is described and related work for speech and sound recognition is also presented.
Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication, Aug 25, 2014
Recognizing a person from a distance is important to establish meaningful social interaction and ... more Recognizing a person from a distance is important to establish meaningful social interaction and to provide additional cues regarding the situations experienced by a robot. To do so, face recognition and speaker identification are biometrics commonly used, with identification performance that are influenced by the distance between the person and the robot. This paper presents a system that combines these biometrics with human metrology (HM) to increase identification performance and range. HM measures are derived from 2D silhouettes extracted online using a dynamic background subtraction approach, processing in parallel 45 front features and 24 side features in 400 ms compared to 38 front and 22 side features extracted in sequence in 30 sec by using the approach presented by Lin and Wang. By having each modality identify a set of up to five possible candidates, results suggest that combining modalities provide better performance compared to what each individual modality provides, from a wider range of distances.
IEEE Systems Journal, Jul 30, 2014
One typical remote consultation envisioned for in-home telerehabilitation involves having the pat... more One typical remote consultation envisioned for in-home telerehabilitation involves having the patient exercise on a stationary bike. Making sure that the patient is breathing well while pedaling is of primary concern for the remote clinician. One key requirement for in-home telerehabilitation is to make the system as simple as possible for the patients, avoiding, for instance, to have them wear sensors and devices. This paper presents a contact-free respiration rate monitoring system measuring temperature variations between inspired and expired air in the mouth–nose region using thermal imaging. The thermal camera is installed on a pan–tilt unit and coupled to a tracking algorithm, allowing the system to keep track of the mouth–nose region as the patient exercises. Results demonstrate that the system works in real time even when the patient moves or rotates its head while exercising. Recommendations are also made to minimize limitations of the system, such as the presence of people in the background or when the patient is talking, for its eventual use in in-home telerehabilitation sessions.
Demonstration session of the IEEE International Conference on Human-Robot Interaction, Mar 2013
Autonomous robots must be able to perceive sounds from their environment in order to interact nat... more Autonomous robots must be able to perceive sounds from their environment in order to interact naturally with humans. ManyEars is an open framework for microphone array-based audio processing, which allows a robot to localize, track, and separate multiple sound sources, for improved speech and sound recognition in real-world settings. The system runs in real-time on a personal computer and is able to reliably localize
and track up to four of the loudest sound sources in reverberant and noisy environments when eight microphones are used. It can also separate up to three sources in an adverse environment with a suitable signal-to-noise ratio improvement for speech recognition."
Autonomous Robots, Feb 2013
This paper presents an open framework for microphone array-based audio processing named ManyEars.... more This paper presents an open framework for microphone array-based audio processing named ManyEars. ManyEars is a sound source localization, tracking and separation system that uses an array of eight microphones, and can provide an enhanced speaker signal for improved speech and sound recognition in real-world settings. ManyEars software framework is composed of a portable and modular C library, and a graphical user interface for tuning the parameters and for real-time monitoring. Integration of ManyEars' library is demonstrated with the Robot Operating System (ROS). To facilitate the use of ManyEars on various robotic platforms, customized microphone board and sound card are also distributed as an open hardware solution for implementation of robotic audition systems.
Proceedings of the IEEE International Conference on Robotics and Automation, May 2012
This paper presents WISS, a speaker identification system for mobile robots integrated to ManyEar... more This paper presents WISS, a speaker identification system for mobile robots integrated to ManyEars, a sound source localization, tracking and separation system. Speaker identification consists in recognizing an individual among a group of known speakers. For mobile robots, performing speaker identification in presence of noise that changes over time is one important challenge. To deal with this issue, WISS uses Parallel Model Combination (PMC) and masks to update in real-time the speaker models (obtained in clean conditions) to both additive and convolutive noises. The results show that the weighted rate of good speaker identifications is 96% on average for a Signal-to-Noise Ratio (SNR) of 16 dB, whereas it only decreases to 84% when the SNR drops to 2 dB.
Preprint, 2021
SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the res... more SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.
Ce projet s'inscrit dans une étude plus large sur l'audition artificielle pour favoriser une inte... more Ce projet s'inscrit dans une étude plus large sur l'audition artificielle pour favoriser une interaction naturelle humain-robot en réadaptation. L'objectif est de mettre à profit les capacités de localisation et de discrimination sonore, de reconnaissance de parole et d'émotions dans la voix afin de répondre à des besoins concrets en réadaptation. L'équipe de recherche développera la technologie selon les besoins des utilisateurs potentiels, ce qui guidera la méthodologie de la présente recherche.
Rotary-Wing Air Vehicles (RW-UAVs), also referred to as drones, have gained in popularity over th... more Rotary-Wing Air Vehicles (RW-UAVs), also referred to as drones, have gained in popularity over the last few years. Intrusions over secured areas have become common and authorities are actively looking for solutions to detect and localize undesired drones. The sound generated by the propellers of the RW-UAVs is powerful enough to be perceived by a human observer nearby. In this paper, we examine the use of particle filtering to detect and localize in 3D the position of a RW-UAV based on sound source localization (SSL) over distributed microphone arrays (MAs). Results show that the proposed method is able to detect and track a drone with precision, as long as the noise emitted by the RW-UAVs dominates the background noise.
A telepresence mobile robot is a remote-controlled, wheeled device with wireless internet connect... more A telepresence mobile robot is a remote-controlled, wheeled device with wireless internet connectivity for bidi-rectional audio, video and data transmission. In health care, a telepresence robot could be used to have a clinician or a caregiver assist seniors in their homes without having to travel to these locations. Many mobile telepresence robotic platforms have recently been introduced on the market, bringing mobility to telecommunication and vital sign monitoring at reasonable costs. What is missing for making them effective remote telepresence systems for home care assistance are capabilities specifically needed to assist the remote operator in controlling the robot and perceiving the environment through the robot's sensors or, in other words, minimizing cognitive load and maximizing situation awareness. This paper describes our approach adding navigation, artificial audition and vital sign monitoring capabilities to a commercially available telepresence mobile robot. This requires the use of a robot control architecture to integrate the autonomous and teleoperation capabilities of the platform.
Audition is a rich source of spatial, identity, linguistic and paralinguistic information. Proces... more Audition is a rich source of spatial, identity, linguistic and paralinguistic information. Processing all this information requires acquisition, processing and interpretation of sound sources, which are instantaneous, invisible and noisy signals. This can lead to different responses by the system in relation to the information perceived. This paper presents our first implementation of an integration framework for speech processing. Acquisition includes sound capture, sound source localization, tracking, separation and enhancement, and voice activity detection. Processing involves speech and emotion recognition. Interpretation consists of translating speech utterances into commands that can influence interaction through dialogue management and speech synthesis. The paper also describes two visualization interfaces, inspired by comic strips, to represent live vocal interactions in real life environments. These interfaces are used to demonstrate how the framework performs in live interactions and its use in a usability study.
To be used on a mobile robot, speech/non-speech discrimination must be robust to environmental no... more To be used on a mobile robot, speech/non-speech discrimination must be robust to environmental noise and to the position of the interlocutor, without necessarily having to satisfy low-latency requirements. To address these conditions, this paper presents a speech/non-speech discrimination approach based on pitch estimation. Pitch features are robust to noise and reverberation, and can be estimated over a few seconds. Results suggest that our approach is more robust compared to the use of Mel-Frequency Cepstrum Coefficients with Gaussian Mixture Models (MFCC-GMM) under high reverberation levels and additive noise (with an accuracy above 98% with a latency of 2.21 sec), which makes it ideal for mobile robot applications. The approach is also validated on a mobile robot equipped with a 8-microphone array, using speech/non-speech discrimination based on pitch estimation as a post-processing module of a localization, tracking and separation system.
Sound source localization is an important challenge for mobile robots operating in real life sett... more Sound source localization is an important challenge for mobile robots operating in real life settings. Sound sources of interest, such as speech, are often corrupted by broadband coherent noise sound source(s) that are non-stationary during transitions between steady-state segments. The interfering noise introduces localization ambiguities leading to the localization of invalid sound sources. Masks to reduce such interferences perform well under stationary noise, but the performance degrades as localization of invalid sound sources generated by noise appear and disappear suddently during transitions between steady-state. This paper presents a new mask based on speech non-stationarity to discriminate between the time difference of arrival (TDOA) of speech source and noise transition. Simulations and experiments on a mobile robot suggest that the proposed technique improve TDOA discrimination and reduces significantly localization of invalid sound sources caused by noise.
Localization of sound sources in adverse environments is an important challenge in robot audition... more Localization of sound sources in adverse environments is an important challenge in robot audition. The target sound source is often corrupted by coherent broadband noise, which introduces localization ambiguities as noise is often mistaken as the target source. To discriminate the time difference of arrival (TDOA) parameters of the target source and noise, this paper presents a binary mask for weighted generalized cross-correlation with phase transform (GCC-PHAT). Simulation and experiments on a mobile robot suggest that the proposed technique improves TDOA discrimination. It also brings the additional benefit of modulating the computing load requirement according to voice activity.
Robots that Talk and Listen - Technology and Social Impact, Nov 2014
Vision and audition provide crucial information for adapting to dynamic and changing environments... more Vision and audition provide crucial information for adapting to dynamic and changing environments. While vision is commonly used on robots, robot audition is relatively new and could greatly contribute to enhance human-robot communication capabilities. Up to now, most speech-recognition systems have been designed to work in a close-talking microphone environment (as is the case for cell phones). However, when a human interacts with a robot, the close-talking microphone condition is no longer satisfied and many additional phenomena, such as reverberation and additive noise, can no longer be neglected. Moreover, since these robots are mobile, the acoustic conditions change over time. Mobile robots also have to carry and power their own computing resources. To handle these challenges, the approach described in this chapter involves the use of a microphone array and the development of low-complexity, real-time algorithms for sound-source localization, tracking and separation. Localization is used to determine the direction from which each sound-source emanates. Tracking allows the system to follow sources as they or the robot move. Separation algorithms are then used to extract and enhance speech signals from simultaneous speaker or sound sources. This results in separated audio streams that can provide improved recognition performance for speaker, sound or speech recognition. A speaker- identification system is described and related work for speech and sound recognition is also presented.
Proceedings of the IEEE International Symposium on Robot and Human Interactive Communication, Aug 25, 2014
Recognizing a person from a distance is important to establish meaningful social interaction and ... more Recognizing a person from a distance is important to establish meaningful social interaction and to provide additional cues regarding the situations experienced by a robot. To do so, face recognition and speaker identification are biometrics commonly used, with identification performance that are influenced by the distance between the person and the robot. This paper presents a system that combines these biometrics with human metrology (HM) to increase identification performance and range. HM measures are derived from 2D silhouettes extracted online using a dynamic background subtraction approach, processing in parallel 45 front features and 24 side features in 400 ms compared to 38 front and 22 side features extracted in sequence in 30 sec by using the approach presented by Lin and Wang. By having each modality identify a set of up to five possible candidates, results suggest that combining modalities provide better performance compared to what each individual modality provides, from a wider range of distances.
IEEE Systems Journal, Jul 30, 2014
One typical remote consultation envisioned for in-home telerehabilitation involves having the pat... more One typical remote consultation envisioned for in-home telerehabilitation involves having the patient exercise on a stationary bike. Making sure that the patient is breathing well while pedaling is of primary concern for the remote clinician. One key requirement for in-home telerehabilitation is to make the system as simple as possible for the patients, avoiding, for instance, to have them wear sensors and devices. This paper presents a contact-free respiration rate monitoring system measuring temperature variations between inspired and expired air in the mouth–nose region using thermal imaging. The thermal camera is installed on a pan–tilt unit and coupled to a tracking algorithm, allowing the system to keep track of the mouth–nose region as the patient exercises. Results demonstrate that the system works in real time even when the patient moves or rotates its head while exercising. Recommendations are also made to minimize limitations of the system, such as the presence of people in the background or when the patient is talking, for its eventual use in in-home telerehabilitation sessions.
Demonstration session of the IEEE International Conference on Human-Robot Interaction, Mar 2013
Autonomous robots must be able to perceive sounds from their environment in order to interact nat... more Autonomous robots must be able to perceive sounds from their environment in order to interact naturally with humans. ManyEars is an open framework for microphone array-based audio processing, which allows a robot to localize, track, and separate multiple sound sources, for improved speech and sound recognition in real-world settings. The system runs in real-time on a personal computer and is able to reliably localize
and track up to four of the loudest sound sources in reverberant and noisy environments when eight microphones are used. It can also separate up to three sources in an adverse environment with a suitable signal-to-noise ratio improvement for speech recognition."
Autonomous Robots, Feb 2013
This paper presents an open framework for microphone array-based audio processing named ManyEars.... more This paper presents an open framework for microphone array-based audio processing named ManyEars. ManyEars is a sound source localization, tracking and separation system that uses an array of eight microphones, and can provide an enhanced speaker signal for improved speech and sound recognition in real-world settings. ManyEars software framework is composed of a portable and modular C library, and a graphical user interface for tuning the parameters and for real-time monitoring. Integration of ManyEars' library is demonstrated with the Robot Operating System (ROS). To facilitate the use of ManyEars on various robotic platforms, customized microphone board and sound card are also distributed as an open hardware solution for implementation of robotic audition systems.
Proceedings of the IEEE International Conference on Robotics and Automation, May 2012
This paper presents WISS, a speaker identification system for mobile robots integrated to ManyEar... more This paper presents WISS, a speaker identification system for mobile robots integrated to ManyEars, a sound source localization, tracking and separation system. Speaker identification consists in recognizing an individual among a group of known speakers. For mobile robots, performing speaker identification in presence of noise that changes over time is one important challenge. To deal with this issue, WISS uses Parallel Model Combination (PMC) and masks to update in real-time the speaker models (obtained in clean conditions) to both additive and convolutive noises. The results show that the weighted rate of good speaker identifications is 96% on average for a Signal-to-Noise Ratio (SNR) of 16 dB, whereas it only decreases to 84% when the SNR drops to 2 dB.