Nils Peters | Friedrich-Alexander-Universität Erlangen-Nürnberg (original) (raw)

Papers by Nils Peters

Research paper thumbnail of Debiasing Strategies for Conversational AI: Improving Privacy and Security Decision-Making

Digital Society

With numerous conversational AI (CAI) systems being deployed in homes, cars, and public spaces, p... more With numerous conversational AI (CAI) systems being deployed in homes, cars, and public spaces, people are faced with an increasing number of privacy and security decisions. They need to decide which personal information to disclose and how their data can be processed by providers and developers. On the other hand, designers, developers, and integrators of conversational AI systems must consider users’ privacy and security during development and make appropriate choices. However, users as well as other actors in the CAI ecosystem can suffer from cognitive biases and other mental flaws in their decision-making resulting in adverse privacy and security choices. Debiasing strategies can help to mitigate these biases and improve decision-making. In this position paper, we establish a novel framework for categorizing debiasing strategies, show how existing privacy debiasing strategies can be adapted to the context of CAI, and assign them to relevant stakeholders of the CAI ecosystem. We ...

Research paper thumbnail of Uncertain yet Rational - Uncertainty as an Evaluation Measure of Rational Privacy Decision-Making in Conversational AI

Lecture Notes in Computer Science, 2023

Research paper thumbnail of Name that room

This paper presents a system for identifying the room in an audio or video recording through the ... more This paper presents a system for identifying the room in an audio or video recording through the analysis of acoustical properties. The room identification system was tested using a corpus of 13440 reverberant audio samples. With no common content between the training and testing data, an accuracy of 61% for musical signals and 85% for speech signals was achieved. This approach could be applied in a variety of scenarios where knowledge about the acoustical environment is desired, such as location estimation, music recommendation, or emergency response systems.

Research paper thumbnail of Deep Learning-based F0 Synthesis for Speaker Anonymization

arXiv (Cornell University), Jun 29, 2023

Voice conversion for speaker anonymization is an emerging concept for privacy protection. In a de... more Voice conversion for speaker anonymization is an emerging concept for privacy protection. In a deep learning setting, this is achieved by extracting multiple features from speech, altering the speaker identity, and waveform synthesis. However, many existing systems do not modify fundamental frequency (F0) trajectories, which convey prosody information and can reveal speaker identity. Moreover, mismatch between F0 and other features can degrade speech quality and intelligibility. In this paper, we formally introduce a method that synthesizes F0 trajectories from other speech features and evaluate its reconstructional capabilities. Then we test our approach within a speaker anonymization framework, comparing it to a baseline and a state-of-the-art F0 modification that utilizes speaker information. The results show that our method improves both speaker anonymity, measured by the equal error rate, and utility, measured by the word error rate.

Research paper thumbnail of Towards blind reverberation time estimation for non-speech signals

Proceedings of Meetings on Acoustics, 2013

Reverberation time (RT) is an important parameter for room acoustics characterization, intelligib... more Reverberation time (RT) is an important parameter for room acoustics characterization, intelligibility and quality assessment of reverberant speech, and for dereverberation. Commonly, RT is estimated from the room impulse response (RIR). In practice, however, RIRs are often unavailable or continuously changing. As such, blind estimation of RT based only on the recorded reverberant signals is of great interest. To date, blind RT estimation has focused on reverberant speech signals. Here, we propose to blindly estimate RT from non-speech signals, such as solo instrument recordings and music ensembles. To estimate the RT of non-speech signals, we propose a blind estimator based on an auditoryinspired modulation spectrum signal representation, which measures the modulation frequency of temporal envelopes computed from a 23channel gammatone filterbank. We show that the higher modulation frequency bands are more sensitive to reverberation than the modulation bands below 20 Hz. When tested on a database of non-speech sounds under 23 different reverberation conditions with reverberation time (T40) ranging from 0.18 to 15.62 s, a blind estimator based on the ratio of high-to-low modulation frequencies outperformed two state-of-the-art methods and achieved correlations with EDT as high as 0.92 for solo instruments and 0.87 for ensembles.

Research paper thumbnail of Privacy Strategies for Conversational AI and their Influence on Users' Perceptions and Decision-Making

Proceedings of the 2023 European Symposium on Usable Security

Conversational AI (CAI) systems are on the rise and have been widely adopted in homes, cars and p... more Conversational AI (CAI) systems are on the rise and have been widely adopted in homes, cars and public spaces. Yet, people report privacy concerns and mistrust in these systems. Current data protection regulations ask providers to communicate data practices transparently and provide users with options to control their data. However, even if users are given control, their decisions can be subject to heuristics and biases leaving people frustrated and regretful. Based on the idea of conversational privacy and debiasing, we design three privacy strategies for CAI that allow people to have their data deleted while at the same time promoting rational decisionmaking. We conduct a user study to test our strategies in two widespread scenarios using a text-based CAI system and evaluate their impact on peoples' privacy perception, usability and attitudebehaviour alignment. We find that our strategies can significantly change people's behaviour, but do not influence peoples' privacy perception. Finally, we discuss evaluation metrics and future research directions to investigate privacy controls in Conversational AI systems.

Research paper thumbnail of Comparison of Position Estimation Methods for the Rotating Equatorial Microphone

17th International Workshop on Acoustic Signal Enhancement (IWAENC), 2022

We present a prototype of a microphone that moves rapidly along the equator of a rigid spherical ... more We present a prototype of a microphone that moves rapidly along the equator of a rigid spherical scatterer. Our prototype allows for up to 100 rotations per second. It will enable processing methods like beamforming or sound field decomposition that are conventionally performed using microphone arrays. Solutions that assume one or more moving microphones have already been proposed in the literature but have not been verified in practise. Most of these methods require precise knowledge of the instantaneous microphone position, for which no convenient practical solution exists. Recent advancements in microphone array processing enable employing simple microphone trajectories, and recent advancements in 3D printing, microcontrollers, and high-speed electric motors allow for the required control of the movement. This paper presents the design of our prototype and evaluates its performance. Of particular interest is the accuracy of the estimation of the microphone’s instantaneous position. This paper demonstrates that monitoring the passing time instants of a photodiode that is integrated into the rotating sphere provides the highest precision and robustness.

Research paper thumbnail of Exploring a Long-term Dataset of Nature Reserve Ambisonics Recordings

AudioMostly 2022

Since 2017, monthly 3D audio recordings of a nature preserve capture the acoustic environment ove... more Since 2017, monthly 3D audio recordings of a nature preserve capture the acoustic environment over seasons and years. The recordings are made at the same location and using the same recording equipment, capturing one hour before and after sunset. The recordings, annotated with real-time weather data and manually labeled for acoustic events, are made to understand if and how a natural soundscape evolves over time allowing for data-driven speculation about transformations of the soundscape that might be caused by climate change. After a short description of the general project and its current state, methods and results of algorithmic analysis used are presented and the results are discussed. Further methods of collecting additional data and expanded analyses of the body of data are suggested.

Research paper thumbnail of On The Effect Of Coding Artifacts On Acoustic Scene Classification

arXiv (Cornell University), Dec 9, 2021

Previous DCASE challenges contributed to an increase in the performance of acoustic scene classif... more Previous DCASE challenges contributed to an increase in the performance of acoustic scene classification systems. State-of-the-art classifiers demand significant processing capabilities and memory which is challenging for resource-constrained mobile or IoT edge devices. Thus, it is more likely to deploy these models on more powerful hardware and classify audio recordings previously uploaded (or streamed) from low-power edge devices. In such scenario, the edge device may apply perceptual audio coding to reduce the transmission data rate. This paper explores the effect of perceptual audio coding on the classification performance using a DCASE 2020 challenge contribution [1]. We found that classification accuracy can degrade by up to 57% compared to classifying original (uncompressed) audio. We further demonstrate how lossy audio compression techniques during model training can improve classification accuracy of compressed audio signals even for audio codecs and codec bitrates not included in the training process. Index Termsacoustic scene classification, data augmentation, audio coding, internet of things 1 taken from challenge results at http://dcase.community/ challenge2020/task-acoustic-scene-classificationresults-a

Research paper thumbnail of VoicePrivacy 2022 System Description: Speaker Anonymization with Feature-matched F0 Trajectories

arXiv (Cornell University), Oct 31, 2022

We introduce a novel method to improve the performance of the VoicePrivacy Challenge 2022 baselin... more We introduce a novel method to improve the performance of the VoicePrivacy Challenge 2022 baseline B1 variants. Among the known deficiencies of x-vector-based anonymization systems is the insufficient disentangling of the input features. In particular, the fundamental frequency (F0) trajectories, which are used for voice synthesis without any modifications. Especially in cross-gender conversion, this situation causes unnatural sounding voices, increases word error rates (WERs), and personal information leakage. Our submission overcomes this problem by synthesizing an F0 trajectory, which better harmonizes with the anonymized x-vector. We utilized a low-complexity deep neural network to estimate an appropriate F0 value per frame, using the linguistic content from the bottleneck features (BN) and the anonymized x-vector. Our approach results in a significantly improved anonymization system and increased naturalness of the synthesized voice. Consequently, our results suggest that F0 extraction is not required for voice anonymization.

Research paper thumbnail of Adapting Debiasing Strategies for Conversational AI

Zenodo (CERN European Organization for Nuclear Research), Jul 12, 2022

Research paper thumbnail of The Internet of Sounds: Convergent Trends, Insights and Future Directions

IEEE Internet of Things Journal

Current sound-based practices and systems developed in both academia and industry point to conver... more Current sound-based practices and systems developed in both academia and industry point to convergent research trends that bring together the field of Sound and Music Computing with that of the Internet of Things. This paper proposes a vision for the emerging field of the Internet of Sounds (IoS), which stems from such disciplines. The IoS relates to the network of Sound Things, i.e., devices capable of sensing, acquiring, processing, actuating, and exchanging data serving the purpose of communicating sound-related information. In the IoS paradigm, which merges under a unique umbrella the emerging fields of the Internet of Musical Things and the Internet of Audio Things, heterogeneous devices dedicated to musical and nonmusical tasks can interact and cooperate with one another and with other things connected to the Internet to facilitate soundbased services and applications that are globally available to the users. We survey the state of the art in this space, discuss the technological and non-technological challenges ahead of us and propose a comprehensive research agenda for the field.

Research paper thumbnail of Loudness Perception of Scene-Based Audio across Loudspeaker Configurations and HOA Orders

154th AES Convention , 2023

Reliable loudness calculation of audio material is an essential component for professional conten... more Reliable loudness calculation of audio material is an essential component for professional content production and delivery. While ITU-R BS.1770-4 has become a well-established standard for loudness measurement of channel-based content, it remains unclear how well the loudness of Scene-Base Audio content, i.e., (Higher-Order) Ambisonics, can be estimated with it. This paper presents a listening test to assess the perceived loudness of Scene-Based Audio at different HOA orders and rendered to different loudspeaker layouts using the EBU ADM renderer. The results indicate that perceived loudness is merely a function of the audio material and only slightly differs across loudspeaker configurations and HOA Orders. For comparison, the loudness is also estimated directly from the 0 th-order coefficient, and indirectly from the respective loudspeaker feeds in accordance to ITU-R BS.1770-4. It was found that both methods moderately but consistently underestimated the perceived loudness.

Research paper thumbnail of Improving Scene Classification Models for Audio Coding Artifacts

Research paper thumbnail of MPEG-I Immersive Audio - The Upcoming New Audio Standard for Virtual / Augmented Reality

DAGA, 2023

Research in the field of auditory virtual environments has a long history. When combined with a v... more Research in the field of auditory virtual environments has a long history. When combined with a visual counterpart and motion tracking, virtual environments can be used in widespread Virtual Reality (VR) or Augmented Reality (AR) application fields. Only recently, the visual part of such VR implementations achieved acceptable characteristics (resolution, latency, price, etc.) for the mass market. International standards have often helped an application field to get widely used by providing formats and software that enable both interoperability between different implementations and format stability over long periods of time. In this way, consumer electronics (CE) companies and service providers creating infrastructure and content for different platforms can jointly establish a healthy ecosystem and reach the population beyond early adopters. Currently, the MPEG Audio working group defines the new MPEG-I Immersive Audio standard for Virtual and Augmented Reality. The normative bitstream and renderer are defined to provide a real-time auditory world consisting of spatially distributed sound sources and listeners with six Degrees of Freedom (6DoF), both of which can move interactively. This includes the modelling of point sources, sized sources, coupled rooms and the realistic auralization of (room) acoustic phenomena, like reflections, diffraction, occlusion, late reverberation, Doppler and more. A specifically developed encoder input format allows the definition of such feature-rich acoustic worlds.

Research paper thumbnail of Feature Selection using Alternating Direction Method of Multiplier for Low-Complexity Acoustic Scene Classification

DCASE Workshop, 2022

Acoustic Scene Classification (ASC) is a common task for many resource-constrained devices, e.g.,... more Acoustic Scene Classification (ASC) is a common task for many resource-constrained devices, e.g., mobile phones or hearing aids. Limiting the complexity and memory footprint of the classifier is crucial. The number of input features directly relates to these two metrics. In this contribution, we evaluate a feature selection algorithm which we also used in this year's challenge [1]. We propose binary search with hard constraints on the feature set and solve the optimization problem with Alternating Direction Method of Multipliers (ADMM). With minimal impact on accuracy and log loss, results show the model complexity is halved by masking 50% of the Mel input features. Further, we found that training convergence is more stable across random seeds. This also facilitates the hyperparameter search. Finally, the remaining Mel features provide an insight into the properties of the DCASE ASC data set.

Research paper thumbnail of VoicePrivacy 2022 System Description: Speaker Anonymization with Feature-matched F0 Trajectories

VoicePrivacy Challenge, 2022

We introduce a novel method to improve the performance of the VoicePrivacy Challenge 2022 baselin... more We introduce a novel method to improve the performance of the VoicePrivacy Challenge 2022 baseline B1 variants. Among the known deficiencies of x-vector-based anonymization systems is the insufficient disentangling of the input features. In particular, the fundamental frequency (F0) trajectories, which are used for voice synthesis without any modifications. Especially in cross-gender conversion, this situation causes unnatural sounding voices, increases word error rates (WERs), and personal information leakage. Our submission overcomes this problem by synthesizing an F0 trajectory, which better harmonizes with the anonymized x-vector. We utilized a low-complexity deep neural network to estimate an appropriate F0 value per frame, using the linguistic content from the bottleneck features (BN) and the anonymized x-vector. Our approach results in a significantly improved anonymization system and increased naturalness of the synthesized voice. Consequently, our results suggest that F0 extraction is not required for voice anonymization.

Research paper thumbnail of Comparison of Position Estimation Methods for the Rotating Equatorial Microphone

International Workshop on Acoustic Signal Enhancement (IWAENC), 2022

We present a prototype of a microphone that moves rapidly along the equator of a rigid spherical ... more We present a prototype of a microphone that moves rapidly along the equator of a rigid spherical scatterer. Our prototype allows for up to 100 rotations per second. It will enable processing methods like beamforming or sound field decomposition that are conventionally performed using microphone arrays. Solutions that assume one or more moving microphones have already been proposed in the literature but have not been verified in practise. Most of these methods require precise knowledge of the instantaneous microphone position, for which no convenient practical solution exists. Recent advancements in microphone array processing enable employing simple microphone trajectories, and recent advancements in 3D printing, microcontrollers, and high-speed electric motors allow for the required control of the movement. This paper presents the design of our prototype and evaluates its performance. Of particular interest is the accuracy of the estimation of the microphone’s instantaneous position. This paper demonstrates that monitoring the passing time instants of a photodiode that is integrated into the rotating sphere provides the highest precision and robustness.

Research paper thumbnail of Exploring a Long-term Dataset of Nature Reserve Ambisonics Recordings

17th International Audio Mostly Conference, 2022

Since 2017, monthly 3D audio recordings of a nature preserve capture the acoustic environment ove... more Since 2017, monthly 3D audio recordings of a nature preserve capture the acoustic environment over seasons and years. The recordings are made at the same location and using the same recording equipment, capturing one hour before and after sunset. The recordings, annotated with real-time weather data and manually labeled for acoustic events, are made to understand if and how a natural soundscape evolves over time allowing for data-driven speculation about transformations of the soundscape that might be caused by climate change. After a short description of the general project and its current state, methods and results of algorithmic analysis used are presented and the results are discussed. Further methods of collecting additional data and expanded analyses of the body of data are suggested.

Research paper thumbnail of DCASE 2022 Task 1: Structured Filter Pruning and Feature Selection for Low Complexity Acoustic Scene Classification

DCASE Challenge, 2022

The DCASE challenge track 1 provides a dataset for Acoustic Scene Classification (ASC), a popular... more The DCASE challenge track 1 provides a dataset for Acoustic Scene Classification (ASC), a popular problem in machine learning. This years challenge shortens the provided audio clips to 1 sec, adds a Multiply-Accumulate operations (MAC) constrain and additionally counts all parameters of the model. We tackle the problem by using three approaches: First we use a linear model with global moments of the spectrogram, getting into reach of the baseline; then we use feature selection to reduce generalization gap and MACs; and finally, structured filter pruning to bring the number of parameters below the parameter constraint. Using the evaluation split of the development dataset, our result shows an increase to 49.1% overall accuracy compared to the baseline system with 42.9% accuracy.

Research paper thumbnail of Debiasing Strategies for Conversational AI: Improving Privacy and Security Decision-Making

Digital Society

With numerous conversational AI (CAI) systems being deployed in homes, cars, and public spaces, p... more With numerous conversational AI (CAI) systems being deployed in homes, cars, and public spaces, people are faced with an increasing number of privacy and security decisions. They need to decide which personal information to disclose and how their data can be processed by providers and developers. On the other hand, designers, developers, and integrators of conversational AI systems must consider users’ privacy and security during development and make appropriate choices. However, users as well as other actors in the CAI ecosystem can suffer from cognitive biases and other mental flaws in their decision-making resulting in adverse privacy and security choices. Debiasing strategies can help to mitigate these biases and improve decision-making. In this position paper, we establish a novel framework for categorizing debiasing strategies, show how existing privacy debiasing strategies can be adapted to the context of CAI, and assign them to relevant stakeholders of the CAI ecosystem. We ...

Research paper thumbnail of Uncertain yet Rational - Uncertainty as an Evaluation Measure of Rational Privacy Decision-Making in Conversational AI

Lecture Notes in Computer Science, 2023

Research paper thumbnail of Name that room

This paper presents a system for identifying the room in an audio or video recording through the ... more This paper presents a system for identifying the room in an audio or video recording through the analysis of acoustical properties. The room identification system was tested using a corpus of 13440 reverberant audio samples. With no common content between the training and testing data, an accuracy of 61% for musical signals and 85% for speech signals was achieved. This approach could be applied in a variety of scenarios where knowledge about the acoustical environment is desired, such as location estimation, music recommendation, or emergency response systems.

Research paper thumbnail of Deep Learning-based F0 Synthesis for Speaker Anonymization

arXiv (Cornell University), Jun 29, 2023

Voice conversion for speaker anonymization is an emerging concept for privacy protection. In a de... more Voice conversion for speaker anonymization is an emerging concept for privacy protection. In a deep learning setting, this is achieved by extracting multiple features from speech, altering the speaker identity, and waveform synthesis. However, many existing systems do not modify fundamental frequency (F0) trajectories, which convey prosody information and can reveal speaker identity. Moreover, mismatch between F0 and other features can degrade speech quality and intelligibility. In this paper, we formally introduce a method that synthesizes F0 trajectories from other speech features and evaluate its reconstructional capabilities. Then we test our approach within a speaker anonymization framework, comparing it to a baseline and a state-of-the-art F0 modification that utilizes speaker information. The results show that our method improves both speaker anonymity, measured by the equal error rate, and utility, measured by the word error rate.

Research paper thumbnail of Towards blind reverberation time estimation for non-speech signals

Proceedings of Meetings on Acoustics, 2013

Reverberation time (RT) is an important parameter for room acoustics characterization, intelligib... more Reverberation time (RT) is an important parameter for room acoustics characterization, intelligibility and quality assessment of reverberant speech, and for dereverberation. Commonly, RT is estimated from the room impulse response (RIR). In practice, however, RIRs are often unavailable or continuously changing. As such, blind estimation of RT based only on the recorded reverberant signals is of great interest. To date, blind RT estimation has focused on reverberant speech signals. Here, we propose to blindly estimate RT from non-speech signals, such as solo instrument recordings and music ensembles. To estimate the RT of non-speech signals, we propose a blind estimator based on an auditoryinspired modulation spectrum signal representation, which measures the modulation frequency of temporal envelopes computed from a 23channel gammatone filterbank. We show that the higher modulation frequency bands are more sensitive to reverberation than the modulation bands below 20 Hz. When tested on a database of non-speech sounds under 23 different reverberation conditions with reverberation time (T40) ranging from 0.18 to 15.62 s, a blind estimator based on the ratio of high-to-low modulation frequencies outperformed two state-of-the-art methods and achieved correlations with EDT as high as 0.92 for solo instruments and 0.87 for ensembles.

Research paper thumbnail of Privacy Strategies for Conversational AI and their Influence on Users' Perceptions and Decision-Making

Proceedings of the 2023 European Symposium on Usable Security

Conversational AI (CAI) systems are on the rise and have been widely adopted in homes, cars and p... more Conversational AI (CAI) systems are on the rise and have been widely adopted in homes, cars and public spaces. Yet, people report privacy concerns and mistrust in these systems. Current data protection regulations ask providers to communicate data practices transparently and provide users with options to control their data. However, even if users are given control, their decisions can be subject to heuristics and biases leaving people frustrated and regretful. Based on the idea of conversational privacy and debiasing, we design three privacy strategies for CAI that allow people to have their data deleted while at the same time promoting rational decisionmaking. We conduct a user study to test our strategies in two widespread scenarios using a text-based CAI system and evaluate their impact on peoples' privacy perception, usability and attitudebehaviour alignment. We find that our strategies can significantly change people's behaviour, but do not influence peoples' privacy perception. Finally, we discuss evaluation metrics and future research directions to investigate privacy controls in Conversational AI systems.

Research paper thumbnail of Comparison of Position Estimation Methods for the Rotating Equatorial Microphone

17th International Workshop on Acoustic Signal Enhancement (IWAENC), 2022

We present a prototype of a microphone that moves rapidly along the equator of a rigid spherical ... more We present a prototype of a microphone that moves rapidly along the equator of a rigid spherical scatterer. Our prototype allows for up to 100 rotations per second. It will enable processing methods like beamforming or sound field decomposition that are conventionally performed using microphone arrays. Solutions that assume one or more moving microphones have already been proposed in the literature but have not been verified in practise. Most of these methods require precise knowledge of the instantaneous microphone position, for which no convenient practical solution exists. Recent advancements in microphone array processing enable employing simple microphone trajectories, and recent advancements in 3D printing, microcontrollers, and high-speed electric motors allow for the required control of the movement. This paper presents the design of our prototype and evaluates its performance. Of particular interest is the accuracy of the estimation of the microphone’s instantaneous position. This paper demonstrates that monitoring the passing time instants of a photodiode that is integrated into the rotating sphere provides the highest precision and robustness.

Research paper thumbnail of Exploring a Long-term Dataset of Nature Reserve Ambisonics Recordings

AudioMostly 2022

Since 2017, monthly 3D audio recordings of a nature preserve capture the acoustic environment ove... more Since 2017, monthly 3D audio recordings of a nature preserve capture the acoustic environment over seasons and years. The recordings are made at the same location and using the same recording equipment, capturing one hour before and after sunset. The recordings, annotated with real-time weather data and manually labeled for acoustic events, are made to understand if and how a natural soundscape evolves over time allowing for data-driven speculation about transformations of the soundscape that might be caused by climate change. After a short description of the general project and its current state, methods and results of algorithmic analysis used are presented and the results are discussed. Further methods of collecting additional data and expanded analyses of the body of data are suggested.

Research paper thumbnail of On The Effect Of Coding Artifacts On Acoustic Scene Classification

arXiv (Cornell University), Dec 9, 2021

Previous DCASE challenges contributed to an increase in the performance of acoustic scene classif... more Previous DCASE challenges contributed to an increase in the performance of acoustic scene classification systems. State-of-the-art classifiers demand significant processing capabilities and memory which is challenging for resource-constrained mobile or IoT edge devices. Thus, it is more likely to deploy these models on more powerful hardware and classify audio recordings previously uploaded (or streamed) from low-power edge devices. In such scenario, the edge device may apply perceptual audio coding to reduce the transmission data rate. This paper explores the effect of perceptual audio coding on the classification performance using a DCASE 2020 challenge contribution [1]. We found that classification accuracy can degrade by up to 57% compared to classifying original (uncompressed) audio. We further demonstrate how lossy audio compression techniques during model training can improve classification accuracy of compressed audio signals even for audio codecs and codec bitrates not included in the training process. Index Termsacoustic scene classification, data augmentation, audio coding, internet of things 1 taken from challenge results at http://dcase.community/ challenge2020/task-acoustic-scene-classificationresults-a

Research paper thumbnail of VoicePrivacy 2022 System Description: Speaker Anonymization with Feature-matched F0 Trajectories

arXiv (Cornell University), Oct 31, 2022

We introduce a novel method to improve the performance of the VoicePrivacy Challenge 2022 baselin... more We introduce a novel method to improve the performance of the VoicePrivacy Challenge 2022 baseline B1 variants. Among the known deficiencies of x-vector-based anonymization systems is the insufficient disentangling of the input features. In particular, the fundamental frequency (F0) trajectories, which are used for voice synthesis without any modifications. Especially in cross-gender conversion, this situation causes unnatural sounding voices, increases word error rates (WERs), and personal information leakage. Our submission overcomes this problem by synthesizing an F0 trajectory, which better harmonizes with the anonymized x-vector. We utilized a low-complexity deep neural network to estimate an appropriate F0 value per frame, using the linguistic content from the bottleneck features (BN) and the anonymized x-vector. Our approach results in a significantly improved anonymization system and increased naturalness of the synthesized voice. Consequently, our results suggest that F0 extraction is not required for voice anonymization.

Research paper thumbnail of Adapting Debiasing Strategies for Conversational AI

Zenodo (CERN European Organization for Nuclear Research), Jul 12, 2022

Research paper thumbnail of The Internet of Sounds: Convergent Trends, Insights and Future Directions

IEEE Internet of Things Journal

Current sound-based practices and systems developed in both academia and industry point to conver... more Current sound-based practices and systems developed in both academia and industry point to convergent research trends that bring together the field of Sound and Music Computing with that of the Internet of Things. This paper proposes a vision for the emerging field of the Internet of Sounds (IoS), which stems from such disciplines. The IoS relates to the network of Sound Things, i.e., devices capable of sensing, acquiring, processing, actuating, and exchanging data serving the purpose of communicating sound-related information. In the IoS paradigm, which merges under a unique umbrella the emerging fields of the Internet of Musical Things and the Internet of Audio Things, heterogeneous devices dedicated to musical and nonmusical tasks can interact and cooperate with one another and with other things connected to the Internet to facilitate soundbased services and applications that are globally available to the users. We survey the state of the art in this space, discuss the technological and non-technological challenges ahead of us and propose a comprehensive research agenda for the field.

Research paper thumbnail of Loudness Perception of Scene-Based Audio across Loudspeaker Configurations and HOA Orders

154th AES Convention , 2023

Reliable loudness calculation of audio material is an essential component for professional conten... more Reliable loudness calculation of audio material is an essential component for professional content production and delivery. While ITU-R BS.1770-4 has become a well-established standard for loudness measurement of channel-based content, it remains unclear how well the loudness of Scene-Base Audio content, i.e., (Higher-Order) Ambisonics, can be estimated with it. This paper presents a listening test to assess the perceived loudness of Scene-Based Audio at different HOA orders and rendered to different loudspeaker layouts using the EBU ADM renderer. The results indicate that perceived loudness is merely a function of the audio material and only slightly differs across loudspeaker configurations and HOA Orders. For comparison, the loudness is also estimated directly from the 0 th-order coefficient, and indirectly from the respective loudspeaker feeds in accordance to ITU-R BS.1770-4. It was found that both methods moderately but consistently underestimated the perceived loudness.

Research paper thumbnail of Improving Scene Classification Models for Audio Coding Artifacts

Research paper thumbnail of MPEG-I Immersive Audio - The Upcoming New Audio Standard for Virtual / Augmented Reality

DAGA, 2023

Research in the field of auditory virtual environments has a long history. When combined with a v... more Research in the field of auditory virtual environments has a long history. When combined with a visual counterpart and motion tracking, virtual environments can be used in widespread Virtual Reality (VR) or Augmented Reality (AR) application fields. Only recently, the visual part of such VR implementations achieved acceptable characteristics (resolution, latency, price, etc.) for the mass market. International standards have often helped an application field to get widely used by providing formats and software that enable both interoperability between different implementations and format stability over long periods of time. In this way, consumer electronics (CE) companies and service providers creating infrastructure and content for different platforms can jointly establish a healthy ecosystem and reach the population beyond early adopters. Currently, the MPEG Audio working group defines the new MPEG-I Immersive Audio standard for Virtual and Augmented Reality. The normative bitstream and renderer are defined to provide a real-time auditory world consisting of spatially distributed sound sources and listeners with six Degrees of Freedom (6DoF), both of which can move interactively. This includes the modelling of point sources, sized sources, coupled rooms and the realistic auralization of (room) acoustic phenomena, like reflections, diffraction, occlusion, late reverberation, Doppler and more. A specifically developed encoder input format allows the definition of such feature-rich acoustic worlds.

Research paper thumbnail of Feature Selection using Alternating Direction Method of Multiplier for Low-Complexity Acoustic Scene Classification

DCASE Workshop, 2022

Acoustic Scene Classification (ASC) is a common task for many resource-constrained devices, e.g.,... more Acoustic Scene Classification (ASC) is a common task for many resource-constrained devices, e.g., mobile phones or hearing aids. Limiting the complexity and memory footprint of the classifier is crucial. The number of input features directly relates to these two metrics. In this contribution, we evaluate a feature selection algorithm which we also used in this year's challenge [1]. We propose binary search with hard constraints on the feature set and solve the optimization problem with Alternating Direction Method of Multipliers (ADMM). With minimal impact on accuracy and log loss, results show the model complexity is halved by masking 50% of the Mel input features. Further, we found that training convergence is more stable across random seeds. This also facilitates the hyperparameter search. Finally, the remaining Mel features provide an insight into the properties of the DCASE ASC data set.

Research paper thumbnail of VoicePrivacy 2022 System Description: Speaker Anonymization with Feature-matched F0 Trajectories

VoicePrivacy Challenge, 2022

We introduce a novel method to improve the performance of the VoicePrivacy Challenge 2022 baselin... more We introduce a novel method to improve the performance of the VoicePrivacy Challenge 2022 baseline B1 variants. Among the known deficiencies of x-vector-based anonymization systems is the insufficient disentangling of the input features. In particular, the fundamental frequency (F0) trajectories, which are used for voice synthesis without any modifications. Especially in cross-gender conversion, this situation causes unnatural sounding voices, increases word error rates (WERs), and personal information leakage. Our submission overcomes this problem by synthesizing an F0 trajectory, which better harmonizes with the anonymized x-vector. We utilized a low-complexity deep neural network to estimate an appropriate F0 value per frame, using the linguistic content from the bottleneck features (BN) and the anonymized x-vector. Our approach results in a significantly improved anonymization system and increased naturalness of the synthesized voice. Consequently, our results suggest that F0 extraction is not required for voice anonymization.

Research paper thumbnail of Comparison of Position Estimation Methods for the Rotating Equatorial Microphone

International Workshop on Acoustic Signal Enhancement (IWAENC), 2022

We present a prototype of a microphone that moves rapidly along the equator of a rigid spherical ... more We present a prototype of a microphone that moves rapidly along the equator of a rigid spherical scatterer. Our prototype allows for up to 100 rotations per second. It will enable processing methods like beamforming or sound field decomposition that are conventionally performed using microphone arrays. Solutions that assume one or more moving microphones have already been proposed in the literature but have not been verified in practise. Most of these methods require precise knowledge of the instantaneous microphone position, for which no convenient practical solution exists. Recent advancements in microphone array processing enable employing simple microphone trajectories, and recent advancements in 3D printing, microcontrollers, and high-speed electric motors allow for the required control of the movement. This paper presents the design of our prototype and evaluates its performance. Of particular interest is the accuracy of the estimation of the microphone’s instantaneous position. This paper demonstrates that monitoring the passing time instants of a photodiode that is integrated into the rotating sphere provides the highest precision and robustness.

Research paper thumbnail of Exploring a Long-term Dataset of Nature Reserve Ambisonics Recordings

17th International Audio Mostly Conference, 2022

Since 2017, monthly 3D audio recordings of a nature preserve capture the acoustic environment ove... more Since 2017, monthly 3D audio recordings of a nature preserve capture the acoustic environment over seasons and years. The recordings are made at the same location and using the same recording equipment, capturing one hour before and after sunset. The recordings, annotated with real-time weather data and manually labeled for acoustic events, are made to understand if and how a natural soundscape evolves over time allowing for data-driven speculation about transformations of the soundscape that might be caused by climate change. After a short description of the general project and its current state, methods and results of algorithmic analysis used are presented and the results are discussed. Further methods of collecting additional data and expanded analyses of the body of data are suggested.

Research paper thumbnail of DCASE 2022 Task 1: Structured Filter Pruning and Feature Selection for Low Complexity Acoustic Scene Classification

DCASE Challenge, 2022

The DCASE challenge track 1 provides a dataset for Acoustic Scene Classification (ASC), a popular... more The DCASE challenge track 1 provides a dataset for Acoustic Scene Classification (ASC), a popular problem in machine learning. This years challenge shortens the provided audio clips to 1 sec, adds a Multiply-Accumulate operations (MAC) constrain and additionally counts all parameters of the model. We tackle the problem by using three approaches: First we use a linear model with global moments of the spectrogram, getting into reach of the baseline; then we use feature selection to reduce generalization gap and MACs; and finally, structured filter pruning to bring the number of parameters below the parameter constraint. Using the evaluation split of the development dataset, our result shows an increase to 49.1% overall accuracy compared to the baseline system with 42.9% accuracy.