George Tzanetakis | University of Victoria (original) (raw)
Papers by George Tzanetakis
ACM Multimedia, Nov 30, 2011
It is our great pleasure to welcome you to the 1st International ACM Workshop on Music Informatio... more It is our great pleasure to welcome you to the 1st International ACM Workshop on Music Information Retrieval with User-Centered and Multimodal Strategies (MIRUM). MIRUM was proposed in order to gather experts from the Music and Multimedia Information Retrieval communities, as well as other neighboring fields, and aims to provide a high-profile platform for presenting current work on Music Information Retrieval, with strong focus on user-centered and multimodal approaches. Music content is multifaceted and exists in many different representations, including audio recordings, symbolic scores, folksonomy descriptions and accompanying video material. No single representation is capable of accounting for all of the music experience, which is strongly guided by affective and subjective context- and user-dependent factors. The existence of complementary representations and information sources in multiple modalities makes music multimedia content by definition. Furthermore, the subjective and affective aspects of music pose challenges that are faced and experienced in the broad Multimedia community. Thus, we believe it is appropriate to discuss these topics in a Multimedia context. The MIRUM 2011 Call for Papers attracted 22 international technical submissions. The program committee accepted 9 papers that cover a wide variety of topics, ranging from beat tracking techniques to affective analysis of music videos. In addition, the full-day program includes a keynote speech by Dr. Roeland Ordelman (Netherlands Institute for Sound and Vision & University of Twente, The Netherlands) on exploitation possibilities of audiovisual data in the networked information society, as well as a panel on bridging opportunities for the music and multimedia domains, featuring multiple experts from the music and multimedia communities. We hope that these proceedings will serve as a valuable reference for researchers in the fields of Music and Multimedia Information Retrieval, as well as neighboring fields.
arXiv (Cornell University), Mar 6, 2022
What audio embedding approach generalizes best to a wide range of downstream tasks across a varie... more What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. HEAR was launched as a NeurIPS 2021 shared challenge. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single generalpurpose audio representation can perform as holistically as the human ear.
IGI Global eBooks, Aug 4, 2010
Multimedia Tools and Applications, Feb 16, 2017
We propose and evaluate a system for content-based visualization and exploration of music collect... more We propose and evaluate a system for content-based visualization and exploration of music collections. The system is based on a modification of Kohonen’s Self-Organizing Map algorithm and allows users to choose the locations of clusters containing acoustically similar tracks on the music space. A user study conducted to evaluate the system shows that the possibility of personalizing the music space was perceived as difficult. Conversely, the user study and objective metrics derived from users’ interactions with the interface demonstrate that the proposed system helped individuals create playlists faster and, under some circumstances, more effectively. We believe that personalized browsing interfaces are an important area of research in Multimedia Information Retrieval, and both the system and user study contribute to the growing work in this field.
Journal of The Audio Engineering Society, May 1, 2007
The papers at this Convention have been selected on the basis of a submitted abstract and extende... more The papers at this Convention have been selected on the basis of a submitted abstract and extended precis that have been peer reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced from the author's advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42 nd Street,
Journal of New Music Research, Jun 1, 2003
In order to represent musical content, pitch and timing information is utilized in the majority o... more In order to represent musical content, pitch and timing information is utilized in the majority of existing work in Symbolic Music Information Retrieval (MIR). Symbolic representations such as MIDI allow the easy calculation of such information and its manipulation. In contrast, most of the existing work in Audio MIR uses timbral and beat information, which can be calculated using automatic computer audition techniques. In this paper, Pitch Histograms are defined and proposed as a way to represent the pitch content of music signals both in symbolic and audio form. This representation is evaluated in the context of automatic musical genre classification. A multiple-pitch detection algorithm for polyphonic signals is used to calculate Pitch Histograms for audio signals. In order to evaluate the extent and significance of errors resulting from the automatic multiple-pitch detection, automatic musical genre classification results from symbolic and audio data are compared. The comparison indicates that Pitch Histograms provide valuable information for musical genre classification. The results obtained for both symbolic and audio cases indicate that although pitch errors degrade classification performance for the audio case, Pitch Histograms can be effectively used for classification in both cases.
Journal of The Audio Engineering Society, Feb 24, 2021
Music listening is an important activity for many people. Advances in technology have made possib... more Music listening is an important activity for many people. Advances in technology have made possible the creation of music collections with thousands of songs in portable music players. Navigating these large music collections is challenging especially for users with vision and/or motion disabilities. In this paper we describe our current efforts to build effective music browsing interfaces for people with disabilities. The foundation of our approach is the automatic extraction of features for describing musical content and the use of selforganizing maps to create two-dimensional representations of music collections. The ultimate goal is effective browsing without using any meta-data. We also describe different control interfaces to the system: a regular desktop application, an iPhone implementation, an eye tracker, and a smart room interface based on Wii-mote tracking.
arXiv (Cornell University), Apr 26, 2021
We release synth1B1, a multi-modal audio corpus consisting of 1 billion 4-second synthesized soun... more We release synth1B1, a multi-modal audio corpus consisting of 1 billion 4-second synthesized sounds, paired with the synthesis parameters used to generate them. The dataset is 100x larger than any audio dataset in the literature. We also introduce torchsynth, an open source modular synthesizer that generates the synth1B1 samples on-the-fly at 16200x faster than real-time (714MHz) on a single GPU. Finally, we release two new audio datasets: FM synth timbre and subtractive synth pitch. Using these datasets, we demonstrate new rank-based evaluation criteria for existing audio representations. Finally, we propose a novel approach to synthesizer hyperparameter optimization.
IEEE Transactions on Speech and Audio Processing, Jul 1, 2002
Musical genres are categorical labels created by humans to characterize pieces of music. A musica... more Musical genres are categorical labels created by humans to characterize pieces of music. A musical genre is characterized by the common characteristics shared by its members. These characteristics typically are related to the instrumentation, rhythmic structure, and harmonic content of the music. Genre hierarchies are commonly used to structure the large collections of music available on the Web. Currently musical genre annotation is performed manually. Automatic musical genre classification can assist or replace the human user in this process and would be a valuable addition to music information retrieval systems. In addition, automatic musical genre classification provides a framework for developing and evaluating features for any type of content-based analysis of musical signals. In this paper, the automatic classification of audio signals into an hierarchy of musical genres is explored. More specifically, three feature sets for representing timbral texture, rhythmic content and pitch content are proposed. The performance and relative importance of the proposed features is investigated by training statistical pattern recognition classifiers using real-world audio collections. Both whole file and real-time frame-based classification schemes are described. Using the proposed feature sets, classification of 61% for ten musical genres is achieved. This result is comparable to results reported for human musical genre classification.
Journal of The Audio Engineering Society, May 28, 2020
The design and implementation of multimedia signal processing systems is challenging especially w... more The design and implementation of multimedia signal processing systems is challenging especially when efficiency and real-time performance is desired. In many modern applications, software systems must be able to handle multiple flows of various types of multimedia data such as audio and video. Researchers frequently have to rely on a combination of different software tools for each modality to assemble proof-of-concept systems that are inefficient, brittle and hard to maintain. Marsyas is a software framework originally developed to address these issues in the domain of audio processing. In this paper we describe MarsyasX, a new open-source cross-modal analysis framework that aims at a broader score of applications. It follows a dataflow architecture where complex networks of processing objects can be assembled to form systems that can handle multiple and different types of multimedia flows with expressiveness and efficiency.
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We introduce a data-driven approach to automatic pitch correction of solo singing performances. T... more We introduce a data-driven approach to automatic pitch correction of solo singing performances. The proposed approach predicts notewise pitch shifts from the relationship between the respective spectrograms of the singing and accompaniment. This approach differs from commercial systems, where vocal track notes are usually shifted to be centered around pitches in a user-defined score, or mapped to the closest pitch among the twelve equal-tempered scale degrees. The proposed system treats pitch as a continuous value rather than relying on a set of discretized notes found in musical scores, thus allowing for improvisation and harmonization in the singing performance. We train our neural network model using a dataset of 4,702 amateur karaoke performances selected for good intonation. Our model is trained on both incorrect intonation, for which it learns a correction, and intentional pitch variation, which it learns to preserve. The proposed deep neural network with gated recurrent units on top of convolutional layers shows promising performance on the real-world score-free singing pitch correction taskautotuning.
Percussion robots have successfully used a variety of actuator technologies to activate a wide ar... more Percussion robots have successfully used a variety of actuator technologies to activate a wide array of striking mechanisms. Popular types of actuators include solenoids and DC motors. However, the use of industrial strength voice coil actuators provides a compelling alternative given a desirable set of heterogeneous features and requirements that span traditional devices. Their characteristics such as high acceleration and accurate positioning enable the exploration of rendering highly accurate and expressive percussion performances.
Comparative studies require a baseline reference and a documented process to capture new subject ... more Comparative studies require a baseline reference and a documented process to capture new subject data. This paper combined with its principal reference [1] presents a definitive dataset in the context of snare drum performances along with a procedure for data acquisition, and a methodology for quantitative analysis. The dataset contains video, audio, and discrete two dimensional motion data for forty standardized percussive rudiments.
The study of periodic biological processes, such as when plants flower and birds arrive in the sp... more The study of periodic biological processes, such as when plants flower and birds arrive in the spring is known as Phenology. In recent years this field has gained interest from the scientific community because of the applicability of this data to the study of climate change and other ecolog- ical processes. In this paper we propose the use of tangible interfaces for interactive sonification with a specific example of a multimodal tangible interface consisting of a physical paper map and tracking of fiducial markers combined with a novel drawing interface. The designed interface enables one or more users to specify point queries with the map inter- face and to specify time queries with the drawing interface. This allows the user to explore both time and space while receiving immediate sonic feedback of their actions. This system can be used to study and explore the effects of cli- mate change, both as tool to be used by scientists, and as a way to educate and involve members of the g...
As digital music and sound collections increase in size there has been a lot of work in developin... more As digital music and sound collections increase in size there has been a lot of work in developing novel interfaces for browsing them. Many of these interfaces rely on automatic content analysis techniques to create representations that reflect similarities between the music pieces or sounds in the collection. Representations in 3D have the potential to convey more information but can be difficult to navigate using the traditional ways of providing input to a computer such as a keyboard and mouse. Utilizing sensors capable of sensing motion in 3-dimensions, we propose a new system for browsing music in augmented reality. Our system places audio files in a virtual cube. The placement of the files into the cube is realized through the use of audio feature extraction and self-organizing maps (SOMs). The system is controlled using gestures, and sound spatialization is utilized to provide auditory cues about the topography of the music or sound collection.
ACM Multimedia, Nov 30, 2011
It is our great pleasure to welcome you to the 1st International ACM Workshop on Music Informatio... more It is our great pleasure to welcome you to the 1st International ACM Workshop on Music Information Retrieval with User-Centered and Multimodal Strategies (MIRUM). MIRUM was proposed in order to gather experts from the Music and Multimedia Information Retrieval communities, as well as other neighboring fields, and aims to provide a high-profile platform for presenting current work on Music Information Retrieval, with strong focus on user-centered and multimodal approaches. Music content is multifaceted and exists in many different representations, including audio recordings, symbolic scores, folksonomy descriptions and accompanying video material. No single representation is capable of accounting for all of the music experience, which is strongly guided by affective and subjective context- and user-dependent factors. The existence of complementary representations and information sources in multiple modalities makes music multimedia content by definition. Furthermore, the subjective and affective aspects of music pose challenges that are faced and experienced in the broad Multimedia community. Thus, we believe it is appropriate to discuss these topics in a Multimedia context. The MIRUM 2011 Call for Papers attracted 22 international technical submissions. The program committee accepted 9 papers that cover a wide variety of topics, ranging from beat tracking techniques to affective analysis of music videos. In addition, the full-day program includes a keynote speech by Dr. Roeland Ordelman (Netherlands Institute for Sound and Vision & University of Twente, The Netherlands) on exploitation possibilities of audiovisual data in the networked information society, as well as a panel on bridging opportunities for the music and multimedia domains, featuring multiple experts from the music and multimedia communities. We hope that these proceedings will serve as a valuable reference for researchers in the fields of Music and Multimedia Information Retrieval, as well as neighboring fields.
arXiv (Cornell University), Mar 6, 2022
What audio embedding approach generalizes best to a wide range of downstream tasks across a varie... more What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. HEAR was launched as a NeurIPS 2021 shared challenge. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single generalpurpose audio representation can perform as holistically as the human ear.
IGI Global eBooks, Aug 4, 2010
Multimedia Tools and Applications, Feb 16, 2017
We propose and evaluate a system for content-based visualization and exploration of music collect... more We propose and evaluate a system for content-based visualization and exploration of music collections. The system is based on a modification of Kohonen’s Self-Organizing Map algorithm and allows users to choose the locations of clusters containing acoustically similar tracks on the music space. A user study conducted to evaluate the system shows that the possibility of personalizing the music space was perceived as difficult. Conversely, the user study and objective metrics derived from users’ interactions with the interface demonstrate that the proposed system helped individuals create playlists faster and, under some circumstances, more effectively. We believe that personalized browsing interfaces are an important area of research in Multimedia Information Retrieval, and both the system and user study contribute to the growing work in this field.
Journal of The Audio Engineering Society, May 1, 2007
The papers at this Convention have been selected on the basis of a submitted abstract and extende... more The papers at this Convention have been selected on the basis of a submitted abstract and extended precis that have been peer reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced from the author's advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42 nd Street,
Journal of New Music Research, Jun 1, 2003
In order to represent musical content, pitch and timing information is utilized in the majority o... more In order to represent musical content, pitch and timing information is utilized in the majority of existing work in Symbolic Music Information Retrieval (MIR). Symbolic representations such as MIDI allow the easy calculation of such information and its manipulation. In contrast, most of the existing work in Audio MIR uses timbral and beat information, which can be calculated using automatic computer audition techniques. In this paper, Pitch Histograms are defined and proposed as a way to represent the pitch content of music signals both in symbolic and audio form. This representation is evaluated in the context of automatic musical genre classification. A multiple-pitch detection algorithm for polyphonic signals is used to calculate Pitch Histograms for audio signals. In order to evaluate the extent and significance of errors resulting from the automatic multiple-pitch detection, automatic musical genre classification results from symbolic and audio data are compared. The comparison indicates that Pitch Histograms provide valuable information for musical genre classification. The results obtained for both symbolic and audio cases indicate that although pitch errors degrade classification performance for the audio case, Pitch Histograms can be effectively used for classification in both cases.
Journal of The Audio Engineering Society, Feb 24, 2021
Music listening is an important activity for many people. Advances in technology have made possib... more Music listening is an important activity for many people. Advances in technology have made possible the creation of music collections with thousands of songs in portable music players. Navigating these large music collections is challenging especially for users with vision and/or motion disabilities. In this paper we describe our current efforts to build effective music browsing interfaces for people with disabilities. The foundation of our approach is the automatic extraction of features for describing musical content and the use of selforganizing maps to create two-dimensional representations of music collections. The ultimate goal is effective browsing without using any meta-data. We also describe different control interfaces to the system: a regular desktop application, an iPhone implementation, an eye tracker, and a smart room interface based on Wii-mote tracking.
arXiv (Cornell University), Apr 26, 2021
We release synth1B1, a multi-modal audio corpus consisting of 1 billion 4-second synthesized soun... more We release synth1B1, a multi-modal audio corpus consisting of 1 billion 4-second synthesized sounds, paired with the synthesis parameters used to generate them. The dataset is 100x larger than any audio dataset in the literature. We also introduce torchsynth, an open source modular synthesizer that generates the synth1B1 samples on-the-fly at 16200x faster than real-time (714MHz) on a single GPU. Finally, we release two new audio datasets: FM synth timbre and subtractive synth pitch. Using these datasets, we demonstrate new rank-based evaluation criteria for existing audio representations. Finally, we propose a novel approach to synthesizer hyperparameter optimization.
IEEE Transactions on Speech and Audio Processing, Jul 1, 2002
Musical genres are categorical labels created by humans to characterize pieces of music. A musica... more Musical genres are categorical labels created by humans to characterize pieces of music. A musical genre is characterized by the common characteristics shared by its members. These characteristics typically are related to the instrumentation, rhythmic structure, and harmonic content of the music. Genre hierarchies are commonly used to structure the large collections of music available on the Web. Currently musical genre annotation is performed manually. Automatic musical genre classification can assist or replace the human user in this process and would be a valuable addition to music information retrieval systems. In addition, automatic musical genre classification provides a framework for developing and evaluating features for any type of content-based analysis of musical signals. In this paper, the automatic classification of audio signals into an hierarchy of musical genres is explored. More specifically, three feature sets for representing timbral texture, rhythmic content and pitch content are proposed. The performance and relative importance of the proposed features is investigated by training statistical pattern recognition classifiers using real-world audio collections. Both whole file and real-time frame-based classification schemes are described. Using the proposed feature sets, classification of 61% for ten musical genres is achieved. This result is comparable to results reported for human musical genre classification.
Journal of The Audio Engineering Society, May 28, 2020
The design and implementation of multimedia signal processing systems is challenging especially w... more The design and implementation of multimedia signal processing systems is challenging especially when efficiency and real-time performance is desired. In many modern applications, software systems must be able to handle multiple flows of various types of multimedia data such as audio and video. Researchers frequently have to rely on a combination of different software tools for each modality to assemble proof-of-concept systems that are inefficient, brittle and hard to maintain. Marsyas is a software framework originally developed to address these issues in the domain of audio processing. In this paper we describe MarsyasX, a new open-source cross-modal analysis framework that aims at a broader score of applications. It follows a dataflow architecture where complex networks of processing objects can be assembled to form systems that can handle multiple and different types of multimedia flows with expressiveness and efficiency.
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We introduce a data-driven approach to automatic pitch correction of solo singing performances. T... more We introduce a data-driven approach to automatic pitch correction of solo singing performances. The proposed approach predicts notewise pitch shifts from the relationship between the respective spectrograms of the singing and accompaniment. This approach differs from commercial systems, where vocal track notes are usually shifted to be centered around pitches in a user-defined score, or mapped to the closest pitch among the twelve equal-tempered scale degrees. The proposed system treats pitch as a continuous value rather than relying on a set of discretized notes found in musical scores, thus allowing for improvisation and harmonization in the singing performance. We train our neural network model using a dataset of 4,702 amateur karaoke performances selected for good intonation. Our model is trained on both incorrect intonation, for which it learns a correction, and intentional pitch variation, which it learns to preserve. The proposed deep neural network with gated recurrent units on top of convolutional layers shows promising performance on the real-world score-free singing pitch correction taskautotuning.
Percussion robots have successfully used a variety of actuator technologies to activate a wide ar... more Percussion robots have successfully used a variety of actuator technologies to activate a wide array of striking mechanisms. Popular types of actuators include solenoids and DC motors. However, the use of industrial strength voice coil actuators provides a compelling alternative given a desirable set of heterogeneous features and requirements that span traditional devices. Their characteristics such as high acceleration and accurate positioning enable the exploration of rendering highly accurate and expressive percussion performances.
Comparative studies require a baseline reference and a documented process to capture new subject ... more Comparative studies require a baseline reference and a documented process to capture new subject data. This paper combined with its principal reference [1] presents a definitive dataset in the context of snare drum performances along with a procedure for data acquisition, and a methodology for quantitative analysis. The dataset contains video, audio, and discrete two dimensional motion data for forty standardized percussive rudiments.
The study of periodic biological processes, such as when plants flower and birds arrive in the sp... more The study of periodic biological processes, such as when plants flower and birds arrive in the spring is known as Phenology. In recent years this field has gained interest from the scientific community because of the applicability of this data to the study of climate change and other ecolog- ical processes. In this paper we propose the use of tangible interfaces for interactive sonification with a specific example of a multimodal tangible interface consisting of a physical paper map and tracking of fiducial markers combined with a novel drawing interface. The designed interface enables one or more users to specify point queries with the map inter- face and to specify time queries with the drawing interface. This allows the user to explore both time and space while receiving immediate sonic feedback of their actions. This system can be used to study and explore the effects of cli- mate change, both as tool to be used by scientists, and as a way to educate and involve members of the g...
As digital music and sound collections increase in size there has been a lot of work in developin... more As digital music and sound collections increase in size there has been a lot of work in developing novel interfaces for browsing them. Many of these interfaces rely on automatic content analysis techniques to create representations that reflect similarities between the music pieces or sounds in the collection. Representations in 3D have the potential to convey more information but can be difficult to navigate using the traditional ways of providing input to a computer such as a keyboard and mouse. Utilizing sensors capable of sensing motion in 3-dimensions, we propose a new system for browsing music in augmented reality. Our system places audio files in a virtual cube. The placement of the files into the cube is realized through the use of audio feature extraction and self-organizing maps (SOMs). The system is controlled using gestures, and sound spatialization is utilized to provide auditory cues about the topography of the music or sound collection.