Frank Zalkow - Profile on Academia.edu (original) (raw)

Papers by Frank Zalkow

Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

Applied intelligence, Mar 27, 2024

arXiv (Cornell University), Jun 10, 2024

In this work, we take on the challenging task of building a single text-to-speech synthesis syste... more In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech synthesis in languages without any available data. We validate our system's performance through objective measures and human evaluation across a diverse linguistic landscape. By releasing our code and models publicly, we aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.

Wagner Ring Dataset

Zenodo (CERN European Organization for Nuclear Research), Feb 23, 2023

The AudioLabs System for the Blizzard Challenge 2023

Meinardmueller/Libfmp: V1.2.2

libfmp - Python package for teaching and learning Fundamentals of Music Processing (FMP)

Librosa/Librosa: 0.7.2

This is primarily a bug-fix release, and most likely the last release in the 0.7 series. It inclu... more This is primarily a bug-fix release, and most likely the last release in the 0.7 series. It includes fixes for errors in dynamic time warping (DTW) and RMS energy calculation, and several corrections to the documentation. Inverse-liftering is now supported in MFCC inversion, and an implementation of mu-law companding has been added. Please refer to the documentation for a full list of changes.

Journal on computing and cultural heritage, May 8, 2021

This article presents a multimodal dataset comprising various representations and annotations of ... more This article presents a multimodal dataset comprising various representations and annotations of Franz Schubert's song cycle Winterreise. Schubert's seminal work constitutes an outstanding example of the Romantic song cycle-a central genre within Western classical music. Our dataset unifies several public sources and annotations carefully created by music experts, compiled in a comprehensive and consistent way. The multimodal representations comprise the singer's lyrics, sheet music in different machine-readable formats, and audio recordings of nine performances, two of which are freely accessible for research purposes. By means of explicit musical measure positions, we establish a temporal alignment between the different representations, thus enabling a detailed comparison across different performances and modalities. Using these alignments, we provide for the different versions various musicological annotations describing tonal and structural characteristics. This metadata comprises chord annotations in different granularities, local and global annotations of musical keys, and segmentations into structural parts. From a technical perspective, the dataset allows for evaluating algorithmic approaches to tasks such as automated music transcription, cross-modal music alignment, or tonal analysis, and for testing these algorithms' robustness across songs, performances, and modalities. From a musicological perspective, the dataset enables the systematic study of Schubert's musical language and style in Winterreise and the comparison of annotations regarding different annotators and granularities. Beyond the research domain, the data may serve further purposes such as the didactic preparation of Schubert's work and its presentation to a wider public by means of an interactive multimedia experience. With this article, we provide a detailed description of the dataset, indicate its potential for computational music analysis by means of several studies, and point out possibilities for future research. CCS Concepts: • Information systems → Music retrieval; Content analysis and feature selection; • Applied computing → Sound and music computing; Digital libraries and archives; Document metadata; • Human-centered computing → Information visualization;

Librosa/Librosa: 0.6.3

This release contains a few minor bugfixes and many improvements to documentation and usability.

Librosa/Librosa: 0.8.1RC2

Second release candidate for 0.8.1.

FMP Notebooks

Springer eBooks, Feb 24, 2012

International Symposium/Conference on Music Information Retrieval, 2020

Even though local tempo estimation promises musicological insights into expressive musical perfor... more Even though local tempo estimation promises musicological insights into expressive musical performances, it has never received as much attention in the music information retrieval (MIR) research community as either beat tracking or global tempo estimation. One reason for this may be the lack of a generally accepted definition. In this paper, we discuss how to model and measure local tempo in a musically meaningful way using a cross-version dataset of Frédéric Chopin's Mazurkas as a use case. In particular, we explore how tempo stability can be measured and taken into account during evaluation. Comparing existing and newly trained systems, we find that CNN-based approaches can accurately measure local tempo even for expressive classical music, if trained on the target genre. Furthermore, we show that different training-test splits have a considerable impact on accuracy for difficult segments.

International Symposium/Conference on Music Information Retrieval, 2017

Richard Wagner's cycle Der Ring des Nibelungen, consisting of four music dramas, constitutes a co... more Richard Wagner's cycle Der Ring des Nibelungen, consisting of four music dramas, constitutes a comprehensive work of high importance for Western music history. In this paper, we indicate how MIR methods can be applied to explore this large-scale work with respect to tonal properties. Our investigations are based on a data set that contains 16 audio recordings of the entire Ring as well as extensive annotations including measure positions, singer activities, and leitmotif regions. As a basis for the tonal analysis, we make use of common audio features, which capture local chord and scale information. Employing a crossversion approach, we show that global histogram representations can reflect certain tonal relationships in a robust way. Based on our annotations, a musicologist may easily select and compare passages associated with dramatic aspects, for example, the appearance of specific characters or the presence of particular leitmotifs. Highlighting and investigating such passages may provide insights into the role of tonality for the dramatic conception of Wagner's Ring. By giving various concrete examples, we indicate how our approach may open up new ways for exploring large musical corpora in an intuitive and interactive way.

International Symposium/Conference on Music Information Retrieval, Nov 4, 2019

In this paper, we introduce a novel collection of educational material for teaching and learning ... more In this paper, we introduce a novel collection of educational material for teaching and learning fundamentals of music processing (FMP) with a particular focus on the audio domain. This collection, referred to as FMP notebooks, discusses well-established topics in Music Information Retrieval (MIR) as motivating application scenarios. The FMP notebooks provide detailed textbook-like explanations of central techniques and algorithms in combination with Python code examples that illustrate how to implement the theory. All components including the introductions of MIR scenarios, illustrations, sound examples, technical concepts, mathematical details, and code examples are integrated into a consistent and comprehensive framework based on Jupyter notebooks. The FMP notebooks are suited for studying the theory and practice, for generating educational material for lectures, as well as for providing baseline implementations for many MIR tasks, thus addressing students, teachers, and researchers.

International Symposium/Conference on Music Information Retrieval, Oct 11, 2020

From the 19th century on, several composers of Western opera made use of leitmotifs (short musica... more From the 19th century on, several composers of Western opera made use of leitmotifs (short musical ideas referring to semantic entities such as characters, places, items, or feelings) for guiding the audience through the plot and illustrating the events on stage. A prime example of this compositional technique is Richard Wagner's four-opera cycle Der Ring des Nibelungen. Across its different occurrences in the score, a leitmotif may undergo considerable musical variations. Additionally, the concrete leitmotif instances in an audio recording are subject to acoustic variability. Our paper approaches the task of classifying such leitmotif instances in audio recordings. As our main contribution, we conduct a case study on a dataset covering 16 recorded performances of the Ring with annotations of ten central leitmotifs, leading to 2403 occurrences and 38448 instances in total. We build a neural network classification model and evaluate its ability to generalize across different performances and leitmotif occurrences. Our findings demonstrate the possibilities and limitations of leitmotif classification in audio recordings and pave the way towards the fully automated detection of leitmotifs in music recordings.

Applied sciences, Mar 14, 2018

In magnetic resonance imaging (MRI), a patient is exposed to beat-like knocking sounds, often int... more In magnetic resonance imaging (MRI), a patient is exposed to beat-like knocking sounds, often interrupted by periods of silence, which are caused by pulsing currents of the MRI scanner. In order to increase the patient's comfort, one strategy is to play back ambient music to induce positive emotions and to reduce stress during the MRI scanning process. To create an overall acceptable acoustic environment, one idea is to adapt the music to the locally periodic acoustic MRI noise. Motivated by this scenario, we consider in this paper the general problem of adapting a given music recording to fulfill certain temporal constraints. More concretely, the constraints are given by a reference time axis with specified time points (e.g., the time positions of the MRI scanner's knocking sounds). Then, the goal is to temporally modify a suitable music recording such that its beat positions align with the specified time points. As one technical contribution, we model this alignment task as an optimization problem with the objective to fulfill the constraints while avoiding strong local distortions in the music. Furthermore, we introduce an efficient algorithm based on dynamic programming for solving this task. Based on the computed alignment, we use existing timescale modification procedures for locally adapting the music recording. To illustrate the outcome of our procedure, we discuss representative synthetic and real-world examples, which can be accessed via an interactive website. In particular, these examples indicate the potential of automated methods for noise beautification within the MRI application scenario.

Transactions of the International Society for Music Information Retrieval, 2020

Musical themes are essential elements in Western classical music. In this paper, we present the M... more Musical themes are essential elements in Western classical music. In this paper, we present the Musical Theme Dataset (MTD), a multimodal dataset inspired by "A Dictionary of Musical Themes" by Barlow and Morgenstern from 1948. For a subset of 2067 themes of the printed book, we created several digital representations of the musical themes. Beyond graphical sheet music, we provide symbolic music encodings, audio snippets of music recordings, alignments between the symbolic and audio representations, as well as detailed metadata on the composer, work, recording, and musical characteristics of the themes. In addition to the data, we also make several parsers and web-based interfaces available to access and explore the different modalities and their relations through visualizations and sonifications. These interfaces also include computational tools, bridging the gap between the original dictionary and music information retrieval (MIR) research. The dataset is of relevance for various subfields and tasks in MIR, such as cross-modal music retrieval, music alignment, optical music recognition, music transcription, and computational musicology.

Journal of open source software, Jul 20, 2021

The revolution in music distribution, storage, and consumption has fueled tremendous interest in ... more The revolution in music distribution, storage, and consumption has fueled tremendous interest in developing techniques and tools for organizing, structuring, retrieving, navigating, and presenting music-related data. As a result, the academic field of music information retrieval (MIR) has matured over the last 20 years into an independent research area related to many different disciplines, including engineering, computer science, mathematics, and musicology. In this contribution, we introduce the Python package libfmp, which provides implementations of well-established model-based algorithms for various MIR tasks (with a focus on the audio domain), including beat tracking, onset detection, chord recognition, music synchronization, version identification, music segmentation, novelty detection, and audio decomposition. Such traditional approaches not only yield valuable baselines for modern data-driven strategies (e.g., using deep learning) but are also instructive from an educational viewpoint deepening the understanding of the MIR task and music data at hand. Our libfmp package is inspired and closely follows conventions as introduced by librosa, which is a widely used Python library containing standardized and flexible reference implementations of many common methods in audio and music processing (McFee et al., 2015). While the two packages overlap concerning basic feature extraction and MIR algorithms, libfmp contains several reference implementations of advanced music processing pipelines not yet covered by librosa (or other open-source software). Whereas the librosa package is intended to facilitate the high-level composition of basic methods into complex pipelines, a major emphasis of libfmp is on the educational side, promoting the understanding of MIR concepts by closely following the textbook on Fundamentals of Music Processing (FMP) (Müller, 2015). In this way, we hope that libfmp constitutes a valuable complement to existing open-source toolboxes such as librosa while fostering education and research in MIR.

IEEE/ACM transactions on audio, speech, and language processing, 2021

This paper deals with a score-audio music retrieval task where the aim is to find relevant audio ... more This paper deals with a score-audio music retrieval task where the aim is to find relevant audio recordings of Western classical music, given a short monophonic musical theme in symbolic notation as a query. Strategies for comparing score and audio data are often based on a common mid-level representation, such as chroma features, which capture melodic and harmonic properties. Recent studies demonstrated the effectiveness of neural networks that learn task-specific mid-level representations. Usually, such supervised learning approaches require score-audio pairs where the score's individual note events are aligned to the corresponding time positions of the audio excerpt. However, in practice, it is tedious to generate such strongly aligned training pairs. As one contribution, we show how to apply the Connectionist Temporal Classification (CTC) loss in the training procedure, which only uses weakly aligned training pairs. In such a pair, only the time positions of the beginning and end of a theme occurrence are annotated in an audio recording, rather than requiring local alignment annotations. We evaluate the resulting features in our theme retrieval scenario and show that they improve the state of the art for this task. As a main result, we demonstrate that with the CTC-based training procedure using weakly annotated data, we can achieve results almost as good as with strongly annotated data. Furthermore, we assess our chroma features in depth by inspecting their temporal smoothness or granularity as an important property and by analyzing the impact of different degrees of musical complexity on the features.

IEEE Transactions on Multimedia, 2022

Attention-based Transformer models have been increasingly employed for automatic music generation... more Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with a user-specified sequence, a popular approach is to take that conditioning sequence as a priming sequence and ask a Transformer decoder to generate a continuation. However, this prompt-based conditioning cannot guarantee that the conditioning sequence would develop or even simply repeat itself in the generated continuation. In this paper, we propose an alternative conditioning approach, called theme-based conditioning, that explicitly trains the Transformer to treat the conditioning sequence as a thematic material that has to manifest itself multiple times in its generation result. This is achieved with two main technical contributions. First, we propose a deep learning-based approach that uses contrastive representation learning and clustering to automatically retrieve thematic materials from music pieces in the training data. Second, we propose a novel gated parallel attention module to be used in a sequence-to-sequence (seq2seq) encoder/decoder architecture to more effectively account for a given conditioning thematic material in the generation process of the Transformer decoder. We report on objective and subjective evaluations of variants of the proposed Theme Transformer and the conventional promptbased baseline, showing that our best model can generate, to some extent, polyphonic pop piano music with repetition and plausible variations of a given condition.

Applied sciences, Dec 18, 2019

Cross-version music retrieval aims at identifying all versions of a given piece of music using a ... more Cross-version music retrieval aims at identifying all versions of a given piece of music using a short query audio fragment. One previous approach, which is particularly suited for Western classical music, is based on a nearest neighbor search using short sequences of chroma features, also referred to as audio shingles. From the viewpoint of efficiency, indexing and dimensionality reduction are important aspects. In this paper, we extend previous work by adapting two embedding techniques; one is based on classical principle component analysis, and the other is based on neural networks with triplet loss. Furthermore, we report on systematically conducted experiments with Western classical music recordings and discuss the trade-off between retrieval quality and embedding dimensionality. As one main result, we show that, using neural networks, one can reduce the audio shingles from 240 to fewer than 8 dimensions with only a moderate loss in retrieval accuracy. In addition, we present extended experiments with databases of different sizes and different query lengths to test the scalability and generalizability of the dimensionality reduction methods. We also provide a more detailed view into the retrieval problem by analyzing the distances that appear in the nearest neighbor search.