Accompaniment separation and karaoke application based on automatic melody transcription (original) (raw)

A review of vocal track separation methods for karaoke generation

Generation of clean karaoke tracks is still an unsolved fundamental research problem in Digital Music Technology. Today, there is a high demand for karaoke tracks among amateur and professional singers. Even though a lot of karaoke tracks of popular songs are available online, there still exist even more songs for which the karaoke tracks are not available. This paper reviews some of the existing automatic karaoke generation methods, which can suppress or extract the vocal portion in audio music signals.

A Query-by-Singing System for Retrieving Karaoke Music

IEEE Transactions on Multimedia, 2000

This paper investigates the problem of retrieving karaoke music using query-by-singing techniques. Unlike regular CD music, where the stereo sound involves two audio channels that usually sound the same, karaoke music encompasses two distinct channels in each track: one is a mixture of the lead vocals and background accompaniment, and the other consists of accompaniment only. Although the two audio channels are distinct, the accompaniments in the two channels often resemble each other. We exploit this characteristic to (i) infer the background accompaniment for the lead vocals from the accompaniment-only channel, so that the main melody underlying the lead vocals can be extracted more effectively; and (ii) detect phrase onsets based on the Bayesian Information Criterion (BIC) to predict the onset points of a song where a user's sung query may begin, so that the similarity between the melodies of the query and the song can be examined more efficiently. To further refine extraction of the main melody, we propose correcting potential errors in the estimated sung notes by exploiting a composition characteristic of popular songs whereby the sung notes within a verse or chorus section usually vary no more than two octaves. In addition, to facilitate an efficient and accurate search of a large music database, we employ multiple-pass Dynamic Time Warping (DTW) combined with multiple-level data abstraction (MLDA) to compare the similarities of melodies. The results of experiments conducted on a karaoke database comprised of 1,071 popular songs demonstrate the feasibility of query-by-singing retrieval for karaoke music.

Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System

Interspeech 2019

Automatic sung speech recognition is a relatively understudied topic that has been held back by a lack of large and freely available datasets. This has recently changed thanks to the release of the DAMP Sing! dataset, a 1100 hour karaoke dataset originating from the social music-making company, Smule. This paper presents work undertaken to define an easily replicable, automatic speech recognition benchmark for this data. In particular, we describe how transcripts and alignments have been recovered from Karaoke prompts and timings; how suitable training, development and test sets have been defined with varying degrees of accent variability; and how language models have been developed using lyric data from the LyricWikia website. Initial recognition experiments have been performed using factored-layer TDNN acoustic models with lattice-free MMI training using Kaldi. The best WER is 19.60%-a new state-of-the-art for this type of data. The paper concludes with a discussion of the many challenging problems that remain to be solved. Dataset definitions and Kaldi scripts have been made available so that the benchmark is easily replicable.

Vocal melody extraction in the presence of pitched accompaniment in polyphonic music

Audio, Speech, and Language Processing, …, 2010

Melody extraction algorithms for single-channel polyphonic music typically rely on the salience of the lead melodic instrument, considered here to be the singing voice. However the simultaneous presence of one or more pitched instruments in the polyphony can cause such a predominant-F0 tracker to switch between tracking the pitch of the voice and that of an instrument of comparable strength, resulting in reduced voice-pitch detection accuracy. We propose a system that, in addition to biasing the salience measure in favor of singing voice characteristics, acknowledges that the voice may not dominate the polyphony at all instants and therefore tracks an additional pitch to better deal with the potential presence of locally dominant pitched accompaniment. A feature based on the temporal instability of voice harmonics is used to finally identify the voice pitch. The proposed system is evaluated on test data that is representative of polyphonic music with strong pitched accompaniment. Results show that the proposed system is indeed able to recover melodic information lost to its single-pitch tracking counterpart, and also outperforms another state-of-the-art melody extraction system designed for polyphonic music.

Automatic Transcription of Polyphonic Vocal Music

This paper presents a method for automatic music transcription applied to audio recordings of a cappella performances with multiple singers. We propose a system for multi-pitch detection and voice assignment that integrates an acoustic and a music language model. The acoustic model performs spectrogram decomposition, extending probabilistic latent component analysis (PLCA) using a six-dimensional dictionary with pre-extracted log-spectral templates. The music language model performs voice separation and assignment using hidden Markov models that apply musicological assumptions. By integrating the two models, the system is able to detect multiple concurrent pitches in polyphonic vocal music and assign each detected pitch to a specific voice type such as soprano, alto, tenor or bass (SATB). We compare our system against multiple baselines, achieving state-of-the-art results for both multi-pitch detection and voice assignment on a dataset of Bach chorales and another of barbershop quartets. We also present an additional evaluation of our system using varied pitch tolerance levels to investigate its performance at 20-cent pitch resolution.

Transcription of the Singing Melody in Polyphonic Music

International Symposium/Conference on Music Information Retrieval, 2006

This paper proposes a method for the automatic transcrip- tion of singing melodies in polyphonic music. The method is based on multiple-F0 estimation followed by acoustic and musicological modeling. The acoustic model consists of separate models for singing notes and for no-melody seg- ments. The musicological model uses key estimation and note bigrams to determine the transition probabilities be- tween

Mayor et al. Performance analysis and scoring of the singing voice PERFORMANCE ANALYSIS AND SCORING OF THE SINGING VOICE

2010

In this article we describe the approximation we follow to analyze the performance of a singer when singing a reference song. The idea is to rate the performance of a singer in the same way that a music tutor would do it, not only giving a score but also giving feedback about how the user has performed regarding expression, tuning and tempo/timing characteristics. Also a discussion on what visual feedback should be relevant for the user is discussed. Segmentation at an intra-note level is done using an algorithm based on untrained HMMs with probabilistic models built out of a set of heuristic rules that determine regions and their probability of being expressive features. A real-time karaoke-like system is presented where a user can sing and visualize simultaneously feedback and results of the performance. The technology can be applied to a wide set of applications that range from pure entertainment to more serious education oriented.

Polyphonic Listening: Real-time accompaniment of polyphonic audio

The Australasian Computer Music Conference …, 2007

This paper outlines a technique for generative musical accompaniment of a polyphonic audio stream. The process involves the real-time extraction of salient harmonic features and the generation of relevant musical accompaniment. We outline a new system for polyphonic pitch tracking of an audio signal which draws upon and extends previous pitch tracking techniques. We demonstrate how this machine listening system can be used as the basis for a generative music improvisation system with the potential to jam with a live ensemble without prior training.

Efficient Implementation of a System for Solo and Accompaniment Separation in Polyphonic Music

2012

Our goal is to obtain improved perceptual quality for separated solo instruments and accompaniment in polyphonic music. The proposed approach uses a pitch detection algorithm in conjunction with a spectral filtering based source separation. The algorithm was designed to work with polyphonic signals regardless of the main instrument, type of accompaniment or musical style. Our approach features a fundamental frequency estimation stage, a refined harmonic structure for the spectral mask and a post-processing stage to reduce artifacts. The processing chain has been kept light. The use of perceptual measures for quality assessment revealed improved quality in the extracted signals with respect to our previous approach. The results obtained with our algorithm were compared with other state-of-the-art algorithms under SISEC 2011.