WikiSpeech – enabling open source text-to-speech for Wikipedia (original) (raw)

Mining the Spoken Wikipedia for Speech Data and Beyond

2016

We present a corpus of time-aligned spoken data of Wikipedia articles as well as the pipeline that allows to generate such corpora for many languages. There are initiatives to create and sustain spoken Wikipedia versions in many languages and hence the data is freely available, grows over time, and can be used for automatic corpus creation. Our pipeline automatically downloads and aligns this data. The resulting German corpus currently totals 293h of audio, of which we align 71h in full sentences and another 86h of sentences with some missing words. The English corpus consists of 287h, for which we align 27h in full sentence and 157h with some missing words. Results are publically available.

The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening

Language Resources and Evaluation

Spoken corpora are important for speech research, but are expensive to create and do not necessarily reflect (read or spontaneous) speech 'in the wild'. We report on our conversion of the preexisting and freely available Spoken Wikipedia into a speech resource. The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. There are initiatives to create and sustain Spoken Wikipedia versions in many languages and hence the available data grows over time. Thousands of spoken articles are available to users who prefer a spoken over the written version. We turn these semi-structured collections into structured and time-aligned corpora, keeping the exact correspondence with the original hypertext as well as all available metadata. Thus, we make the Spoken Wikipedia accessible for sustainable research. We present our open-source software pipeline that downloads, extracts, normalizes and text-speech aligns the Spoken Wikipedia. Additional language versions can be exploited by adapting configuration files or extending the software if necessary for language peculiarities. We also present and analyze the resulting corpora for German, English, and Dutch, which presently total 1005 h and grow at an estimated 87 h per year. The corpora, together with our software, are available via http://islrn.org/resources/684-927-624-257-3/. As a prototype usage of the time-aligned corpus, we describe an experiment about the preferred modalities for interacting with information-rich read-out hypertext. We find alignments to help improve user experience and factual information access by enabling targeted interaction. Keywords Wikipedia • speech corpus • found data • annotation • robust text-speech alignment • spoken hypertext • eyes-free speech access This article extends and consolidates previous research (Köhn et al, 2016; Rohde and Baumann, 2016). This work was partly supported by a PostDoc grant by Daimler-and-Benz-foundation to the first author.

Navigating the Spoken Wikipedia

SLPAT 2016 Workshop on Speech and Language Processing for Assistive Technologies, 2016

The Spoken Wikipedia project unites volunteer readers of encyclopedic entries. Their recordings make encyclopedic knowledge accessible to persons who are unable to read (out of alexia, visual impairment, or because their sight is currently occupied, e. g. while driving). However, on Wikipedia, recordings are available as raw audio files that can only be consumed linearly, without the possibility for targeted navigation or search. We present a reading application which uses an alignment between the recording, text and article structure and which allows to navigate spoken articles, through a graphical or voice-based user interface (or a combination thereof). We present the results of a usability study in which we compare the two interaction modalities. We find that both types of interaction enable users to navigate articles and to find specific information much more quickly compared to a sequential presentation of the full article. In particular when the VUI is not restricted by speech recognition and understanding issues, this interface is on par with the graphical interface and thus a real option for browsing the Wikipedia without the need for vision or reading.

Crowdsourcing a Multi-lingual Speech Corpus: Recording, Transcription and Annotation of the CrowdIS Corpora

2016

We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourcing, containing a native speaker corpus of English (CROWDED_ENGLISH), and a corpus of German/English bilinguals (CROWDED_BILINGUAL). Release 1 of the CROWDED CORPUS contains 1000 recordings amounting to 33,400 tokens collected from 80 speakers and is freely available to other researchers. We recruited participants via the Crowdee application for Android. Recruits were prompted to respond to business-topic questions of the type found in language learning oral tests. We then used the CrowdFlower web application to pass these recordings to crowdworkers for transcription and annotation of errors and sentence boundaries. Finally, the sentences were tagged and parsed using standard natural language processing tools. We propose that crowdsourcing is a valid and economical method for corpus collection, and discuss the advantages and disadvantages of this approach.

Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

Information

Collecting speech data for a low-resource language is challenging when funding and resources are limited. This paper describes the process of designing, creating and using the Paldaruo Speech Corpus for developing speech technology for Welsh. Specifically, this paper focuses on the crowdsourcing of data using an app on smartphones and mobile devices, allowing speakers from across Wales to contribute. We discuss the development of reading prompts: isolated words and full sentences, as well as the metadata collected from contributors. We also provide background on the design of the Paldaruo App as well as the main uses for the corpus and its availability and licensing. The corpus was designed for the development of speech recognition for Welsh and has been used to create a number of other resources. These methods can be extended to other languages, and suggestions for other low-resource languages are discussed.

VoicePedia: Towards Speech-based Access to Unstructured Information, Interspeech 2007

Interspeech, 2007

Currently there are no dialog systems that enable purely voice-based access to the unstructured information on websites such as Wikipedia. Such systems could be revolutionary for non-literate users in the developing world. To investigate interface issues in such a system, we developed VoicePedia, a telephone-based dialog system for searching and browsing Wikipedia. In this paper, we present the system, as well as a user study comparing the use of VoicePedia to SmartPedia, a Smartphone GUI-based alternative. Keyword entry through the voice interface was significantly faster, while search result navigation, and page browsing were significantly slower. Although users preferred the GUI-based interface, task success rates between both systems were comparable-a promising result for regions where Smartphones and data plans are not viable.

Crowdsourcing a multilingual speech corpus: recording, transcription and annotation of the CROWDED CORPUS

We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourcing, containing a native speaker corpus of English (CROWDED_ENGLISH), and a corpus of German/English bilinguals (CROWDED_BILINGUAL). Release 1 of the CROWDED CORPUS contains 1000 recordings amounting to 33,400 tokens collected from 80 speakers and is freely available to other researchers. We recruited participants via the Crowdee application for Android. Recruits were prompted to respond to business-topic questions of the type found in language learning oral tests. We then used the CrowdFlower web application to pass these recordings to crowdworkers for transcription and annotation of errors and sentence boundaries. Finally, the sentences were tagged and parsed using standard natural language processing tools. We propose that crowdsourcing is a valid and economical method for corpus collection, and discuss the advantages and disadvantages of this approach.

WikiBABEL: A System for Multilingual Wikipedia Content

Proc. of AMTA …, 2010

This position paper outlines our project–WikiBABEL–which will be released as an open source project for the creation of multilingual Wikipedia content, and has potential to produce parallel data as a by-product for Machine Translation systems research. We ...

Palabras: Crowdsourcing Transcriptions of L2 Speech

2016

We developed a web application for crowdsourcing transcriptions of Dutch words spoken by Spanish L2 learners. In this paper we discuss the design of the application and the influence of metadata and various forms of feedback. Useful data were obtained from 159 participants, with an average of over 20 transcriptions per item, which seems a satisfactory result for this type of research. Informing participants about how many items they still had to complete, and not how many they had already completed, turned to be an incentive to do more items. Assigning participants a score for their performance made it more attractive for them to carry out the transcription task, but this seemed to influence their performance. We discuss possible advantages and disadvantages in connection with the aim of the research and consider possible lessons for designing future experiments.

Samrómur: Crowd-sourcing Data Collection for Icelandic Speech Recognition

2020

This contribution describes an ongoing project of speech data collection, using the web application Samrómur which is built upon Common Voice, Mozilla Foundation’s web platform for open-source voice collection. The goal of the project is to build a large-scale speech corpus for Automatic Speech Recognition (ASR) for Icelandic. Upon completion, Samrómur will be the largest open speech corpus for Icelandic collected from the public domain. We discuss the methods used for the crowd-sourcing effort and show the importance of marketing and good media coverage when launching a crowd-sourcing campaign. Preliminary results exceed our expectations, and in one month we collected data that we had estimated would take three months to obtain. Furthermore, our initial dataset of around 45 thousand utterances has good demographic coverage, is gender-balanced and with proper age distribution. We also report on the task of validating the recordings, which we have not promoted, but have had numerous ...