GitHub - facebookresearch/llama-hd-dataset: This is a balanced dataset for English homograph disambiguation (HD), generated with Meta's Llama 2-Chat 70B model. (original) (raw)

This repository provides a balanced dataset for training and evaluating English homograph disambiguation (HD) models, generated with Meta's Llama 2-Chat 70B model.

The dataset contains 3,260 sentences covering the same set of 162 homograph words as in Google's Wikipedia HD dataset. Most words have two pronunciations; two words (august and mobile) have three. For each pronunciation of each homograph word, this dataset contains 10 sentences in which the homograph word occurs exactly once. The 10 sentences are divided evenly into a training set and an evaluation set.

This dataset is intended to be balanced and diverse:

Data Format

The dataset comes in two files: llama_hd_train.tsv and llama_hd_eval.tsv. Each file contains five fields, separated by a tab:

Notes:

  1. All the fields are enclosed in double quotes ("), and double quotes within the fields are escaped as two double quotes ("). You can read and unescape the fields in Python using csv.reader(..., delimiter="\t", quotechar='"').
  2. The homograph word in the sentence may be capitalized, have surrounding punctuations, take the possessive suffix ('s), or be part of a hyphenated word (e.g. lead-free). You need to make sure your tokenizer can treat the homograph word itself as a token.
  3. The start and end indices aren't really necessary, since the homograph word always occurs only once in the sentence. They are included to be consistent with the Wikipedia dataset. Note that the indices are counted inUTF-8 bytes, not Unicode characters, e.g. é (C3 A9 in UTF-8) counts as two bytes.

License

This dataset is released under the Creative Commons Attribution Non-Commercial (CC-BY-NC) 4.0 license.

This dataset is only intended for training models for homograph disambiguation (HD) and related elementary NLP tasks, such as text normalization (TN) and POS tagging.

Citing

If you use this data in a publication, please cite the following paper:

Wonjune Kang, Yun Wang, Shun Zhang, Arthur Hinsvark, Qing He. Multi-task learning for front-end text processing in TTS. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Seoul, Korea, April 2024.