Inputs — Bootleg v1.1.0dev1 documentation (original) (raw)

Given an input sentence, Bootleg outputs the entities that participate in the text. For example, given the sentence

Where is Lincoln in Logan County

Bootleg should output that Lincoln refers to Lincoln IL and Logan County to Logan County IL.

This disambiguation occurs in two parts. The first, described here, is mention extraction and candidate generation, where phrases in the input text are extracted to be disambiguation. For example, in the sentence above, the phrases “Lincoln” and “Logan County” should be extracted. Each phrase to be disambiguated is called a mention (or alias). Instead of disambiguating against all entities in Wikipedia, Bootleg uses predefined candidate maps that provide a small subset of possible entity candidates for each mention. The second step, described in Bootleg Model, is the disambiguation using Bootleg’s neural model.

To understand how we do mention extraction and candidate generation, we first need to describe the profile data we have associated with an entity. Then we will describe how we perform mention extraction. Finally, we will provide details on the input data provided to Bootleg. Take a look at our tutorials to see it in action.

Entity Data

Bootleg uses Wikipedia and Wikidata to collect and generate a entity database of metadata associated with an entity. This is all located in entity_db and contains mappings from entities to structural data and possible mention. We describe the entity profiles in more details and how to generate them on our entity profile page. For reference, we have an EntityProfile class that loads and manages this metadata.

As our profile data does give us mentions that are associated with each entity, we now need to describe how we generate mentions.

Textual Input

Once we have mentions and candidates, we are ready to run our Bootleg model. The raw input format is in jsonl format where each line is a json object. We have one json per sentence in our training data with the following files

For example, the input for the sentence above is

{ "sentence": "Where is Lincoln in Logan County", "sent_idx_unq": 0, "aliases": ["lincoln", "logan county"], "qids": ["Q121", "Q???"], "spans": [[2,3], [4,6]], "gold": [True, True], "slices": {} }

For more details on training, see our training tutorial.