Configuring Bootleg — Bootleg v1.1.0dev1 documentation (original) (raw)

By default, Bootleg loads the default config from bootleg/utils/parser/bootleg_args.py. When running a Bootleg model, the user may pass in a custom JSON or YAML config via:

python3 bootleg/run.py --config_script

This will override all default values. Further, if a user wishes to overwrite a param from the command line, they can pass in the value, using the dotted path of the argument. For example, to overwrite the data directory (the param data_config.data_dir, the user can enter:

python3 bootleg/run.py --config_script --data_config.data_dir

Bootleg will save the run config (as well as a fully parsed verison with all defaults) in the log directory.

Finally, when evaluating Bootleg using the annotator, Bootleg processes possible mentions in text with three environment flags: BOOTLEG_STRIP, BOOTLEG_LOWER, BOOTLEG_LANG_CODE. The first sets the language to use for Spacy. The second is if the user wants to strip punctuation on mentions (set to False by default). The third is if the user wants to call .lower() (set to True by default).

Emmental Config¶

As Bootleg uses Emmental, the training parameters (e.g., learning rate) are set and handled by Emmental. We provide all Emmental params, as well as our defaults, at bootleg/utils/parser/emm_parse_args.py. All Emmental params are under the emmental configuration group. For example, to change the learning rate and number of epochs in a config, add

emmental: lr: 1e-4 n_epochs: 10 run_config: ...

You can also change Emmental params by the command line with --emmental.<emmental_param> <value>.

Example Training Config¶

An example training config is shown below

emmental: lr: 2e-5 n_epochs: 3 evaluation_freq: 0.2 warmup_percentage: 0.1 lr_scheduler: linear log_path: logs/wiki l2: 0.01 grad_clip: 1.0 fp16: true run_config: eval_batch_size: 32 dataloader_threads: 4 dataset_threads: 50 train_config: batch_size: 32 model_config: hidden_size: 200 data_config: data_dir: bootleg-data/data/wiki_title_0122 data_prep_dir: prep use_entity_desc: true entity_type_data: use_entity_types: true type_symbols_dir: type_mappings/wiki entity_kg_data: use_entity_kg: true kg_symbols_dir: kg_mappings entity_dir: bootleg-data/data/wiki_title_0122/entity_db max_seq_len: 128 max_seq_window_len: 64 max_ent_len: 128 overwrite_preprocessed_data: false dev_dataset: file: dev.jsonl use_weak_label: true test_dataset: file: test.jsonl use_weak_label: true train_dataset: file: train.jsonl use_weak_label: true train_in_candidates: true word_embedding: cache_dir: bootleg-data/embs/pretrained_bert_models bert_model: bert-base-uncased

Default Config¶

The default Bootleg config is shown below

"""Bootleg default configuration parameters.

In the json file, everything is a string or number. In this python file, if the default is a boolean, it will be parsed as such. If the default is a dictionary, True and False strings will become booleans. Otherwise they will stay string. """ import multiprocessing

config_args = { "run_config": { "spawn_method": ( "forkserver", "multiprocessing spawn method. forkserver will save memory but have slower startup costs.", ), "eval_batch_size": (128, "batch size for eval"), "dump_preds_accumulation_steps": ( 1000, "number of eval steps to accumulate the output tensors for before saving results to file", ), "dump_preds_num_data_splits": ( 1, "number of chunks to split the input file; helps with OOM issues", ), "overwrite_eval_dumps": (False, "overwrite dumped eval data"), "dataloader_threads": (16, "data loader threads to feed gpus"), "log_level": ("info", "logging level"), "dataset_threads": ( int(multiprocessing.cpu_count() * 0.9), "data set threads for prepping data", ), "result_label_file": ( "bootleg_labels.jsonl", "file name to save predicted entities in", ), "result_emb_file": ( "bootleg_embs.npy", "file name to save contextualized embs in", ), }, # Parameters for hyperparameter tuning "train_config": { "batch_size": (32, "batch size"), }, "model_config": { "hidden_size": (300, "hidden dimension for the embeddings before scoring"), "normalize": (False, "normalize embeddings before dot product"), "temperature": (1.0, "temperature for softmax in loss"), }, "data_config": { "eval_slices": ([], "slices for evaluation"), "train_in_candidates": ( True, "Train in candidates (if False, this means we include NIL entity)", ), "data_dir": ("data", "where training, testing, and dev data is stored"), "data_prep_dir": ( "prep", "directory where data prep files are saved inside data_dir", ), "entity_dir": ( "entity_data", "where entity profile information and prepped embedding data is stored", ), "entity_prep_dir": ( "prep", "directory where prepped embedding data is saved inside entity_dir", ), "entity_map_dir": ( "entity_mappings", "directory where entity json mappings are saved inside entity_dir", ), "alias_cand_map": ( "alias2qids", "name of alias candidate map file, should be saved in entity_dir/entity_map_dir", ), "alias_idx_map": ( "alias2id", "name of alias index map file, should be saved in entity_dir/entity_map_dir", ), "qid_cnt_map": ( "qid2cnt.json", "name of alias index map file, should be saved in data_dir", ), "max_seq_len": (128, "max token length sentences"), "max_seq_window_len": (64, "max window around an entity"), "max_ent_len": (128, "max token length for entire encoded entity"), "context_mask_perc": ( 0.0, "mask percent for context tokens in addition to tail masking", ), "popularity_mask": ( True, "whether to use popularity masking for training in the entity and context encoders", ), "overwrite_preprocessed_data": (False, "overwrite preprocessed data"), "print_examples_prep": (True, "whether to print examples during prep or not"), "use_entity_desc": (True, "whether to use entity descriptions or not"), "entity_type_data": { "use_entity_types": (False, "whether to use entity type data"), "type_symbols_dir": ( "type_mappings/wiki", "directory to type symbols inside entity_dir", ), "max_ent_type_len": (20, "max WORD length for type sequence"), }, "entity_kg_data": { "use_entity_kg": (False, "whether to use entity type data"), "kg_symbols_dir": ( "kg_mappings", "directory to kg symbols inside entity_dir", ), "max_ent_kg_len": (60, "max WORD length for kg sequence"), }, "train_dataset": { "file": ("train.jsonl", ""), "use_weak_label": (True, "Use weakly labeled mentions"), }, "dev_dataset": { "file": ("dev.jsonl", ""), "use_weak_label": (True, "Use weakly labeled mentions"), }, "test_dataset": { "file": ("test.jsonl", ""), "use_weak_label": (True, "Use weakly labeled mentions"), }, "word_embedding": { "bert_model": ("bert-base-uncased", ""), "context_layers": (12, ""), "entity_layers": (12, ""), "cache_dir": ( "pretrained_bert_models", "Directory where word embeddings are cached", ), }, }, }