GitHub - PonteIneptique/boudams: Le Boucher d'Amsterdam, tokenizer (original) (raw)
Le Boucher d'Amsterdam
Boudams, or "Le boucher d'Amsterdam", is a deep-learning tool built for tokenizing Latin or Medieval French languages.
How to cite
An article has been published about this work : https://hal.archives-ouvertes.fr/hal-02154122v1
@unpublished{clerice:hal-02154122,
TITLE = {{Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin}},
AUTHOR = {Cl{\'e}rice, Thibault},
URL = {https://hal.archives-ouvertes.fr/hal-02154122},
NOTE = {working paper or preprint},
YEAR = {2019},
MONTH = Jun,
KEYWORDS = {convolutional network ; scripta continua ; tokenization ; Old French ; word segmentation},
PDF = {https://hal.archives-ouvertes.fr/hal-02154122/file/Evaluating_Deep_Learning_Methods_for_Tokenization_of_Scripta_Continua_in_Old_French_and_Latin%284%29.pdf},
HAL_ID = {hal-02154122},
HAL_VERSION = {v1},
}
How to
Install the usual way you install python stuff: python setup.py install (Python >= 3.6)).
The config file can be kickstarted using boudams template config.json, we recommend using the following settings :
linear-conv-no-posfor the model, as it is not limited by the input size;normalizeandlowertoTruedepending on your dataset size.
The initial dataset is pretty small but if you want to build with your own, it's fairly simple : you need data in the following shape : "samesentence<TAB>same sentence" where the first element is the same than the second but with no space and they are separated by tabs (\t, marked here as <TAB>).
{ "name": "model", "max_sentence_size": 150, "network": { "emb_enc_dim": 256, "enc_n_layers": 10, "enc_kernel_size": 3, "enc_dropout": 0.25 }, "model": "linear-conv-no-pos", "learner": { "lr_grace_periode": 2, "lr_patience": 2, "lr": 0.0001 }, "label_encoder": { "normalize": true, "lower": true }, "datasets": { "test": "./test.tsv", "train": "./train.tsv", "dev": "./dev.tsv", "random": true } }
The best architecture I find for medieval French was Conv to Linear without POS using the following setup:
{ "network": { "emb_enc_dim": 256, "enc_n_layers": 10, "enc_kernel_size": 5, "enc_dropout": 0.25 }, "model": "linear-conv-no-pos", "batch_size": 64, "learner": { "lr_grace_periode": 2, "lr_patience": 2, "lr": 0.00005, "lr_factor": 0.5 } }
Credits
Inspirations, bits of code and source for being able to understand how Seq2Seq words or write my own Torch module come both from Ben Trevett and Enrique Manjavacas.