Add pretokenizer utility by andreaskoepf · Pull Request #3654 · LAION-AI/Open-Assistant (original) (raw)

The pretokenizer utility (pretokenizer/pretokenize.py) allows to tokenize datamixes in advance for use with the epfLLM/Megatron-LLM/ trainer.

The datamix configuration can be defined in a yaml file similarly to the classic training configurations of trainer_sft.py. For loading the datasets the functions from model_training are used (therefore the model_training module needs to be installed).