Add pretokenizer utility by andreaskoepf · Pull Request #3654 · LAION-AI/Open-Assistant (original) (raw)
The pretokenizer utility (pretokenizer/pretokenize.py) allows to tokenize datamixes in advance for use with the epfLLM/Megatron-LLM/ trainer.
The datamix configuration can be defined in a yaml file similarly to the classic training configurations of trainer_sft.py. For loading the datasets the functions from model_training are used (therefore the model_training module needs to be installed).