neuropark/sahajBERT · Hugging Face (original) (raw)
Collaboratively pre-trained model on Bengali language using masked language modeling (MLM) and Sentence Order Prediction (SOP) objectives.
Model description
sahajBERT is a model composed of 1) a tokenizer specially designed for Bengali and 2) an ALBERT architecture collaboratively pre-trained on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.
Intended uses & limitations
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.
We trained our model on 2 of these downstream tasks: sequence classification and token classification
How to use
You can use this model directly with a pipeline for masked language modeling:
from transformers import AlbertForMaskedLM, FillMaskPipeline, PreTrainedTokenizerFast
# Initialize tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")
# Initialize model
model = AlbertForMaskedLM.from_pretrained("neuropark/sahajBERT")
# Initialize pipeline
pipeline = FillMaskPipeline(tokenizer=tokenizer, model=model)
raw_text = "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো" # Change me
pipeline(raw_text)
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import AlbertModel, PreTrainedTokenizerFast
# Initialize tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")
# Initialize model
model = AlbertModel.from_pretrained("neuropark/sahajBERT")
text = "ধন্যবাদ। আপনার সাথে কথা বলে ভালো লাগলো" # Change me
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Limitations and bias
WIP
Training data
The tokenizer was trained on he Bengali part of OSCAR and the model on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.
Training procedure
This model was trained in a collaborative manner by volunteer participants.
Contributors leaderboard
Hardware used
Eval results
We evaluate sahajBERT model quality and 2 other model benchmarks (XLM-R-large and IndicBert) by fine-tuning 3 times their pre-trained models on two downstream tasks in Bengali:
- NER: a named entity recognition on Bengali split of WikiANN dataset
- NCC: a multi-class classification task on news Soham News Category Classification dataset from IndicGLUE
| Base pre-trained Model | NER - F1 (mean ± std) | NCC - Accuracy (mean ± std) |
|---|---|---|
| sahajBERT | 95.45 ± 0.53 | 91.97 ± 0.47 |
| XLM-R-large | 96.48 ± 0.22 | 90.05 ± 0.38 |
| IndicBert | 92.52 ± 0.45 | 74.46 ± 1.91 |
BibTeX entry and citation info
Coming soon!