neuropark/sahajBERT · Hugging Face (original) (raw)

Collaboratively pre-trained model on Bengali language using masked language modeling (MLM) and Sentence Order Prediction (SOP) objectives.

Model description

sahajBERT is a model composed of 1) a tokenizer specially designed for Bengali and 2) an ALBERT architecture collaboratively pre-trained on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.

We trained our model on 2 of these downstream tasks: sequence classification and token classification

How to use

You can use this model directly with a pipeline for masked language modeling:


from transformers import AlbertForMaskedLM, FillMaskPipeline, PreTrainedTokenizerFast

# Initialize tokenizer

tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

# Initialize model

model = AlbertForMaskedLM.from_pretrained("neuropark/sahajBERT")

# Initialize pipeline

pipeline = FillMaskPipeline(tokenizer=tokenizer, model=model)

raw_text = "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো" # Change me

pipeline(raw_text)

Here is how to use this model to get the features of a given text in PyTorch:


from transformers import AlbertModel, PreTrainedTokenizerFast

# Initialize tokenizer

tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")

# Initialize model

model = AlbertModel.from_pretrained("neuropark/sahajBERT")

text = "ধন্যবাদ। আপনার সাথে কথা বলে ভালো লাগলো" # Change me

encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)

Limitations and bias

WIP

Training data

The tokenizer was trained on he Bengali part of OSCAR and the model on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.

Training procedure

This model was trained in a collaborative manner by volunteer participants.

Contributors leaderboard

Hardware used

Eval results

We evaluate sahajBERT model quality and 2 other model benchmarks (XLM-R-large and IndicBert) by fine-tuning 3 times their pre-trained models on two downstream tasks in Bengali:

NER: a named entity recognition on Bengali split of WikiANN dataset
NCC: a multi-class classification task on news Soham News Category Classification dataset from IndicGLUE

Base pre-trained Model	NER - F1 (mean ± std)	NCC - Accuracy (mean ± std)
sahajBERT	95.45 ± 0.53	91.97 ± 0.47
XLM-R-large	96.48 ± 0.22	90.05 ± 0.38
IndicBert	92.52 ± 0.45	74.46 ± 1.91

BibTeX entry and citation info

Coming soon!