PyTorch Huggingface BERT-NLP for Named Entity Recognition · Issue #328 · huggingface/transformers (original) (raw)

I have been using your PyTorch implementation of Google’s BERT by HuggingFace for the MADE 1.0 dataset for quite some time now. Up until last time (11-Feb), I had been using the library and getting an F-Score of 0.81 for my Named Entity Recognition task by Fine Tuning the model. But this week when I ran the exact same code which had compiled and run earlier, it threw an error when executing this statement:

input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts], maxlen=MAX_LEN, dtype=”long”, truncating=”post”, padding=”post”)

ValueError: Token indices sequence length is longer than the specified
maximum sequence length for this BERT model (632 > 512). Running this
sequence through BERT will result in indexing errors

The full code is available in this colab notebook.

To get around this error I modified the above statement to the one below by taking the first 512 tokens of any sequence and made the necessary changes to add the index of [SEP] to the end of the truncated/padded sequence as required by BERT.

input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt[:512]) for txt in tokenized_texts], maxlen=MAX_LEN, dtype=”long”, truncating=”post”, padding=”post”)

The result shouldn’t have changed because I am only considering the first 512 tokens in the sequence and later truncating to 75 as my (MAX_LEN=75) but my F-Score has dropped to 0.40 and my precision to 0.27 while the Recall remains the same (0.85). I am unable to share the dataset as I have signed a confidentiality clause but I can assure all the preprocessing as required by BERT has been done and all extended tokens like (Johanson –> Johan ##son) have been tagged with X and replaced later after the prediction as said in the BERT Paper.

Has anyone else faced a similar issue or can elaborate on what might be the issue or what changes the PyTorch (Huggingface) has done on their end recently?