PhoBERT (original) (raw)

PyTorch TensorFlow Flax

Overview

The PhoBERT model was proposed in PhoBERT: Pre-trained language models for Vietnamese by Dat Quoc Nguyen, Anh Tuan Nguyen.

The abstract from the paper is the following:

We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the first public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R (Conneau et al., 2020) and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference.

This model was contributed by dqnguyen. The original code can be found here.

Usage example

import torch from transformers import AutoModel, AutoTokenizer

phobert = AutoModel.from_pretrained("vinai/phobert-base") tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")

line = "Tôi là sinh_viên trường đại_học Công_nghệ ."

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad(): ... features = phobert(input_ids)

PhoBERT implementation is the same as BERT, except for tokenization. Refer to BERT documentation for information on configuration classes and their parameters. PhoBERT-specific tokenizer is documented below.

PhobertTokenizer

class transformers.PhobertTokenizer

< source >

( vocab_file merges_file bos_token = '' eos_token = '' sep_token = '' cls_token = '' unk_token = '' pad_token = '' mask_token = '' **kwargs )

Parameters

Construct a PhoBERT tokenizer. Based on Byte-Pair-Encoding.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Loads a pre-existing dictionary from a text file and adds its symbols to this instance.

build_inputs_with_special_tokens

< source >

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]

Parameters

List of input IDs with the appropriate special tokens.

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A PhoBERT sequence has the following format:

Converts a sequence of tokens (string) in a single string.

create_token_type_ids_from_sequences

< source >

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]

Parameters

List of zeros.

Create a mask from the two sequences passed to be used in a sequence-pair classification task. PhoBERT does not make use of token type ids, therefore a list of zeros is returned.

get_special_tokens_mask

< source >

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None already_has_special_tokens: bool = False ) → List[int]

Parameters

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

< > Update on GitHub