bert - Pretrained BERT model - MATLAB (original) (raw)

Pretrained BERT model

Since R2023b

Syntax

Description

A Bidirectional Encoder Representations from Transformer (BERT) model is a transformer neural network that can be fine-tuned for natural language processing tasks such as document classification and sentiment analysis. The network uses attention layers to analyze text in context and capture long-range dependencies between words.

[[net](#mw%5F21de60af-c915-4832-8313-5341257993f1),[tokenizer](#mw%5F9a71c361-269e-4ce6-950d-f3d819abd288)] = bert returns a pretrained BERT-Base model and the corresponding tokenizer.

example

[[net](#mw%5F21de60af-c915-4832-8313-5341257993f1),[tokenizer](#mw%5F9a71c361-269e-4ce6-950d-f3d819abd288)] = bert([Name=Value](#namevaluepairarguments)) specifies additional options using one or more name-value arguments.

Examples

collapse all

Load Pretrained BERT Neural Network

Load a pretrained BERT-Base neural network and the corresponding tokenizer using the bert function. If the Text Analytics Toolbox™ Model for BERT-Base Network support package is not installed, then the function provides a link to the required support package in the Add-On Explorer. To install the support package, click the link, and then click Install.

View the network properties.

net = dlnetwork with properties:

     Layers: [129x1 nnet.cnn.layer.Layer]
Connections: [164x2 table]
 Learnables: [197x3 table]
      State: [0x3 table]
 InputNames: {'input_ids'  'attention_mask'  'seg_ids'}
OutputNames: {'enc12_layernorm2'}
Initialized: 1

View summary with summary.

View the tokenizer.

tokenizer = bertTokenizer with properties:

    IgnoreCase: 1
  StripAccents: 1
  PaddingToken: "[PAD]"
   PaddingCode: 1
    StartToken: "[CLS]"
     StartCode: 102
  UnknownToken: "[UNK]"
   UnknownCode: 101
SeparatorToken: "[SEP]"
 SeparatorCode: 103
   ContextSize: 512

Input Arguments

collapse all

Name-Value Arguments

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: [net,tokenizer] = bert(Model="tiny") returns a pretrained BERT-Tiny model and the corresponding tokenizer.

Model — BERT model

"base" (default) | "tiny" | "mini" | "small" | "large" | "multilingual"

BERT model, specified as one of these options:

Head — Model head

"none" (default) | "document-classifier"

Model head, specified as one of these values:

NumClasses — Number of classes for document classification head

2 (default) | positive integer

Number of classes for the document classification head, specified as a positive integer.

This option only applies when Head is"document-classifier".

DropoutProbability — Probability of dropping out input elements in dropout layers

0.1 (default) | scalar in the range [0, 1)

Probability of dropping out input elements in dropout layers, specified as a scalar in the range [0, 1).

When you train a neural network with dropout layers, the layer randomly sets input elements to zero using the dropout mask rand(size(X)) < p, where X is the layer input and p is the layer dropout probability. The layer then scales the remaining elements by 1/(1-p).

This operation helps to prevent the network from overfitting [2], [3]. A higher number results in the network dropping more elements during training. At prediction time, the output of the layer is equal to its input.

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

AttentionDropoutProbability — Probability of dropping out input elements in attention layers

0.1 (default) | scalar in the range [0, 1)

Probability of dropping out input elements in attention layers, specified as a scalar in the range [0, 1).

When you train a neural network with attention layers, the layer randomly sets attention scores to zero using the dropout mask rand(size(scores)) < p, where scores is the layer input and p is the layer dropout probability. The layer then scales the remaining elements by 1/(1-p).

This operation helps to prevent the network from overfitting [2], [3]. A higher number results in the network dropping more elements during training. At prediction time, the output of the layer is equal to its input.

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Output Arguments

collapse all

net — Pretrained BERT model

dlnetwork object

Pretrained BERT model, returned as a dlnetwork (Deep Learning Toolbox) object.

tokenizer — BERT tokenizer

bertTokenizer object

BERT tokenizer, returned as a bertTokenizer object.

References

[1] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding" Preprint, submitted May 24, 2019. https://doi.org/10.48550/arXiv.1810.04805.

[2] Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." The Journal of Machine Learning Research 15, no. 1 (January 1, 2014): 1929–58

[3] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks."Communications of the ACM 60, no. 6 (May 24, 2017): 84–90. https://doi.org/10.1145/3065386.

Version History

Introduced in R2023b