PLBart (original) (raw)

PyTorch

Overview

The PLBART model was proposed in Unified Pre-training for Program Understanding and Generation by Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. This is a BART-like model which can be used to perform code-summarization, code-generation, and code-translation tasks. The pre-trained model plbart-base has been trained using multilingual denoising task on Java, Python and English.

According to the abstract

Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.

This model was contributed by gchhablani. The Authors’ code can be found here.

Usage examples

PLBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for code-to-text, text-to-code, code-to-code tasks. As the model is multilingual it expects the sequences in a different format. A special language id token is added in both the source and target text. The source text format is X [eos, src_lang_code] where X is the source text. The target text format is [tgt_lang_code] X [eos]. bos is never used.

However, for fine-tuning, in some cases no language token is provided in cases where a single language is used. Please refer to the paper to learn more about this.

In cases where the language code is needed, the regular call() will encode source text format when you pass texts as the first argument or with the keyword argument text, and will encode target text format if it’s passed with the text_target keyword argument.

Supervised training

from transformers import PLBartForConditionalGeneration, PLBartTokenizer

tokenizer = PLBartTokenizer.from_pretrained("uclanlp/plbart-base", src_lang="en_XX", tgt_lang="python") example_python_phrase = "def maximum(a,b,c):NEW_LINE_INDENTreturn max([a,b,c])" expected_translation_english = "Returns the maximum value of a b c." inputs = tokenizer(example_python_phrase, text_target=expected_translation_english, return_tensors="pt") model(**inputs)

Generation

While generating the target text set the decoder_start_token_id to the target language id. The following example shows how to translate Python to English using the uclanlp/plbart-python-en_XX model.

from transformers import PLBartForConditionalGeneration, PLBartTokenizer

tokenizer = PLBartTokenizer.from_pretrained("uclanlp/plbart-python-en_XX", src_lang="python", tgt_lang="en_XX") example_python_phrase = "def maximum(a,b,c):NEW_LINE_INDENTreturn max([a,b,c])" inputs = tokenizer(example_python_phrase, return_tensors="pt") model = PLBartForConditionalGeneration.from_pretrained("uclanlp/plbart-python-en_XX") translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["en_XX"]) tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] "Returns the maximum value of a b c."

Resources

PLBartConfig

class transformers.PLBartConfig

< source >

( vocab_size = 50005 max_position_embeddings = 1024 encoder_layers = 6 encoder_ffn_dim = 3072 encoder_attention_heads = 12 decoder_layers = 6 decoder_ffn_dim = 3072 decoder_attention_heads = 12 encoder_layerdrop = 0.0 decoder_layerdrop = 0.0 use_cache = True is_encoder_decoder = True activation_function = 'gelu' d_model = 768 dropout = 0.1 attention_dropout = 0.1 activation_dropout = 0.0 init_std = 0.02 classifier_dropout = 0.0 scale_embedding = True pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 forced_eos_token_id = 2 **kwargs )

Parameters

This is the configuration class to store the configuration of a PLBartModel. It is used to instantiate an PLBART model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the PLBARTuclanlp/plbart-base architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import PLBartConfig, PLBartModel

configuration = PLBartConfig()

model = PLBartModel(configuration)

configuration = model.config

PLBartTokenizer

class transformers.PLBartTokenizer

< source >

( vocab_file bos_token = '' eos_token = '' sep_token = '' cls_token = '' unk_token = '' pad_token = '' mask_token = '' language_codes = 'base' tokenizer_file = None src_lang = None tgt_lang = None sp_model_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None additional_special_tokens = None clean_up_tokenization_spaces = True **kwargs )

Parameters

Construct an PLBART tokenizer.

Adapted from RobertaTokenizer and XLNetTokenizer. Based onSentencePiece.

The tokenization method is <tokens> <eos> <language code> for source language documents, and `

` for target language documents.

Examples:

from transformers import PLBartTokenizer

tokenizer = PLBartTokenizer.from_pretrained("uclanlp/plbart-python-en_XX", src_lang="python", tgt_lang="en_XX") example_python_phrase = "def maximum(a,b,c):NEW_LINE_INDENTreturn max([a,b,c])" expected_translation_english = "Returns the maximum value of a b c." inputs = tokenizer(example_python_phrase, text_target=expected_translation_english, return_tensors="pt")

build_inputs_with_special_tokens

< source >

( token_ids_0: typing.List[int] token_ids_1: typing.Optional[typing.List[int]] = None ) → List[int]

Parameters

List of input IDs with the appropriate special tokens.

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An PLBART sequence has the following format, where X represents the sequence:

BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.

PLBartModel

class transformers.PLBartModel

< source >

( config: PLBartConfig )

Parameters

The bare Plbart Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.LongTensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None decoder_head_mask: typing.Optional[torch.LongTensor] = None cross_attn_head_mask: typing.Optional[torch.Tensor] = None encoder_outputs: typing.Optional[typing.List[torch.FloatTensor]] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.Tensor] = None ) → transformers.modeling_outputs.Seq2SeqModelOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.Seq2SeqModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PLBartConfig) and inputs.

The PLBartModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

PLBartForConditionalGeneration

class transformers.PLBartForConditionalGeneration

< source >

( config: PLBartConfig )

Parameters

The PLBART Model with a language modeling head. Can be used for code-to-text, text-to-code and code-to-code.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.LongTensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None decoder_head_mask: typing.Optional[torch.LongTensor] = None cross_attn_head_mask: typing.Optional[torch.Tensor] = None encoder_outputs: typing.Optional[typing.List[torch.FloatTensor]] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.Tensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.Tensor] = None ) → transformers.modeling_outputs.Seq2SeqLMOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.Seq2SeqLMOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PLBartConfig) and inputs.

The PLBartForConditionalGeneration forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example Mask-filling:

from transformers import AutoTokenizer, PLBartForConditionalGeneration

model = PLBartForConditionalGeneration.from_pretrained("uclanlp/plbart-base") tokenizer = AutoTokenizer.from_pretrained("uclanlp/plbart-base")

TXT = " Is 0 the Fibonacci number ? en_XX" input_ids = tokenizer([TXT], add_special_tokens=False, return_tensors="pt").input_ids

logits = model(input_ids).logits masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item() probs = logits[0, masked_index].softmax(dim=0) values, predictions = probs.topk(5)

tokenizer.decode(predictions).split() ['first', 'same', 'highest', 'result', 'number']

PLBartForSequenceClassification

class transformers.PLBartForSequenceClassification

< source >

( config: PLBartConfig **kwargs )

Parameters

PLBart model with a sequence classification/head on top (a linear layer on top of the pooled output) e.g. for code classification.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.Tensor] = None decoder_head_mask: typing.Optional[torch.Tensor] = None cross_attn_head_mask: typing.Optional[torch.Tensor] = None encoder_outputs: typing.Optional[typing.List[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None ) → transformers.modeling_outputs.Seq2SeqSequenceClassifierOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.Seq2SeqSequenceClassifierOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PLBartConfig) and inputs.

The PLBartForSequenceClassification forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example of single-label classification:

import torch from transformers import AutoTokenizer, PLBartForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("uclanlp/plbart-base") model = PLBartForSequenceClassification.from_pretrained("uclanlp/plbart-base")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

with torch.no_grad(): ... logits = model(**inputs).logits

predicted_class_id = logits.argmax().item() model.config.id2label[predicted_class_id] ...

num_labels = len(model.config.id2label) model = PLBartForSequenceClassification.from_pretrained("uclanlp/plbart-base", num_labels=num_labels)

labels = torch.tensor([1]) loss = model(**inputs, labels=labels).loss round(loss.item(), 2) ...

Example of multi-label classification:

import torch from transformers import AutoTokenizer, PLBartForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("uclanlp/plbart-base") model = PLBartForSequenceClassification.from_pretrained("uclanlp/plbart-base", problem_type="multi_label_classification")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")

with torch.no_grad(): ... logits = model(**inputs).logits

predicted_class_ids = torch.arange(0, logits.shape[-1])[torch.sigmoid(logits).squeeze(dim=0) > 0.5]

num_labels = len(model.config.id2label) model = PLBartForSequenceClassification.from_pretrained( ... "uclanlp/plbart-base", num_labels=num_labels, problem_type="multi_label_classification" ... )

labels = torch.sum( ... torch.nn.functional.one_hot(predicted_class_ids[None, :].clone(), num_classes=num_labels), dim=1 ... ).to(torch.float) loss = model(**inputs, labels=labels).loss

PLBartForCausalLM

class transformers.PLBartForCausalLM

< source >

( config )

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None encoder_hidden_states: typing.Optional[torch.FloatTensor] = None encoder_attention_mask: typing.Optional[torch.FloatTensor] = None head_mask: typing.Optional[torch.Tensor] = None cross_attn_head_mask: typing.Optional[torch.Tensor] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None ) → transformers.modeling_outputs.CausalLMOutputWithCrossAttentions or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.CausalLMOutputWithCrossAttentions or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (PLBartConfig) and inputs.

The PLBartForCausalLM forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import AutoTokenizer, PLBartForCausalLM

tokenizer = AutoTokenizer.from_pretrained("uclanlp/plbart-base") model = PLBartForCausalLM.from_pretrained("uclanlp/plbart-base", add_cross_attention=False) assert model.config.is_decoder, f"{model.class} has to be configured as a decoder." inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") outputs = model(**inputs)

logits = outputs.logits expected_shape = [1, inputs.input_ids.shape[-1], model.config.vocab_size] list(logits.shape) == expected_shape True

< > Update on GitHub