Open AI GPT3 (original) (raw)

Open AI GPT-3

Last Updated : 9 Nov, 2022

Open AI GPT-3 is proposed by the researchers at OpenAI as a next model series of GPT models in the paper titled "Language Models are few shots learners". It is trained on 175 billion parameters, which is 10x more than any previous non-sparse model. It can perform various tasks from machine translation to code generation etc.

The model is not available for download as of now due to its concerns about wrong uses. The OpenAI will provide premium API for using GPT-3 ability. The API is currently available in beta-version.

Zero-shot, one-shot and few-shot learning

The above training methods are used for in-context learning, which means it provided a task and examples, based on that the model needs to perform it on the test dataset. This training method commonly used in GPT-3

Fine Tuning: In this process, the model is trained by providing a large amount of data. In this method, we will train the model by performing gradient updates after every epoch (or every example) similar to the training of neural networks.

Architecture: GPT-3 is trained with different variants of models with a number of parameters ranging from 125 million to 175 billion. Below are the architectural details of different GPT-3 models.

Model Name nparams nlayers dmodel nheads dheads Batch Size Learning Rate
GPT-3 small 125 M 12 768 12 64 0.5 M 6 * 10-4
GPT-3 Medium 350 M 24 1024 16 64 0.5 M 3 * 10-4
GPT-3 Large 760 M 24 1536 16 96 0.5 M 2.5 * 10-4
GPT-3 XL 1.3 B 24 2048 24 128 1 M 2 * 10-4
GPT-3 2.7 B 2.7 B 32 2560 32 80 1 M 1.6 * 10-4
GPT-3 6.7 B 6.7 B 32 4096 32 128 2 M 1.2 * 10-4
GPT-3 13 B 13 B 40 5140 40 128 2 M 1 * 10-4
GPT-3 175 B 175 B 96 12288 96 128 3.2 M 0.6 * 10-4

Result Details:

Results on SuperGLUE benchmarks

Title: United Methodists Agree to Historic Split Subtitle: Those who oppose gay marriage will form their own denomination Article: After two days of intense debate, the United Methodist Church has agreed to a historic split - one that is expected to end in the creation of a new denomination, one that will be "theologically and socially conservative," according to The Washington Post. The majority of delegates attending the church's annual General Conference in May voted to strengthen a ban on the ordination of LGBTQ clergy and to write new rules that will "discipline" clergy who officiate at same-sex weddings. But those who opposed these measures have a new plan: They say they will form a separate denomination by 2020, calling their church the Christian Methodist denomination. The Post notes that the denomination, which claims 12.5 million members, was in the early 20th century the "largest Protestant denomination in the U.S.," but that it has been shrinking in recent decades. The new split will be the second in the church's history. The first occurred in 1968, when roughly 10 percent of the denomination left to form the Evangelical United Brethren Church. The Post notes that the proposed split "comes at a critical time for the church, which has been losing members for years," which has been "pushed toward the brink of a schism over the role of LGBTQ people in the church." Gay marriage is not the only issue that has divided the church. In 2016, the denomination was split over ordination of transgender clergy, with the North Pacific regional conference voting to ban them from serving as clergy, and the South Pacific regional conference voting to allow them.

Datasets Used: There are five different datasets used in training, the biggest of them is the Common crawl dataset which contains nearly a trillion words before filtering. But this dataset is filtered and preprocessed to obtain nearly 400 billion tokens. The other dataset includes an expanded version of the WebText dataset and two internet-based book corpora datasets and English Wikipedia text.

Dataset Quantity(Num Tokens) Weight in Training MIx
Common Crawl Dataset (filtered) 410 billion 60%
WebText 2 19 billion 22%
Books1 12 billion 8%
Books2 55 billion 8%
Wikipedia 3 billion 3%

Training Details:

All versions of GPT-3 is (pre) trained with Adam as Optimizer with β1 = 0.9, β2 = 0.95, and epsilon = 10-8 . The batch size of training data is linearly increased from 32k tokens to a maximum over 4-12 billion tokens. The data is sampled without replacement during training to minimize overfitting.

Limitations:

Despite its strong improvement in qualitative and quantitative result, GPT-3 also has some limitations:

References: