Meta releases new Llama 3.1 models, including highly anticipated 405B parameter variant - IBM Blog (original) (raw)

On Tuesday, July 23, Meta announced the launch of the Llama 3.1 collection of multilingual large language models (LLMs). Llama 3.1 comprises both pretrained and instruction-tuned text in/text out open source generative AI models in sizes of 8B, 70B and—for the first time—405B parameters.

The instruction-tuned Llama 3.1-405B, which figures to be the largest and most powerful open source language model available today and competitive with the best proprietary models on the market, will be available on IBM® watsonx.ai™ today where it can be deployed on the IBM cloud, in a hybrid cloud environment or on-premises.

The Llama 3.1 release follows the April 18 launch of Llama 3 models. In the accompanying launch announcement, Meta stated that “[their] goal in the near future is to make Llama 3 multilingual and multimodal, have longer context, and continue to improve overall performance across LLM capabilities such as reasoning and coding.”

Today’s launch of Llama 3.1 demonstrates significant progress toward that goal, from dramatically increased context length to expanded tool use and multilingual capabilities.

An important step forward for accessible, open, responsible AI innovation

In December of 2023, Meta and IBM launched the AI Alliance in collaboration with over 50 global founding members and collaborators. Bringing together leading organizations across industry, startups, academia, research and government, the AI Alliance aspires to shape the evolution of AI to best reflect the needs and complexity of our societies. Since its founding, the Alliance has grown to over 100 members.

More specifically, the AI Alliance is dedicated to fostering an open community that enables developers and researchers to accelerate responsible innovation while ensuring trust, safety, security, diversity, scientific rigor and economic competitiveness. To that end, the Alliance supports projects that develop and deploy benchmarks and evaluation standards, help address society-wide challenges, support global AI skills building and encourage open development of AI in safe and beneficial ways.

Llama 3.1 furthers that mission by providing the global AI community with an open, state-of-the-art model family and development ecosystem to build, experiment and responsibly scale new ideas and approaches. Alongside its powerful new models, the release includes robust system level safety measures, new cyber security evaluation measures and updated inference-time guardrails. Collectively, these resources encourage standardization of the development and usage of trust and safety tools for generative AI.

How Llama 3.1-405B compares to leading models

Upcoming Llama models with “over 400B parameters” were discussed in the April announcement of Llama 3, including some preliminary evaluation of model performance, but their exact size and specifics were not made public until today’s launch. While Llama 3.1 represents major upgrades across all model sizes, the new 405B open source model achieves unprecedented parity with leading proprietary, closed source LLMs.

Updated figures released by Meta today paint a comprehensive picture of how impressively the 405B model stacks up against other state-of-the-art offerings. Here’s how it compares to leading LLMs across common benchmarks.[1]

**Undergraduate level knowledge (MMLU, 5-shot):**With a score of 87.3%, the instruction-tuned Llama 405B more than matched OpenAI’s GPT-4-Turbo (86.5%), Anthropic’s Claude 3 Opus (86.8%) and Google’s Gemini 1.5 Pro (85.9%) while cleanly outperforming Gemini 1.0 Ultra (83.7%), Google’s largest Gemini model.
Graduate level reasoning (GPQA, 0-shot): Llama 405B Instruct’s GPQA score (50.7%) matched Claude 3 Opus (50.4%), edged GPT-4T (48.0%) and significantly exceeded that of Claude 3 Sonnet (40.4%), Claude 3 Haiku (33.3%) and GPT-4 (35.7%).
Math problem solving (MATH, 0-shot CoT): Llama 405B Instruct (73.8%) was beaten only by GPT-4o (76.6%). It edged GPT-4T (72.6%) and Anthropic’s newest model, Claude 3.5 Sonnet (71.1%) and significantly beat Claude 3 Opus (60.1%). Even when comparing Llama’s 0-shot MATH score to 4-shot MATH scores of other models, Llama dramatically outperformed GPT-4 (42.5%), Gemini Ultra 1.0 (53.2%) and Gemini Pro 1.5 (58.5%).
Reading comprehension (DROP, F1): The base pre-trainedLlama 405B (84.8) outperformed GPT-4o (83.4), Claude 3 Opus (83.1), Gemini 1.0 Ultra (82.4) and Gemini 1.5 Pro (78.9). It was outmatched only by GPT-4T (86.0) and Claude 3.5 Sonnet (87.1).
**Knowledge Q&A (ARC-Challenge, 25-shot):**The pre-trained Llama 400B+ (96.1%) matched the performance of GPT-4 (96.3%) and Claude 3 Opus (96.4%).
Code (HumanEval, 0-shot): The instruct-tuned Llama model (89.0%) is nearly best in class, beating all models except Claude 3.5 Sonnet and GPT-4o by a comfortable margin.

Looking beyond the numbers

When comparing the 405B to other cutting-edge models, performance benchmarks are not the only factor to consider. Unlike its closed source peers, accessible only through an API wherein the underlying model might be changed without notice, Llama 3.1-405B is a stable platform that can be built upon, modified and even run on-premises. That level of control and predictability is a boon to researchers, enterprises and other entities that value consistency and reproducibility.

How to best use Llama-3.1-405B

IBM, like Meta, believes that the availability of viable open models facilitates better, safer products, accelerates innovation and contributes to an overall healthier AI market. The scale and capability of a sophisticated 405B-parameter open source model present unique opportunities and use cases for organizations of all sizes.

Aside from to direct use of the model for inference and text generation—which, given its size and corresponding computational demands, might require quantization or other optimization methods to run locally on most hardware setups—the 405B can be leveraged for:

Synthetic data generation: When suitable data for pre-training, fine-tuning or instruction tuning is scarce or prohibitively expensive, synthetic data can bridge the gap. The 405B can generate high quality task- and domain-specific synthetic data for training another LLM. IBM’s Large-scale Alignment for chatBots (LAB) is a phased-training protocol for efficiently updating LLMs with synthetic data while preserving the model’s present knowledge.
Knowledge distillation: The knowledge and emergent abilities of the 405B model can be distilled into a smaller model, blending the capabilities of a large “teacher” model with the fast and cost-effective inference of a “student” model (like an 8B or 70B Llama 3.1). Knowledge distillation, particularly through instruction tuning on synthetic data generated by larger GPT models, was essential to the creation of influential Llama-based models like Alpaca and Vicuna.
LLM-as-a-judge: Evaluating LLMs can be tricky due to the subjectivity of human preferences and the imperfect ability of existing benchmarks to approximate them. As demonstrated in the Llama 2 research paper, for example, larger models can serve as an impartial judge of response quality in other models. (For more on the efficacy of LLM-as-a-judge technique, this 2023 paper is a good place to start.)
A powerful, domain-specific fine-tune: Many leading closed models grant permission for fine-tuning only on a case-by-case basis, for only older or smaller model versions, or not at all. Conversely, Meta has made Llama 3.1-405B fully available for continual pre-training (to keep the model’s general knowledge up to date) or fine-tuning on a specific domain—coming soon to the watsonx Tuning Studio.

For a successful launch with the Llama 3.1 models, Meta AI “strongly recommends” use of a platform that, like IBM® watsonx, offers core features for model evaluation, safety guardrails and retrieval augmented generation (RAG).

Upgrades for every llama 3.1 model size

The long-awaited 405B model may be the most noteworthy aspect of the Llama 3.1 release, but it’s far from the only noteworthy aspect. While Llama 3.1 models share the same dense transformer architecture of Llama 3, they represent several significant upgrades to their Llama 3 counterparts at all model sizes.

Longer context windows

For all pre-trained and instruction-tuned Llama 3.1 models, the context length has been profoundly expanded from 8,192 tokens in Llama 3 to 128,000 tokens in Llama 3.1—a whopping 1600% increase. This makes Llama 3.1’s context length equal to that of the version of GPT-4o offered to enterprise users, significantly greater than that of GPT-4 (or the version of GPT-4o in ChatGPT Free) and comparable to the 200,000 token window offered by Claude 3. Because Llama 3.1 can be deployed on the user’s hardware or cloud provider of choice, its context length is not subject to temporary curtailing during periods of high demand. Likewise, Llama 3.1 is not generally subject to broad usage limits.

A model’s context length, alternatively called its context window, refers to the total amount of text (in tokens) that an LLM can consider or “remember” at any given time. When a conversation, document or code base exceeds a model’s maximum context length, it must be trimmed or summarized for the model to proceed. Llama 3.1’s expanded context window means Llama models can now carry out far longer conversations without forgetting details and ingest much larger documents or code samples during training and inference.

Though converting text to tokens doesn’t entail any fixed word-to-token “exchange rate,” a decent estimate would be roughly 1.5 tokens per word. Llama 3.1’s 128,000 token context window thus equates to around 85,000 words. The Tokenizer Playground on Hugging Face is an easy way to see and experiment with how different models tokenize text inputs.

Llama 3.1 models continue to enjoy the benefits of the new tokenizer rolled out for Llama 3, which encodes language much more efficiently than did Llama 2.

Preserving security and safety

In keeping with its responsible approach to innovation, Meta has been cautious and thorough in its approach to expanded context length. It’s worth noting that previous experimental open source efforts have yielded Llama derivatives with 128,000 token windows, or even 1M token windows. Though these projects are an excellent example of the benefits of Meta’s commitment to open models, they should be approached with caution: recent research indicates that very long context windows “present a rich new attack surface for LLMs” in the absence of stringent countermeasures.

Fortunately, the Llama 3.1 release also includes a new set of inference guardrails. Alongside updated versions of Llama Guard and CyberSec Eval, the release is supported by the introduction of Prompt Guard, which provides direct and indirect prompt injection filtering. Meta provides further risk mitigation with CodeShield, a robust inference time filtering tool engineered to prevent the introduction of insecure code generated by LLMs into production systems.

As with any implementation of generative AI, it’s always wise to only deploy models on a platform with robust security, privacy and safety measures.

Multilingual models

Both the pretrained and instruction tuned Llama 3.1 models, in all sizes, will now be multilingual. Beyond English, Llama 3.1 models are conversant in additional languages including Spanish, Portuguese, Italian, German and Thai. Meta has noted that “a few other languages” are still in post-training validation and could be released in the future.

Optimized for tool use

The Llama 3.1 Instruct models are fine tuned for “tool use,” meaning Meta has optimized their ability to interface with certain programs that complement or expand the LLM’s capabilities. This includes training for generating tool calls for specific search, image generation, code execution and mathematical reasoning tools as well as support for zero-shot tool use—that is, an ability to smoothly integrate with tools previously unseen in training.

Getting started with Llama 3.1

Meta’s latest release is an unprecedented opportunity to tune and tailor truly state-of-the-art generative AI models to your specific use case.

Support for Llama 3.1 is part of IBM’s commitment to furthering open source innovation in AI and providing our clients with access to best-in-class open models in watsonx, including both third party models and the IBM Granite model family.

IBM watsonx helps enable clients to truly customize implementation of open source models like Llama 3.1 in a way that best befits their needs, from the flexibility to deploy models on-premises or in their preferred cloud environment to intuitive workflows for fine-tuning, prompt engineering and integration with enterprise applications. Readily build custom AI applications for your business, manage all data sources, and accelerate responsible AI workflows—all on one platform.

Llama 3.1-405B will be available in IBM watsonx.ai today, with the 8B and 70B models soon to follow.

Try out Llama 3.1-405B in watsonx.ai™

Get started on RAG tutorials with Llama 3.1-405B and watsonx.ai today:

[1] Cited benchmark evaluations for proprietary models are drawn from self-reported figures from Anthropic on 20 June, 2024 (for Claude 3.5 Sonnet and Claude 3 Opus) and 4 March, 2024 (for Claude 3 Sonnet and Haiku), OpenAI on 13 May, 2024 (for GPT models) and Google Deepmind in May 2024 (for Gemini models).

Was this article helpful?

YesNo

Vice President, Product Management, AI Platform (watsonx.ai & watsonx.gov)

Director, Product Management, Data & AI Strategic Partnerships