Rate Limits - GroqDocs (original) (raw)

Rate limits act as control measures to regulate how frequently users and applications can access our API within specified timeframes. These limits help ensure service stability, fair access, and protection against misuse so that we can serve reliable and fast inference for all.

Understanding Rate Limits

Rate limits are measured in:

Rate limits apply at the organization level, not individual users. You can hit any limit type depending on which threshold you reach first.

Example: Let's say your RPM = 50 and your TPM = 200K. If you were to send 50 requests with only 100 tokens within a minute, you would reach your limit even though you did not send 200K tokens within those 50 requests.

The following is a high level summary and there may be exceptions to these limits. You can view the current, exact rate limits for your organization on the limits page in your account settings.

Need higher rate limits? Upgrade to Developer plan to access higher limits, Batch and Flex processing, and more. Note that the limits shown below are the base limits for the Developer plan, and higher limits are available for select workloads and enterprise use cases.

allam-2-7b 30 7K 6K 500K - -
canopylabs/orpheus-arabic-saudi 10 100 1.2K 3.6K - -
canopylabs/orpheus-v1-english 10 100 1.2K 3.6K - -
groq/compound 30 250 70K - - -
groq/compound-mini 30 250 70K - - -
llama-3.1-8b-instant 30 14.4K 6K 500K - -
llama-3.3-70b-versatile 30 1K 12K 100K - -
meta-llama/llama-4-scout-17b-16e-instruct 30 1K 30K 500K - -
meta-llama/llama-prompt-guard-2-22m 30 14.4K 15K 500K - -
meta-llama/llama-prompt-guard-2-86m 30 14.4K 15K 500K - -
moonshotai/kimi-k2-instruct 60 1K 10K 300K - -
moonshotai/kimi-k2-instruct-0905 60 1K 10K 300K - -
openai/gpt-oss-120b 30 1K 8K 200K - -
openai/gpt-oss-20b 30 1K 8K 200K - -
openai/gpt-oss-safeguard-20b 30 1K 8K 200K - -
qwen/qwen3-32b 60 1K 6K 500K - -
whisper-large-v3 20 2K - - 7.2K 28.8K
whisper-large-v3-turbo 20 2K - - 7.2K 28.8K

In addition to viewing your limits on your account's limits page, you can also view rate limit information such as remaining requests and tokens in HTTP response headers as follows:

The following headers are set (values are illustrative):

Header Value Notes
retry-after 2 In seconds
x-ratelimit-limit-requests 14400 Always refers to Requests Per Day (RPD)
x-ratelimit-limit-tokens 18000 Always refers to Tokens Per Minute (TPM)
x-ratelimit-remaining-requests 14370 Always refers to Requests Per Day (RPD)
x-ratelimit-remaining-tokens 17997 Always refers to Tokens Per Minute (TPM)
x-ratelimit-reset-requests 2m59.56s Always refers to Requests Per Day (RPD)
x-ratelimit-reset-tokens 7.66s Always refers to Tokens Per Minute (TPM)

Handling Rate Limits

When you exceed rate limits, our API returns a 429 Too Many Requests HTTP status code.

Note: retry-after is only set if you hit the rate limit and status code 429 is returned. The other headers are always included.

Was this page helpful?