Getting 429 Too many request for NIM cloud api (original) (raw)
Hi Team,
We are currently experiencing 429 (Too Many Requests) response codes when using the NVIDIA NIM cloud API. Previously, the same integration was working smoothly in our project. However, we are now encountering this issue even with as few as 10 requests in a loop.
Could you please confirm if there have been any recent changes in rate limits or if there is an ongoing issue on your end? This will help us make the necessary adjustments in our implementation.
API Endpoint: https://integrate.api.nvidia.com/v1/chat/completions
Looking forward to your response.
Below attached a curl request to 10 request :
Hi,
We are currently seeing a range of issues with NGC services, including API Endpoints.
You can track the issue here https://status.ngc.nvidia.com/.
If your error persists once the issue has been resolved please reach back out and we will look into this for you.
Best,
Sophie
Hi Sophie,
The status shows as issues with NGC services are resolved but I still see issue when hitting multiple request to Nvidia API endpoints, I get 429 response code (Too many requests).
Please update if Nvidia NGC services are completely up and operational.
Response :
{“id”:“chat-2a7b029f403741aba5f99022486bcdc5”,“object”:“chat.completion”,“created”:1749628365,“model”:“meta/llama-3.1-8b-instruct”,“choices”:[{“index”:0,“message”:{“role”:“assistant”,“content”:“Hello! How can I assist you today?”},“logprobs”:null,“finish_reason”:“stop”,“stop_reason”:null}],“usage”:{“prompt_tokens”:11,“total_tokens”:20,“completion_tokens”:9},“prompt_logprobs”:null}{“status”:429,“title”:“Too Many Requests”}
{“status”:429,“title”:“Too Many Requests”}{“status”:429,“title”:“Too Many Requests”}{“status”:429,“title”:“Too Many Requests”}
{“status”:429,“title”:“Too Many Requests”}{“status”:429,“title”:“Too Many Requests”}{“status”:429,“title”:“Too Many Requests”}{“status”:429,“title”:“Too Many Requests”}{“status”:429,“title”:“Too Many Requests”}
Hi @sumit.mehta,
I’m trying to get clarity on whether there have been changes to the trial API for the model you are using.
In the mean time, you could try adding a sleep command in your loop iterations.
(I see the same ‘429 Too many requests’ error when I run your code.)
Please note that the API Endpoints are only to be used for experimentation, development, testing and research. NVIDIA NIM FAQ
Best,
Sophie
ophis June 12, 2026, 6:03am 5
Hi Sophie,
Thank you for looking into this!
I would like to add some details from my own testing, as it seems the issue might be a bit more complex than standard rate limiting (RPM).
When evaluating the GLM 5.1 model, after just a few initial requests, I receive the “429 Too Many Requests” error. However, unlike standard rate limits, it doesn’t reset after a minute or two. I waited for over 2 hours, and the endpoint was still returning 429 errors. Once triggered, it behaves more like a permanent lock rather than a temporary rate limit.
Regarding the sleep workaround: unfortunately, I am evaluating the API through an open-source integration tool. Because of this setup, I don’t have direct access to the request loop to easily inject delays.
Just to assure you, my usage is strictly for non production, R&D, and evaluation purposes to see how the model integrates with external tools.
Could you please ask the engineering team to check if there might be a bug causing the endpoint to stay locked indefinitely instead of resetting the limit?
Any insights would be greatly appreciated.
Thank you!
hi sophie, you have to know the followed things:
- when 429 is reached, you have to wait before re do any request because each request in 429 status reste the timer and so you extend the waiting time . to manage that issue you have to apply expodential waiting time on you agent like first time wait for 6 second if not okey wait for 18 sec etc etc then you will no more wait for 2 hour ^^
ophis June 12, 2026, 6:26am 7
Hi bozoweed,
I think there is a slight misunderstanding. I believe you meant to address me, not Sophie, regarding the client-side logic?
What you are describing is called “Exponential Backoff”. Please note that tool, which I am using, already has this exact mechanism builtin. It does not spam the API when a 429 status is returned. The retry intervals increase progressively (a few seconds, then several dozens of seconds, then minutes, and finally tens of minutes!).
Therefore, the API is absolutely not being hammered with requests during the lockout, so the timer is not being “reset” by continuous pinging on my end. The endpoint simply remains unresponsive and locked for hours despite the exponential backoff strategy being correctly applied by the tool.
This is exactly why I am reporting this as a potential server side issue.
Regards.
my bad, look like you allready using correctly the api ^^’ i hope that nvidia will found a better way to help you so
I’ve also run into this same issue as well. I was previously experimenting with using the GLM-5.1 endpoint in an agentic coding setup (Claude Code), and all was working well until probably around a few days ago. When I came back to my work today, I was constantly hitting 429 errors, despite no change from my end.
From my (admittedly brief) observation, I’ve noticed:
- I start receiving 429 errors after making ~20 requests to the messages endpoint within a minute. This is below the 40 requests per minute I expected from the free tier.
- Exponential back-off does not seem to really work? Once I received this error, every subsequent API call returns a 429, even after waiting 30 minutes until the next request. It seems to around an hour for the rate limit to reset.
I would appreciate some clarification on whether this is intended behavior (the rate limit for this model has permanently been changed) or whether this is some temporary or dynamic measure due to server load.
yes can confirm,
my code already have high backoff per successful request, and exponential backoff for any error
so I don’t think its RPM or ant real rate limti
but is still 429
i’v even pause my code for a hour but still 429
can you clarify if its new limit or back end rate limiting bug, miss?
bozoweed June 12, 2026, 3:19pm 13
it’s not a bug just saye thanks to all free tiers abus , now glm5.1 have low RPH rate instead of 40 RPM is more like 10-20 RPH maybe TPH instead of request is maybe also token limited
Dear Mr Bozo Weed,
Please do not reply if you have nothing to add other than baseless accusation, or baseless speculation, that fuels panic and anxiety of the community.
I have never said I’m using GLM, did I? Please see an optometrist, the original poster claimed to hit 429 with low request volume with LLaMA 3.1 8b, which is 93x smaller than GLM 5.1.
The poster who said they are using GLM-5.1 is using a sophisticated backoff mechanism, which you are accusing/blaming the poster for not using such mechanism, which you have apologised, but then suddenly, you accused me of using GLM and abusing free tier, which I did not.
The Nvidia NIM API offers 40 RPM limit, and using it is not an abuse, it’s using it as intended. In fact, myself, @ophis, and @aptenodyte have clarified that we all are using it well it below its limit, using exponential backoff, and waiting hours, yet we still got 429.
Please, users are seeking clarification whether there’s a structural change to the limit or a bug in the backend that fails to reset the limit for users, so we’d appreciate it that if you have nothing to add, please reconsider replying.
Best,
John Deere
It’s started those 429 annoying errors 12h ago with GLM5.1 here too.
I agree that what’s going on here definitely does not seem like standard rate limiting. I did some further testing, and it seems like what sets off the rate limiting is either bursts of many requests within a small time frame (but still within the ~40 RPM window) or simultaneous requests (e.g. making multiple tool calls at once).
it’s not a bug just saye thanks to all free tiers abus , now glm5.1 have low RPH rate instead of 40 RPM is more like 10-20 RPH maybe TPH instead of request is maybe also token limited
@bozoweed I don’t think that the rate limiting has become per hour instead of per minute. I set up a cron job to make a simple request to the model every minute, and left it running for around 15 minutes without an issue. I think it is more likely that the penalty for violating the rate limit is around an hour instead of the rate limit itself being per hour.
Token usage is an interesting theory, although I’m not sure if that’s it either. I started a new session and ran into problems after consuming around ~20-25k tokens. However, I haven’t run a separate test for how token usage affects rate limiting independently of the simultaneity test I ran either, so it may be worth looking into that in more detail.
As for free tier abuse, it’s a probable theory, but I couldn’t find anything from NVIDIA directly supporting that claim. If that is the case, however, I would hope for at least some public announcement or alert about a changed rate limit instead of just leaving developers blindly guessing in the dark. Especially when it seems to add additional restrictions on top of the already established 40 RPM limit.
bozoweed June 12, 2026, 8:35pm 18
interssting test, but appear my key is in 429 from couple of hour now, i get stuck from 7pm and still in 429 for now , i think they have also limit the number of token available per day. it’s the only things that can explain why my key i locked for 3hours now ( i have stop all my job and try one time per hour to see if lock is over but still 429 )
Huh, that’s odd. I got hit with a 429 last night, but was able to start again earlier today. I hit an additional two 429 errors earlier today as well, but was able to resume after I had taken a break for around an hour or so in each case.
bozoweed June 12, 2026, 8:45pm 20
appear they have few rules on the 429, i have also hit 429 few times befor get the full lock down, at begun i get 429 and api resum after few minutes, then i finaly get the stong 429 issue let me out for 3 hours at now ^^’ maybe reset tomorow
PS: 429 status code (no body) just tested again at now and still 429 ^^’
I agree and have the same issue when I start my software/agentic setup with claude code connected to nim platform with glm 5.1 endpoint or even others models like kimi, deepseek, stepfun. I implemented all the techniques to resolve all the errors from the nim platform of course I have the exponentiel back off and even more than that. this week , i dont use my account because I m hitting the 429 error every day even if I dont use it for 3 days , the error 429 comes in 4 minutes all the time this week. On the past it was simple to use , we dont have that many errors a month ago. I respect 40 rpm because I use 10 rpm so I should not hit the error 429 or too many requests but I do hit, I also respect the tokens per minute(TPM), with TPM throttling. I trim also the tool payload. I have also max_retries and max_bacckoff.Seems the account stays locked to error 429 in all models ,if I hit in glm 5.1 , I cannot use other models. My usage is not for production but for personal research in ai. I dont use openclaw, I use the last claude code and I follow strictly the guidelines of the nim platform. I thanks nvidia to give us such nice platform for research purposes in ai.
please, could they ask the engineering team to check if there might be a bug or something who cause a bug with the error 429?
Thank you for your help
hubaibm9 June 13, 2026, 9:24am 24
I am facing the exact same issue since a day or two no matter which model I use. It was working fine and now it keeps on giving 429 (Too Many Requests) response and it doesn’t seem to reset after minutes or even hours.
