Retries with exponential backoff – the scaling special sauce
So, how do we control (or queue) messages when using multiple Azure OpenAI instances (accounts)?
How do we manage return error codes highly efficiently to optimize the AOAI experience?
As a best practice, Microsoft, and any other cloud vendor, will recommend the use of “retry logic” or a “retry pattern” whenever using a cloud service. This retry pattern, when used in cloud applications, helps the applications deal with temporary (transient) failures while then attempting to re-establish a connection, or reconnect, to a service to perform requests on that service, thus automatically repeating a failed operation without additional user intervention. As cloud services are cloud-based and applications or users are remote to the cloud-based service, this retry pattern is paramount. This retry logic can improve the stability of the application and provide a better end user experience.
Using a cloud-based service, such as ChatGPT on Azure OpenAI, especially at scale via an application, is no exception.
While you can add some retry logic directly to your application, you are quite limited as you scale across the enterprise. Are you now using the retry logic again and again with every application? What if the application was written by a third party? In that scenario, you can’t (usually) edit code directly.
Instead, to achieve stability and high scalability, using the APIM service described previously will provide the necessary retry pattern/logic. For example, if your application sends prompt and if the server is too busy or some other error occurs, APIM will be able to resend the same prompt again, without any additional end user interaction. This will all happen seamlessly.
APIM allows us to do this easily using the scaling special sauce – the concept of retries with exponential backoff, which allows for extremely high, concurrent user loads.
Retries with exponential backoff is a method that tries an operation again, with a wait time that grows exponentially, until it reaches a maximum number of retries (the exponential backoff). This technique accepts the fact that cloud resources may sometimes be unreachable for more than a few seconds for any reason, known as a transient error, or if an error is returned due to too many tokens per second being processed in a large-scale deployment.
This can be accomplished via APIM’s retry policy. Here’s an example:
<retry condition=”@(context.Response.StatusCode == 429 || context. Response.StatusCode >= 500)” interval=”1″ delta=”1″ max-interval=”30″ count=”3″>
In this example, the error is specific to an HTTP response status code equal to 429, which is the return code for “server busy.” This states thattoo many concurrent requests were sent to a particular model, measured at a per-second rate. This can occur as an enterprise organization is scaling to a large number of users.
Here are the detailed values and explanation of the APIM policy statement:
<retry
condition=”Boolean expression or literal”
count=”number of retry attempts”
interval=”retry interval in seconds”
max-interval=”maximum retry interval in seconds”
delta=”retry interval delta in seconds”
first-fast-retry=”boolean expression or literal”> <!– One or more child policies. No restrictions. –> </retry>
The format and what each value means is fairly evident, however for a deeper dive, you can learn more about the parameters by reading the link to the documentation provided at the end of this chapter.
The main and extremely important point to understand is that when the APIM’s interval, max interval, and delta parameters are specified, as they are in the preceding example, then an exponential interval retry algorithm is automatically applied by APIM. This is what we call thescaling special sauce – that is, the exponential interval retry special sauce needed to scale using any combination of multiple AOAI accounts to meet the most demanding business/user requirements.
For those interested in the mathematical logic behind this, here is the calculation that’s used by APIM for the exponential interval retry formula:
interval + 2^(count – 1) * random(delta * 0.8, delta * 1.2), up to the maximum interval (max-interval)
Without the scaling special sauce (APIM using retries with exponential backoff), once the initial rate limit is hit, say due to too many concurrent users sending too many prompts, then a 429 error return code (server busy) response code is sent back.
Furthermore, as additional subsequent prompts/completions are sent, the issue can be compounded quickly as more 429 errors are returned, and the error rates increase further and further. It is with the retries with exponential backoff that you are then able to scale many thousands of concurrent users with very low error responses, providing scalability to the AOAI service.
In addition to using retries with exponential backoff, APIM also supports content-based routing. This is where the message routing endpoint is determined by the content of the message at runtime. You can leverage this to send AOAI prompts to multiple AOAI accounts, including both PTUs and TPMs, to meet further scaling requirements. For example, if your model API request states a specific version, say gpt -35-turbo-16k, you can route this request to your GPT 3.5 Turbo (16K) PTUs deployment. This is true whether you’re in thesame region or a multi-region deployment.
We could write an entire book on all the wonderful scaling features APIM provides, but for additional details on APIM, please check out the APIM link at the end of this chapter. Alternatively, you can refer to the great book Enterprise API Management, by Luis Weir, published by Packt Publishing