Understanding TPM, RPM, and PTUs – Deploying ChatGPT in the Cloud: Architecture Design and Scaling Strategies

Understanding TPM, RPM, and PTUs

As we scale, we will need to understand some additional terminology, such as tokens per minute (TPM), request per minute (RPM), and provisioned throughput units (PTUs), as well as other additionalservices, such as Azure API Management (APIM), which support a cloud environment in Azure.

TPMs

With a cloud provider such as Microsoft Azure, Azure OpenAI’s quota management service built into Azure AI Studio enables you to assign quota limits for your deployments, up to whatever amount is the specified limit – that is, your “quota.” You can assign a quota to an Azure subscription on a per-region, per-model basis in units of TPM. The billing component of TPM is also known as pay-as-you-go, where pricing will be based on the pay-as-you-go consumption model, with a price per unit specific for each type of model deployed. Please refer to Figure 7.2 for a list of some models and what their token limit is.

When you create an Azure OpenAI service within a subscription, you will receive the default TPM quota size. You can then adjust the TPM to that deployment or any additional deployment you create, at which point the overall available quota for that model will be reduced by that amount. TPMs/ pay-as-you-go are also the default mechanism for billing within the Azure OpenAI (AOAI) service. We will cover some of the costs a bit later, but for more details on AOAI quota management, take a look at the link provided at the end of this chapter.

If you are using OpenAI directly, scaling works very similarly – in OpenAI models, you can scale by adjusting the TPM bar to the “max” under the advanced options.

Now, let’s look at an example and deep dive into TPMs.

In the Microsoft Azure cloud, for example, there is an overall limit (quota) of 240,000 TPMs for GPT-35-Turbo in the Azure East US region. This means you can have asingle deployment of 240K TPM per Azure OpenAI account, two deployments of 120K TPM each, or any number of deployments in one or multiple deployments, so long as the TPMs add up to 240K (or less) total in the East US region.

So, one way to scale up is by adding ADDITIONAL (Azure) OpenAI accounts. With additional AOAI accounts, you can stack or add limits together. So, in this example, rather than having a single 240K GPT-35-Turbo limit, we can add an additional 240K times X, where X is 30 or less.

The maximum number of Azure OpenAI accounts (or resources) per region per Azure subscription is 30 (at the time of writing) and is also dependent on regional capacity availability. We expect this number to be increased over time as additional GPU-based capacity continues to be made available.

Leave a Reply

Your email address will not be published. Required fields are marked *