Scale Azure OpenAI for Java chat using RAG with Azure Container Apps
Learn how to add load balancing to your application to extend the chat app beyond the Azure OpenAI Service token and model quota limits. This approach uses Azure Container Apps to create three Azure OpenAI endpoints and a primary container to direct incoming traffic to one of the three endpoints.
This article requires you to deploy two separate samples:
Chat app
If you haven't deployed the chat app yet, wait until after the load balancer sample is deployed.
If you already deployed the chat app once, change the environment variable to support a custom endpoint for the load balancer and redeploy it again.
The chat app is available in these languages:
Load balancer app
Note
This article uses one or more AI app templates as the basis for the examples and guidance in the article. AI app templates provide you with well-maintained reference implementations that are easy to deploy. They help to ensure a high-quality starting point for your AI apps.
Architecture for load balancing Azure OpenAI with Azure Container Apps
Because the Azure OpenAI resource has specific token and model quota limits, a chat app that uses a single Azure OpenAI resource is prone to have conversation failures because of those limits.
To use the chat app without hitting those limits, use a load-balanced solution with Container Apps. This solution seamlessly exposes a single endpoint from Container Apps to your chat app server.
The container app sits in front of a set of Azure OpenAI resources. The container app solves two scenarios: normal and throttled. During a normal scenario where token and model quota is available, the Azure OpenAI resource returns a 200 back through the container app and app server.
When a resource is in a throttled scenario because of quota limits, the container app can retry a different Azure OpenAI resource immediately to fulfill the original chat app request.
Prerequisites
An Azure subscription. Create one for free
Access granted to Azure OpenAI in the desired Azure subscription. Currently, access to this service is granted only by application. You can apply for access to Azure OpenAI by completing the form at https://aka.ms/oai/access.
Dev containers are available for both samples, with all dependencies required to complete this article. You can run the dev containers in GitHub Codespaces in a browser, or locally using Visual Studio Code.
- A GitHub account.
Open the Container Apps load balancer sample app
GitHub Codespaces runs a development container managed by GitHub with Visual Studio Code for the Web as the user interface. For the most straightforward development environment, use GitHub Codespaces so that you have the correct developer tools and dependencies preinstalled to complete this article.
Important
All GitHub accounts can use GitHub Codespaces for up to 60 hours free each month with two core instances. For more information, see GitHub Codespaces monthly included storage and core hours.
Deploy the Azure Container Apps load balancer
Sign in to the Azure Developer CLI to provide authentication to the provisioning and deployment steps:
azd auth login --use-device-code
Set an environment variable to use Azure CLI authentication to the post provision step:
azd config set auth.useAzCliAuth "true"
Deploy the load balancer app:
azd up
Select a subscription and region for the deployment. They don't have to be the same subscription and region as the chat app.
Wait for the deployment to finish before you continue.
Get the deployment endpoint
Use the following command to display the deployed endpoint for the container app:
azd env get-values
Copy the
CONTAINER_APP_URL
value. You use it in the next section.
Redeploy the chat app with the load balancer endpoint
These examples are completed on the chat app sample.
Open the chat app sample's dev container by using one of the following choices.
Language GitHub Codespaces Visual Studio Code .NET JavaScript Python Sign in to the Azure Developer CLI (
AZD
):azd auth login
Finish the sign-in instructions.
Create an
AZD
environment with a name such aschat-app
:azd env new <name>
Add the following environment variable, which tells the chat app's backend to use a custom URL for the Azure OpenAI requests:
azd env set OPENAI_HOST azure_custom
Add the following environment variable. Substitute
<CONTAINER_APP_URL>
for the URL from the previous section. This action tells the chat app's backend what the value is of the custom URL for the Azure OpenAI request.azd env set AZURE_OPENAI_CUSTOM_URL <CONTAINER_APP_URL>
Deploy the chat app:
azd up
You can now use the chat app with the confidence that it's built to scale across many users without running out of quota.
Stream logs to see the load balancer results
In the Azure portal, search your resource group.
From the list of resources in the group, select the Azure Container Apps resource.
Select Monitoring > Log stream to view the log.
Use the chat app to generate traffic in the log.
Look for the logs, which reference the Azure OpenAI resources. Each of the three resources has its numeric identity in the log comment that begins with
Proxying to https://openai3
, where3
indicates the third Azure OpenAI resource.
When the load balancer receives status that the request exceeds quota, the load balancer automatically rotates to another resource.
Configure the TPM quota
By default, each of the Azure OpenAI instances in the load balancer is deployed with a capacity of 30,000 tokens per minute (TPM). You can use the chat app with the confidence that it's built to scale across many users without running out of quota. Change this value when:
- You get deployment capacity errors: Lower the value.
- You need higher capacity: Raise the value.
Use the following command to change the value:
azd env set OPENAI_CAPACITY 50
Redeploy the load balancer:
azd up
Clean up resources
When you're finished with the chat app and the load balancer, clean up the resources. The Azure resources created in this article are billed to your Azure subscription. If you don't expect to need these resources in the future, delete them to avoid incurring more charges.
Clean up chat app resources
Return to the chat app article to clean up the resources:
Clean upload balancer resources
Run the following Azure Developer CLI command to delete the Azure resources and remove the source code:
azd down --purge --force
The switches provide:
purge
: Deleted resources are immediately purged so that you can reuse the Azure OpenAI Service tokens per minute.force
: The deletion happens silently, without requiring user consent.
Clean up GitHub Codespaces and Visual Studio Code
Deleting the GitHub Codespaces environment ensures that you can maximize the amount of free per-core hours entitlement that you get for your account.
Important
For more information about your GitHub account's entitlements, see GitHub Codespaces monthly included storage and core hours.
Sign in to the GitHub Codespaces dashboard.
Locate your currently running codespaces that are sourced from the azure-samples/openai-aca-lb GitHub repository.
Open the context menu for the codespace, and then select Delete.
Get help
If you have trouble deploying the Azure API Management load balancer, add your issue to the repository's Issues webpage.
Sample code
The following samples are used in this article:
Next step
- Use Azure Load Testing to load test your chat app.