Semantic kernel with Plugin that includes 1000 entities results in very high token usage

Question

Hello,

I am playing with Semantic Kernel to link my API data source to the gpt-4o model.

The service I inject has the following code:


public class MyPlugin
{
    private ICollection? records;

    private WebApiClientFactory apiClientFactory => new WebApiClientFactory();
    [KernelFunction("get_records")]
    [Description("Lists all records and their metadata")]
    [return: Description("An array of the Api Records")]
    public async Task> GetRecordsAsync()
    {
        try
        {
            if (records == null)
            {
                var apiClient = new ApiClient(apiClientFactory.GetHttpClient());
                records = await apiClient.GetRecordsAsync();
            }
            return records;
        }
        catch (Exception e)
        {
            Console.WriteLine(e);
            return records;
        }
    }

    [KernelFunction("get_record")]
    [Description("Get record details")]
    [return: Description("The details of the record")]
    public async Task GetRecordAsync(int id)
    {
        var records = await GetRecordsAsync();
        return records.FirstOrDefault(c => c.Id == id.ToString());
    }
}

But when I ask something like Please give me a description on record And justice for all, I get the following 429 rate limit response:

`Error: HTTP 429 (429)

Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-10-01-preview have exceeded token rate limit of your current OpenAI S0 pricing tier. Please retry after 58 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.`

So, I get the impression that there are way too much tokens used, with every request, as the full list of records (1000 records with 10 fields) may be sent with every request.

What would be a better way to work around this, also assuming the data doesn't change too often.

Answer

hi Sam

The first thing comes to my mind is which of the above API call SK actually decided to invoke. Worth turning on the tracing of SK and see the API calls in the log to be 100%.

1)if SK uses the GetRecordAsync(int id), that potentially means that SK could be calling 1000 times for each record into OpenAI endpoint. so you are likely be limited by number of https request rather than total token consumed. be aware, the openai rate limits both request count and total tokens. if this happens, you might force SK to use the bulk one instead.

2)if SK uses the GetRecordAsync(), the SK would just send all 1000 records in one go to OpenAI endpoint with all the text (assuming it fits the context length. worse case might just be splitting into 2-3 calls). in this case, the request count likely is not an issue but total tokens consumed could be rate limited.

3)continue with 2), the OpenAI needs to consume all the text records and summary them to answer your question, so you can't really reduce total tokens for a single user question. What you might be able to do is to take advantage of token caching if the exact same user prompt is asked.

4)wont think fine-tuning is going to help with your example. your scenario is just a RAG, pulling live data set for openai to produce a summary. fine-tuning is more to change the default behavior of LLM to better answer your questions in terms of the approach in a very specific domain.

Share via

Semantic kernel with Plugin that includes 1000 entities results in very high token usage

1 answer

Your answer