As AI applications grow, costs of running them and latency tend to respectively increase and worsen, in most cases this will cause signifiant performance problems because at some point the cost will become unaffordable and the infrastructure might have to be brought down to fit into budget which would mean that the application becomes worse over time.
In this tutorial, we will explore how to cut LLM costs with Upstash Vector and Semantic caching. We will use Elysia.js to demonstrate how to use Upstash Vector for Semantic caching and how it can help a lot with cutting LLM costs.
Let's dive right in!
Before we commence, you will need to have a few things in place:
Before we dive right in, we need to understand some key terms that will be used in this tutorial.
bun create elysia app
bun install @upstash/semantic-cache @upstash/vector
Since you now have an Upstash Vector database. You will need the URL and token credentials to connect your semantic cache. Be sure to select any pre-made embedding model during the database creation process.
Different embedding models excel in various scenarios. For instance, if low latency is crucial, opt for a model with smaller dimensions, such as bge-small-en-v1.5. If accuracy is a priority, select a model with more dimensions.
Next, create a .env file in the root directory of your project and include your Upstash Vector URL and token:
UPSTASH_VECTOR_REST_URL=https://example.upstash.io
UPSTASH_VECTOR_REST_TOKEN=your_secret_token_here
Semantic caching is a powerful technique that can help reduce the costs associated with running LLMs. By caching the results of expensive operations based on their semantic meaning, you can avoid recomputing them every time they are needed. This can lead to significant cost savings and improved performance. A proof of concept of this would be in RAG applications to enable faster responses and low latency.
Think of use cases where you do not have to rerun your model for a response that has already been generated before which you would agree occurs more than often in AI applications.
The above image summarizes most of what the point of this blog is.
Here is how you can use semantic cache in your Elysia server:
import { SemanticCache } from "@upstash/semantic-cache";
import { Index } from "@upstash/vector";
const DBConfig = {
url: process.env.UPSTASH_VECTOR_REST_URL,
token: process.env.UPSTASH_VECTOR_REST_TOKEN,
};
// 👇 your vector database
const index = new Index(DBConfig);
// 👇 your semantic cache
const semanticCache = new SemanticCache({ index, minProximity: 0.85 });
async function runCache() {
await semanticCache.set("largest city in USA by population", "New York");
await delay(1000);
const result = await semanticCache.get(
"which is the most populated city in the USA?",
);
}
function delay(ms: number) {
return new Promise((resolve) => setTimeout(resolve, ms));
}
runCache();
In the above code example we can see that we start by setting up a vector database then we create a Semantic Cache with it and we then run a function to handle the caching; the function sets the question and the answer in the vector database which is then used to get the answer to a similar question.
To run the above script in your setup you will need to run the script as in:
bun ./src/{PATH_TO_YOUR_FILE}
Now in the above code, if I were to change the question to "what is the largest city in the USA by population?" to "Largest population in USA by city" the answer would still be "New York" because the semantic cache would have stored the answer to the first question and would be able to retrieve it for the second question.
Notice how both questions have semantic similarities and still got the same answers? That is the power Semantic Caching. Now think of this small demo we just had here at scale.
For each of the questions, you would see that the questions and answers are added instantly to the vector database you created.
This was a very fun tutorial to work as a Proof Of Concept and I hope you enjoyed it as much as I did. Here are some key takeaways and derivations from this tutorial: