Maximize AI Efficiency with Upstash Vector

Introduction

As AI applications grow, costs of running them and latency tend to respectively increase and worsen, in most cases this will cause signifiant performance problems because at some point the cost will become unaffordable and the infrastructure might have to be brought down to fit into budget which would mean that the application becomes worse over time.

In this tutorial, we will explore how to cut LLM costs with Upstash Vector and Semantic caching. We will use Elysia.js to demonstrate how to use Upstash Vector for Semantic caching and how it can help a lot with cutting LLM costs.

Let's dive right in!

Prerequisites

Before we commence, you will need to have a few things in place:

An Upstash Vector database create one here. You may need to create an account if you do not already have one.

Definitions

Before we dive right in, we need to understand some key terms that will be used in this tutorial.

LLM: LLM is an acronym for Large Language Models. These are models that are trained on large datasets and are capable of generating human-like text. Examples of LLMs include GPT-3, BERT, and T5.
Upstash Vector: Upstash Vector is a Redis-based vector database that is optimized for AI and ML workloads. It is designed to store and query high-dimensional vectors at scale.
Semantic Caching: Semantic caching is a technique that involves caching the results of expensive operations based on their semantic meaning. This allows for faster retrieval of results and reduces the need to recompute them.
Elysia.js: Elysia is an ergonomic web framework for building backend servers with Bun. It has support for most of the packages in the Node-Express land so it is suitable for our purposes.

Setup

Setup Elysia:

bun create elysia app

Run the below to get the necessary packages for this tutorial which are necessary for the implementation.

bun install @upstash/semantic-cache @upstash/vector

Since you now have an Upstash Vector database. You will need the URL and token credentials to connect your semantic cache. Be sure to select any pre-made embedding model during the database creation process.

Note

Different embedding models excel in various scenarios. For instance, if low latency is crucial, opt for a model with smaller dimensions, such as bge-small-en-v1.5. If accuracy is a priority, select a model with more dimensions.

Next, create a .env file in the root directory of your project and include your Upstash Vector URL and token:

UPSTASH_VECTOR_REST_URL=https://example.upstash.io
UPSTASH_VECTOR_REST_TOKEN=your_secret_token_here

Why Semantic Caching?

Semantic caching is a powerful technique that can help reduce the costs associated with running LLMs. By caching the results of expensive operations based on their semantic meaning, you can avoid recomputing them every time they are needed. This can lead to significant cost savings and improved performance. A proof of concept of this would be in RAG applications to enable faster responses and low latency.

Think of use cases where you do not have to rerun your model for a response that has already been generated before which you would agree occurs more than often in AI applications.

The above image summarizes most of what the point of this blog is.

Semantic Caching In Action

Here is how you can use semantic cache in your Elysia server:

import { SemanticCache } from "@upstash/semantic-cache";
import { Index } from "@upstash/vector";

const DBConfig = {
  url: process.env.UPSTASH_VECTOR_REST_URL,
  token: process.env.UPSTASH_VECTOR_REST_TOKEN,
};

// 👇 your vector database
const index = new Index(DBConfig);

// 👇 your semantic cache
const semanticCache = new SemanticCache({ index, minProximity: 0.85 });

async function runCache() {
  await semanticCache.set("largest city in USA by population", "New York");

  await delay(1000);

  const result = await semanticCache.get(
    "which is the most populated city in the USA?",
  );

}

function delay(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

runCache();

In the above code example we can see that we start by setting up a vector database then we create a Semantic Cache with it and we then run a function to handle the caching; the function sets the question and the answer in the vector database which is then used to get the answer to a similar question.

To run the above script in your setup you will need to run the script as in:

 bun ./src/{PATH_TO_YOUR_FILE}

Caveats:

The minProximity parameter (0 to 1) sets the required similarity score for a cache hit, with higher values demanding greater similarity, where 1.00 means an exact match.
A 1-second delay is added after setting the data to ensure the vector index updates and the data is available for retrieval.

Now in the above code, if I were to change the question to "what is the largest city in the USA by population?" to "Largest population in USA by city" the answer would still be "New York" because the semantic cache would have stored the answer to the first question and would be able to retrieve it for the second question.

Notice how both questions have semantic similarities and still got the same answers? That is the power Semantic Caching. Now think of this small demo we just had here at scale.

For each of the questions, you would see that the questions and answers are added instantly to the vector database you created.

Conclusions and Takeaways

This was a very fun tutorial to work as a Proof Of Concept and I hope you enjoyed it as much as I did. Here are some key takeaways and derivations from this tutorial:

Save money on LLMs, stores LLM answers for you so you don't have to run costly API call to OpenAI again.
The cache runs on top of Redis which is a super fast in memory database by Upstash. They provide a serverless model so you don't have to worry about having a bill if you don't use it at the end of the month.
Since redis is fast, your users won't have to wait, you will check in milliseconds if you already have a similar answer. You should pair this with a call to an LLM if you don't have the answer in the cache.
You can pair this well with LLMs, Q/A in documentation, and other applications that require a lot of text processing.