AIEmbeddings

What are Embeddings? How to use them with OpenAI API?

By Johannes Hayer
Picture of the author
Published on

Embeddings are a key component in natural language processing, and they help to convert text data into a numerical format. In simple terms, an embedding is a vector representation of a word or phrase, which captures the context and meaning of the text.

When using OpenAI API, embeddings play a crucial role in transforming text data into a format that can be used for training machine learning models. OpenAI provides a powerful API for natural language processing, and it offers a wide range of features that can be used to extract insights from unstructured text data.

To use embeddings with OpenAI API, you must first create an account on their website and obtain an API key. Once you have the API key, you can use it to make requests to the API and get the embeddings for your text data.

One of the key benefits of using embeddings with OpenAI API is that it allows you to perform a wide range of natural language processing tasks, such as language translation, sentiment analysis, and text classification. By using embeddings, you can capture the meaning and context of the text, which can help to improve the accuracy and effectiveness of your machine learning models.

In conclusion, embeddings are a powerful tool for natural language processing, and they can help to improve the accuracy and effectiveness of machine learning models. If you are looking to use embeddings with OpenAI API, it is important to have a clear understanding of how they work and how to use them effectively. With the right approach, you can unlock the full potential of OpenAI API and extract valuable insights from unstructured text data.

Here are some examples of how embeddings can be used with OpenAI API:

  • Language Translation: You can use embeddings to translate text from one language to another. By embedding the text in both languages, you can compare the vectors to find the most similar translation.
  • Text Classification: Embeddings can be used to classify text into different categories, such as positive or negative sentiment. By training a machine learning model on embeddings, you can accurately classify large volumes of text data.
  • Question Answering: Embeddings can be used to answer questions by finding the most similar vector to the question in a large corpus of text. This approach has been used to build powerful question-answering systems, such as OpenAI's GPT-3.
  • Text Generation: Embeddings can be used to generate new text by sampling from the vector space. This approach has been used to create language models that can generate realistic and diverse text samples.

These are just a few examples of how embeddings can be used with OpenAI API. With the right approach, embeddings can be a powerful tool for natural language processing and can help to unlock valuable insights from unstructured text data.

How to get embeddings

The embeddings API endpoint of OpenAI allows you to obtain embeddings for your text data. To get an embedding, you need to send your text string to the endpoint, along with a choice of embedding model ID (e.g., text-embedding-ada-002). The response will contain an embedding, which you can extract, save, and use.

Here is an example of how to get embeddings using OpenAI API:

import openai
import json

# Authenticate with OpenAI API
openai.api_key = "YOUR_API_KEY"

# Call the embeddings API endpoint
embedding = openai.Embedding.create(
    input="Your text goes here", model="text-embedding-ada-002"
)["data"][0]["embedding"]


Let's have a look at an embedding example from the Openai Python Cookbook ūüć≥

1. Prepare search data

# download pre-chunked text and pre-computed embeddings
# this file is ~200 MB, so may take a minute depending on your connection speed
embeddings_path = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"

df = pd.read_csv(embeddings_path)

# convert embeddings from CSV str type back to list type
df['embedding'] = df['embedding'].apply(ast.literal_eval)

2. Search

Now we'll define a search function that:

  • Takes a user query and a dataframe with text & embedding columns

  • Embeds the user query with the OpenAI API

  • Uses distance between query embedding and text embeddings to rank the texts

  • Returns two lists:

    • The top N texts, ranked by relevance
    • Their corresponding relevance scores
    # search function
    def strings_ranked_by_relatedness(
        query: str,
        df: pd.DataFrame,
        relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
        top_n: int = 100
    ) -> tuple[list[str], list[float]]:
        """Returns a list of strings and relatednesses, sorted from most related to least."""
        query_embedding_response = openai.Embedding.create(
            model=EMBEDDING_MODEL,
            input=query,
        )
        query_embedding = query_embedding_response["data"][0]["embedding"]
        strings_and_relatednesses = [
            (row["text"], relatedness_fn(query_embedding, row["embedding"]))
            for i, row in df.iterrows()
        ]
        strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
        strings, relatednesses = zip(*strings_and_relatednesses)
        return strings[:top_n], relatednesses[:top_n]
    

    3. Ask

    With the search function above, we can now automatically retrieve relevant knowledge and insert it into messages to GPT.

    Below, we define a function ask that:

    • Takes a user query
    • Searches for text relevant to the query
    • Stuffs that text into a message for GPT
    • Sends the message to GPT
    • Returns GPT's answer
    def num_tokens(text: str, model: str = GPT_MODEL) -> int:
        """Return the number of tokens in a string."""
        encoding = tiktoken.encoding_for_model(model)
        return len(encoding.encode(text))
    
    def query_message(
        query: str,
        df: pd.DataFrame,
        model: str,
        token_budget: int
    ) -> str:
        """Return a message for GPT, with relevant source texts pulled from a dataframe."""
        strings, relatednesses = strings_ranked_by_relatedness(query, df)
        introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'
        question = f"\n\nQuestion: {query}"
        message = introduction
        for string in strings:
            next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'
            if (
                num_tokens(message + next_article + question, model=model)
                > token_budget
            ):
                break
            else:
                message += next_article
        return message + question
    
    def ask(
        query: str,
        df: pd.DataFrame = df,
        model: str = GPT_MODEL,
        token_budget: int = 4096 - 500,
        print_message: bool = False,
    ) -> str:
        """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
        message = query_message(query, df, model=model, token_budget=token_budget)
        if print_message:
            print(message)
        messages = [
            {"role": "system", "content": "You answer questions about the 2022 Winter Olympics."},
            {"role": "user", "content": message},
        ]
        response = openai.ChatCompletion.create(
            model=model,
            messages=messages,
            temperature=0
        )
        response_message = response["choices"][0]["message"]["content"]
        return response_message
    

    Example questions

    Finally, let's ask our system our original question about gold medal curlers:

    ask('Which athletes won the gold medal in curling at the 2022 Winter Olympics?')

    "There were two gold medal-winning teams in curling at the 2022 Winter Olympics: the Swedish men's team consisting of Niklas Edin, Oskar Eriksson, Rasmus Wranå, Christoffer Sundgren, and Daniel Magnusson, and the British women's team consisting of Eve Muirhead, Vicky Wright, Jennifer Dodds, Hailey Duff, and Mili Smith."

    Summary

    Embeddings are a vector representation of text data that capture context and meaning, and they are a key component in natural language processing. OpenAI API offers a powerful tool for natural language processing that utilizes embeddings to perform tasks such as language translation, sentiment analysis, and text classification. To use embeddings with OpenAI API, you must first create an account and obtain an API key. Embeddings can be used for a wide range of natural language processing tasks, and they can help to improve the accuracy and effectiveness of machine learning models.

Stay Tuned

Subscribe for development and indie hacking tips!