What are Embeddings? How to use them with OpenAI API?
- Published on
Embeddings are a key component in natural language processing, and they help to convert text data into a numerical format. In simple terms, an embedding is a vector representation of a word or phrase, which captures the context and meaning of the text.
When using OpenAI API, embeddings play a crucial role in transforming text data into a format that can be used for training machine learning models. OpenAI provides a powerful API for natural language processing, and it offers a wide range of features that can be used to extract insights from unstructured text data.
To use embeddings with OpenAI API, you must first create an account on their website and obtain an API key. Once you have the API key, you can use it to make requests to the API and get the embeddings for your text data.
One of the key benefits of using embeddings with OpenAI API is that it allows you to perform a wide range of natural language processing tasks, such as language translation, sentiment analysis, and text classification. By using embeddings, you can capture the meaning and context of the text, which can help to improve the accuracy and effectiveness of your machine learning models.
In conclusion, embeddings are a powerful tool for natural language processing, and they can help to improve the accuracy and effectiveness of machine learning models. If you are looking to use embeddings with OpenAI API, it is important to have a clear understanding of how they work and how to use them effectively. With the right approach, you can unlock the full potential of OpenAI API and extract valuable insights from unstructured text data.
Here are some examples of how embeddings can be used with OpenAI API:
- Language Translation: You can use embeddings to translate text from one language to another. By embedding the text in both languages, you can compare the vectors to find the most similar translation.
- Text Classification: Embeddings can be used to classify text into different categories, such as positive or negative sentiment. By training a machine learning model on embeddings, you can accurately classify large volumes of text data.
- Question Answering: Embeddings can be used to answer questions by finding the most similar vector to the question in a large corpus of text. This approach has been used to build powerful question-answering systems, such as OpenAI's GPT-3.
- Text Generation: Embeddings can be used to generate new text by sampling from the vector space. This approach has been used to create language models that can generate realistic and diverse text samples.
These are just a few examples of how embeddings can be used with OpenAI API. With the right approach, embeddings can be a powerful tool for natural language processing and can help to unlock valuable insights from unstructured text data.
How to get embeddings
The embeddings API endpoint of OpenAI allows you to obtain embeddings for your text data. To get an embedding, you need to send your text string to the endpoint, along with a choice of embedding model ID (e.g., text-embedding-ada-002
). The response will contain an embedding, which you can extract, save, and use.
Here is an example of how to get embeddings using OpenAI API:
import openai
import json
# Authenticate with OpenAI API
openai.api_key = "YOUR_API_KEY"
# Call the embeddings API endpoint
embedding = openai.Embedding.create(
input="Your text goes here", model="text-embedding-ada-002"
)["data"][0]["embedding"]
Let's have a look at an embedding example from the Openai Python Cookbook 🍳
1. Prepare search data
# download pre-chunked text and pre-computed embeddings
# this file is ~200 MB, so may take a minute depending on your connection speed
embeddings_path = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"
df = pd.read_csv(embeddings_path)
# convert embeddings from CSV str type back to list type
df['embedding'] = df['embedding'].apply(ast.literal_eval)
2. Search
Now we'll define a search function that:
Takes a user query and a dataframe with text & embedding columns
Embeds the user query with the OpenAI API
Uses distance between query embedding and text embeddings to rank the texts
Returns two lists:
- The top N texts, ranked by relevance
- Their corresponding relevance scores
# search function def strings_ranked_by_relatedness( query: str, df: pd.DataFrame, relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y), top_n: int = 100 ) -> tuple[list[str], list[float]]: """Returns a list of strings and relatednesses, sorted from most related to least.""" query_embedding_response = openai.Embedding.create( model=EMBEDDING_MODEL, input=query, ) query_embedding = query_embedding_response["data"][0]["embedding"] strings_and_relatednesses = [ (row["text"], relatedness_fn(query_embedding, row["embedding"])) for i, row in df.iterrows() ] strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True) strings, relatednesses = zip(*strings_and_relatednesses) return strings[:top_n], relatednesses[:top_n]
3. Ask
With the search function above, we can now automatically retrieve relevant knowledge and insert it into messages to GPT.
Below, we define a function
ask
that:- Takes a user query
- Searches for text relevant to the query
- Stuffs that text into a message for GPT
- Sends the message to GPT
- Returns GPT's answer
def num_tokens(text: str, model: str = GPT_MODEL) -> int: """Return the number of tokens in a string.""" encoding = tiktoken.encoding_for_model(model) return len(encoding.encode(text)) def query_message( query: str, df: pd.DataFrame, model: str, token_budget: int ) -> str: """Return a message for GPT, with relevant source texts pulled from a dataframe.""" strings, relatednesses = strings_ranked_by_relatedness(query, df) introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."' question = f"\n\nQuestion: {query}" message = introduction for string in strings: next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""' if ( num_tokens(message + next_article + question, model=model) > token_budget ): break else: message += next_article return message + question def ask( query: str, df: pd.DataFrame = df, model: str = GPT_MODEL, token_budget: int = 4096 - 500, print_message: bool = False, ) -> str: """Answers a query using GPT and a dataframe of relevant texts and embeddings.""" message = query_message(query, df, model=model, token_budget=token_budget) if print_message: print(message) messages = [ {"role": "system", "content": "You answer questions about the 2022 Winter Olympics."}, {"role": "user", "content": message}, ] response = openai.ChatCompletion.create( model=model, messages=messages, temperature=0 ) response_message = response["choices"][0]["message"]["content"] return response_message
Example questions
Finally, let's ask our system our original question about gold medal curlers:
ask('Which athletes won the gold medal in curling at the 2022 Winter Olympics?')
"There were two gold medal-winning teams in curling at the 2022 Winter Olympics: the Swedish men's team consisting of Niklas Edin, Oskar Eriksson, Rasmus Wranå, Christoffer Sundgren, and Daniel Magnusson, and the British women's team consisting of Eve Muirhead, Vicky Wright, Jennifer Dodds, Hailey Duff, and Mili Smith."
Summary
Embeddings are a vector representation of text data that capture context and meaning, and they are a key component in natural language processing. OpenAI API offers a powerful tool for natural language processing that utilizes embeddings to perform tasks such as language translation, sentiment analysis, and text classification. To use embeddings with OpenAI API, you must first create an account and obtain an API key. Embeddings can be used for a wide range of natural language processing tasks, and they can help to improve the accuracy and effectiveness of machine learning models.
- More about embeddings: https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
- OpenAI cookbook: https://github.com/openai/openai-cookbook
If you found this content helpful ⇢