OpenAI 的向量数据库嵌入模型

date

Apr 14, 2023

slug

openai_embedding_vector_db

status

Published

OpenAI’s Embedding Model With Vector Database

The updated Embedding model offers State-of-the-Art performance with a 4x longer context window. The new model is 90% cheaper. The smaller embedding dimensions reduce cost of storing them on vector databases.更新后的嵌入模型提供了最先进的性能和 4 倍长的上下文窗口。新型号便宜 90%。较小的嵌入维度降低了将它们存储在矢量数据库中的成本。

OpenAI’s Embedding model: 300 Fine Food Reviews¹ clustered with K-meansOpenAI 的嵌入模型：300 条美食评论¹ 与 K-means 聚类

Introduction

OpenAI updated in December 2022 the Embedding model to text-embedding-ada-002. The new model offers:OpenAI 于 2022 年 12 月将嵌入模型更新为 text-embedding-ada-002 。新型号提供：

90%-99.8% lower price 价格降低 90%-99.8%

1/8th embeddings dimensions size reduces vector database costs1/8 的嵌入维度大小降低了矢量数据库成本

Endpoint unification for ease of use端点统一，易于使用

State-of-the-Art performance for text search, code search, and sentence similarity文本搜索、代码搜索和句子相似度的最先进性能

Context window increased from 2048 to 8192.上下文窗口从 2048 增加到 8192。

This tutorial guides you through the Embedding endpoint with a clustering task. I store and retrieve these embeddings from a vector database. I cover questions related to the Embedding model and vector databases. Why costs aspect was an issue with the prior version of the Embedding endpoint? How I can use the Embedding model in practice for NLP tasks? What is a vector database? How to integrate OpenAI text embeddings into a vector database service? How to perform queries to vector database?本教程将指导您通过集群任务完成嵌入端点。我从矢量数据库中存储和检索这些嵌入。我涵盖了与嵌入模型和矢量数据库相关的问题。为什么成本方面是 Embedding 端点的先前版本的问题？如何在 NLP 任务实践中使用嵌入模型？什么是矢量数据库？如何将 OpenAI 文本嵌入集成到矢量数据库服务中？如何对矢量数据库执行查询？

This tutorial requires OpenAI API access. The tokens cost a few cents for 300 reviews. The vector database requires only a free Pinecone account.本教程需要 OpenAI API 访问权限。 300 条评论的代币只需几美分。矢量数据库只需要一个免费的 Pinecone 帐户。

OpenAI’s Embedding endpoint OpenAI 的嵌入端点

OpenAI released in December 2022 the updated version of the Embedding endpoint. The model is useful for many NLP tasks. It offers “State-of-the-Art” performance for text search, code search and sentence similarity. The text classification is good, too. BEIR-benchmark⁶ evaluates performance on such tasks. The Embedding model performs well on this benchmark — considering it is a commercial product.OpenAI 于 2022 年 12 月发布了 Embedding 端点的更新版本。该模型对许多 NLP 任务很有用。它为文本搜索、代码搜索和句子相似性提供“最先进”的性能。文本分类也很好。 BEIR-benchmark⁶ 评估此类任务的性能。嵌入模型在此基准测试中表现良好——考虑到它是一种商业产品。

I can use Embedding model to perform NLP tasks with text such as:我可以使用嵌入模型对文本执行 NLP 任务，例如：

clustering

recommendations

classification

anomaly detection and 异常检测和

diversity measurement. 多样性测量。

Many NLP tasks rely on concept called “text embeddings”, which are vector lists of floating point numbers. Text strings are very related if the vector list distance is small. Similarly, text strings are very unrelated, if the distance is large.许多 NLP 任务依赖于称为“文本嵌入”的概念，它是浮点数的向量列表。如果向量列表距离很小，则文本字符串非常相关。类似地，如果距离很大，文本字符串是非常不相关的。

Enterprises have vast amount of potential use cases for NLP. Invoice disputes and product reviews are examples of text strings, which can be converted into text embeddings. This tutorial uses a publicly available “Fine Food Reviews”-dataset. The dataset includes 500k reviews in a CSV-format. As such a large dataset consumes a lot of API tokens — I will sample a smaller subset of data from it.企业有大量 NLP 的潜在用例。发票争议和产品评论是文本字符串的示例，可以将其转换为文本嵌入。本教程使用公开可用的“美食评论”数据集。该数据集包括 50 万条 CSV 格式的评论。由于如此大的数据集会消耗大量 API 令牌——我将从中抽取较小的数据子集。

NLP models use tokens as a base for pricing. Tokens are “pieces of a word” with varying numbers of characters:NLP 模型使用代币作为定价基础。标记是具有不同数量字符的“单词片段”：

OpenAI refers a word as roughly 1.3 tokens³OpenAI 将一个词称为大约 1.3 个标记³

Cohere API a word is stated around 2–3 tokens⁴.Cohere API 一个词大约由 2-3 个标记组成⁴。

The longer the csv file of text strings to be processed, the more tokens will be charged.要处理的文本字符串的 csv 文件越长，收取的代币就越多。

This was a limiting factor in the previous Embedding model. The API calls were too expensive.这是之前 Embedding 模型中的一个限制因素。 API 调用成本太高。

Let’s look into this in practice. A developer may be considering upon two projects to spend the API budget. Each project uses a different OpenAI endpoint. How would the developer allocate tokens between the Text completion vs. the Embedding models?让我们在实践中研究一下。开发人员可能正在考虑在两个项目上花费 API 预算。每个项目都使用不同的 OpenAI 端点。开发人员将如何在文本完成与嵌入模型之间分配令牌？

Let’s start with the Text completion model. My billing statement includes several prompts with consumption of less than 1000 tokens:让我们从文本完成模型开始。我的billing statement有几个消费少于1000 token的提示：

OpenAI pricing for text generation. $0.0200 / 1K tokens. Image by Author.文本生成的 OpenAI 定价。 0.0200 美元/1K 代币。图片由作者提供。

Such API calls could be used for example to summarize text to shorter or translate it from one language to another. These API calls are cheap, as long as the text strings' length and quantity are controlled.例如，此类 API 调用可用于将文本概括为更短的内容或将其从一种语言翻译成另一种语言。只要控制文本字符串的长度和数量，这些 API 调用就很便宜。

Let’s next look into the Embedding API call. I processed a single CSV file of product reviews. It consumed roughly 33x more tokens, despite I only processed small subset of data from the CSV file:接下来让我们看看 Embedding API 调用。我处理了一个包含产品评论的 CSV 文件。尽管我只处理了 CSV 文件中的一小部分数据，但它消耗了大约 33 倍的令牌：

OpenAI pricing for text generation. $0.0004 / 1K tokens. Image by Author.文本生成的 OpenAI 定价。 0.0004 美元/1K 代币。图片由作者提供。

Needless to say, these endpoints are used for different purposes. The reason why I compare the pricing here — is the following. The new Embedding model price was lowered from 0.2 $ / 1K tokens to 0.0004 $ / 1K tokens. In other words, the 27k tokens used to be approximately 5 $. I can now query the same with only 0.01 dollars.不用说，这些端点用于不同的目的。我在这里比较定价的原因如下。新的嵌入模型价格从 0.2 美元/1K 代币降至 0.0004 美元/1K 代币。换句话说，过去的 27k 代币大约是 5 美元。我现在只需 0.01 美元就可以查询相同的内容。

The updated Embedding-endpoint is between 90% to 99.8% cheaper compared to the old endpoint².与旧端点相比，更新后的嵌入端点便宜 90% 到 99.8%²。

The price reduction enables developers to build products, which were too costly in the past even to test with the prior Embedding model.降价使开发人员能够构建产品，而这些产品在过去成本太高，甚至无法使用之前的嵌入模型进行测试。

This was not such an issue with the Text completion endpoint in the past.过去，这不是文本完成端点的问题。

OpenAI unified multiple models to the Embedding endpoint into a single and better-performing model. This made it possible to lower the model pricing. I noticed this immediately because the API is now easier to use.OpenAI 将 Embedding 端点的多个模型统一为一个性能更好的模型。这使得降低模型定价成为可能。我立即注意到了这一点，因为 API 现在更易于使用。

The new Embedding endpoint increases context window up to 8192 tokens. This enables working with 4x longer text strings efficiently.新的嵌入端点将上下文窗口增加到 8192 个标记。这使得能够有效地处理 4 倍长的文本字符串。

Many Large Language Models (LLMs) rely still on 2048 token context windows or less. This may not sound significant. Yet, lot of NLP tasks work with long context windows — such as processing documents or legal contracts. I believe a longer context window enables completely new use cases for the LLMs. For example, I can now text search an entire podcast using Whisper with Embedding model.许多大型语言模型 (LLM) 仍然依赖 2048 个或更少的令牌上下文窗口。这听起来可能并不重要。然而，许多 NLP 任务都需要较长的上下文窗口——例如处理文档或法律合同。我相信更长的上下文窗口可以为 LLM 带来全新的用例。例如，我现在可以使用带有嵌入模型的 Whisper 对整个播客进行文本搜索。

Cluster reviews with Embedding model使用嵌入模型的聚类评论

I import next all libraries required in this tutorial. Pinecone is only required for the final part to store word embeddings into Pinecone vector database.接下来我将导入本教程中所需的所有库。 Pinecone 只需要最后一部分将词嵌入存储到 Pinecone 向量数据库中。

The next step is to add your own API keys. This example retrieves these variables saved on windows as environment variables.下一步是添加您自己的 API 密钥。此示例检索保存在 Windows 上的这些变量作为环境变量。

The API key itself is available from the OpenAI website under “View API keys”.API 密钥本身可从 OpenAI 网站的“查看 API 密钥”下获得。

Remember not to type the API key directly to code. It is not secure practice.切记不要直接输入 API 密钥进行编码。这不是安全的做法。

This example uses Amazon Fine Food Reviews-dataset published in Kaggle¹. The dataset includes old reviews with information such as product ratings and plain text reviews.此示例使用在 Kaggle 中发布的 Amazon Fine Food Reviews 数据集。该数据集包括带有产品评级和纯文本评论等信息的旧评论。

I will load the dataset and generate vector embedding of the text review. Then, I will cluster these embeddings and plot them in 2d space.我将加载数据集并生成文本评论的向量嵌入。然后，我将对这些嵌入进行聚类并将它们绘制在二维空间中。

I then count a number of tokens in the combined data column with the tokenizer. I filter the most recent reviews below 8000 tokens. The context window is manageable for the model and limits out very long reviews.然后，我使用分词器计算组合数据列中的一些分词。我过滤了 8000 个令牌以下的最新评论。模型的上下文窗口是可管理的，并且限制了很长的评论。

I use 300 reviews in this tutorial to limit charges by the API. Yet, it is not limited by the API.我在本教程中使用 300 条评论来限制 API 的收费。然而，它不受 API 的限制。

I add an extra column to the CSV-file and save it with a new name. I then finally retrieve from OpenAI’s API the similarity- and search vectors from the Embedding endpoint.我在 CSV 文件中添加了一个额外的列，并用新名称保存它。然后，我最终从 OpenAI 的 API 中检索了来自嵌入端点的相似性和搜索向量。

I can now use these vectors to cluster the reviews.我现在可以使用这些向量来对评论进行聚类。

I can plot the reviews in 2D space using t-SNE dimensionality reduction:我可以使用 t-SNE 降维在二维空间中绘制评论：

The resulting plot illustrates clusters in the reviews.生成的图说明了评论中的集群。

OpenAI’s Embedding model: 300 Fine Food Reviews¹ clustered with K-meansOpenAI 的嵌入模型：300 条美食评论¹ 与 K-means 聚类

I next summarize the clusters under common themes and print a few examples of each.接下来，我总结了共同主题下的集群，并打印了每个集群的一些示例。

The results are useful in finding themes, which are quick to validate with the accompanying reviews:结果对于寻找主题很有用，这些主题可以通过随附的评论快速验证：

Clustering is not new technique even in the NLP, but I cannot stress enough the usefulness of the OpenAI API. The model is State-of-the-Art, which removes the barrier of having to settle for lower quality. Next, I add the word embedding-vectors to a vector database.即使在 NLP 中，聚类也不是新技术，但我怎么强调 OpenAI API 的用处都不为过。该模型是最先进的，消除了必须满足于较低质量的障碍。接下来，我将词嵌入向量添加到向量数据库中。

Vector Databases

Vector data may be stored in many ways, the simplest is perhaps a CSV file. Yet, this may not be a proper long-term approach. Vector databases store and retrieve vector data in a scalable and secure way as floating point numbers. The vector database saves them as a series of bits in the database's internal storage format.矢量数据可以以多种方式存储，最简单的可能是 CSV 文件。然而，这可能不是一个合适的长期方法。矢量数据库以可扩展且安全的方式将矢量数据存储和检索为浮点数。矢量数据库将它们保存为数据库内部存储格式中的一系列位。

Vector databases enable retrieving and storing text embedding-vectors effectively.矢量数据库可以有效地检索和存储文本嵌入矢量。

OpenAI’s Embedding model dimensions were reduced from 12288 to 1536. This refers to number of floating point numbers each “text embedding”-vector contains. The change reduces significantly the operating costs of the new Embedding model with vector databases.OpenAI 的嵌入模型维度从 12288 减少到 1536。这是指每个“文本嵌入”向量包含的浮点数的数量。这一变化显着降低了带有矢量数据库的新嵌入模型的运营成本。

Vector database services are priced based on the number of vectors the models use.矢量数据库服务根据模型使用的矢量数量定价。

Pinecone’s standard tier starts from $0.0960⁵ / pod-hour (with s1 or p1-pod each s1 pod fits 5M 768-dim vectors). It includes collections.Pinecone 的标准层级起价为 0.0960 美元/pod-hour（对于 s1 或 p1-pod，每个 s1 pod 适合 5M 768-dim 向量）。它包括集合。

Weavite pricing starts from $0.050⁷ per 1M vector dimensions.Weavite 定价从每 1M 矢量维度 0.050⁷ 美元起。

The Embeddings model dimensions impact directly to the vector database costs. Lower dimension vectors are cheaper to store. This aspect is very important as solutions are scaled up!嵌入模型维度直接影响矢量数据库成本。低维向量的存储成本更低。随着解决方案的扩大，这方面非常重要！

Pinecone console with 1536 dimension vector database带有1536维向量数据库的松果控制台

Vector database with OpenAI Embeddings带有 OpenAI 嵌入的矢量数据库

This tutorial integrates OpenAI’s “word embedding” vectors into a commercial vector database. Few options include Faiss, Weavite, while in this tutorial I will be using Pinecone. Pinecone offers a free plan, which is sufficient for completing this tutorial.本教程将 OpenAI 的“词嵌入”向量集成到商业向量数据库中。很少有选项包括 Faiss、Weavite，而在本教程中我将使用 Pinecone。 Pinecone 提供免费计划，足以完成本教程。

The first step is to create Pinecone account and obtain the API key and environment name from your Pinecone-profile. Earlier, I installed the Pinecone- package and imported it, so there is now no need perform it again. So, I can directly define my login parameters:第一步是创建 Pinecone 帐户并从您的 Pinecone 配置文件中获取 API 密钥和环境名称。之前我安装了Pinecone-包并导入了，现在不用再执行了。所以，我可以直接定义我的登录参数：

I then define my model dimension and the index name, which I want to use for my vectors. The model dimensions is in this case 1536 — matching Embedding model dimensions. I then check, in case I have already an existing index with this index name, and delete it — so I can create a new one.然后我定义我的模型维度和索引名称，我想将其用于我的向量。在这种情况下，模型尺寸为 1536 — 匹配嵌入模型尺寸。然后我检查，如果我已经有一个具有该索引名称的现有索引，并将其删除——这样我就可以创建一个新的。

I create the new index with cosine-similarity due to its fast computation:由于计算速度快，我创建了具有余弦相似度的新索引：

I can then upsert the OpenAI’s text embeddings from the previously generated df-dataframe. I need to define the index-name, so Pinecone knows which index to add the data.然后我可以从之前生成的 df-dataframe 中插入 OpenAI 的文本嵌入。我需要定义索引名称，这样 Pinecone 就知道要添加数据到哪个索引。

Pinecone receives as result the similarity score with ada_similarity, ProductId and combined columns. The “combined” column includes both the Summary & Text, which are passed to vector database as a metadata. These are the only steps required to store data in a vector database!Pinecone 收到与 ada_similarity 、 ProductId 和 combined 列的相似性分数作为结果。 “组合”列包括摘要和文本，它们作为元数据传递到矢量数据库。这些是将数据存储在矢量数据库中所需的唯一步骤！

I can now query this vector database. I pass a query text, send it to OpenAI API to get the text embeddings. The vector floating point numbers are saved as “query_response_embeddings”-variable.我现在可以查询这个矢量数据库。我传递一个查询文本，将其发送到 OpenAI API 以获取文本嵌入。矢量浮点数保存为“query_response_embeddings”变量。

I then query this vector from the vector database, by specifying the number of responses to provide. I as well request the metadata. I then print the results along with the :然后，我通过指定要提供的响应数量从向量数据库中查询该向量。我也请求元数据。然后我打印结果以及：

This results the semantic search results based on the text similarity, which is queried directly from the vector database.这导致基于文本相似性的语义搜索结果，直接从矢量数据库中查询。

Conclusions

The tutorial started by explaining Embedding model pricing. I covered the context window, model unification, and smaller dimensions. I explained using the Embedding endpoint with a practical clustering activity.本教程首先解释了嵌入模型定价。我介绍了上下文窗口、模型统一和更小的维度。我通过实际的集群活动解释了使用嵌入端点。

I introduced then the concept of vector databases and their pricing defined by vector dimensions. I then illustrated the usage of vector databases both for storage and text search.然后我介绍了矢量数据库的概念及其由矢量维度定义的定价。然后，我说明了矢量数据库在存储和文本搜索方面的用法。

References

[1] Amazon Fine Food Reviews-dataset. https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews?resource=download. Kaggle.[1] 亚马逊美食评论数据集。 https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews?resource=download。格格。

[2] New and Improved Embedding Model. https://openai.com/blog/new-and-improved-embedding-model/. OpenAI.[2] 新的和改进的嵌入模型。 https://openai.com/blog/new-and-improved-embedding-model/。开放人工智能。

[3] What are tokens and How to count them?. https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them. OpenAI.[3] 什么是代币以及如何计算它们？。 https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them。开放人工智能。

[4] Tokens. https://docs.cohere.ai/docs/tokens. co:here.[4] 代币。 https://docs.cohere.ai/docs/tokens。合作：在这里。

[5] Pinecone pricing. https://www.pinecone.io/pricing/. Pinecone.[5] 松果定价。 https://www.pinecone.io/pricing/。松果。

[6] Thakur et al., 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models https://openreview.net/forum?id=wCu6T5xFjeJ. OpenReview.net.[6] Thakur 等人，2021。BEIR：信息检索模型零样本评估的异构基准 https://openreview.net/forum?id=wCu6T5xFjeJ。打开评论网。

[7] Weavite pricing. https://weaviate.io/pricing.html. Weavite.[7] Weavite 定价。 https://weaviate.io/pricing.html. 韦维特。