【LangChain系列文章】4. 向量数据库Vector Stores

LangChain No.4


  • 文本嵌入模型
    • Text embedding models简介
    • 文本嵌入模型使用
  • 向量数据库使用
    • 通过文本创建索引
    • 加载文件创建索引
  • 向量数据库类别
  • 如何选择向量数据库


使用向量数据库的一个关键步骤是创建文本向量,并存储进数据库。这个工作通常是通过Embedding实现的。所以,使用向量数据库前,首先需要熟悉文本嵌入模型text embedding model


  • Text embedding models简介

提供文本嵌入模型的有很多种,如:OpenAI,Cohere, Hugging Face等,LangChain定义了一个抽象的Embddings类,用来与这些文本嵌入模型交互。Embeddings的本质是对文本数据创建一个向量表示,这种对文本的数字表示,意味着我们就可以在一个向量空间中来考虑文本,如进行语义检索等需要查找文本片段的操作,使用embeddings就非常类似于在向量空间中进行检索,对于查找相似文档等非常有用。


  • 文本嵌入模型使用


import { OpenAIEmbeddings } from "langchain/embeddings/openai";

其次,使用OpenAIEmbeddings类来创建一个嵌入模型的实例,下面是使用embedQuery方法,创建一个简单文本Hello world的嵌入。可见,它的结果是一个数字的向量。

/* Create instance */const embeddings = new OpenAIEmbeddings();/* Embed queries */const res = await embeddings.embedQuery("Hello world");/*[ -0.004845875, 0.004899438,-0.016358767,-0.024475135, -0.017341806,0.012571548,-0.019156644, 0.009036391,-0.010227379, -0.026945334,0.022861943, 0.010321903,-0.023479493, -0.0066544134,0.007977734, 0.0026371893, 0.025206111,-0.012048521, 0.012943339,0.013094575, -0.010580265,-0.003509951, 0.004070787, 0.008639394, -0.020631202,-0.0019203906, 0.012161949,-0.019194454, 0.030373365, -0.031028723, 0.0036170771,-0.007813894, -0.0060778237,-0.017820721, 0.0048647798, -0.015640393, 0.001373733,-0.015552171, 0.019534737, -0.016169721,0.007316074, 0.008273906, 0.011418369, -0.01390117, -0.033347685,0.011248227,0.0042503807,-0.012792102, -0.0014595914,0.028356876,0.025407761, 0.00076445413,-0.016308354, 0.017455231, -0.016396577,0.008557475, -0.03312083, 0.031104341, 0.032389853,-0.02132437,0.003324056,0.0055610985, -0.0078012915, 0.006090427, 0.0062038545,0.0169133,0.0036391325,0.0076815626,-0.018841568,0.026037913,0.024550753,0.0055264398, -0.0015824712, -0.0047765584,0.018425668, 0.0030656934, -0.0113742575, -0.0020322427, 0.005069579, 0.0022701253,0.036095154,-0.027449455,-0.008475555, 0.015388331,0.018917186, 0.0018999106,-0.003349262, 0.020895867,-0.014480911, -0.025042271,0.012546342, 0.013850759,0.0069253794, 0.008588983, -0.015199285,-0.0029585673,-0.008759124, 0.016749462, 0.004111747,-0.04804285,... 1436 more items]*/

你也可以尝试embedDoucments方法的使用,并查看文档嵌入的结果。下列代码演示了通过[“Hello world”, “Bye bye”]方法,生成一个简单文档嵌入的方法。你还可以通过使用文档的加载器,来加载一个完整的文档,查看不同文档转换的结果。

/* Embed documents */const documentRes = await embeddings.embedDocuments(["Hello world", "Bye bye"]);




  • 通过文本创建索引


import { MemoryVectorStore } from "langchain/vectorstores/memory";import { OpenAIEmbeddings } from "langchain/embeddings/openai";


const vectorStore = await MemoryVectorStore.fromTexts(["Hello world", "Bye bye", "hello nice world"],[{ id: 2 }, { id: 1 }, { id: 3 }],new OpenAIEmbeddings());const resultOne = await vectorStore.similaritySearch("hello world", 1);console.log(resultOne);


/*[Document {pageContent: "Hello world",metadata: { id: 2 }}]*/
  • 加载文件创建索引


import { MemoryVectorStore } from "langchain/vectorstores/memory";import { OpenAIEmbeddings } from "langchain/embeddings/openai";import { TextLoader } from "langchain/document_loaders/fs/text";


// Create docs with a loaderconst loader = new TextLoader("src/document_loaders/example_data/example.txt");const docs = await loader.load();// Load the docs into the vector storeconst vectorStore = await MemoryVectorStore.fromDocuments(docs,new OpenAIEmbeddings());// Search for the most similar documentconst resultOne = await vectorStore.similaritySearch("hello world", 1);console.log(resultOne);/*[Document {pageContent: "Hello world",metadata: { id: 2 }}]*/


abstract class BaseVectorStore implements VectorStore {static fromTexts(texts: string[],metadatas: object[] | object,embeddings: Embeddings,dbConfig: Record<string, any>): Promise<VectorStore>;static fromDocuments(docs: Document[],embeddings: Embeddings,dbConfig: Record<string, any>): Promise<VectorStore>;}


interface VectorStore {/** * Add more documents to an existing VectorStore. 向已存在的矢量数据库中添加文档 * Some providers support additional parameters, e.g. to associate custom ids * with added documents or to change the batch size of bulk inserts. 一些供应商还支持其它附加的参数,如将自定义 ID 与已添加的文档相关联或更改批量插入的批量大小 * Returns an array of ids for the documents or nothing. 返回文档的id数组或无返回值 */addDocuments(documents: Document[],options?: Record<string, any>): Promise<string[] | void>;/** * Search for the most similar documents to a query 搜索与查询条件最相似的文档 */similaritySearch(query: string,k?: number,filter?: object | undefined): Promise<Document[]>;/** * Search for the most similar documents to a query, 搜索与查询最相似的文档 * and return their similarity score 返回相似的范围 */similaritySearchWithScore(query: string,k = 4,filter: object | undefined = undefined): Promise<[object, number][]>;/** * Turn a VectorStore into a Retriever 向量数据库转Retriever*/asRetriever(k?: number): BaseRetriever;/** * Delete embedded documents from the vector store matching the passed in parameter. 从向量数据库中删除与传入参数匹配的嵌入文档 * Not supported by every provider. */delete(params?: Record<string, any>): Promise<void>;/** * Advanced: Add more documents to an existing VectorStore, 向已经存在的向量数据库中添加文档 * when you already have their embeddings */addVectors(vectors: number[][],documents: Document[],options?: Record<string, any>): Promise<string[] | void>;/** * Advanced: Search for the most similar documents to a query, 搜索与查询相似的文档 * when you already have the embedding of the query */similaritySearchVectorWithScore(query: number[],k: number,filter?: object): Promise<[Document, number][]>;}


向量数据库有很多个,可点击 这里 进行查看。常见的如Memory, Elasticsearch, Redis, Chroma等。
图片[1] - 【LangChain系列文章】4. 向量数据库Vector Stores - MaxSSL



  • If you’re after something that can just run inside your Node.js application, in-memory, without any other servers to stand up, then go for HNSWLib, Faiss, or LanceDB.如果仅需要在NodeJS程序内运行、在内存中运行,而不需要任何其它服务器的支持,那么选择HNSWLib、Faiss或LanceDB
  • If you’re looking for something that can run in-memory in browser-like environments, then go for MemoryVectorStore 如果需要能够运行在浏览器内存中,则选择MemoryVectorStore
  • If you come from Python and you were looking for something similar to FAISS, try HNSWLib or Faiss. python语言如果需要类似 FAISS的功能,可以尝试HNSWLib 或 Faiss
  • If you’re looking for an open-source full-featured vector database that you can run locally in a docker container, then go for Chroma. 如果需要一个开源的全功能的矢量数据库,并且可以在本地dockers容器中运行,则可以使用Chroma
  • If you’re looking for an open-source vector database that offers low-latency, local embedding of documents and supports apps on the edge, then go for Zep 如果需要一个开源数据库,该数据库能够提供低延迟、本地文档镶嵌并且支持在Edge上的应用,则可以使用Zep
  • If you’re looking for an open-source production-ready vector database that you can run locally (in a docker container) or hosted in the cloud, then go for Weaviate. 如果是需要一个运行在本地docker容器或云平台中的开源生产矢量数据库,可以选择Weaviate.
  • If you’re using Supabase already then look at the Supabase vector store to use the same Postgres database for your embeddings too.如果已经在使用Supabase矢量数据库,可以参照它去使用Postgres
  • If you’re looking for a production-ready vector store you don’t have to worry about hosting yourself, then go for Pinecone.如果是需要一个已经生产可用的、不需要自己托管的矢量数据库,可以使用 Pinecone
  • If you are already utilizing SingleStore, or if you find yourself in need of a distributed, high-performance database, you might want to consider the SingleStore vector store. 如果是需要一个分布式、高性能的数据库,可以考虑使用 SingleStore 矢量数据库
  • If you are looking for an online MPP (Massively Parallel Processing) data warehousing service, you might want to consider the AnalyticDB vector store. 如果需要一个在线的MMP(大规模多并发处理)数据仓库服务,可以考虑使用AnalyticDB 矢量数据库
  • If you’re in search of a cost-effective vector database that allows run vector search with SQL, look no further than MyScale. 如果需要一个经济高效且允许使用SQL进行搜索的数据库,则可以使用 MyScale.
© 版权声明
点赞0 分享