使用 Milvus 和 LangChain 进行文档问答

本指南演示了如何使用 Milvus 和 LangChain 构建一个 LLM 驱动的问答应用程序。

准备工作

本页面的代码片段需要安装 pymilvus 和 langchain。为将文档嵌入向量存储库中，还需使用OpenAI的嵌入式API，因此还需要安装openai和tiktok库。如果您的计算机上没有这些库，请运行以下命令进行安装。

! python -m pip install --upgrade pymilvus langchain openai tiktoken

全局参数

在本节中，您需要设置所有参数以在以下代码片段中使用。

from os import environ
 
MILVUS_HOST = "localhost"
MILVUS_PORT = "19530"
OPENAI_API_KEY = "sk-******" # example: "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
 
## Set up environment variables
environ["OPENAI_API_KEY"] = OPENAI_API_KEY

准备数据

在深入研究之前，您应该完成以下步骤：

准备当LLM思考时要查看的文档。
设置嵌入模型以将文档转换为向量嵌入。
设置用于保存向量嵌入的向量存储。

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Milvus
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import CharacterTextSplitter
 
# Use the WebBaseLoader to load specified web pages into documents
loader = WebBaseLoader([
    "https://milvus.io/docs/overview.md",
])
 
docs = loader.load()
 
# Split the documents into smaller chunks
text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=0)
docs = text_splitter.split_documents(docs)

文本拆分器的输出将类似于以下内容：

Created a chunk of size 1745, which is longer than the specified 1024
Created a chunk of size 1278, which is longer than the specified 1024

一旦准备好文档，我们需要将它们转换为向量嵌入并保存到向量存储中。

# Set up an embedding model to covert document chunks into vector embeddings.
embeddings = OpenAIEmbeddings(model="ada")
 
# Set up a vector store used to save the vector embeddings. Here we use Milvus as the vector store.
vector_store = Milvus.from_documents(
    docs,
    embedding=embeddings,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT}
)

您可以尝试使用以下代码片段进行文本相似度搜索。返回的结果将是文档中与查询最相关的文本。

query = "What is milvus?"
docs = vector_store.similarity_search(query)
 
print(docs)

输出应类似于以下内容：

[Document(page_content='Milvus workflow.', metadata={'source': 'https://milvus.io/docs/overview.md', 'title': 'Introduction Milvus documentation', 'description': 'Milvus is an open-source vector database designed specifically for AI application development, embeddings similarity search, and MLOps v2.2.x.', 'language': 'en'}), Document(page_content="Installat...rved.", metadata={'source': 'https://milvus.io/docs/overview.md', 'title': 'Introduction Milvus documentation', 'description': 'Milvus is an open-source vector database designed specifically for AI application development, embeddings similarity search, and MLOps v2.2.x.', 'language': 'en'}), Document(page_content='Introduction ... Milvus is able to analyze the correlation between two vectors by calculating their similarity distance. If the two embedding vectors are very similar, it means that the original data sources are similar as well.', metadata={'source': 'https://milvus.io/docs/overview.md', 'title': 'Introduction Milvus documentation', 'description': 'Milvus is an open-source vector database designed specifically for AI application development, embeddings similarity search, and MLOps v2.2.x.', 'language': 'en'}), Document(page_content="Key concepts...search algorithms are used to accelerate the searching process. If the two embedding vectors are very similar, it means that the original data sources are similar as well.Why Milvus?", metadata={'source': 'https://milvus.io/docs/overview.md', 'title': 'Introduction Milvus documentation', 'description': 'Milvus is an open-source vector database designed specifically for AI application development, embeddings similarity search, and MLOps v2.2.x.', 'language': 'en'})]

提出问题

文档准备就绪后，您可以设置一个链将其包含在提示中，以便LLM在准备答案时使用文档作为参考。

请注意，LangChain为带有来源的问答提供了四种链式类型，分别为stuff、map_reduce、refine和map-rerank。简单来说，stuff链将整个文档作为输入，只适用于小型文档。由于大多数LLMs对提示中可能包含的最大标记数量有限制，建议使用其他三种链式类型。这些链式类型将输入文档分成较小的部分，并以不同的方式将它们馈送到LLM中。有关详细信息，请参阅LangChain文档中的索引相关链式类型 (opens in a new tab)。

以下代码片段设置了一个使用OpenAI作为LLM和map-reduce链式类型的链式。

from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.llms import OpenAI
 
chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="map_reduce", return_intermediate_steps=True)
query = "What is Milvus?"
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

返回的结果包括intermediate_steps和output_text。前者指示搜索期间引用的文档，后者是问题的最终答案。

{'intermediate_steps': [' No relevant text.',
  ' What is Milvus vector database?',
  'What is Milvus? Milvus was created in 2019 with a singular goal: store, index, and manage massive embedding vectors generated by deep neural networks and other machine learning (ML) models. As a database specifically designed to handle queries over input vectors, it is capable of indexing vectors on a trillion scale. Unlike existing relational databases which mainly deal with structured data following a pre-defined pattern, Milvus is designed from the bottom-up to handle embedding vectors converted from unstructured data.',
  ' Milvus is a vector database and similarity search platform that enables users to quickly and accurately search for semantically similar vectors in an unstructured data repository. It uses modern embedding techniques to convert unstructured data to embedding vectors, and approximate nearest neighbor (ANN) search algorithms to accelerate the searching process.'],
 'output_text': ' Milvus is a vector database and similarity search platform that enables users to quickly and accurately search for semantically similar vectors in an unstructured data repository. It uses modern embedding techniques to convert unstructured data to embedding vectors, and approximate nearest neighbor (ANN) search algorithms to accelerate the searching process.SOURCES: https://milvus.io/docs/overview.md'}

CLI 概述 (cli_overview)