LangChain 文档问答#

让我们在 LangChain Python 文档的问答数据集上评估您的架构。有关如何测试不同嵌入、索引策略和架构的更多示例,请参阅 评估基准任务上的 RAG 架构 笔记本。

先决条件#

由于我们正在比较许多技术和模型,因此我们将安装许多此示例的先决条件。

我们将使用 LangSmith 来捕获评估轨迹。您可以在 smith.langchain.com 上创建免费帐户。完成此操作后,您可以创建一个 API 密钥并在下面设置它。

%pip install -U --quiet langchain langsmith langchainhub langchain_benchmarks
%pip install --quiet chromadb openai huggingface pandas langchain_experimental sentence_transformers pyarrow anthropic tiktoken

要使此代码正常工作,请使用您的凭据配置 LangSmith 环境变量。

import os

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "ls_..."  # Your API key
# Update these with your own API keys
os.environ["ANTHROPIC_API_KEY"] = "sk-..."
os.environ["OPENAI_API_KEY"] = "sk-..."
# Silence warnings from HuggingFace
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import uuid

# Generate a unique run ID for this experiment
run_uid = uuid.uuid4().hex[:6]

回顾问答任务#

注册表提供配置,可以在精选数据集中测试常见架构。您可以通过按类型过滤来查看检索任务。

from langchain_benchmarks import clone_public_dataset, registry
registry = registry.filter(Type="RetrievalTask")
registry
名称类型数据集 ID描述
LangChain 文档问答RetrievalTask452ccafc-18e1-4314-885b-edd735f17b9d基于 LangChain Python 文档快照的问答。环境提供文档和检索器信息。每个示例都包含一个问题和参考答案。成功基于答案相对于参考答案的准确性来衡量。我们还衡量了模型响应相对于检索到的文档(如果有)的忠实度。
半结构化报告RetrievalTaskc47d9617-ab99-4d6e-a6e6-92b8daf85a7d基于包含表格和图表 PDF 的问答。该任务提供原始文档以及用于轻松索引它们并创建检索器的工厂方法。每个示例都包含一个问题和参考答案。成功基于答案相对于参考答案的准确性来衡量。我们还衡量了模型响应相对于检索到的文档(如果有)的忠实度。
langchain_docs = registry["LangChain Docs Q&A"]
langchain_docs
名称LangChain 文档问答
类型RetrievalTask
数据集 ID452ccafc-18e1-4314-885b-edd735f17b9d
描述基于 LangChain Python 文档快照的问答。环境提供文档和检索器信息。每个示例都包含一个问题和参考答案。成功基于答案相对于参考答案的准确性来衡量。我们还衡量了模型响应相对于检索到的文档(如果有)的忠实度。
检索器工厂basic, parent-doc, hyde
架构工厂conversational-retrieval-qa
get_docs

克隆数据集#

选择 LangChain 文档问答任务后,将数据集克隆到您的 LangSmith 租户。此步骤要求在上面设置 LANGCHAIN_API_KEY。

clone_public_dataset(langchain_docs.dataset_id, dataset_name=langchain_docs.name)
Dataset LangChain Docs Q&A already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/3f29798f-5939-4643-bd99-008ca66b72ed.

创建索引#

创建检索问答系统时,第一步是准备检索器。您如何构建索引会显著影响系统性能。在尝试任何过于复杂的事情之前,最好对可靠的基准进行基准测试。

在本例中,我们的基准将是为每个原始源文档生成单个向量,并将它们直接存储在向量存储中。

下面,从 GCS 中的缓存中获取源文档。此缓存是使用 提取脚本 创建的,该脚本抓取了 LangChain 文档。为了节省时间并确保数据集答案仍然正确,我们将这些源文档用于所有基准方法。

docs = list(langchain_docs.get_docs())
print(repr(docs[0])[:100] + "...")
Document(page_content="LangChain cookbook | 🦜️🔗 Langchain\n\n[Skip to main content](#docusaurus_skip...

现在我们将填充我们的向量存储。我们将使用 LangChain 的索引 API 来缓存嵌入

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores.chroma import Chroma

embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-base",
    # model_kwargs={"device": 0},  # Comment out to use CPU
)

vectorstore = Chroma(
    collection_name="lcbm-b-huggingface-gte-base",
    embedding_function=embeddings,
    persist_directory="./chromadb",
)

vectorstore.add_documents(docs)
retriever = vectorstore.as_retriever(search_kwargs={"k": 6})

定义响应生成器#

我们的 RAG 系统已经完成了一半。我们已经创建了 Retriever(检索器)。现在是时候创建响应 Generator(生成器)了。

from operator import itemgetter
from typing import Sequence

from langchain.chat_models import ChatAnthropic
from langchain.prompts import ChatPromptTemplate
from langchain.schema.document import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable.passthrough import RunnableAssign


# After the retriever fetches documents, this
# function formats them in a string to present for the LLM
def format_docs(docs: Sequence[Document]) -> str:
    formatted_docs = []
    for i, doc in enumerate(docs):
        doc_string = (
            f"<document index='{i}'>\n"
            f"<source>{doc.metadata.get('source')}</source>\n"
            f"<doc_content>{doc.page_content}</doc_content>\n"
            "</document>"
        )
        formatted_docs.append(doc_string)
    formatted_str = "\n".join(formatted_docs)
    return f"<documents>\n{formatted_str}\n</documents>"


prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an AI assistant answering questions about LangChain."
            "\n{context}\n"
            "Respond solely based on the document content.",
        ),
        ("human", "{question}"),
    ]
)
llm = ChatAnthropic(model="claude-2.1", temperature=1)

response_generator = (prompt | llm | StrOutputParser()).with_config(
    run_name="GenerateResponse",
)

# This is the final response chain.
# It fetches the "question" key from the input dict,
# passes it to the retriever, then formats as a string.

chain = (
    RunnableAssign(
        {
            "context": (itemgetter("question") | retriever | format_docs).with_config(
                run_name="FormatDocs"
            )
        }
    )
    # The "RunnableAssign" above returns a dict with keys
    # question (from the original input) and
    # context: the string-formatted docs.
    # This is passed to the response_generator above
    | response_generator
)
chain.invoke({"question": "What's expression language?"})
' The LangChain Expression Language (LCEL) is a declarative way to easily compose chains of different components like prompts, models, parsers, etc. \n\nSome key things it provides:\n\n- Streaming support - Ability to get incremental outputs from chains rather than waiting for full completion. Useful for long-running chains.\n\n- Async support - Chains can be called synchronously (like in a notebook) or asynchronously (like in production). Same code works for both.\n\n- Optimized parallel execution - Steps that can run in parallel (like multiple retrievals) are automatically parallelized to minimize latency.\n\n- Retries and fallbacks - Help make chains more robust to failure.\n\n- Access to intermediate results - Useful for debugging or showing work-in-progress.\n\n- Input and output validation via schemas - Enables catching issues early.\n\n- Tracing - Automatic structured logging of all chain steps for observability.\n\n- Seamless deployment - LCEL chains can be easily deployed with LangServe.\n\nThe key idea is it makes it very easy to take a prototype LLM application made with components like prompts and models and turn it into a robust, scalable production application without changing any code.'

评估#

现在让我们在数据集上评估您的 RAG 架构。

from langsmith.client import Client

from langchain_benchmarks.rag import get_eval_config
client = Client()
RAG_EVALUATION = get_eval_config()

test_run = client.run_on_dataset(
    dataset_name=langchain_docs.name,
    llm_or_chain_factory=chain,
    evaluation=RAG_EVALUATION,
    project_name=f"claude-2 qa-chain simple-index {run_uid}",
    project_metadata={
        "index_method": "basic",
    },
    verbose=True,
)
View the evaluation results for project 'claude-2 qa-chain simple-index 1bdbe5' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/3fe31959-95e8-4413-aa09-620bd49bd0d3?eval=true

View all tests for Dataset LangChain Docs Q&A at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/3f29798f-5939-4643-bd99-008ca66b72ed
[------------------------------------------------->] 86/86
 Eval quantiles:
                                0.25        0.5       0.75       mean  \
embedding_cosine_distance   0.088025   0.115760   0.159969   0.129161   
score_string:accuracy       0.500000   0.700000   1.000000   0.645349   
faithfulness                0.700000   1.000000   1.000000   0.812791   
execution_time             27.098772  27.098772  27.098772  27.098772   

                                mode  
embedding_cosine_distance   0.048622  
score_string:accuracy       0.700000  
faithfulness                1.000000  
execution_time             27.098772  
test_run.get_aggregate_feedback()