半结构化 RAG#
让我们在小型半结构化 Q&A 数据集上评估您的架构。该数据集由针对包含表格的 PDF 的 QA 对组成。
先决条件#
我们将为此示例安装许多先决条件,因为我们正在比较各种技术和模型。
%pip install -U langchain langsmith langchainhub langchain_benchmarks langchain_experimental
%pip install --quiet chromadb openai huggingface pandas "unstructured[all-docs]"
为了使此代码正常运行,请使用您的凭据配置 LangSmith 环境变量。
import os
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "sk-..." # Your API key
# Silence warnings from HuggingFace
os.environ["TOKENIZERS_PARALLELISM"] = "false"
审查 Q&A 任务#
注册表提供配置以在精选的数据集上测试常用架构。
from langchain_benchmarks import clone_public_dataset, registry
registry = registry.filter(Type="RetrievalTask")
registry
名称 | 类型 | 数据集 ID | 描述 |
---|---|---|---|
LangChain 文档 Q&A | RetrievalTask | 452ccafc-18e1-4314-885b-edd735f17b9d | 基于 LangChain Python 文档快照的问答。该环境提供文档和检索器信息。每个示例都包含一个问题和参考答案。成功是根据答案相对于参考答案的准确性来衡量的。我们还根据模型响应相对于检索到的文档(如果有)的忠实度来衡量。 |
半结构化报告 | RetrievalTask | c47d9617-ab99-4d6e-a6e6-92b8daf85a7d | 基于包含表格和图表 PDF 的问答。该任务提供原始文档以及用于轻松索引它们和创建检索器的工厂方法。每个示例都包含一个问题和参考答案。成功是根据答案相对于参考答案的准确性来衡量的。我们还根据模型响应相对于检索到的文档(如果有)的忠实度来衡量。 |
task = registry["Semi-structured Reports"]
task
名称 | 半结构化报告 |
类型 | RetrievalTask |
数据集 ID | c47d9617-ab99-4d6e-a6e6-92b8daf85a7d |
描述 | 基于包含表格和图表 PDF 的问答。该任务提供原始文档以及用于轻松索引它们和创建检索器的工厂方法。每个示例都包含一个问题和参考答案。成功是根据答案相对于参考答案的准确性来衡量的。我们还根据模型响应相对于检索到的文档(如果有)的忠实度来衡量。 |
检索器工厂 | basic, parent-doc, hyde |
架构工厂 | |
get_docs |
clone_public_dataset(task.dataset_id, dataset_name=task.name)
Dataset Semi-structured Reports already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f8f24935-cf57-4cb3-a30f-8df303a46962.
现在,索引文档#
您可以查看原始文件路径,或使用 unstructured 处理 PDF。
from langchain_benchmarks.rag.tasks.semi_structured_reports import get_file_names
# If you want to completely customize the document processing, you can use the files directly
file_names = list(get_file_names())
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="thenlper/gte-base",
model_kwargs={"device": 0}, # Comment out to use CPU
)
# Arguments to pass to partition_pdf
unstructured_config = {
# Unstructured first finds embedded image blocks
"extract_images_in_pdf": False,
# Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
# Titles are any sub-section of the document
"infer_table_structure": True,
# Post processing to aggregate text once we have the title
"chunking_strategy": "by_title",
# Chunking params to aggregate text blocks
# Attempt to create a new chunk 3800 chars
# Attempt to keep chunks > 2000 chars
"max_characters": 4000,
"new_after_n_chars": 3800,
"combine_text_under_n_chars": 2000,
}
docs = list(task.get_docs(unstructured_config=unstructured_config))
Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
retriever_factory = task.retriever_factories["basic"]
# Indexes the documents with the specified embeddings
retriever = retriever_factory(embeddings, docs=docs)
Chroma/semi-structured-earnings-b_Chroma_HuggingFaceEmbeddings_raw
[]
评估时间#
我们将使用一个简单的基于 Llama 的 LLM 组成我们的检索器。
from langchain.chat_models import ChatAnthropic
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable.passthrough import RunnableAssign
def create_chain(retriever):
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"Answer based solely on the retrieved documents below:\n\n<Documents>\n{docs}</Documents>",
),
("user", "{question}"),
]
)
llm = ChatAnthropic(model="claude-2")
return (
RunnableAssign({"docs": (lambda x: next(iter(x.values()))) | retriever})
| prompt
| llm
| StrOutputParser()
)
from langsmith.client import Client
from langchain_benchmarks.rag import get_eval_config
client = Client()
RAG_EVALUATION = get_eval_config()
chain = create_chain(retriever)
test_run = client.run_on_dataset(
dataset_name=task.name,
llm_or_chain_factory=chain,
evaluation=RAG_EVALUATION,
verbose=True,
)
View the evaluation results for project 'cold-attachment-88' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/d8e512b7-b63d-4eb5-8d73-d95f7fa7ffc2?eval=true
View all tests for Dataset Semi-structured Reports at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f8f24935-cf57-4cb3-a30f-8df303a46962
[------------------------------------------------->] 5/5
Eval quantiles:
inputs.question \
count 5
unique 5
top Analyzing the operating expenses for Q3 2023, ...
freq 1
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
feedback.embedding_cosine_distance feedback.faithfulness \
count 5.000000 5.0
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 0.137066 1.0
std 0.011379 0.0
min 0.123112 1.0
25% 0.129089 1.0
50% 0.137871 1.0
75% 0.143398 1.0
max 0.151860 1.0
feedback.score_string:accuracy error execution_time
count 5.0 0 5.000000
unique NaN 0 NaN
top NaN NaN NaN
freq NaN NaN NaN
mean 0.1 NaN 7.940625
std 0.0 NaN 1.380190
min 0.1 NaN 6.416387
25% 0.1 NaN 7.272528
50% 0.1 NaN 7.324673
75% 0.1 NaN 8.831243
max 0.1 NaN 9.858293
文档处理示例#
RAG 应用与其能够检索的信息一样好。让我们尝试索引表格的摘要,以提高用户提出相关问题时检索到它们的机会。
我们将使用 unstructured 的 partition_pdf
功能并使用 LLM 生成摘要。
您可以定义自己的索引管道,以查看它如何影响下游性能。
from operator import itemgetter
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.document import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable.passthrough import RunnableAssign
# Prompt
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are summarizing semi-structured tables or text in a pdf.\n\n```document\n{doc}\n```",
),
("user", "Write a concise summary."),
]
)
# Summary chain
model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k")
def create_doc(x) -> Document:
return Document(
page_content=x["output"],
metadata=x["doc"].metadata,
)
summarize_chain = (
{"doc": lambda x: x}
| RunnableAssign({"prompt": prompt})
| {
"output": itemgetter("prompt") | model | StrOutputParser(),
"doc": itemgetter("doc"),
}
| create_doc
)
summaries = summarize_chain.batch(
[doc for doc in docs if doc.metadata["element_type"] == "table"]
)
索引文档并创建检索器。我们将再次
# Indexes the documents with the specified embeddings
retriever_with_summaries = retriever_factory(
embeddings,
docs=docs + summaries,
# Specify a unique transformation name to avoid local cache collisions with other indices.
transformation_name="docs-with_summaries",
)
评估#
我们将对同一数据集评估新的链。
chain_2 = create_chain(retriever_with_summaries)
test_run_with_summaries = client.run_on_dataset(
dataset_name=task.name,
llm_or_chain_factory=chain_2,
evaluation=RAG_EVALUATION,
verbose=True,
)
View the evaluation results for project 'crazy-harmony-39' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/b69d796f-6ba4-4cde-822f-db363cf81f8f?eval=true
View all tests for Dataset Semi-structured Reports at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f8f24935-cf57-4cb3-a30f-8df303a46962
[------------------------------------------------->] 5/5
Eval quantiles:
inputs.question \
count 5
unique 5
top Analyzing the operating expenses for Q3 2023, ...
freq 1
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
feedback.score_string:accuracy feedback.faithfulness \
count 5.000000 5.0
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 0.720000 1.0
std 0.408656 0.0
min 0.100000 1.0
25% 0.500000 1.0
50% 1.000000 1.0
75% 1.000000 1.0
max 1.000000 1.0
feedback.embedding_cosine_distance error execution_time
count 5.000000 0 5.000000
unique NaN 0 NaN
top NaN NaN NaN
freq NaN NaN NaN
mean 0.069363 NaN 8.659120
std 0.023270 NaN 2.611724
min 0.039593 NaN 6.283505
25% 0.050176 NaN 6.723136
50% 0.078912 NaN 7.441743
75% 0.084389 NaN 10.673265
max 0.093747 NaN 12.173952