半结构化 RAG#

让我们在小型半结构化 Q&A 数据集上评估您的架构。该数据集由针对包含表格的 PDF 的 QA 对组成。

先决条件#

我们将为此示例安装许多先决条件,因为我们正在比较各种技术和模型。

%pip install -U langchain langsmith langchainhub  langchain_benchmarks langchain_experimental
%pip install --quiet chromadb openai huggingface pandas "unstructured[all-docs]"

为了使此代码正常运行,请使用您的凭据配置 LangSmith 环境变量。

import os

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "sk-..."  # Your API key

# Silence warnings from HuggingFace
os.environ["TOKENIZERS_PARALLELISM"] = "false"

审查 Q&A 任务#

注册表提供配置以在精选的数据集上测试常用架构。

from langchain_benchmarks import clone_public_dataset, registry
registry = registry.filter(Type="RetrievalTask")
registry
名称类型数据集 ID描述
LangChain 文档 Q&ARetrievalTask452ccafc-18e1-4314-885b-edd735f17b9d基于 LangChain Python 文档快照的问答。该环境提供文档和检索器信息。每个示例都包含一个问题和参考答案。成功是根据答案相对于参考答案的准确性来衡量的。我们还根据模型响应相对于检索到的文档(如果有)的忠实度来衡量。
半结构化报告RetrievalTaskc47d9617-ab99-4d6e-a6e6-92b8daf85a7d基于包含表格和图表 PDF 的问答。该任务提供原始文档以及用于轻松索引它们和创建检索器的工厂方法。每个示例都包含一个问题和参考答案。成功是根据答案相对于参考答案的准确性来衡量的。我们还根据模型响应相对于检索到的文档(如果有)的忠实度来衡量。
task = registry["Semi-structured Reports"]
task
名称半结构化报告
类型RetrievalTask
数据集 IDc47d9617-ab99-4d6e-a6e6-92b8daf85a7d
描述基于包含表格和图表 PDF 的问答。该任务提供原始文档以及用于轻松索引它们和创建检索器的工厂方法。每个示例都包含一个问题和参考答案。成功是根据答案相对于参考答案的准确性来衡量的。我们还根据模型响应相对于检索到的文档(如果有)的忠实度来衡量。
检索器工厂basic, parent-doc, hyde
架构工厂
get_docs
clone_public_dataset(task.dataset_id, dataset_name=task.name)
Dataset Semi-structured Reports already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f8f24935-cf57-4cb3-a30f-8df303a46962.

现在,索引文档#

您可以查看原始文件路径,或使用 unstructured 处理 PDF。

from langchain_benchmarks.rag.tasks.semi_structured_reports import get_file_names

# If you want to completely customize the document processing, you can use the files directly
file_names = list(get_file_names())
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-base",
    model_kwargs={"device": 0},  # Comment out to use CPU
)

# Arguments to pass to partition_pdf
unstructured_config = {
    # Unstructured first finds embedded image blocks
    "extract_images_in_pdf": False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    "infer_table_structure": True,
    # Post processing to aggregate text once we have the title
    "chunking_strategy": "by_title",
    # Chunking params to aggregate text blocks
    # Attempt to create a new chunk 3800 chars
    # Attempt to keep chunks > 2000 chars
    "max_characters": 4000,
    "new_after_n_chars": 3800,
    "combine_text_under_n_chars": 2000,
}
docs = list(task.get_docs(unstructured_config=unstructured_config))
Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
retriever_factory = task.retriever_factories["basic"]
# Indexes the documents with the specified embeddings
retriever = retriever_factory(embeddings, docs=docs)
Chroma/semi-structured-earnings-b_Chroma_HuggingFaceEmbeddings_raw
[]

评估时间#

我们将使用一个简单的基于 Llama 的 LLM 组成我们的检索器。

from langchain.chat_models import ChatAnthropic
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable.passthrough import RunnableAssign


def create_chain(retriever):
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "Answer based solely on the retrieved documents below:\n\n<Documents>\n{docs}</Documents>",
            ),
            ("user", "{question}"),
        ]
    )
    llm = ChatAnthropic(model="claude-2")
    return (
        RunnableAssign({"docs": (lambda x: next(iter(x.values()))) | retriever})
        | prompt
        | llm
        | StrOutputParser()
    )
from langsmith.client import Client

from langchain_benchmarks.rag import get_eval_config

client = Client()
RAG_EVALUATION = get_eval_config()
chain = create_chain(retriever)
test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=chain,
    evaluation=RAG_EVALUATION,
    verbose=True,
)
View the evaluation results for project 'cold-attachment-88' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/d8e512b7-b63d-4eb5-8d73-d95f7fa7ffc2?eval=true

View all tests for Dataset Semi-structured Reports at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f8f24935-cf57-4cb3-a30f-8df303a46962
[------------------------------------------------->] 5/5
 Eval quantiles:
                                          inputs.question  \
count                                                   5   
unique                                                  5   
top     Analyzing the operating expenses for Q3 2023, ...   
freq                                                    1   
mean                                                  NaN   
std                                                   NaN   
min                                                   NaN   
25%                                                   NaN   
50%                                                   NaN   
75%                                                   NaN   
max                                                   NaN   

        feedback.embedding_cosine_distance  feedback.faithfulness  \
count                             5.000000                    5.0   
unique                                 NaN                    NaN   
top                                    NaN                    NaN   
freq                                   NaN                    NaN   
mean                              0.137066                    1.0   
std                               0.011379                    0.0   
min                               0.123112                    1.0   
25%                               0.129089                    1.0   
50%                               0.137871                    1.0   
75%                               0.143398                    1.0   
max                               0.151860                    1.0   

        feedback.score_string:accuracy error  execution_time  
count                              5.0     0        5.000000  
unique                             NaN     0             NaN  
top                                NaN   NaN             NaN  
freq                               NaN   NaN             NaN  
mean                               0.1   NaN        7.940625  
std                                0.0   NaN        1.380190  
min                                0.1   NaN        6.416387  
25%                                0.1   NaN        7.272528  
50%                                0.1   NaN        7.324673  
75%                                0.1   NaN        8.831243  
max                                0.1   NaN        9.858293  

文档处理示例#

RAG 应用与其能够检索的信息一样好。让我们尝试索引表格的摘要,以提高用户提出相关问题时检索到它们的机会。

我们将使用 unstructured 的 partition_pdf 功能并使用 LLM 生成摘要。

您可以定义自己的索引管道,以查看它如何影响下游性能。

from operator import itemgetter

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.document import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable.passthrough import RunnableAssign

# Prompt
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are summarizing semi-structured tables or text in a pdf.\n\n```document\n{doc}\n```",
        ),
        ("user", "Write a concise summary."),
    ]
)

# Summary chain
model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k")


def create_doc(x) -> Document:
    return Document(
        page_content=x["output"],
        metadata=x["doc"].metadata,
    )


summarize_chain = (
    {"doc": lambda x: x}
    | RunnableAssign({"prompt": prompt})
    | {
        "output": itemgetter("prompt") | model | StrOutputParser(),
        "doc": itemgetter("doc"),
    }
    | create_doc
)
summaries = summarize_chain.batch(
    [doc for doc in docs if doc.metadata["element_type"] == "table"]
)

索引文档并创建检索器。我们将再次

# Indexes the documents with the specified embeddings
retriever_with_summaries = retriever_factory(
    embeddings,
    docs=docs + summaries,
    # Specify a unique transformation name to avoid local cache collisions with other indices.
    transformation_name="docs-with_summaries",
)

评估#

我们将对同一数据集评估新的链。

chain_2 = create_chain(retriever_with_summaries)

test_run_with_summaries = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=chain_2,
    evaluation=RAG_EVALUATION,
    verbose=True,
)
View the evaluation results for project 'crazy-harmony-39' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/b69d796f-6ba4-4cde-822f-db363cf81f8f?eval=true

View all tests for Dataset Semi-structured Reports at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/f8f24935-cf57-4cb3-a30f-8df303a46962
[------------------------------------------------->] 5/5
 Eval quantiles:
                                          inputs.question  \
count                                                   5   
unique                                                  5   
top     Analyzing the operating expenses for Q3 2023, ...   
freq                                                    1   
mean                                                  NaN   
std                                                   NaN   
min                                                   NaN   
25%                                                   NaN   
50%                                                   NaN   
75%                                                   NaN   
max                                                   NaN   

        feedback.score_string:accuracy  feedback.faithfulness  \
count                         5.000000                    5.0   
unique                             NaN                    NaN   
top                                NaN                    NaN   
freq                               NaN                    NaN   
mean                          0.720000                    1.0   
std                           0.408656                    0.0   
min                           0.100000                    1.0   
25%                           0.500000                    1.0   
50%                           1.000000                    1.0   
75%                           1.000000                    1.0   
max                           1.000000                    1.0   

        feedback.embedding_cosine_distance error  execution_time  
count                             5.000000     0        5.000000  
unique                                 NaN     0             NaN  
top                                    NaN   NaN             NaN  
freq                                   NaN   NaN             NaN  
mean                              0.069363   NaN        8.659120  
std                               0.023270   NaN        2.611724  
min                               0.039593   NaN        6.283505  
25%                               0.050176   NaN        6.723136  
50%                               0.078912   NaN        7.441743  
75%                               0.084389   NaN       10.673265  
max                               0.093747   NaN       12.173952