多模态评估:带有多模态嵌入和多向量检索器的 GPT-4#
Multi-modal slide decks
是一个包含来自带有视觉内容的幻灯片组的问答对的公共数据集。
问答对源自幻灯片组中的视觉内容,测试了 RAG 执行视觉推理的能力。
我们使用两种方法评估此数据集
(1) 带有多模态嵌入的向量存储
(2) 带有索引图像摘要的多向量检索器
先决条件#
# %pip install -U langchain langsmith langchain_benchmarks
# %pip install -U openai chromadb pypdfium2 open-clip-torch pillow
import getpass
import os
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
env_vars = ["LANGCHAIN_API_KEY", "OPENAI_API_KEY"]
for var in env_vars:
if var not in os.environ:
os.environ[var] = getpass.getpass(prompt=f"Enter your {var}: ")
数据集#
我们可以浏览可用的 LangChain 基准测试数据集以进行检索。
from langchain_benchmarks import clone_public_dataset, registry
registry = registry.filter(Type="RetrievalTask")
registry
名称 | 类型 | 数据集 ID | 描述 |
---|---|---|---|
LangChain 文档 Q&A | RetrievalTask | 452ccafc-18e1-4314-885b-edd735f17b9d | 基于 LangChain Python 文档快照的问答对。环境提供文档和检索器信息。每个示例都包含一个问题和参考答案。成功与答案相对于参考答案的准确性成正比。我们还根据模型响应相对于检索到的文档(如果有)的忠实度进行衡量。 |
半结构化报告 | RetrievalTask | c47d9617-ab99-4d6e-a6e6-92b8daf85a7d | 基于包含表格和图表 PDF 的问答对。任务提供原始文档以及用于轻松索引它们并创建检索器的工厂方法。每个示例都包含一个问题和参考答案。成功与答案相对于参考答案的准确性成正比。我们还根据模型响应相对于检索到的文档(如果有)的忠实度进行衡量。 |
多模态幻灯片组 | RetrievalTask | 40afc8e7-9d7e-44ed-8971-2cae1eb59731 | 此公共数据集是一个正在进行的工作,并将随着时间的推移而扩展。基于包含视觉表格和图表的幻灯片组的问答对。每个示例都包含一个问题和参考答案。成功与答案相对于参考答案的准确性成正比。 |
Multi-modal slide decks
是我们任务的相关数据集。
task = registry["Multi-modal slide decks"]
task
名称 | 多模态幻灯片组 |
类型 | RetrievalTask |
数据集 ID | 40afc8e7-9d7e-44ed-8971-2cae1eb59731 |
描述 | 此公共数据集是一个正在进行的工作,并将随着时间的推移而扩展。基于包含视觉表格和图表的幻灯片组的问答对。每个示例都包含一个问题和参考答案。成功与答案相对于参考答案的准确性成正比。 |
检索器工厂 | |
架构工厂 | |
get_docs | {} |
克隆数据集以使其在我们的 LangSmith 数据集中可用。
clone_public_dataset(task.dataset_id, dataset_name=task.name)
Finished fetching examples. Creating dataset...
New dataset created you can access it at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306.
Done creating dataset.
从远程缓存中获取与数据集关联的 PDF,以便我们执行摄取。
from langchain_benchmarks.rag.tasks.multi_modal_slide_decks import get_file_names
file_names = list(get_file_names()) # PosixPath
加载#
对于每个演示文稿,提取每张幻灯片的图像。
import os
from pathlib import Path
import pypdfium2 as pdfium
def get_images(file):
"""
Get PIL images from PDF pages and save them to a specified directory
:param file: Path to file
:return: A list of PIL images
"""
# Get presentation
pdf = pdfium.PdfDocument(file)
n_pages = len(pdf)
# Extracting file name and creating the directory for images
file_name = Path(file).stem # Gets the file name without extension
img_dir = os.path.join(Path(file).parent, "img")
os.makedirs(img_dir, exist_ok=True)
# Get images
pil_images = []
print(f"Extracting {n_pages} images for {file.name}")
for page_number in range(n_pages):
page = pdf.get_page(page_number)
bitmap = page.render(scale=1, rotation=0, crop=(0, 0, 0, 0))
pil_image = bitmap.to_pil()
pil_images.append(pil_image)
# Saving the image with the specified naming convention
image_path = os.path.join(img_dir, f"{file_name}_image_{page_number + 1}.jpg")
pil_image.save(image_path, format="JPEG")
return pil_images
images = []
for fi in file_names:
images.extend(get_images(fi))
Extracting 30 images for DDOG_Q3_earnings_deck.pdf
现在,我们将每个 PIL 图像转换为 Base64 编码字符串并设置图像大小。
Base64 编码字符串可以输入到 GPT-4V。
import base64
import io
from io import BytesIO
from PIL import Image
def resize_base64_image(base64_string, size=(128, 128)):
"""
Resize an image encoded as a Base64 string
:param base64_string: Base64 string
:param size: Image size
:return: Re-sized Base64 string
"""
# Decode the Base64 string
img_data = base64.b64decode(base64_string)
img = Image.open(io.BytesIO(img_data))
# Resize the image
resized_img = img.resize(size, Image.LANCZOS)
# Save the resized image to a bytes buffer
buffered = io.BytesIO()
resized_img.save(buffered, format=img.format)
# Encode the resized image to Base64
return base64.b64encode(buffered.getvalue()).decode("utf-8")
def convert_to_base64(pil_image):
"""
Convert PIL images to Base64 encoded strings
:param pil_image: PIL image
:return: Re-sized Base64 string
"""
buffered = BytesIO()
pil_image.save(buffered, format="JPEG") # You can change the format if needed
img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
img_str = resize_base64_image(img_str, size=(960, 540))
return img_str
images_base_64 = [convert_to_base64(i) for i in images]
如果需要,我们可以绘制图像以确认它们是否已正确提取。
from IPython.display import HTML, display
def plt_img_base64(img_base64):
"""
Disply base64 encoded string as image
:param img_base64: Base64 string
"""
# Create an HTML img tag with the base64 string as the source
image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
# Display the image by rendering the HTML
display(HTML(image_html))
i = 10
plt_img_base64(images_base_64[i])
索引#
我们将测试两种方法。
选项 1:带有多模态嵌入的向量存储#
在这里,我们将使用 OpenCLIP 多模态嵌入。
有 许多可供选择。
默认情况下,它将使用 model_name="ViT-H-14", checkpoint="laion2b_s32b_b79k"
。
此模型在内存和性能之间取得了良好的平衡。
但是,您可以通过将它们传递给 OpenCLIPEmbeddings 作为 model_name=, checkpoint=
来测试不同的模型。
from langchain.vectorstores import Chroma
from langchain_experimental.open_clip import OpenCLIPEmbeddings
# Make vectorstore
vectorstore_mmembd = Chroma(
collection_name="multi-modal-rag",
embedding_function=OpenCLIPEmbeddings(),
)
# Read images we extracted above
img_dir = os.path.join(Path(file_names[0]).parent, "img")
image_uris = sorted(
[
os.path.join(img_dir, image_name)
for image_name in os.listdir(img_dir)
if image_name.endswith(".jpg")
]
)
# Add images
vectorstore_mmembd.add_images(uris=image_uris)
# Make retriever
retriever_mmembd = vectorstore_mmembd.as_retriever()
选项 2:多向量检索器#
此方法将生成和索引图像摘要。有关详细信息,请参见 此处。
然后,它将检索原始图像以传递给 GPT-4V 以进行最终合成。
该想法是,对图像摘要进行检索
不依赖于多模态嵌入
可以更好地检索在视觉上/语义上相似但定量不同的幻灯片内容
注意:OpenAI 的 GPT-4V API 可能遇到 非确定性 BadRequestError
,我们会进行处理。希望这种情况很快得到解决。
from langchain.chat_models import ChatOpenAI
from langchain.schema.messages import HumanMessage
def image_summarize(img_base64, prompt):
"""
Make image summary
:param img_base64: Base64 encoded string for image
:param prompt: Text prompt for summarizatiomn
:return: Image summarization prompt
"""
chat = ChatOpenAI(model="gpt-4-vision-preview", max_tokens=1024)
msg = chat.invoke(
[
HumanMessage(
content=[
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
},
]
)
]
)
return msg.content
def generate_img_summaries(img_base64_list):
"""
Generate summaries for images
:param img_base64_list: Base64 encoded images
:return: List of image summaries and processed images
"""
# Store image summaries
image_summaries = []
processed_images = []
# Prompt
prompt = """You are an assistant tasked with summarizing images for retrieval. \
These summaries will be embedded and used to retrieve the raw image. \
Give a concise summary of the image that is well optimized for retrieval."""
# Apply summarization to images
for i, base64_image in enumerate(img_base64_list):
try:
image_summaries.append(image_summarize(base64_image, prompt))
processed_images.append(base64_image)
except:
print(f"BadRequestError with image {i+1}")
return image_summaries, processed_images
# Image summaries
image_summaries, images_base_64_processed = generate_img_summaries(images_base_64)
将原始文档和文档摘要添加到 多向量检索器 中
将原始图像存储在
docstore
中。将图像摘要存储在
vectorstore
中以进行语义检索。
import uuid
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.schema.document import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.storage import InMemoryStore
def create_multi_vector_retriever(vectorstore, image_summaries, images):
"""
Create retriever that indexes summaries, but returns raw images or texts
:param vectorstore: Vectorstore to store embedded image sumamries
:param image_summaries: Image summaries
:param images: Base64 encoded images
:return: Retriever
"""
# Initialize the storage layer
store = InMemoryStore()
id_key = "doc_id"
# Create the multi-vector retriever
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=store,
id_key=id_key,
)
# Helper function to add documents to the vectorstore and docstore
def add_documents(retriever, doc_summaries, doc_contents):
doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
summary_docs = [
Document(page_content=s, metadata={id_key: doc_ids[i]})
for i, s in enumerate(doc_summaries)
]
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, doc_contents)))
add_documents(retriever, image_summaries, images)
return retriever
# The vectorstore to use to index the summaries
vectorstore_mvr = Chroma(
collection_name="multi-modal-rag-mv", embedding_function=OpenAIEmbeddings()
)
# Create retriever
retriever_multi_vector_img = create_multi_vector_retriever(
vectorstore_mvr,
image_summaries,
images_base_64_processed,
)
RAG#
创建一个管道,用于根据与输入问题语义相似性检索相关图像。
将图像传递给 GPT-4V 以进行答案合成。
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough
def prepare_images(docs):
"""
Prepare iamges for prompt
:param docs: A list of base64-encoded images from retriever.
:return: Dict containing a list of base64-encoded strings.
"""
b64_images = []
for doc in docs:
if isinstance(doc, Document):
doc = doc.page_content
b64_images.append(doc)
return {"images": b64_images}
def img_prompt_func(data_dict, num_images=2):
"""
GPT-4V prompt for image analysis.
:param data_dict: A dict with images and a user-provided question.
:param num_images: Number of images to include in the prompt.
:return: A list containing message objects for each image and the text prompt.
"""
messages = []
if data_dict["context"]["images"]:
for image in data_dict["context"]["images"][:num_images]:
image_message = {
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image}"},
}
messages.append(image_message)
text_message = {
"type": "text",
"text": (
"You are an analyst tasked with answering questions about visual content.\n"
"You will be give a set of image(s) from a slide deck / presentation.\n"
"Use this information to answer the user question. \n"
f"User-provided question: {data_dict['question']}\n\n"
),
}
messages.append(text_message)
return [HumanMessage(content=messages)]
def multi_modal_rag_chain(retriever):
"""
Multi-modal RAG chain
"""
# Multi-modal LLM
model = ChatOpenAI(temperature=0, model="gpt-4-vision-preview", max_tokens=1024)
# RAG pipeline
chain = (
{
"context": retriever | RunnableLambda(prepare_images),
"question": RunnablePassthrough(),
}
| RunnableLambda(img_prompt_func)
| model
| StrOutputParser()
)
return chain
# Create RAG chain
chain_multimodal_rag = multi_modal_rag_chain(retriever_multi_vector_img)
chain_multimodal_rag_mmembd = multi_modal_rag_chain(retriever_mmembd)
评估#
对我们的数据集运行评估
task.name
是我们克隆的 QA 对数据集eval_config
指定了我们的数据集的 LangSmith 评估器,它将使用 GPT-4 作为评分器评分器将根据基本事实评估链生成的答案对每个问题
import uuid
from langchain.smith import RunEvalConfig
from langsmith.client import Client
# Evaluator configuration
client = Client()
eval_config = RunEvalConfig(
evaluators=["cot_qa"],
)
# Experiments
chain_map = {
"multi_modal_mvretriever_gpt4v": chain_multimodal_rag,
"multi_modal_mmembd_gpt4v": chain_multimodal_rag_mmembd,
}
# Run evaluation
run_id = uuid.uuid4().hex[:4]
test_runs = {}
for project_name, chain in chain_map.items():
test_runs[project_name] = client.run_on_dataset(
dataset_name=task.name,
llm_or_chain_factory=lambda: (lambda x: x["Question"]) | chain,
evaluation=eval_config,
verbose=True,
project_name=f"{project_name}-{run_id}",
project_metadata={"chain": project_name},
)
View the evaluation results for project 'multi_modal_mvretriever_gpt4v-f6f7' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306/compare?selectedSessions=15dd3901-382c-4f0f-8433-077963fc4bb7
View all tests for Dataset Multi-modal slide decks at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306
[------------------------------------------------->] 10/10
实验结果
输出 | 反馈.COT 上下文准确性 | 错误 | 执行时间 | |
---|---|---|---|---|
计数 | 10 | 10.0 | 0 | 10.000000 |
唯一 | 10 | NaN | 0 | NaN |
顶部 | 截至 2023 年第三季度 (Q3 2023),Dat... | NaN | NaN | NaN |
频率 | 1 | NaN | NaN | NaN |
平均值 | NaN | 1.0 | NaN | 13.430077 |
标准差 | NaN | 0.0 | NaN | 3.656360 |
最小值 | NaN | 1.0 | NaN | 10.319160 |
25% | NaN | 1.0 | NaN | 10.809424 |
50% | NaN | 1.0 | NaN | 11.675873 |
75% | NaN | 1.0 | NaN | 15.971083 |
最大值 | NaN | 1.0 | NaN | 20.940341 |
View the evaluation results for project 'multi_modal_mmembd_gpt4v-f6f7' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306/compare?selectedSessions=ed6255b4-23b5-45ee-82f7-bcf6744c3f8e
View all tests for Dataset Multi-modal slide decks at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306
[------------------------------------------------->] 10/10
实验结果
输出 | 反馈.COT 上下文准确性 | 错误 | 执行时间 | |
---|---|---|---|---|
计数 | 10 | 10.000000 | 0 | 10.000000 |
唯一 | 10 | NaN | 0 | NaN |
顶部 | 提供的图像不包含信息... | NaN | NaN | NaN |
频率 | 1 | NaN | NaN | NaN |
平均值 | NaN | 0.500000 | NaN | 15.596197 |
标准差 | NaN | 0.527046 | NaN | 2.716853 |
最小值 | NaN | 0.000000 | NaN | 11.661625 |
25% | NaN | 0.000000 | NaN | 12.941465 |
50% | NaN | 0.500000 | NaN | 16.246343 |
75% | NaN | 1.000000 | NaN | 17.723280 |
最大值 | NaN | 1.000000 | NaN | 18.488639 |