多模态:Gemini#
让我们使用 多模态幻灯片集
数据集对 Gemini 进行基准测试。
和之前一样,我们将使用两种索引方法测试该模型
(1) 使用多模态嵌入的向量存储
(2) 使用已索引图像摘要的多向量检索器
先决条件#
%pip install -U --quiet langchain langchain-google-genai langchain_benchmarks
%pip install -U --quiet openai chromadb pypdfium2 open-clip-torch pillow
import getpass
import os
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
env_vars = ["LANGCHAIN_API_KEY", "GOOGLE_API_KEY"]
for var in env_vars:
if var not in os.environ:
os.environ[var] = getpass.getpass(prompt=f"Enter your {var}: ")
数据集#
我们可以浏览 LangChain 提供的检索基准数据集,或者直接选择 多模态幻灯片集
任务。
from langchain_benchmarks import clone_public_dataset, registry
task = registry["Multi-modal slide decks"]
task
名称 | 多模态幻灯片集 |
类型 | 检索任务 |
数据集 ID | 40afc8e7-9d7e-44ed-8971-2cae1eb59731 |
描述 | 这个公开数据集正在开发中,并将随着时间推移进行扩展。问答基于包含可视化表格和图表的幻灯片集。每个示例都由一个问题和参考答案组成。成功是根据答案相对于参考答案的准确性来衡量的。 |
检索器工厂 | |
架构工厂 | |
get_docs | {} |
克隆数据集,使其在我们的 LangSmith 数据集中可用。
clone_public_dataset(task.dataset_id, dataset_name=task.name)
Dataset Multi-modal slide decks already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306.
从远程缓存中获取数据集相关的 PDF,以便我们可以执行数据摄取。
from langchain_benchmarks.rag.tasks.multi_modal_slide_decks import get_file_names
file_names = list(get_file_names()) # PosixPath
加载#
对于每个演示文稿,为每张幻灯片提取一张图像。
import os
from pathlib import Path
import pypdfium2 as pdfium
def get_images(file):
"""
Get PIL images from PDF pages and save them to a specified directory
:param file: Path to file
:return: A list of PIL images
"""
# Get presentation
pdf = pdfium.PdfDocument(file)
n_pages = len(pdf)
# Extracting file name and creating the directory for images
file_name = Path(file).stem # Gets the file name without extension
img_dir = os.path.join(Path(file).parent, "img")
os.makedirs(img_dir, exist_ok=True)
# Get images
pil_images = []
print(f"Extracting {n_pages} images for {file.name}")
for page_number in range(n_pages):
page = pdf.get_page(page_number)
bitmap = page.render(scale=1, rotation=0, crop=(0, 0, 0, 0))
pil_image = bitmap.to_pil()
pil_images.append(pil_image)
# Saving the image with the specified naming convention
image_path = os.path.join(img_dir, f"{file_name}_image_{page_number + 1}.jpg")
pil_image.save(image_path, format="JPEG")
return pil_images
images = []
for fi in file_names:
images.extend(get_images(fi))
Extracting 30 images for DDOG_Q3_earnings_deck.pdf
现在,我们将每张 PIL 图像转换为 Base64 编码字符串并设置图像大小。
Base64 编码字符串可以作为 GPT-4V 的输入。
import base64
import io
from io import BytesIO
from PIL import Image
def resize_base64_image(base64_string, size=(128, 128)):
"""
Resize an image encoded as a Base64 string
:param base64_string: Base64 string
:param size: Image size
:return: Re-sized Base64 string
"""
# Decode the Base64 string
img_data = base64.b64decode(base64_string)
img = Image.open(io.BytesIO(img_data))
# Resize the image
resized_img = img.resize(size, Image.LANCZOS)
# Save the resized image to a bytes buffer
buffered = io.BytesIO()
resized_img.save(buffered, format=img.format)
# Encode the resized image to Base64
return base64.b64encode(buffered.getvalue()).decode("utf-8")
def convert_to_base64(pil_image):
"""
Convert PIL images to Base64 encoded strings
:param pil_image: PIL image
:return: Re-sized Base64 string
"""
buffered = BytesIO()
pil_image.save(buffered, format="JPEG") # You can change the format if needed
img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
img_str = resize_base64_image(img_str, size=(960, 540))
return img_str
images_base_64 = [convert_to_base64(i) for i in images]
如果需要,我们可以绘制图像以确认它们已正确提取。
from IPython.display import HTML, display
def plt_img_base64(img_base64):
"""
Disply base64 encoded string as image
:param img_base64: Base64 string
"""
# Create an HTML img tag with the base64 string as the source
image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
# Display the image by rendering the HTML
display(HTML(image_html))
i = 10
plt_img_base64(images_base_64[i])
索引#
我们将测试两种方法。
选项 1:使用多模态嵌入的向量存储#
这里我们将使用 OpenCLIP 多模态嵌入。
有 许多可供选择。
默认情况下,它将使用 model_name="ViT-H-14", checkpoint="laion2b_s32b_b79k"
。
该模型在内存和性能之间取得了良好的平衡。
但是,您可以通过将不同的模型作为 model_name=, checkpoint=
参数传递给 OpenCLIPEmbeddings 来测试它们。
from langchain.vectorstores import Chroma
from langchain_experimental.open_clip import OpenCLIPEmbeddings
# Make vectorstore
vectorstore_mmembd = Chroma(
collection_name="multi-modal-rag",
embedding_function=OpenCLIPEmbeddings(),
)
# Read images we extracted above
img_dir = os.path.join(Path(file_names[0]).parent, "img")
image_uris = sorted(
[
os.path.join(img_dir, image_name)
for image_name in os.listdir(img_dir)
if image_name.endswith(".jpg")
]
)
# Add images
vectorstore_mmembd.add_images(uris=image_uris)
# Make retriever
retriever_mmembd = vectorstore_mmembd.as_retriever()
选项 2:多向量检索器#
这种方法将生成并索引图像摘要。详见此处。
然后它将检索原始图像并将其传递给 GPT-4V 进行最终合成。
其理念是图像摘要检索
不依赖多模态嵌入
在检索视觉/语义相似但定量不同的幻灯片内容方面表现更好
注意:OpenAI 的 GPT-4V API 可能会出现 非确定性 BadRequestError
错误,我们对此进行了处理。希望这个问题能尽快解决。
from langchain.schema.messages import HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI
def image_summarize(img_base64, prompt):
"""
Make image summary
:param img_base64: Base64 encoded string for image
:param prompt: Text prompt for summarizatiomn
:return: Image summarization prompt
"""
chat = ChatGoogleGenerativeAI(model="gemini-pro-vision", temperature=0)
msg = chat.invoke(
[
HumanMessage(
content=[
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
},
]
)
]
)
return msg.content
def generate_img_summaries(img_base64_list):
"""
Generate summaries for images
:param img_base64_list: Base64 encoded images
:return: List of image summaries and processed images
"""
# Store image summaries
image_summaries = []
processed_images = []
# Prompt
prompt = """You are an assistant tasked with summarizing images for retrieval. \
These summaries will be embedded and used to retrieve the raw image. \
Give a detailed summary of the image that is well optimized for retrieval."""
# Apply summarization to images
for i, base64_image in enumerate(img_base64_list):
try:
image_summaries.append(image_summarize(base64_image, prompt))
processed_images.append(base64_image)
except Exception as e:
print(f"BadRequestError with image {i+1}. {e}")
return image_summaries, processed_images
# Image summaries
image_summaries, images_base_64_processed = generate_img_summaries(images_base_64)
将原始文档和文档摘要添加到 多向量检索器
将原始图像存储在
docstore
中。将图像摘要存储在
vectorstore
中用于语义检索。
import uuid
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.schema.document import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.storage import InMemoryStore
def create_multi_vector_retriever(vectorstore, image_summaries, images):
"""
Create retriever that indexes summaries, but returns raw images or texts
:param vectorstore: Vectorstore to store embedded image sumamries
:param image_summaries: Image summaries
:param images: Base64 encoded images
:return: Retriever
"""
# Initialize the storage layer
store = InMemoryStore()
id_key = "doc_id"
# Create the multi-vector retriever
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=store,
id_key=id_key,
)
# Helper function to add documents to the vectorstore and docstore
def add_documents(retriever, doc_summaries, doc_contents):
doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
summary_docs = [
Document(page_content=s, metadata={id_key: doc_ids[i]})
for i, s in enumerate(doc_summaries)
]
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, doc_contents)))
add_documents(retriever, image_summaries, images)
return retriever
# The vectorstore to use to index the summaries
vectorstore_mvr = Chroma(
collection_name="multi-modal-rag-mv", embedding_function=OpenAIEmbeddings()
)
# Create retriever
retriever_multi_vector_img = create_multi_vector_retriever(
vectorstore_mvr,
image_summaries,
images_base_64_processed,
)
RAG#
创建一个管道,根据与输入问题的语义相似性检索相关图像。
将图像传递给 GPT-4V 进行答案合成。
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough
def prepare_images(docs):
"""
Prepare images for prompt
:param docs: A list of base64-encoded images from retriever.
:return: Dict containing a list of base64-encoded strings.
"""
b64_images = []
for doc in docs:
if isinstance(doc, Document):
doc = doc.page_content
b64_images.append(doc)
return {"images": b64_images}
def img_prompt_func(data_dict, num_images=2):
"""
Gemni prompt for image analysis.
:param data_dict: A dict with images and a user-provided question.
:param num_images: Number of images to include in the prompt.
:return: A list containing message objects for each image and the text prompt.
"""
messages = []
if data_dict["context"]["images"]:
for image in data_dict["context"]["images"][:num_images]:
image_message = {
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image}"},
}
messages.append(image_message)
text_message = {
"type": "text",
"text": (
"You are an analyst tasked with answering questions about visual content.\n"
"You will be give a set of image(s) from a slide deck / presentation.\n"
"Use this information to answer the user question. \n"
f"User-provided question: {data_dict['question']}\n\n"
),
}
messages.append(text_message)
return [HumanMessage(content=messages)]
def multi_modal_rag_chain(retriever):
"""
Multi-modal RAG chain
"""
# Multi-modal LLM
model = ChatGoogleGenerativeAI(temperature=0, model="gemini-pro-vision")
# RAG pipeline
chain = (
{
"context": retriever | RunnableLambda(prepare_images),
"question": RunnablePassthrough(),
}
| RunnableLambda(img_prompt_func)
| model
| StrOutputParser()
)
return chain
# Create RAG chain
chain_multimodal_rag = multi_modal_rag_chain(retriever_multi_vector_img)
chain_multimodal_rag_mmembd = multi_modal_rag_chain(retriever_mmembd)
评估#
对我们的数据集运行评估
task.name
是我们克隆的问答对数据集eval_config
指定了我们数据集的 LangSmith 评估器,它将使用 GPT-4 作为评分器评分器将根据真实情况评估链式生成的每个问题的答案
import uuid
from langchain.smith import RunEvalConfig
from langsmith.client import Client
# Evaluator configuration
client = Client()
eval_config = RunEvalConfig(
evaluators=["cot_qa"],
)
# Experiments
chain_map = {
"multi_modal_mvretriever_gemini-pro-vision": chain_multimodal_rag,
"multi_modal_mmembd_gemini-pro-vision": chain_multimodal_rag_mmembd,
}
# Run evaluation
run_id = uuid.uuid4().hex[:4]
test_runs = {}
for arch_name, chain in chain_map.items():
test_runs[arch_name] = client.run_on_dataset(
dataset_name=task.name,
llm_or_chain_factory=lambda: (lambda x: x["Question"]) | chain,
evaluation=eval_config,
verbose=True,
project_name=f"{arch_name}-{run_id}",
project_metadata={"arch": arch_name, "model": "gemini-pro-vision"},
)
View the evaluation results for project 'multi_modal_mvretriever_gemini-pro-vision-f276' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306/compare?selectedSessions=e2890853-d80d-432f-a91d-60902f8244c6
View all tests for Dataset Multi-modal slide decks at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306
[------------------------------------------------->] 10/10
实验结果
输出 | feedback.COT 上下文准确性 | 错误 | 执行时间 | 运行 ID | |
---|---|---|---|---|---|
计数 | 10 | 10.000000 | 0 | 10.000000 | 10 |
唯一值 | 10 | NaN | 0 | NaN | 10 |
顶部 | ~26,800 | NaN | NaN | NaN | afc9745a-8f9f-4092-8344-4617b1d91a7c |
频率 | 1 | NaN | NaN | NaN | 1 |
平均值 | NaN | 0.800000 | NaN | 11.835799 | NaN |
标准差 | NaN | 0.421637 | NaN | 0.746122 | NaN |
最小值 | NaN | 0.000000 | NaN | 10.604844 | NaN |
25% | NaN | 1.000000 | NaN | 11.587839 | NaN |
50% | NaN | 1.000000 | NaN | 12.004324 | NaN |
75% | NaN | 1.000000 | NaN | 12.307524 | NaN |
最大值 | NaN | 1.000000 | NaN | 12.919055 | NaN |
View the evaluation results for project 'multi_modal_mmembd_gemini-pro-vision-f276' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306/compare?selectedSessions=575112d2-d7a8-472b-bafe-0698ff496606
View all tests for Dataset Multi-modal slide decks at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306
[------------------------------------------------->] 10/10
实验结果
输出 | feedback.COT 上下文准确性 | 错误 | 执行时间 | 运行 ID | |
---|---|---|---|---|---|
计数 | 10 | 10.000000 | 0 | 10.000000 | 10 |
唯一值 | 9 | NaN | 0 | NaN | 10 |
顶部 | 图像中找不到答案。 | NaN | NaN | NaN | a17bb468-499b-4663-89a3-eeeb6207763c |
频率 | 2 | NaN | NaN | NaN | 1 |
平均值 | NaN | 0.300000 | NaN | 14.063835 | NaN |
标准差 | NaN | 0.483046 | NaN | 3.145174 | NaN |
最小值 | NaN | 0.000000 | NaN | 10.307565 | NaN |
25% | NaN | 0.000000 | NaN | 11.285314 | NaN |
50% | NaN | 0.000000 | NaN | 14.042789 | NaN |
75% | NaN | 0.750000 | NaN | 16.025788 | NaN |
最大值 | NaN | 1.000000 | NaN | 19.055173 | NaN |