多模态:Gemini#

让我们使用 多模态幻灯片集 数据集对 Gemini 进行基准测试。

和之前一样,我们将使用两种索引方法测试该模型

(1) 使用多模态嵌入的向量存储

(2) 使用已索引图像摘要的多向量检索器

先决条件#

%pip install -U --quiet langchain langchain-google-genai langchain_benchmarks
%pip install -U --quiet openai chromadb pypdfium2 open-clip-torch pillow
import getpass
import os

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
env_vars = ["LANGCHAIN_API_KEY", "GOOGLE_API_KEY"]
for var in env_vars:
    if var not in os.environ:
        os.environ[var] = getpass.getpass(prompt=f"Enter your {var}: ")

数据集#

我们可以浏览 LangChain 提供的检索基准数据集,或者直接选择 多模态幻灯片集 任务。

from langchain_benchmarks import clone_public_dataset, registry
task = registry["Multi-modal slide decks"]
task
名称多模态幻灯片集
类型检索任务
数据集 ID40afc8e7-9d7e-44ed-8971-2cae1eb59731
描述这个公开数据集正在开发中,并将随着时间推移进行扩展。问答基于包含可视化表格和图表的幻灯片集。每个示例都由一个问题和参考答案组成。成功是根据答案相对于参考答案的准确性来衡量的。
检索器工厂
架构工厂
get_docs{}

克隆数据集,使其在我们的 LangSmith 数据集中可用。

clone_public_dataset(task.dataset_id, dataset_name=task.name)
Dataset Multi-modal slide decks already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306.

从远程缓存中获取数据集相关的 PDF,以便我们可以执行数据摄取。

from langchain_benchmarks.rag.tasks.multi_modal_slide_decks import get_file_names

file_names = list(get_file_names())  # PosixPath

加载#

对于每个演示文稿,为每张幻灯片提取一张图像。

import os
from pathlib import Path

import pypdfium2 as pdfium


def get_images(file):
    """
    Get PIL images from PDF pages and save them to a specified directory
    :param file: Path to file
    :return: A list of PIL images
    """

    # Get presentation
    pdf = pdfium.PdfDocument(file)
    n_pages = len(pdf)

    # Extracting file name and creating the directory for images
    file_name = Path(file).stem  # Gets the file name without extension
    img_dir = os.path.join(Path(file).parent, "img")
    os.makedirs(img_dir, exist_ok=True)

    # Get images
    pil_images = []
    print(f"Extracting {n_pages} images for {file.name}")
    for page_number in range(n_pages):
        page = pdf.get_page(page_number)
        bitmap = page.render(scale=1, rotation=0, crop=(0, 0, 0, 0))
        pil_image = bitmap.to_pil()
        pil_images.append(pil_image)

        # Saving the image with the specified naming convention
        image_path = os.path.join(img_dir, f"{file_name}_image_{page_number + 1}.jpg")
        pil_image.save(image_path, format="JPEG")

    return pil_images


images = []
for fi in file_names:
    images.extend(get_images(fi))
Extracting 30 images for DDOG_Q3_earnings_deck.pdf

现在,我们将每张 PIL 图像转换为 Base64 编码字符串并设置图像大小。

Base64 编码字符串可以作为 GPT-4V 的输入。

import base64
import io
from io import BytesIO

from PIL import Image


def resize_base64_image(base64_string, size=(128, 128)):
    """
    Resize an image encoded as a Base64 string

    :param base64_string: Base64 string
    :param size: Image size
    :return: Re-sized Base64 string
    """
    # Decode the Base64 string
    img_data = base64.b64decode(base64_string)
    img = Image.open(io.BytesIO(img_data))

    # Resize the image
    resized_img = img.resize(size, Image.LANCZOS)

    # Save the resized image to a bytes buffer
    buffered = io.BytesIO()
    resized_img.save(buffered, format=img.format)

    # Encode the resized image to Base64
    return base64.b64encode(buffered.getvalue()).decode("utf-8")


def convert_to_base64(pil_image):
    """
    Convert PIL images to Base64 encoded strings

    :param pil_image: PIL image
    :return: Re-sized Base64 string
    """

    buffered = BytesIO()
    pil_image.save(buffered, format="JPEG")  # You can change the format if needed
    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
    img_str = resize_base64_image(img_str, size=(960, 540))
    return img_str


images_base_64 = [convert_to_base64(i) for i in images]

如果需要,我们可以绘制图像以确认它们已正确提取。

from IPython.display import HTML, display


def plt_img_base64(img_base64):
    """
    Disply base64 encoded string as image

    :param img_base64:  Base64 string
    """
    # Create an HTML img tag with the base64 string as the source
    image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
    # Display the image by rendering the HTML
    display(HTML(image_html))


i = 10
plt_img_base64(images_base_64[i])

索引#

我们将测试两种方法。

选项 1:使用多模态嵌入的向量存储#

这里我们将使用 OpenCLIP 多模态嵌入

许多可供选择

默认情况下,它将使用 model_name="ViT-H-14", checkpoint="laion2b_s32b_b79k"

该模型在内存和性能之间取得了良好的平衡。

但是,您可以通过将不同的模型作为 model_name=, checkpoint= 参数传递给 OpenCLIPEmbeddings 来测试它们。

from langchain.vectorstores import Chroma
from langchain_experimental.open_clip import OpenCLIPEmbeddings

# Make vectorstore
vectorstore_mmembd = Chroma(
    collection_name="multi-modal-rag",
    embedding_function=OpenCLIPEmbeddings(),
)

# Read images we extracted above
img_dir = os.path.join(Path(file_names[0]).parent, "img")
image_uris = sorted(
    [
        os.path.join(img_dir, image_name)
        for image_name in os.listdir(img_dir)
        if image_name.endswith(".jpg")
    ]
)

# Add images
vectorstore_mmembd.add_images(uris=image_uris)

# Make retriever
retriever_mmembd = vectorstore_mmembd.as_retriever()

选项 2:多向量检索器#

这种方法将生成并索引图像摘要。详见此处

然后它将检索原始图像并将其传递给 GPT-4V 进行最终合成。

其理念是图像摘要检索

  • 不依赖多模态嵌入

  • 在检索视觉/语义相似但定量不同的幻灯片内容方面表现更好

注意:OpenAI 的 GPT-4V API 可能会出现 非确定性 BadRequestError 错误,我们对此进行了处理。希望这个问题能尽快解决。

from langchain.schema.messages import HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI


def image_summarize(img_base64, prompt):
    """
    Make image summary

    :param img_base64: Base64 encoded string for image
    :param prompt: Text prompt for summarizatiomn
    :return: Image summarization prompt

    """
    chat = ChatGoogleGenerativeAI(model="gemini-pro-vision", temperature=0)

    msg = chat.invoke(
        [
            HumanMessage(
                content=[
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                    },
                ]
            )
        ]
    )
    return msg.content


def generate_img_summaries(img_base64_list):
    """
    Generate summaries for images

    :param img_base64_list: Base64 encoded images
    :return: List of image summaries and processed images
    """

    # Store image summaries
    image_summaries = []
    processed_images = []

    # Prompt
    prompt = """You are an assistant tasked with summarizing images for retrieval. \
    These summaries will be embedded and used to retrieve the raw image. \
    Give a detailed summary of the image that is well optimized for retrieval."""

    # Apply summarization to images
    for i, base64_image in enumerate(img_base64_list):
        try:
            image_summaries.append(image_summarize(base64_image, prompt))
            processed_images.append(base64_image)
        except Exception as e:
            print(f"BadRequestError with image {i+1}. {e}")

    return image_summaries, processed_images


# Image summaries
image_summaries, images_base_64_processed = generate_img_summaries(images_base_64)

将原始文档和文档摘要添加到 多向量检索器

  • 将原始图像存储在 docstore 中。

  • 将图像摘要存储在 vectorstore 中用于语义检索。

import uuid

from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.schema.document import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.storage import InMemoryStore


def create_multi_vector_retriever(vectorstore, image_summaries, images):
    """
    Create retriever that indexes summaries, but returns raw images or texts

    :param vectorstore: Vectorstore to store embedded image sumamries
    :param image_summaries: Image summaries
    :param images: Base64 encoded images
    :return: Retriever
    """

    # Initialize the storage layer
    store = InMemoryStore()
    id_key = "doc_id"

    # Create the multi-vector retriever
    retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
        id_key=id_key,
    )

    # Helper function to add documents to the vectorstore and docstore
    def add_documents(retriever, doc_summaries, doc_contents):
        doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
        summary_docs = [
            Document(page_content=s, metadata={id_key: doc_ids[i]})
            for i, s in enumerate(doc_summaries)
        ]
        retriever.vectorstore.add_documents(summary_docs)
        retriever.docstore.mset(list(zip(doc_ids, doc_contents)))

    add_documents(retriever, image_summaries, images)

    return retriever


# The vectorstore to use to index the summaries
vectorstore_mvr = Chroma(
    collection_name="multi-modal-rag-mv", embedding_function=OpenAIEmbeddings()
)

# Create retriever
retriever_multi_vector_img = create_multi_vector_retriever(
    vectorstore_mvr,
    image_summaries,
    images_base_64_processed,
)

RAG#

创建一个管道,根据与输入问题的语义相似性检索相关图像。

将图像传递给 GPT-4V 进行答案合成。

from langchain.schema.runnable import RunnableLambda, RunnablePassthrough


def prepare_images(docs):
    """
    Prepare images for prompt

    :param docs: A list of base64-encoded images from retriever.
    :return: Dict containing a list of base64-encoded strings.
    """
    b64_images = []
    for doc in docs:
        if isinstance(doc, Document):
            doc = doc.page_content
        b64_images.append(doc)
    return {"images": b64_images}


def img_prompt_func(data_dict, num_images=2):
    """
    Gemni prompt for image analysis.

    :param data_dict: A dict with images and a user-provided question.
    :param num_images: Number of images to include in the prompt.
    :return: A list containing message objects for each image and the text prompt.
    """
    messages = []
    if data_dict["context"]["images"]:
        for image in data_dict["context"]["images"][:num_images]:
            image_message = {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image}"},
            }
            messages.append(image_message)
    text_message = {
        "type": "text",
        "text": (
            "You are an analyst tasked with answering questions about visual content.\n"
            "You will be give a set of image(s) from a slide deck / presentation.\n"
            "Use this information to answer the user question. \n"
            f"User-provided question: {data_dict['question']}\n\n"
        ),
    }
    messages.append(text_message)
    return [HumanMessage(content=messages)]


def multi_modal_rag_chain(retriever):
    """
    Multi-modal RAG chain
    """

    # Multi-modal LLM
    model = ChatGoogleGenerativeAI(temperature=0, model="gemini-pro-vision")

    # RAG pipeline
    chain = (
        {
            "context": retriever | RunnableLambda(prepare_images),
            "question": RunnablePassthrough(),
        }
        | RunnableLambda(img_prompt_func)
        | model
        | StrOutputParser()
    )

    return chain


# Create RAG chain
chain_multimodal_rag = multi_modal_rag_chain(retriever_multi_vector_img)
chain_multimodal_rag_mmembd = multi_modal_rag_chain(retriever_mmembd)

评估#

对我们的数据集运行评估

  • task.name 是我们克隆的问答对数据集

  • eval_config 指定了我们数据集的 LangSmith 评估器,它将使用 GPT-4 作为评分器

  • 评分器将根据真实情况评估链式生成的每个问题的答案

import uuid

from langchain.smith import RunEvalConfig
from langsmith.client import Client

# Evaluator configuration
client = Client()
eval_config = RunEvalConfig(
    evaluators=["cot_qa"],
)

# Experiments
chain_map = {
    "multi_modal_mvretriever_gemini-pro-vision": chain_multimodal_rag,
    "multi_modal_mmembd_gemini-pro-vision": chain_multimodal_rag_mmembd,
}

# Run evaluation
run_id = uuid.uuid4().hex[:4]
test_runs = {}
for arch_name, chain in chain_map.items():
    test_runs[arch_name] = client.run_on_dataset(
        dataset_name=task.name,
        llm_or_chain_factory=lambda: (lambda x: x["Question"]) | chain,
        evaluation=eval_config,
        verbose=True,
        project_name=f"{arch_name}-{run_id}",
        project_metadata={"arch": arch_name, "model": "gemini-pro-vision"},
    )
View the evaluation results for project 'multi_modal_mvretriever_gemini-pro-vision-f276' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306/compare?selectedSessions=e2890853-d80d-432f-a91d-60902f8244c6

View all tests for Dataset Multi-modal slide decks at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306
[------------------------------------------------->] 10/10

实验结果

输出 feedback.COT 上下文准确性 错误 执行时间 运行 ID
计数 10 10.000000 0 10.000000 10
唯一值 10 NaN 0 NaN 10
顶部 ~26,800 NaN NaN NaN afc9745a-8f9f-4092-8344-4617b1d91a7c
频率 1 NaN NaN NaN 1
平均值 NaN 0.800000 NaN 11.835799 NaN
标准差 NaN 0.421637 NaN 0.746122 NaN
最小值 NaN 0.000000 NaN 10.604844 NaN
25% NaN 1.000000 NaN 11.587839 NaN
50% NaN 1.000000 NaN 12.004324 NaN
75% NaN 1.000000 NaN 12.307524 NaN
最大值 NaN 1.000000 NaN 12.919055 NaN
View the evaluation results for project 'multi_modal_mmembd_gemini-pro-vision-f276' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306/compare?selectedSessions=575112d2-d7a8-472b-bafe-0698ff496606

View all tests for Dataset Multi-modal slide decks at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08a29acb-5ad6-42ce-a482-574c9e2e5306
[------------------------------------------------->] 10/10

实验结果

输出 feedback.COT 上下文准确性 错误 执行时间 运行 ID
计数 10 10.000000 0 10.000000 10
唯一值 9 NaN 0 NaN 10
顶部 图像中找不到答案。 NaN NaN NaN a17bb468-499b-4663-89a3-eeeb6207763c
频率 2 NaN NaN NaN 1
平均值 NaN 0.300000 NaN 14.063835 NaN
标准差 NaN 0.483046 NaN 3.145174 NaN
最小值 NaN 0.000000 NaN 10.307565 NaN
25% NaN 0.000000 NaN 11.285314 NaN
50% NaN 0.000000 NaN 14.042789 NaN
75% NaN 0.750000 NaN 16.025788 NaN
最大值 NaN 1.000000 NaN 19.055173 NaN