介绍#

工具使用任务旨在评估代理在使用工具完成目标方面的能力。

每个任务都定义了代理运行的环境。环境包括一组工具和一种读取环境状态的方法（稍后将详细介绍）。

这些任务允许您以不同的方式对代理进行压力测试

代理能否有效地使用单个工具？
代理能否有效地使用超过 10 个工具？
代理能否正确地将工具返回的信息整合（并忽略内部知识）？

为了帮助进行评估，每个任务都与一个 LangSmith 数据集相关联，该数据集包含不同难度的输入/输出示例。

模式#

为了能够评估不同的代理实现，我们使用了一种标准化的模式，我们将使用以下示例来说明它，该示例来自工具使用。

数据集#

每个任务对应一个 LangSmith 数据集，该数据集具有以下模式

输入

名称	类型	含义
问题	str	用户问题

输出

名称	类型	含义
参考	str	预期答案
预期步骤	List[str]	应该调用的工具列表
顺序重要	bool	工具是否应该按特定顺序调用
状态	Optional[Any]	代理执行其操作后系统的状态

这是一个例子包含以下键值对

{
  "input": {"question": "weather in LA right now?"},
  "output": {
      "reference": "Sunny, Temperature: 75°F",
      "order_matters": true,
      "expected_steps": [
        "find_locations_by_name",
        "get_current_weather_for_location"
      ],
    }
}

代理#

为了与 LangChain 基准测试提供的评估器一起使用（当然您可以自由编写自己的评估器！）。

代理必须接受question作为输入并返回

{
    "output": "It's super sunny. Like 75F", // the output from the agent
    "intermediate_steps": [... "find_locations_by_name" ...], // list of the intermediate steps taken by the agent (see format in LangChain)
    "state": .., // Can be anything, this is the state fo the environment after the agent has taken all of its actions (optional key)
}

任务#

您可以在注册表中查看工具使用任务的最新列表

from langchain_benchmarks import registry

registry.filter(Type="ToolUsageTask")

名称	类型	数据集 ID	描述
工具使用 - 打字机（1 个工具）	ToolUsageTask	59577193-8938-4ccf-92a7-e8a96bcf4f86	具有单个工具的环境，该工具接受单个字母作为输入，并在虚拟纸上打印它。此任务的目标是评估模型使用提供的工具重复给定输入字符串的能力。例如，如果字符串为'abc'，则必须按顺序调用工具'a'、'b'和'c'。数据集包含不同难度的示例。难度通过字符串的长度来衡量。
工具使用 - 打字机（26 个工具）	ToolUsageTask	128af05e-aa00-4e3b-a958-d166dd450581	具有 26 个工具的环境，每个工具代表字母表中的一个字母。此任务的目标是评估模型使用工具进行简单重复任务的能力。例如，如果字符串为'abc'，则必须按顺序调用工具'a'、'b'和'c'。数据集包含不同难度的示例。难度通过字符串的长度来衡量。这是打字机任务的变体，其中提供了 26 个无参数工具，而不是一个接受字母作为参数的工具。
工具使用 - 关系数据	ToolUsageTask	1d89f4b3-5f73-48cf-a127-2fdeb22f6d84	具有有关用户及其位置和最喜欢的食物的虚假数据的环境。环境提供了一组工具，可用于查询数据。此任务的目标是评估使用提供的工具来回答有关关系数据的问题的能力。数据集包含 21 个不同难度的示例。难度通过回答问题需要使用的工具数量来衡量。每个示例都包含一个问题、一个参考答案以及有关按顺序使用哪些工具来回答问题的的信息。成功通过能够正确有效地回答问题来衡量。
多元宇宙数学	ToolUsageTask	47ed57bc-e852-4f84-a23e-cce4793864e9	包含一些基本数学运算的环境，但结果已更改。例如，53 的乘法将被重新解释为 53*1.1。基本运算保留了一些基本属性，例如交换律、结合律和分配律；但是，结果与预期不同。此任务的目标是评估使用提供的工具来解决简单的数学问题并忽略任何有关数学的固有知识的能力。此任务与 20 个测试示例相关联。

让我们更详细地了解一下工具使用任务是什么

task = registry["Tool Usage - Typewriter (26 tools)"]
task

名称	工具使用 - 打字机（26 个工具）
类型	ToolUsageTask
数据集 ID	128af05e-aa00-4e3b-a958-d166dd450581
描述	具有 26 个工具的环境，每个工具代表字母表中的一个字母。此任务的目标是评估模型使用工具进行简单重复任务的能力。例如，如果字符串为'abc'，则必须按顺序调用工具'a'、'b'和'c'。数据集包含不同难度的示例。难度通过字符串的长度来衡量。这是打字机任务的变体，其中提供了 26 个无参数工具，而不是一个接受字母作为参数的工具。

工具使用任务与环境相关联

@dataclasses.dataclass(frozen=True)
class ToolUsageEnvironment:
    """An instance of an environment for tool usage."""

    tools: List[BaseTool]
    """The tools that can be used in the environment."""

    read_state: Optional[Callable[[], Any]] = None
    """A function that returns the current state of the environment."""

在这里，我们将深入研究打字机任务，以解释环境状态的含义。

打字机任务有 26 个工具，每个工具都在虚拟纸上打印一个字母

env = task.create_environment()
env.tools[:4]

[StructuredTool(name='a', description='a() -> str - Run to Type the letter "a".', args_schema=<class 'pydantic.v1.main.aSchema'>, func=<function _create_typing_func.<locals>.func at 0x7b3a9f62c9a0>),
 StructuredTool(name='b', description='b() -> str - Run to Type the letter "b".', args_schema=<class 'pydantic.v1.main.bSchema'>, func=<function _create_typing_func.<locals>.func at 0x7b3a9f62c5e0>),
 StructuredTool(name='c', description='c() -> str - Run to Type the letter "c".', args_schema=<class 'pydantic.v1.main.cSchema'>, func=<function _create_typing_func.<locals>.func at 0x7b3a9f62cae0>),
 StructuredTool(name='d', description='d() -> str - Run to Type the letter "d".', args_schema=<class 'pydantic.v1.main.dSchema'>, func=<function _create_typing_func.<locals>.func at 0x7b3a9f62cb80>)]

env.tools[0].invoke({})  # Invoke a()
env.tools[0].invoke({})  # invoke a()
env.tools[2].invoke({})  # invoke c()

'OK'

env.read_state()  # Shows the content of the virtual paper

'aac'

创建代理!#

既然您已经了解了测试环境的工作原理，让我们创建一个可以测试的代理！

因为代理通过工具与环境交互，并且可以在代理运行期间更改环境的状态，所以我们真正想要的是能够为每次测试运行创建一个新的代理和一个新的环境。

我们将使用工厂来做到这一点。工厂只是计算机科学中一个花哨的名字，用于表示可以创建其他对象的物体。在这种情况下，我们将拥有一个代理工厂，我们可以调用它，它将在每次调用时为我们创建一个新的代理。

我们将使用 StandardAgentFactory，它在幕后创建一个标准的 LangChain 工具调用代理。它可以与任何支持工具调用的聊天模型一起使用。

from langchain_anthropic.chat_models import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate

from langchain_benchmarks.tool_usage.agents import StandardAgentFactory

model = ChatAnthropic(model="claude-3-opus-20240229", temperature=0)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "{instructions}"),  # Populated from task.instructions automatically
        (
            "human",
            "{question}",
        ),  # Each evaluation example is associated with a question
        ("placeholder", "{agent_scratchpad}"),  # Space for the agent to do work
    ]
)

agent_factory = StandardAgentFactory(task, model, prompt)

以下是任务的说明

task.instructions

"Repeat the given string by using the provided tools. Do not write anything else or provide any explanations. For example, if the string is 'abc', you must invoke the tools 'a', 'b', and 'c' in that order. Please invoke the functions without any arguments."

让我们测试一下

from langchain import globals

globals.set_verbose(True)
agent = agent_factory()
agent.invoke({"question": "abc"})
globals.set_verbose(False)

> Entering new AgentExecutor chain...

Invoking: `a` with `{}`
responded: [{'text': '<thinking>\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\n</thinking>', 'type': 'text'}, {'id': 'toolu_01MQ6oTx2j2uNGCR5LBVeKui', 'input': {}, 'name': 'a', 'type': 'tool_use'}, {'id': 'toolu_01AytT1jvNNR67VodMkhbq7r', 'input': {}, 'name': 'b', 'type': 'tool_use'}, {'id': 'toolu_015VkTYUV5hWcobtduqssi9k', 'input': {}, 'name': 'c', 'type': 'tool_use'}]

OK
Invoking: `b` with `{}`
responded: [{'text': '<thinking>\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\n</thinking>', 'type': 'text'}, {'id': 'toolu_01MQ6oTx2j2uNGCR5LBVeKui', 'input': {}, 'name': 'a', 'type': 'tool_use'}, {'id': 'toolu_01AytT1jvNNR67VodMkhbq7r', 'input': {}, 'name': 'b', 'type': 'tool_use'}, {'id': 'toolu_015VkTYUV5hWcobtduqssi9k', 'input': {}, 'name': 'c', 'type': 'tool_use'}]

OK
Invoking: `c` with `{}`
responded: [{'text': '<thinking>\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\n</thinking>', 'type': 'text'}, {'id': 'toolu_01MQ6oTx2j2uNGCR5LBVeKui', 'input': {}, 'name': 'a', 'type': 'tool_use'}, {'id': 'toolu_01AytT1jvNNR67VodMkhbq7r', 'input': {}, 'name': 'b', 'type': 'tool_use'}, {'id': 'toolu_015VkTYUV5hWcobtduqssi9k', 'input': {}, 'name': 'c', 'type': 'tool_use'}]

OK[]

> Finished chain.

基准测试#

如何评估代理？给定特定任务和输入，代理使用工具生成输出和/或更改环境状态。

为了评估代理，我们可以检查以下内容

代理是否使用了预期的工具？
代理是否以最有效的方式使用了工具；例如，工具调用的顺序是否正确？
代理使用工具后环境是否处于正确的最终状态？（例如，我的日历是否包含所有预定的会议？）
代理的输出是否与预期的参考输出匹配？

每个任务都与一个标准评估器相关联，该评估器执行适合该任务的评估；例如，

使用 LLM 对输出进行评分与使用对响应进行评分的 LLM 的参考进行比较。
将 expected_steps 的相等性与 intermediate_steps 中的工具列表进行比较 - 简单的列表相等性
将环境状态与预期状态进行比较（如果存在于数据集中以及代理中）

每个任务都与其自身的任务特定评估器相关联！

eval_config = task.get_eval_config()
eval_config

RunEvalConfig(evaluators=[], custom_evaluators=[<langchain_benchmarks.tool_usage.evaluators.AgentTrajectoryEvaluator object at 0x7b3a9ea5b110>], batch_evaluators=None, reference_key=None, prediction_key=None, input_key=None, eval_llm=None)

设置针对所有任务运行的代码

import datetime
import uuid

from langsmith.client import Client

from langchain_benchmarks import (
    __version__,
    clone_public_dataset,
    model_registry,
    registry,
)
from langchain_benchmarks.rate_limiting import RateLimiter

创建一个实验 ID。我们将使用它来标记我们的运行，我们以后可以使用它从 LangSmith 检索运行数据。

experiment_id = uuid.uuid4().hex[:]

针对所有任务运行评估。

client = Client()  # Launch langsmith client for cloning datasets
today = datetime.date.today().isoformat()

# You can use an optional rate limiter to rate limit your requests!
rate_limiter = RateLimiter(requests_per_second=1)


# Set up 2-tuples of (model name, model instance)
# You can update this list with any model that supports tool calling.
# See list here: https://python.langchain.ac.cn/docs/integrations/chat/
tests = [
    (
        "claude-3-haiku-20240307",
        ChatAnthropic(model="claude-3-haiku-20240307", temperature=0),
    )
]


for task in registry.tasks:
    if task.type != "ToolUsageTask":
        continue

    dataset_name = task.name + f" ({today})"
    clone_public_dataset(task.dataset_id, dataset_name=dataset_name)

    for model_name, model in tests:
        print()
        print(f"Benchmarking {task.name} with model: {model_name}")
        eval_config = task.get_eval_config()

        agent_factory = StandardAgentFactory(
            task, model, prompt, rate_limiter=rate_limiter
        )

        client.run_on_dataset(
            dataset_name=dataset_name,
            llm_or_chain_factory=agent_factory,
            evaluation=eval_config,
            verbose=False,
            project_name=f"{model_name}-{task.name}-{today}-{experiment_id}",
            concurrency_level=5,
            project_metadata={
                "model": model_name,
                "id": experiment_uuid,
                "task": task.name,
                "date": today,
                "langchain_benchmarks_version": __version__,
            },
        )

高级用法#

以下部分演示了稍微更“高级”的用法，如果您想以与我们的测试运行程序兼容的方式完全自定义代理运行时。

我们还将向代理应用一个适配器，它将捕获其输入和输出（例如，在运行结束时添加代理环境信息），以便我们可以对其进行评估。

自定义代理工厂#

如果您希望获得超出 CustomRunnableAgentFactory 提供的范围的更多可配置性，您可以使用以下模式创建自己的 AgentFactory。

The AgentExecutor should accept question as an input and include the fields output, intermediate_steps and potentially state in its response – for this we will wrap the agent executor in an adapter (apply_agent_executor_adapter) that will help match the expected schema.

from typing import Optional

from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate

from langchain_benchmarks.schema import ExtractionTask
from langchain_benchmarks.tool_usage.agents import apply_agent_executor_adapter


class CustomAgentFactory:
    def __init__(
        self,
        task: ExtractionTask,
        *,
        # It can be useful to add a rate-limiter
        # which will limit ther number of requests per second
        # when running evaluation.
        rate_limiter: Optional[RateLimiter] = None,
    ) -> None:
        self.task = task
        self.rate_limiter = rate_limiter

    def __call__(self):
        # This factory creates a new environment for every agent run.
        # The reason is that the environment may be associated with an environment state (e.g., typewriter)
        # which is changed by the actions of the agent.
        # At the end of the run, the environment state will be read.
        env = task.create_environment()  # Create a new environment for every agent run!
        tools = env.tools
        model = ChatAnthropic(model="claude-3-opus-20240229", temperature=0)
        prompt = ChatPromptTemplate.from_messages(
            [
                ("system", self.task.instructions),
                (
                    "human",
                    "{question}",
                ),  # Populated from task.instructions automatically
                ("placeholder", "{agent_scratchpad}"),
            ]
        )

        # This is the standard tool calling agent implementation
        # Feel free to replace it with any other implementation you want!
        # https://python.langchain.ac.cn/docs/modules/agents/how_to/custom_agent/
        agent = create_tool_calling_agent(model, env.tools, prompt)

        if self.rate_limiter:
            agent = with_rate_limit(agent, self.rate_limiter)

        executor = AgentExecutor(
            agent=agent,
            tools=env.tools,
            handle_parsing_errors=True,
            return_intermediate_steps=True,
        )

        # Apply the adapters so that inputs and outputs match dataset schema
        # state_reader automatically adds the state of the environment at the end of the run.
        return apply_agent_executor_adapter(executor, state_reader=env.read_state)

task

名称	工具使用 - 打字机（26 个工具）
类型	ToolUsageTask
数据集 ID	128af05e-aa00-4e3b-a958-d166dd450581
描述	具有 26 个工具的环境，每个工具代表字母表中的一个字母。此任务的目标是评估模型使用工具进行简单重复任务的能力。例如，如果字符串为'abc'，则必须按顺序调用工具'a'、'b'和'c'。数据集包含不同难度的示例。难度通过字符串的长度来衡量。这是打字机任务的变体，其中提供了 26 个无参数工具，而不是一个接受字母作为参数的工具。

custom_agent_factory = CustomAgentFactory(task)

agent = custom_agent_factory()

agent.invoke({"question": "abc"})

{'question': 'abc',
 'output': [],
 'intermediate_steps': [(ToolAgentAction(tool='a', tool_input={}, log='\nInvoking: `a` with `{}`\nresponded: [{\'text\': \'<thinking>\\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\\n</thinking>\', \'type\': \'text\'}, {\'id\': \'toolu_016f6CZwwFmdz2h8KbdGRVjj\', \'input\': {}, \'name\': \'a\', \'type\': \'tool_use\'}, {\'id\': \'toolu_01JvfeTpU3hEuS7PknFk5a8S\', \'input\': {}, \'name\': \'b\', \'type\': \'tool_use\'}, {\'id\': \'toolu_01NbBCY5Fg62RsyAAUd4n2g1\', \'input\': {}, \'name\': \'c\', \'type\': \'tool_use\'}]\n\n', message_log=[AIMessageChunk(content=[{'text': '<thinking>\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\n</thinking>', 'type': 'text'}, {'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj', 'input': {}, 'name': 'a', 'type': 'tool_use'}, {'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S', 'input': {}, 'name': 'b', 'type': 'tool_use'}, {'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1', 'input': {}, 'name': 'c', 'type': 'tool_use'}], id='run-42ea263e-e52a-4fc7-8aa3-71e16a9db42b', tool_calls=[{'name': 'a', 'args': {}, 'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj'}, {'name': 'b', 'args': {}, 'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S'}, {'name': 'c', 'args': {}, 'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1'}], tool_call_chunks=[{'name': 'a', 'args': '{}', 'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj', 'index': 0}, {'name': 'b', 'args': '{}', 'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S', 'index': 1}, {'name': 'c', 'args': '{}', 'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1', 'index': 2}])], tool_call_id='toolu_016f6CZwwFmdz2h8KbdGRVjj'),
   'OK'),
  (ToolAgentAction(tool='b', tool_input={}, log='\nInvoking: `b` with `{}`\nresponded: [{\'text\': \'<thinking>\\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\\n</thinking>\', \'type\': \'text\'}, {\'id\': \'toolu_016f6CZwwFmdz2h8KbdGRVjj\', \'input\': {}, \'name\': \'a\', \'type\': \'tool_use\'}, {\'id\': \'toolu_01JvfeTpU3hEuS7PknFk5a8S\', \'input\': {}, \'name\': \'b\', \'type\': \'tool_use\'}, {\'id\': \'toolu_01NbBCY5Fg62RsyAAUd4n2g1\', \'input\': {}, \'name\': \'c\', \'type\': \'tool_use\'}]\n\n', message_log=[AIMessageChunk(content=[{'text': '<thinking>\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\n</thinking>', 'type': 'text'}, {'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj', 'input': {}, 'name': 'a', 'type': 'tool_use'}, {'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S', 'input': {}, 'name': 'b', 'type': 'tool_use'}, {'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1', 'input': {}, 'name': 'c', 'type': 'tool_use'}], id='run-42ea263e-e52a-4fc7-8aa3-71e16a9db42b', tool_calls=[{'name': 'a', 'args': {}, 'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj'}, {'name': 'b', 'args': {}, 'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S'}, {'name': 'c', 'args': {}, 'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1'}], tool_call_chunks=[{'name': 'a', 'args': '{}', 'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj', 'index': 0}, {'name': 'b', 'args': '{}', 'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S', 'index': 1}, {'name': 'c', 'args': '{}', 'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1', 'index': 2}])], tool_call_id='toolu_01JvfeTpU3hEuS7PknFk5a8S'),
   'OK'),
  (ToolAgentAction(tool='c', tool_input={}, log='\nInvoking: `c` with `{}`\nresponded: [{\'text\': \'<thinking>\\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\\n</thinking>\', \'type\': \'text\'}, {\'id\': \'toolu_016f6CZwwFmdz2h8KbdGRVjj\', \'input\': {}, \'name\': \'a\', \'type\': \'tool_use\'}, {\'id\': \'toolu_01JvfeTpU3hEuS7PknFk5a8S\', \'input\': {}, \'name\': \'b\', \'type\': \'tool_use\'}, {\'id\': \'toolu_01NbBCY5Fg62RsyAAUd4n2g1\', \'input\': {}, \'name\': \'c\', \'type\': \'tool_use\'}]\n\n', message_log=[AIMessageChunk(content=[{'text': '<thinking>\nTo repeat the string "abc", I need to call the a(), b(), and c() functions in that order. No parameters are required for these functions.\n</thinking>', 'type': 'text'}, {'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj', 'input': {}, 'name': 'a', 'type': 'tool_use'}, {'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S', 'input': {}, 'name': 'b', 'type': 'tool_use'}, {'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1', 'input': {}, 'name': 'c', 'type': 'tool_use'}], id='run-42ea263e-e52a-4fc7-8aa3-71e16a9db42b', tool_calls=[{'name': 'a', 'args': {}, 'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj'}, {'name': 'b', 'args': {}, 'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S'}, {'name': 'c', 'args': {}, 'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1'}], tool_call_chunks=[{'name': 'a', 'args': '{}', 'id': 'toolu_016f6CZwwFmdz2h8KbdGRVjj', 'index': 0}, {'name': 'b', 'args': '{}', 'id': 'toolu_01JvfeTpU3hEuS7PknFk5a8S', 'index': 1}, {'name': 'c', 'args': '{}', 'id': 'toolu_01NbBCY5Fg62RsyAAUd4n2g1', 'index': 2}])], tool_call_id='toolu_01NbBCY5Fg62RsyAAUd4n2g1'),
   'OK')],
 'state': 'abc'}