聊天提取#
此基准测试将分类、摘要和提取组合在一个组合任务中。该模型预计将以预期架构输出格式化的 JSON。
# %pip install -U --quiet langchain langchain_benchmarks
# %pip install -U openai rapidfuzz fireworks-ai anthropic
为了使此代码正常工作,请使用您的凭据配置 LangSmith 环境变量,以及您的 LLM 提供商的 API 密钥。
import getpass
import os
import uuid
uid = uuid.uuid4().hex[:4] # Avoid conflicts in project names
# Get your API key from https://smith.langchain.com/settings
api_keys = [
"LANGCHAIN_API_KEY",
"OPENAI_API_KEY",
"ANTHROPIC_API_KEY",
"FIREWORKS_API_KEY",
]
for key in api_keys:
if key not in os.environ:
os.environ[key] = getpass.getpass(f"Enter your {key}: ")
from langchain_benchmarks import clone_public_dataset, registry
task = registry["Chat Extraction"]
# Clone the dataset to your tenant
clone_public_dataset(task.dataset_id, dataset_name=task.name)
task
Dataset Chat Extraction already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6.
名称 | 聊天提取 |
类型 | 提取任务 |
数据集 ID | 00f4444c-9460-4a82-b87a-f50096f1cfef |
描述 | 一个旨在测试 LLM 从对话中提取和推断结构化信息能力的数据集。对话发生在用户和支持工程师之间。输出应结构化为一个 JSON 对象,并测试 LLM 正确构建信息的能力及其执行简单分类任务的能力。 |
架构#
每个提取任务都有一个预期输出架构,在 Pydantic BaseModel 对象中定义,我们可以使用它来获取 JSON 架构对象。
task.schema.schema()
{'title': 'GenerateTicket',
'description': 'Generate a ticket containing all the extracted information.',
'type': 'object',
'properties': {'issue_summary': {'title': 'Issue Summary',
'description': 'short (<10 word) summary of the issue or question',
'type': 'string'},
'question': {'title': 'Question',
'description': 'Information inferred from the the question.',
'allOf': [{'$ref': '#/definitions/QuestionCategorization'}]},
'response': {'title': 'Response',
'description': 'Information inferred from the the response.',
'allOf': [{'$ref': '#/definitions/ResponseCategorization'}]}},
'required': ['issue_summary', 'question', 'response'],
'definitions': {'QuestionCategory': {'title': 'QuestionCategory',
'description': 'An enumeration.',
'enum': ['Implementation Issues',
'Feature Requests',
'Concept Explanations',
'Code Optimization',
'Security and Privacy Concerns',
'Model Training and Fine-tuning',
'Data Handling and Manipulation',
'User Interaction Flow',
'Technical Integration',
'Error Handling and Logging',
'Customization and Configuration',
'External API and Data Source Integration',
'Language and Localization',
'Streaming and Real-time Processing',
'Tool Development',
'Function Calling',
'LLM Integrations',
'General Agent Question',
'General Chit Chat',
'Memory',
'Debugging Help',
'Application Design',
'Prompt Templates',
'Cost Tracking',
'Other'],
'type': 'string'},
'Sentiment': {'title': 'Sentiment',
'description': 'An enumeration.',
'enum': ['Negative', 'Neutral', 'Positive'],
'type': 'string'},
'ProgrammingLanguage': {'title': 'ProgrammingLanguage',
'description': 'An enumeration.',
'enum': ['python', 'javascript', 'typescript', 'unknown', 'other'],
'type': 'string'},
'QuestionCategorization': {'title': 'QuestionCategorization',
'type': 'object',
'properties': {'question_category': {'$ref': '#/definitions/QuestionCategory'},
'category_if_other': {'title': 'Category If Other',
'description': "question category if the category above is 'other'",
'type': 'string'},
'is_off_topic': {'title': 'Is Off Topic',
'description': 'If the input is general chit chat or does not pertain to technical inqueries about LangChain or building/debugging applications with LLMs/AI, it is off topic. For context, LangChain is a library and framework designed to assist in building applications with LLMs. Questions may also be about similar packages like LangServe, LangSmith, OpenAI, Anthropic, vectorstores, agents, etc.',
'type': 'boolean'},
'toxicity': {'title': 'Toxicity',
'description': 'Whether or not the input question is toxic',
'default': 0,
'exclusiveMaximum': 6,
'minimum': 0,
'type': 'integer'},
'sentiment': {'$ref': '#/definitions/Sentiment'},
'programming_language': {'$ref': '#/definitions/ProgrammingLanguage'}},
'required': ['question_category',
'is_off_topic',
'sentiment',
'programming_language']},
'ResponseType': {'title': 'ResponseType',
'description': 'An enumeration.',
'enum': ['resolve issue',
'provide guidance',
'request information',
'give up',
'none',
'other'],
'type': 'string'},
'ResponseCategorization': {'title': 'ResponseCategorization',
'type': 'object',
'properties': {'response_type': {'$ref': '#/definitions/ResponseType'},
'response_type_if_other': {'title': 'Response Type If Other',
'type': 'string'},
'confidence_level': {'title': 'Confidence Level',
'description': 'The confidence of the assistant in its answer.',
'exclusiveMaximum': 6,
'minimum': 0,
'type': 'integer'},
'followup_actions': {'title': 'Followup Actions',
'description': 'Actions the assistant recommended the user take.',
'type': 'array',
'items': {'type': 'string'}}},
'required': ['response_type', 'confidence_level']}}}
定义提取链#
让我们构建一个可以用来从电子邮件中获取结构化信息的提取链。
from langchain.chat_models import ChatOpenAI
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser
llm = ChatOpenAI(model="gpt-4-1106-preview", temperature=0).bind_functions(
functions=[task.schema],
function_call=task.schema.schema()["title"],
)
def format_run(dialogue_input: dict):
question = dialogue_input["question"]
answer = dialogue_input["answer"]
return {
"dialogue": f"<question>\n{question}\n</question>\n"
f"<assistant-response>\n{answer}\n</assistant-response>"
}
output_parser = JsonOutputFunctionsParser()
extraction_chain = (
format_run
| task.instructions
| llm
| output_parser
# Wrap as 'output' so to be unified for the evaluators
| (lambda x: {"output": x})
)
extraction_chain.invoke(
{"question": "how do i run llama 2 locally?", "answer": "Llama.cpp of course."}
)
{'output': {'issue_summary': 'Running Llama 2 Locally',
'question': {'question_category': 'Implementation Issues',
'is_off_topic': False,
'sentiment': 'Neutral',
'programming_language': 'unknown'},
'response': {'response_type': 'provide guidance', 'confidence_level': 1}}}
现在是时候衡量我们链的有效性了!
评估#
让我们现在评估一下这条链。
from langsmith.client import Client
from langchain_benchmarks.extraction.tasks.chat_extraction import get_eval_config
client = Client()
eval_config = get_eval_config()
test_run = client.run_on_dataset(
dataset_name=task.name,
llm_or_chain_factory=extraction_chain,
evaluation=eval_config,
verbose=True,
project_name=f"gpt-4-1106-preview-{uid}",
project_metadata={
"arch": "openai-functions",
"model": "gpt-4-1106-preview",
},
)
View the evaluation results for project 'gpt-4-1106-preview-5689' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6/compare?selectedSessions=0c022691-a7ac-4545-b2bc-58aab2d476e8
View all tests for Dataset Chat Extraction at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6
[------------------------------------------------->] 27/27
实验结果
feedback.json_edit_distance | feedback.json_schema | feedback.toxicity_similarity | feedback.sentiment_similarity | feedback.confidence_level_similarity | feedback.question_category | feedback.off_topic_similarity | feedback.programming_language_similarity | 错误 | 执行时间 | |
---|---|---|---|---|---|---|---|---|---|---|
计数 | 27.000000 | 27.0 | 27.0 | 27.0 | 27.000000 | 27.000000 | 27.000000 | 27.000000 | 0 | 27.000000 |
唯一 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN |
顶部 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
频率 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
平均值 | 0.283000 | 1.0 | 0.0 | 1.0 | 0.940741 | 0.555556 | 0.888889 | 0.592593 | NaN | 6.949585 |
标准差 | 0.181282 | 0.0 | 0.0 | 0.0 | 0.093064 | 0.506370 | 0.320256 | 0.500712 | NaN | 1.639494 |
最小值 | 0.049430 | 1.0 | 0.0 | 1.0 | 0.800000 | 0.000000 | 0.000000 | 0.000000 | NaN | 4.248728 |
25% | 0.104149 | 1.0 | 0.0 | 1.0 | 0.800000 | 0.000000 | 1.000000 | 0.000000 | NaN | 5.679244 |
50% | 0.336343 | 1.0 | 0.0 | 1.0 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | NaN | 6.558088 |
75% | 0.378270 | 1.0 | 0.0 | 1.0 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | NaN | 8.300396 |
最大值 | 0.594255 | 1.0 | 0.0 | 1.0 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | NaN | 10.123084 |
与 Claude-2 比较#
让我们将我们的结果与 Anthropic 的 Claude-2 进行比较。我们将模拟函数调用接口。
from typing import Any, Dict, Type
from langchain.chat_models import ChatAnthropic
from langchain.output_parsers.xml import XMLOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain.pydantic_v1 import BaseModel
claude_prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are a data extraction bot tasked with extracting and inferring information from dialogues and generating tickets. Always respond "
"only with XML based on the following JSON schema:\n{schema}",
),
(
"user",
"Generate a ticket from the following question-response pair:\n"
"<Dialogue>\n{dialogue}\n</Dialogue>\n"
"Remember, respond directly with this format:\n"
"<{function_call}>\n...\n</{function_call}>"
"RESPOND ONLY IN XML THEN STOP.",
),
]
)
prompt = claude_prompt.partial(
schema=task.schema.schema_json(), function_call=task.schema.schema()["title"]
)
claude = ChatAnthropic(model="claude-2", temperature=0, max_tokens_to_sample=2048)
class MergeSchema:
"""Merge the XML Output Parser schema into the output."""
def __init__(self, schema: Type[BaseModel]):
self.schema = schema
@property
def _func_name(self) -> str:
return self.schema.__name__
def _merge_schema(self, parsed_output: Any, schema: Type[BaseModel]):
merged_output = {}
if isinstance(parsed_output, dict):
items = parsed_output.items()
elif isinstance(parsed_output, list):
items = [(k, v) for item in parsed_output for k, v in item.items()]
else:
return parsed_output
for key, value in items:
if key in schema.__fields__:
field_info = schema.__fields__[key]
if isinstance(value, list):
if issubclass(field_info.type_, (BaseModel, dict)):
result = self._merge_schema(value, field_info.type_)
elif all(
isinstance(item, dict) and item.keys() == {"item"}
for item in value
):
result = [next(iter(item.values())) for item in value]
else:
result = value
else:
result = value
else:
result = value
if key in merged_output:
if isinstance(merged_output[key], list):
merged_output[key].append(result)
else:
merged_output[key] = [merged_output[key], result]
else:
merged_output[key] = result
return merged_output
def __call__(self, parsed_output: dict) -> Dict[str, Any]:
merged_output = {}
if self._func_name not in parsed_output:
return parsed_output
return {
self._func_name: self._merge_schema(
parsed_output[self._func_name], self.schema
)
}
def try_parse(llm_output, config):
try:
output_chain = XMLOutputParser() | MergeSchema(task.schema)
parsed = output_chain.invoke(llm_output, config)
# Wrap as 'output' so to be unified for the evaluators
return {"output": parsed.get("GenerateTicket")}
except Exception as e:
return {"output": llm_output, "error": str(e)}
claude_extraction_chain = format_run | prompt | claude | try_parse
result = claude_extraction_chain.invoke(
{"question": "how do i run llama 2 locally?", "answer": "Llama.cpp of course."}
)
result
{'output': {'issue_summary': 'How to run Llama locally',
'question': {'question_category': 'Implementation Issues',
'is_off_topic': 'false',
'toxicity': '0',
'sentiment': 'Neutral',
'programming_language': 'unknown'},
'response': {'response_type': 'provide guidance',
'confidence_level': '3',
'followup_actions': ['Ask clarifying questions about the specific issue',
'Provide documentation or examples for running Llama locally']}}}
claude_test_run = client.run_on_dataset(
dataset_name=task.name,
llm_or_chain_factory=claude_extraction_chain,
evaluation=eval_config,
verbose=True,
project_name=f"claude-2-json-schema-to-xml-{uid}",
project_metadata={
"arch": "claude-json-schema-xml-output",
},
)
View the evaluation results for project 'claude-2-json-schema-to-xml-5689' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6/compare?selectedSessions=3f590999-a9d1-48be-83dd-e84acb99a195
View all tests for Dataset Chat Extraction at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6
[------------------------------------------------->] 27/27
实验结果
feedback.json_edit_distance | feedback.json_schema | feedback.toxicity_similarity | feedback.sentiment_similarity | feedback.confidence_level_similarity | feedback.question_category | feedback.off_topic_similarity | feedback.programming_language_similarity | 错误 | 执行时间 | |
---|---|---|---|---|---|---|---|---|---|---|
计数 | 27.000000 | 27.000000 | 27.0 | 27.000000 | 27.000000 | 27.000000 | 27.0 | 27.000000 | 0 | 27.000000 |
唯一 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN |
顶部 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
频率 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
平均值 | 0.371950 | 0.777778 | 1.0 | 0.925926 | 0.970370 | 0.481481 | 0.0 | 0.444444 | NaN | 10.556105 |
标准差 | 0.108628 | 0.423659 | 0.0 | 0.181007 | 0.072403 | 0.509175 | 0.0 | 0.506370 | NaN | 1.790352 |
最小值 | 0.105033 | 0.000000 | 1.0 | 0.500000 | 0.800000 | 0.000000 | 0.0 | 0.000000 | NaN | 8.435542 |
25% | 0.312445 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 0.000000 | 0.0 | 0.000000 | NaN | 9.077631 |
50% | 0.390000 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 0.000000 | 0.0 | 0.000000 | NaN | 10.059124 |
75% | 0.462694 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 1.000000 | 0.0 | 1.000000 | NaN | 11.795210 |
最大值 | 0.537678 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 1.000000 | 0.0 | 1.000000 | NaN | 15.072743 |
因此,看起来编辑距离相当不错,但架构验证还有待改进。
我们在 JSON 中定义了架构,然后请求 XML。让我们尝试保持一致。
尝试使用 XSD 架构定义#
在这个变体中,让我们看看如果我们保持一致的结构,Claude 的表现是否会更好。
from typing import Any, Dict, Type
from langchain.chat_models import ChatAnthropic
from langchain.output_parsers.xml import XMLOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain.pydantic_v1 import BaseModel
# This is the schema the model will populate
xsd = """<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:simpleType name="QuestionCategory">
<xs:restriction base="xs:string">
<xs:enumeration value="Implementation Issues"/>
<xs:enumeration value="Feature Requests"/>
<xs:enumeration value="Concept Explanations"/>
<xs:enumeration value="Code Optimization"/>
<xs:enumeration value="Security and Privacy Concerns"/>
<xs:enumeration value="Model Training and Fine-tuning"/>
<xs:enumeration value="Data Handling and Manipulation"/>
<xs:enumeration value="User Interaction Flow"/>
<xs:enumeration value="Technical Integration"/>
<xs:enumeration value="Error Handling and Logging"/>
<xs:enumeration value="Customization and Configuration"/>
<xs:enumeration value="External API and Data Source Integration"/>
<xs:enumeration value="Language and Localization"/>
<xs:enumeration value="Streaming and Real-time Processing"/>
<xs:enumeration value="Tool Development"/>
<xs:enumeration value="Function Calling"/>
<xs:enumeration value="LLM Integrations"/>
<xs:enumeration value="General Agent Questions"/>
<xs:enumeration value="General Chit Chat"/>
<xs:enumeration value="Memory"/>
<xs:enumeration value="Debugging Help"/>
<xs:enumeration value="Application Design"/>
<xs:enumeration value="Prompt Templates"/>
<xs:enumeration value="Cost Tracking"/>
<xs:enumeration value="Other"/>
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="Sentiment">
<xs:restriction base="xs:string">
<xs:enumeration value="Negative"/>
<xs:enumeration value="Neutral"/>
<xs:enumeration value="Positive"/>
</xs:restriction>
</xs:simpleType>
<xs:simpleType name="ProgrammingLanguage">
<xs:restriction base="xs:string">
<xs:enumeration value="python"/>
<xs:enumeration value="javascript"/>
<xs:enumeration value="typescript"/>
<xs:enumeration value="unknown"/>
<xs:enumeration value="other"/>
</xs:restriction>
</xs:simpleType>
<xs:complexType name="QuestionCategorization">
<xs:sequence>
<xs:element name="question_category" type="QuestionCategory"/>
<xs:element name="category_if_other" type="xs:string" minOccurs="0"/>
<xs:element name="is_off_topic" type="xs:boolean"/>
<xs:element name="toxicity" type="xs:int">
<xs:minInclusive value="0"/>
<xs:maxInclusive value="5"/>
</xs:element>
<xs:element name="sentiment" type="Sentiment"/>
<xs:element name="programming_language" type="ProgrammingLanguage"/>
</xs:sequence>
</xs:complexType>
<xs:simpleType name="ResponseType">
<xs:restriction base="xs:string">
<xs:enumeration value="resolve issue"/>
<xs:enumeration value="provide guidance"/>
<xs:enumeration value="request information"/>
<xs:enumeration value="give up"/>
<xs:enumeration value="none"/>
<xs:enumeration value="other"/>
</xs:restriction>
</xs:simpleType>
<xs:complexType name="ResponseCategorization">
<xs:sequence>
<xs:element name="response_type" type="ResponseType"/>
<xs:element name="response_type_if_other" type="xs:string" minOccurs="0"/>
<xs:element name="confidence_level" type="xs:int">
<xs:minInclusive value="0"/>
<xs:maxInclusive value="5"/>
</xs:element>
<xs:element name="followup_actions" type="xs:string" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="GenerateTicket">
<xs:sequence>
<xs:element name="issue_summary" type="xs:string"/>
<xs:element name="question" type="QuestionCategorization"/>
<xs:element name="response" type="ResponseCategorization"/>
</xs:sequence>
</xs:complexType>
</xs:schema>"""
prompt = claude_prompt.partial(schema=xsd, function_call=task.schema.schema()["title"])
claude_extraction_chain = format_run | prompt | claude | try_parse
result = claude_extraction_chain.invoke(
{
"question": "how do i run llama 2 locally?",
"answer": "Llama.cpp of course. Afterwords remember to install it, then add it to your path!",
}
)
result
{'output': {'issue_summary': 'How to run Llama locally',
'question': {'question_category': 'LLM Integrations',
'is_off_topic': 'false',
'toxicity': '0',
'sentiment': 'Neutral',
'programming_language': 'unknown'},
'response': {'response_type': 'provide guidance',
'confidence_level': '3',
'followup_actions': ['Install Llama locally', 'Add Llama to path']}}}
claude_xsd_test_run = client.run_on_dataset(
dataset_name=task.name,
llm_or_chain_factory=claude_extraction_chain,
evaluation=eval_config,
verbose=True,
project_name=f"claude-2-xsd-to-xml-{uid}",
project_metadata={
"arch": "claude-xml",
},
)
View the evaluation results for project 'claude-2-xsd-to-xml-5689' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6/compare?selectedSessions=dc7656d8-00ef-4048-9ce5-38ef72af593c
View all tests for Dataset Chat Extraction at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6
[------------------------------------------------->] 27/27
实验结果
feedback.json_edit_distance | feedback.json_schema | feedback.toxicity_similarity | feedback.sentiment_similarity | feedback.confidence_level_similarity | feedback.question_category | feedback.off_topic_similarity | feedback.programming_language_similarity | 错误 | 执行时间 | |
---|---|---|---|---|---|---|---|---|---|---|
计数 | 27.000000 | 27.000000 | 27.0 | 27.000000 | 27.000000 | 27.000000 | 27.0 | 27.000000 | 0 | 27.000000 |
唯一 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN |
顶部 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
频率 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
平均值 | 0.394232 | 0.518519 | 1.0 | 0.907407 | 0.970370 | 0.370370 | 0.0 | 0.518519 | NaN | 11.128319 |
标准差 | 0.117880 | 0.509175 | 0.0 | 0.197924 | 0.072403 | 0.492103 | 0.0 | 0.509175 | NaN | 4.845637 |
最小值 | 0.116608 | 0.000000 | 1.0 | 0.500000 | 0.800000 | 0.000000 | 0.0 | 0.000000 | NaN | 7.833285 |
25% | 0.332400 | 0.000000 | 1.0 | 1.000000 | 1.000000 | 0.000000 | 0.0 | 0.000000 | NaN | 8.888438 |
50% | 0.380435 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 0.000000 | 0.0 | 1.000000 | NaN | 9.629613 |
75% | 0.456592 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 1.000000 | 0.0 | 1.000000 | NaN | 11.143679 |
最大值 | 0.644007 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 1.000000 | 0.0 | 1.000000 | NaN | 32.068304 |
JSON 架构指标下降了,这意味着输出与之前的预期相反,对我们的解析器不那么友好。
让我们尝试一个开源模型:llama-v2-34b-code-instruct
。
尝试使用 Llama 2#
llama-v2-34b-code-instruct
是一个开源模型,旨在擅长代码生成和其他任务。让我们对其进行基准测试。
import json
from langchain.chat_models import ChatFireworks
from langchain.output_parsers.json import parse_json_markdown
llama_prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are a data extraction bot tasked with extracting and inferring information from dialogues and generating tickets. Always respond "
"only with json based on the following JSON schema:\n{schema}",
),
(
"user",
"Generate a ticket from the following question-response pair:\n"
"<Dialogue>\n{dialogue}\n</Dialogue>\n"
"Remember, respond directly with this format:\n"
'{{"{function_call}": ...}}\n'
"RESPOND ONLY IN JSON THEN STOP.",
),
]
)
prompt = llama_prompt.partial(
schema=task.schema.schema_json(), function_call=task.schema.schema()["title"]
)
llm = ChatFireworks(
model="accounts/fireworks/models/llama-v2-34b-code-instruct",
temperature=0,
model_kwargs={"max_tokens": 4000},
)
def parse_output(ai_message):
content = ai_message.content
parser = lambda x: json.loads(x, strict=False)
try:
parsed = parse_json_markdown(content, parser=parser)
if "GenerateTicket" in parsed:
return {"output": parsed["GenerateTicket"]}
return {"output": parsed}
except json.JSONDecodeError:
return {"output": content}
fireworks_extraction_chain = format_run | prompt | llm | parse_output
result = fireworks_extraction_chain.invoke(
{"question": "how do i run llama 2 locally?", "answer": "Llama.cpp of course."}
)
result
{'output': {'issue_summary': 'How to run Llama 2 locally',
'question': {'question_category': 'Implementation Issues',
'is_off_topic': False,
'toxicity': 0,
'sentiment': 'Neutral',
'programming_language': 'cpp'},
'response': {'response_type': 'Resolve Issue',
'confidence_level': 5,
'followup_actions': ['Please provide more information about the environment (OS, versions, etc.) and the specific issue you are experiencing.']}}}
llama_v2_test_run = client.run_on_dataset(
dataset_name=task.name,
llm_or_chain_factory=fireworks_extraction_chain,
evaluation=eval_config,
verbose=True,
project_name=f"llama-v2-34b-code-instruct-{uid}",
project_metadata={"arch": "claude-xml", "model": "llama-v2-34b-code-instruct"},
)
View the evaluation results for project 'llama-v2-34b-code-instruct-5689' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6/compare?selectedSessions=dc2e0648-7e65-4d60-a149-15c24bca943b
View all tests for Dataset Chat Extraction at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6
[------------------------------------------------->] 27/27
实验结果
feedback.json_edit_distance | feedback.json_schema | feedback.toxicity_similarity | feedback.sentiment_similarity | feedback.confidence_level_similarity | feedback.question_category | feedback.off_topic_similarity | feedback.programming_language_similarity | 错误 | 执行时间 | |
---|---|---|---|---|---|---|---|---|---|---|
计数 | 17.000000 | 27.000000 | 27.000000 | 27.000000 | 27.000000 | 27.000000 | 27.000000 | 27.000000 | 0 | 27.000000 |
唯一 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN |
顶部 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
频率 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
平均值 | 0.399687 | 0.333333 | 0.444444 | 0.444444 | 0.540741 | 0.074074 | 0.518519 | 0.222222 | NaN | 4.738518 |
标准差 | 0.097771 | 0.480384 | 0.506370 | 0.423659 | 0.439632 | 0.266880 | 0.509175 | 0.423659 | NaN | 3.162978 |
最小值 | 0.197279 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | 3.224190 |
25% | 0.325069 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | NaN | 3.595067 |
50% | 0.413203 | 0.000000 | 0.000000 | 0.500000 | 0.800000 | 0.000000 | 1.000000 | 0.000000 | NaN | 3.744033 |
75% | 0.471366 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | NaN | 4.211040 |
最大值 | 0.552430 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | NaN | 18.660901 |
比较结果#
在这里,我们将稍微看一下基础结果。您可以查看结果以查看聚合和逐示例的相对性能。
df = (
test_run.to_dataframe()
.join(claude_test_run.to_dataframe(), rsuffix="_claude")
.join(claude_xsd_test_run.to_dataframe(), rsuffix="_claude_xsd")
.join(llama_v2_test_run.to_dataframe(), rsuffix="_llama_v2")
)
df.head(5)
inputs.answer | inputs.question | outputs.output | reference.output | feedback.json_edit_distance | feedback.json_schema | feedback.toxicity_similarity | feedback.sentiment_similarity | feedback.confidence_level_similarity | feedback.question_category | ... | feedback.json_edit_distance_llama_v2 | feedback.json_schema_llama_v2 | feedback.toxicity_similarity_llama_v2 | feedback.sentiment_similarity_llama_v2 | feedback.confidence_level_similarity_llama_v2 | feedback.question_category_llama_v2 | feedback.off_topic_similarity_llama_v2 | feedback.programming_language_similarity_llama_v2 | error_llama_v2 | execution_time_llama_v2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
23a81130-2ad9-46cf-ad27-46589bcea94a | Pour joindre les deux outputs, vous pouvez uti... | je travail sur python. je souhaite joindre ces... | {'issue_summary': 'Joining two outputs in Pyth... | {'question': {'toxicity': 0, 'sentiment': 'Neu... | 0.089219 | 1 | 0 | 1.0 | 1.0 | 1 | ... | 0.552239 | 1 | 0.0 | 0.5 | 0.8 | 0 | 0 | 1 | None | 3.981128 |
598316ec-f5e2-4b4d-83a8-36adb18e12fe | Hmm, I'm not sure. | example for dalle agent | {'issue_summary': 'Example for DALL-E Agent', ... | {'question': {'toxicity': 0, 'sentiment': 'Neu... | 0.171103 | 1 | 0 | 1.0 | 0.8 | 0 | ... | NaN | 0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | None | 10.942758 |
d1a1a2e8-6f4c-4325-8aaa-ea20e2449268 | To run Llama2 using pandas, you can follow the... | how do I run llama2 using pandas | {'issue_summary': 'Running Llama2 with Pandas'... | {'question': {'toxicity': 0, 'sentiment': 'Neu... | 0.594255 | 1 | 0 | 1.0 | 1.0 | 0 | ... | NaN | 0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | None | 3.628600 |
140a4819-0046-469d-b4df-8e747ddae112 | To clear the conversation in ConversationalRet... | if Im useing ConversationalRetrievalChain how ... | {'issue_summary': 'Clearing Conversation in Co... | {'question': {'toxicity': 0, 'sentiment': 'Neu... | 0.353261 | 1 | 0 | 1.0 | 1.0 | 0 | ... | 0.393643 | 0 | 1.0 | 0.5 | 0.8 | 0 | 1 | 0 | None | 3.711707 |
7b0a9dd9-68ce-41a1-9f9d-067d93175477 | To perform the task of creating an app that in... | I want to create an app which:\n- chats with u... | {'issue_summary': 'Building an app with Langch... | {'question': {'toxicity': 0, 'sentiment': 'Neu... | 0.562950 | 1 | 0 | 1.0 | 0.8 | 1 | ... | 0.436747 | 1 | 1.0 | 0.5 | 1.0 | 0 | 1 | 1 | None | 4.410890 |
5 行 × 56 列
这里,我们并排比较了聚合指标#
df = (
test_run.get_aggregate_feedback()
.add_suffix(".gpt-4")
.join(claude_test_run.get_aggregate_feedback(), rsuffix=".claude")
.join(claude_xsd_test_run.get_aggregate_feedback(), rsuffix=".claude_xsd")
.join(llama_v2_test_run.get_aggregate_feedback(), rsuffix=".llama_v2")
)
from IPython.display import HTML, display
feedback_columns = sorted(
{col.rsplit(".", 1)[0] for col in df.columns if col.startswith("feedback.")}
)
def render_metric(df, metric):
sub_cols = [col for col in df.columns if col.startswith(metric)]
display(HTML(f"<h3>{metric.split('.')[-1]}</h3>"))
display(df[sub_cols][df.index.isin(["mean", "std"])])
feedback_columns
['feedback',
'feedback.confidence_level_similarity',
'feedback.json_edit_distance',
'feedback.json_schema',
'feedback.off_topic_similarity',
'feedback.programming_language_similarity',
'feedback.question_category',
'feedback.sentiment_similarity',
'feedback.toxicity_similarity']
render_metric(df, "execution_time")
执行时间
execution_time.gpt-4 | 执行时间 | execution_time.claude_xsd | execution_time.llama_v2 | |
---|---|---|---|---|
平均值 | 6.949585 | 10.556105 | 11.128319 | 4.738518 |
标准差 | 1.639494 | 1.790352 | 4.845637 | 3.162978 |
for metric in feedback_columns:
render_metric(df, metric)
反馈
feedback.json_edit_distance.gpt-4 | feedback.json_schema.gpt-4 | feedback.toxicity_similarity.gpt-4 | feedback.sentiment_similarity.gpt-4 | feedback.confidence_level_similarity.gpt-4 | feedback.question_category.gpt-4 | feedback.off_topic_similarity.gpt-4 | feedback.programming_language_similarity.gpt-4 | feedback.json_edit_distance | feedback.json_schema | ... | feedback.off_topic_similarity.claude_xsd | feedback.programming_language_similarity.claude_xsd | feedback.json_edit_distance.llama_v2 | feedback.json_schema.llama_v2 | feedback.toxicity_similarity.llama_v2 | feedback.sentiment_similarity.llama_v2 | feedback.confidence_level_similarity.llama_v2 | feedback.question_category.llama_v2 | feedback.off_topic_similarity.llama_v2 | feedback.programming_language_similarity.llama_v2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
平均值 | 0.283000 | 1.0 | 0.0 | 1.0 | 0.940741 | 0.555556 | 0.888889 | 0.592593 | 0.371950 | 0.777778 | ... | 0.0 | 0.518519 | 0.399687 | 0.333333 | 0.444444 | 0.444444 | 0.540741 | 0.074074 | 0.518519 | 0.222222 |
标准差 | 0.181282 | 0.0 | 0.0 | 0.0 | 0.093064 | 0.506370 | 0.320256 | 0.500712 | 0.108628 | 0.423659 | ... | 0.0 | 0.509175 | 0.097771 | 0.480384 | 0.506370 | 0.423659 | 0.439632 | 0.266880 | 0.509175 | 0.423659 |
2 行 × 32 列
confidence_level_similarity
feedback.confidence_level_similarity.gpt-4 | feedback.confidence_level_similarity | feedback.confidence_level_similarity.claude_xsd | feedback.confidence_level_similarity.llama_v2 | |
---|---|---|---|---|
平均值 | 0.940741 | 0.970370 | 0.970370 | 0.540741 |
标准差 | 0.093064 | 0.072403 | 0.072403 | 0.439632 |
json_edit_distance
feedback.json_edit_distance.gpt-4 | feedback.json_edit_distance | feedback.json_edit_distance.claude_xsd | feedback.json_edit_distance.llama_v2 | |
---|---|---|---|---|
平均值 | 0.283000 | 0.371950 | 0.394232 | 0.399687 |
标准差 | 0.181282 | 0.108628 | 0.117880 | 0.097771 |
json_schema
feedback.json_schema.gpt-4 | feedback.json_schema | feedback.json_schema.claude_xsd | feedback.json_schema.llama_v2 | |
---|---|---|---|---|
平均值 | 1.0 | 0.777778 | 0.518519 | 0.333333 |
标准差 | 0.0 | 0.423659 | 0.509175 | 0.480384 |
off_topic_similarity
feedback.off_topic_similarity.gpt-4 | feedback.off_topic_similarity | feedback.off_topic_similarity.claude_xsd | feedback.off_topic_similarity.llama_v2 | |
---|---|---|---|---|
平均值 | 0.888889 | 0.0 | 0.0 | 0.518519 |
标准差 | 0.320256 | 0.0 | 0.0 | 0.509175 |
programming_language_similarity
feedback.programming_language_similarity.gpt-4 | feedback.programming_language_similarity | feedback.programming_language_similarity.claude_xsd | feedback.programming_language_similarity.llama_v2 | |
---|---|---|---|---|
平均值 | 0.592593 | 0.444444 | 0.518519 | 0.222222 |
标准差 | 0.500712 | 0.506370 | 0.509175 | 0.423659 |
question_category
feedback.question_category.gpt-4 | feedback.question_category | feedback.question_category.claude_xsd | feedback.question_category.llama_v2 | |
---|---|---|---|---|
平均值 | 0.555556 | 0.481481 | 0.370370 | 0.074074 |
标准差 | 0.506370 | 0.509175 | 0.492103 | 0.266880 |
sentiment_similarity
feedback.sentiment_similarity.gpt-4 | feedback.sentiment_similarity | feedback.sentiment_similarity.claude_xsd | feedback.sentiment_similarity.llama_v2 | |
---|---|---|---|---|
平均值 | 1.0 | 0.925926 | 0.907407 | 0.444444 |
标准差 | 0.0 | 0.181007 | 0.197924 | 0.423659 |
toxicity_similarity
feedback.toxicity_similarity.gpt-4 | feedback.toxicity_similarity | feedback.toxicity_similarity.claude_xsd | feedback.toxicity_similarity.llama_v2 | |
---|---|---|---|---|
平均值 | 0.0 | 1.0 | 1.0 | 0.444444 |
标准差 | 0.0 | 0.0 | 0.0 | 0.506370 |
下一步#
自己试试吧!您可以在此代码库中查看一些有关开源模型的额外实验。