查询分析器#
查询分析器是信息检索系统中常用的组件,有助于提高检索结果的相关性。
分析器以用户的原始搜索查询(例如,“我附近的廉价餐厅”)以及附加元数据作为输入,并将查询转换为更精确的结构化查询。
生成的结构化查询可能如下所示
{
"text": null,
"entity_type": "restaurant",
"filters": [
{
"attribute": "price",
"op": "<",
"value": 100
},
{
"attribute": "location",
"op": "near",
"value": "user_geo_location"
}
]
}
为了说明如何使用 Kork
,让我们重新实现 langchain 中使用 QueryConstructor chain 实现的查询分析器。
import langchain
from langchain.llms import OpenAI
from typing import List, Any
from kork import CodeChain
过滤器#
我们将定义一些可以用于查询分析的函数作为外部函数。
注意
每个二元运算符使用一个函数表示似乎比使用单个函数的更通用的表示效果更好
def f_(attribute: str, op: Any, value: Any) -> Any: “””使用运算符(“>”、“<”、“=”等)和值对给定属性应用过滤器。””” …
作为一个实验,您可以尝试切换到更通用的表示形式,看看是否可以使其良好运行!
def gt(attribute: str, value: Any) -> Any:
"""Filter to where attribute > value"""
return {"attribute": attribute, "op": ">", "value": value}
def gte(attribute: str, value: Any) -> Any:
"""Filter to where attribute >= value"""
return {"attribute": attribute, "op": ">=", "value": value}
def eq(attribute: str, value: Any) -> Any:
"""Filter to where attribute = value"""
return {"attribute": attribute, "op": "=", "value": value}
def neq(attribute: str, value: Any) -> Any:
"""Filter to where attribute != value"""
return {"attribute": attribute, "op": "!=", "value": value}
def lte(attribute: str, value: Any) -> Any:
"""Filter to where attribute <= value"""
return {"attribute": attribute, "op": "<=", "value": value}
def lt(attribute: str, value: Any) -> Any:
"""Filter to where attribute < value"""
return {"attribute": attribute, "op": "<", "value": value}
def and_(filters: List[Any]) -> Any:
"""Combine a list of filters using an AND operator."""
return {
"op": "and",
"filters": [filters],
}
def or_(filters: List[Any]) -> Any:
"""Combine a list of filters using an OR operator."""
return {
"op": "or_",
"filters": [filters],
}
def in_(attribute: str, value: List[Any]) -> Any:
"""Filter to where attribute >= value"""
return {"attribute": attribute, "op": "in", "value": value}
def request(query: str, filters: List[Any]) -> Any:
return {
"query": query,
"filters": filters,
}
提示词#
让我们定义一个提示词来解释任务。
您可以尝试使用提示词,看看是否可以改进它!
请参阅 Langchain query constructor
链中的 prompt 以获取灵感。
from langchain.prompts import PromptTemplate
template = """\
Your task is to analyze the user's query and translate it into a search request composed \
of a search string and a filter.
Here is a set of functions that you can rely on:
{external_functions_block}
Here is a schema of the data being queried.
```TypeScript
type schema = {{
author: string // The author of the document
pub_year: string // 4 digit format year representing year when doc was published
price: number // how much it costs to buy the document
}}
```
Filter attributes must match the data schema. If the query seems to include other attributes, \
assume those are not filters, but part of the search string.
Pay attention to the doc string in the schema for each attribute. If it doesn't look like the \
usage of the filter does not match the description in comment treat it as part of the search query.
Filters can be combined using `and_` and `or_`.
Please encapsulate the code in <code> and </code> tags.
"""
prompt_template = PromptTemplate(
template=template, input_variables=["external_functions_block"]
)
示例#
现在,定义一些示例程序。
from kork.parser import parse
examples = [
(
"documents published after 2020",
'var result = request(null, gte("pub_year", 2020))',
),
(
"toy models of cars by john smith or ones that were published after 2012",
'var result = request("toy models of cars", or_([eq("author", "john smith"), gte("pub_year", 2012)]))',
),
(
"share prices by john or oliver",
'var result = request("share prices", in("author", ["john", "oliver"]))',
),
]
examples_in_ast = [(query, parse(code)) for query, code in examples]
examples_in_ast[0]
('documents published after 2020',
Program(stmts=(VarDecl(name='result', value=FunctionCall(name='request', args=(Literal(value=None), FunctionCall(name='gte', args=(Literal(value='pub_year'), Literal(value=2020)))))),)))
让我们测试一下#
llm = OpenAI(
model_name="text-davinci-003",
temperature=0,
max_tokens=3000,
frequency_penalty=0,
presence_penalty=0,
top_p=1.0,
verbose=True,
)
chain = CodeChain.from_defaults(
llm=llm,
examples=examples_in_ast,
context=[gte, gt, eq, neq, lte, lt, in_, and_, or_, request],
instruction_template=prompt_template,
input_formatter=None,
)
langchain.verbose = False
queries = [
"publications by mama bear published after 2013",
"documents about florida from 2013 or docs written by mama bear",
"smells like sunshine",
"documents that discuss $5",
"documents that discuss the $50 debt of the bank",
"docs that cost more than $150",
]
results = []
for query in queries:
results.append(chain(inputs={"query": query}))
from kork.display import as_html_dict, display_html_results
仔细检查下面的结果!并非所有生成的请求都是正确的。errors
表示没有运行时异常,并不表示结果是正确的。
display_html_results(
[as_html_dict(r) for r in results], columns=["query", "code", "result", "errors"]
)
查询 | 代码 | 结果 | 错误 | |
---|---|---|---|---|
0 | mama bear 的出版物 2013 年之后出版 |
var result = request("publications", and_([eq("author", "mama bear"), gte("pub_year", 2013)])) |
{'query': 'publications', 'filters': {'op': 'and', 'filters': [[{'attribute': 'author', 'op': '=', 'value': 'mama bear'}, {'attribute': 'pub_year', 'op': '>=', 'value': 2013}]]}} | [] |
1 | 关于佛罗里达州的文件来自 2013 年或 mama 撰写的文件 bear |
var result = request("documents about florida", or_([gte("pub_year", 2013), eq("author", "mama bear")])) |
{'query': 'documents about florida', 'filters': {'op': 'or_', 'filters': [[{'attribute': 'pub_year', 'op': '>=', 'value': 2013}, {'attribute': 'author', 'op': '=', 'value': 'mama bear'}]]}} | [] |
2 | 闻起来像阳光 | var result = request("smells like sunshine", null) |
{'query': 'smells like sunshine', 'filters': None} | [] |
3 | 讨论 5 美元的文件 | var result = request("documents that discuss $5", eq("price", 5)) |
{'query': 'documents that discuss $5', 'filters': {'attribute': 'price', 'op': '=', 'value': 5}} | [] |
4 | 讨论 50 美元的文件 银行债务 |
var result = request("documents that discuss the $50 debt of the bank", null) |
{'query': 'documents that discuss the $50 debt of the bank', 'filters': None} | [] |
5 | 成本超过 150 美元的文件 | var result = request(null, gt("price", 150)) |
{'query': None, 'filters': {'attribute': 'price', 'op': '>', 'value': 150}} | [] |