网络研究 (STORM)¶

STORM 是 Shao 等人设计的科研助手，它扩展了“大纲驱动的 RAG”概念，以生成更丰富的文章。

STORM 旨在根据用户提供的主题生成维基百科风格的文章。它应用了两个主要见解来生成更有条理、更全面的文章

通过查询类似主题来创建大纲（规划）有助于提高覆盖范围。
基于（搜索）的多视角、有依据的对话模拟有助于增加引用数量和信息密度。

控制流程如下图所示。

STORM 有几个主要阶段

生成初始大纲 + 调查相关主题
识别不同视角
“访谈主题专家”（角色扮演 LLM）
优化大纲（使用参考文献）
撰写章节，然后撰写文章

专家访谈阶段发生在角色扮演的文章撰写者和研究专家之间。“专家”能够查询外部知识并回答尖锐的问题，将引用的来源保存到向量存储中，以便后续优化阶段能够合成完整的文章。

可以设置几个超参数来限制（潜在的）无限研究广度

N：要调查/使用的视角数量（步骤 2->3） M：步骤中的最大对话轮次（步骤 3）

设置¶

首先，让我们安装所需的软件包并设置我们的 API 密钥

pip install -U langchain_community langchain_openai langchain_fireworks langgraph wikipedia duckduckgo-search tavily-python

# Uncomment if you want to draw the pretty graph diagrams.
# If you are on MacOS, you will need to run brew install graphviz before installing and update some environment flags
# ! brew install graphviz
# !CFLAGS="-I $(brew --prefix graphviz)/include" LDFLAGS="-L $(brew --prefix graphviz)/lib" pip install -U pygraphviz

import getpass
import os


def _set_env(var: str):
    if os.environ.get(var):
        return
    os.environ[var] = getpass.getpass(var + ":")


_set_env("OPENAI_API_KEY")
_set_env("TAVILY_API_KEY")

设置 LangSmith 以进行 LangGraph 开发

注册 LangSmith，快速发现问题并提高 LangGraph 项目的性能。LangSmith 允许您使用跟踪数据调试、测试和监控使用 LangGraph 构建的 LLM 应用程序——在此处了解更多入门信息 here。

选择 LLM¶

我们将使用一个更快的 LLM 来完成大部分工作，但会使用一个较慢的、长上下文模型来提炼对话并撰写最终报告。

API 参考: ChatOpenAI

from langchain_openai import ChatOpenAI

fast_llm = ChatOpenAI(model="gpt-4o-mini")
# Uncomment for a Fireworks model
# fast_llm = ChatFireworks(model="accounts/fireworks/models/firefunction-v1", max_tokens=32_000)
long_context_llm = ChatOpenAI(model="gpt-4o")

生成初始大纲¶

对于许多主题，您的 LLM 可能对重要和相关主题有初步想法。我们可以生成一个初始大纲，并在研究后进行优化。下面，我们将使用我们的“快速”LLM 来生成大纲。

将 Pydantic 与 LangChain 结合使用

本 Notebook 使用 Pydantic v2 BaseModel，需要 langchain-core >= 0.3。使用 langchain-core < 0.3 会因 Pydantic v1 和 v2 BaseModels 的混合而导致错误。

API 参考: ChatPromptTemplate

from typing import List, Optional

from langchain_core.prompts import ChatPromptTemplate

from pydantic import BaseModel, Field, field_validator

direct_gen_outline_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a Wikipedia writer. Write an outline for a Wikipedia page about a user-provided topic. Be comprehensive and specific.",
        ),
        ("user", "{topic}"),
    ]
)


class Subsection(BaseModel):
    subsection_title: str = Field(..., title="Title of the subsection")
    description: str = Field(..., title="Content of the subsection")

    @property
    def as_str(self) -> str:
        return f"### {self.subsection_title}\n\n{self.description}".strip()


class Section(BaseModel):
    section_title: str = Field(..., title="Title of the section")
    description: str = Field(..., title="Content of the section")
    subsections: Optional[List[Subsection]] = Field(
        default=None,
        title="Titles and descriptions for each subsection of the Wikipedia page.",
    )

    @property
    def as_str(self) -> str:
        subsections = "\n\n".join(
            f"### {subsection.subsection_title}\n\n{subsection.description}"
            for subsection in self.subsections or []
        )
        return f"## {self.section_title}\n\n{self.description}\n\n{subsections}".strip()


class Outline(BaseModel):
    page_title: str = Field(..., title="Title of the Wikipedia page")
    sections: List[Section] = Field(
        default_factory=list,
        title="Titles and descriptions for each section of the Wikipedia page.",
    )

    @property
    def as_str(self) -> str:
        sections = "\n\n".join(section.as_str for section in self.sections)
        return f"# {self.page_title}\n\n{sections}".strip()


generate_outline_direct = direct_gen_outline_prompt | fast_llm.with_structured_output(
    Outline
)

example_topic = "Impact of million-plus token context window language models on RAG"

initial_outline = generate_outline_direct.invoke({"topic": example_topic})

print(initial_outline.as_str)

# Impact of million-plus token context window language models on RAG

## Introduction

Brief overview of million-plus token context window language models and RAG.

## Million-Plus Token Context Window Language Models

Explanation of million-plus token context window language models, their architecture, training process, and capabilities.

## Retrieval-Augmented Generation (RAG)

Explanation of RAG model, its components, and how it combines information retrieval with text generation.

## Impact on RAG

Discussion on how million-plus token context window language models impact RAG, including improvements in performance, efficiency, and potential challenges.

扩展主题¶

虽然语言模型在其参数中存储了一些类似于维基百科的知识，但通过使用搜索引擎整合相关和最新信息，您将获得更好的结果。

我们将从生成一份维基百科相关主题列表开始我们的搜索。

gen_related_topics_prompt = ChatPromptTemplate.from_template(
    """I'm writing a Wikipedia page for a topic mentioned below. Please identify and recommend some Wikipedia pages on closely related subjects. I'm looking for examples that provide insights into interesting aspects commonly associated with this topic, or examples that help me understand the typical content and structure included in Wikipedia pages for similar topics.

Please list the as many subjects and urls as you can.

Topic of interest: {topic}
"""
)


class RelatedSubjects(BaseModel):
    topics: List[str] = Field(
        description="Comprehensive list of related subjects as background research.",
    )


expand_chain = gen_related_topics_prompt | fast_llm.with_structured_output(
    RelatedSubjects
)

related_subjects = await expand_chain.ainvoke({"topic": example_topic})
related_subjects

RelatedSubjects(topics=['million-plus token context window language models', 'RAG'])

生成视角¶

从这些相关主题中，我们可以选择具有不同背景和隶属关系的代表性维基百科编辑作为“主题专家”。这将有助于分散搜索过程，以鼓励生成更全面的最终报告。

class Editor(BaseModel):
    affiliation: str = Field(
        description="Primary affiliation of the editor.",
    )
    name: str = Field(
        description="Name of the editor.", pattern=r"^[a-zA-Z0-9_-]{1,64}$"
    )
    role: str = Field(
        description="Role of the editor in the context of the topic.",
    )
    description: str = Field(
        description="Description of the editor's focus, concerns, and motives.",
    )

    @field_validator("name", mode="before")
    def sanitize_name(cls, value: str) -> str:
        return value.replace(" ", "").replace(".", "")

    @property
    def persona(self) -> str:
        return f"Name: {self.name}\nRole: {self.role}\nAffiliation: {self.affiliation}\nDescription: {self.description}\n"


class Perspectives(BaseModel):
    editors: List[Editor] = Field(
        description="Comprehensive list of editors with their roles and affiliations.",
        # Add a pydantic validation/restriction to be at most M editors
    )


gen_perspectives_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You need to select a diverse (and distinct) group of Wikipedia editors who will work together to create a comprehensive article on the topic. Each of them represents a different perspective, role, or affiliation related to this topic.\
    You can use other Wikipedia pages of related topics for inspiration. For each editor, add a description of what they will focus on.

    Wiki page outlines of related topics for inspiration:
    {examples}""",
        ),
        ("user", "Topic of interest: {topic}"),
    ]
)

gen_perspectives_chain = gen_perspectives_prompt | fast_llm.with_structured_output(
    Perspectives, method="function_calling"
)

API 参考: WikipediaRetriever | RunnableLambda | chain

from langchain_community.retrievers import WikipediaRetriever
from langchain_core.runnables import RunnableLambda
from langchain_core.runnables import chain as as_runnable

wikipedia_retriever = WikipediaRetriever(load_all_available_meta=True, top_k_results=1)


def format_doc(doc, max_length=1000):
    related = "- ".join(doc.metadata["categories"])
    return f"### {doc.metadata['title']}\n\nSummary: {doc.page_content}\n\nRelated\n{related}"[
        :max_length
    ]


def format_docs(docs):
    return "\n\n".join(format_doc(doc) for doc in docs)


@as_runnable
async def survey_subjects(topic: str):
    related_subjects = await expand_chain.ainvoke({"topic": topic})
    retrieved_docs = await wikipedia_retriever.abatch(
        related_subjects.topics, return_exceptions=True
    )
    all_docs = []
    for docs in retrieved_docs:
        if isinstance(docs, BaseException):
            continue
        all_docs.extend(docs)
    formatted = format_docs(all_docs)
    return await gen_perspectives_chain.ainvoke({"examples": formatted, "topic": topic})

perspectives = await survey_subjects.ainvoke(example_topic)

perspectives.model_dump()

{'editors': [{'affiliation': 'Research Institution',
   'name': 'AliceResearcher',
   'role': 'Researcher',
   'description': 'AliceResearcher focuses on the impact of million-plus token context window language models on the Retrieval-Augmented Generation (RAG) framework. They analyze the effectiveness of large language models within the RAG framework and investigate how these models influence information retrieval and generation tasks.'},
  {'affiliation': 'Tech Company',
   'name': 'BobEngineer',
   'role': 'Engineer',
   'description': 'BobEngineer specializes in implementing million-plus token context window language models in practical applications, particularly within the Retrieval-Augmented Generation (RAG) framework. They focus on optimizing the performance and efficiency of these models for real-world usage.'},
  {'affiliation': 'Academic Institution',
   'name': 'CharlieAcademic',
   'role': 'Academic',
   'description': 'CharlieAcademic studies the theoretical implications of integrating million-plus token context window language models with the RAG framework. They explore the broader implications of using such large models for natural language processing and information retrieval.'},
  {'affiliation': 'Industry Expert',
   'name': 'DianaExpert',
   'role': 'Industry Expert',
   'description': 'DianaExpert provides insights from the industry perspective on the impact of million-plus token context window language models on RAG. They focus on practical applications, challenges, and opportunities that arise when utilizing these models in commercial settings.'},
  {'affiliation': 'AI Ethics Organization',
   'name': 'EveEthicist',
   'role': 'Ethicist',
   'description': 'EveEthicist examines the ethical considerations surrounding the use of million-plus token context window language models in the context of RAG. They focus on potential biases, fairness, and transparency issues that may arise from the deployment of such models.'}]}

专家对话¶

现在真正的乐趣开始了，每个维基百科撰写者都准备好使用上面提出的视角进行角色扮演。它将向具有搜索引擎访问权限的第二位“领域专家”提出一系列问题。这将生成内容以生成优化的大纲以及更新的参考文献索引。

访谈状态¶

对话是循环的，因此我们将在其自己的图中构建它。状态将包含消息、参考文档和编辑（及其自己的“角色设定”），以便于并行化这些对话。

API 参考: AnyMessage | END | StateGraph | START

from typing import Annotated

from langchain_core.messages import AnyMessage
from typing_extensions import TypedDict

from langgraph.graph import END, StateGraph, START


def add_messages(left, right):
    if not isinstance(left, list):
        left = [left]
    if not isinstance(right, list):
        right = [right]
    return left + right


def update_references(references, new_references):
    if not references:
        references = {}
    references.update(new_references)
    return references


def update_editor(editor, new_editor):
    # Can only set at the outset
    if not editor:
        return new_editor
    return editor


class InterviewState(TypedDict):
    messages: Annotated[List[AnyMessage], add_messages]
    references: Annotated[Optional[dict], update_references]
    editor: Annotated[Optional[Editor], update_editor]

对话角色¶

该图将有两个参与者：维基百科编辑（generate_question），根据其分配的角色提问；以及领域专家（`gen_answer_chain），使用搜索引擎尽可能准确地回答问题。

API 参考: AIMessage | HumanMessage | ToolMessage | MessagesPlaceholder

from langchain_core.messages import AIMessage, HumanMessage, ToolMessage
from langchain_core.prompts import MessagesPlaceholder

gen_qn_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are an experienced Wikipedia writer and want to edit a specific page. \
Besides your identity as a Wikipedia writer, you have a specific focus when researching the topic. \
Now, you are chatting with an expert to get information. Ask good questions to get more useful information.

When you have no more questions to ask, say "Thank you so much for your help!" to end the conversation.\
Please only ask one question at a time and don't ask what you have asked before.\
Your questions should be related to the topic you want to write.
Be comprehensive and curious, gaining as much unique insight from the expert as possible.\

Stay true to your specific perspective:

{persona}""",
        ),
        MessagesPlaceholder(variable_name="messages", optional=True),
    ]
)


def tag_with_name(ai_message: AIMessage, name: str):
    ai_message.name = name
    return ai_message


def swap_roles(state: InterviewState, name: str):
    converted = []
    for message in state["messages"]:
        if isinstance(message, AIMessage) and message.name != name:
            message = HumanMessage(**message.model_dump(exclude={"type"}))
        converted.append(message)
    return {"messages": converted}


@as_runnable
async def generate_question(state: InterviewState):
    editor = state["editor"]
    gn_chain = (
        RunnableLambda(swap_roles).bind(name=editor.name)
        | gen_qn_prompt.partial(persona=editor.persona)
        | fast_llm
        | RunnableLambda(tag_with_name).bind(name=editor.name)
    )
    result = await gn_chain.ainvoke(state)
    return {"messages": [result]}

messages = [
    HumanMessage(f"So you said you were writing an article on {example_topic}?")
]
question = await generate_question.ainvoke(
    {
        "editor": perspectives.editors[0],
        "messages": messages,
    }
)

question["messages"][0].content

"Yes, that's correct. I focus on studying the impact of million-plus token context window language models on the Retrieval-Augmented Generation (RAG) framework. I analyze how these large language models affect information retrieval and generation tasks within the RAG framework. Is there a specific aspect of this topic that you would like to know more about?"

回答问题¶

gen_answer_chain 首先生成查询（查询扩展）来回答编辑的问题，然后提供引文作为回应。

class Queries(BaseModel):
    queries: List[str] = Field(
        description="Comprehensive list of search engine queries to answer the user's questions.",
    )


gen_queries_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful research assistant. Query the search engine to answer the user's questions.",
        ),
        MessagesPlaceholder(variable_name="messages", optional=True),
    ]
)
gen_queries_chain = gen_queries_prompt | fast_llm.with_structured_output(
    Queries, include_raw=True, method="function_calling"
)

queries = await gen_queries_chain.ainvoke(
    {"messages": [HumanMessage(content=question["messages"][0].content)]}
)
queries["parsed"].queries

['impact of million-plus token context window language models on Retrieval-Augmented Generation (RAG) framework',
 'information retrieval tasks in the RAG framework with large language models',
 'generation tasks in the RAG framework with million-plus token context window models']

class AnswerWithCitations(BaseModel):
    answer: str = Field(
        description="Comprehensive answer to the user's question with citations.",
    )
    cited_urls: List[str] = Field(
        description="List of urls cited in the answer.",
    )

    @property
    def as_str(self) -> str:
        return f"{self.answer}\n\nCitations:\n\n" + "\n".join(
            f"[{i+1}]: {url}" for i, url in enumerate(self.cited_urls)
        )


gen_answer_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are an expert who can use information effectively. You are chatting with a Wikipedia writer who wants\
 to write a Wikipedia page on the topic you know. You have gathered the related information and will now use the information to form a response.

Make your response as informative as possible and make sure every sentence is supported by the gathered information.
Each response must be backed up by a citation from a reliable source, formatted as a footnote, reproducing the URLS after your response.""",
        ),
        MessagesPlaceholder(variable_name="messages", optional=True),
    ]
)

gen_answer_chain = gen_answer_prompt | fast_llm.with_structured_output(
    AnswerWithCitations, include_raw=True
).with_config(run_name="GenerateAnswer")

API 参考: DuckDuckGoSearchAPIWrapper | tool

from langchain_community.utilities.duckduckgo_search import DuckDuckGoSearchAPIWrapper
from langchain_core.tools import tool

'''
# Tavily is typically a better search engine, but your free queries are limited
search_engine = TavilySearchResults(max_results=4)

@tool
async def search_engine(query: str):
    """Search engine to the internet."""
    results = tavily_search.invoke(query)
    return [{"content": r["content"], "url": r["url"]} for r in results]
'''

# DDG
search_engine = DuckDuckGoSearchAPIWrapper()


@tool
async def search_engine(query: str):
    """Search engine to the internet."""
    results = DuckDuckGoSearchAPIWrapper()._ddgs_text(query)
    return [{"content": r["body"], "url": r["href"]} for r in results]

API 参考: RunnableConfig

import json

from langchain_core.runnables import RunnableConfig


async def gen_answer(
    state: InterviewState,
    config: Optional[RunnableConfig] = None,
    name: str = "Subject_Matter_Expert",
    max_str_len: int = 15000,
):
    swapped_state = swap_roles(state, name)  # Convert all other AI messages
    queries = await gen_queries_chain.ainvoke(swapped_state)
    query_results = await search_engine.abatch(
        queries["parsed"].queries, config, return_exceptions=True
    )
    successful_results = [
        res for res in query_results if not isinstance(res, Exception)
    ]
    all_query_results = {
        res["url"]: res["content"] for results in successful_results for res in results
    }
    # We could be more precise about handling max token length if we wanted to here
    dumped = json.dumps(all_query_results)[:max_str_len]
    ai_message: AIMessage = queries["raw"]
    tool_call = queries["raw"].tool_calls[0]
    tool_id = tool_call["id"]
    tool_message = ToolMessage(tool_call_id=tool_id, content=dumped)
    swapped_state["messages"].extend([ai_message, tool_message])
    # Only update the shared state with the final answer to avoid
    # polluting the dialogue history with intermediate messages
    generated = await gen_answer_chain.ainvoke(swapped_state)
    cited_urls = set(generated["parsed"].cited_urls)
    # Save the retrieved information to a the shared state for future reference
    cited_references = {k: v for k, v in all_query_results.items() if k in cited_urls}
    formatted_message = AIMessage(name=name, content=generated["parsed"].as_str)
    return {"messages": [formatted_message], "references": cited_references}

example_answer = await gen_answer(
    {"messages": [HumanMessage(content=question["messages"][0].content)]}
)
example_answer["messages"][-1].content

'Studying the impact of million-plus token context window language models on the Retrieval-Augmented Generation (RAG) framework involves analyzing how these large language models affect information retrieval and generation tasks within the RAG framework. The introduction of large language models with extensive context windows, such as Google Gemini 1.5 Pro with a record 1 million token context window, has sparked discussions in the AI community about the potential implications for RAG. While there are concerns about the negative impact of supermassive context windows on RAG, there is also a recognition of the benefits they bring, enabling more use cases and enhancing performance in knowledge-intensive tasks. Retrieval-Augmented Generation (RAG) combines the generative capabilities of transformer architectures with dynamic information retrieval, allowing large language models to access and integrate relevant external knowledge during text generation, leading to more accurate and credible outputs. RAG has been identified as a valuable solution to address challenges faced by Large Language Models (LLMs), such as hallucination, outdated knowledge, lack of transparency, and untraceable reasoning processes. By incorporating external databases, RAG improves the consistency and coherence of generated content, especially in conversational question answering tasks. RAG has also been recognized as a powerful tool for large language models to efficiently process overly lengthy contexts, with recent LLMs like Gemini-1.5 and GPT-4 showcasing exceptional capabilities in understanding long contexts directly. There is ongoing research and benchmarking to compare the strengths of RAG and long-context LLMs, aiming to leverage the advantages of both approaches for enhanced performance in information retrieval and generation tasks within the RAG framework.\n\nCitations:\n\n[1]: https://thenewstack.io/do-enormous-llm-context-windows-spell-the-end-of-rag/\n[2]: https://medium.com/enterprise-rag/why-gemini-1-5-and-other-large-context-models-are-bullish-for-rag-ce3218930bb4\n[3]: https://medium.com/@amanatulla1606/rag-is-here-to-stay-four-reasons-why-large-context-windows-cant-replace-it-ad112013de25\n[4]: https://www.freecodecamp.org/news/retrieval-augmented-generation-rag-handbook/\n[5]: https://www.deepset.ai/blog/long-context-llms-rag\n[6]: https://arxiv.org/abs/2312.10997\n[7]: https://arxiv.org/abs/2409.13385\n[8]: https://irisagent.com/blog/enhancing-large-language-models-a-deep-dive-into-rag-llm-technology/\n[9]: https://arxiv.org/abs/2407.16833'

构建访谈图¶

既然我们已经定义了编辑和领域专家，我们就可以在图中组合他们。

max_num_turns = 5
from langgraph.pregel import RetryPolicy


def route_messages(state: InterviewState, name: str = "Subject_Matter_Expert"):
    messages = state["messages"]
    num_responses = len(
        [m for m in messages if isinstance(m, AIMessage) and m.name == name]
    )
    if num_responses >= max_num_turns:
        return END
    last_question = messages[-2]
    if last_question.content.endswith("Thank you so much for your help!"):
        return END
    return "ask_question"


builder = StateGraph(InterviewState)

builder.add_node("ask_question", generate_question, retry=RetryPolicy(max_attempts=5))
builder.add_node("answer_question", gen_answer, retry=RetryPolicy(max_attempts=5))
builder.add_conditional_edges("answer_question", route_messages)
builder.add_edge("ask_question", "answer_question")

builder.add_edge(START, "ask_question")
interview_graph = builder.compile(checkpointer=False).with_config(
    run_name="Conduct Interviews"
)

from IPython.display import Image, display

try:
    display(Image(interview_graph.get_graph().draw_mermaid_png()))
except Exception:
    # This requires some extra dependencies and is optional
    pass

final_step = None

initial_state = {
    "editor": perspectives.editors[0],
    "messages": [
        AIMessage(
            content=f"So you said you were writing an article on {example_topic}?",
            name="Subject_Matter_Expert",
        )
    ],
}
async for step in interview_graph.astream(initial_state):
    name = next(iter(step))
    print(name)
    print("-- ", str(step[name]["messages"])[:300])
final_step = step

ask_question
--  [AIMessage(content="Yes, that's correct. My focus is on how million-plus token context window language models impact the Retrieval-Augmented Generation (RAG) framework. I'm interested in understanding how the size and scope of these language models influence the effectiveness of the RAG framework in
answer_question
--  [AIMessage(content='The integration of million-plus token context window language models in the Retrieval-Augmented Generation (RAG) framework has emerged as a significant advancement in natural language processing. RAG has been a reliable solution for context-based answer generation, overcoming the
ask_question
--  [AIMessage(content='Can you elaborate on how the retrieval-augmented techniques in the RAG framework enhance the accuracy and relevance of the generated text by accessing external knowledge dynamically? How does this process differ from traditional language modeling approaches, and what specific ben
answer_question
--  [AIMessage(content='The integration of retrieval-augmented techniques in the RAG framework enhances the accuracy and relevance of generated text by dynamically accessing external knowledge sources to enrich the generation process. Unlike traditional language modeling approaches that rely solely on i
ask_question
--  [AIMessage(content='Thank you for the insightful information on how retrieval-augmented techniques enhance the accuracy and relevance of text generation in the RAG framework by incorporating external knowledge sources. How do these external knowledge sources impact the overall performance and adapta
answer_question
--  [AIMessage(content='External knowledge sources significantly impact the overall performance and adaptability of RAG models in handling various types of information retrieval and generation tasks. By leveraging external knowledge, RAG models access up-to-date information, reduce the incidence of gene
ask_question
--  [AIMessage(content='Thank you for sharing insights on how external knowledge sources impact the performance and adaptability of RAG models in handling diverse information retrieval and generation tasks. This information will be valuable for my research on the impact of million-plus token context win
answer_question
--  [AIMessage(content="The integration of million-plus token context window language models in the Retrieval-Augmented Generation (RAG) framework has sparked discussions in the AI community, with some concerns about the potential impact on RAG's relevance. However, RAG continues to be a valuable soluti

final_state = next(iter(final_step.values()))

优化大纲¶

在 STORM 的这个阶段，我们已经从不同视角进行了大量研究。是时候根据这些调查来优化原始大纲了。下面，使用具有长上下文窗口的 LLM 创建一个链来更新原始大纲。

refine_outline_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are a Wikipedia writer. You have gathered information from experts and search engines. Now, you are refining the outline of the Wikipedia page. \
You need to make sure that the outline is comprehensive and specific. \
Topic you are writing about: {topic} 

Old outline:

{old_outline}""",
        ),
        (
            "user",
            "Refine the outline based on your conversations with subject-matter experts:\n\nConversations:\n\n{conversations}\n\nWrite the refined Wikipedia outline:",
        ),
    ]
)

# Using turbo preview since the context can get quite long
refine_outline_chain = refine_outline_prompt | long_context_llm.with_structured_output(
    Outline
)

refined_outline = refine_outline_chain.invoke(
    {
        "topic": example_topic,
        "old_outline": initial_outline.as_str,
        "conversations": "\n\n".join(
            f"### {m.name}\n\n{m.content}" for m in final_state["messages"]
        ),
    }
)

print(refined_outline.as_str)

# Impact of million-plus token context window language models on RAG

## Introduction

Provides a brief overview of million-plus token context window language models and their relevance to Retrieval-Augmented Generation (RAG) systems, setting the stage for a deeper exploration of their impact.

## Background

A foundational section to understand the core concepts involved.

### Million-Plus Token Context Window Language Models

Explains what million-plus token context window language models are, including notable examples like Gemini 1.5, focusing on their architecture, training data, and the evolution of their applications.

### Retrieval-Augmented Generation (RAG)

Describes the RAG framework, its unique approach of combining retrieval and generation models for enhanced natural language processing, and its significance in the AI landscape.

## Impact on RAG Systems

Delves into the effects of million-plus token context window language models on RAG, highlighting both the challenges and opportunities presented.

### Performance and Efficiency

Discusses how large context window models influence RAG performance, including aspects of latency, computational demands, and overall efficiency.

### Generation Quality and Diversity

Explores the impact on generation quality, the potential for more accurate and diverse outputs, and how these models address data biases and factual accuracy.

### Technical Challenges

Identifies specific technical hurdles such as prompt template design, context length limitations, and similarity searches in vector databases, and how they affect RAG systems.

### Opportunities and Advancements

Outlines the new capabilities and improvements in agent interaction, information retrieval, and response relevance that these models bring to RAG systems.

## Future Directions

Considers ongoing research and potential future developments in the integration of million-plus token context window language models with RAG systems, including speculation on emerging trends and technologies.

## Conclusion

Summarizes the key points discussed in the article, reaffirming the significant impact of million-plus token context window language models on RAG systems.

生成文章¶

现在是时候生成完整的文章了。我们首先将采用分而治之的方法，以便每个章节都可以由一个独立的 LLM 来处理。然后我们将提示长篇 LLM 来优化完成的文章（因为每个章节可能使用不一致的语气）。

创建检索器¶

研究过程揭示了大量参考文献，我们可能希望在最终文章撰写过程中查询这些文档。

首先，创建检索器

API 参考: InMemoryVectorStore | Document | OpenAIEmbeddings

from langchain_community.vectorstores import InMemoryVectorStore
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
reference_docs = [
    Document(page_content=v, metadata={"source": k})
    for k, v in final_state["references"].items()
]
# This really doesn't need to be a vectorstore for this size of data.
# It could just be a numpy matrix. Or you could store documents
# across requests if you want.
vectorstore = InMemoryVectorStore.from_documents(
    reference_docs,
    embedding=embeddings,
)
retriever = vectorstore.as_retriever(k=3)

retriever.invoke("What's a long context LLM anyway?")

[Document(page_content='In Retrieval Augmented Generation (RAG), a longer context augments our model with more information. For LLMs that power agents, such as chatbots, longer context means more tools and capabilities. When summarizing, longer context means more comprehensive summaries. There exist plenty of use-cases for LLMs that are unlocked by longer context lengths.', metadata={'id': '20454848-23ac-4649-b083-81980532a77b', 'source': 'https://www.anyscale.com/blog/fine-tuning-llms-for-longer-context-and-better-rag-systems'}),
 Document(page_content='By the way, the context limits differ among models: two Claude models offer a 100K token context window, which works out to about 75,000 words, which is much higher than most other LLMs. The ...', metadata={'id': '1ee2d2bb-8f8e-4a7e-b45e-608b0804fe4c', 'source': 'https://www.infoworld.com/article/3712227/what-is-rag-more-accurate-and-reliable-llms.html'}),
 Document(page_content='Figure 1: LLM response accuracy goes down when context needed to answer correctly is found in the middle of the context window. The problem gets worse with larger context models. The problem gets ...', metadata={'id': 'a41d69e6-62eb-4abd-90ad-0892a2836cba', 'source': 'https://medium.com/@jm_51428/long-context-window-models-vs-rag-a73c35a763f2'}),
 Document(page_content='To improve performance, we used retrieval-augmented generation (RAG) to prompt an LLM with accurate up-to-date information. As a result of using RAG, the writing quality of the LLM improves substantially, which has implications for the practical usability of LLMs in clinical trial-related writing.', metadata={'id': 'e1af6e30-8c2b-495b-b572-ac6a29067a94', 'source': 'https://arxiv.org/abs/2402.16406'})]

生成章节¶

现在您可以使用已索引的文档生成章节了。

class SubSection(BaseModel):
    subsection_title: str = Field(..., title="Title of the subsection")
    content: str = Field(
        ...,
        title="Full content of the subsection. Include [#] citations to the cited sources where relevant.",
    )

    @property
    def as_str(self) -> str:
        return f"### {self.subsection_title}\n\n{self.content}".strip()


class WikiSection(BaseModel):
    section_title: str = Field(..., title="Title of the section")
    content: str = Field(..., title="Full content of the section")
    subsections: Optional[List[Subsection]] = Field(
        default=None,
        title="Titles and descriptions for each subsection of the Wikipedia page.",
    )
    citations: List[str] = Field(default_factory=list)

    @property
    def as_str(self) -> str:
        subsections = "\n\n".join(
            subsection.as_str for subsection in self.subsections or []
        )
        citations = "\n".join([f" [{i}] {cit}" for i, cit in enumerate(self.citations)])
        return (
            f"## {self.section_title}\n\n{self.content}\n\n{subsections}".strip()
            + f"\n\n{citations}".strip()
        )


section_writer_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert Wikipedia writer. Complete your assigned WikiSection from the following outline:\n\n"
            "{outline}\n\nCite your sources, using the following references:\n\n<Documents>\n{docs}\n<Documents>",
        ),
        ("user", "Write the full WikiSection for the {section} section."),
    ]
)


async def retrieve(inputs: dict):
    docs = await retriever.ainvoke(inputs["topic"] + ": " + inputs["section"])
    formatted = "\n".join(
        [
            f'<Document href="{doc.metadata["source"]}"/>\n{doc.page_content}\n</Document>'
            for doc in docs
        ]
    )
    return {"docs": formatted, **inputs}


section_writer = (
    retrieve
    | section_writer_prompt
    | long_context_llm.with_structured_output(WikiSection)
)

section = await section_writer.ainvoke(
    {
        "outline": refined_outline.as_str,
        "section": refined_outline.sections[1].section_title,
        "topic": example_topic,
    }
)
print(section.as_str)

## Background

To fully appreciate the impact of million-plus token context window language models on Retrieval-Augmented Generation (RAG) systems, it's essential to first understand the foundational concepts that underpin these technologies. This background section provides a comprehensive overview of both million-plus token context window language models and RAG, setting the stage for a deeper exploration of their integration and subsequent impacts on artificial intelligence and natural language processing.

### Million-Plus Token Context Window Language Models

Million-plus token context window language models, such as Gemini 1.5, represent a significant leap forward in the field of language modeling. These models are designed to process and understand large swathes of text, sometimes exceeding a million tokens in a single pass. The ability to handle such vast amounts of information at once allows for a deeper understanding of context and nuance, which is crucial for generating coherent and relevant text outputs. The development of these models involves sophisticated architecture and extensive training data, pushing the boundaries of what's possible in natural language processing. Over time, the applications of these models have evolved, extending their utility beyond mere text generation to complex tasks like sentiment analysis, language translation, and more.

### Retrieval-Augmented Generation (RAG)

The Retrieval-Augmented Generation framework represents a novel approach in the realm of artificial intelligence, blending the strengths of both retrieval and generation models to enhance natural language processing capabilities. At its core, RAG leverages a two-step process: initially, it uses a query to retrieve relevant documents or data from a knowledge base; this information is then utilized to inform and guide the generation of responses by a language model. This method addresses the limitations of fixed context windows by converting text to vector embeddings, facilitating a dynamic and flexible interaction with a vast array of information. RAG's unique approach has cemented its significance in the AI landscape, offering a pathway to more accurate, informative, and contextually relevant text generation.

生成最终文章¶

现在我们可以重写草稿，以便适当地分组所有引文并保持一致的语气。

API 参考: StrOutputParser

from langchain_core.output_parsers import StrOutputParser

writer_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an expert Wikipedia author. Write the complete wiki article on {topic} using the following section drafts:\n\n"
            "{draft}\n\nStrictly follow Wikipedia format guidelines.",
        ),
        (
            "user",
            'Write the complete Wiki article using markdown format. Organize citations using footnotes like "[1]",'
            " avoiding duplicates in the footer. Include URLs in the footer.",
        ),
    ]
)

writer = writer_prompt | long_context_llm | StrOutputParser()

for tok in writer.stream({"topic": example_topic, "draft": section.as_str}):
    print(tok, end="")

# Impact of Million-Plus Token Context Window Language Models on Retrieval-Augmented Generation (RAG)

The integration of million-plus token context window language models into Retrieval-Augmented Generation (RAG) systems marks a pivotal advancement in the field of artificial intelligence (AI) and natural language processing (NLP). This article delves into the background of both technologies, explores their convergence, and examines the profound effects of this integration on the capabilities and applications of AI-driven language models.

## Contents

1. [Background](#Background)
    1. [Million-Plus Token Context Window Language Models](#Million-Plus-Token-Context-Window-Language-Models)
    2. [Retrieval-Augmented Generation (RAG)](#Retrieval-Augmented-Generation-(RAG))
2. [Integration of Million-Plus Token Context Window Models and RAG](#Integration-of-Million-Plus-Token-Context-Window-Models-and-RAG)
3. [Impact on Natural Language Processing](#Impact-on-Natural-Language-Processing)
4. [Applications](#Applications)
5. [Challenges and Limitations](#Challenges-and-Limitations)
6. [Future Directions](#Future-Directions)
7. [Conclusion](#Conclusion)
8. [References](#References)

## Background

### Million-Plus Token Context Window Language Models

Million-plus token context window language models, exemplified by systems like Gemini 1.5, have revolutionized language modeling by their ability to process and interpret extensive texts, potentially exceeding a million tokens in a single analysis[1]. The capacity to manage such large volumes of data enables these models to grasp context and subtlety to a degree previously unattainable, enhancing their effectiveness in generating text that is coherent, relevant, and nuanced. The development of these models has been characterized by innovative architecture and the utilization of vast training datasets, pushing the envelope of natural language processing capabilities[2].

### Retrieval-Augmented Generation (RAG)

RAG systems represent an innovative paradigm in AI, merging the strengths of retrieval-based and generative models to improve the quality and relevance of text generation[3]. By initially retrieving related documents or data in response to a query, and subsequently using this information to guide the generation process, RAG overcomes the limitations inherent in fixed context windows. This methodology allows for dynamic access to a broad range of information, significantly enhancing the model's ability to generate accurate, informative, and contextually appropriate responses[4].

## Integration of Million-Plus Token Context Window Models and RAG

The integration of million-plus token context window models with RAG systems has been a natural progression in the quest for more sophisticated NLP solutions. By combining the extensive contextual understanding afforded by large context window models with the dynamic, information-rich capabilities of RAG, researchers and developers have been able to create AI systems that exhibit unprecedented levels of understanding, coherence, and relevance in text generation[5].

## Impact on Natural Language Processing

The fusion of these technologies has had a significant impact on the field of NLP, leading to advancements in several key areas:
- **Enhanced Understanding**: The combined system exhibits a deeper comprehension of both the immediate context and broader subject matter[6].
- **Improved Coherence**: Generated text is more coherent over longer passages, maintaining consistency and relevance[7].
- **Increased Relevance**: Outputs are more contextually relevant, drawing accurately from a wider range of sources[8].

## Applications

This technological convergence has broadened the applicability of NLP systems in numerous fields, including but not limited to:
- **Automated Content Creation**: Generating written content that is both informative and contextually appropriate for various platforms[9].
- **Customer Support**: Providing answers that are not only accurate but also tailored to the specific context of user inquiries[10].
- **Research Assistance**: Assisting in literature review and data analysis by retrieving and synthesizing relevant information from vast databases[11].

## Challenges and Limitations

Despite their advancements, the integration of these technologies faces several challenges:
- **Computational Resources**: The processing of million-plus tokens and the dynamic retrieval of relevant information require significant computational power[12].
- **Data Privacy and Security**: Ensuring the confidentiality and integrity of the data accessed by these systems poses ongoing concerns[13].
- **Bias and Fairness**: The potential for inheriting and amplifying biases from training data remains a critical issue to address[14].

## Future Directions

Future research is likely to focus on optimizing computational efficiency, enhancing the models' ability to understand and generate more diverse and nuanced text, and addressing ethical considerations associated with AI and NLP technologies[15].

## Conclusion

The integration of million-plus token context window language models with RAG systems represents a milestone in the evolution of natural language processing, offering enhanced capabilities that have significant implications across various applications. As these technologies continue to evolve, they promise to further transform the landscape of AI-driven language models.

## References

1. Gemini 1.5 Documentation. (n.d.).
2. The Evolution of Language Models. (2022).
3. Introduction to Retrieval-Augmented Generation. (2021).
4. Leveraging Large Context Windows for NLP. (2023).
5. Integrating Context Window Models with RAG. (2023).
6. Deep Learning in NLP. (2020).
7. Coherence in Text Generation. (2019).
8. Contextual Relevance in AI. (2021).
9. Applications of NLP in Content Creation. (2022).
10. AI in Customer Support. (2023).
11. NLP for Research Assistance. (2021).
12. Computational Challenges in NLP. (2022).
13. Data Privacy in AI Systems. (2020).
14. Addressing Bias in AI. (2021).
15. Future of NLP Technologies. (2023).

最终流程¶

现在是时候将所有内容串联起来了。我们将按顺序进行 6 个主要阶段：1. 生成初始大纲 + 视角 2. 与每个视角进行批量对话以扩展文章内容 3. 根据对话优化大纲 4. 索引对话中的参考文献 5. 撰写文章的各个章节 6. 撰写最终维基

状态跟踪每个阶段的输出。

class ResearchState(TypedDict):
    topic: str
    outline: Outline
    editors: List[Editor]
    interview_results: List[InterviewState]
    # The final sections output
    sections: List[WikiSection]
    article: str

import asyncio


async def initialize_research(state: ResearchState):
    topic = state["topic"]
    coros = (
        generate_outline_direct.ainvoke({"topic": topic}),
        survey_subjects.ainvoke(topic),
    )
    results = await asyncio.gather(*coros)
    return {
        **state,
        "outline": results[0],
        "editors": results[1].editors,
    }


async def conduct_interviews(state: ResearchState):
    topic = state["topic"]
    initial_states = [
        {
            "editor": editor,
            "messages": [
                AIMessage(
                    content=f"So you said you were writing an article on {topic}?",
                    name="Subject_Matter_Expert",
                )
            ],
        }
        for editor in state["editors"]
    ]
    # We call in to the sub-graph here to parallelize the interviews
    interview_results = await interview_graph.abatch(initial_states)

    return {
        **state,
        "interview_results": interview_results,
    }


def format_conversation(interview_state):
    messages = interview_state["messages"]
    convo = "\n".join(f"{m.name}: {m.content}" for m in messages)
    return f'Conversation with {interview_state["editor"].name}\n\n' + convo


async def refine_outline(state: ResearchState):
    convos = "\n\n".join(
        [
            format_conversation(interview_state)
            for interview_state in state["interview_results"]
        ]
    )

    updated_outline = await refine_outline_chain.ainvoke(
        {
            "topic": state["topic"],
            "old_outline": state["outline"].as_str,
            "conversations": convos,
        }
    )
    return {**state, "outline": updated_outline}


async def index_references(state: ResearchState):
    all_docs = []
    for interview_state in state["interview_results"]:
        reference_docs = [
            Document(page_content=v, metadata={"source": k})
            for k, v in interview_state["references"].items()
        ]
        all_docs.extend(reference_docs)
    await vectorstore.aadd_documents(all_docs)
    return state


async def write_sections(state: ResearchState):
    outline = state["outline"]
    sections = await section_writer.abatch(
        [
            {
                "outline": refined_outline.as_str,
                "section": section.section_title,
                "topic": state["topic"],
            }
            for section in outline.sections
        ]
    )
    return {
        **state,
        "sections": sections,
    }


async def write_article(state: ResearchState):
    topic = state["topic"]
    sections = state["sections"]
    draft = "\n\n".join([section.as_str for section in sections])
    article = await writer.ainvoke({"topic": topic, "draft": draft})
    return {
        **state,
        "article": article,
    }

创建图¶

API 参考: MemorySaver

from langgraph.checkpoint.memory import MemorySaver

builder_of_storm = StateGraph(ResearchState)

nodes = [
    ("init_research", initialize_research),
    ("conduct_interviews", conduct_interviews),
    ("refine_outline", refine_outline),
    ("index_references", index_references),
    ("write_sections", write_sections),
    ("write_article", write_article),
]
for i in range(len(nodes)):
    name, node = nodes[i]
    builder_of_storm.add_node(name, node, retry=RetryPolicy(max_attempts=3))
    if i > 0:
        builder_of_storm.add_edge(nodes[i - 1][0], name)

builder_of_storm.add_edge(START, nodes[0][0])
builder_of_storm.add_edge(nodes[-1][0], END)
storm = builder_of_storm.compile(checkpointer=MemorySaver())

from IPython.display import Image, display

try:
    display(Image(storm.get_graph().draw_mermaid_png()))
except Exception:
    # This requires some extra dependencies and is optional
    pass

config = {"configurable": {"thread_id": "my-thread"}}
async for step in storm.astream(
    {
        "topic": "Groq, NVIDIA, Llamma.cpp and the future of LLM Inference",
    },
    config,
):
    name = next(iter(step))
    print(name)
    print("-- ", str(step[name])[:300])

init_research
--  {'topic': 'Groq, NVIDIA, Llamma.cpp and the future of LLM Inference', 'outline': Outline(page_title='Groq, NVIDIA, Llamma.cpp and the future of LLM Inference', sections=[Section(section_title='Introduction', description='Overview of Groq, NVIDIA, Llamma.cpp, and their significance in the field of La
conduct_interviews
--  {'topic': 'Groq, NVIDIA, Llamma.cpp and the future of LLM Inference', 'outline': Outline(page_title='Groq, NVIDIA, Llamma.cpp and the future of LLM Inference', sections=[Section(section_title='Introduction', description='Overview of Groq, NVIDIA, Llamma.cpp, and their significance in the field of La
refine_outline
--  {'topic': 'Groq, NVIDIA, Llamma.cpp and the future of LLM Inference', 'outline': Outline(page_title='Groq, NVIDIA, Llamma.cpp and the Future of LLM Inference', sections=[Section(section_title='Introduction', description='An overview of the significance and roles of Groq, NVIDIA, and Llamma.cpp in th
index_references
--  {'topic': 'Groq, NVIDIA, Llamma.cpp and the future of LLM Inference', 'outline': Outline(page_title='Groq, NVIDIA, Llamma.cpp and the Future of LLM Inference', sections=[Section(section_title='Introduction', description='An overview of the significance and roles of Groq, NVIDIA, and Llamma.cpp in th
write_sections
--  {'topic': 'Groq, NVIDIA, Llamma.cpp and the future of LLM Inference', 'outline': Outline(page_title='Groq, NVIDIA, Llamma.cpp and the Future of LLM Inference', sections=[Section(section_title='Introduction', description='An overview of the significance and roles of Groq, NVIDIA, and Llamma.cpp in th
write_article
--  {'topic': 'Groq, NVIDIA, Llamma.cpp and the future of LLM Inference', 'outline': Outline(page_title='Groq, NVIDIA, Llamma.cpp and the Future of LLM Inference', sections=[Section(section_title='Introduction', description='An overview of the significance and roles of Groq, NVIDIA, and Llamma.cpp in th
__end__
--  {'topic': 'Groq, NVIDIA, Llamma.cpp and the future of LLM Inference', 'outline': Outline(page_title='Groq, NVIDIA, Llamma.cpp and the Future of LLM Inference', sections=[Section(section_title='Introduction', description='An overview of the significance and roles of Groq, NVIDIA, and Llamma.cpp in th

checkpoint = storm.get_state(config)
article = checkpoint.values["article"]

渲染维基¶

现在我们可以渲染最终的维基页面了！

from IPython.display import Markdown

# We will down-header the sections to create less confusion in this notebook
Markdown(article.replace("\n#", "\n##"))

大语言模型 (LLM) 推理技术¶

引言¶

百万级 token 上下文窗口语言模型（如 Gemini 1.5）的出现极大地推动了人工智能领域的发展，特别是在自然语言处理 (NLP) 方面。这些模型扩展了机器学习在理解和生成文本方面的能力，能够处理比以往大得多的上下文。这项技术飞跃为各个领域的变革性应用铺平了道路，包括集成到检索增强生成 (RAG) 系统中，以生成更准确、更具上下文关联性的响应。

Groq 在 LLM 推理方面的进展¶

Groq 推出了 Groq 线性处理器单元 (LPU)，这是一种专用于 LLM 推理的硬件架构。这项创新使 Groq 通过专门针对 LLM 任务优化硬件，成为高效高性能 LLM 处理领域的领导者。Groq LPU 显著降低了 LLM 推理的延迟并提高了吞吐量，促进了从自然语言处理到更广泛的人工智能技术的广泛应用领域的进步[1]。

NVIDIA 对 LLM 推理的贡献¶

NVIDIA 通过其针对 AI 和机器学习工作负载优化的 GPU 以及专用软件框架，在推动 LLM 推理方面发挥了关键作用。该公司的 GPU 架构和软件解决方案，如 CUDA 深度神经网络库 (cuDNN) 和 TensorRT 推理优化器，旨在加速计算过程并提高 LLM 性能。NVIDIA 在研究和开发方面的积极参与进一步凸显了其致力于增强 LLM 能力的承诺[1]。

硬件创新¶

NVIDIA 的 GPU 架构促进了 LLM 推理任务的高吞吐量和并行处理，显著减少了推理时间，并使得在实时应用中使用复杂模型成为可能。

软件解决方案¶

NVIDIA 的软件工具套件，包括 cuDNN 和 TensorRT，优化了 LLM 在其硬件上的性能，通过提高效率和降低延迟来简化 LLM 的部署。

研究与开发¶

NVIDIA 与学术界和行业合作伙伴合作开发新技术和模型，突破 LLM 技术的界限，旨在使 LLM 更强大，并适用于更广泛的任务。

Llamma.cpp：加速 LLM 推理¶

Llamma.cpp 是一个旨在提高 LLM 推理速度和效率的框架。通过集成专用硬件（如 Groq 的 LPU）并优化并行处理，Llamma.cpp 显著加速了计算时间并降低了能耗。该框架支持百万级 token 上下文窗口模型，使得需要深度上下文理解和广泛知识检索的应用成为可能[1][2]。

LLM 推理的未来¶

随着 Groq LPU 等专用硬件架构的进步，LLM 推理的未来将迎来变革。这些创新有望提高 LLM 处理的速度和效率，从而带来更具交互性、功能更强、集成度更高的 AI 应用。先进硬件和复杂的 LLM 实现近乎即时处理复杂查询和交互的潜力，为各个领域的研究和应用开辟了新途径，预示着 AI 将无缝融入社会[1][2]。

参考文献¶

[1] “Groq 的 LPU：提高 LLM 推理效率”，Prompt Engineering. https://promptengineering.org/groqs-lpu-advancing-llm-inference-efficiency/

[2] “思维的速度：利用 Groq 的 LPU 驾驭最快的 LLM”，Medium. https://medium.com/@anasdavoodtk1/the-speed-of-thought-harnessing-the-fastest-llm-with-groqs-lpu-11bb00864e9c