Add RAG agent and ReAct agent implemention for llama3.1 served by TGI…

…-gaudi (#722) * add ragagent and react agent for llama3.1 Signed-off-by: minmin-intel <[email protected]> * update ut Signed-off-by: minmin-intel <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update test Signed-off-by: minmin-intel <[email protected]> * update test Signed-off-by: minmin-intel <[email protected]> * debug ut Signed-off-by: minmin-intel <[email protected]> * update test Signed-off-by: minmin-intel <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update test and readme Signed-off-by: minmin-intel <[email protected]> * update ragagent llama docgrader Signed-off-by: minmin-intel <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: minmin-intel <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
opea-project · Sep 26, 2024 · e7fdf53 · e7fdf53
1 parent cc80c1b
commit e7fdf53
Show file tree

Hide file tree

Showing 15 changed files with 669 additions and 38 deletions.
diff --git a/comps/agent/langchain/README.md b/comps/agent/langchain/README.md
@@ -8,8 +8,8 @@ This agent microservice is built on Langchain/Langgraph frameworks. Agents integ
 
 We currently support the following types of agents:
 
-1. ReAct: use `react_langchain` or `react_langgraph` as strategy. First introduced in this seminal [paper](https://arxiv.org/abs/2210.03629). The ReAct agent engages in "reason-act-observe" cycles to solve problems. Please refer to this [doc](https://python.langchain.com/v0.2/docs/how_to/migrate_agent/) to understand the differences between the langchain and langgraph versions of react agents.
-2. RAG agent: `rag_agent` strategy. This agent is specifically designed for improving RAG performance. It has the capability to rephrase query, check relevancy of retrieved context, and iterate if context is not relevant.
+1. ReAct: use `react_langchain` or `react_langgraph` or `react_llama` as strategy. First introduced in this seminal [paper](https://arxiv.org/abs/2210.03629). The ReAct agent engages in "reason-act-observe" cycles to solve problems. Please refer to this [doc](https://python.langchain.com/v0.2/docs/how_to/migrate_agent/) to understand the differences between the langchain and langgraph versions of react agents. See table below to understand the validated LLMs for each react strategy.
+2. RAG agent: use `rag_agent` or `rag_agent_llama` strategy. This agent is specifically designed for improving RAG performance. It has the capability to rephrase query, check relevancy of retrieved context, and iterate if context is not relevant. See table below to understand the validated LLMs for each rag agent strategy.
 3. Plan and execute: `plan_execute` strategy. This type of agent first makes a step-by-step plan given a user request, and then execute the plan sequentially (or in parallel, to be implemented in future). If the execution results can solve the problem, then the agent will output an answer; otherwise, it will replan and execute again.
    For advanced developers who want to implement their own agent strategies, please refer to [Section 5](#5-customize-agent-strategy) below.
 
@@ -20,6 +20,15 @@ Agents use LLM for reasoning and planning. We support 2 options of LLM engine:
 1. Open-source LLMs served with TGI-gaudi. To use open-source llms, follow the instructions in [Section 2](#222-start-microservices) below. Note: we recommend using state-of-the-art LLMs, such as llama3.1-70B-instruct, to get higher success rate.
 2. OpenAI LLMs via API calls. To use OpenAI llms, specify `llm_engine=openai` and `export OPENAI_API_KEY=<your-openai-key>`
 
+| Agent type       | `strategy` arg    | Validated LLMs                                                                                 | Notes                                                                                                        |
+| ---------------- | ----------------- | ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ |
+| ReAct            | `react_langchain` | GPT-4o-mini, [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) | Only allows tools with one input variable                                                                    |
+| ReAct            | `react_langgraph` | GPT-4o-mini                                                                                    | Currently does not work for open-source LLMs served with TGI-Gaudi                                           |
+| ReAct            | `react_llama`     | [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)              | Recommended for open-source LLMs served with TGI-Gaudi                                                       |
+| RAG agent        | `rag_agent`       | GPT-4o-mini                                                                                    | Currently does not work for open-source LLMs served with TGI-Gaudi                                           |
+| RAG agent        | `rag_agent_llama` | [llama3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)              | Recommended for open-source LLMs served with TGI-Gaudi, only allows 1 tool with input variable to be "query" |
+| Plan and execute | `plan_execute`    | GPT-4o-mini                                                                                    |                                                                                                              |
+
 ### 1.3 Tools
 
 The tools are registered with a yaml file. We support the following types of tools:

diff --git a/comps/agent/langchain/agent.py b/comps/agent/langchain/agent.py
@@ -53,6 +53,7 @@ async def llm_generate(input: Union[LLMParamsDoc, ChatCompletionRequest]):
         logger.info(f"args: {args}")
     input.streaming = args.streaming
     config = {"recursion_limit": args.recursion_limit}
+    print("========initiating agent============")
     agent_inst = instantiate_agent(args, args.strategy)
     if logflag:
         logger.info(type(agent_inst))

diff --git a/comps/agent/langchain/src/agent.py b/comps/agent/langchain/src/agent.py
@@ -11,12 +11,18 @@ def instantiate_agent(args, strategy="react_langchain", with_memory=False):
         from .strategy.react import ReActAgentwithLanggraph
 
         return ReActAgentwithLanggraph(args, with_memory)
+    elif strategy == "react_llama":
+        print("Initializing ReAct Agent with LLAMA")
+        from .strategy.react import ReActAgentLlama
+
+        return ReActAgentLlama(args, with_memory)
     elif strategy == "plan_execute":
         from .strategy.planexec import PlanExecuteAgentWithLangGraph
 
         return PlanExecuteAgentWithLangGraph(args, with_memory)
 
-    elif strategy == "rag_agent":
+    elif strategy == "rag_agent" or strategy == "rag_agent_llama":
+        print("Initializing RAG Agent")
         from .strategy.ragagent import RAGAgent
 
         return RAGAgent(args, with_memory)

diff --git a/comps/agent/langchain/src/strategy/ragagent/planner.py b/comps/agent/langchain/src/strategy/ragagent/planner.py
@@ -4,8 +4,8 @@
 from typing import Annotated, Any, Literal, Sequence, TypedDict
 
 from langchain.output_parsers import PydanticOutputParser
-from langchain_core.messages import BaseMessage, HumanMessage, ToolMessage
-from langchain_core.output_parsers import StrOutputParser
+from langchain_core.messages import AIMessage, BaseMessage, HumanMessage, ToolMessage
+from langchain_core.output_parsers import JsonOutputParser, StrOutputParser
 from langchain_core.output_parsers.openai_tools import PydanticToolsParser
 from langchain_core.prompts import PromptTemplate
 from langchain_core.pydantic_v1 import BaseModel, Field
@@ -17,7 +17,7 @@
 from langgraph.prebuilt import ToolNode, tools_condition
 
 from ..base_agent import BaseAgent
-from .prompt import DOC_GRADER_PROMPT, RAG_PROMPT
+from .prompt import DOC_GRADER_PROMPT, RAG_PROMPT, QueryWriterLlamaPrompt
 
 instruction = "Retrieved document is not sufficient or relevant to answer the query. Reformulate the query to search knowledge base again."
 MAX_RETRY = 3
@@ -61,6 +61,8 @@ def __call__(self, state):
 class Retriever:
     @classmethod
     def create(cls, tools_descriptions):
+        for tool in tools_descriptions:
+            print(tool.name)
         return ToolNode(tools_descriptions)
 
 
@@ -132,20 +134,23 @@ def __init__(self, llm_endpoint, model_id=None):
         self.rag_chain = prompt | llm_endpoint | StrOutputParser()
 
     def __call__(self, state):
+        from .utils import aggregate_docs
+
         print("---GENERATE---")
         messages = state["messages"]
         question = messages[0].content
         query_time = state["query_time"]
 
         # find the latest retrieved doc
         # which is a ToolMessage
-        for m in state["messages"][::-1]:
-            if isinstance(m, ToolMessage):
-                last_message = m
-                break
+        # for m in state["messages"][::-1]:
+        #     if isinstance(m, ToolMessage):
+        #         last_message = m
+        #         break
+        # docs = last_message.content
 
         question = messages[0].content
-        docs = last_message.content
+        docs = aggregate_docs(messages)
 
         # Run
         response = self.rag_chain.invoke({"context": docs, "question": question, "time": query_time})
@@ -159,8 +164,13 @@ def __init__(self, args, with_memory=False):
         super().__init__(args)
 
         # Define Nodes
-        document_grader = DocumentGrader(self.llm_endpoint, args.model)
-        query_writer = QueryWriter(self.llm_endpoint, args.model, self.tools_descriptions)
+
+        if args.strategy == "rag_agent":
+            query_writer = QueryWriter(self.llm_endpoint, args.model, self.tools_descriptions)
+            document_grader = DocumentGrader(self.llm_endpoint, args.model)
+        elif args.strategy == "rag_agent_llama":
+            query_writer = QueryWriterLlama(self.llm_endpoint, args.model, self.tools_descriptions)
+            document_grader = DocumentGraderLlama(self.llm_endpoint, args.model)
         text_generator = TextGenerator(self.llm_endpoint)
         retriever = Retriever.create(self.tools_descriptions)
 
@@ -248,3 +258,119 @@ async def non_streaming_run(self, query, config):
             return last_message.content
         except Exception as e:
             return str(e)
+
+
+class QueryWriterLlama:
+    """Temporary workaround to use LLM with TGI-Gaudi.
+
+    Use custom output parser to parse text string from LLM into tool calls.
+    Only support one tool. Does NOT support multiple tools.
+    The tool input variable must be "query".
+    Only validated with llama3.1-70B-instruct.
+    Output of the chain is AIMessage.
+    Streaming=false is required for this chain.
+    """
+
+    def __init__(self, llm_endpoint, model_id, tools):
+        from .utils import QueryWriterLlamaOutputParser
+
+        assert len(tools) == 1, "Only support one tool, passed in {} tools".format(len(tools))
+        output_parser = QueryWriterLlamaOutputParser()
+        prompt = PromptTemplate(
+            template=QueryWriterLlamaPrompt,
+            input_variables=["question", "history", "feedback"],
+        )
+        llm = ChatHuggingFace(llm=llm_endpoint, model_id=model_id)
+        self.tools = tools
+        self.chain = prompt | llm | output_parser
+
+    def __call__(self, state):
+        from .utils import assemble_history, convert_json_to_tool_call
+
+        print("---CALL QueryWriter---")
+        messages = state["messages"]
+
+        question = messages[0].content
+        history = assemble_history(messages)
+        feedback = instruction
+
+        response = self.chain.invoke({"question": question, "history": history, "feedback": feedback})
+        print("Response from query writer llm: ", response)
+
+        ### Code below assumes one tool call in the response ##############
+        # if "query" in response:
+        #     add_kw_tc, tool_call = convert_json_to_tool_call(response, self.tools[0])
+        #     # print("Tool call:\n", tool_call)
+        #     response = AIMessage(content="", additional_kwargs=add_kw_tc, tool_calls=[tool_call])
+        #     # print(response)
+        # else:
+        #     response = AIMessage(content=response["answer"])
+        # We return a list, because this will get added to the existing list
+        # return {"messages": [response], "output": response}
+        ######################################################################
+
+        ############ allow multiple tool calls in one AI message ############
+        tool_calls = []
+        for res in response:
+            if "query" in res:
+                add_kw_tc, tool_call = convert_json_to_tool_call(res, self.tools[0])
+                # print("Tool call:\n", tool_call)
+                tool_calls.append(tool_call)
+
+        if tool_calls:
+            ai_message = AIMessage(content="", additional_kwargs=add_kw_tc, tool_calls=tool_calls)
+        else:
+            ai_message = AIMessage(content=response[0]["answer"])
+
+        return {"messages": [ai_message], "output": ai_message.content}
+
+
+class DocumentGraderLlama:
+    """Determines whether the retrieved documents are relevant to the question.
+
+    Args:
+        state (messages): The current state
+
+    Returns:
+        str: A decision for whether the documents are relevant or not
+    """
+
+    def __init__(self, llm_endpoint, model_id=None):
+        from .prompt import DOC_GRADER_Llama_PROMPT
+
+        # Prompt
+        prompt = PromptTemplate(
+            template=DOC_GRADER_Llama_PROMPT,
+            input_variables=["context", "question"],
+        )
+
+        if isinstance(llm_endpoint, HuggingFaceEndpoint):
+            llm = ChatHuggingFace(llm=llm_endpoint, model_id=model_id)
+        elif isinstance(llm_endpoint, ChatOpenAI):
+            llm = llm_endpoint
+        self.chain = prompt | llm
+
+    def __call__(self, state) -> Literal["generate", "rewrite"]:
+        from .utils import aggregate_docs
+
+        print("---CALL DocumentGrader---")
+        messages = state["messages"]
+
+        question = messages[0].content  # the original query
+        docs = aggregate_docs(messages)
+        print("@@@@ Docs: ", docs)
+
+        scored_result = self.chain.invoke({"question": question, "context": docs})
+
+        score = scored_result.content
+        print("@@@@ Score: ", score)
+
+        # if score.startswith("yes"):
+        if "yes" in score:
+            print("---DECISION: DOCS RELEVANT---")
+            return {"doc_score": "generate"}
+
+        else:
+            print("---DECISION: DOCS NOT RELEVANT---")
+
+            return {"messages": [HumanMessage(content=instruction)], "doc_score": "rewrite"}
diff --git a/comps/agent/langchain/src/strategy/ragagent/prompt.py b/comps/agent/langchain/src/strategy/ragagent/prompt.py
@@ -34,3 +34,35 @@
         ),
     ]
 )
+
+
+QueryWriterLlamaPrompt = """\
+Given the user question, think step by step.
+If you can answer the question without searching the knowledge base, provide your answer.
+
+If you need to search for information in the knowledge base, provide the search query.
+Decompose a complex question into a set of simple tasks, and issue search queries for each task.
+Here is the history of search queries that you have issued.
+{history}
+Here are the feedback for the documents retrieved with your search queries.
+{feedback}
+
+What is the new query that you should issue to the knowledge base to answer the user question?
+Output the new query in JSON format as below.
+{{"query": "your new query here"}}
+If you plan to issue multiple queries, you must output JSON in multiple lines like the example below.
+{{"query": "your first query here"}}
+{{"query": "your second query here"}}
+
+If you can directly answer the user question, output your answer in JSON format as below.
+{{"answer": "your answer here"}}
+
+User Question: {question}
+You Output:\n
+"""
+
+DOC_GRADER_Llama_PROMPT = """\
+Given the QUERY, determine if the DOCUMENT contains all the information to answer the query.\n
+QUERY: {question} \n
+DOCUMENT:\n{context}\n\n
+Give score 'yes' if the document provides all the information needed to answer the question. Otherwise, give score 'no'. ONLY answer with 'yes' or 'no'. NOTHING ELSE."""
diff --git a/comps/agent/langchain/src/strategy/ragagent/utils.py b/comps/agent/langchain/src/strategy/ragagent/utils.py
@@ -0,0 +1,78 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import json
+import uuid
+
+from huggingface_hub import ChatCompletionOutputFunctionDefinition, ChatCompletionOutputToolCall
+from langchain_core.messages import AIMessage, BaseMessage, HumanMessage, ToolMessage
+from langchain_core.messages.tool import ToolCall
+from langchain_core.output_parsers import BaseOutputParser
+
+
+class QueryWriterLlamaOutputParser(BaseOutputParser):
+    def parse(self, text: str):
+        print("raw output from llm: ", text)
+        json_lines = text.split("\n")
+        print("json_lines: ", json_lines)
+        output = []
+        for line in json_lines:
+            try:
+                output.append(json.loads(line))
+            except Exception as e:
+                print("Exception happened in output parsing: ", str(e))
+        if output:
+            return output
+        else:
+            return None
+
+
+def convert_json_to_tool_call(json_str, tool):
+    tool_name = tool.name
+    tcid = str(uuid.uuid4())
+    add_kw_tc = {
+        "tool_calls": [
+            ChatCompletionOutputToolCall(
+                function=ChatCompletionOutputFunctionDefinition(
+                    arguments={"query": json_str["query"]}, name=tool_name, description=None
+                ),
+                id=tcid,
+                type="function",
+            )
+        ]
+    }
+    tool_call = ToolCall(name=tool_name, args={"query": json_str["query"]}, id=tcid)
+    return add_kw_tc, tool_call
+
+
+def assemble_history(messages):
+    """
+    messages: AI (query writer), TOOL (retriever), HUMAN (Doc Grader), AI, TOOL, HUMAN, etc.
+    """
+    query_history = ""
+    n = 1
+    for m in messages[1:]:  # exclude the first message
+        if isinstance(m, AIMessage):
+            # if there is tool call
+            if hasattr(m, "tool_calls") and len(m.tool_calls) > 0:
+                for tool_call in m.tool_calls:
+                    query = tool_call["args"]["query"]
+                    query_history += f"{n}. {query}\n"
+                    n += 1
+    return query_history
+
+
+def aggregate_docs(messages):
+    """
+    messages: AI (query writer), TOOL (retriever), HUMAN (Doc Grader
+    """
+    docs = []
+    context = ""
+    for m in messages[::-1]:
+        if isinstance(m, ToolMessage):
+            docs.append(m.content)
+        elif isinstance(m, AIMessage):
+            break
+    for doc in docs[::-1]:
+        context = context + doc + "\n"
+    return context
diff --git a/comps/agent/langchain/src/strategy/react/__init__.py b/comps/agent/langchain/src/strategy/react/__init__.py
@@ -3,3 +3,4 @@
 
 from .planner import ReActAgentwithLangchain
 from .planner import ReActAgentwithLanggraph
+from .planner import ReActAgentLlama