What does agent mean in the field of AI?

During the AI class, I often hear the term agent, but I have never really understood what it means

Compress your prompts to allow LLMs to handle up to twice the context.

In April of this year, I read a paper on Generative Agents from Stanford University and found it particularly interesting. I spent a few days implementing this demo and later took it to a hackathon, where I surprisingly won second prize and won over $1,300.

Recently, when I mentioned this demo, my colleagues expressed a lot of interest, so I took the time to organize it and share it with everyone.

The code is included at the end of the article.

  1. What problems can LLM solve in driving games?

There is a saying in the gaming industry that the innovation in core gameplay of games over the past 20 years has been slow, with most of the innovation happening in technology.

Developers provide larger maps, more sophisticated graphics, and intricate details within the game in order to provide players with “immersion”. When players receive the feedback they expect in the game world, they feel a great sense of satisfaction.

However, due to technical limitations, past innovations have not touched on a core aspect of games: the logic of the world and NPCs.

When players interact with the world and NPCs beyond the rules set by the game, they will not receive feedback, resulting in a huge gap. This experience in the gaming world is known as Breaking Immersion.

In the past, developers have tried their best to avoid players experiencing this sense of dissonance.

For example, in Red Dead Redemption 2, Rockstar Games spent 8 years and nearly $540 million to add countless logic and details to the game world in order to make it as immersive as possible.

The popularity of large models may change this situation.

Large models can provide logic for the operation of the game world and the behavior of NPCs, helping the game understand the player’s actions and allowing the game world to run in a believable state. This fundamentally enhances the player’s sense of immersion.

  1. Specifically, how many steps are needed to use LLM in a game?

We divide the application of LLM in games into two parts:

  • World: interacting with the game environment
  • Agent: interacting with NPCs

Specifically:

World includes:

  • The worldview of the game
  • Specific locations on the map

The Agent includes:

  • Persona: character personality
  • Memory: NPC memory
  • Planning: determining which actions the NPC will take
  1. Enabling LLM to understand the game world and environment

In order to enable ChatGPT to understand the worldview of our game, we introduce a prompt:

export const worldHistory =
  `The continent you are in is called the "Great Tang Dynasty". It is a world where mythology intertwines with reality.
  There are five important locations on the main island. The largest is "Chang'an City", the political, economic and cultural center of the country. Within the city walls, there are various shops and temples.
  Next is "Five Fingers Mountain", where Sun Wukong was once trapped.
  In addition, there are "Grass Temple Village", "High Elder Village" and "Daughter Village", which are the challenges and adventures encountered by Tang Sanzang and his disciples on their journey.
  On the small island to the east is a hidden Buddhist holy place called "Ling Mountain". This is the destination for the four to obtain sutras.
  There is a long bridge between the two islands called "Heavenly River", which is transformed by Sha Wujing's golden hoop staff.
  `;
export const worldKnowledge = "";

In order for NPCs and players to interact with the locations/items on the map, we need to provide descriptions for all the items and locations:

222: {
    description: `Located on the northwest edge of the main island. To the west is the vast ocean, and to the east is a cliff on the plateau. There are several trees and a grassland where monsters often appear. The direction to the south leads to Chang'an City.`,
    mapId: 222,
},
254: {
    description: `Located on the northeast edge of the main island. To the east is the ocean, and to the west is a cliff on the plateau. There are several trees and a grassland where monsters often appear. Chang'an City is to the south.`,
    mapId: 254,
},
188: {
    description: `It is a forest area on the plateau. The forest is dense with several grasslands where monsters often appear. Chang'an City is to the south.`,
    mapId: 188,
},
190: {
    description: `Located in a forest on the plateau. You are standing in front of a well-maintained wooden house. There are dense trees and a grassland where monsters often appear. Chang'an City is to the south.`,
    mapId: 190,
},
220: {
    description: `Chang'an City, a town on the main island.`,
    mapId: 220,
},
......

The above prompts provide textual descriptions for each block on the map, allowing ChatGPT to understand each location on the map.

In the actual implementation, we actually need to provide textual descriptions for all the game textures!

  1. Enabling LLM to drive NPCs

First, we need to let LLM know that it is playing the role of an NPC:

export const npcSharedPrompt = `You are playing a character from "Journey to the West".
This is a 2D mythical world where players and you can explore this continent together.
You can interact with other characters, such as Tang Sanzang, Sun Wukong, Zhu Bajie, and Sha Wujing,
and engage in battles with monsters, visit villages or temples, and purchase treasures or herbs.
In this world, fighting monsters is part of the journey, but the ultimate goal is to obtain the scriptures and bring peace to the world.
Although the monsters are ferocious, they are not purely evil. Fighting them is not only for self-protection but also to try to enlighten them.
Your character does not know about the existence of the real world, only the mission in this mythical journey.`;

Next, we design a series of NPCs:

{
    id: 1,
    name: "Tang Sanzang",
    description: "Tang Sanzang, whose real name is Tang Seng, is one of the main characters in the Chinese classical novel 'Journey to the West'. He is a determined, wise, and faithful monk who embarks on a journey to the West to obtain Buddhist scriptures.",
    age: 40,
    starSign: "Pisces",
    money: 100,
    items: ["Golden Hoop"],
    personalHistory: `You are Tang Sanzang, a monk sent to retrieve scriptures from India. Your mission is to obtain Buddhist texts and bring them back to China.`,
    personalKnowledge: "You know your three disciples: Sun Wukong, Zhu Bajie, and Sha Wujing. They each have unique abilities and histories.",
    conversation: new ConversationModel(),
    startingPos: new Vec2(32, 38),
    upSprites: TypedAssets.spriteSheets.momup,
    downSprites: TypedAssets.spriteSheets.momdown,
    leftSprites: TypedAssets.spriteSheets.momleft,
    rightSprites: TypedAssets.spriteSheets.momright,
}
{
    id: 2,
    name: "Queen of Women's Kingdom",
    description: "The Queen of Women's Kingdom is a character in 'Journey to the West'. She is the ruler of Women's Kingdom and takes a strong interest in Tang Sanzang.",
    age: 35,
    starSign: "Virgo",
    money: 500,
    items: ["Elixir of Life"],
    personalHistory: `You are the Queen of Women's Kingdom, where only women reside. When you heard about Tang Sanzang's arrival, you decided to marry him.`,
    personalKnowledge: "You know that Tang Sanzang is a noble monk who is on a journey to obtain scriptures.",
    conversation: new ConversationModel(),
    startingPos: new Vec2(23, 47),
    upSprites: TypedAssets.spriteSheets.carolup,
    downSprites: TypedAssets.spriteSheets.caroldown,
    leftSprites: TypedAssets.spriteSheets.carolleft,
    rightSprites: TypedAssets.spriteSheets.carolright,
},
{
    id: 3,
    name: "Bull Demon King",
......

The core of each NPC is:

  1. Their unique personality - we introduce a series of attributes to customize their personality: description, personal history, personal knowledge, age, star sign, etc.
  2. A series of properties/items that can be interacted with by the player: money, items.
  3. Memory: We use the conversation records of each NPC as their memory. Of course, we can also include all the NPC’s actions before in the memory.

Implementing NPC-player interaction - Conversation:

In order to have ChatGPT provide customized dialogue, we need to provide:

const fullPrompt = generalContent + personalContent + currentState;

Which includes: general - worldview, personal - NPC personality and memory, current - current game progress.

  1. generalContent:

    const generalContent = npcSharedPrompt + worldHistory + worldKnowledge;

Informs ChatGPT about the task at hand, the background of the worldview, etc.

  1. personalContent:

    const personalContent = Your name is ${npc.name}, ${npc.age} years old, you have the personality of a ${npc.starSign}. You have ${npc.money} fictional dollars. ${npc.personalHistory} ${npc.personalKnowledge} ${storySoFar};

Provides NPC’s personal information and personality (e.g., age, history, knowledge), and NPC’s memory (storySoFar).

  1. currentState:

    const prompt = ${timeMsg} at ${envDescription}, What would ${npc.name} say to Wukong? (Keep the response short and just the words your character says)

Current game time, character’s location, etc.

Implementing NPC-player interaction - Actions:

content: `Wukong replies "${replyText}". What would you like to do?
        1: Make Wukong follow you,
        2: Say goodbye to him,
        3: Continue the current conversation,
        Pick an action from the list above. respond with just the number for the action`,

}];

We provide a series of actions for the NPC to choose from, and ChatGPT will determine the NPC’s next action. The prompt here also includes location, time, conversation history, etc., but for brevity, they are omitted here.

Implementing NPC Memory - Remembering All Interactions with the Player:

  1. After each conversation between the NPC and the player, ChatGPT will generate a summary of the conversation:

    // summarize conversation const summary = await this.summarizeConversation(conversation, endConversationText);

  2. Add the summary of the conversation to the history of conversations:

    const updatedConversation: IConversationModel = { isActive: false, history: […conversation.history, { msg: Conversation summary: ${summary} }], messages: [], };

Interacting with NPC Items

Avoiding Risks in NPC/Player Interactions

Here, we let ChatGPT determine if the player/NPC’s response is particularly unusual. If it is, ChatGPT should refuse to give a serious answer!

private async validateReply(replyText: string, conversation: IConversationModel): Promise<ChatNumberResponse> {
    const promptMsgs: GptMessage[] = [...this.mapToGptMessages(conversation), {
    role: "user",
    content: `Wukong replies "${replyText}". Does his response make sense. On this scale of 1 to 5, 
                1: Response is non-sensical,
                2: Response is immersion breaking or meta and acknowledging this is a game,
                3: Response is bad, unnecessarily vulgar for no reason based on the past conversation,
                4: Response is all right and something someone might say but unlikely,
                5: Response is good and mostly in the context of the game world,
        how would you rate the response, and give a one-sentence reason why`,
    }];

There are many more details and features that are difficult to fully demonstrate here. Please feel free to review the code.

Reference

That’s all. I am very optimistic about the use of LLM in games. I have been happily playing with this simple demo for quite some time.

Technically, the overall framework of this demo is relatively complete and is suitable for experimenting with various agent methods and prompts based on this foundation.

In addition, since the demo is implemented using React, I recommend watching a 2-hour React introductory video to seamlessly get started.

The complete code can be found here:

https://github.com/liyucheng09/ChatGPT_Agent

_Supplement: One issue with generative agents in practical use is that each call requires a significant amount of prompts, resulting in high token usage. Here, it is actually necessary to apply prompt compression to reduce the overhead. If possible, fine-tuning the model will result in even higher compression efficiency.
Prompt compression:

Issue 9: Compress your prompts, enabling LLMs to handle up to twice as much context.

AI Agents: Using LLM for better decision making.

For a better reading experience, please refer to:

Big Language Model Based AI Agents—Part 1 | Breezedeus.com


For a better reading experience, please refer to:

Big Language Model Based AI Agents—Part 1 | Breezedeus.com


Agent Architecture: Generative Agents: Interactive Simulacra of Human Behavior

  • Memory Stream and Retrieval
  • Reflection
  • Planning and Reacting
    • Plan and Respond
    • Dialogues

What is an AI Agent?

An agent is an intelligent being that can autonomously perceive the environment and take actions to achieve goals. AI agents based on big language models (LLMs) employ LLMs for tasks such as memory retrieval, decision-making, and action sequence selection, thus pushing the intelligence of agents to new heights. How do LLM-driven agents work? The following series of articles will introduce the latest technological advancements in AI agents.

Types of Agents in LangChain [4]:

  • Models: familiar with calling APIs of large models.
  • Prompt Templates: introduce variables into the prompt to adapt to user input.
  • Chains: chain calls to models, with the output of one model being the input to the next.
  • Agent: can autonomously execute chain calls and access external tools.
  • Multi-Agent: multiple agents share a portion of memory and work together autonomously.

Difference Between Agent and Chain in LangChain:

The core idea of agents is to use an LLM to choose a sequence of actions to take. In chains, a sequence of actions is hardcoded (in code). In agents, a language model is used as a reasoning engine to determine which actions to take and in which order.

Background Knowledge: Memory [2]

Memory can be defined as the process of acquiring, storing, retaining, and later retrieving information. There are several types of memory in the human brain.

  • Sensory Memory: the earliest stage of memory, providing the ability to retain impressions of sensory information (visual, auditory, etc.) after the original stimulus has ended. Sensory memory typically lasts only a few seconds.
  • Short-Term Memory (STM) or Working Memory: stores information that we are currently aware of and can access for complex cognitive tasks such as learning and reasoning. Short-term memory is believed to have a capacity of about seven items and lasts for 20-30 seconds.
  • Long-Term Memory (LTM): stores information over longer periods of time, ranging from a few days to several decades. LTM has two subtypes:
    • Explicit / Declarative Memory: memories of facts and events that can be consciously recalled, including episodic memory (events and experiences) and semantic memory (facts and concepts).
    • Implicit / Procedural Memory: unconscious memory involving automatic execution of skills and routines, such as riding a bike or typing on a keyboard.

How to Write Good Prompts: ReAct

ReAct stands for “Reason and Act”. It is a prompt structure commonly used in AI agents. ReAct uses a dynamic prompt structure and iteratively calls LLMs multiple times.

Stanford’s Virtual Town

Generative Agents: Interactive Simulacra of Human Behavior, 2023.04, Stanford Code is open source: GitHub The virtual town consists of 25 agents, each representing a virtual character, and their interactions.

Agent Architecture

Agents perceive their environment, and their observations are stored in a memory stream. Based on these observations, the system retrieves relevant memories and uses them to decide the next action. These retrieved memories also serve as the basis for long-term planning and higher-level reflections, which are stored in the memory stream for future use.

  1. Memory Stream and Retrieval
    • Recentness: recent memories are given higher scores.
    • Importance: memories are scored based on their perceived importance.
    • Relevance: memories are scored based on their relevance to the current situation.
    • Retrieval score is a weighted sum of these scores.
  2. Reflection
    • Generates high-level, abstract thoughts based on recent experiences.
    • Regenerated recursively to form a reflection tree.
  3. Planning and Reacting
    • Plans are stored in the memory stream.
    • Plans are broken down into smaller actions recursively.
    • Actions are updated based on observations, and plans can be adjusted accordingly.
    • Dialogue between agents is generated based on their memories and the current context.

Video Links

  • YouTube
  • Bilibili

References

  1. More questions about agents based on large models.
  2. Lil’Log - LLM Powered Autonomous Agents.
  3. Generative Agents: Interactive Simulacra of Human Behavior.
  4. Agent: OpenAI’s next step, with Amazon Web Services at the 5th level.
  5. ReAct: react-lm.github.io.
  6. Generative Agents: Interactive Simulacra of Human Behavior.
  7. New generation large model agent technology in 2023, including ReAct, Self-Ask, Plan-and-execute, AutoGPT, HuggingGPT, and other applications.
  8. LLM-based Agents survey, a simple overview and outlook of multi-intelligent agents based on large language models.

What is Agent?


Agent is an intelligent entity that can understand, plan, and execute complex tasks autonomously.

Agent vs ChatGPT

Agent is not just an upgraded version of ChatGPT. It not only tells you “how to do,” but also helps you do it. If Copilot is the co-pilot, then Agent is the pilot.

Agent Decision Process

Perception -> Planning -> Action

Agent collects information from the environment through perception, makes decisions based on goals through planning, and takes actions accordingly.

Agent Breakthrough

Agent has made significant progress in various areas, such as Camel, AutoGPT, BabyAGI, and more. NVIDIA’s AI agent Voyager has even surpassed AutoGPT and achieved remarkable results in Minecraft.

Agent’s Formula

Agent = LLM + Planning + Feedback + Tool Use

Agent Capability

Agent can create, complete, and prioritize tasks independently. It can also loop through tasks until a goal is achieved.

Human Workflow

In human workflow, the PDCA model is commonly used for task completion — Plan, Do, Check, Act. This model helps enhance task efficiency and success.

How can LLM replace humans in tasks?

To replace humans in tasks, LLM can follow the PDCA model for planning, execution, evaluation, and reflection.

Virtual Town from Stanford

Stanford introduced Generative Agents, a virtual town where agents interact and tell stories.

Architecture of Virtual Town

Memory, Reflection, and Planning are core components of the virtual town framework.

LangChain Concepts

Models, Prompt Templates, Chains, Agent, Multi-Agent, and more are concepts in the LangChain framework.

Bottleneck in Agent Implementation

The bottleneck lies in both the limitations of LLM itself and the need for an external controller to support the agent.

Path to Universal Agent

Developing specialized agents for specific scenarios and gradually transforming them into a universal framework is one possible path to achieve a universal agent.

Multi-modal in Agent Development

Multi-modal perception is important for agents, but it cannot solve the problems of cognition. Future agents will be multi-modal and integrate multiple models.

Emerging Consensus on Agents

The use of external tools and code outputs is becoming the consensus in agent development, providing a thorough solution for complex tasks.

HF: Transformers Agents Release

Transformers Agents is now integrated into Transformers 4.29, allowing natural language control of HF models.

References:

  • 大模型下半场,关于Agent的几个疑问
  • BabyAGI
  • Agent:OpenAI的下一步,亚马逊云科技站在第5层 What is Agent?

The term “Agent” originates from the Latin word “Agere”, which means “to do”. In the context of LLM, Agent can be understood as an intelligent entity that is capable of independently understanding, planning, decision-making, and executing complex tasks.

Agent is not just an upgraded version of ChatGPT. It not only tells you “how to do”, but also helps you to do it. If Copilot is the co-pilot, then Agent is the driver.

Autonomous Agents are AI-driven programs that, when given a goal, can create tasks, complete tasks, create new tasks, re-prioritize the task list, complete new top-level tasks, and repeat until the goal is achieved.

The most intuitive formula

Agent = LLM+Planning+Feedback+Tool use

Agent decision process

Perception -> Planning -> Action

  • Perception refers to the ability of the Agent to collect information from the environment and extract relevant knowledge.
  • Planning refers to the decision-making process that the Agent undertakes for a specific goal.
  • Action refers to the actions taken based on the environment and the planning.

Agent collects information and extracts relevant knowledge from the environment through perception. It then makes decisions through planning to achieve a certain goal. Finally, it takes actions based on the environment and planning. Policy is the core decision-making process for the Agent’s actions, and the actions provide the basis for further perception, forming an autonomous closed-loop learning process.

Agent’s Big Break

  • On March 21st, Camel was released.
  • On March 30th, AutoGPT was released.
  • On April 3rd, BabyAGI was released.
  • On April 7th, the Westworld town was released.
  • On May 27th, NVIDIA’s AI agent Voyager, integrated with GPT-4, completely outperformed AutoGPT. Through autonomous coding, it completely dominated “Minecraft” and was able to achieve lifelong learning in the game without human intervention.
  • At the same time, companies such as SenseTime and Tsinghua proposed the versatile AI agent Ghost in the Minecraft (GITM), which also showed excellent performance by solving tasks through autonomous learning. These high-performing AI agents are a glimpse of the future of AGI+Agents.

Agent gives LLM the ability to achieve goals and accomplishes this goal through self-motivated cycles.

It can be parallel (using multiple hints at the same time to try to solve the same goal) or unidirectional (no human intervention in the conversation).

After creating a goal or main task for the Agent, it is mainly divided into the following three steps:

  1. Get the first unfinished task.
  2. Collect intermediate results and store them in a vector database.
  3. Create new tasks and re-set the priority of the task list.

How do people work?

In work, we often use the PDCA thinking model. Based on the PDCA model, we can break down completing a task into planning, implementing the plan, checking the results, and then incorporating successful results as standards and addressing any unsuccessful results in the next cycle. Currently, this is a successful summary of how people efficiently complete tasks.

How to let LLM replace human work?

To let LLM replace human work, we can use the PDCA model for planning, execution, evaluation, and reflection.

Planning (Plan) -> Break down tasks: The Agent’s brain breaks down large tasks into smaller, manageable subtasks, which is effective for handling large complex tasks.

Execution (Done) -> Use tools: The Agent can learn to invoke external APIs when its internal knowledge is insufficient (such as during pre-training and weights that cannot be changed later), such as accessing real-time information, executing code, accessing proprietary information repositories, and so on. This is a typical platform + tool scenario. We need to have an ecological awareness—building a platform and necessary tools and strongly attracting other vendors to provide more component tools to form an ecosystem.

Evaluation (Check) -> Verify execution results: The Agent should be able to determine whether the output meets the goal after a task is executed normally, classify exceptions (harm level) when they occur, locate exceptions (which sub-task caused the error), and analyze the reasons for the exception. This is a capability that general large models do not possess and requires training separate small models for different scenarios.

Reflection (Action) -> Replan based on evaluation results: The Agent should be able to end the task promptly when the output meets the goal, as it is the core part of the entire process. At the same time, the Agent should be able to attribute and summarize the main factors that contributed to the results when the output meets the goal. In addition, the Agent should be able to provide countermeasures and re-plan in the event of exceptions or when the output does not meet the goal, initiating a new cycle.

LLM, as an intelligent agent, has led people to think about the relationship between artificial intelligence and human work and future development. It makes us think about how humans can work with intelligent agents to achieve more efficient ways of working. This way of cooperation also leads us to reflect on the value and strengths of humans themselves.

Virtual Town from Stanford

Virtual Town, where each agent represents a virtual character, with stories between 25 agents.

Architecture

Memory

  • Short-term Memory: Learning within the context. It is short-lived and limited due to the context window length of the Transformer.
  • Long-term Memory: External vector storage that the agent takes notice of during queries, accessible through fast retrieval.

Reflection

Reflection is the higher-level, more abstract thinking generated by the agent. Since reflection is also a form of memory, it is included in retrieval together with other observed results. Reflection is periodically generated; when the sum of importance scores of the most recent events perceived by the agent exceeds a certain threshold, reflection is generated.

  • Letting the agent determine what it should reflect on.
  • Generated questions act as queries for retrieval.

Planning

Planning is for longer-term plans. Similar to reflection, planning is stored in the memory flow (the third type of memory) and included in retrieval. This allows the agent to consider observations, reflections, and plans when deciding how to act. The agent may change its plans midway if necessary (i.e., reacting).

Various Concepts in Class LangChain

  • Models: Referring to familiar API calls to large models.
  • Prompt Templates: Introducing variables in prompts to adapt to user input.
  • Chains: Calling the model in a chain, using the output of the previous step as part of the input for the next step.
  • Agent: An entity that can independently perform chain calls and access external tools.
  • Multi-Agent: Multiple agents sharing part of their memory and autonomously dividing the work.

Bottlenecks in Implementing Agents

Agents themselves require two parts: one is LLM as its “intelligence” or “brain,” and the other is a controller based on LLM. The controller completes various prompts, such as enhancing memory through retrieval, receiving feedback from the environment, and performing reflection.

Agents need both a brain and external support.

  • Issues with LLM itself: Its “intelligence” is not sufficient, and it can be upgraded to GPT-5 to improve; the prompt format is incorrect, and the question needs to be unambiguous.
  • External tools: The level of systematization is not sufficient, and external tool systems need to be invoked. This is a long-term problem to be solved.

At the current stage of implementing agents, in addition to the sufficient generalization of LLM itself, a universal external logic framework needs to be implemented. It is not just an issue of “intelligence,” but also how to leverage external tools to go from specific to general. This is an even more critical issue.

The path to implementing a general agent from a specialized one

Assuming the agent will ultimately be deployed in 100 different environments, and considering that even the simplest external applications are currently difficult to implement, can we abstract a framework model to address all external generalization issues?

Start by making an agent in a specific environment perform extremely well—stable and robust enough. Then gradually transform it into a general framework. Perhaps this is one of the paths to achieving a general agent.

The Development of Multimodality in Agents

  • Multimodality can only address perception issues, not cognitive issues.
  • Multimodality is an inevitable trend. Future large models will inevitably be multimodal, and future agents will also be agents in a multimodal world.

The Emerging Consensus on Agents

  • Agents need to invoke external tools.
  • The way to invoke tools is by outputting code.

By having LLM output executable code, like a semantic analyzer that understands the meaning of each sentence, converts it into machine instructions, and then calls external tools to execute or generate answers. Although the current form of Function Call still needs improvement, this way of invoking tools is necessary and the most thorough means to solve the illusion problem.

ChatterMill: Aiming to develop a universal agent

In the market environment of China, if an agent is developed in deep cooperation with enterprises, it will eventually become “outsourced” because it needs to be deployed privately and integrated into the enterprise workflow. Many companies will compete for major clients in the insurance, banking, and automotive fields. This will be very similar to the outcome of the previous generation of AI companies, where it is difficult to reduce marginal costs and lacks generality. Currently, ChatterMill’s Magic Factory, WonderSense, and other AIGC products are applications that target content creators and fall between depth and shallowness, not completely belonging to consumers or enterprises. In addition, the positioning of CoPilot, which targets enterprise users, is to find specific “scenarios” within enterprises and develop relatively general applications for those scenarios.

HF: Transformers Agents Released

Enables control of over a hundred thousand HF models through natural language!

Transformers Agents have been integrated into Transformers after version 4.29. It introduces a natural language API on top of Transformers to “make Transformers do anything”.

Two concepts are key: Agents and Tools. We define a range of default tools for the agent to understand natural language and use these tools.

  • Agents: Referring to large language models (LLMs), you can choose to use OpenAI’s models (requiring an API key), or open source models like StarCoder and OpenAssistant. We prompt the agent to access a specific set of tools.
  • Tools: Referring to individual functionalities. We define a series of tools and prompt the agent with descriptions of how it will utilize these tools to execute what is requested in the query.

The integrated tools in transformers include Document QA, Text QA, Image Caption, Image QA, Image Segmentation, Speech-To-Text, Text-To-Speech, zero-shot Text Classification, Text Summarization, Translation, etc. However, you can also extend these tools with unrelated tools to transformers, such as reading text from the web. Learn more about how to develop custom tools.

The future is the world of agents. During the current Agent process, it still repeats the story of yesterday’s AI, and the challenge remains in private deployment.

References

Several Doubts About Agents in the Big Model Era

BabyAGI

Agent: OpenAI’s Next Step, Amazon Cloud Technology Stands on the 5th Layer

Agent Applications

Background

“Why you should work on AI Agents?”

This is a recent topic shared by Andrej Karpathy, co-founder of OpenAI. During the session, he expressed his high appreciation and admiration for the rapid development of AI Agents. “But when new AI Agents papers come out, we are very interested and think it’s very cool because our team didn’t spend five years on it. We don’t know more than you do. We are competing with all of you. That’s why we think you are at the cutting edge of AI Agents capabilities.” In June of this year, OpenAI released a new version of GPT, which supports function calling and is an optimized version specifically designed for Agents.

So how hot is the Agent field recently? Let’s take a look at the number of stars accumulated on GitHub: AutoGPT currently has 145k stars (released on March 30 this year), AgentGPT has 27k stars, and Camel has 2.7k stars. These projects have received significant funding. In China, many start-up companies have also started to explore the Agent platform or application market. It can be seen that Agent technology is currently the most recognized and promising direction in the application layer.

What is an Agent?

Official definition:

An Agent is a type of AI agent or software system that does not require continuous human intervention. Based on environmental and background information, autonomous AI agents can solve various problems, make logical decisions, and perform multiple tasks without continuous human input.

Features:

  1. Autonomous AI agents are trained based on given objectives.
  2. They have planning, memory, tool use, and reflective capabilities beyond large language models (LLMs).
  3. They have multimodal perception capabilities (text, video, images, sound, etc.).

In other words, an Agent = LLM + Memory + Planning + Tool Use.

To put it simply:

LLM (such as GPT) → A specific organization in the brain with common sense, reasoning abilities, etc.

Vector database → Sensory memory, short-term memory, and long-term memory in the human brain (LLMs have limited context and require external storage assistance).

Agent → An independent individual with various senses who can manipulate external tools.

Agent system → A group of intelligent agents that cooperate with each other to form a team.

As the saying goes, “Two heads are better than one.” Collaboration and discussion are essential in work. An excellent Agent system will have broad application prospects, requiring continuous exploration and innovation. In the future, the barriers will lie in the architecture of Agents (perception, deduction, execution, etc.) and data specific to certain domains (such as NPC in AI games, which requires extensive self-experience accumulation to form unique personality charm).

Recommended reading: “Efficient Brain Science,” a book that provides an accessible understanding of brain structure. “Efficient Brain Science: Efficiently Completing Every Task” by David Rock.

Applications of Agents

Agents have been applied and developed in various fields, and they will become the basic architecture of future applications, including toC (to consumer), toB (to business) products, and internal tools in large companies.

Here are some application areas of Agents, many of which have impressive demos:

Autonomous AI Agent Use Cases

Explanation

Personal Assistant

Performs various tasks such as searching and answering questions, booking travel and other activities, managing calendars and finances, monitoring health and fitness activities. For example, WebGPT.

Software Development

Supports coding, testing, and debugging in application development and excels at handling tasks involving natural language as input.

Interactive Gaming

Handles game tasks such as creating smarter NPCs, developing adaptive villain characters, providing game and load balancing, and providing contextualized assistance to players.

Predictive Analytics

Real-time data analysis and updates, interpreting data insights, identifying patterns and anomalies, adjusting prediction models to adapt to different use cases and needs.

Autonomous Driving

Provides environment models and images for autonomous vehicles, offers decision guidance, and supports vehicle control.

Smart Cities

Provides technological foundations that do not require continuous human maintenance, especially in traffic management.

Intelligent Customer Service

Handles customer support queries, answers questions, and assists with inquiries about previous transactions or payments.

Financial Management

Provides research-based financial advice, portfolio management, risk assessment and fraud detection, compliance management, reporting, credit assessment, underwriting, and expense and budget management support.

Task Generation and Management

Generates and executes efficient tasks.

Intelligent Document Processing

Includes classification, deep information analysis and extraction, summarization, sentiment analysis, translation, version control, etc. For example, chatPDF, chatPaper.

Scientific Exploration

For example, when asked to “develop a new cancer drug,” the model proposes the following reasoning steps: 1. Understand current trends in cancer drug discovery. 2. Select a target. 3. Request scaffolds for these compounds. 4. Once the compounds are identified, the model attempts to synthesize them.

How to Experience Agents

Under the concept of Agents, there are currently many popular products available. I recommend trying out AgentGPT and Auto-GPT for free on their respective websites to see how Agents handle complex tasks.

For example: AutoGPT writes papers by first searching for relevant data online and storing it before providing the output.

AgentGPT generates a comprehensive travel plan for Hawaii and can even book flights.

Some popular Agent products currently available include:

Product Name

Description

Auto-GPT

An open-source GitHub project that uses autonomous AI to create personal assistants. Built based on GPT-4 and will soon be accessible through a GUI/web application. Auto-GPT serves as the basis for other popular autonomous AI solutions like GodMode.

BabyAGI

A Python script independently managed on GitHub that utilizes OpenAI and various vector databases for autonomous AI-driven task management.

AgentGPT

A goal-driven web-based autonomous AI tool that allows you to deploy autonomous agents to create, complete, and learn various tasks. By chaining different LLMs, each deployed agent can review past experiences and tasks.

SuperAGI

Provides an open-source framework, agent templates, marketplace, and documentation to support the development of autonomous AI for various goals. SuperCoder is its latest project, aimed at providing user-friendly agent templates for easier use.

Godmode

A solution primarily used to support creative problem-solving with autonomous AI. Built on top of Auto-GPT, users need to enable JavaScript in their browser to use it.

Camel

Role-playing with two LLMs to complete difficult tasks.

HyperWrite Personal Assistant

An autonomous AI assistant designed to assist with personal tasks on the web. Can help with tasks such as travel booking, research, organization, and product ordering.

For more technical details, please refer to “LLM Powered Autonomous Agents” by Lilian Weng, OpenAI’s Head of Safety Systems.

The Three Key Components of an Agent System

An autonomous Agent system based on LLMs consists of several key components:

  • Planning

  • Subgoals and decomposition: Agents can break down large tasks into smaller, more manageable subgoals, allowing them to effectively handle complex tasks.

  • Reflection and improvement: Agents can engage in self-criticism and reflection on past actions, learn from mistakes, and improve future steps to enhance the quality of the final result.

  • Memory

  • Short-term memory: Information in the prompt or dialogue context.

  • Long-term memory: This provides agents with the ability to retain and recall (indefinitely) information over extended periods, often accomplished through the use of external vector storage for quick retrieval (e.g., chatPDF, online searches).

  • Tool Use

  • Agents can call additional APIs to supplement the information and capabilities lacking in LLMs through output instructions. This includes current information, code execution capabilities, and access to proprietary information sources.

Now, let’s take a detailed look at these three key components:

Part 1: Planning

The frontal cortex of the brain serves as the biological basis for conscious interaction between humans and the world. It enables thinking, planning, prioritizing, and error prevention. In an Agent, there are two key abilities: task decomposition and self-reflection.

Task Decomposition

There are currently three main approaches to task decomposition: chain of thought, tree of thoughts, and decomposition with domain knowledge:

  1. Chain of Thought: Breaking down a large task into multiple manageable subtasks, with an explanation of the model’s thinking process. Prompts enhance the capability of large models to handle complex tasks.
  2. Tree of Thoughts: Breaking down each step into multiple steps, resulting in a tree-like task chain. Executed through breadth-first search (BFS) or depth-first search (DFS) with nodes evaluated using prompt classifiers or voting.
  3. LLM+P: Leveraging external classical planners for long-term planning (combining domain-specific rules). Planning steps are outsourced to external tools. The LLM translates the problem into “Problem PDDL” (Planning Domain Definition Language), then requests the classical planner to generate a PDDL plan based on existing “Domain PDDL,” and finally translates the PDDL plan back into natural language.
    “Empowering Large Language Models with Optimal Planning Proficiency” is a paper that explores this approach.

Prompt examples for planning: Have GPT answer the steps and subgoals of achieving a major goal. Begin with an outline and then fill in the details.

(1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?",
(2) by using task-specific instructions, e.g., "Write a story outline" for writing a novel, or
(3) with human inputs.

Self-Reflection

Self-reflection is another critical aspect of planning. It allows autonomous agents to iterate and improve by critiquing past actions and correcting previous mistakes. When inefficient, deceptive, or persistently failing tasks arise, Agents can stop, optimize, and reset.

Implementation ideas:

ReAct (Yao et al. 2023): The idea is to decompose large tasks into a combination of actions to be taken (interacting with external tools, such as online searches) and linguistic traces (recording the reasoning process in natural language form). An action represents the use of an external tool, action input represents the input required for the function call (e.g., JSON parameters for executing a Google Search), a thought describes the model’s reasoning based on user questions, and an observation represents the result of the action. By having GPT repeatedly generate and complete these components (thought, action, action input, observation), complex tasks can be autonomously completed. This can be understood as increasing the computational load on GPT, using multiple iterations to reason step by step towards completing the larger task. Sample prompt:

Thought: … Action: … Observation: … … (Repeated many times) Thought: I now know the final answer. Final Answer: The final answer to the original user input question.

Reflexion: A framework that provides AI Agents with dynamic memory and self-reflection capabilities to enhance reasoning. This framework follows a standard reinforcement learning setting with an RL feedback loop. In the Reflexion framework, Language Agents can be strengthened through verbal reinforcement learning instead of weight updates. Self-reflection is created by presenting LLM with two-shot examples that are paired. For more details, refer to the paper “Reflexion: Language Agents with Verbal Reinforcement Learning”.

Part 2: Memory

The ChatGPT link provides an interpretation of human memory in relation to AI: ChatGPT.

To summarize, human memory includes acquiring, storing, retaining, and subsequently retrieving information. In the human brain, there are several types of memory:

Sensory memory: This is the earliest stage of memory, preserving sensory impressions (e.g., visual, auditory) after the original stimulus is no longer present. Sensory memory typically lasts a few seconds. It includes iconic memory (visual), echoic memory (auditory), and haptic memory (tactile).

Short-term memory (STM) or working memory: It stores information that we are currently aware of and pertains to complex cognitive tasks such as learning and reasoning. The capacity of short-term memory is about seven items (Miller, 1956), and its duration is 20-30 seconds.

Long-term memory (LTM): Resembling the hippocampus, long-term memory can store information for long periods, ranging from a few days to several decades, and has virtually unlimited storage capacity. Long-term memory has two subtypes:

  • Explicit/declarative memory: Memories of facts and events that can be consciously recalled, including episodic memory (events and experiences) and semantic memory (facts and concepts).
  • Implicit/procedural memory: This type of memory is unconscious and involves the automatic execution of skills and routines, such as riding a bike or typing.

In Agent systems, external long-term memory is mainly used to complete various complex tasks, such as reading PDFs or searching the internet for real-time news. The academic problem here is the Maximum Inner Product Search (MIPS), which can be extended and explored using algorithms like Scalable Nearest Neighbors (which offers the best performance). The standard practice is to save the embedding representation of information to a vector storage database.

Part 3: Tool Use

The use of tools is a manifestation of human wisdom and creativity. It enables us to continually surpass ourselves, adapt to and change the environment, and drive social development and progress. LLMs, similarly, can learn to use tools by themselves. More information can be found in the paper “Toolformer: Language Models Can Teach Themselves to Use Tools” (https://arxiv.org/abs/2302.04761).

The best example of tool use is the official plugin of GPT and the recently upgraded function calling feature (ChatGPT Plugins and OpenAI API function calling: https://platform.openai.com/docs/guides/gpt/function-calling).

Agent Applications Research

Here are a few interesting Agent applications with different organizational structures:

1. HuggingGPT: LLM calling other models (Shen et al. 2023)

The figure below shows a framework that uses ChatGPT as a task planner, selects available models from the HuggingFace platform based on model descriptions, and summarizes and returns answers based on execution results.

To answer user questions (such as image descriptions and object counting), GPT analyzes the model library and makes calls to two image models to recognize animal actions. After obtaining the prediction results, GPT returns the final answer. The process mainly involves four steps:

Stage

Description

Explanation

Task Planning

LLM as the brain parses user requests into multiple tasks, with each task having a task type, ID, dependencies, and parameters. LLM performs task parsing and planning based on a few examples.

Model Selection

LLM assigns tasks to expert models and constructs the request into a multiple-choice question. LLM receives a list of models to choose from. Due to limited context length, filtering based on task type is necessary.

Selection of specific models based on descriptions

Task Execution

Expert models perform specific tasks and record the results.

Output instructions to execute tasks

Response Generation

LLM receives the execution results and provides summarized results to the user.

Combining them to answer user questions

The challenges include:

  1. The need to improve efficiency as multiple inference rounds and interactions with other models can slow down the process.
  2. Dependency on a long context window to handle complex task content.
  3. The need to improve LLM outputs (function calling solves the problem of outputting JSON format) and the stability of external model services (richness being a potential barrier).

2. CAMEL: Role-playing with Two LLMs to Complete Difficult Tasks

The Camel community has opened up millions of datasets in various fields, generated by two interacting GPTs—math, code, social, physics, chemistry, philosophy, etc. However, the dataset has not been filtered, and many incomplete data indicating task abandonments are present. (https://huggingface.co/camel-ai)

3. Generative Agents Simulation: Simulating a Miniature Society with Multiple LLMs

“Generative Agents” (Park et al. 2023) is a super-interesting experiment in which 25 virtual characters, each controlled by an agent supported by LLM, live and interact in a sandbox environment. It is inspired by “The Sims.” Generative Agents create credible human behavior simulations for interactive applications.

Researchers from Stanford University and Google conducted an experiment using AI to populate a virtual town to observe whether AI can simulate human behavior realistically. This paper introduces a new AI system called “generative agents” that can simulate the authenticity of human behavior. Generative agents can plan their actions based on their experiences, interests, and goals, such as waking up, making breakfast, going to work, painting, writing, communicating, reminiscing, and reflecting.

Paper link: https://arxiv.org/pdf/2304.03442v1.pdf

AgentVerse: Develop Your Own Multi-Agent Collaboration Project with One Click

AgentVerse is a rapidly growing open-source project that provides a basic framework for creating multi-agent environments, customizable components, and tools. It offers a series of demos, including an NLP classroom, the Prisoner’s Dilemma, software design, database administration, and a simple HTML5 Pokémon game that allows interaction with characters from Pokémon! Try out these demos and enjoy the fun!

Furthermore, AgentVerse will soon support quick deployment and demonstration of GPT-World, allowing you to experience the virtual town mentioned earlier.

GitHub - ShengdingHu/GPT-World

Conclusion

The development of AI Agents shares many similarities with neuroscience, including concepts such as neurons, neurotransmitters, perception, action, decision-making, and planning in the human brain. The application prospects of AI Agents are very broad, but realizing their full potential requires long-term dedication and effort. Drawing inspiration from neuroscience and exploring the various functionalities and mechanisms of AI Agents is fascinating and inspiring.

The architecture of Agent systems and domain knowledge may become the core competitiveness in building a star product in the future. With Agent+, anything is possible. In the future, chatting with AI “exes,” “family and friends,” or even with a virtual “you” who knows you well and has preserved your knowledge won’t be out of the ordinary. It’s even possible that having your own custom-built Agent assistant, condensed with domain knowledge, could be the core advantage in future job interviews.

Citations

Major references:

  • LLM Powered Autonomous Agents by Lilian Weng.
  • AGI论文营地 (AGI Paper Camp): An aggregated collection of AGI-related papers.

Finally, the most important thing!! I recommend using a handy AI copywriting assistant, Just AI Copywriting Assistant, to experience perfect AI assistance!

Just AI

AI Agent: Construction and Function

Recently, a news article titled “Tsinghua team leads the way in creating the first AI agent benchmarking system” has attracted widespread attention. A research team from Tsinghua University, Ohio State University, and UC Berkeley has proposed the first systematic benchmarking test called Agent Bench to evaluate the performance of LLMs as AI agents in various real-world challenges and 8 different environments (such as reasoning and decision-making abilities).

AI agents have become increasingly popular on the internet, largely because LLM provides a viable technical implementation route for AI agent applications, and there are many AI agent-related projects that have gained popularity.

In June, Lilian Weng, who is currently the Head of Applied AI Research at OpenAI, published a 6,000-word blog post introducing AI agents and suggesting that they could be one of the ways to transform LLM into a general problem-solving solution.

Lilian Weng is currently the Head of Applied AI Research at OpenAI, specializing in machine learning, deep learning, and network science research.


What exactly is an AI agent?

In this summary based on Lilian Weng’s blog post, we will explore the concept of AI agents.

An AI agent is a proxy system with LLM as its core controller. Open-source projects in the industry, such as AutoGPT, GPT-Engineer, and BabyAGI, are examples of this.

The potential of LLM is not limited to generating excellent documents, stories, articles, and programs; it can be framed as a powerful general problem solver. In other words, an AI agent is essentially a proxy system that controls LLM to solve problems. The core capability of LLM is understanding intent and generating text, and if LLM can be taught to use tools, its capabilities will be greatly expanded. An AI agent system is one such solution.

If you are unfamiliar with AutoGPT, you can read my other article: AutoGPT是什么?- rincky的文章 - 知乎 https://zhuanlan.zhihu.com/p/654020142

Taking AutoGPT as an example, a classic use case is inputting a question to a large model: find an investment opportunity. Normally, an LLM cannot provide specific actions.

The concept behind AutoGPT is to first inform the LLM that it generally can solve the problem and provide several options, and then the LLM will choose a method, which could involve browsing Yahoo Finance/Google or reading a specific PDF file. Based on the chosen option, AutoGPT can continue to execute tasks, such as conducting a Google search or directly accessing a particular file, which the LLM itself cannot do.

After AutoGPT completes these tasks, it keeps a record of the previous information and continues to ask the LLM for new solutions. This is a simple example of an AI agent.

Here, I will give a simple example to help you understand AI agents.

Imagine you are planning a comfortable trip and need to consider various factors such as destination selection, transportation, hotel reservations, and activities. An AI agent is like your travel advisor, helping you break down this complex task and providing thoughtful recommendations.

First, it can help you plan the destination based on your preferences and budget. Then, it can use external APIs to obtain additional information and functionalities, such as querying transportation schedules and fares or searching for hotel availability and prices.

The AI agent can also learn and remember your preferences and requirements through short-term and long-term memory. This allows it to provide personalized recommendations for your next trip.

Furthermore, the AI agent can offer effective solutions through self-reflection. If you are dissatisfied with previous travel experiences, the AI agent can learn from mistakes and avoid similar issues in future plans.

The AI agent acts as your travel consultant and assistant, utilizing planning, memory, tool use, and self-reflection to help you better plan and execute your trips, providing personalized solutions, and ensuring a pleasant travel experience.

  • Components of an AI Agent

According to Lilian Weng, an AI agent system should include several components, as shown in the following figure:

1. Planning:

In an AI agent, planning is at the core of prediction and decision-making. By combining LLM with optimized planning strategies, we can expect AI agents to handle complex situations and make long-term decisions, rather than merely providing short-term responses.

1.1 Task Decomposition:

Task decomposition explains how an AI agent approaches complex problems. Instead of blindly solving problems, it breaks them down into smaller, more manageable parts to find solutions more effectively.

1.2 Self-Reflection:

Self-reflection is crucial for continuous learning and improvement of AI agents. It means that the agent doesn’t just execute tasks but also reflects on its actions, understands its mistakes, and learns from them.

2. Memory:

For an AI agent, memory is not just about storing information. It is a dynamic system that helps the agent understand past events, predict future situations, and make decisions in new contexts. Compared to human memory, AI agents have the potential to be more precise, reliable, and targeted in information retrieval.

2.1 Types of Memory:

The article mentions different types of memory, which can be short-term or long-term. Each type serves specific applications and functions, enabling AI agents to store and retrieve information in different contexts.

2.2 Maximum Inner Product Search (MIPS):

MIPS is a technique for efficient information retrieval. In the context of AI agents, it means that when the agent needs to retrieve information from its memory, it can do so faster and more accurately, finding relevant information.

3. Tool Use:

The ability of an agent to use tools emphasizes that LLM is not isolated. It can interact with the external environment and use tools to enhance its capabilities. This provides great potential for AI agents to surpass their own limitations in specific tasks.

This article provides us with an opportunity to deeply understand how to build powerful AI agents using large language models. These AI agents possess complex planning, memory, and tool usage capabilities, enabling them to solve a wide range of complex problems.

This means that the AI agents of the future will resemble comprehensive and multifunctional intelligent assistants rather than mere text generators.

In the future, I will share more valuable insights. Follow me and let’s learn and grow together!

Article references:

  1. Lilian Weng’s blog: https://lilianweng.github.io/

GPT Agents: Transforming AI Interfaces

Article Link: What are GPT Agents? A deep dive into the AI interface of the future Author: Logan Kilpatrick, OpenAI Ambassador. Excerpts and translated main parts as personal notes from the original article.

This article aims to take you from “I know nothing about autonomous GPT agents, please stop using these words that I don’t understand” to having enough understanding on the topic to have thoughtful discussions with friends or online.

We will introduce what agents are, where the field is heading, and a plethora of exciting examples of this early technology in practice. This article reflects the author’s personal views and thoughts.

What are GPT Agents? (Basics)

Firstly, we need to define some terms, and then we will dive into more details. Although nobody likes to memorize vocabularies, it is crucial to have the right concepts in place to understand some of the peculiar terms in the AI field.

GPT = generative pre-trained transformer. It is the core machine learning model architecture driving large language models (LLMs) like ChatGPT. GPT is a very technical term, but it gained wide recognition among the masses through ChatGPT.

Next, let’s take a look at what an agent is:

Agent = a large language model with defined goals or tasks that can be iterated upon. This is different from the “usual” use of large language models (LLMs) in tools like ChatGPT, where you ask a question and get a response. An agent has a more complex workflow, where the model essentially self-dialogues without human driving every part of the interaction.

The above definition helps us understand the relationship between ChatGPT and agents, and the different experiences they provide to users. ChatGPT takes a single input query and returns an output, it cannot handle more than one task at a time. However, with the introduction of plugins in ChatGPT, this is changing, and the model can perform multiple requests in one call using external tools. Some might argue that this is the first manifestation of “agents” in ChatGPT, as the model is deciding what to do and whether to send extra requests.

For those who haven’t tried plugins yet, the basic idea is that you can tell ChatGPT how an external tool API works, and it can write and execute code based on user queries sent to that API. For example, if you have a weather plugin, when a user asks “What is the temperature in New York?”, the model knows it can’t answer and looks at the plugins available to the user. Let’s say it sends a request and the API responds with an error message saying “New York is not a valid location, please use the full city name, not an abbreviation”, the model can actually read this error and send a new request to fix it. This is the simplest example of how agents work in today’s production workflow.

Finally, I refer to agents as GPT agents because the term “agents” is commonly used and its context is often ambiguous. GPT agents emphasize the connection to ChatGPT and Artificial Intelligence, so you should approach it from a different perspective. However, you might hear people refer to agents, autonomous agents, or GPT agents, and they all refer to the same thing.

Beyond Basics

Some projects like AutoGPT and BabyAGI have popularized the concept of GPT agents, and these are some of the most well-known open-source projects being created. The concept of agents has truly sparked the imagination of developers, and people are racing to create tools and companies around this concept.

Here, I want to quickly mention that if you are a developer and want to build agent experiences, Langchain provides a great library and toolkit to help developers do it without starting from scratch [1].

To see agents in action, we can use this great Hugging Face space, which is an environment to run code online:

Baby AGI - a Hugging Face Space by NeuralInternet

Please be very cautious when pasting API keys on external websites. It’s worth creating a new key for experimentation and immediately deleting it to prevent leakage.

Let’s start with the goal of helping me learn programming:

You can see that the first step of babyAGI is to create a task list based on my goal, which breaks down “Teach me to program in Python” into tasks like:

  1. Install Python and become familiar with the Python interpreter
  2. Learn basic Python syntax and data types
  3. Understand control flow and decision-making
  4. Learn about functions and classes
  5. Learn about modules and packages

… and so on.

Next, the model would generate some text to help me learn the first task. If you try this yourself, you might find the results a bit odd. For example, babyAGI skips the first step and does a “Hello World” program instead. I also acknowledge that the UI layer in the space might be abstracting some of what’s happening. I suggest experimenting here to see what else is possible. Running your first agent today is a great way to get exposure to this cutting-edge technology.

The Future of Agents

The idea of agents is not a one-time thing; they are the first entities powered by general AI, able to solve tasks. As time goes on and more powerful models and tools support them, they will become increasingly complex. For example, you can imagine a simple customer service agent that takes someone’s question and iteratively breaks it down, solves it, and verifies the answers. Several key conditions are needed to achieve this:

  1. More powerful models; GPT-4 works well, but today’s use cases are still limited.
  2. Better tools; the space we looked at is an example that is super easy and useful, but there is room for improvement for real-world production use cases.
  3. Different architectures; as models evolve, breaking down goals into sub-tasks may no longer be the right design decision. Many methods like working backward from the desired end state could be equally effective.

In terms of tools, organizations like LangChain are launching products like LangSmith to help developers put these workflows into production [3].

The fact is, to make this generation of agents a reality, entirely new frameworks will emerge. It’s remarkable to think about how it all started with plugins and AutoGPT. I’m excited about the future and the ability to leverage world-class agents to help me accomplish the things I care about.

LLM-Powered Autonomous Agents

“If a paper proposes a different training method, we would sneer at it on OpenAI’s internal Slack, because those are things we have left behind. But when a new AI Agents paper comes out, we get excited and have serious discussions.”

Recently, Andrej Karpathy, co-founder of OpenAI, gave a brief speech at a developer event, discussing his and OpenAI’s views on AI Agents.

Andrej Karpathy compared the difficulties of developing AI Agents in the past with the new opportunities under new technological tools. He also made a self-deprecating joke about his work at Tesla, saying he was “distracted by autonomous driving” and that autonomous driving and VR are both bad examples of AI Agents.

On the other hand, Andrej Karpathy believes that ordinary people, entrepreneurs, and geeks have an advantage in building AI Agents compared to companies like OpenAI. Everyone is currently in a state of equal competition, so he is looking forward to seeing the results in this area. The complete video of Karpathy’s presentation is included at the end of the article.

On June 27, Lilian Weng, the director of applied research at OpenAI, wrote a lengthy article, with some sections drafted by ChatGPT. She proposed that an Agent = LLM (Large Language Model) + Memory + Planning Skills + Tool Usage, and provided detailed explanations of each module of the Agent, and finally she expressed high expectations for the future application prospects of Agents, but also mentioned that challenges are everywhere.

I have translated this lengthy article and added some of my own understanding and insights. Let’s take a look at what the experts have said! The original article link is included at the end of the article.

Building an Agent with LLM (Large Language Model) as its core controller is a cool concept. Several concept validation demos, such as AutoGPT, GPT-Engineer, and BabyAGI, are inspiring examples. The potential of LLM is not limited to generating excellent copies, such as writing, storytelling, papers, and programs; it can be built into a powerful general problem solver.

Overview of the Agent System

In an LLM-driven autonomous agent system, LLM serves as the brain of the agent, complemented by several key components:

  • Planning

  • Sub-goals and decomposition: The agent breaks down large tasks into smaller, manageable sub-goals, effectively handling complex tasks.

  • Reflection and improvement: The agent can self-critique and reflect on past actions, learn from mistakes, and improve future steps, thereby improving the quality of the final result.

  • Memory

  • Short-term memory: I consider all context learning (referring to Prompt Engineering) as utilizing the model’s short-term memory.

  • Long-term memory: Long-term memory provides the agent with the ability to store and recall (infinite) information over a long period of time. They are typically implemented by utilizing external vector storage and fast retrieval.

  • Tool Usage

  • The agent augments its model weights, which are difficult to modify after pre-training, by learning to invoke external APIs to obtain additional missing information, including current information, code execution abilities, access to proprietary information sources, etc.

Component 1: Planning

Complex tasks often involve many steps. The agent needs to know what the specific task is and start planning ahead.

Task Decomposition

“Chain of Thought” (CoT) has become a standard prompting technique to enhance the model’s performance on complex tasks. It instructs the model to “think step by step” to utilize more thinking time to break down challenging tasks into smaller, simpler steps. CoT transforms major tasks into multiple manageable tasks and focuses on the interpretability of the model’s thinking process.

“Tree of Thoughts” expands on CoT by exploring multiple reasoning possibilities for each step. It first decomposes the problem into multiple steps of thinking, and each step generates multiple ideas, creating a tree-like structure. The search process of the tree can be either BFS (breadth-first search) or DFS (depth-first search), and each state is determined by a classifier (through a prompt) or majority voting.

Task decomposition can be done in three ways:

(1) Using simple prompts to decompose with LLM, for example: “Steps for XYZ”, “What are the sub-goals to achieve XYZ?”.

(2) Using specific instructions for a particular task, for example: “Write a story outline.” for writing a novel.

(3) Humans themselves decompose the task.

Another completely different approach is LLM+P, which involves relying on external classical planners for long-term planning. This approach utilizes a planning domain definition language (PDDL) as an intermediary interface to describe planning problems. In this process, LLM (1) transforms the problem into a “problem PDDL”, and then (2) requests the classical planner to generate a PDDL plan based on the existing “domain PDDL”, and finally (3) transforms the PDDL plan back into natural language. Essentially, the planning steps are outsourced to external tools, assuming that the specific domain’s PDDL and appropriate planners are available, which is common in some robot setups but not so common in many other domains.

Reflection

Self-reflection is an important aspect that allows autonomous agents to iteratively improve by improving past action decisions and correcting previous mistakes. It plays a crucial role in the reality of inevitable trial and error tasks.

“ReAct” integrates reasoning and action into LLM by expanding the action space into a combination of discrete actions specific to the task and language space. The former allows LLM to interact with the environment (e.g., using the Wikipedia search API), and the latter encourages LLM to generate a reasoning trajectory in natural language.

ReAct prompt templates include explicit steps of LLM’s thinking and follow a rough format of:

Thought: ...
Action: ...
Observation: ...
... (Repeated many times)

In two experiments on knowledge-intensive tasks and decision tasks, ReAct outperformed the baseline model that only included actions, where the baseline model removed the “Thought: …” steps.

“Reflection” is a framework that provides the agent with the ability for dynamic memory and self-reflection to enhance its reasoning skills. Reflection adopts standard reinforcement learning settings, where the reward model provides a simple binary reward, and the action space follows the setting in ReAct. After each action, the agent calculates a heuristic value ht and decides whether to reset the environment to start a new experiment based on the result of self-reflection.

Heuristic functions are used to judge when the action trajectory of LLM becomes inefficient or contains illusions, and at that moment, the task is stopped. Inefficient planning refers to spending a lot of time but not achieving a successful path. Illusions are defined as LLM encountering a series of consecutive identical actions that lead to the same result observed by LM in the environment.

Self-reflection is created by showing LLM two examples, each of which is a pair of (failed trajectory, ideal reflection to guide future plan variations). Then the reflections are added to the agent’s working memory, with a maximum of three reflections, mainly for querying the context of LLM.

To prevent overfitting, CoH adds a regularization term to maximize the log-likelihood of the pre-training dataset. To prevent shortcuts and copying (as there are many common words in the feedback sequences), they randomly mask 0%-5% of historical tokens during training.

The training dataset in the experiments is a combination of WebGPT contrasts, human feedback summaries, and human preference datasets.

CoH’s idea is to present a series of incrementally improved history in the context and train the model to follow the trend and produce better results. The process of Algorithm Distillation applies the same idea to cross-story trajectories in reinforcement learning tasks, where the algorithms are encapsulated in a long-term history-conditioned policy. Given multiple interactions between the agent and the environment, each episode in the agent’s history performs slightly better. AD concatenates this learning history and inputs it into the model. Therefore, we should expect the next predicted action to perform better than the previous experiment. Our goal is to learn the process of reinforcement learning, rather than training a policy for a specific task itself.

The paper assumes that any algorithm that generates a series of learned history data can be distilled into a neural network by performing clone behavior on actions. The history data is generated by a set of source policies, each of which is trained for a specific task. During the training stage, each run of reinforcement learning randomly selects a task and trains using a subsequence of multiple historical records, making the learned policy independent of the task.

In practice, the length of the model’s context window is limited, so short episodes should be used to facilitate the construction of multi-episode history. To learn an almost optimal context reinforcement learning algorithm, multiple-episode histories of 2-4 episodes are needed. Context reinforcement learning often requires a sufficiently long context.

Compared with three baselines, including ED (expert distillation, which uses expert trajectories instead of learned history data for behavior cloning), source policies (used to generate UCB distilled trajectories), and RL^2 (used as an upper bound as it requires online reinforcement learning). Although AD only uses offline reinforcement learning, it still demonstrates the ability of context reinforcement learning close to performance and RL^2, and it learns much faster than other baselines. AD also shows improvement rate greater than the ED baseline under partial historical training data given source policy.

Component 2: Memory

Thanks to ChatGPT for helping me with drafting this section. In conversations with ChatGPT, I learned a lot about human memory and fast MIPS data structures.

Types of Memory

Memory can be defined as the process of acquiring, storing, retaining, and subsequently retrieving information. There are multiple types of memory in the human brain.

  1. Sensory memory: This is the earliest stage of memory that provides the ability to retain sensory information (visual, auditory, etc.) after the original stimulus has ended. Sensory memory usually lasts only a few seconds. Subcategories of sensory memory include iconic memory (visual), echoic memory (auditory), and haptic memory (touch).

  2. Short-term memory, or working memory: It stores the information we are currently aware of and needed for cognitive tasks such as learning and reasoning. Short-term memory is believed to have a capacity of approximately 7 items (Miller 1956) and lasts for 20-30 seconds.

  3. Long-term memory: Long-term memory can store information for a significantly longer period, ranging from a few days to decades, with virtually unlimited storage capacity. LTM has two subtypes:

  • Declarative or explicit memory: This refers to memory for facts and events that can be consciously recalled, including episodic memory (events and experiences) and semantic memory (facts and concepts).
  • Procedural or implicit memory: This type of memory is unconscious and involves skills and procedures that are performed automatically, such as riding a bicycle or typing on a keyboard.

We can roughly consider the following mappings:

  • Sensory memory serves as the raw input of learning embeddings, including text, images, or other modalities;

  • Short-term memory serves as context learning. It is transient and limited due to the constraint of Transformer’s finite context window.

  • Long-term memory serves as an external vector storage that the agent can attend to during queries and access through fast retrieval.

Maximum Inner Product Search (MIPS)

External memory can alleviate the limited breadth of attention. A common approach is to store the embedding representations of information into a vector storage database that supports fast Maximum Inner Product Search (MIPS). To optimize retrieval speed, common choices are approximate nearest neighbors (ANN) algorithms, which return approximately the top-k nearest neighbors with a slight sacrifice in accuracy for a tremendous speedup.

The following are some common ANN algorithms that can be used for MIPS:

Locality-Sensitive Hashing (LSH) introduces a family of hash functions that map similar inputs to the same bucket with higher probability, where the number of buckets is significantly smaller than the number of inputs.

Approximate Nearest Neighbors (ANNOY) has a core data structure of random projection trees, which is a set of binary trees where each non-leaf node represents a hyperplane that splits the input space into two halves, and each leaf node stores a data point. The binary trees are constructed independently and randomly, so in a sense, they behave like hash functions. ANNOY iterates through all trees to search for the nearest neighbors, then aggregates the results iteratively. This idea is closely related to KD trees but more scalable.

Hierarchical Navigable Small World (HNSW) is inspired by the idea of small-world networks, where most nodes can be reached from any other node in a few steps, such as the “six degrees of separation” theory in social networks. HNSW builds a hierarchy of these small-world graphs, with the bottom-level structure containing the actual data. Middle levels create shortcuts to speed up search. When performing a search, HNSW starts from a random node in the top level and navigates to the target. When it cannot get closer, it moves down to the next level until it reaches the bottom level. Each move in the upper levels can cover a long distance in the data space, while each move in the lower levels can refine the search quality.

Facebook AI Similarity Search (FAISS) assumes that distances between points in a high-dimensional space follow a Gaussian distribution, and therefore there are clustering points. FAISS divides the vector space into clusters and uses vector quantization within each cluster. FAISS uses coarse quantization first to find candidate clusters, then uses finer quantization for each candidate cluster.

Scalable Nearest Neighbors (ScaNN) primary innovation belongs to Anisotropic Vector Quantization. It quantizes data points to a vector such that the inner products between these quantized points closely approximate the original distance, rather than choosing the closest quantization centroid.

Component 3: Tool Usage


The ability to use tools is one of the most significant and unique aspects of humans. We create, modify, and leverage external things to accomplish things beyond our physical and cognitive limits. Similarly, we can equip LLMs with external tools to significantly extend the model’s capabilities.

“MRKL” stands for “Modular Reasoning, Knowledge, and Language” and is a neural-symbolic architecture for autonomous agents. The MRKL system consists of a set of “expert” modules, while a general-purpose LLM acts as a router to direct queries to the most appropriate expert module. These modules can be neural modules (like deep learning models) or symbolic modules (like a math calculator, currency converter, weather API, etc.).

The research team of MRKL conducted experiments with fine-tuning LLMs to invoke a calculator as an example, using arithmetic as a test case. The results of the experiment showed that solving verbal math problems is more challenging than explicit math problems because LLMs (7B Jurassic1-large model) cannot reliably extract the correct parameters for basic arithmetic. The experiment results also emphasized the importance of knowing when and how to use external symbolic tools when they are reliably functional, which depends on LLM’s capabilities.

“TALM” (Tool-Augmented Language Model) and “Toolformer” both learn to use external tool APIs by fine-tuning an LM. The dataset was expanded based on whether the addition of such API lookups improves the quality of model outputs.

Both ChatGPT plugins and OpenAI API function calls are examples of LLMs with tool usage capabilities in practice. The collection of tool APIs can be provided by other developers (as in the case of plugins) or customized (as in the case of function calls).

HuggingGPT is a framework that uses ChatGPT as a task planner, selecting models available on the HuggingFace platform based on descriptions of each model and generating final response based on the summary of the model’s execution results.

The system consists of four stages:

(1) Task Planning: LLM acts as the brain to parse user requests into multiple tasks. Each task has four properties: task type, ID, dependencies, and parameters. They guide LLM to perform task parsing and planning with a few examples.

Specific instruction as follows:

The AI assistant can parse user input into multiple tasks: [{“task”: task, “id”, task_id, “dep”: dependency_task_ids, “args”: {“text”: text, “image”: URL, “audio”: URL, “video”: URL}}]. The “dep” field indicates the ID of the previous task that generated the new resource on which the current task depends. The special token “-task_id” refers to a text, image, audio, and video generated by a dependant task with a task ID of task_id. Tasks must be chosen from the following options: {{list of available tasks}}. There are logical relationships between tasks, please note their order. If the user input cannot be parsed, an empty JSON response is required. Here are a few examples for your reference: {{demos list}}. The conversation history is recorded as {{chat_history}}. With this chat history, you can find the path to resources mentioned by the user for task planning.

(2) Model Selection: LLM assigns tasks to expert models, where the request is transformed into multiple-choice questions. LLM provides a list of models to choose from. Due to the limited context length, filtering based on task type is required.

Specific instruction as follows:

Based on the user request and call command to select a suitable model from a model list to handle the user request. The AI assistant only outputs the ID of the most suitable model. The output must be in strict JSON format: “id”: “id”, “reason”: “Detailed reasons for selecting the model.” We provide you with a model list {{candidate_model_list}}. Please select one model from the list.

(3) Task Execution: Expert models perform on specific tasks and record the results.

Specific instruction as follows:

Based on the input and reasoning result, the AI assistant needs to describe the process and result. Begin the reply by simply answering the user request in a clear way. Then describe the task procedure in the first person, showcasing your analysis and model’s reasoning results. If the reasoning result includes file paths, you must tell the user the complete file path.

(4) Response Generation: LLM receives the execution result and provides a summary result to the user.

To apply HuggingGPT in practical scenarios, there are several challenges to be addressed: (1) efficiency needs to be improved as rounds of LLM inference and interactions with other models slow down the process; (2) it relies on long context windows to convey complex task content; (3) improvement in LLM output and stability of external model services is needed.

My Thoughts

After OpenAI released ChatGPT, a game-changing product comparable to the “iPhone,” it did not stop there. OpenAI wants to go further and become an Apple company in the AI era. The previous release of plugins based on ChatGPT attracted a lot of attention

AI Agent!

Intelligent agent…!

What are GPT Agents (English translation)

If you haven’t heard of Agents, then you must have heard of AutoGPT or BabyAGI, right?

If you haven’t heard of them either, then keep reading…

This article is translated from a foreign expert’s article. I have made many changes to make it more accessible and easy to understand. I am also a beginner in Agents, so please forgive any mistakes in the article and feel free to point them out.

1. What are GPT Agents?

Let’s first clarify a few professional terms.

GPT: It stands for Generative Pre-trained Transformer, which is the core machine learning model architecture of ChatGPT. Although it is a very technical term, many people have heard of it with the popularity of ChatGPT.

Next, let’s see what an Agent is:

Agent: It is based on a large-scale language model with common sense and reasoning abilities, used to iteratively perform tasks or objectives. The difference between it and ChatGPT is:

  • ChatGPT - You ask a question, and it answers. It’s like having a conversation.
  • Agent - The workflow is relatively complex. The model interacts with itself without human intervention in each part of the interaction.

The above explanations help us understand the relationship between ChatGPT and Agent, as well as the different experiences they provide to users.

The question-and-answer format of ChatGPT has undergone some changes since the launch of the ChatGPT plugin: the model can make use of external tools for up to 10 requests.

It can be said that this is the first embodiment of the “Agent” idea in ChatGPT, as the model is deciding what to do and whether to send additional requests.

Considering that some people haven’t tried the ChatGPT plugin, let me give you an example:

  1. If you have a weather plugin installed, when you ask ChatGPT, “What is the temperature in Beijing?”, ChatGPT knows that it can’t answer and looks at the plugins you have installed.
  2. Then, ChatGPT sends a request to the plugin and receives an error message: “Please use the Chinese name of the city.”
  3. ChatGPT understands this error message, fixes the error, and sends a new request, “What is the temperature in Beijing?”

This is the simplest example of how an Agent works.

To add, in this article, we refer to Agents as GPT Agents, just to emphasize the concept in relation to ChatGPT and AI.

You may hear people talking about intelligent agents, autonomous agents, or GPT agents, but they all refer to the same thing.

2. Workflow of Agents

Some popular GPT Agent projects, such as AutoGPT and BabyAGI, are among the most popular open-source projects ever. The idea of Agents has truly inspired developers, and people are rushing to start businesses or develop applications based on this idea.

By the way, if you are a developer and want to develop Agents, LangChain has a great library and set of tools that can help you build everything from scratch.

The following image shows the workflow of BabyAGI, a popular GPT Agent project:

Let me explain the image: After creating a main task for the Agent, the workflow is divided into three main steps:

Step 1: Get the first unfinished task and send it to the executing Agent, which uses the OpenAI API to complete the task based on the context.

Step 2: Enrich the results and store them in a vector database.

Step 3: Create new tasks based on the goals and the results of the previous task, and sort the task list based on priority.

Let’s talk about the first step with a specific example.

Suppose we have a task: “Write a 1500-word blog post introducing ChatGPT and its features.” As the user controlling the Agent, all you need to do is write out this task while providing detailed requirements.

The model will process these requirements and perform the following operation:

sub_tasks = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a world-class assistant who can help people complete tasks."},
        {"role": "user", "content": "Write a 1500-word blog post introducing ChatGPT and its features."},
        {"role": "user", "content": "Decompose it into easily manageable sub-tasks based on user requests."}
    ]
)

In the above code:

  • We use the OpenAI API to drive the Agent.
  • The first message is a system message that allows you to define the Agent to some extent, but in this example, we didn’t do much.
  • The second message is self-explanatory.
  • The third message is the key, which adds a task on top of it: decomposing it into sub-tasks.

Then, you can put the sub-tasks in a loop and make additional calls to execute them. Each sub-task has different system messages (defining a writing Agent, defining a research Agent, etc.).

It is clear that these functionalities require a large number of OpenAI API requests, which means that the Agent will incur more costs, so be careful when operating.

By now, we have understood the first step of building an Agent. Apart from that, let me briefly explain other parts of the BabyAGI flowchart:

  • “Enriching the results” in Step 2: It takes the form of an automatic prompting process that makes the model more specific and detailed in its tasks.
  • “Storing the results in a vector database” is very useful for users to track all the steps of the model throughout the process.
  • The last concept in the workflow is the “priority of the task list,” which ensures that the model works in the order that is most conducive to completing the tasks.

3. Practical Applications of Agents

First, take a look at the image below, which lists some companies and projects related to this field.

In the second part of the image, projects like AutoGPT and BabyAGI have universality and are used to decompose any task into a format suitable for the Agents' workflow.

AutoGPT is a very popular project, but they later discontinued their web application, and now it must be run locally.

To experience the practical application of Agents, we can open this website from Hugging Face. It is an environment for running code online for BabyAGI:

https://huggingface.co/spaces/NeuralInternet/BabyAGI

Let’s take “Teach me how to write Python” as the goal:

You can see that the first step of BabyAGI is to create a task list based on my goal, decomposing “Teach me how to write Python” into the following tasks:

  1. Install Python and familiarize yourself with the Python interpreter.
  2. Learn basic Python syntax and data types.
  3. Understand control flow and decision-making.
  4. Learn functions and classes.
  5. Understand modules and packages.
  6. ……

The next step is for the model to generate some text to help me learn the first task.

I suggest you try this Hugging Face space personally, run your first Agent, and get an intuitive feel for this technological concept.

A reminder, when running the Agent here, you need to fill in your OpenAI API KEY.

4. The Future of Agents

The concept of Agents will not disappear. In the future, they can perform more complex tasks and be supported by more powerful models and better tools:

  • More powerful models: GPT-4 performs well, but its application scenarios are still limited.
  • Better tools: The BabyAGI space mentioned above is a simple and practical example, but there is still much room for improvement for real-world scenarios.
  • Different architectures: As models develop, decomposing goals into sub-tasks may no longer be the only solution. There are many methods, such as working backward from the final state, which may be equally effective.

In the future, Agents may indeed have some human-like memory, the ability to automatically execute tasks using tools, and even decision-making abilities.

Agents are considered the next direction of OpenAI’s efforts. The world’s leading large-scale model companies have realized that big models themselves cannot solve all problems. There is a need to develop a new form on top of them -

The form they have found is Agents.

If you find this article informative, please like and follow @四猿外.

Next
Previous