Unmasking AI Agents: The Basic Formula Powering the Hype

An agent hand morphed from structured text

Agents have been all the rage during the past months on tech Twitter. The startup Cognition promises to make software developers obsolete, while 11x intends to do the same to Sales Development Reps, typically the most junior role in a sales organization. If you sighed with relief because those didn’t threaten your job, check out the other 71 AI Assistant startups funded by Y Combinator in the past years.

Each of these companies and their investors has an incentive to sell those agents as complex and mystical technological systems. But behind each of them lies a simple underlying idea. Let’s uncover what an AI agent actually is and how it works.

OpenAI ushered in the current LLM hype era with the release of its GPT family of models. The basic idea of an LLM is to predict the next token in a token sequence. We can simplify here and treat a token essentially as the next word in a preceding text. We can describe the interaction with an LLM as a function that takes a text input and returns text.

Some early applications of this technology seemed obvious. We could prompt a model with the start of a text, and it could complete it. We could also simulate a conversation by feeding previous messages to the LLM for it to complete a new message.

But pretty quickly, we realized that pure text generation by LLMs can be limiting. For one, LLMs hallucinate text that is not factually correct. In some artistic domains, that might be desirable, but for most business applications this is clearly a bug. While you could feed additional information to the LLM in the conversation, it wasn’t bound to stick to your guidance or return data in a specific format.

LLMs are not only great at generating text — they are excellent at producing code. And so we discovered that we could ask the LLM to produce structured data with existing standards like JSON or XML. This opened up a whole new world of possibilities. Our LLM function had evolved from a text-based interface to one of structured data:

Now we could integrate an LLM response much tighter into existing applications. The use cases expanded from pure text generation to displaying richer insights like reports to users. We at Arro use this technique to extract rich qualitative analysis from automated user interviews.

So, do we have an agent? Not quite. According to Stuart Russell and Peter Norvig in their seminal book Artificial Intelligence: A Modern Approach, an agent is

Anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators

This definition is so broad it might as well include the assistant Clippy. Later, Kaplan and Haenlein argued that the agent would also need to learn from external data to achieve specific goals and tasks.1

Current LLMs have two limitations that don’t make them true agents. First, each call to an LLM is fundamentally stateless. The model does not change its weights in response to an inference call. The only time the agent adapts is during training with predefined data.

This problem of LLM memory is today solved in two ways. One is to extend context windows of LLMs so they can fit ever larger amounts of data. This data can not only include contextual information, but also a history of previous conversations and decisions. The Gemini models of Google have pushed the boundaries of what’s possible when extending context. The Gemini 1.5 Pro model boasts a 2 million token context window that can easily fit 10 novels. But before you go out and build that Harry Potter fantasy bot, consider that the resulting quality of responses from these long context windows is debated. Analysis emerged that the model weighs the beginning and end of the context much more into providing an answer than the middle part. So far, large contexts are helpful in many cases but not a panacea for model memory.

Another approach to providing a memory for LLMs has been Retrieval Augmented Generation (RAG). It is a technique in which context is stored as vectors in a database. That context can be searched, and the result amended to the prompt for an LLM. RAG can be very effective in providing factual answers and is scalable. It is currently the closest solution we have to modeling LLM memory based on interactions at inference.

The second limitation is that our agents cannot truly act in the digital world. Their only way to express themselves is to return text. But as we learned with structured data, we can ask the LLM to format that text in a specific way. It only took a small leap in imagination to help produce acting agents: if we asked our LLM to pick from a selection of actions it can take in the real world (often called tools or functions), we could enable the LLM to act.

A simple prompt that exemplifies this concept might look like this:

You are a world-class customer support agent.

A customer has asked the following question:

"How can I cancel my order?"

Respond in valid JSON in one of the following ways:

[
	{
		action_title: "search_knowledge_base",
		arguments: {
			"search_query": {
				"type": string
			}
		}
	},
	{
		action_title: "respond",
		arguments: {
			"message_to_customer": {
				"type": string
			}
		}
	},
	{
		action_title: "create_ticket_for_human_agent",
		arguments: {
			"ticket_title": {
				"type": string
			},
			"customer_id": {
				"type": string
			}
		}
	}
]

Hopefully, our LLM would respond in this case with the first option to search the knowledge base. In the next call to the model, we would now include the context we received from our knowledge base in the prompt, and it would likely be able to answer the question.

At its simplest, we have an agent!

More complex systems include many different tools with complex parameters. They are also split out into several different agents that have different responsibilities. This is done to simplify prompting but also to enable systems to use many different models. A small and cheap model like Claude 3 Haiku might be excellent at planning the next step in a conversation with an AI lawyer, but might struggle with providing the correct output that could require a more powerful or even fine-tuned model.

Agent systems are not trivial, and they solve problems for real business and personal workflows. But now you understand how they work behind the scenes and how you can start building one yourself.

Footnotes

  1. Kaplan, Andreas; Haenlein, Michael (1 January 2019). “Siri, Siri, in my hand: Who’s the fairest in the land? On the interpretations, illustrations, and implications of artificial intelligence”. Business Horizons.

Subscribe to my newsletter

Join for occasional updates on the latest in AI, engineering, startups and more.