Key Takeaways
1. AI Agents Evolve Beyond Simple Chatbots
The word agent in our journey to build powerful agents in this book uses this dictionary definition. That also means the term assistant will be synonymous with agent.
Defining intelligence. AI agents are more than just conversational interfaces; they are entities designed to act, exert power, and produce effects on behalf of a guiding intelligence. This broad definition encompasses various forms, from simple assistants to complex autonomous systems. The core distinction lies in their capacity for independent decision-making and action execution, moving beyond mere information retrieval.
Spectrum of interaction. Agent interactions with Large Language Models (LLMs) exist on a spectrum: direct user interaction, agent/assistant proxies (like DALL-E 3 in ChatGPT), agents acting on behalf of users (with approval), and fully autonomous agents making independent decisions. Autonomous agents, while powerful, introduce ethical and safety concerns due to their self-directed nature, necessitating careful design and oversight.
Collaborative intelligence. For intricate problems, multi-agent systems leverage specialized "profiles" or "personas" that work together. A controller agent might orchestrate tasks between a coder agent and a tester agent, fostering internal feedback and evaluation loops. This collaborative approach magnifies the benefits of single agents, allowing for parallel task execution and reduced errors in complex assignments.
2. LLMs Form the Core, But Agents Add Action
For our journey to build powerful agents in this book, we focus on the class of LLMs called chat completions models.
Generative foundation. Large Language Models (LLMs), particularly those built on Generative Pretrained Transformers (GPTs), are generative models trained to create content rather than merely predict or classify. They are defined by their vast training data, architecture (like parameter count), specific training for use cases (e.g., chat completions), and fine-tuning processes. Chat completion models, optimized for iterative refinement, are ideal for agent development due to their conversational nature.
Beyond direct interaction. While direct interaction with LLMs is powerful, agents elevate their utility by providing structure and external capabilities. An agent acts as an intermediary, interpreting user requests, formulating optimal prompts, and orchestrating external tools. This layer of abstraction allows LLMs to perform tasks they weren't inherently designed for, such as interacting with external services or managing complex workflows, making them more versatile.
Strategic LLM selection. Choosing the right LLM for an agent involves evaluating criteria like performance on specific tasks (e.g., coding), model size (affecting hardware and speed), use case (chat completions for agents), training data, and cost. For learning and research, commercial models like GPT-4 Turbo are often recommended due to their capabilities and ease of access, though open-source alternatives are rapidly advancing and can be hosted locally with tools like LM Studio.
3. Effective Prompt Engineering is Foundational
Prompt engineering is a new and emerging field that attempts to structure a methodology for building prompts.
Guiding intelligence. Prompt engineering is the art and science of crafting messages to Large Language Models (LLMs) to elicit better, more consistent, and desired outputs. It's an iterative process, where refining queries and providing context significantly improves the LLM's response quality. This discipline is crucial for transforming generic LLM capabilities into targeted agent behaviors, ensuring precision and relevance.
Clear instructions. A core strategy involves writing clear instructions, which includes tactics like providing detailed queries, adopting specific personas, using delimiters to separate content, specifying step-by-step instructions, offering examples, and defining output length. These tactics ensure the LLM understands the task, its role, and the expected format of its response, minimizing ambiguity and improving accuracy. For example, asking for "3 examples" or "summarize in 50 words" guides the output.
Persona power. Adopting personas is a particularly potent tactic, allowing agents to frame all responses within a defined role, background, or personality. For instance, instructing a cooking agent to "speak as Julia Child" not only adds a fun tone but also guides its culinary advice. This persona-driven approach is fundamental to creating specialized and engaging agent profiles, making interactions more natural and effective.
4. Actions and Tools Extend Agent Capabilities
Actions, therefore, are extensions of plugins—they give a plugin its abilities.
External empowerment. Agents transcend mere conversation by leveraging "actions," "tools," or "skills" to interact with the outside world. These capabilities, often encapsulated as plugins or functions, allow agents to perform tasks like searching the web, calling APIs, generating images, or executing code. This external interaction transforms an LLM from a passive responder into an active participant in complex workflows, enabling real-world impact.
OpenAI function calling. OpenAI introduced a standard specification for defining these actionable interfaces, enabling LLMs to recognize user requests that match a function's description and extract the necessary parameters. The LLM doesn't execute the function itself but rather returns the suggested function call and its arguments, which an external system then processes. This delegation allows for powerful, context-aware tool use, as demonstrated by ChatGPT plugins.
Semantic Kernel's role. Microsoft's Semantic Kernel (SK) is a robust framework for building and managing these agent actions, referring to them as "semantic plugins." SK can wrap both "semantic functions" (prompt templates) and "native functions" (code-based operations) into reusable plugins. This allows for the creation of a "GPT interface" that exposes any service or API through natural language, making it accessible to chat interfaces or other agents, such as a movie database API.
5. Multi-Agent Systems Tackle Complex Problems Collaboratively
Multi-agent systems incorporate many of the same tools single-agent systems use but benefit from the ability to provide outside feedback and evaluation to other agents.
Collaborative intelligence. Multi-agent systems enhance problem-solving by distributing tasks among specialized agents that communicate and collaborate. This setup allows for internal feedback and evaluation, significantly reducing errors and improving the quality of solutions compared to single-agent approaches. Agents can specialize in distinct roles, such as a "coder" and a "tester," working in tandem to achieve a common goal.
AutoGen's conversational power. Microsoft's AutoGen platform exemplifies conversational multi-agent systems, where agents communicate using natural language. A "UserProxy" agent can orchestrate tasks, directing an "AssistantAgent" to generate code, then evaluating its output and providing feedback. This iterative loop continues until the task is satisfactorily completed, effectively replacing human oversight in many cases and even installing necessary packages.
CrewAI's structured approach. CrewAI, designed for enterprise applications, offers a more structured approach with role-based and autonomous agents. It supports both sequential and hierarchical task management, allowing agents to focus on specific areas of a goal. Observability tools like AgentOps are crucial for monitoring these complex interactions, tracking performance, costs, and identifying inefficiencies in agent collaboration, revealing how a single joke generation might cost over 50 cents.
6. Autonomous Agents Require Structured Control (Behavior Trees)
Behavior trees are a long-established pattern used to control robotics and AI in games.
Orchestrating complexity. Autonomous agents, capable of making independent decisions, require robust control mechanisms. Behavior trees, a pattern originating from robotics and game AI, provide a scalable and modular framework for orchestrating complex agent behaviors. They define a hierarchical structure of nodes—selectors, sequences, conditions, and actions—that dictate the flow of execution based on success or failure.
Execution logic. Unlike traditional Boolean logic, behavior trees operate on "success" or "failure" states. Execution flows from top to bottom, left to right, with composite nodes (selectors, sequences) determining which child nodes to execute. This clear, intuitive structure makes behavior trees excellent for debugging and visualizing an agent's decision-making process, ensuring predictable and controlled autonomy, such as an AI deciding to eat an apple or a pear.
Agentic Behavior Trees (ABTs). When applied to AI agents, these become Agentic Behavior Trees (ABTs), where prompts direct actions and conditions. Tools like the GPT Assistants Playground facilitate building ABTs with OpenAI assistants, allowing for complex workflows like coding challenges or social media posting. ABTs can combine siloed agent interactions with conversational threads, leveraging the strengths of both patterns for unbiased review and emergent behaviors.
7. Memory and Knowledge Augment Agent Context (RAG)
Retrieval in agent and chat applications is a mechanism for obtaining knowledge to keep in storage that is typically external and long-lived.
Contextual enrichment. Agents require both knowledge and memory to provide relevant context to their prompts, moving beyond the limitations of their initial training data. Knowledge refers to augmenting prompts with information from external, unstructured documents (like PDFs or code), while memory pertains to contextualizing prompts with conversation history, facts, or preferences. Both rely on "retrieval augmented generation" (RAG) patterns.
RAG workflow. The RAG process involves several steps: loading documents, transforming them into manageable "chunks," embedding these chunks into high-dimensional vectors, and storing them in a vector database (e.g., Chroma DB). When a query is made, it's also embedded, and a "semantic similarity search" retrieves the most relevant chunks, which then augment the LLM's prompt to generate a more informed response. This is more effective than sending the entire document.
Memory types. Agent memory mirrors human cognitive functions, categorized into sensory, short-term (conversational history), and long-term (semantic, episodic, procedural). Long-term memory, especially semantic memory, allows agents to store and retrieve facts, concepts, and even preferences. Platforms like Nexus enable the creation of configurable memory stores, where LLMs process conversations into semantically relevant memories, enhancing personalized interactions and reducing redundancy through compression.
8. Systematic Evaluation is Key to Agent Performance (Prompt Flow)
Evaluation of prompt/profile performance isn’t something we can typically do using a measure of accuracy or correct percentage.
Beyond intuition. While iterative prompt engineering is effective, systematically evaluating prompt and agent profile performance is crucial for building reliable AI agents. This involves defining clear criteria and standards to assess how well an agent accomplishes its given task, moving beyond subjective judgment to objective measurement. This systematic approach ensures genuine improvements and consistent outputs.
Prompt Flow's power. Microsoft's Prompt Flow, an open-source tool, excels at this systematic evaluation. It allows developers to build, test, and compare multiple prompt variations at scale, leveraging multi-threaded batch processing. This capability is invaluable for quickly assessing the performance of different agent profiles, LLM configurations (e.g., temperature, max tokens), and even comparing different LLM models, like GPT-3.5 vs. GPT-4.
Rubrics and grounding. Evaluation in Prompt Flow often employs "rubrics" – structured sets of criteria and rating scales – to assess prompt responses. "Grounding" refers to how well a response aligns with these predefined criteria, objectives, and context. By using a second LLM to automatically evaluate responses against a rubric, developers can establish objective baselines, compare profile variants, and iteratively refine their agents for optimal performance, ensuring recommendations meet specific criteria.
9. Reasoning and Planning Drive Agent Intelligence
While an LLM isn’t designed to reason, the training material fed into the model provides an understanding of reasoning, planning, and thought.
Eliciting intelligence. Although LLMs are not inherently trained to "reason" or "plan," careful prompt engineering can elicit these behaviors by leveraging the vast knowledge embedded in their training data. Reasoning involves understanding thought processes and applying actions to solve tasks, while planning is the ability to order these actions to achieve a goal. These capabilities are crucial for agents tackling complex, multi-step problems.
Chain of Thought (CoT). CoT prompting is a powerful technique that guides LLMs through a problem-solving process by providing few-shot examples that demonstrate explicit reasoning steps. This encourages the LLM to "think step-by-step," breaking down complex challenges into manageable parts and showing its internal logic. This method significantly improves accuracy on intricate problems like time travel paradoxes or mathematical puzzles.
Advanced reasoning. Techniques like "Zero-shot CoT" (using phrases like "Let's think step by step" without examples) and "Prompt Chaining" (sequencing multiple prompts to decompose and solve a problem) further enhance reasoning. "Self-consistency" generates multiple solutions and selects the most frequent, while "Tree of Thought" (ToT) explores multiple reasoning paths, evaluating each step to prune invalid ones. These methods, though computationally intensive, push the boundaries of LLM problem-solving.
10. Feedback Loops Ensure Continuous Agent Improvement
Planning can only go so far, and an often-unrecognized element is feedback.
Adaptive intelligence. While planning and reasoning are critical, feedback is the essential, often overlooked, component that enables agents to continuously adapt and improve. Feedback mechanisms allow agents to learn from their execution, correct errors, and refine their strategies, moving beyond static plans to dynamic, self-correcting behavior. This is vital for agents operating in unpredictable or evolving environments.
Internal and external feedback. Feedback can be integrated internally within an LLM (as seen in advanced models like OpenAI's Strawberry, which can self-critique and suggest improvements) or externally through human oversight or other evaluation agents. For instance, if an agent's plan fails, explicit feedback can guide it to review its assumptions, adjust its approach, or even modify its internal instructions for future similar tasks. This helps correct errors like miscalculating time travel days.
Application across systems. Feedback is crucial across various agent applications: personal assistants learn user preferences, customer service bots refine responses based on satisfaction surveys, and autonomous agents adjust complex workflows. In collaborative multi-agent systems, agents provide feedback to each other, fostering a collective learning environment. Rigorous feedback, combined with evaluation, builds confidence in agent performance and drives long-term system improvement.
11. Agent Platforms Streamline Development and Deployment
Nexus is an open source platform developed with this book to teach the core concepts of building full-featured AI agents.
Simplifying complexity. The proliferation of AI agent tools and frameworks highlights the need for platforms that simplify development and deployment. Platforms like Nexus, AutoGen, and CrewAI abstract away much of the underlying complexity of LLM interaction, tool orchestration, and multi-agent coordination, allowing developers to focus on agent logic and problem-solving. This reduces the barrier to entry for building sophisticated AI systems.
Modular architecture. Nexus, built with Streamlit for its intuitive web interface, exemplifies a modular agent platform. It dynamically discovers and integrates agent components such as profiles/personas, actions/tools, knowledge/memory stores, and planners through a plugin system. This architecture allows for easy customization and extension, enabling users to experiment with different combinations of agent capabilities, from a "talkative AI" persona to specific Wikipedia search actions.
Practical implementation. These platforms provide concrete environments for building and testing agents. For instance, Nexus allows users to define agent personas, attach custom actions (native or semantic functions), configure knowledge and memory stores, and select planning strategies. This hands-on approach, coupled with features like logging and debugging, makes the intricate process of agent development accessible and manageable for all levels of developers.
Last updated:
Review Summary
AI Agents in Action receives mixed reviews, with ratings ranging from 1 to 5 stars. Some readers praise its practical focus and comprehensive coverage of AI agent development, while others criticize its reliance on third-party tools and lack of depth. Positive reviews highlight the book's value for learning to build AI agents, especially using OpenAI capabilities. Negative reviews mention outdated content, overreliance on code snippets, and insufficient explanation of core concepts. Several readers note the rapid pace of change in the field, making parts of the book quickly obsolete.
Similar Books










Download PDF
Download EPUB
.epub
digital book format is ideal for reading ebooks on phones, tablets, and e-readers.