Master agent observability: learn what to log when AI systems call other AIs. Spans, traces, audit logs, and debugging strategies for multi-agent systems.
When you deploy AI agents into production, you're no longer dealing with a single model making isolated predictions. You're building systems where agents call APIs, invoke other agents, make decisions based on tool outputs, and chain multiple steps together. This complexity creates a fundamental observability challenge: how do you see what's happening when your AI calls other AIs?
Agent observability is the practice of instrumenting your AI systems to capture, correlate, and analyze the full execution flow of multi-step agent workflows. Unlike traditional application observability—which focuses on request latency, error rates, and resource utilization—agent observability requires visibility into the decision-making process itself: what did the agent see, what did it decide, what tools did it invoke, and why?
This is critical because agent failures are often silent. A model might call a tool incorrectly, get a result it didn't expect, and then hallucinate a response without any obvious error signal. Or an agent might call another agent, which calls a third system, and the failure point becomes buried three layers deep in the execution trace. Without proper observability, you're flying blind.
Agent observability rests on three interconnected concepts that form the foundation of what you need to log:
A span is a single, measurable unit of work within your agent system. It has a start time, an end time, a name, and attributes that describe what happened. When your agent calls an LLM, that's a span. When it invokes a tool, that's another span. When it retrieves data from a database, that's another span.
Each span should capture:
Spans are lightweight and granular. You might have dozens or hundreds of spans in a single agent execution. The goal is to make each one queryable and analyzable so you can pinpoint exactly where latency comes from or where a decision went wrong.
A trace is a collection of related spans that together represent one complete execution of your agent workflow. A trace has a unique ID that propagates through all the spans it contains, creating a causal chain: agent A calls tool B, which calls API C, which triggers agent D. All of those spans share the same trace ID, allowing you to reconstruct the entire execution path from a single entry point.
Traces answer the question: "What happened during this agent run?" They let you see:
Without trace IDs, you lose the ability to correlate events across your system. A log entry from agent A and a log entry from agent B might be related, but without a shared trace ID, you have no way to know. With traces, correlation is automatic.
Audit logs are a different beast. They're not about debugging; they're about compliance, accountability, and forensics. An audit log records what happened and who made it happen in a way that's tamper-proof and legally defensible.
For agents, audit logs should capture:
Audit logs are essential for compliance (HIPAA, SOX, GDPR), for security investigations, and for understanding agent behavior in regulated environments. Unlike debug logs, which you can sample or truncate, audit logs must be comprehensive and permanent.
Observability gets exponentially harder when agents call other agents. Consider this scenario:
You have a "research agent" that gathers information. It calls a "summarization agent" to condense findings. The summarization agent calls a "fact-check agent" to verify claims. Each agent might call multiple tools, each tool might call an API, and each API might be rate-limited or slow.
Now, a user reports that their research took 45 seconds when it should take 5. Where did the time go?
Without proper instrumentation, you're guessing. With proper observability, you can see the entire trace, identify that the fact-check agent spent 40 seconds waiting for an external API response, and know exactly where to optimize.
Multi-agent observability requires:
Here's what you need to instrument in a production agent system:
Every time your agent calls an LLM, log:
This data is essential for understanding model behavior, tracking costs, and debugging why an agent made a particular decision.
When your agent calls a tool (database query, API, calculator, web search, etc.), log:
Tools are where agents interact with the real world, so this is where failures often originate. Comprehensive tool logging lets you distinguish between tool failures and agent decision failures.
When your agent makes a decision (which tool to call next, how to interpret a tool result, when to stop), log:
This is the "why" behind agent behavior. When an agent makes a mistake, understanding its reasoning helps you decide whether to retrain, add guardrails, or improve the tool set.
When one agent calls another, log:
Inter-agent calls are where complexity explodes. Logging them properly is how you avoid getting lost in the call stack.
When anything fails, log:
Error logging is where most observability systems fall short. Errors are where you learn the most, so log them comprehensively.
For every agent execution, capture:
This context is essential for security, compliance, and debugging user-specific issues.
Knowing what to log is one thing; actually implementing it is another. Here's how to approach it:
Don't build your own tracing system. Use an existing framework. For Python, LangChain's built-in debugging and tracing capabilities provide a solid foundation. For broader observability, Phoenix - Open Source Observability for AI Agents offers a comprehensive tracing system designed specifically for agent workloads.
A good tracing framework should:
You don't need to log every single operation. Log at the level of:
Avoid logging individual token processing or internal model computations unless you're debugging a specific issue.
Log structured data (JSON, key-value pairs), not free-form text. Structured logs are:
A structured log entry might look like:
{
"timestamp": "2024-01-15T10:23:45.123Z",
"trace_id": "abc123def456",
"span_id": "xyz789",
"parent_span_id": "xyz788",
"event_type": "tool_call",
"tool_name": "database_query",
"tool_version": "1.2.3",
"input": {"query": "SELECT * FROM users WHERE id = ?"},
"output": {"rows": 1, "data": {...}},
"status": "success",
"latency_ms": 145,
"user_id": "user_456",
"agent_id": "research_agent",
"environment": "production"
}Debug logs can be sampled, truncated, or deleted. Audit logs must be comprehensive and permanent. Use different log levels and storage backends:
Agent logs often contain sensitive information: user data, API keys, proprietary queries, financial information. You need to:
For AI Agent Logging & Audit Trails: Debugging and Compliance, this is non-negotiable.
Here are patterns that work in production:
When a user makes a request, create a trace ID and inject it into all downstream operations. If the user reports an issue, you can query by user ID and request ID to see the entire trace.
import uuid
from contextvars import ContextVar
trace_id = ContextVar('trace_id')
def handle_user_request(user_id, request):
trace_id.set(str(uuid.uuid4()))
# All downstream operations automatically have access to trace_id.get()
result = run_agent(user_id, request)
return resultEvery log entry should include the trace ID, making it trivial to reconstruct the entire user journey.
Logging everything can be expensive. Use sampling to control costs:
This gives you visibility into problems while keeping costs manageable.
From your logs, extract key metrics:
These metrics should be tracked separately from logs, in a time-series database like Prometheus or Datadog. This lets you set alerts and dashboards without querying raw logs.
If your agents call external services (databases, APIs, other microservices), propagate trace IDs to those services. If they support it (most modern APIs do), they'll include the trace ID in their logs, creating a unified trace across your entire system.
Several platforms specialize in agent observability. 5 best AI agent observability tools for agent reliability in 2026 provides a comprehensive comparison, covering platforms like Braintrust, Vellum, Fiddler, Helicone, and Galileo.
When evaluating a tool, look for:
Best AI Observability Tools for Autonomous Agents in 2026 offers another detailed evaluation, focusing on proxy vs. SDK architectures and lifecycle tracking.
For those running on open-source stacks, Tracing Agents with Helicone provides a solid example of how to instrument agents without proprietary tooling.
Logging only errors means you have no baseline for success. You can't calculate error rates, latency percentiles, or cost per operation. Log successes too, at least a sample of them.
If your agents call external services and those services don't receive trace IDs, you lose visibility. Always propagate trace IDs in HTTP headers or message metadata.
If you only log tool outputs, you can't debug why a tool behaved unexpectedly. Log inputs too (being careful with sensitive data).
Some spans are more important than others. Use span tags or attributes to mark critical paths, slow operations, or error cases. This lets you filter and alert on what matters.
When you deploy a new agent version, you need to know which version ran which operation. Always log agent version alongside agent ID.
If an agent run takes 30 seconds, you need to know: was it 25 seconds waiting for an API, 3 seconds in LLM calls, and 2 seconds in decision-making? Use span timing to attribute latency to specific components.
Once you have comprehensive observability data, you can use it for analytics. This is where platforms like D23 become valuable—you can ingest your observability data and build dashboards to understand agent behavior at scale.
For example, you might build dashboards showing:
With D23's embedded analytics capabilities, you can also embed these dashboards into your product, letting users see how their agents are performing without leaving your application.
The key is that observability data is only valuable if you can analyze it. Invest in both logging infrastructure and analytics infrastructure.
Agent observability has compliance and security implications:
If you're subject to GDPR, HIPAA, or other regulations, your logs might contain regulated data. Ensure your logging platform stores data in compliant jurisdictions.
Different regulations require different retention periods. Debug logs might be kept for 30 days, but audit logs might need to be kept for 7 years. Configure retention policies accordingly.
Not everyone should be able to read logs. Implement role-based access control so only authorized personnel can view sensitive logs.
If you're logging for compliance, you need to log who accessed the logs. This creates an audit trail of the audit trail.
Under GDPR, you should only log data you actually need. Don't log full prompts and completions if you only need token counts. Don't log full user records if you only need user IDs.
Agent observability is still maturing. Expect to see:
For now, the best approach is to build observability into your agents from day one. Lanai Introduces AI Observability Agent shows how specialized agents can even help monitor other agents.
If you're building multi-agent systems, here's what to do:
Agent observability is not a one-time project; it's an ongoing practice. As your agents get more complex and handle more critical workloads, your observability needs will grow. Build with that in mind from the start.
For organizations using Apache Superset and looking to track agent performance across their analytics stack, D23's managed Superset platform provides the infrastructure to ingest observability data and build dashboards without managing your own BI infrastructure. You can focus on building better agents while D23 handles the analytics layer.
The bottom line: when your AI calls other AIs, you need to see every call, understand every decision, and track every failure. That's what agent observability gives you.