New: AI & text-to-SQL on your own SupersetBook a demo

AI Analytics18 Apr 2026

Agent Observability: What to Log When Your AI Calls Other AIs

Master agent observability: learn what to log when AI systems call other AIs. Spans, traces, audit logs, and debugging strategies for multi-agent systems.

DTD23 Team

14 minutes read

Understanding Agent Observability in Multi-Agent Systems

When you deploy AI agents into production, you're no longer dealing with a single model making isolated predictions. You're building systems where agents call APIs, invoke other agents, make decisions based on tool outputs, and chain multiple steps together. This complexity creates a fundamental observability challenge: how do you see what's happening when your AI calls other AIs?

Agent observability is the practice of instrumenting your AI systems to capture, correlate, and analyze the full execution flow of multi-step agent workflows. Unlike traditional application observability—which focuses on request latency, error rates, and resource utilization—agent observability requires visibility into the decision-making process itself: what did the agent see, what did it decide, what tools did it invoke, and why?

This is critical because agent failures are often silent. A model might call a tool incorrectly, get a result it didn't expect, and then hallucinate a response without any obvious error signal. Or an agent might call another agent, which calls a third system, and the failure point becomes buried three layers deep in the execution trace. Without proper observability, you're flying blind.

The Core Observability Stack: Spans, Traces, and Audit Logs

Agent observability rests on three interconnected concepts that form the foundation of what you need to log:

Spans: The Atomic Unit of Work

A span is a single, measurable unit of work within your agent system. It has a start time, an end time, a name, and attributes that describe what happened. When your agent calls an LLM, that's a span. When it invokes a tool, that's another span. When it retrieves data from a database, that's another span.

Each span should capture:

Timing data: when it started, when it ended, and how long it took
Input and output: what went in, what came out
Status: did it succeed, fail, or get throttled?
Metadata: which agent, which model, which tool, which user context?

Spans are lightweight and granular. You might have dozens or hundreds of spans in a single agent execution. The goal is to make each one queryable and analyzable so you can pinpoint exactly where latency comes from or where a decision went wrong.

Traces: Connecting the Dots Across Agent Calls

A trace is a collection of related spans that together represent one complete execution of your agent workflow. A trace has a unique ID that propagates through all the spans it contains, creating a causal chain: agent A calls tool B, which calls API C, which triggers agent D. All of those spans share the same trace ID, allowing you to reconstruct the entire execution path from a single entry point.

Traces answer the question: "What happened during this agent run?" They let you see:

The sequence of decisions and tool calls
Where time was spent (which span took 5 seconds?)
Which spans failed and which succeeded
How data flowed from one component to the next
Whether the agent got stuck in a loop or made progress

Without trace IDs, you lose the ability to correlate events across your system. A log entry from agent A and a log entry from agent B might be related, but without a shared trace ID, you have no way to know. With traces, correlation is automatic.

Audit Logs: The Legal and Operational Record

Audit logs are a different beast. They're not about debugging; they're about compliance, accountability, and forensics. An audit log records what happened and who made it happen in a way that's tamper-proof and legally defensible.

For agents, audit logs should capture:

Agent identity: which agent, which version
User context: who triggered this agent, what permissions did they have
Tool invocations: which tools were called, with what parameters
Data access: what data did the agent read or write
Decisions: what decisions did the agent make, and what was the reasoning
Timestamps: immutable record of when events occurred
Signatures or hashes: proof that the log hasn't been tampered with

Audit logs are essential for compliance (HIPAA, SOX, GDPR), for security investigations, and for understanding agent behavior in regulated environments. Unlike debug logs, which you can sample or truncate, audit logs must be comprehensive and permanent.

The Challenge: Multi-Agent Observability at Scale

Observability gets exponentially harder when agents call other agents. Consider this scenario:

You have a "research agent" that gathers information. It calls a "summarization agent" to condense findings. The summarization agent calls a "fact-check agent" to verify claims. Each agent might call multiple tools, each tool might call an API, and each API might be rate-limited or slow.

Now, a user reports that their research took 45 seconds when it should take 5. Where did the time go?

Was it the research agent's decision-making loop?
Was it the summarization agent's LLM call?
Was it the fact-check agent waiting for an external API?
Was it network latency between agents?
Was it the database query that the research agent ran?

Without proper instrumentation, you're guessing. With proper observability, you can see the entire trace, identify that the fact-check agent spent 40 seconds waiting for an external API response, and know exactly where to optimize.

Multi-agent observability requires:

Propagating context: Every agent must pass trace IDs and parent span IDs to downstream agents so the chain remains connected
Capturing inter-agent calls: When agent A calls agent B, that call itself must be a span, with timing and status
Handling concurrency: Agents might call multiple tools or other agents in parallel; your observability system must track concurrent spans correctly
Correlating across boundaries: If agent A calls agent B via an API, and agent B writes to a database that agent C reads, you need to be able to trace that causal chain

What to Log: The Practical Checklist

Here's what you need to instrument in a production agent system:

LLM Calls

Every time your agent calls an LLM, log:

Model name and version: which model, which version, which provider (OpenAI, Anthropic, local, etc.)
Prompt and completion: the full prompt sent and the full response received (be careful with sensitive data)
Token counts: input tokens, output tokens, total tokens (critical for cost tracking and understanding model behavior)
Latency: how long the API call took
Temperature and other parameters: what sampling parameters were used
Cost: how much did this call cost (increasingly important as you scale)
Model reasoning: if the model outputs reasoning (like in o1 or Claude's extended thinking), capture that separately
Stop reason: did the model finish naturally, hit the max tokens, or get stopped for another reason?

This data is essential for understanding model behavior, tracking costs, and debugging why an agent made a particular decision.

Tool Invocations

When your agent calls a tool (database query, API, calculator, web search, etc.), log:

Tool name and version: which tool, which version
Input parameters: what arguments were passed to the tool
Output: what did the tool return
Status: did it succeed, fail, get rate-limited, time out?
Latency: how long did the tool take
Error details: if it failed, what was the error message and stack trace?
Resource usage: if the tool consumed resources (compute, memory, API quota), track that
Data sensitivity: if the tool accessed sensitive data, note that for compliance

Tools are where agents interact with the real world, so this is where failures often originate. Comprehensive tool logging lets you distinguish between tool failures and agent decision failures.

Agent Decisions and Reasoning

When your agent makes a decision (which tool to call next, how to interpret a tool result, when to stop), log:

Decision point: what was the agent deciding?
Available options: what choices did the agent have?
Decision made: which option did it choose?
Reasoning: why did it make that choice? (If your agent outputs reasoning, capture it)
Confidence: how confident was the agent in this decision? (If the model provides a confidence score)
Alternatives considered: what else could it have done?

This is the "why" behind agent behavior. When an agent makes a mistake, understanding its reasoning helps you decide whether to retrain, add guardrails, or improve the tool set.

Inter-Agent Communication

When one agent calls another, log:

Caller agent: which agent initiated the call
Callee agent: which agent was called
Request: what did the caller ask the callee to do?
Response: what did the callee return?
Latency: how long did the inter-agent call take?
Status: did the call succeed or fail?
Trace ID propagation: ensure the trace ID is passed and received correctly

Inter-agent calls are where complexity explodes. Logging them properly is how you avoid getting lost in the call stack.

Error and Exception Handling

When anything fails, log:

Error type: what kind of error (API error, timeout, invalid input, etc.)
Error message: the actual error text
Stack trace: full stack trace if available
Context: what was the agent doing when it failed?
Recovery attempt: did the agent retry, fall back to another tool, or escalate?
Impact: did this error cause the overall agent run to fail or just a sub-task?

Error logging is where most observability systems fall short. Errors are where you learn the most, so log them comprehensively.

User and Context Information

For every agent execution, capture:

User ID: who triggered this agent
Session ID: what broader session is this part of
Request ID: unique identifier for this specific request
Timestamp: when did this start
Environment: production, staging, development
Feature flags: what feature flags were enabled
Permissions: what was the user allowed to do

This context is essential for security, compliance, and debugging user-specific issues.

Implementing Observability: From Theory to Code

Knowing what to log is one thing; actually implementing it is another. Here's how to approach it:

Use a Tracing Framework

Don't build your own tracing system. Use an existing framework. For Python, LangChain's built-in debugging and tracing capabilities provide a solid foundation. For broader observability, Phoenix - Open Source Observability for AI Agents offers a comprehensive tracing system designed specifically for agent workloads.

A good tracing framework should:

Automatically generate and propagate trace IDs
Provide decorators or context managers to define spans
Capture timing automatically
Handle concurrent spans correctly
Export traces to a backend for storage and analysis

Instrument at the Right Granularity

You don't need to log every single operation. Log at the level of:

Agent execution: one span per agent run
Tool calls: one span per tool invocation
LLM calls: one span per LLM API call
Sub-agent calls: one span per call to another agent
Decision points: one span per major decision

Avoid logging individual token processing or internal model computations unless you're debugging a specific issue.

Use Structured Logging

Log structured data (JSON, key-value pairs), not free-form text. Structured logs are:

Queryable: you can filter by any field
Parseable: tools can automatically extract data
Comparable: you can aggregate and analyze logs across runs

A structured log entry might look like:

{
  "timestamp": "2024-01-15T10:23:45.123Z",
  "trace_id": "abc123def456",
  "span_id": "xyz789",
  "parent_span_id": "xyz788",
  "event_type": "tool_call",
  "tool_name": "database_query",
  "tool_version": "1.2.3",
  "input": {"query": "SELECT * FROM users WHERE id = ?"},
  "output": {"rows": 1, "data": {...}},
  "status": "success",
  "latency_ms": 145,
  "user_id": "user_456",
  "agent_id": "research_agent",
  "environment": "production"
}

Separate Debug Logs from Audit Logs

Debug logs can be sampled, truncated, or deleted. Audit logs must be comprehensive and permanent. Use different log levels and storage backends:

Debug logs: sent to a log aggregation service (DataDog, Splunk, etc.), sampled in production, kept for 30 days
Audit logs: written to immutable storage (cloud object storage, append-only database), cryptographically signed, kept for years

Handle Sensitive Data

Agent logs often contain sensitive information: user data, API keys, proprietary queries, financial information. You need to:

Identify sensitive fields: which fields in your logs might contain sensitive data?
Mask or redact: before logging, mask API keys, PII, financial data, etc.
Encrypt in transit and at rest: logs should be encrypted when transmitted and stored
Restrict access: only authorized personnel should be able to read logs
Comply with regulations: GDPR requires you to delete personal data on request; ensure your logging system supports that

For AI Agent Logging & Audit Trails: Debugging and Compliance, this is non-negotiable.

Real-World Observability Patterns

Here are patterns that work in production:

Pattern 1: Request-Scoped Tracing

When a user makes a request, create a trace ID and inject it into all downstream operations. If the user reports an issue, you can query by user ID and request ID to see the entire trace.

import uuid
from contextvars import ContextVar
 
trace_id = ContextVar('trace_id')
 
def handle_user_request(user_id, request):
    trace_id.set(str(uuid.uuid4()))
    # All downstream operations automatically have access to trace_id.get()
    result = run_agent(user_id, request)
    return result

Every log entry should include the trace ID, making it trivial to reconstruct the entire user journey.

Pattern 2: Sampling for Cost Control

Logging everything can be expensive. Use sampling to control costs:

Log 100% of errors
Log 100% of slow operations (latency > threshold)
Log 10% of successful, fast operations
Log 100% of operations for specific users or feature flags

This gives you visibility into problems while keeping costs manageable.

Pattern 3: Metric Extraction from Logs

From your logs, extract key metrics:

Agent latency percentiles: p50, p95, p99 latency for each agent
Tool success rates: what percentage of tool calls succeed?
Error rates: what percentage of operations fail?
Cost per operation: how much do different agents cost to run?
Token usage: how many tokens are different agents consuming?

These metrics should be tracked separately from logs, in a time-series database like Prometheus or Datadog. This lets you set alerts and dashboards without querying raw logs.

Pattern 4: Correlation Across Services

If your agents call external services (databases, APIs, other microservices), propagate trace IDs to those services. If they support it (most modern APIs do), they'll include the trace ID in their logs, creating a unified trace across your entire system.

Tools and Platforms for Agent Observability

Several platforms specialize in agent observability. 5 best AI agent observability tools for agent reliability in 2026 provides a comprehensive comparison, covering platforms like Braintrust, Vellum, Fiddler, Helicone, and Galileo.

When evaluating a tool, look for:

Trace capture: does it automatically capture LLM calls, tool invocations, and inter-agent calls?
Latency analysis: can you see where time is spent in your agent workflows?
Error tracking: does it surface errors and exceptions clearly?
Cost tracking: can you see how much each agent costs to run?
Compliance features: can you generate audit logs, redact sensitive data, and prove compliance?
Integration: does it work with your existing tools and frameworks?

Best AI Observability Tools for Autonomous Agents in 2026 offers another detailed evaluation, focusing on proxy vs. SDK architectures and lifecycle tracking.

For those running on open-source stacks, Tracing Agents with Helicone provides a solid example of how to instrument agents without proprietary tooling.

Common Observability Mistakes to Avoid

Mistake 1: Logging Only Failures

Logging only errors means you have no baseline for success. You can't calculate error rates, latency percentiles, or cost per operation. Log successes too, at least a sample of them.

Mistake 2: Losing Context Across Service Boundaries

If your agents call external services and those services don't receive trace IDs, you lose visibility. Always propagate trace IDs in HTTP headers or message metadata.

Mistake 3: Forgetting to Log Tool Inputs

If you only log tool outputs, you can't debug why a tool behaved unexpectedly. Log inputs too (being careful with sensitive data).

Mistake 4: Treating All Spans Equally

Some spans are more important than others. Use span tags or attributes to mark critical paths, slow operations, or error cases. This lets you filter and alert on what matters.

Mistake 5: Not Versioning Agents

When you deploy a new agent version, you need to know which version ran which operation. Always log agent version alongside agent ID.

Mistake 6: Ignoring Latency Attribution

If an agent run takes 30 seconds, you need to know: was it 25 seconds waiting for an API, 3 seconds in LLM calls, and 2 seconds in decision-making? Use span timing to attribute latency to specific components.

Connecting Observability to Data Analytics

Once you have comprehensive observability data, you can use it for analytics. This is where platforms like D23 become valuable—you can ingest your observability data and build dashboards to understand agent behavior at scale.

For example, you might build dashboards showing:

Agent performance: latency, error rates, cost for each agent type
Tool reliability: which tools fail most often, which are slowest
User impact: which users experience the slowest agents, which hit errors most often
Trend analysis: are agents getting faster or slower over time
Cost optimization: which agents cost the most to run, where can you optimize

With D23's embedded analytics capabilities, you can also embed these dashboards into your product, letting users see how their agents are performing without leaving your application.

The key is that observability data is only valuable if you can analyze it. Invest in both logging infrastructure and analytics infrastructure.

Compliance and Security Considerations

Agent observability has compliance and security implications:

Data Residency

If you're subject to GDPR, HIPAA, or other regulations, your logs might contain regulated data. Ensure your logging platform stores data in compliant jurisdictions.

Data Retention

Different regulations require different retention periods. Debug logs might be kept for 30 days, but audit logs might need to be kept for 7 years. Configure retention policies accordingly.

Access Control

Not everyone should be able to read logs. Implement role-based access control so only authorized personnel can view sensitive logs.

Audit Trail for the Audit Trail

If you're logging for compliance, you need to log who accessed the logs. This creates an audit trail of the audit trail.

Data Minimization

Under GDPR, you should only log data you actually need. Don't log full prompts and completions if you only need token counts. Don't log full user records if you only need user IDs.

The Future of Agent Observability

Agent observability is still maturing. Expect to see:

Automatic instrumentation: frameworks that automatically add observability without explicit logging code
AI-powered debugging: systems that use AI to analyze traces and suggest root causes
Causal inference: tools that can determine not just what happened, but why it happened
Real-time alerting: systems that detect anomalies in agent behavior in real-time
Cost optimization: platforms that automatically identify expensive agents and suggest optimizations

For now, the best approach is to build observability into your agents from day one. Lanai Introduces AI Observability Agent shows how specialized agents can even help monitor other agents.

Practical Next Steps

If you're building multi-agent systems, here's what to do:

Choose a tracing framework: Start with LangChain's built-in tracing or Phoenix for more advanced use cases
Define your logging schema: What fields will you log for LLM calls, tool calls, inter-agent calls, and decisions?
Instrument gradually: Start with the critical path (LLM calls and tool calls), then expand to include decision logging
Set up storage and analysis: Choose a log aggregation platform (DataDog, Splunk, etc.) and build dashboards
Implement alerting: Set up alerts for high error rates, high latency, and high costs
Review and iterate: Regularly review your logs to find blind spots and add instrumentation where needed

Agent observability is not a one-time project; it's an ongoing practice. As your agents get more complex and handle more critical workloads, your observability needs will grow. Build with that in mind from the start.

For organizations using Apache Superset and looking to track agent performance across their analytics stack, D23's managed Superset platform provides the infrastructure to ingest observability data and build dashboards without managing your own BI infrastructure. You can focus on building better agents while D23 handles the analytics layer.

The bottom line: when your AI calls other AIs, you need to see every call, understand every decision, and track every failure. That's what agent observability gives you.