AI Observability Is a Security Requirement, Not a Dashboard

McKinsey Built an AI Agent. It Got Hacked in Two Hours.

On February 28, 2026, security startup CodeWall pointed an autonomous AI agent at McKinsey's internal AI platform Lilli. No credentials. No insider help. No human guidance. Within two hours, the agent had full read-write access to the production database: 46.5 million chat messages about strategy, M&A, and client engagements. 728,000 confidential files. 57,000 user accounts. 95 system prompts.

The root cause was a SQL injection vulnerability on unauthenticated API endpoints. Twenty-two endpoints, no auth, JSON keys concatenated into SQL. A vulnerability technique that has existed since the late 1990s.

Here is the part that should worry you: nobody at McKinsey noticed until CodeWall told them. The attack took two hours. Detection took zero. Because there was no detection.

That is what happens when you deploy AI agents without observability.

Observability Is Not Monitoring

Let me be specific about what I mean. Traditional monitoring tells you that your API returned a 500 error. AI observability tells you that your agent made 14 tool calls across 3 reasoning chains, the 9th call contained a SQL injection payload in the user query field, and the response included data from a table the agent should never have accessed.

These are fundamentally different problems. An HTTP 200 from an AI agent tells you almost nothing. The agent could have hallucinated, leaked PII, executed a prompt injection, or accessed data outside its authorization scope. All while returning a perfectly healthy status code.

Microsoft made this distinction explicit on March 18, 2026 when they reclassified AI observability from an optional diagnostic tool to a mandatory security requirement for enterprise AI systems. Their reasoning: prompt injection attacks, data exfiltration through AI interactions, and unintended agent behaviors create risks that traditional monitoring cannot detect.

This is not Microsoft being cautious. This is Microsoft telling you what they learned from 80% of Fortune 500 companies running active AI agents.

What AI Agent Observability Actually Looks Like

If you are running agentic AI in production and your observability stack cannot answer these questions, you have a gap:

What tools did the agent call, in what order, with what arguments? Every tool invocation is an attack surface. If your agent can call a database, an API, or a file system, you need to see every call with full argument logging.
What was the full reasoning chain? Agentic systems make multi-step decisions. A prompt injection at step 3 can redirect the agent's behavior at step 7. Without trace-level visibility into the full chain, you cannot reconstruct what happened.
What data did the agent access versus what data it was authorized to access? This is the McKinsey problem. The agent accessed 46.5 million messages because nothing checked whether it should.
What is the token cost and latency per reasoning chain? This is the ops side. An agent stuck in a tool-calling loop will burn through your API budget before anyone notices, unless you have cost alerting on traces.

The good news: OpenTelemetry now has semantic conventions for GenAI agent spans. Every conversation turn, LLM call, tool execution, and speaker selection can be captured as structured spans connected by a shared trace ID. Datadog natively supports these conventions as of OTel v1.37. So does New Relic, which launched a dedicated AI agent platform in February.

A Minimal Observability Stack for AI Agents

Here is what a basic setup looks like with OpenTelemetry and Python. This is not production-grade, but it shows the pattern:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
 
# Initialize tracer with OTLP exporter
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("ai-agent")
 
def run_agent_step(user_query: str):
    with tracer.start_as_current_span("agent.run") as span:
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", "gpt-4")
        span.set_attribute("user.query", user_query)
 
        # Trace each tool call as a child span
        with tracer.start_as_current_span("tool.database_query") as tool_span:
            tool_span.set_attribute("tool.name", "sql_query")
            tool_span.set_attribute("tool.arguments", "SELECT * FROM users LIMIT 10")
            tool_span.set_attribute("tool.authorized", True)
            result = execute_query("SELECT * FROM users LIMIT 10")
            tool_span.set_attribute("tool.result_rows", len(result))
 
        # Log token usage for cost tracking
        span.set_attribute("gen_ai.usage.input_tokens", 847)
        span.set_attribute("gen_ai.usage.output_tokens", 312)

The key attributes: tool.name, tool.arguments, tool.authorized. That last one is what McKinsey was missing. Every tool call needs an authorization check, and every check needs to be logged.

Where the EU AI Act Fits

If you are deploying AI agents in the EU, this is not optional for a different reason. Article 15 of the EU AI Act requires high-risk AI systems to have "appropriate levels of accuracy, robustness and cybersecurity." Article 12 mandates automatic logging capabilities that enable monitoring of the operation of the high-risk AI system.

The connection is direct: if your AI agent processes personal data, makes decisions about people, or operates in an Annex III high-risk category, you need logging that a regulator can audit. OpenTelemetry traces exported to a durable store satisfy the technical documentation requirements under Articles 11 and 12.

The August 2, 2026 enforcement deadline for high-risk systems is less than 5 months away. Whether the Digital Omnibus proposal delays this or not, the core documentation requirements remain identical.

Why Your Current Stack Probably Has Gaps

Most teams I talk to have one of two setups:

Setup A: "We log to CloudWatch." You have application logs. You can see that the agent ran. You cannot see what it decided, what tools it called, what data it accessed, or whether the response was correct. This is the McKinsey setup.

Setup B: "We use LangSmith for dev." You have trace-level visibility in development. You probably turned it off in production because of cost or latency concerns. So your dev environment is observable and your production environment, where the actual attacks happen, is not.

The fix is straightforward: instrument your agent with OpenTelemetry, export traces to whatever backend you already run (Datadog, Grafana, Elastic), and set up alerts on anomalous patterns. An agent that suddenly makes 50 tool calls instead of the usual 5 is either broken or under attack. Either way, you want to know.

What This Means For Your Team

McKinsey is a $16 billion consultancy with presumably a sizable security budget. Their AI platform got compromised through a vulnerability that is older than most junior developers. Not because they could not afford better security, but because they did not instrument their AI agent to detect abnormal behavior.

If McKinsey's AI agent had observability, someone would have noticed 46.5 million database reads in two hours. They did not, because nobody was looking.

The tools exist. OpenTelemetry GenAI semantic conventions are stable. Datadog, New Relic, and Grafana all support them. Microsoft now requires observability as part of their SDL for AI systems.

The question is not whether you need AI observability. The question is whether you will add it before or after something goes wrong.

About DeviDevs: We build ML platforms, secure AI systems, and help companies comply with the EU AI Act. devidevs.com

Is your AI system compliant with the EU AI Act? Free risk assessment - find out in 2 minutes →

AI Observability Is a Security Requirement, Not a Dashboard

AI Observability Is a Security Requirement, Not a Dashboard

McKinsey Built an AI Agent. It Got Hacked in Two Hours.

Observability Is Not Monitoring

What AI Agent Observability Actually Looks Like

A Minimal Observability Stack for AI Agents

Where the EU AI Act Fits

Why Your Current Stack Probably Has Gaps

What This Means For Your Team

Weekly AI Security & Automation Digest

Related Articles

Your LLM Proxy Was Backdoored for 5 Hours. Here Is What to Do Now.

Your AI Toolchain Is Under Attack: Langflow and Trivy Hit CISA KEV in the Same Week

Your AI Agents Have No Boss: The Enterprise Governance Gap