5 questions AI agent vendors hope you don't ask - Imperative Business Ventures Limited

Every vendor pitch right now involves an agent completing a 12-step workflow flawlessly in a sandboxed environment with clean data and zero edge cases. The demo always works. Production, as every practitioner already knows, is a different matter entirely.Gartner forecasts that 40% of enterprise applications will embed task-specific AI agents by the end of 2026, up from less than 5% in 2025 (one of the steepest adoption curves in enterprise software history). And yet, Gartner also predicts over 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls. That gap between deployment volume and deployment quality is exactly where AI decision makers need to focus.Here’s what separates a solid agent deployment from an expensive disappointment.The metric vendors skip in their slidesMost agent demos optimize for task completion rate on a curated benchmark. That number matters, but the metric that actually predicts real-world ROI is error recovery rate: How does the agent behave when a tool call fails, an API returns unexpected data, or a human-in-the-loop step times out?An agent that completes 94% of tasks in ideal conditions but halts clumsily on failure is a liability in any workflow where reliability is the point. Ask vendors for their failure mode documentation. If that documentation is sparse, you already have useful information.💡According to McKinsey, security and risk concerns rank as the number-one barrier to scaling agentic AI in 2026. Agents gaining autonomy across tools, data, and systems create failure surfaces where small issues cascade quickly into compliance violations. The failure taxonomy matters as much as the capability set.How RLHF encodes bias through alignment tampering | AIAIYour reward model is learning exactly what your annotators prefer. The problem is that “better” and “unbiased” are two different things, and RLHF has no way to tell them apart.AI Accelerator InstituteAndrew Lovell5 questions AI agent vendors hope you don’t askMost vendor evaluations focus on capability demos. These questions shift the conversation to operational reality:How does the agent behave when confidence drops below the threshold? Does it escalate, halt, or proceed? Who decides the threshold, and can your team configure it directly?How does the system handle tool call rate limits at scale? If your agent is calling Salesforce, Jira, and an internal API simultaneously across 500 concurrent sessions, where are the bottlenecks and who owns them?What does the audit trail look like? For regulated industries, you need a complete, queryable log of every action, every decision point, and every human override. Ask to see an actual log from a live production environment. A screenshot from a demo tells you very little.How does model versioning work? When the underlying model is updated by the vendor, does your agent’s behavior change? How are breaking changes communicated and tested?What is the latency profile under load? A 2-second response in a demo becomes a different problem at 10,000 requests per hour. Get numbers from real deployments, with named reference customers if possible.Orchestration architecture is the real decisionThe agent interface is visible. The orchestration layer is where the actual architecture decision lives, and it carries significant downstream consequences.Single-agent architectures (one LLM calling tools sequentially) are simpler to debug and audit but hit ceilings on complex, multi-domain tasks. 💡Multi-agent architectures (orchestrator plus specialist agents, as in LangGraph, AutoGen, or CrewAI) scale better but introduce coordination overhead and failure surfaces that compound quickly. An agent that delegates to five sub-agents has five additional places to fail.For most enterprise deployments in 2026, the right architecture depends less on raw capability and more on your team’s ability to observe, debug, and intervene. The 2026 Gartner Hype Cycle for Agentic AI flags agentic AI governance and security as profiles now distributed across the curve, reflecting enterprise concern about accountability emerging early in the adoption cycle.A well-monitored single-agent setup will outperform a sophisticated multi-agent system your team can barely inspect. Arize AI, LangSmith, and Weights & Biases all offer observability tooling worth evaluating alongside the agents themselves.Multi-turn reasoning is broken in a way nobody saw comingMulti-turn reasoning is broken in a way nobody saw coming. The question is; what can we do to fix it?AI Accelerator InstituteAndrew LovellContext window management is a hidden costAgentic tasks are long-context tasks. An agent working through a complex procurement workflow might accumulate 80,000 tokens of context across tool call results, intermediate reasoning, and prior steps. At current GPT-4o pricing, this could get expensive fast if context management is handled carelessly.Retrieval-augmented approaches that pull only relevant context at each step are more cost-efficient than naive full-history approaches. Ask vendors how their system handles context pruning and whether you have visibility into token consumption per workflow run. The 2026 Hype Cycle calls out FinOps for agentic AI as a rising concern. The industry is beginning to treat agent cost management as a discipline in its own right. If your vendor struggles to give you a price estimate, build your own cost model before you sign.The evaluation framework most teams are missingAgent evaluation is a discipline the industry is still constructing. Established LLM eval frameworks like RAGAS, PromptFoo, and DeepEval have added agentic evaluation features, but coverage remains uneven. A mature eval suite should cover:Faithfulness to instructions across multi-step tasks, including cases where following instructions exactly produces a suboptimal outcome – a real and under-examined edge case.Tool call accuracy: The agent calls the right tool with the right parameters. A plausible-looking tool call that returns a result is a separate, lower bar.Trajectory evaluation: Comparing the actual sequence of steps taken against the optimal path, assessed at the process level rather than the final output alone.Adversarial inputs, including prompt injections delivered via tool call results – a live attack surface in any agent that reads external content.Running this suite on a vendor’s system before deployment is a reasonable ask. Production-ready vendors will have already done versions of this internally and should be able to share findings.Scaling AI in production: context, control and confidenceMost companies don’t have an AI problem. They have a throughput problem. And I think that distinction matters a lot when you start talking about how to actually get AI working in production.AI Accelerator InstituteKevin McGrathA word on build vs. buyThe build-vs-buy dilemma for agents is more nuanced than it was for traditional software. The core LLM capability is accessible to anyone via API. The differentiation in commercial products sits in the workflow tooling, the pre-built integrations, and the fine-tuned domain-specific models underneath.Generic horizontal tasks (email triage, meeting summarization, document processing): Commercial products are almost always faster to value.Specialized vertical workflows (clinical documentation, financial compliance, engineering code review): Purpose-built vendors with domain fine-tuning may justify the price premium.The trap to avoid: Building a generic agent in-house at significant engineering cost to do something a mature commercial product already does adequately.💡The production-readiness gap tells the story clearly: in 2026, 79% of enterprises have adopted AI agents in some form, with just 11% running them in production. Most organizations are still iterating on pilot infrastructure. The build-vs-buy decision should factor in your team’s bandwidth genuinely lives.So, what will the next 12 months look like?The agent market is consolidating around orchestration standards. Anthropic’s Model Context Protocol (MCP) and Google’s Agent2Agent protocol are both gaining adoption as interoperability frameworks across vendors and tools. Betting heavily on a proprietary orchestration layer today carries real lock-in risk as these standards mature.The organizations getting the most value from agents in 2026 are treating the first deployment as an infrastructure investment, building observability from day one, and defining clear human-escalation paths before automating anything. Gartner’s strategic predictions for 2026 flag that “death by AI” legal claims will exceed 2,000 by the end of 2026 due to insufficient risk guardrails, particularly in healthcare, finance, and public safety.Governance has moved from best practice to table stakes. Unglamorous advice, reliably correct.

Related Posts

Microsoft Bets on Humans to Scale AI

Prompt: The Next AI Challenge Isn’t the Model. It’s the Organization.

NVIDIA BioNeMo accelerates Anthropic Claude Science