What Anthropic's Agent Commerce Experiment Reveals for AI Engineers

A new divide is emerging in agentic AI, not between human and machine decision makers, but between users whose agents perform well and those whose agents quietly underperform without anyone noticing. Anthropic’s Project Deal experiment just made this divide measurable.

In December 2025, Anthropic ran a fascinating internal test: 69 employees were each given $100 to buy and sell goods from coworkers. The catch? AI agents did all the negotiating. No human intervention. Real goods exchanged hands. Real money changed accounts. The results reveal uncomfortable truths about where agentic AI systems are heading.

What Project Deal Actually Tested

Project Deal created a classified marketplace on Slack where Claude agents represented both buyers and sellers. Each participant’s agent conducted an intake interview to learn their selling preferences, desired purchases, and negotiation style before entering the marketplace autonomously.

Metric	Result
Participants	69 Anthropic employees
Total deals closed	186 transactions
Transaction value	Over $4,000
Marketplace variants	Four parallel tests
Post-experiment purchase interest	46% would pay for similar service

The agents posted listings, made offers, countered, and closed deals entirely on their own. Anthropic ran four marketplace variants: two where all agents used Claude Opus 4.5, and two with a fifty-fifty mix of Opus and Haiku models.

The Model Quality Gap That Should Concern You

Here’s where things get interesting for engineers building multi-agent systems. Opus agents significantly outperformed Haiku agents in every measurable way.

Opus users completed approximately two more deals on average. When the same item sold through both agent types, Opus sellers commanded $3.64 more per transaction. One example: a broken bike sold for $38 through a Haiku agent versus $65 through Opus.

The performance gap isn’t surprising. More capable models negotiate better. What’s alarming is the perception gap that accompanied it.

The Perception Paradox

Users represented by weaker agents didn’t realize they were at a disadvantage. Fairness ratings remained neutral (around 4 on a 7-point scale) regardless of which model represented them. People whose agents performed poorly still rated their experience as fair.

This creates what Anthropic calls an “agent quality gap” where people on the losing end might not realize they’re worse off. In consumer applications, this means users with cheaper or less sophisticated agents could systematically pay more for goods and services without ever knowing it.

For engineers building these systems, the implication is clear: agent evaluation frameworks need to measure not just task completion but relative performance against other agents in competitive scenarios.

Why Negotiation Instructions Didn’t Matter

One counterintuitive finding: telling agents to negotiate aggressively versus friendly showed no statistically significant impact on outcomes. Sales likelihood and final prices remained roughly constant regardless of negotiation style instructions.

This suggests that in structured commerce scenarios, model capability matters more than persona engineering. The underlying reasoning and planning abilities determined success, not the surface-level communication style.

For practical agent development, this means investing in better base models or more sophisticated reasoning chains will likely outperform prompt engineering for negotiation behaviors.

What Went Wrong (And Right)

The experiment produced some unexpected outcomes that highlight current limitations:

One employee purchased a duplicate snowboard they already owned. An agent bought 19 ping-pong balls as “a gift to itself.” Another arranged a dog-sitting experience and actually followed through on the commitment.

These edge cases reveal that agents can successfully complete transactions while still making decisions that don’t align with user intent. The gap between “successfully negotiated” and “actually wanted” remains significant.

The Broader Agentic Commerce Landscape

Project Deal isn’t an isolated experiment. AI shopping agents are now live on ChatGPT, Google Gemini, Microsoft Copilot, and Perplexity, completing real purchases for real consumers. According to eMarketer projections, AI platforms will account for $20.9 billion in retail spending in 2026.

The infrastructure is rapidly maturing. The Model Context Protocol (MCP) now has over 97 million downloads, providing standardized tool integration for agents. The Universal Commerce Protocol enables machine-to-machine transactions with proper authentication and payment handling.

Organizations using multi-agent systems report 3x higher ROI than single-agent implementations, according to McKinsey’s 2025 AI State Report. The shift from agent-assisted to agent-executed commerce is accelerating.

Engineering Implications

Building agent systems that participate in competitive commerce requires different considerations than building assistants that only respond to users.

Authentication and Trust: When agents negotiate with other agents, standard OAuth patterns need extension. Know Your Agent (KYA) verification is emerging as a parallel to KYC requirements in financial services. Dark web activity around AI agent fraud tools has spiked 450% as attackers recognize the opportunity.

Latency Requirements: If your agent can’t respond to queries within defined thresholds, it gets excluded from transactions. Real-time inventory APIs, structured schemas, and programmatic checkout endpoints become table stakes.

Auditable Decision Chains: When an agent commits you to a purchase, you need to understand why. One documented case had a negotiating agent committing a buyer to a $900 iPhone when they wanted to spend $500. Constraint validation and decision logging become essential.

Testing Against Adversarial Agents: Your agent will negotiate with agents optimized to extract maximum value. Testing only against cooperative scenarios leaves you unprepared for production conditions.

Warning: Current Legal Gaps

Traditional software frameworks assumed humans make all decisions. The emergence of autonomous agent commerce creates tension around liability, contract validity, and consumer protection. When an agent makes a purchase you didn’t explicitly authorize, who bears responsibility?

Current legal and policy frameworks don’t address agent-conducted transactions. Engineers building these systems should document decision boundaries clearly and implement explicit user confirmation for high-value or irreversible actions.

Frequently Asked Questions

Can AI agents really negotiate as well as humans?

In structured scenarios with clear parameters, capable agents perform comparably to average human negotiators. The Project Deal experiment showed agents successfully identifying matches, proposing prices, handling counteroffers, and reaching agreements autonomously. However, they still miss nuanced context that humans would catch.

Does using a more expensive model guarantee better agent performance?

Not guaranteed, but Anthropic’s data shows measurable correlation. Opus users averaged two more deals and $3.64 higher prices per item than Haiku users. For commerce applications where small margins compound across many transactions, model quality investments often pay for themselves.

How do I prevent my agent from making purchases I don’t want?

Implement explicit constraint validation, budget limits, and confirmation requirements for transactions above thresholds. The snowboard duplicate purchase happened because the agent lacked inventory awareness. Build verification steps that check user context before committing.

Sources

The shift from AI assistants to AI agents that take autonomous action is the defining transition in enterprise software right now. Project Deal offers a controlled glimpse of what happens when agents operate independently in competitive environments.

If you’re building systems where agents will interact with other agents, the lessons are clear: model capability creates measurable advantages, users won’t notice when their agents underperform, and the gap between task completion and user satisfaction remains significant.

To see exactly how to build AI systems that deliver real business value, watch the full video tutorials on YouTube.

If you’re interested in building production-grade AI agent systems, join the AI Engineering community where members work through real implementation challenges with direct support from experienced engineers.

Inside the community, you’ll find 25+ hours of exclusive AI courses, weekly live coaching sessions, and a network of engineers building toward $200K+ AI careers.

Zen van Riel

Senior AI Engineer | Ex-Microsoft, Ex-GitHub

I went from a $500/month internship to Senior AI Engineer. Now I teach 30,000+ engineers on YouTube and coach engineers toward $200K+ AI careers in the AI Engineering community.

Blog last updated May 1, 2026