Lessons from TheAgentCompany

Why Many Academic Agent Benchmarks Fall Short in Practice

May 19, 2025

My morning “commute” on the train to NYC today gave me the perfect 30 minutes to dive into a LinkedIn post that Chris Ogboke had shared with me. The provocative post by Bevan Lane caught my attention:

"A group of professors from Carnegie Mellon University recently decided to run an experiment. They built a fake company and staffed it entirely with AI agents... Out of all the AI models tested, including ones from OpenAI, Google, Anthropic, and Meta, Claude had the best result, completing just 24% of its tasks. The others did worse."

The post led me down a rabbit hole – first to an article on Futurism titled "Professors Built a Company Staffed Entirely With AI Agents and It Was a Total Disaster," which then referenced another on Yahoo Tech with the headline "Next Assignment: Babysitting AI."

As someone who works with AI agents daily, I found myself thinking: "Is this actually news to anyone in the field?"

Thanks to AI (doing what it actually is good at), I was able to quickly examine the original research paper - "TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks" - to see what was really being tested.

What The Research Actually Tested

After thoroughly reviewing the paper, here's what I found when I asked some basic questions:

Did they use any prompt tuning techniques? No evidence in the paper (see §6, pp. 9-10).
Did they implement any prompt reinforcement to improve results? Not mentioned throughout the paper.
Did they use AI to help write their prompts? No indication of this in their methodology.
Did they provide the actual prompts used? No, surprisingly absent from both the main paper and appendices.
Did they implement any context management so agents could learn from mistakes? No such mechanisms described (see §7.3, pp. 13-14 on "Common Agent Failures").
Did they use frameworks like LangChain for workflow management? No, they relied solely on version 0.14.2 CodeAct Agent with Browsing (paper footnote 8, p. 9) without additional orchestration.

The researchers simply used default OpenHands CodeAct agents with basic instructions. They didn't implement any of the techniques that practitioners consider essential for effective AI agent deployment—though to be fair, the authors note that their aim was to provide a reproducible baseline rather than an optimized system.

The Paper's Valuable Contributions

It would be unfair not to acknowledge what TheAgentCompany benchmark does well. The CMU team created:

A comprehensive set of realistic tasks - 175 long-horizon tasks covering software engineering, project management, financial analysis, and other professional domains (§4-5, pp. 5-8)
A fully reproducible, self-hosted environment - Using open-source alternatives to common workplace tools like GitLab and RocketChat (§3, pp. 4-5, Table 1)
A sophisticated evaluation methodology - Including granular checkpoints and a partial credit scoring system that provides nuanced performance metrics (§4.1, pp. 6-7)
Simulated colleagues for testing communication - AI-powered NPCs that agents could interact with via chat platforms (§3, pp. 4-5)

These contributions represent significant value to the research community, providing a standardized benchmark that future work can build upon.

Research Without Showing The Work

Most concerning is the complete absence of the actual prompts used in the study. In scientific research, this is akin to not sharing your experimental methods or data. Without seeing how tasks were framed to the agents, it's impossible to evaluate whether the poor performance was due to inherent limitations of AI or simply poorly structured instructions.

The paper states they used "OpenHands' main agent... CodeAct Agent with Browsing" but provides no details on prompt construction or refinement. This is a significant methodological limitation, particularly when the central conclusion relates to agent performance.

A Trend in Academic AI Research

This isn't just an issue with a single paper. Recent analysis reveals a pattern of academic AI agent research that overlooks practical implementation techniques:

"AI Agents That Matter" (Kapoor et al., July 2024) - This paper critiques current agent benchmarks for their "narrow focus on accuracy without attention to other metrics" resulting in "SOTA agents [that] are needlessly complex and costly." The authors specifically call out the lack of standardization in evaluation practices and the gap between benchmark performance and real-world usefulness.
"The 2025 AI Engineering Reading List" (Latent.Space, December 2024) - This expert-curated list notes that for critical areas like code generation and agents, "much of the frontier has moved from research to industry and practical engineering advice... are only found in industry blogposts and talks rather than research papers."
IEEE Spectrum's opinion piece "AI Prompt Engineering Is Dead" (March 2024) - This commentary highlights how academic approaches often fail to incorporate the automated prompt engineering techniques that have become standard in industry practice.

The gap between academic research and practitioner knowledge in AI is particularly pronounced in agent benchmarks, where the scaffolding, prompt engineering, and workflow management techniques that practitioners consider essential are frequently overlooked in formal evaluations.

What Practitioners Already Know

Those of us working with agentic AI daily understand that:

Raw LLMs aren't agents - They need carefully engineered workflows, scaffolding, and context management systems
Effective AI systems combine:
- Specialized prompt engineering
- Task decomposition frameworks
- Error recovery mechanisms
- Human oversight at critical junctures
The 70/30 rule applies - According to practitioner observations (as I've discussed elsewhere), AI can handle about 70% of knowledge work independently, but the final 30% requires human judgment and refinement

The Real Revolution Is In The Tooling

The agentic revolution isn't about AI systems that can autonomously run a company. It's about the sophisticated tooling, scaffolding, and context management that allows AI to handle specific parts of complex workflows effectively.

Basic prompts with vanilla LLMs are interesting for academic research but bear little resemblance to how AI agents are actually implemented in production environments.

Missing The Forest For The Trees

This research (and the clickbait articles it spawned) fundamentally misunderstands what we're trying to achieve with AI agents. The goal isn't fully autonomous systems that replace humans entirely - it's augmented intelligence that handles routine aspects of complex work while keeping humans in control of judgment, direction, and quality.

When researchers build simplified test environments without implementing the engineering practices that practitioners consider essential, they're not measuring the effectiveness of AI agents as they exist in the real world - they're simply confirming limitations we already know exist in raw models.

Conclusion

Next time you see a headline proclaiming "AI agents fail at X," look carefully at what was actually tested. Was it a raw model with basic prompts? Or was it a carefully engineered system with the scaffolding and oversight mechanisms that real implementations require?

The gap between academic research and practitioner knowledge in AI remains wide - and studies like this, while generating attention-grabbing headlines, often do little to close it.

It's worth noting a limitation of my own critique: without access to the raw prompts used in the study, we can't definitively prove whether simple re-phrasing or basic prompt engineering would significantly improve the 24% success rate. The true impact of better prompt engineering remains an empirical question that would require additional experimentation.

P.S. The 30-minute train ride wasn't quite enough to research and write this entire article – I'm putting the finishing touches on it in Grand Central Terminal. But that should be the real story here, not a poorly executed study. The fact that I could review a research paper, analyze its methodology, compare it with other recent publications, and draft a substantive critique in roughly the time of a commute demonstrates exactly the kind of human-AI collaboration that these academic papers often miss.