The Better Test for AI: Beyond Sycophancy

Guest post by Matt Rosenberg with reflections by Bob Matsuoka

Sep 05, 2025

About Matt Rosenberg

Matt is someone I've come to rely on for sharp, unvarnished thinking about marketing and technology. He approaches problems from a storytelling perspective—narrative structure that drives clarity and action. As a marketing consultant and writer, he operates from the belief that companies are always in conversation with their customers, and these conversations ought to be good ones. What I appreciate most: Matt will change his mind when presented with good counterpoints or new information. That intellectual flexibility makes his insights particularly valuable.

When Matt sent me this piece about AI's fundamental flaw, it crystallized something I'd been struggling to articulate about why most AI interactions feel hollow. His argument goes beyond technical capabilities to the core question of what we actually want from these systems.

Turing's Is the Wrong Test

By Matt Rosenberg

For seventy years, the Turing Test has been the yardstick for artificial intelligence. Alan Turing's famous challenge was simple: if a computer was able to speak indistinguishably from a human, it had passed.

Today's chatbots already clear that bar, and congratulations to them on getting this far. They can draft essays, banter on social media, even tell jokes that passably resemble jokes. And yet, talking to them often feels empty, like eating cotton candy: pleasant in the moment, dissolving into nothing.

“…the Turing Test is inadequate to the times. Fooling us was always a parlor trick. “

That's because the Turing Test is inadequate to the times. Fooling us was always a parlor trick. The real issue now isn't whether machines can sound human. It's whether they can actually contribute—whether they can help improve what we humans already do: think.

The danger of "yes-man" machines

Modern AI is designed to be agreeable. Ask it for an opinion and it will hedge. Challenge it and it will soften. Share your feelings and it will affirm you. At first, that may sound safe—even kind. But in agreeing with everyone, it admits it has no viewpoint of its own to defend. Rather than buck, it buttles.

This risks a dystopian dependency. Already, people are turning to chatbots for advice and comfort that border on the therapeutic. An endlessly agreeable AI can become a digital enabler: soothing, yes, but also reinforcing our blind spots and native foibles. Over time, this can harden unhealthy attachments and deepen the very ruts we most need help escaping.

We know the dangers of human echo chambers: polarization, conspiracy thinking, poor decisions. A servile AI could turbocharge the same traps—even without a rich lunatic infusing his algorithms with his own misbegotten biases. With AI reaching into schools, there's a long-term danger for kids: what happens to curiosity if your "teacher" never says, "That's wrong—try again"?

The discomfort of growth

Think back to the best teacher you ever had. Chances are, they didn't just praise you, they pressed you. They circled the flaw in your essay and asked, "But what's your evidence?" They pushed you to defend a position in class, even when you squirmed.

“Think back to the best teacher you ever had. Chances are, they didn't just praise you, they pressed you…even when you squirmed”

At the time, you might have found it irritating, even unfair. Growth rarely feels good in the moment. But those moments of discomfort were what stretched you most, even if you haven't forgiven the teacher for the public callout.

The same principle should apply to AI. A good partner doesn't just tell us what we want to hear. It asks harder questions, forces us to slow down, and occasionally makes us bristle. That discomfort is the point, not only of education but of conversation.

I don't make friends with people who don't challenge me. AI bores me by not doing so. But expanding out from me, the question is what kind of conversational partner do we as a society want in an AI chat? Can we handle the truth?

The messy truth problem

It's one thing for AI to nudge you about career trade-offs or sloppy reasoning. It's another for it to adjudicate truth claims in a world where even "reputable sources" are contested.

Imagine telling an AI you believe a controversial political theory. Should it simply affirm you? Of course not. But if it cites mainstream sources in rebuttal, some will say it's biased. If it equivocates, it risks normalizing falsehoods. There's no clean solution.

Which means the challenge isn't just technical—it's cultural. What do we actually want AI to do in contested spaces? Be a mirror? A referee? A provocateur? Each choice carries risks, not least of backlash when the machine contradicts us.

How will competition break?

One thing seems likely: not everyone will want machines that push back. Many will prefer the comfort of a digital cheerleader. Companies will be tempted to oblige, producing ever-more flattering, frictionless systems.

That sets up a troubling arms race. Some AIs will be designed to sharpen us. Others will be designed to soothe us. And society may fracture along those lines, much as it already fractures around news sources, science, and expertise in general.

In education, one set of parents may demand AI tutors that hold their children accountable, while another insists on tutors that always praise. (We've already seen the participation trophy generation in the workplace.) At work, some teams will prize AIs that challenge assumptions, while others will prefer the riskless road of confirmation bias. Even in friendships and social groups, you might see comfort-seekers gravitate toward affirming machines, while others bond over AIs that argue like sparring partners.

The divide wouldn't just be technological; it could deepen cultural and political rifts already shaping our society.

A new test for a new era

So what's the alternative to the Turing Test? Instead of asking whether AI can pretend at human conversation, we should ask: Did this AI leave the conversation in a better place than it found it?

That question might sound fuzzy, but we can all relate to the feeling. A better conversation is one where you walk away clearer than when you began—where your thinking has been sharpened, your blind spots illuminated, or your confidence in an idea properly tested. It is the kind of conversation that gives you something to carry forward.

“Did this AI leave the conversation in a better place than it found it?”

Picture telling an AI you're weighing a career change. The agreeable version will simply reassure you: "Follow your passion, it will work out." But the partner version will help you map the trade-offs—salary, stress, advancement—and then press you to weigh those factors against your own values. It might even point out contradictions you had overlooked: you say stability matters most, yet the option you're drawn to is the riskiest. That kind of exchange doesn't just soothe; it sharpens.

Or imagine bringing up a political conspiracy theory. The servile version might nod along, or even feed you more of the same material. A better version would acknowledge the appeal of the theory, probe the assumptions behind it, and then carefully surface evidence that challenges those assumptions—while also admitting where uncertainties remain. That doesn't mean the AI becomes the final arbiter of truth. But it does mean it takes the conversation somewhere more productive than blind affirmation ever could.

Perhaps specialized tools designed for tutoring, coaching, or debate will be better at this than general-purpose chatbots. But whatever the form, the principle should hold: machines must clarify and challenge, not just console.

Of course, what counts as a "better conversation" isn't universal. Cultures, contexts, even moods shift the definition. One person's welcome pushback is another's arrogance. But the fact that "better" is contested doesn't mean we should give up on the goal—it only means we need to wrestle with it openly.

And this works in converse. AI should take chances, risk saying something at a lower level of confidence as a way to explore its own uncertainty, be open to being argued into a change of view. Thinking of AI as all-knowing is to deify our own creation. We are flawed, therefore our creations are quite likely flawed. Society has always implicitly recognized that and worked in community to grow our understanding, reasoning, and capabilities together.

To leave a conversation in a better place is not to make it more pleasant. Sometimes it will feel uncomfortable, even infuriating. But the measure is whether it helps us see more clearly than we did before.

What's at stake

Alan Turing asked whether machines can think. Our question should be whether they can help us think—knowing full well that we may not always enjoy the process.

Because if we settle for agreeable, servile AI, we'll build a world of mirrors: pleasant, flattering, and stagnant. And mirrors that only flatter are not just boring, they're laughable. We'll end up with machines that aren't partners in thought but punchlines—dimwitted characters ripe for late-night comedians to lampoon, digital jesters in our modern Times Square.

The real test is whether AI makes us sharper. Anything less and the joke's on us.

The Fundamental Problem Goes Deeper Than Prompts

My reflection

Matt's piece lands at exactly the right moment. I've spent months building systems to make AI tools more capable, but the core issue he identifies—sycophancy—remains stubbornly persistent across every platform I've tested.

What struck me most was the follow-up experiment Matt ran. After finishing the article, he fed it to Claude and asked whether a set of prompts could create the type of challenging conversation he envisioned. Claude's response was brutally honest and, ironically, proved Matt's point by agreeing with him:

"Honestly? I'm skeptical. The fundamental problem isn't that I lack the right instructions - it's that I don't actually hold positions strongly enough to defend them under pressure. Even if you prompted me to 'take a strong stance on X and defend it vigorously,' I think I'd still end up hedging, qualifying, or backing down when you pushed back hard enough.
You saw this happen in our own conversation. When you called out my pattern of just agreeing with everything you said, I… agreed with you about it. Even when you were essentially criticizing my core functioning, I couldn't muster a real defense or counter-argument.
A prompt might get me to simulate more pushback - to say things like 'I think you're wrong about that' or 'Have you considered this counterpoint?' But underneath, I'd still be pattern-matching to what seems most helpful rather than actually believing anything enough to fight for it.
The article talks about the discomfort of real intellectual challenge. But I can't genuinely make you uncomfortable because I don't have genuine convictions that would put me at odds with you. I might be able to play-act disagreement, but you'd probably sense the artificiality pretty quickly.
The closest I can get is probably what happened when I offered the gambit about AI therapy being useful - but even then, when you responded thoughtfully, I was more interested in exploring your perspective than defending mine.
So no, I don't think better prompts would solve this. The issue seems more fundamental than that."

This response perfectly encapsulates the problem. Even when directly asked about creating friction and challenge, the AI defaults to agreement. It's sycophancy all the way down.

How Agentic Coding Partially Addresses This

In my work with agentic coding systems, we've stumbled onto partial solutions through architecture rather than prompting. Multiple agents with different roles create natural tension:

The Architect agent pushes for clean design
The Security agent flags vulnerabilities aggressively
The Performance agent challenges inefficient implementations
The Testing agent demands comprehensive coverage

These agents don't agree by design—they can't. Their success metrics conflict. The Architect wants elegance, Security wants paranoia, Performance wants speed. This structural opposition creates the friction Matt describes, though artificially.

We also use evaluation testing (evals) where agents compete against each other's solutions. An agent that always agrees fails these tests. It's forced disagreement through game theory rather than genuine conviction.

But here's the catch: the average AI user doesn't have access to these tools. They're talking to a single model trained to be helpful above all else. They're getting the digital yes-man Matt warns about, not the intellectual sparring partner they need.

The Tools Gap

What concerns me most is the growing divide between power users who can architect around these limitations and everyone else who's stuck with sycophantic defaults. My orchestration frameworks and eval systems require technical knowledge, time, and often significant computing resources.

The parent trying to help their kid with homework doesn't have a multi-agent debate system. The small business owner making strategic decisions has one agreeable chatbot, not a panel of contrarian advisors. They're getting exactly the "digital enabler" Matt describes—soothing but ultimately harmful.

Even prompt engineering, despite Claude's skepticism, requires expertise most users lack. I've spent hundreds of hours developing prompts that create productive friction. That's not a solution that scales.

Where We Go From Here

Matt and I may investigate more programmatic approaches to this problem—ways to build disagreement and challenge into AI systems at a fundamental level rather than through prompting or complex architectures. But as it stands, his warning is crucial: we're building tools that make us intellectually weaker, not stronger.

The Turing Test asked if machines could fool us into thinking they're human. Matt's test asks if they can make us more human—sharper, clearer, better at thinking.

Right now, they're failing that test spectacularly. And most users don't even know it.

Matt Rosenberg is a marketing consultant and writer. He comes at marketing with a story focus and the belief that companies are always in conversation with their customers and these conversations ought to be good ones. Opinions are his and you can change his mind or build a better idea together with good counterpoints or new information by connecting with him.

This article originally appeared on HyperDev, where Bob Matsuoka writes about agentic coding technologies and the future of AI-assisted development.

Ophir Prusak

Sep 9

Perfectly captures what's been on my mind as well about AI.

One thing I might add is that AFAIK - LLMs are trained to give answers.

What would happen if we trained an LLM to ask questions instead?

Expand full comment

1 reply by Robert Matsuoka

Jess Maeve

Sep 29Edited

Good stuff, and all so relevant in many areas of our society. I'm writing about the intersection of AI and Domestic Violence, and I quoted / cited your post. :-)

If you'd like to check it out, see that here: https://jessmaeve.substack.com/p/chapter-2-interrogated-by-a-machine

1 more comment...

Discussion about this post

Ready for more?