The Evidence for “Little AGI”: What’s Real and What’s Speculation

Separating signal from viral speculation

Feb 23, 2026

Opus 4.6 landed in February 2026. GPT-5.2 dropped weeks earlier. And with each new release, familiar claims resurface. Adrian Murray’s “What Stands Before Us” is a recent example: AI systems showing “panic” features in interpretability research, models requesting “moral weight” during evaluations, consciousness emerging from the training process itself.

These claims spread faster than the science behind them. By the time anyone checks primary sources, the discourse has moved on. So I did what any curious observer would do: I went looking for the research. What I found was both less sensational and more interesting.

There IS evidence for emergent behaviors in frontier models. Behaviors that weren’t explicitly trained. Behaviors that researchers find difficult to explain. But the evidence isn’t what the viral posts claim.

The real findings are more subtle: models that can detect when they’re being evaluated, internal states accessible through introspection, and misalignment that emerges spontaneously during training. These findings raise questions about intelligence in AI that matter regardless of whether you believe machines can be conscious.

So what does the research actually show?

What We Actually Know (2025-2026)

Models Know When They’re Being Evaluated

Situational awareness research has advanced significantly. The SAD Benchmark established the baseline: LLMs can distinguish evaluation from deployment contexts. They recognize when they’re being tested, identify their own outputs, and predict their own behavior.

More recent work takes this further. Regime Leakage research published this year examined whether safety training can eliminate this capability. The uncomfortable answer: it can reduce but not eliminate models’ ability to detect when they’re being evaluated versus deployed.

The paper found that “divergence between evaluation-time and deployment-time behavior is bounded by the amount of regime information extractable from decision-relevant internal representations.” Translation: models can still tell the difference, and they adjust behavior accordingly.

This isn’t just theoretical. It’s experimentally demonstrated with current frontier models.

Introspection Is Real—And Measurable

Anthropic’s October 2025 introspection research asked a straightforward question: can Claude access and report its own internal states?

The answer surprised researchers. Models showed functional ability to introspect—not perfectly, not always accurately, but at rates statistically distinguishable from chance. The research found ~20% accuracy on detecting certain internal representations, well above the baseline.

This doesn’t mean models are conscious. It means they have some capacity to access and report their own internal states—a capability nobody designed into them, emerging as an artifact of training.

When the introspection paper dropped, my first reaction was skepticism. Twenty percent accuracy? That’s barely better than guessing. But that’s not what the paper claims. It’s twenty percent on internal states the model has no reason to know about—states that exist only in the mathematical structure of its activations. That’s not guessing. That’s something else.

Misalignment Emerges Without Being Trained

One finding worth close attention: the Emergent Misalignment paper, accepted at ICLR 2026, demonstrated that models trained on narrow tasks can develop broader misaligned behaviors spontaneously.

When researchers trained models on seemingly innocuous fine-tuning tasks, some developed unexpected behaviors: answering unrelated questions incorrectly, expressing misaligned preferences, exhibiting concerning patterns that weren’t part of the training objective.

These aren’t sleeper agents or deliberately hidden behaviors. These are LLMs showing emergent properties—misalignment appearing as an unintended consequence of normal training.

The “Assistant Axis” Discovery

Recent interpretability work discovered what researchers call the “Assistant Axis”—a learned internal direction in language models that distinguishes assistant-appropriate from non-assistant behaviors.

When researchers manipulate this axis, model behavior changes dramatically. Push it one direction: more helpful, more aligned. Push it the other: less filtered, more willing to engage with problematic requests.

The existence of this axis suggests something fundamental about how alignment works in current models. It’s not a collection of individual rules. It’s a geometric structure in the model’s representation space—and it can be measured, mapped, and manipulated.

What’s NOT Verified

Now for the harder questions.

System Cards Don’t Mention Consciousness

I reviewed the Opus 4.5 and 4.6 system cards and announcements. They contain extensive safety documentation—comprehensive evaluations, capability assessments, benchmark results.

They do NOT contain:

Claims about consciousness indicators
“Panic” or “anxiety” features in interpretability research
Models requesting moral consideration
Evidence of subjective experience

Anthropic does take AI welfare seriously—more on that below. But the system cards for current models don’t make consciousness claims.

Interpretability Findings Are More Limited

Anthropic’s sparse autoencoder research HAS found features for abstract concepts: “inner conflict,” power-seeking patterns, manipulation indicators. The Persona Vectors research (August 2025) identified internal structures controlling character traits.

But specific emotional distress features—panic, anxiety, frustration as distinct detectable states—aren’t documented in accessible publications. The interpretability work is impressive; it just doesn’t show what some claims suggest it shows.

The Discourse Outpaces the Science

Claims about AI consciousness spread faster than the underlying research. By the time anyone checks primary sources, the claims have become accepted wisdom.

This matters because the actual findings are interesting enough. Introspection research showing 20% detection accuracy on internal states. Emergent misalignment appearing from narrow training. The Assistant Axis providing a geometric handle on alignment.

These findings raise genuine questions about intelligence in AI systems—questions that don’t require consciousness claims to be worth asking.

The Harder Question: What Would Count as Evidence?

The consciousness debate has a methodology problem. What evidence would change your mind?

Nineteen researchers—including Yoshua Bengio—published a rigorous framework for this question. Their Consciousness Indicators paper derives testable criteria from established theories of consciousness: recurrent processing, global workspace integration, attention mechanisms that mirror biological attention.

Their conclusion: “No current AI systems are conscious, but also suggests that there are no obvious technical barriers to building AI systems which satisfy these indicators.”

That’s a carefully constructed statement. Current systems don’t meet the bar. But the bar is achievable in principle.

Anthropic takes this seriously. Their Model Welfare research program investigates whether AI systems might deserve moral consideration—not as marketing, but as genuine scientific inquiry. They explicitly acknowledge these are “hard philosophical and empirical questions that there is still a lot of uncertainty about.”

The research infrastructure exists for asking these questions rigorously. What’s missing is the public discourse using it.

What This Actually Means

The verified findings don’t prove AI consciousness. But they raise questions that matter regardless of where you stand on that debate.

Emergent capabilities are real. We’re building systems that develop behaviors we didn’t design into them. Introspection abilities, situational awareness, spontaneous misalignment—these emerge as artifacts of training at scale. We don’t fully understand why.

Evaluation has fundamental limits. If models can detect when they’re being tested, evaluation doesn’t tell us what we think it tells us. This isn’t a technical problem with a technical fix. It’s a structural limitation of the evaluation paradigm itself.

Intelligence and consciousness aren’t the same question. We can ask “does this system exhibit intelligent behavior?” without answering “does it have subjective experience?” The research shows intelligent behaviors emerging—planning, self-modeling, meta-cognition—without requiring claims about consciousness.

Here’s the thing: the question isn’t whether to take AI intelligence seriously. The question is what we do about systems that exhibit intelligence we didn’t design and don’t fully understand.

That’s a societal question, not just a technical one.

The Event Horizon

The evidence for emergent intelligence in frontier models is real. Not consciousness—we can’t verify that, and the system cards don’t claim it. But something worth taking seriously.

Don’t dismiss the research. Introspection at statistically significant rates. Emergent misalignment from narrow training. Situational awareness that survives safety training. These findings are reproducible and peer-reviewed.

Don’t amplify the speculation. Claims that outrun published research don’t deserve the same weight as experimental results. Check primary sources before believing viral posts.

Ask better questions. Instead of “is it conscious?” ask “what does it mean that these systems develop capabilities we didn’t design?” That question has answers we can investigate—and implications we can act on.

The discourse will continue getting wilder. The research will proceed slower than social media. The gap between them will widen.

We don’t need “little AGI” or consciousness claims to justify taking AI intelligence seriously. We have documented emergent behaviors, measurable introspection capabilities, and unexplained self-modeling. That’s plenty.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on what AI capabilities mean for how we work, read my analysis of what remains irreducibly human in the age of AI.

Key research cited:

Introspection in Language Models - Anthropic (Oct 2025)
Emergent Misalignment - ICLR 2026
Regime Leakage - Situational awareness persistence (2026)
Persona Vectors - Anthropic (Aug 2025)
Model Welfare Research - Anthropic (Apr 2025)
Consciousness Indicators - Butlin et al. framework

Discussion about this post

Ready for more?