Fantasia: The Sorcerer’s Apprentice in AI Development

Magic Days, Disaster Days, and the Chaotic Reality of Building with Agentic AI Tools

Nov 24, 2025

I’ve spent the last eight months building with AI orchestration tools while using AI to build them. Some days feel like conducting a symphony—Claude Code humming along, agents coordinating perfectly, client work flowing at 10x normal speed. Other days? The brooms multiply out of control.

If you’ve seen Disney’s Fantasia, you know the scene. Mickey Mouse discovers the sorcerer’s hat, animates a broom to carry water, and watches in delight as his spell works. Then he falls asleep. The broom keeps working. And working. And multiplying. Soon the workshop floods with an army of brooms, each carrying more water, creating more chaos. Mickey frantically chops one broom into pieces, but each piece becomes another broom. The magic that seemed so controllable spirals into disaster.

That’s what building with AI tools often feels like in 2025.

When Everything Aligns

Here’s what a magic day looks like. I’m building Claude MPM, my multi-agent project management framework. The orchestrator agent breaks down client requirements. Worker agents implement features in parallel, each in isolated context windows. The vector search finds exactly the right code patterns. The memory system recalls architectural decisions from three weeks ago. Everything clicks.

I ship features in hours that would’ve taken days. Client demos go flawlessly. The code quality rivals what I’d write manually, sometimes better. The tools anticipate what I need. Suggestions land exactly right. It’s disproportionately effective.

Then Tuesday happens.

Same codebase. Same configuration. Same prompts that worked perfectly yesterday. But now Claude Code takes 30 seconds to respond. Suggestions miss the mark. The orchestrator sends workers down rabbit holes. Context gets corrupted. Simple refactoring tasks spiral into debug sessions.

Nothing changed. Except everything did.

The Infrastructure Kept Changing Beneath Me

Turns out this wasn’t just me being paranoid. In September 2025, Anthropic published a rare detailed postmortem documenting three simultaneous infrastructure bugs that degraded Claude Code between early August and mid-September. Three separate issues, all hitting at once:

Bug #1 routed Claude Sonnet 4 requests to servers configured for 1M context windows instead of 200K. Started at 0.8% of requests, peaked at 16% after a load balancer update. The routing was “sticky”—once you hit a misconfigured server, you kept hitting it. Anthropic reported that thirty percent of active users experienced at least one degraded message.

Bug #2 was worse. A TPU misconfiguration randomly selected wrong tokens during output generation. Thai characters appeared in English responses. Code generated with syntax errors. The system knew the right answer but physically couldn’t output it.

Bug #3 involved compiler miscompilation. The XLA:TPU compiler occasionally caused Claude to exclude probable tokens—in effect, breaking the mechanism for selecting appropriate responses.

Six weeks. Three overlapping bugs. And users couldn’t tell which issue they were hitting or whether it was a bug at all. As Anthropic noted in their postmortem: “Claude would often recover from mistakes, making degradation look like normal variance.”

The Non-Determinism Goes Deeper

But infrastructure bugs aren’t the whole story. Even when everything works “correctly,” AI tools exhibit fundamental non-determinism that traditional software never did.

A 2024 study on LLM non-determinism analyzing GPT-4o, Llama 3, Mixtral and other models found accuracy variations of 5-15% across runs with identical settings (temperature=0, same prompt, same model). Some tasks showed max-min accuracy differences up to 72%—one Mixtral configuration achieved 75% accuracy on college math problems in one run, 3% in another. Why?

Floating-point operations complete in non-deterministic order across GPUs. When you’re doing billions of calculations and each token depends on all previous tokens, tiny differences compound. Different GPU types (A100 vs L40) produce slightly different numerical results. Mixture-of-experts architectures route tokens to different “expert” sub-networks based on capacity limits—your output depends on which other users want the same experts at the same time.

For many hosted LLMs (including, likely, Claude), inference optimizations like continuous batching and prefix caching mean your request gets grouped with random other users’ requests. System-level factors beyond your prompt contribute to non-determinism. Cache hits versus cache misses create different computation paths. Memory pressure changes cache eviction policies.

As Anthropic’s SDK documentation explicitly states: “Even with temperature of 0.0, the results will not be fully deterministic.”

This isn’t a bug. It’s architecture.

Building Tools While the Platform Evolves

The meta-development challenge compounds everything. I’m building Claude MPM to orchestrate multiple Claude agents. So when Claude Code has an off day, I can’t tell if:

The platform degraded
My orchestration code introduced bugs
My usage patterns changed
The task complexity increased
Some combination of all four

Traditional software has stable foundations. You build on Linux, you know what syscalls do. Build on AWS, the APIs stay consistent. Build with AI tools in late 2025? The foundation shifts frequently.

Claude released six major model versions in 14 months: Sonnet 3.5, Sonnet 3.7, Opus 4, Sonnet 4, Opus 4.1, Sonnet 4.5. Near-weekly updates to Claude Code itself. Context windows for the Sonnet family jumped from 100K to 200K to 1M tokens. Rate limits changed. Temperature defaults updated. Tool permissions evolved.

As one AI tool founder described it: “It’s like you’re not just building a product, you’re trying to build a raft while sailing it through a hurricane.”

My vector search worked perfectly with Sonnet 3.5. Then Sonnet 4 changed how it chunked text. The memory system stored architectural decisions assuming 200K context—then had to adapt when 1M became available. Integration points between my tools multiplied faster than I could document them.

And I’m using AI to build all of this. AI-generated code calling AI agents orchestrating AI workers. When something breaks, is it my code? The platform? Both? The recursive debugging gets absurd.

The Community Lives This Daily

I’m not alone in this experience. GitHub Issue #7683 captures it perfectly. A power user who’d generated “billions of tokens” wrote:

“Previously, working with this model felt like collaborating with a Senior Developer—I could trust the output and focus on higher-level concerns. Now, it feels like supervising a Junior Developer where I must minutely review every single line of code, catching basic errors and unwanted additions.”

They measured a 30-40% productivity loss. Tasks taking 1-2 days now required 2-3 days.

Another developer reported an extreme case: “Tasks that typically take me 30 minutes to complete using Cursor now require approximately 8 hours with Claude Code.” That’s 16x degradation.

The Reddit thread “Claude Is Dead” collected 841+ upvotes—more than double Anthropic’s official response. Comments described Claude as “significantly dumber,” noting it “ignored its own plan” and “started to lie about the changes it made to code.” Developers reported waves of subscription cancellations and downgrades.

Semgrep’s research on Claude Code found remarkable inconsistency: “Running the exact same prompt on the exact same codebase multiple times often yielded vastly different results. In one application, three identical runs produced 3, 6, and then 11 distinct findings.”

The “magic day vs. disaster day” pattern is familiar to anyone who’s lived with these tools. One developer posted on GitHub: “Some days everything works perfectly, other days nothing gets done despite using the same approaches.”

The Psychology Amplifies The Perception

There’s a documented perception gap. METR’s “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” tracked 16 developers doing 246 real tasks. Before starting, they expected AI to make them 24% faster. After completing tasks, they still believed AI had sped them up by 20%.

The actual measured outcome? They were 19% slower.

That’s a 39-point perception gap. Developers felt faster even when they were demonstrably slower.

Why? AI feels productive. Instant suggestions. Less typing. Immediate feedback. The actual time spent—reading AI-generated code, debugging suggestions, untangling context pollution—doesn’t register as “waiting.” It feels active. Traditional development has obvious “blocked” moments. AI development hides the inefficiency in constant activity.

Magic days reinforce the illusion. When tools align perfectly, productivity genuinely soars. Those experiences get remembered vividly. Disaster days trigger frustration but get rationalized as anomalies. “Must’ve been a server issue. Tomorrow will be better.”

Confirmation bias does the rest. Stack Overflow’s 2025 survey found only 16.3% of developers said AI made them “greatly” more productive. The largest group (41.4%) reported “little or no effect.” Yet in GitHub’s own Copilot research, 60-75% of developers reported feeling more fulfilled using AI.

Feeling faster doesn’t mean being faster.

When the Magic Works

But here’s the uncomfortable bit: the magic is real when conditions align. When everything works—stable infrastructure, appropriate tasks, good prompts, well-scoped context—AI tools deliver genuine breakthroughs.

Last month I needed to refactor a complex Next.js component that had grown organically over six months. Multiple state managers, tangled props, callbacks everywhere. Manual refactoring would’ve taken a day minimum, high risk of introducing subtle bugs.

Claude Code did it in 47 minutes. The orchestrator analyzed dependencies. Worker agents handled isolated sections. Testing agents validated behavior. The result was cleaner than what I’d write manually, properly typed, with edge cases I’d have missed.

That’s not perception. That’s measurable value.

Research on GitHub Copilot acceptance patterns found time-of-day differences in how often developers accepted suggestions—weekend acceptance rates differed from weekday patterns, non-working hours showed different behaviors than working hours. The tools genuinely have good days and bad days. But when they’re good, they’re orders-of-magnitude faster.

Building Claude MPM taught me this viscerally. The framework coordinates multiple specialized agents for different project phases. When orchestration works, I watch agents collaborate like an experienced team. Planning agent breaks down requirements. Architecture agent designs components. Implementation agents build in parallel. Review agent catches issues before they reach me.

It feels like conducting a symphony. Every instrument knows its part. The timing lands perfectly. The result exceeds what any single instrument could achieve.

The Sorcerer’s Return

In Fantasia, the Sorcerer returns. One gesture, and the chaos stops. The brooms freeze mid-stride. Order restores instantly. Mickey sheepishly returns the hat.

In AI development, there are moments like that. After the August-September bugs, Anthropic shipped fixes. For a few days, everything worked beautifully. Claude Code responses landed fast and accurate. Context management felt solid. Brief stability returned.

Then version 2.0.30 shipped. New features, new edge cases, new patterns to learn. The cycle begins again.

But that’s the nature of building on rapidly evolving infrastructure. The platforms will stabilize eventually. Vendors are investing heavily in reliability—Anthropic’s postmortem shows they take these issues seriously. The technical challenges of non-determinism and infrastructure optimization won’t disappear, but they’ll become more manageable.

What I don’t expect to change is the fundamental experience: sometimes the brooms carry water beautifully, sometimes they flood the workshop. You learn to work with both.

Living with the Magic and Chaos

I keep building Claude MPM. Keep shipping client projects. Keep writing about these tools. Not because the chaos disappeared—it didn’t. Because when everything aligns and the water flows exactly where it should, you can do things you never imagined possible.

The August-September bugs taught me something valuable: keep backups. I maintain Augment Code subscriptions now, not because I prefer it, but because when Claude Code has a bad day, having alternatives prevents complete workflow collapse. I’ve learned to recognize the signs of platform degradation early—unusually slow responses, repetitive suggestions, context confusion—and switch tools before wasting hours.

The meta-development challenge remains. Building orchestration tools while those tools depend on evolving platforms creates compound instability. But it also provides unique insight. I understand these tools deeply precisely because I build with them, on them, and for them.

The Sorcerer’s Apprentice analogy captures AI development perfectly. Mickey didn’t give up on magic after the workshop flooded. He learned respect for the power he was wielding. The next time he picked up the hat, he’d be more careful. More prepared. More aware of what could go wrong.

But he’d absolutely pick it up again. Because when it works, you can do things you never imagined possible.

Zooming out: 2025 feels like we’re all Mickey, learning to work with forces more powerful than we fully understand. The brooms still multiply sometimes. The workshop still floods. But on the days when everything aligns? Those days represent some of the most technically exciting work I’ve tackled in 25 years of software development.

I’m Bob Matsuoka, writing about agentic coding and AI-powered development at HyperDev. For more on the reality of building with AI tools, read my analysis of multi-agent orchestration in practice or my deep dive into the other shoe dropping on AI pricing.