The Year of the Fire Horse - Part 2

The Western Field

Jun 24, 2026

This is Part 2 of a three-part read on the Fire Horse year in AI coding. Part 1 set up the conceit — twelve coding models mapped to the twelve animals of the Chinese zodiac, in a year (丙午, the once-in-sixty Fire Horse) that earned its reputation for upheaval — and walked through the Chinese front-runners and the $60 billion SpaceX-buys-Cursor deal that reframed the whole field. (See Part 1 for the benchmark caveats; every version number and vendor-versus-independent distinction below assumes that three-question rule.)

This part is the Western field, and it carries its own argument. The Western frontier still leads the hardest work, the off-script agentic jobs where finishing is the whole point. But the Western open-weight story collapsed. Meta and Mistral built the open movement, and both got outrun on code by the Chinese open models. The open coding lead crossed an ocean, and that is the thread this part follows to its end.

TL;DR

The Western frontier still leads the hardest agentic work, and on that tier the price premium earns out — the Sharda “$200 beat the $30” verdict lands here.
Gemini 3 Flash dethroned its own bigger sibling on coding (78% vs 76.2% SWE-bench Verified). The “cheap weak Flash” framing is obsolete; watch the citation error that attributes the 78% to the old 2.5 Flash.
Gemma 4 jumped roughly 3x on coding in one generation (LiveCodeBench v6 80.0% vs Gemma 3 27B’s 29.1%), and Gemma 3n runs multimodal in 2–3GB on the edge.
Llama stalled — stale lineup, a closed pivot (Muse Spark), 15.6% Aider against ~5x-higher Chinese open models, and Yann LeCun saying the Llama 4 results “were fudged a little bit.”
Mistral sells sovereignty more than raw capability now, and the EU-data edge is narrowing. Set Llama and Mistral side by side and the open coding lead has crossed an ocean.

The animals that hold the line

The order here runs roughly down a visibility-and-strength gradient: the frontier reasoners first, then the local family, then the two foundational Western open-weight players that built the movement and got passed.

Dog — Claude (Anthropic)

The dog is loyal, faithful, and protective, the one that finishes the job and guards the gate. Claude Opus is the model developers reach for when correctness matters and the one that holds the line on guardrails. Repeatedly in the research, it is also the one that finished.

The capability story is about consistency under pressure. DataScienceDojo ran Kimi K2.6 against Claude Sonnet 4.6 and found Kimi capable but Claude more consistent — Claude added a DELETE endpoint nobody asked for, flagged a Redis warning, and applied type-level validation unprompted. Composio’s harder test landed on the line Part 1 first quoted from the Monkey’s side: “Opus was expensive, but it finished. Kimi just could not put it all together once the task got real.”

And the GLM thread from Part 1 closes here, on the economics. Ashish Sharda’s much-quoted “I Tested GLM-4.6 for 2 Weeks and Went Back to Claude” landed on the argument against pure cost optimization: “The $200/month AI model beat the $30/month alternative. Sometimes expensive is worth it.” That was an older GLM, and the framing has aged — GLM-5.2 now posts the top open-weight index score in the field and sits #2 on Code Arena, so the cheaper model is no longer a lightweight you outgrow. The point that survives is narrower and still holds: on the hardest agentic work, Opus is the one that finishes, and the developers pairing GLM for volume with Claude for the hard problems are routing to exactly that tier. The dog is expensive to keep. It also guards the thing you cannot afford to lose.

Tiger — GPT (OpenAI)

The tiger is bold, fierce, and competitive, a predator that leads from the front. GPT-5.5 leads the field on terminal work, topping Terminal-Bench 2.0 by 13 points.

The tiger does not appear much in the open-weight cost debate because it is not playing that game. Its claim is one hard surface: the command line, where an agent has to chain real operations against a real environment and not lose the thread. Cursor’s routing guidance sends shell-heavy terminal work to GPT-5.5 for exactly this reason. On its territory, it leads.

Rooster — Gemini (Google)

The rooster is showy, observant, punctual, and loud at dawn. The whole Gemini family struts in fast and crowing, and the loudest crow is internal: the lean Gemini 3 Flash dethroned its own bigger sibling, Gemini 3 Pro, on coding. The framing most people carry for Flash is obsolete. “The cheap, weak sibling” stopped being true at the end of last year.

Gemini 3 Flash shipped December 17, 2025, and beat Gemini 3 Pro on coding: 78% SWE-bench Verified for Flash against 76.2% for Pro. The smaller, cheaper model won. That 78% figure is the source of a common citation error — people attribute it to Gemini 2.5 Flash, which is the previous generation. The number belongs to the 3 Flash family. If you see “Flash beats Pro at 78%,” confirm which Flash before you repeat it.

The current Flash earns the upset. Gemini 3 Flash runs roughly 4x cheaper than 3 Pro (around $0.50/M input), about 3x faster (218 tokens/sec), scores 90.4% on GPQA Diamond, and Cursor, Cline, JetBrains AI, and Gemini CLI adopted it immediately. The latest, Gemini 3.5 Flash, beats Gemini 3.1 Pro on Terminal-Bench 2.1 at 76.2% and is described as Google’s strongest agentic model.

None of which retires Pro. The developer consensus is hybrid routing, not replacement. Flash handles 80–90% of the work — summarization, classification, tagging, structured pipelines, autocomplete — indistinguishably from Pro. Pro still earns its keep on architectural reasoning, complex multi-file refactors, novel algorithms, and the off-script agentic tasks where, as one developer put it, “Flash gets confused when a task goes off-script.” Cursor’s routing still sends deep-architecture and long-context work to the heavyweight reasoning tier, and Pro is the Gemini family’s answer there. The rooster reasons deep in one body and runs fast in the other, and that “off-script” line could be the epigraph for this entire field.

Rabbit — Gemma (Google’s open-weight family)

The rabbit is gentle, quiet, and home-bound, and Gemma is Google’s open-weight family — the calm local helpers that never pretended to be frontier coders, then quietly got 3x better at code in one generation. The Gemma arc runs across three things: the workhorse, the leap, and the tiny one that runs where nothing else will.

Start with the leap, because it dates everyone’s article. Gemma 3 is previous-generation. Gemma 4 landed April 2, 2026 (up to 31B dense plus a 26B MoE) under a full Apache 2.0 license, and the jump is large. On LiveCodeBench v6, Gemma 4 31B reportedly scores 80.0% against Gemma 3 27B’s 29.1% — roughly 3x better at coding in a single generation. That number is vendor and secondary-source, not yet on an independent leaderboard, so hold it loosely; the generational jump is the durable part.

Gemma 3 was the quiet local utility the family was known for, not a coder: agentic tool use on τ²-bench Retail at 6.6%, and Fixstars’ hands-on found hallucinations on technical detail and no capacity for complex agentic VSCode work. What Google marketed was the LMArena 1338 score, which beats GPT-4o and Claude 3.7 Sonnet — but that measures chat preference, not coding ability, and the two should not get conflated. None of which made Gemma 3 useless: it earned real adoption as a free local utility for JSON extraction, log parsing, and code explanation.

Then there is Gemma 3n, the tiny edge variant that fits where nothing else does. Clever architecture lets the full 5B and 8B models run in 2GB and 3GB of memory, multimodal across image, audio, video, and text. A privacy and offline play, not a coding rival. The rabbit wins by being where the others cannot go: edge devices, air-gapped machines, the laptop with no connection.

Access runs through Ollama (override the default 2048 num_ctx or context keeps falling out), llama.cpp, LM Studio, and Vertex AI. No native Cursor or Windsurf integration.

Pig — Llama (Meta)

The pig is the sign of abundance and generosity. That was Llama: Llama 2 and Llama 3 became the base layer for the open-weight movement, the foundation under thousands of derivatives, fine-tunes, and quantizations. Meta’s generosity built the ecosystem everyone else now competes in. The pig is also the sign of complacency, and the 2026 Llama story is a provider that got too well-fed to move while leaner animals ran past it on code.

The lineup went stale, then went closed. The Llama 4 models shipped in April 2025 and were never refreshed; the ~2T Behemoth was shelved as of May 2026; Muse Spark, released that same month, is Meta’s first closed-weight, API-only model, reportedly lagging on coding; and Llama 5 slipped to roughly 2027. Andrew Ng called the retreat from open weights “a significant loss for the developer community.” The biggest model in the family is stuck in the pen, and the family stopped being fully open.

The coding numbers are why none of that got forgiven. On Aider Polyglot, Llama 4 Maverick scores 15.6% — against Kimi K2 at 59.1%, Qwen3-235B at 59.6%, and DeepSeek-V3.2 Reasoner at 74.2%, roughly a 5x gap to the Chinese open models. And it doubles as the cleanest benchmark-trap example in the series, the callback to Part 1’s rule. For LMArena, Meta submitted a conversationality-tuned variant that ranked around #2; the actual public release ranked #32, and LMArena rebuked Meta for the swap. Yann LeCun, who left Meta in November, told the Financial Times in January 2026 that the Llama 4 results “were fudged a little bit.” One case study in why a vendor number means nothing until an independent leaderboard confirms it.

Llama still ships under the Llama Community License, not an OSI-approved one, and has been passed in both openness and downloads — DeepSeek (MIT) and Qwen (Apache 2.0) are more open, and Qwen overtook Llama as the most-downloaded open family on Hugging Face, with Chinese models holding four of the top five open-weight slots — GLM-5.2 now the #1 open-weight model on the Artificial Analysis index, ahead of Qwen, Kimi, and DeepSeek. The money did not buy the code scores: Meta runs the best-funded open lab there has ever been, and r/LocalLLaMA’s reaction to Llama 4’s coding was negative anyway. Access is everywhere, and no first-party coding default anywhere, because the scores do not earn one.

Rat — Mistral (France)

The rat is first in the zodiac, clever and nimble, the outsider that thrives in the cracks and punches above its weight. Mistral fits cleanly: the European challenger named for the cold, fast wind off the Alps, a fraction of OpenAI’s war chest, surviving against far larger labs by being efficient and by owning a niche nobody else can — sovereignty. It made its name with Mistral 7B in September 2023, which beat Llama 2 13B at half the parameter count. The rat against the pig, literally, on the first release.

The current flagship open model is Mistral Large 3 (December 2025), a 256K-context MoE under Apache 2.0 that Microsoft markets as “the strongest fully open model developed outside of China” — praise and an admission in one sentence.

On coding, keep two models straight, because the names invite a mistake. Codestral (codestral-2508) is a fill-in-the-middle specialist: its FIM pass@1 is around 95.3% (vendor-reported, class-leading), and it quietly wins the tab-completion slot in a lot of editors. Its Aider Polyglot score is only 11.1% (independent) — but that is a FIM model measured on an agentic task, the same “which benchmark, which task” trap the Llama section just walked through. The agentic model is Devstral, which scores 53.6–61.6% on SWE-bench Verified. Use Devstral, not Codestral, for SWE-bench comparisons — and ignore any source citing “Codestral 2 with Apache 2.0,” which does not exist.

The rat’s real moat is jurisdiction, not a leaderboard score. The pitch is sovereign AI: data never leaves the EU, GDPR and EU AI Act native. The French Ministry of Armed Forces signed a 2026–2030 framework; Macron told citizens to “download Le Chat rather than ChatGPT.”

Be fair about the limits. The EU AI Act edge is narrowing — Mistral, OpenAI, Anthropic, Google, and Microsoft all signed the EU AI Act Code of Practice in July 2025, while Meta declined and the Chinese firms did not, so the real divide is US/EU signatories versus China and Meta, not Mistral alone. And the skeptic’s line is hard to answer on capability: why pay Mistral on-prem when you could run Qwen for free?

The coding lead crossed an ocean

Set the Pig and the Rat side by side and the thesis arrives from a fresh direction. Llama and Mistral are the two foundational non-Chinese open-weight players, the labs that built the Western open movement, and both have been outrun on coding by the Chinese open models — the Dragon, Ox, Monkey, and Goat from Part 1. The Western open-weight story in 2026 is Meta retreating to a closed model and Mistral surviving on a sovereignty niche rather than on raw capability. The coding lead did not just shift between companies. It crossed an ocean.

That leaves the frontier still in Western hands for the hardest work, and the open flank fallen. Which sets up the question the whole series was built around. Twelve animals across two hemispheres, and underneath all of them, one question: who controls the inference, and can you trust it with your code? Part 3 is the governance reckoning.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Discussion about this post

Ready for more?