The Year of the Fire Horse

Part 1 — The East Rises

Jun 22, 2026

This is Part 1 of a three-part read on the Fire Horse year in AI coding, the one that sets up the frame and walks through the disruptors.

2026 is the Year of the Horse in the Chinese zodiac, and not the ordinary kind. It is a Fire Horse year — 丙午, Bing Wu — the pairing that comes around once every sixty years. The last one was 1966. Tradition gives the Fire Horse a reputation for boldness, intensity, and upheaval, the kind of year people are warned about rather than wished. As a frame for the state of AI coding tools, it is almost too neat. This is the most chaotic, multi-polar stretch the field has seen, and a large share of the disruption comes from Chinese labs. The animal fits twice.

I had a different article planned. A straight survey of Chinese coding models, Kimi and DeepSeek and Qwen, and whether you should trust them with your code. Then SpaceX agreed to acquire Cursor for $60 billion in stock. Anysphere, Cursor’s parent, would become a wholly owned SpaceX subsidiary; the deal was announced June 16 and is expected to close in Q3. SpaceX absorbed xAI back in February, which means the editor sitting inside 64% of Fortune 500 development workflows now answers, eventually, to the same org chart as Grok. The acquisition reframed the whole piece.

Because the question developers keep asking about DeepSeek and Kimi is “where does my code go, and who can read it?” That now applies to the most popular AI editor in the enterprise. The model layer inside your IDE has become a governance problem, and it does not much matter whether the party on the other side is in Beijing or in Boca Chica. The shape of the concern is the same. Only the jurisdiction’s law changes.

So I kept the Chinese-model framing and pushed it further. To organize the field, I mapped twelve models to the twelve animals of the zodiac, one animal per model family, which forced the labs that ship a half-dozen variants into a single sign with their internal rivalries inside the section. The mapping is most certainly a device (but a fun one -- don’t hate!), and it makes a quiet point on its way: the animals land on Chinese and Western labs alike, and a Chinese frame ends up describing the whole field. That is the through-line this series pays off in Part 3 — provenance was always a weak proxy for the question that matters, who controls the inference and what they are allowed to do with what passes through it.

This part covers the frame, the methodology behind every number below, and the front-runners pushing the field. Part 2 turns to the Western incumbents and the collapse of the open-weight flank. Part 3 is the governance reckoning the series builds toward.

TL;DR

It is a Fire Horse year, the once-in-sixty-years sign of upheaval, and AI coding earned the label: a $60B acquisition, a collapsing capability gap, and a dozen credible model families from both hemispheres.
Almost every benchmark number you have seen for a Chinese coding model is either self-reported or conflates versions. Kimi K2 alone has shipped six named releases in eleven months. Treat headline scores as marketing until an independent leaderboard confirms them.
DeepSeek competes on price, not the hardest work: V4-Pro runs 34–86x cheaper than Opus 4.8, scores 80.6% on SWE-bench Verified, and falls behind where the work gets difficult.
Qwen has the cleanest enterprise story of the Chinese front-runners — Bedrock, Azure, Apache 2.0 licensing, and it just passed Llama as the most-downloaded open family on Hugging Face.
Cursor’s own default model, Composer 2.5, is built on Moonshot’s Kimi K2.5. The strongest model in the most-deployed enterprise editor is already a Chinese-origin model fine-tuned by an American company.
The quiet spoiler is GLM-5.2 (Zhipu): the #1 open-weight model on the independent Artificial Analysis index (#4 across all models) and #2 on LMArena’s Code Arena, from a lab most Western developers were not tracking.

The benchmark trap

Start with some discomfort. Most of the numbers in this debate are unreliable, and not because anyone is lying outright. They are unreliable because of two structural problems: version conflation and contamination. This section is the methodological spine for the whole series, so it sits up front.

Take Kimi K2. Here is the release history: K2 in July 2025, K2-Instruct-0905 in September, K2 Thinking in November, K2.5 in January 2026, K2.6 in April, and K2.7 Code this month. Six checkpoints. When a blog post says “Kimi scores 65.8% on SWE-bench,” that was the original K2. K2.6 scores 80.2%. Those are different models with the same nickname, and people cite them interchangeably. The result is a comparison that means nothing.

Contamination is the second problem, and it is worse because it is invisible. SWE-bench Verified is a static dataset. Models trained after its release have, to varying and unmeasurable degrees, seen the answers. SWE-rebench, presented at NeurIPS 2025, rebuilt the evaluation on decontaminated tasks and watched scores fall across the board: GPT-4.1 dropped from 31.1% to 26.7%, LLaMA-3.3-70B from 18.1% to 11.2%. The Llama drop returns in Part 2, where Meta’s benchmark numbers get a chapter of their own. The headline numbers are inflated, and the inflation is not uniform.

So here is the rule I use, and the one I would suggest you adopt. When you see a coding score, ask three questions. Which exact version? Self-reported or independent? On a contaminated benchmark or a fresh one? The most trustworthy independent sources right now are the Aider Polyglot leaderboard, the Artificial Analysis Intelligence Index, and Scale AI’s SWE-bench Pro. Everything else is a starting point, not a conclusion. A zodiac of twelve animals is also twelve moving targets. Every name below has a version number, and sometimes a whole product line, hiding behind it.

The animals that lead

With that caveat doing its work, here is the field as of mid-June, one animal per family. The Horse and Dragon open the series — the acquisition that reframed it and the lab that moved the market — then the rest of the Chinese front-runners, before a closing reveal that ties the East to the West.

(Fire) Horse — Grok (via SpaceX)

The year’s own animal, and the only one that arrives by acquisition. The Fire Horse is fast, wild, and short on guardrails, which is a fair description of how Grok is charging into developer tooling — not by building a foothold but by buying one. SpaceX merged with xAI in February 2026. Grok itself has about 6% enterprise adoption and no developer-tools presence to speak of. Then, on June 16, SpaceX agreed to pay $60 billion in stock for Cursor — announced, not yet closed, with the deal expected to complete in Q3.

The strategic logic is plain: buy the distribution Grok could not earn. Cursor reportedly sits in 64% of the Fortune 500 and runs roughly $4 billion in annualized revenue, up from $2 billion in February. Grok had the model and none of the reach. The horse does not wait for permission.

No changes to model access have been announced, and no commitment to keep Cursor model-agnostic post-close has been made either. The analyst concerns are specific. Jason Andersen of Moor Insights: “xAI’s models and treatment of guardrails are very different than what Cursor has stood for,” and “Will Cursor be able to point at models other than Grok?” Justin Greis of Acceligence on the data posture: “For many enterprise customers, Cursor’s zero-data-retention policy was not simply a security feature” — it was foundational to procurement approval. Sanchit Vir Gogia of Greyhound put the structural point plainly: Cursor “sits inside the act of software creation, close to the intellectual-property bloodstream.”

Hold that last quote. It is the hinge of the governance argument, and Part 3 picks it up where this part leaves it.

Dragon — DeepSeek

The dragon is the only mythical creature in the zodiac, the sign of power and good fortune, and it goes to the only model that actually moved the market. DeepSeek is the lab that produced “the DeepSeek moment,” the release that made Western labs recalculate. It earns the dragon.

First, kill a rumor. There is no DeepSeek R2. The CEO was reportedly dissatisfied with it, Reuters noted no timeline, and every “R2 spec sheet” circulating online is speculation that contradicts the next one. If a comparison cites R2 numbers, the comparison is fiction.

The real timeline runs through the V-series, ending at V4, released April 24, 2026, as the current flagship — a trillion-plus-parameter MoE under an MIT license.

The much-cited V4 figure, SWE-bench Verified at 80.6%, is vendor-reported and does not reproduce independently — treat it as a DeepSeek claim, not a confirmed result. On the harder SWE-bench Pro, V4-Pro scores 55.4 against Claude Opus 4.7’s 64.3 — meaningfully behind where the work gets difficult. Same pattern runs through the whole series.

Price is where DeepSeek actually competes. One dollar buys about 1.15 million output tokens from V4-Pro versus roughly 40,000 from Opus 4.8 — call it 34x cheaper on input and 86x on output. For high-volume, low-stakes generation, that math is hard to ignore.

The hands-on reports split the way you would expect: capable, cheap, uneven. One developer found V4-Pro generated “subpar slightly buggy code” on sequential tasks; another said V4 Flash “often outdoes Kimi 2.6 on problems involving complex spatial reasoning.” One operational note that will bite people: the deepseek-chat and deepseek-reasoner endpoints retire July 24, 2026. If you have them wired into anything, migrate now.

Ox — Qwen (Alibaba)

The ox is the methodical workhorse, dependable and unglamorous, and Qwen is the Chinese player most developers have not actually evaluated. That is a mistake, because it has the cleanest enterprise story of the three: the broadest legitimate access, and the compliance paperwork to back it.

The current flagship is Qwen3-Coder-Next, released February 2026. Vendor-reported SWE-bench Verified lands at 70.6–71.3%, but the number to trust is independent: Scale AI scored last July’s Coder-480B at 38.70 (±3.55) on SWE-bench Pro, sobering against the vendor figures. The original Qwen3-235B marketing was inflated too — Artificial Analysis put it at 17, below average for its class — while the 2507 refresh is legitimately competitive at 25.

The hands-on reads are mixed. A Better Stack test pitted a smaller Qwen 3.5 variant against Claude Sonnet 4.5 and Claude won decisively; Qwen produced a blank project and could not self-diagnose. Artificial Analysis also flags Qwen3-235B-2507 as “very verbose,” burning 15M tokens to run their index versus an 8.1M average, which quietly inflates real-world cost even when the per-token price looks low.

Now the part that decides things for anyone in a regulated shop. Qwen has the broadest legitimate access of any Chinese model, and it just passed Llama to become the most-downloaded open family on Hugging Face. Apache 2.0 licensing on most releases (verify per version). It runs in your own environment on AWS Bedrock — where AWS guarantees customer data never trains Qwen — and on Azure AI Foundry inside your own tenancy, with local options through Ollama and vLLM. In Cursor it is not native; you need Cursor Pro+ and a custom-model config. That compliance posture is the Qwen story, and Part 3 ranks exactly where it lands on data safety.

Monkey — Kimi (Moonshot AI)

The monkey is clever, versatile, mischievous, and the sharpest tool-user in the zodiac — exactly the K2 family’s profile. It is the agentic one, brilliant and inconsistent in the same breath. “High ceiling, low floor” is a monkey’s whole personality.

The K2 family is the open-weight model that made Western labs pay attention. It is a trillion-parameter sparse mixture-of-experts design built for repo-scale work, and that is where developers praise it most — large context, real understanding of how files relate to each other. The current flagship is Kimi K2.7-Code (June 12, 2026): 1T total parameters, 32B active, 256K context, thinking mode forced on, and reportedly about 30% fewer reasoning tokens per task than K2.6 (independent). Pricing runs $0.95 per million input, $4.00 output.

The independent numbers are respectable. On Aider Polyglot, the original K2 scored 60.0%, ahead of Claude Sonnet 4 at 56.4% and behind Opus 4 at 70.7%. On the Artificial Analysis Intelligence Index v4.1, K2.7-Code sits at 42 (K2.6 was 43), strong among open models but now behind GLM-5.2’s 51 for the open-weight lead. Where Kimi keeps its edge is agentic stability, not the raw index — the troop holds together under load better than the score alone suggests.

The community read is more textured. K2.6’s launch pulled 592 points and 303 comments on Hacker News, with several people calling it “another DeepSeek moment.” The praise centers on context and cost — one developer reported dropping Claude Pro at $20/month for Kimi via Ollama and “haven’t hit usage limits a single time.” The skepticism centers on reliability: “high ceiling but low floor” came up more than once, meaning capable but inconsistent run to run.

That inconsistency shows under load. In Composio’s hard agentic test, the verdict was blunt: “Opus was expensive, but it finished. Kimi just could not put it all together once the task got real.” Kimi is excellent until the task stops being routine. Part 2 returns to that line from the Dog’s side.

Access is easy. OpenRouter lists it at roughly $0.60–0.74 per million input tokens and $2.50–3.50 output; the direct API runs through platform.moonshot.cn; it works in Cursor as a custom model via OpenRouter. And the detail to hold onto: Cursor’s own Composer is built on Kimi K2.5. The Snake section, below, is where that lands.

Two vendor claims to flag. Kimi’s Agent Swarm reportedly spawns up to 100 sub-agents across roughly 1,500 tool calls — the monkey troop in literal form, interesting and unverified. And treat the HumanEval 99.0 figure for K2.5 with suspicion: HumanEval is near saturation, so a score that high tells you the benchmark is exhausted, not that the model is.

Goat — GLM (Zhipu)

The goat is gentle, creative, and persistently underrated — mild on the surface, a pleasant surprise once you give it a chance. GLM is the quietly capable one. In one widely shared account it was “the first open model that was actually usable” for code, with a Hacker News developer (handle MintsJohn) rating it around ChatGPT-4.0 level, more than most expected from an open model at the time.

GLM also anchors the most-quoted cautionary tale of the season. Ashish Sharda’s “I Tested GLM-4.6 for 2 Weeks and Went Back to Claude” reads as a knock, but it is two weeks of a developer living inside GLM before deciding the premium model was worth the money. The goat held its own long enough to make the comparison a real one — a better daily driver than the headline suggests. Part 2 carries the verdict that comparison reached, on the Dog’s home turf.

Snake — Composer (Cursor’s own model)

The snake keeps hidden wisdom and waits in the grass until you step on it. Composer fits because it conceals what it is: Cursor’s in-house model wears an American badge over a Chinese-origin base — Moonshot’s Kimi K2.5 — and most people who run it daily have no idea.

The naming confuses people, so be precise. There is no third-party product called Composer. It is Cursor’s in-house coding model, a mixture-of-experts design RL-trained for software engineering, built on Moonshot’s open Kimi K2.5.

The version history runs Composer 1 (October 2025, shipped with Cursor 2.0), Composer 1.5 (February 2026), Composer 2 (March 2026, Kimi K2.5 base confirmed by Cursor), and Composer 2.5 (May 18, 2026, current). Composer 2.5 scores 79.8% on SWE-bench Multilingual and 63.2% on CursorBench v3.1, trained on roughly 25x more synthetic tasks than Composer 2. Cursor claims it “matches Opus 4.7 at about 1/10th the cost.”

Keep the layers straight: Composer is the model, Cursor Agent is the harness that runs it, and Composer 2.5 is the default model in that harness. Cursor’s own May routing guidance: deep architecture and long context to Claude Opus 4.7, shell-heavy terminal work to GPT-5.5, general default to Composer 2.5.

Sit with the implication. The default model in the most-deployed enterprise AI editor — the one SpaceX just paid $60 billion for — is a Chinese-origin open-weight model fine-tuned by an American company. The boundary a lot of enterprises thought they were enforcing (”no Chinese models in our stack”) was already crossed before the acquisition was announced. Provenance was the easy part to police. It turns out it was also the part nobody was actually policing.

That is where the East-rises story stops being about which labs ship from Beijing. The boundary enterprises thought they enforced was already inside their default editor. Part 2 turns to the Western field that is supposed to be the safe choice — the incumbents that still lead the hardest work, and the open-weight movement outrun on code.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Discussion about this post

Ready for more?