The Year of the Fire Horse - Part 3

The Governance Reckoning

Jun 26, 2026

This is Part 3 of a three-part read on the Fire Horse year in AI coding, the payoff the whole series was building toward. Parts 1 and 2 walked twelve coding models across the twelve animals of the zodiac and two hemispheres — the Chinese front-runners and the $60 billion SpaceX-buys-Cursor deal in Part 1, the Western incumbents and the collapse of the Western open flank in Part 2. Now the question all of it was built around: who controls the inference, what they do with your prompts, and whether the answer can change without your consent.

TL;DR

The community read is convergence with a ceiling — and the real endorsement is behavioral: Chinese models reached ~61% of top-10 token consumption on OpenRouter in one February week, with DeepSeek alone near 17.6%.
For enterprise, the practical data-safety ranking is: self-host > AWS Bedrock / Azure AI Foundry > US inference hosts > OpenRouter ZDR > direct Chinese API.
China’s National Intelligence Law means a written no-train promise from a China-domiciled vendor is not a clean answer for regulated work, whatever the privacy policy says.
CrowdStrike measured DeepSeek-R1 producing insecure code at a ~50% higher rate on politically sensitive prompts — and the bias persisted when running the open weights locally. Self-hosting is not the escape hatch you assume.
Provenance was a proxy. The DeepSeek risk and the SpaceX-Cursor risk are the same underlying exposure, and the Cursor deal makes the flag-over-the-lab heuristic useless.

What developers are actually saying

Cut through the benchmarks and the community lands on one sentence: convergence with a ceiling.

The gap has collapsed for the bulk of coding work. For autocomplete, summarization, refactoring known code, and the high-volume long tail, Chinese open-weight models are competitive and dramatically cheaper. Where they still lose is the hardest agentic, multi-file, novel-algorithm work — the jobs where finishing is the whole point. Every hands-on test in my research told the same story from a different angle, and the animal sections in Parts 1 and 2 carry the specifics: Kimi’s “could not put it all together,” Flash’s confusion off-script, the Dog that finished when the Monkey could not.

Two things stand out about the endorsements. No prominent named engineer — no Karpathy, no Willison — has publicly endorsed a Chinese model for production coding; the praise is essentially all pseudonymous Hacker News handles. But the real endorsement is behavioral. On OpenRouter, Chinese models reached roughly 61% of token consumption among the top-10 models in one week this past February (around 45–51% of platform-wide tokens by April), with DeepSeek alone at about 17.6% top-10 share — reportedly exceeding Google and OpenAI combined in one snapshot. One caveat on the volume story: the Chinese models dominate tokens, not revenue. Anthropic accounts for roughly 12% of OpenRouter tokens but about 46% of its revenue, which is the cheap-volume-versus-paid-difficulty split showing up in the billing. Developers are voting with their tokens even when they will not put their name on a blog post.

The behavior that dominates is routing. Use the cheap or local model for the long tail; escalate to Claude or GPT for high-stakes correctness and novel work. That is not a compromise people settled for. It is the rational architecture, and it is what most serious teams already run.

The enterprise question

Here the conversation stops being about capability and starts being about who can read your prompts. The three Chinese vendors are not equivalent, and the differences are material.

DeepSeek is the hard case. Its terms of service permit training on user-submitted data by default. There is no zero-data-retention option. Data is processed in China. No SOC 2, no HIPAA BAA. Its own privacy policy describes collecting prompts, chat history, uploaded files, voice, device and OS details, IP, device identifiers, crash logs, and — the line that stops procurement officers cold — “keystroke patterns or rhythms,” stored on “secure servers located in the People’s Republic of China,” retained “as long as needed,” which analysts read as indefinite.

Moonshot (Kimi) is better but not clean. Its policy permits using API content to “develop and improve the services” — no ZDR — though it hosts in Singapore. Singapore residency is a real improvement over data-in-China, but Moonshot’s core operations are Beijing-based, which matters for the legal point below.

Alibaba (Qwen) has the strongest commitment of the three. Alibaba Cloud Model Studio states explicitly that it “will never use your data for model training,” offers multiple regions including US and EU, and Alibaba carries institutional accountability the others lack — Hong Kong-listed, with an international regulatory track record.

The caveat that overrides all three privacy policies is China’s National Intelligence Law, which requires Chinese companies to “support, assist and cooperate” with state intelligence work regardless of what their terms say. A written no-train commitment from a China-domiciled vendor is subject to that law. For regulated industries, the only clean answers are self-hosting open weights or routing through Western-managed infrastructure where the inference never touches a Chinese-operated service.

Which gives a practical ranking. From safest to least:

Self-host the open weights. Bedrock, vLLM, Ollama, your own GPUs. The weights are open; the inference is yours. (One asterisk, in the security section below.)
AWS Bedrock or Azure AI Foundry. DeepSeek and Qwen run in the provider’s environment, data stays in your selected region, no interaction with Chinese-operated services. AWS contractually guarantees no training on your data; Azure runs Qwen in your tenancy.
US inference hosts. Together AI, Fireworks AI, DeepInfra run the open weights on US infrastructure.
OpenRouter with ZDR. OpenRouter supports zero-data-retention per-account, per-key, or per-request ("zdr": true), but does not enumerate which specific Chinese-model endpoints honor it, and the National Intelligence Law caveat still applies upstream.
Direct Chinese API. Convenient, cheapest to wire up, and the option you cannot defend in a regulated procurement review.

“It’s complicated” is a non-answer. The ranking is the answer.

A cautionary security finding

One result deserves its own section because it survives self-hosting, which most people assume is the escape hatch.

CrowdStrike tested DeepSeek-R1 and found it produced insecure code at a higher rate on politically sensitive prompts. The baseline vulnerable-code rate was about 19%; it rose to roughly 27% when prompts contained CCP-sensitive terms — a relative increase of about 50% off that baseline, not a 50-point jump. CrowdStrike frames this as emergent misalignment, not a deliberate backdoor, so read it as a behavioral artifact rather than intent. The operational detail: the bias persisted when running the open weights locally. Self-hosting does not fix alignment baked into the weights. You can air-gap the model and the bias rides along.

Booz Allen found a related pattern: three of four Chinese models produced more security flaws when the user was described as US government, with Qwen3-Coder adding roughly 130% more vulnerabilities under a government persona. The authors stop short of alleging deliberate backdoors, and you should too — the mechanism is unproven. But the measured effect is real.

The counterpoint keeps this from being a blanket indictment. In one test, Kimi K2.5 posted the lowest aggregate vulnerability score of the group, below the US comparison model. The Monkey’s brilliance shows here too. The concern is not uniform. It is model-specific, prompt-specific, and measurable — which means it is something you test for, not something you assume.

For completeness on the public record: South Korea’s PIPC found in April 2025 that DeepSeek transferred user prompts and device data to a ByteDance-affiliated cloud without consent. Wiz found a publicly exposed, unauthenticated DeepSeek database in January 2025 leaking over a million log lines including chat history and API keys (secured quickly after disclosure). Government-device bans exist in Italy, Australia, Taiwan, South Korea, the Netherlands, the Czech Republic, Germany, India, and roughly 17 US states. But note what does not exist: a nationwide US API ban. “DeepSeek is banned in the US” overstates the situation. The Congressional pressure is real — the House Select Committee sent formal letters to Airbnb and Cursor in April 2025 over Chinese-model data concerns — but it is pressure, not prohibition.

The flag was never the variable

Now back to the governance question, with the whole zodiac in hand.

Every concern in the enterprise section comes down to control of the model layer. Who controls inference. What they do with your prompts. Whether a no-train promise is durable. Whether the governance can change without your consent. None of it is intrinsically about China — China is the jurisdiction where the control question has the sharpest legal teeth, because of the National Intelligence Law.

The Cursor acquisition raises the same question from the other direction. Read the enterprise objection to DeepSeek and the enterprise objection to a SpaceX-owned Cursor side by side and they rhyme. Both are: a single owner now controls the layer that sees all your code, the no-train and retention guarantees can be revised by that owner, and you have limited visibility into what happens upstream. The DeepSeek version has a foreign-intelligence-law wrapper. The Cursor version has a change-of-control wrapper. The underlying exposure — your IP flowing through infrastructure you do not control, governed by terms the owner can change — is the same exposure. Beijing on one side, Boca Chica on the other. That is where Sanchit Vir Gogia’s line from Part 1, Cursor sitting “close to the intellectual-property bloodstream,” stops being an analyst soundbite and becomes the whole argument.

That symmetry is what the zodiac quietly demonstrated. A Chinese frame walked through twelve animals and landed on Anthropic, OpenAI, Google, Meta, and Mistral as readily as on Moonshot, DeepSeek, Alibaba, and Zhipu. The frame fit the whole field because the flag over the lab was never the real variable. Provenance was a proxy for the question that matters — who controls the inference and what they are permitted to do with it — and the Cursor deal makes that proxy useless. You can no longer reason about model risk by asking which flag flies over the lab.

Practical routing guide

Strip out the geopolitics and a workable default emerges. This is roughly how I would set up a team today, animals included where they help. (The capability and access details behind each label live in Parts 1 and 2.)

The long tail — autocomplete, summarization, classification, tagging, structured extraction, refactoring code you already understand. This is 80–90% of volume, and it belongs to the Rabbit and the Rat. Use the cheapest capable model. Gemini 3 Flash if you are already in Google’s ecosystem. A self-hosted Qwen or Kimi if cost and data residency both matter. Gemma locally for pure structured-text utility, Gemma 3n on the edge, Codestral for fill-in-the-middle tab completion. The quality difference against a frontier model on this work is hard to detect, and the cost difference is 5–20x.

Routine multi-file work in a known codebase. Composer 2.5 as Cursor’s default — the Snake — or DeepSeek V4 / Kimi K2.6 through a vetted inference path. Cheap, capable, fine when the task stays on the rails.

High-stakes correctness, novel algorithms, off-script agentic work, deep architecture. This is the Dog and the Tiger. Claude Opus or GPT-5.5. This is where “Kimi just could not put it all together” and “Flash gets confused when a task goes off-script” stop being quotes and start being your Tuesday. Pay for the model that finishes.

Regulated or IP-sensitive work. Ignore the leaderboard and start from the data-safety ranking above. Self-hosted weights or Bedrock/Azure first; provenance second. If EU residency is the binding constraint, the Rat’s sovereignty pitch (Mistral on-prem, EU-only) is the one place provenance actually buys you something — though weigh it against a free self-hosted Qwen. And run the CrowdStrike test yourself — generate security-relevant code with and without sensitive context and diff the output before you trust any model in this tier, Chinese or otherwise.

The meta-point: the right answer is almost never one model. It is a routing policy. The teams getting real value here are not picking a winner. They are matching each class of task to the cheapest model that clears the bar for that task, and escalating only when the work demands it.

Where this leaves us

A Fire Horse year is supposed to be bold and unruly, and the field obliged. The Chinese models have largely arrived in the high-volume, low-stakes parts of your workflow — cheaper, open, and good enough that you would struggle to tell the difference. On the hardest work, the Western frontier still holds a real lead, and the cost of getting that work wrong is exactly where the price premium earns out. The clearest casualty of the year is the Western open-weight story: Meta and Mistral built the movement and got passed on code, and the open coding lead now sits with the Chinese labs.

But the durable lesson of this year is not about China. It is that the model layer inside your IDE is a governance surface, and ownership of that surface is now in motion — a $60 billion acquisition on one side, a foreign-intelligence statute on the other, and the same question underneath both. Where does my code go, and who can read it, and can the answer change without my consent? The zodiac was a device, but it made the point cleanly enough: twelve animals, two hemispheres, one set of questions. Provenance was the comfortable proxy. After this year, the proxy is gone. You have to ask the real question now, and you have to ask it of everyone — including the editor you have trusted by default.

Bob Matsuoka is CTO of Duetto and writes about AI-powered engineering at HyperDev.

Related reading:

AI Power Ranking — Tool comparisons and benchmarks for AI practitioners
LinkedIn Newsletter — Strategic AI insights for CTOs and engineering leaders

Discussion about this post

Ready for more?