Enter Codex CLI

OpenAI's Open Source Agentic Tool Shows Promise

Apr 23, 2025

I came across OpenAI's Codex CLI while researching a broader State of the Union on agentic coding—a topic that’s about to get a lot more public in the next few days. Codex wasn’t the headline of that work, but it stuck with me.

Quick note: there are multiple tools named "Codex." This piece is about OpenAI's Codex CLI, released in April 2025. It's a standalone, open-source command-line assistant with support for GPT-4 and other models. It is not the same as the earlier OpenAI Codex model (2021–2023, used in GitHub Copilot) or the community-built Codex CLI projects like the one by codingmoh that runs fully offline¹.

It dropped last week without much noise. No launch spectacle, no hype cycle. But it’s quietly one of the most thoughtfully engineered AI tools I’ve used this year.

Not because it’s novel. Because it’s practical.

Four Things Codex Gets Right

1. Model Choice With Real Leverage

Codex gives you access to multiple OpenAI models:

gpt-4-o-mini: cheap, fast, and surprisingly conversational
gpt-4.1: best-in-class on coding tasks, and more cost-effective than Claude 3.5 in real-world use

Having both means you can balance cost and capability based on what you’re doing—debugging a script, reviewing code, or generating config boilerplate.

But here’s the kicker: Codex CLI is open source and supports any provider.

Want to use Claude, Gemini, or local models via Ollama, LM Studio, or custom APIs? You can. The system is designed to be provider-agnostic. Model definitions live in your .codexrc file, and you can route prompts through whatever stack suits your workflow—whether that’s cloud, self-hosted, or air-gapped environments.

That flexibility turns Codex from a tool into a platform.

And if you're looking to save on usage, Codex also supports a --flex-mode flag. When enabled, it attempts to use lower-cost or opportunistically available compute—but with a tradeoff: it may occasionally interrupt long-running sessions if those resources become unavailable. Great for budget-conscious devs, but not yet fully production-safe.

A hint to OpenAI: --flex-mode is a clever idea. But when compute runs out, sessions shouldn’t just crash. Either exit gracefully, or pause until capacity is restored.

2. Big Context, Smart Scope

Codex reads your full codebase but doesn't send it all at once. It scopes intelligently, caches what it can, and only sends what’s relevant. This cuts down noise, improves response time, and avoids leaking sensitive data.

It behaves like someone who’s read your repo and knows what to ignore.

(It actually can read your entire codebase and act fully autonomously, a la Devin, but I haven’t had the courage to try that mode yet. "dangerously-auto-approve-everything" or "full-context" mode is available.)

3. Privacy That’s Built In, Not Bolted On

Local-first execution
Explicit file access
No telemetry or background logging by default

If you work in regulated or high-trust environments, these aren’t features—they’re prerequisites.

4. Thoughtful UX, Conversational Flow

Codex is non-blocking. Ask a question, keep working, get a response later. No lock-in. No modal dead ends.

And when you authenticate with OpenAI, it exposes its thought chain—a step-by-step breakdown of how it interpreted your request. Here’s what o4-mini said while resolving a style update, followed by another example of strategic decision-making:

"The user asked to ‘apply the same style to the playground pages,’ which likely means using the card style from the landing page. Since they didn’t mention adding breadcrumbs, I think it’s best to leave the child pages as they are... I confirmed the necessary Card components are imported... I’ll also remove any trailing spaces before committing the changes. Now it’s ready to go!"

That’s not just output—that’s understanding… or at least a better simulation of it.

Here’s another great example that showcases how it reasons through constraints and proposes a clear plan of action:

"I’m thinking it’s best to propose a revised plan: let’s roll back and start with something smaller instead. It seems like the changes to session.ts are too large, so I’ll check version control for the previous state because we don’t have that code locally. Given the constraints, I think the best course is to recommend to the user that we revert, and open a feature branch to implement the smaller changes."

It’s this kind of structured decision-making that makes Codex feel like more than just a code generator—it’s a tool that reasons with the project in mind.

In Practice

Fixed Docker mount bugs in under a minute
Deployed hybrid Next.js + Edge Function apps with cleaner env config
Wrote simple Bash helpers faster than switching to a browser tab
Tested code changes with model commentary I could inspect and trust

Limitations Worth Noting

Codex isn't a full agent, and it has some clear boundaries:

It doesn't do live web searches, which can be limiting when troubleshooting unknown errors or dependency conflicts.
It relies on local context only—you get what you give it.

The biggest issue with gpt-4.1 right now is instability around file editing. In many sessions, Codex will report successful patching, but changes don't actually get written to disk. This inconsistency can break workflows without clear feedback, making gpt-4.1 frustrating for multi-step code refactors.

There is also a known Node.js warning that can occur during heavy usage:

(node:20662) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. MaxListeners is 10. Use events.setMaxListeners() to increase limit

While this isn’t the cause of the main crash behavior, it's a symptom of process saturation that may require restarting the CLI.

My current takeaway: while the gpt-4.1 integration is promising—and still useful in situations that require deeper contextual understanding or multi-file reasoning—it's also buggy in key areas, especially around file edits and process stability. In contrast, I was pleasantly surprised by how well gpt-4-o-mini performed. It took a bit longer to work through complex issues, but it was persistent, accurate, and refreshingly reliable.

It’s also worth saying: Claude still has legs, particularly when viewed through the lens of Codex’s limitations—most notably its lack of web search capabilities. While Codex outperforms it in structured coding workflows, Claude’s ability to browse the web makes it a better general-purpose assistant. For problems that required research, external docs, or integrating with web scrapers, Claude’s open context and tool access were invaluable. It’s also worth noting that Claude feels more polished in terms of UI and integrated tooling—particularly for users working outside the CLI or across multiple workflows. It’s still part of my toolbox—just not always my first stop.

Should You Use It?

Yes.

Codex CLI isn’t a gimmick. It’s not trying to be your copilot. I first encountered it while preparing a broader State of the Union on agentic coding—and it fits squarely into that evolving category: tools that can reason, act incrementally, and stay out of your way. It’s an AI-native command-line assistant that makes you faster without making a fuss:

It gets the model selection right
It respects your environment
It explains itself just enough
It lets you work without waiting

And in the broader context of agentic coding—which is about to get a major spotlight—it’s a solid glimpse into what AI-native tools should look like: quiet, capable, and engineered with intent. Add on the affordability of 4o-mini, and this may end up being my daily tool.

Given its newness and rawness, maybe hold off on putting it against your toughest problems…but at least have it take a look at them.

Footnotes

Other "Codex" projects include:

OpenAI Codex (2021): the model behind GitHub Copilot, now deprecated.
Community Codex CLI: a local-first, offline-friendly command-line tool for coding assistance.
This article is focused on the OpenAI Codex CLI, released April 2025, available on GitHub.