(note: this is the second in a 4-part series, part 1 is here)
In the past year, a surge of new entrants and experimental tools has pushed the boundaries of autonomy and integration in AI-assisted coding. This section highlights several notable recent launches (not covered in Part 1) that exemplify the latest developments in agentic coding – AI coding assistants that can act with greater independence.
Devin (Cognition Labs)
Description: Branded as "the first AI software engineer," Devin is an autonomous coding agent developed by the startup Cognition Labs. Launched in late 2023, Devin garnered significant attention in 2024 for its ability to handle entire software tasks with minimal input (it quickly went viral on social media for its impressive demos).
Capabilities: Devin can plan and execute complex engineering tasks requiring thousands of decisions, recalling relevant context at every step, learning over time, and fixing its own mistakes (as described in Cognition's introductory blog post). It comes equipped with common developer tools — a shell, code editor, and web browser — all inside a sandboxed environment, essentially everything a human developer would use. In practice, you give Devin a natural-language prompt describing a project or feature, and it will generate a step-by-step plan, write the code, run it, test it, debug as needed, and even iterate based on feedback. For example, in tests Devin built a functioning website in about 10 minutes and even recreated the classic game Pong from scratch in a similar timeframe. It can also search the web for documentation or solutions as it works (like a developer Googling), and it maintains a memory of its actions – for instance, it uses a running notes file to record important info and creates persistent knowledge entries that it can reuse in later sessions.
One standout aspect of Devin is its ability to collaborate in real time via Slack. The user can invite Devin to a Slack channel and converse with it as if it were a remote colleague. Devin provides updates on its progress, and you can interject with clarifications or new instructions mid-task (Slack is actually the primary interface for Devin's commands). This makes the interaction feel very natural in team settings, essentially bringing an AI pair-programmer into your workplace chat.
Innovations: Cognition has demonstrated Devin performing tasks that were previously unheard of for AI coding tools, including: learning to use an unfamiliar AI library by reading a blog and implementing it, building and deploying a web app incrementally based on user requests, autonomously finding and fixing bugs in an open-source codebase, fine-tuning its own AI models with minimal guidance, and even tackling freelance programming jobs from Upwork. Later revisions of Devin introduced multi-agent collaboration, where Devin can spin up additional specialized AI agents that work in concert on subtasks. In other words, Devin can manage a team of AI "developers," not just work as a single agent – a hint at how the system could scale to more complex projects.
Pricing & Access: Initially, access to Devin was exclusive and pricey. It became generally available to team customers in early 2024 at an eye-watering cost of $500 per month per team. Despite the steep price, demand was high following the viral demos. In April 2025, Cognition introduced a more accessible plan: a $20 upfront fee + pay-as-you-go usage billing model. The $20 gets you roughly 9 "AI Compute Units" (about 2.25 hours of active Devin work) before additional charges accrue. This change was likely enabled by Cognition's massive funding round (the company reportedly raised hundreds of millions of dollars in new capital) and reflects an effort to broaden adoption by lowering the upfront cost. Devin is still relatively expensive for heavy use, but these pricing experiments show Cognition trying to balance the agent's powerful capabilities with greater accessibility.
Effectiveness: Devin has drawn both praise and skepticism. Notably, Aravind Srinivas, CEO of Perplexity.ai, praised Devin as "the first demo of any agent… that seems to cross the threshold of human capability". In benchmarks like SWE-Bench – which involves resolving real-world GitHub issues – Devin significantly outperformed prior state-of-the-art agents, successfully resolving about 13.9% of issues end-to-end versus ~2% for others. This was a major leap forward in autonomous coding performance (showing that Devin can handle a non-trivial subset of coding tasks fully autonomously), though it still means the majority of complex issues remain beyond its abilities. On the flip side, independent evaluations have exposed Devin's limitations. For example, a review by The Register reported that Devin completed only 3 out of 20 challenging coding tasks successfully – highlighting that it can struggle with very complex or creative problems. Like any large language model (LLM)-based system, Devin can also misinterpret requirements or "hallucinate" solutions that don't actually work. Cognition claims the current Devin 2.0 (as of 2025) is roughly twice as productive as the initial release and has added features such as automatic project plan generation, code explanations with citations, and even generating project documentation wikis.
Users have noted that Devin works impressively on self-contained tasks but can be harder to integrate into a typical development workflow – since its primary interface is Slack or a web dashboard, using it alongside traditional IDEs requires some adjustment. Moreover, running such an autonomous agent requires a level of trust: Devin will attempt large changes to the codebase on its own, so teams often keep it on a tight leash (e.g. reviewing every commit it makes before merging). In summary, Devin represents the cutting edge of agentic coding: highly autonomous and capable of amazing feats, but also pushing into territory where human oversight and judgment remain crucial. It has effectively proven that an AI can take a high-level prompt like "Build me an e-commerce website" and produce non-trivial results. As the technology improves, Devin or its successors might eventually realize the vision of an "AI software engineer" that can handle projects start-to-finish – but for now, using Devin effectively still means knowing when to intervene and guide its creativity.
Firebase Studio
Firebase Studio represents a pragmatic step toward realizing the "apps-as-agents" paradigm described earlier. While many agentic tools focus on abstract reasoning or speculative execution, Firebase Studio grounds agent workflows in persistent data models, event streams, and UI generation that stays tightly coupled to the backend logic.
At its core, Firebase Studio is a visual development environment layered over Google's Firebase platform. It exposes Firestore schemas, authentication flows, Cloud Functions, and app state as editable primitives. These can be reasoned about not just by the developer but also by agentic copilots. An LLM agent can, for instance, respond to a user prompt like "create a feedback form tied to user sessions and notify the admin via email" by generating the Firestore schema, creating the appropriate rules, configuring triggers in Cloud Functions, and scaffolding out the UI in a few seconds—backed by Firebase's real-time sync and security model.
Crucially, Firebase Studio doesn't throw away the underlying APIs—it aligns with them. This makes it a strong candidate for integrating with hybrid agentic workflows where tools like GPT-4 or Claude can read/write specs, propose logic, and refine behavior—but still leave the actual deployment and observability in a human-readable, production-safe form. In that sense, Firebase Studio acts as a "constraint-aware canvas" that limits hallucination while encouraging intention.
For organizations seeking a bridge between no-code frontends and serverless backend orchestration, Firebase Studio offers a concrete manifestation of the promises made by agentic development: reducing boilerplate, accelerating iteration, and maintaining fidelity to app architecture. It's less about speculative agents spinning up imaginary worlds—and more about giving you a reliable, smart assistant for the real one.
OpenAI Codex CLI
Description: In April 2025, OpenAI launched Codex CLI, a new open-source AI coding agent designed to run locally in a developer's terminal. Codex CLI brings the power of OpenAI's latest models directly to the command line, allowing developers to interact with an AI agent that can read, write, and execute code on their own machine. Essentially, OpenAI built a CLI-based "autopilot" for coding tasks.
Key Features:
Local Execution: Codex CLI can autonomously carry out programming tasks on the user's system. It can read and modify local files and run code in a sandboxed environment. In effect, it behaves like an automated developer operating in your terminal: it can compile programs, run test suites, start servers, etc., all locally. This means your code never needs to leave your machine for the AI to work on it.
Approval Modes: Recognizing the need for safeguards, OpenAI built in three distinct modes controlling how independently the agent operates:
Suggest Mode: The agent only suggests file edits or command executions, which the user must approve. (Great for code review assistance or learning a codebase, since you see every suggestion before it's applied.)
Auto-Edit Mode: The agent can modify files on its own, but still must ask permission to actually run any code. (Useful for letting it refactor or make mechanical code changes with a human gate on execution.)
Full Auto Mode: The agent can read, write, and execute commands autonomously, end-to-end, within a restricted sandbox. This mode is for longer tasks like fixing a broken build or prototyping a feature completely hands-free. OpenAI wisely sandboxed this mode (no network access and scope limited to the current directory) to limit potential harm. The CLI even warns users if version control is not enabled, encouraging them to use Git so all changes are tracked.
Multimodal Inputs: Codex CLI supports more than just text prompts – you can also input images (screenshots, diagrams) as part of your query. For instance, you could provide a screenshot of an error or a sketch of a UI, and the agent will take that into account while coding. This leverages OpenAI's latest "o4-mini" model, which has multimodal capabilities (processing both text and image inputs).
Model Flexibility: By default, Codex CLI uses OpenAI's new o4-mini model (a code-optimized model presumably related to GPT-4). However, developers can choose other models via a configuration flag or API setting – meaning you could plug in a larger model like GPT-4 or future OpenAI models as they become available. This makes the tool quite future-proof and adaptable to different needs and performance budgets.
Open-Source and Extensible: The CLI is open-source and available on GitHub, making it transparent and extensible. Developers have full control and can integrate it into their workflows – for example, hooking it into continuous integration (CI) pipelines or customizing it for their environment. OpenAI even offered grants and credits to encourage developers to experiment with Codex CLI in their projects, underscoring their hope that the community will build on top of it.
Language & Framework Support: Because Codex CLI leverages OpenAI's general-purpose models, it inherits very broad programming language support (Python, JavaScript/TypeScript, C/C++, Java, C#, Go, Ruby, etc. – essentially any language present in the models' training data). It isn't tied to a particular framework or tech stack. However, being a terminal-based tool, it's most convenient for languages and frameworks that have command-line build or run workflows. It can certainly work with front-end frameworks too (for example, it could run a development server for a React app or compile an Angular project), but it really shines in scenarios like backend development, writing scripts, performing DevOps tasks, and other use cases where interacting with the shell is natural.
Use Cases: Some scenarios where Codex CLI is especially useful include:
Automating the fix of failing builds or tests: For example, you can run your test suite and let Codex CLI find any failures and attempt fixes in Full Auto mode, iterating until the tests pass.
Codebase exploration in "suggest" mode: You can ask questions like "Where is user authentication handled in this codebase?" and the agent can search and open relevant files or suggest code pointers with your approval. This is like a super-charged code search that can follow instructions.
Routine refactoring or upgrades: For instance, "Migrate this codebase from Python 3.9 to 3.11." The agent can edit all the necessary files and run compatibility tests in a loop, only asking for confirmation to execute changes or when it's done.
Serving as a hands-free junior developer: You describe a new feature in natural language, and the agent implements it across the codebase, even running the project to verify it works, while you supervise in one of the approval modes. This lets you focus on high-level direction while the AI handles the grunt work of implementation.
Effectiveness: Codex CLI is brand new (as of mid-2025), but it's poised to "square off with similar AI coding assistants from rivals like Anthropic and Cursor," as The Indian Express noted in its coverage. Early impressions highlight the convenience of having an AI agent tightly integrated with local development workflows. Many developers appreciate that the graded autonomy (the three modes) allows them to dial in the right balance between automation and control – this is seen as a smart approach for safety and trust. In Full Auto mode, Codex CLI essentially brings capabilities similar to Cursor or Devin onto the developer's own machine, but backed by OpenAI's latest models. A major benefit of this local approach is privacy and security: your source code doesn't leave your environment; typically only high-level prompts or diffs are sent to the model for processing. This addresses a common concern enterprises have with cloud-based coding AIs (where sensitive code might be sent to an external service).
OpenAI's push with Codex CLI also signals where things are heading: a future where agentic AI is a standard part of the developer toolkit. The fact that Codex CLI is open-source and free to use (you just need your API key for OpenAI) dramatically lowers the barrier to trying an autonomous coding agent. It could quickly gain popularity given OpenAI's brand recognition and distribution. It's also notable that there have been reports of OpenAI eyeing an acquisition of the coding AI startup WindSurf for around $3 billion – possibly to augment Codex CLI or a similar offering with WindSurf's specialized code auditing and compliance features. This shows OpenAI's strong commitment to leading in the agentic coding arena. Overall, Codex CLI's effectiveness will depend largely on the model it uses. The default o4-mini model is new and details about it aren't public, but it's presumably a capable code model. If it performs well in practice, developers might find themselves increasingly delegating terminal-based chores to this AI agent and focusing more on higher-level design and decision-making, knowing that the AI can handle the iterative coding and debugging loop locally.
OpenAI Codex Web
Description: OpenAI quietly rolled out Codex Web last week as another entry in their increasingly confusing "Codex" product line (joining the original model powering GitHub Copilot and the CLI tool I covered previously). Unlike IDE plugins or editor extensions, Codex Web functions as a standalone web application focused on autonomous coding implementation. The closest comparisons would be Cognition's Devin or SweepAI - tools designed to autonomously implement entire features or fixes based on high-level instructions.
Key Features:
Repository Integration: After authenticating with 2FA, you grant Codex Web access to specific GitHub repositories. It then scans your codebase and suggests tasks it can perform - from explaining code structure to fixing bugs it identifies. What's particularly impressive is that it found several legitimate issues in my personal website repo during testing, including stray terminals, implementation flaws, and incorrect imports. These weren't just linting issues but actual functional bugs that would impact performance.
Execution Environment: When Codex tackles a task, it spins up a virtual environment to actually execute your code. This approach has profound implications. On the positive side, it can verify changes work before committing them, and execution reveals runtime issues static analysis might miss. You also don't risk breaking your local environment. The downside is that build cycles are excruciatingly long compared to modern development workflows relying on hot module reloading and incremental builds.
Branch Management: Each task operates in its own separate thread, with checkpointing directly tied to git branches. This seems like an intentional design choice that might feel more natural to developers in the long run, but currently adds complexity to the workflow as context doesn't transfer smoothly between operations.
Limited External Access: Despite having repository access, Codex Web can't access GitHub issues or external documentation. When I referenced an issue number during testing, it responded that "network access is blocked" and couldn't determine the requirements. I ended up manually copying and pasting the issue text. This reveals OpenAI's approach: Codex Web only accesses what's explicitly enabled, even when it technically has the permissions.
Pricing: OpenAI is currently enabling Codex Web for Pro ($20/month) and Teams subscribers, seemingly testing the waters before a wider rollout or separate pricing tier. There's no indication yet whether this will remain bundled with existing subscriptions or eventually become a separate product with its own pricing structure.
Effectiveness: After several hours of testing on my personal site, I found Codex Web occupies an interesting middle ground in the autonomous coding landscape. Its code understanding and bug identification capabilities are genuinely impressive, and its autonomous implementation of fixes generally works well. The clean interface offers minimal setup friction compared to other tools.
However, the build process is painfully inefficient for iterative development, workflow integration with issues and CI/CD is limited, and there's no ability to leverage CLI tools or custom scripts. Context is also limited to your code rather than research or external documentation.
This positioning reflects the broader landscape splitting into three distinct approaches:
Full IDEs (Bolt, v0, Cursor) that replace your existing editor
CLI/Plugin Agents (like Claude Code) that enhance your tools
PR-style agents (Codex Web, Devin) that autonomously create pull requests
Codex Web feels caught between worlds - not seamless enough for daily coding, yet not fully autonomous enough to handle end-to-end workflows. It's a promising glimpse into an autonomous coding future, but several iterations away from practical reality for most development teams. I'll keep it in my arsenal for occasional code review and maintenance tasks, but it won't replace my existing workflow anytime soon.
Replit Ghostwriter (AI Agent Mode)
Description: Replit's Ghostwriter is an AI-powered coding assistant integrated into Replit's cloud development environment (the Replit online IDE). Ghostwriter has existed as an autocomplete and chat-based helper since 2022, but Replit has recently introduced more agentic features that allow Ghostwriter not only to suggest code but to actually execute code and iterate within the Replit IDE. In other words, Ghostwriter can now act as an AI agent for coding, especially for projects hosted on Replit, closing the loop from writing code to running and debugging it.
Notable Agentic Features:
Interactive Execution: Ghostwriter can run your code directly in the Replit environment to help identify runtime errors or verify outputs. Since every Repl (project) on Replit provides an execution sandbox for languages like Python, Node.js, C++, etc., the AI can compile and execute programs as needed. As one reviewer noted in early 2025, Replit's agent "literally runs your program (in a sandbox) and can pinpoint runtime errors, then attempt to fix them in subsequent iterations," acting like a tireless junior dev that keeps running and fixing until tests pass. This tight integration of coding and execution allows Ghostwriter to automatically move from writing code, to testing it, to fixing it – all in one flow.
Debugging Assistance: If a user's program crashes or throws an exception, Ghostwriter can detect that and immediately offer a fix. The latest Ghostwriter Chat interface even includes a "Debugger" feature that tells you when there's an error and suggests what change might resolve it (almost like a built-in pair programmer pointing out your mistake).
Project-Aware Context: Ghostwriter has access to the entire project's code (since all files are in the Replit workspace), so it can answer questions about the codebase or make multi-file changes in one go. It will automatically include relevant file context when you ask it to implement something. This is similar to how Cursor's agent mode works – providing that it "knows" about all your project files, not just the single file you're editing.
Version Control & Commit: Replit added a feature where Ghostwriter can commit code changes on your behalf. It's not a full Git integration, but under the hood it saves snapshots of your code changes. This acts as an approval mechanism – you get to review what Ghostwriter has changed and then commit those changes to the project's history. It's a way to build trust, since you can always see a diff of what the AI did and roll it back if needed.
Multi-Language Support: Because Replit supports a wide array of programming languages, Ghostwriter's agent mode isn't limited to a particular stack. It can just as easily run a Python script or a Node.js app or compile C++ code. It also has knowledge of common frameworks (Flask, Django, React, etc.) and can set up boilerplate code for them. In practice, this means you can have a single project that includes, say, a Python backend and a JavaScript/HTML frontend, and Ghostwriter can handle both sides.
Pricing: Replit Ghostwriter is offered as a subscription service. There is a free tier on Replit that gives you a limited number of AI-driven completions or answers (and possibly uses a smaller/slower model for Ghostwriter), suitable for casual use. For heavy use, there's a Pro plan around $20 per month that provides unlimited Ghostwriter access and priority computing resources. Notably, Replit offers Ghostwriter for free to students and teachers as part of their education initiative – similar to how GitHub Copilot is free for students. This education program is meant to encourage adoption among the next generation of programmers, allowing classrooms to use Ghostwriter without additional cost.
Use Cases:
Learning and Experimentation: Beginners can use Ghostwriter to get live help as they code. The agent can explain error messages and even automatically fix simple bugs. This is a huge boon to someone learning programming, as it's like having a tutor who not only tells you what went wrong but also shows you how to fix it.
Rapid Prototyping: A developer can describe a program or feature in natural language and let Ghostwriter generate the initial code, run it, and then refine it in cycles. All of this happens in the browser with zero setup. This makes it incredibly fast to go from idea to a working prototype, especially for web apps or small scripts.
Integration with Deployments: Replit isn't just for coding; it also lets you deploy applications. Ghostwriter can thus not only build your code but also help deploy web applications or bots directly on Replit's platform. For example, you could ask Ghostwriter to "launch" or set up your project on a URL, and it can interface with Replit's deployment system.
Multi-language Projects: Because Replit supports front-end, back-end, and everything in between, you might create a project that has, say, a Python backend API and a JavaScript front-end. Ghostwriter can seamlessly assist with both parts in one environment. For instance, it could generate a Python Flask endpoint and also help write the corresponding HTML/JS for the front-end, all within the same Repl.
Effectiveness: Ghostwriter's evolution into an agent has effectively made Replit a fully AI-powered development platform. In a sense, Replit can now offer something akin to other agentic dev tools (like Cursor's autonomous mode or even OpenAI's Codex CLI), but for any language and completely in the cloud. The convenience of having this in a single online IDE is significant: there's no need to manage API keys or local installations or custom setups – it just works in your browser with one click. Early users have found that Ghostwriter can often handle about 80% of the "glue code" – the boilerplate, setup, configuration, and minor logic – allowing the developer to focus on the core creative aspects of their project.
That said, Ghostwriter (by default) is a bit less aggressive or autonomous than tools like Devin or Cursor's agent mode. It often behaves more like a super-smart assistant unless you explicitly ask it to take over a task. This is partly by design, to avoid the AI making unintended large-scale changes without user oversight. One clear advantage Ghostwriter has is collaboration: Replit is multiplayer by design, so multiple users can see what Ghostwriter is doing in real time and even chat with it together. This opens up interesting possibilities, like pair programming where one of the "pair" is the AI agent and a whole team can observe or guide it.
In terms of raw code generation quality, Ghostwriter's underlying models are very good but perhaps not at the absolute bleeding edge of something like GPT-4 (Replit uses a combination of its own code model and OpenAI's models). In practice, Ghostwriter's suggestions are high-quality, though occasionally it may produce an error or require a nudge to get a complex algorithm exactly right. However, the ability to run and debug code closes that gap significantly – if Ghostwriter makes a mistake, it will often catch and correct it on the next run, or at least flag the issue to the user.
Overall, Replit Ghostwriter with its new agentic capabilities provides one of the most accessible entry points into autonomous coding. Any developer can hop into replit.com, create a project, and immediately have an AI co-developer who not only suggests code but also executes it and verifies it. As one user put it, it's like having an "AI programming buddy in the cloud" who is always available to help. For solo developers and learners, this can dramatically accelerate the coding process, and even for experienced devs it can automate the tedious parts of coding so they can focus on design and architecture.
Other Notable New Tools
In addition to the major tools described above, there are other emerging solutions worth mentioning:
Amazon CodeWhisperer / Amazon Q: Tech giant Amazon has developed its own AI coding assistant (CodeWhisperer), and more recently Amazon Q for Developers, which the company touts as "the most capable generative AI–powered assistant for software development." This assistant (announced in late 2024) is deeply integrated with AWS cloud services. It can help design cloud architectures, generate infrastructure-as-code (e.g. Terraform or CloudFormation scripts), and handle application code while performing built-in security scans. The introduction of Amazon Q reflects a trend of cloud providers embedding AI agents to help developers build, operate, and optimize software on their platforms. For enterprises already invested in the AWS ecosystem, Amazon Q offers an AI pair programmer that has intrinsic knowledge of AWS services and best practices out-of-the-box.
Open-Source Agent Frameworks: There has been an explosion of open-source projects and frameworks for building multi-agent AI systems. Examples include Microsoft's AutoGen (an open-source framework from Microsoft Research for LLM-based agents working together), and community-driven projects like GPT-Engineer and CAMEL that let developers script multiple AI agents collaborating on tasks (for instance, one agent might be the "coder," another the "tester," and another the "architect" in a project). These aren't turnkey end-user tools like the ones above; rather, they are toolkits that enable developers to create custom agentic coding solutions. This trend signals a future where developers might assemble their own bespoke AI dev teams for specific workflows. (For example, using AutoGen or CAMEL, a company could set up a coding agent that interacts with a testing agent and a documentation agent to fully automate a software development pipeline in a controlled way.)
Google's Gemini (anticipated): As of early 2025, Google has not yet released a dedicated agentic coding product, but the AI community is eagerly anticipating Google's next-generation model Gemini, which is expected to have very strong coding capabilities. It's widely speculated that Google will integrate Gemini (or its coding features) into its developer tools – perhaps as an expansion of Google Cloud's existing AI coding assistant (Codey) or by adding agent-like features to Google Colab or Cloud Shell. Early hints and research from DeepMind – notably the AlphaCode project, which in 2022 achieved human-level performance in some coding competitions – suggest that Google is not far behind in the autonomous coding race. In practical terms, we might soon see Google offering an AI agent that can integrate with tools like Android Studio or Google Cloud Build to act as an AI co-developer. This is one space to watch, as a robust offering from Google could quickly become popular given how many developers use Google's ecosystem.
These emerging tools and frameworks underscore how quickly the field is evolving. Every major tech company – and many startups – are racing to add more autonomy and smarter capabilities to their coding AI offerings. The result is a rapidly expanding ecosystem of agentic coding assistants targeting different niches (from general-purpose tools, to cloud-specific helpers, to frameworks for low-code or no-code users).
Thanks Bob, for the rundown. I've tried several of these tools myself and there is clearly a lot of overlap between them. I find myself switching back and forth when the AI gets stuck, which turns out to be useful, but feels a little like throwing shit on different wall to see what sticks.
I have also tried asking one AI to evaluate the work of another, which has yielded some interesting results, although I am still a little skeptical of the output since whenever I propose a refactoring I get a reply starting with "that's an excellent idea", which, statistically speaking, obviously cannot be the case... Some of my ideas must not be great and I would appreciate some constructive critical feedback. Perhaps I need to work on my prompts to facilitate this?