Claude MPM 5

Git-First Agent/Skills Distribution

Dec 07, 2025

The biggest friction point in multi-agent orchestration frameworks isn’t the just orchestration itself—it’s keeping agents and skills updated and enabling contributions. Claude MPM 5.0 solves this with a git-first architecture that treats agent and skill repositories as first-class citizens.

The Problem We Were Trying to Solve

Before 5.0, updating Claude MPM agents meant waiting for a package release. Contributing a new agent meant understanding the entire build system, submitting a PR to the main repository, and hoping someone reviewed it before your use case became irrelevant.

This created a painful bottleneck. Teams building custom agents had no clean way to share them. The official agent set updated slowly. And the cognitive overhead of contribution meant most improvements never made it upstream.

The fix: treat agents like any other code artifact. Put them in git repositories. Let organizations maintain their own collections. Let priority rules handle conflicts.

How It Works

Claude MPM now syncs agents and skills from configurable git repositories at startup. The default configuration pulls from two sources:

Agents: bobmatnyc/claude-mpm-agents (47+ agents) Skills: bobmatnyc/claude-mpm-skills (community) + anthropics/skills (official Anthropic skills)

Adding your own repository takes one command:

claude-mpm agent-source add https://github.com/yourorg/your-agents

The --test flag validates the repository before saving configuration—fail-fast behavior that prevents startup issues later:

claude-mpm agent-source add https://github.com/yourorg/your-agents --test

Priority-based resolution handles conflicts when multiple repositories provide the same agent. Lower numbers win:

# ~/.claude-mpm/config/agent_sources.yaml
repositories:
  - url: https://github.com/myteam/agents
    priority: 10    # Your custom agents take precedence
    
  - url: https://github.com/bobmatnyc/claude-mpm-agents
    priority: 100   # System defaults as fallback

This means organizations can override any system agent with their own version while still receiving updates for agents they haven’t customized.

The Hierarchical BASE-AGENT.md Pattern

One feature that emerged during development addresses a common maintenance headache: shared instructions across related agents.

Consider a team with Python specialists for FastAPI, Django, and Flask. All three need the same Python style guidelines, testing expectations, and code review standards. Before 5.0, you’d duplicate that content in each agent file and pray you remembered to update all copies.

The hierarchical BASE-AGENT.md pattern solves this:

your-agents/
  BASE-AGENT.md              # All agents inherit this
  engineering/
    BASE-AGENT.md            # Engineering agents inherit this too
    python/
      fastapi-engineer.md    # Inherits both BASE files
      django-engineer.md     # Inherits both BASE files
    rust/
      systems-engineer.md    # Inherits engineering + root BASE

Each agent inherits from all BASE-AGENT.md files in parent directories, cascading from root to leaf. Update the Python testing standards once, and every Python agent receives the change. Same inheritance pattern used in configuration management for decades—just applied to agent instructions.

Performance: ETag Caching and Two-Phase Progress

Git operations on every startup would be intolerable. ETag-based HTTP caching reduces network traffic by 95%+ after the initial sync. The framework only pulls when content has actually changed.

Visibility into the sync/deploy process matters when troubleshooting. Two-phase progress bars now show distinct stages:

Sync phase: Repository cloning and updates
Deploy phase: File discovery and deployment to ~/.claude/agents/

Real counts, real-time updates. When something fails, you know exactly where.

The Contribution Workflow Is Now Just Git

The agent cache at ~/.claude-mpm/cache/remote-agents/ is a full git repository. To contribute:

Edit agents in the cache directory
Test locally with claude-mpm run
Commit and push

That’s it. No build system. No package release cycle. The contribution workflow matches what developers already do for any other code.

Previously, I’d encounter a situation where an agent needed tweaking—the Python engineer didn’t handle async context managers well, or the QA agent missed a testing pattern. The fix was obvious, maybe 10 lines. But the contribution overhead meant I’d make a local note and move on. The improvement never happened.

Now I edit the agent in the cache, test it works, commit with a conventional commit message, and push. The improvement exists in the shared repository within minutes. Other users get it on their next startup sync.

For organizations, this means maintaining internal agent repositories becomes trivial. Fork the community repository, add your customizations, configure priority, done. Your sales engineering team can have agents tuned for demo preparation. Your platform team can have agents that understand your infrastructure conventions. Each group maintains their own repository without coordination overhead.

The priority system means customizations don’t require abandoning upstream improvements. Set your custom repository at priority 10, keep the community repository at priority 100. Your overrides win for agents you’ve customized. Everything else updates automatically.

Nested Repository Support

Some skill repositories organize content in nested directories—category folders, framework folders, complexity levels. Claude Code requires a flat structure in ~/.claude/skills/.

Rather than force repository maintainers to flatten their organization, the framework handles this automatically. Nested SKILL.md files are discovered recursively and flattened during deployment. Original directory structure is preserved in metadata for reference. One less barrier to contributing skill collections with sensible organization.

What’s Coming: DeepEval-Based Behavioral Reinforcement

The next major feature addresses a harder problem: ensuring agents actually follow their instructions.

Multi-agent systems have a hidden reliability problem. You write careful instructions telling the PM agent to delegate work. You define circuit breakers for when agents should stop. You require evidence for claims. Then you deploy and cross your fingers.

We’ve built a behavioral evaluation framework based on DeepEval that treats agent compliance as a testable property.

The framework tests across 51 scenarios in 6 categories:

Delegation patterns: Does the PM agent properly delegate instead of doing work directly?
Circuit breakers: Do agents stop when they should?
Tool usage: Are agents using the correct tools for each task?
Workflow compliance: Do agents follow defined workflows?
Evidence requirements: Do agents provide evidence for claims?
File tracking: Do agents properly track files they create?

Each scenario defines an input, expected behavior, and scoring criteria. The scoring system provides measurable feedback: 1.0 for exact match, 0.8 for acceptable fallback, 0.0 for failure. A MockPMAgent with intelligent agent selection logic validates responses against compliance rules.

Delegation authority testing (DEL-011) presents a task that should be delegated to the ticketing agent. The test verifies the PM produces a delegation response, not a direct action. Eight sub-scenarios cover edge cases: ambiguous requests, multi-step workflows, fallback conditions.

The universal delegation meta-test (DEL-000) goes further. It synthesizes novel work types not explicitly covered in instructions, testing whether the PM generalizes delegation patterns correctly. This catches instruction gaps before users encounter them.

Early testing revealed some uncomfortable findings. The PM agent was delegating correctly in most scenarios but occasionally bypassed the ticketing agent for “simple” operations—a 12% violation rate on what should be absolute rules. The security agent provided recommendations without evidence 15% of the time. The research agent sometimes made claims without citing sources.

These patterns were invisible without systematic behavioral testing. Manual spot-checking missed them entirely. Only by running hundreds of scenarios did the failure modes become apparent.

The goal is tight feedback loops: modify instructions, run behavioral tests, see exactly what changed. The same principle that makes test-driven development work for code should work for agent instructions. CI/CD pipelines can catch behavioral regressions before they reach production.

We’re also exploring whether behavioral test results can feed back into agent instructions—automated identification of instruction gaps based on failure patterns. This is speculative, but the data from systematic testing opens possibilities that weren’t available when we were flying blind.

This isn’t ready for release yet—we’re validating the test scenarios across different Claude model versions and building the integration with existing CI pipelines. But it represents where agent orchestration frameworks need to go: from “hope the agents behave” to “verify the agents behave.”

Upgrade Path

For existing installations:

pipx install --upgrade claude-mpm

Existing configurations are preserved. The git-first architecture activates automatically with sane defaults. You’ll see 47+ agents and hundreds of skills appear without configuration changes.

If you’ve customized agents, they’re still in .claude-mpm/agents/ and take precedence over repository sources. Nothing breaks.

To verify the upgrade worked:

claude-mpm agent-source list
claude-mpm skill-source list
ls ~/.claude/agents/    # Should show 47+ agents
ls ~/.claude/skills/    # Should show dozens of skills

The Bigger Picture

Infrastructure optimization matters more than code generation quality in AI-assisted development. This has been my consistent finding over months of testing various AI coding tools.

Claude MPM 5.0 reflects this philosophy. The git-first architecture doesn’t make Claude smarter—it reduces friction in maintaining and distributing the instructions that make Claude effective. The behavioral testing framework doesn’t improve Claude’s capabilities—it provides visibility into whether Claude is following instructions at all.

The features that determine whether a multi-agent system remains useful at scale or gradually drifts into unreliable behavior.

The repositories are public. Contributions welcome.

Links:

Claude MPM is an open-source orchestration framework for Claude Code. Version 5 was released December 2025.

Discussion about this post

Ready for more?