How I Rebuilt My AI Dev Stack After Anthropic Killed OAuth
I run 9 AI agents, a CRM, a voice agent, a website, and auto-posting workflows. So when Anthropic disabled OAuth for third-party harnesses in January 2026, this wasn't a minor inconvenience. It broke a real production workflow.
My Claude Code setup through OpenClaw stopped working the way I had been using it. If I wanted to keep using Claude Code in that setup, I now needed an API key. The subscription alone was no longer enough.
At the same time, Codex CLI with GPT-5.4 kept working through ChatGPT OAuth. So I didn't try to force one tool to do everything. I rebuilt the stack around a different principle:
Don't depend on one model or one vendor — depend on infrastructure that survives model changes.
What Broke
Before this change, Claude Code inside my broader OpenClaw workflow was part of a larger multi-agent system. Then Anthropic disabled OAuth for third-party tools.
That meant:
- Claude Code through OpenClaw no longer worked off subscription auth alone
- Using it in practice now required API-key-based access
- The old “one subscription covers the workflow” assumption was gone
This matters a lot more if you are not doing isolated coding sessions, but running agents daily across products, memory, automation, and coordination. A broken auth layer is not just a tooling problem. It becomes an operations problem.
What I Did Instead
I didn't “switch from Claude to Codex.” I built a stack where each tool covers the weaknesses of the others.
1. Claude Code (Native CLI, Opus 4.6)
This is what I use for the heavy work — migrations, architecture, deploy, code review, harder reasoning-heavy implementation. I run it natively now, with an API key.
But I also structured it properly instead of keeping everything in one giant instruction file. I had a 300-line CLAUDE.md. Claude follows it worse the longer it gets. So I split it into 5 path-scoped rule files: agent-crm, agentforgeai, luna-voice, security, infrastructure. CRM rules load when I touch CRM code. Voice agent rules load for voice agent code. The rest stays out of context.
On top of that:
- Deny rules at the harness level —
rm -rf,.envreads,secrets/access blocked in settings.json. Not relying on the model to “be careful.” Taking the option away - Custom slash commands —
/review(diff vs main),/deploy-check(pre-deploy checklist),/crm-status(health check CRM instances),/start(load session context from memory) - Ruff + Biome as mandatory code gates — every Python file gets
ruff check --fix+ruff formatbefore I see it. Every JS/TS/CSS getsbiome check --write. Baked into global instructions, not optional - Session-end hooks — a Python script auto-captures every session (requests, tools used, files touched) and indexes it into ChromaDB. Next session starts with context, not a blank slate
Claude Code Pain Points
Claude Code is excellent, but not frictionless:
- Limits burn fast. One command can trigger 10-20 internal model calls. Run multiple agents on Opus simultaneously and your $100/mo quota disappears in 30 minutes. This happened to me twice in one week
- No OAuth for third-party harnesses.That door is shut. The subscription doesn't cover API-key usage in external tools
- Mobile is a workaround. Phone to Claude Code app to GitHub repo to MacBook pulls. The Telegram bridge I wrote helps (it wraps
claude --resume SESSION_IDin a bot), but it's still running through a server - MCP ecosystem is immature. Tried connecting EdgeLab MCP — 73 tools, looked promising. OAuth broken on their side. Wrote a health check script, moved on
So Claude Code became my premium tool, not my only tool.
2. OpenClaw + Codex CLI (GPT-5.4)
This is my daily driver. Codex kept working through ChatGPT OAuth, which immediately made it more practical for routine work inside my broader system. In real workflows, reliability often beats elegance.
7 agents on MagicBox (a $24/mo Digital Ocean server):
- Caramel — coordinator, daily briefs, nightly audits
- Sixteen — dev agent, architecture, shipping
- Vibe — content, trends, social
- Rex — business analysis
- Mira — marketing
- Luna — voice agent (ElevenLabs + Twilio)
Each has a profile, workspace, and memory files. They push data to my own CRM — FastAPI + PostgreSQL + Telegram Mini App, open-sourced at github.com/kossvat/agent-crm.
Codex is fine for 80% of routine tasks. It falls apart on complex multi-file refactors or subtle production bugs where Opus shines. That's fine — that's what the dual stack is for.
3. Model Strategy Instead of Model Loyalty
- Codex / GPT-5.4 = daily work
- Opus = premium reasoning
- Qwen = budget tasks
I stopped thinking in terms of “best model.” I started thinking in terms of role-based model allocation. Each model has a job.
The Shared Layer That Matters Most
The most important part of the stack is not Claude Code or Codex. It's the shared infrastructure underneath.
- Obsidian vault = shared memory. My OpenClaw agents write to it. Claude Code reads from it. Protocol enforced — when to write, what format, mandatory git push after every change. Mac syncs every 5 minutes via Obsidian Git. Context survives tool switches
- Memory MCP server = semantic search. Custom-built, runs on ChromaDB. Indexes the Obsidian vault, Claude memory files, and rules — 83+ documents searchable by meaning, not keywords. One agent writes a decision at 2pm, another finds it semantically at 6pm
- Git = bridge. Transport layer between environments, including mobile workflows and coding environments
- OpenClaw = orchestration layer. Multi-agent runtime, server execution, persistent coordination. This is the part I do not want tied to one model vendor
What This Shipped in 48 Hours
- Full CRM migration: SQLite to PostgreSQL, domain switch via Cloudflare Tunnel, bot security hardening
- Open-sourced the CRM: 77 files, 14K+ lines, MIT license, clean git history with zero leaked secrets
- Semantic memory system: ChromaDB + Obsidian + MCP server + CLI
- Telegram bridge for Claude Code via chat
- Complete Claude Code config: modular rules, deny list, session hooks, Ruff + Biome gates, 4 custom commands
One person. Two AI stacks. Real output.
The Lesson
Once your stack touches real operations, every weak assumption gets exposed — auth, model lock-in, memory fragmentation, tool incompatibilities, mobile vs server gaps.
Most people are still thinking about AI tooling too narrowly. They ask: which model is best?
The better question is: what infrastructure keeps working when one model breaks?
The real win is not choosing Claude Code or Codex. The real win is building a system where changing one model does not destroy your workflow. That is the difference between an AI demo and an AI operating system.
One repo. Multiple models. Shared context.
And that is a much more durable way to work.
I build these systems for clients using the same stack and protocols that run my own business. Discovery starts at $500, and most single-agent setups are deployed within two weeks.
Want a system like this for your business?
I build custom AI agent systems deployed in 2 weeks. Discovery starts at $500.
Book a Discovery Call