Day 26 Week 4 — Production

Real-World Case Studies

Why This Matters

Theory is useful. Watching real agents ship real code is better. Today we dissect three production-grade agentic systems — GitHub Copilot Workspace, Devin/SWE-bench, and OpenClaw — not to marvel at them, but to reverse-engineer why they work and what you can steal for your own builds.


1. GitHub Copilot Workspace: Plan-Edit-Verify Loop

Copilot Workspace (released 2024) is Microsoft's take on the full software task lifecycle. The key insight: it separates intent from execution with an explicit planning layer.

How it works:

  1. User describes a task ("fix this bug", "add this feature")
  2. Agent generates a plan: which files to touch, what changes to make, what tests to run
  3. User reviews and edits the plan (HITL checkpoint)
  4. Agent executes — file by file, diff by diff
  5. CI runs; agent reads results and iterates if needed
User Intent
    │
    ▼
[Plan Generation] ←── repo context (AST, grep, symbols)
    │
    ▼
[Human Review / Edit Plan] ←── HITL gate
    │
    ▼
[File Edits] → [CI / Tests] → [Re-plan if failing]

What to steal:

  • Make the plan a first-class artifact the user can edit before execution
  • Diff-based edits (not full rewrites) reduce hallucination surface and are easier to review
  • CI output is just another tool result — wire it back into the loop

2. Devin & SWE-bench: Autonomous Engineering at Scale

Devin (Cognition AI) was the first agent to meaningfully score on SWE-bench — a benchmark of real GitHub issues from production repos (Django, Flask, Astropy, etc.). Devin resolves ~13–16% of tasks unassisted; frontier models alone resolve ~1-3%.

Architecture highlights:

  • Persistent shell session (not ephemeral commands) — the agent has a stateful terminal
  • Browser tool for reading docs, Stack Overflow, GitHub issues
  • Self-editing: can rewrite its own plans and recover from failures mid-task
  • Long horizon: tasks can take 30–60+ minutes of real execution time

SWE-bench teaches us what hard looks like:

Failure modeFrequency
Wrong root cause identified~40% of failures
Correct fix, wrong file edited~20%
Tests pass but logic is wrong~15%
Context window exceeded mid-task~10%
// Devin-style persistent shell pattern
// Instead of one-shot exec, maintain a session:
const shell = await spawnShellSession();

await shell.run('cd /repo && git log --oneline -20');
const logs = await shell.read();

await shell.run('grep -r "failing_function" src/');
const matches = await shell.read();

// Agent builds mental model incrementally
// before touching any file

What to steal:

  • Stateful shell sessions beat one-shot commands for complex tasks
  • Build a "diagnosis phase" before touching any file — just read and grep
  • SWE-bench is worth running your own evals against; it humbles you fast

3. OpenClaw as a Personal Agent: The Ambient Model

You're living inside this one. OpenClaw is a different beast — not task-focused, but ambient. It runs continuously, monitors channels, manages memory across sessions, and dispatches sub-agents for longer work.

Architectural patterns worth noting:

Memory hierarchy:

Session context (ephemeral)
    └── memory/YYYY-MM-DD.md (daily raw log)
        └── MEMORY.md (curated long-term)

Sub-agent dispatch:

  • Main session receives a message → if task is long-running, spawns isolated sub-agent
  • Sub-agent has its own context, runs to completion, announces result
  • Avoids polluting the main session's context with noise

Heartbeat loop:

Every ~30 min:
  Read HEARTBEAT.md
  Check: email / calendar / mentions
  If nothing urgent → HEARTBEAT_OK
  If something found → send message to user

What makes it work in production:

  • Skills are versioned, self-contained (SKILL.md drives behavior)
  • AGENTS.md sets global invariants (safety rules, communication style)
  • Tool policy is separate from agent intelligence — the agent can't grant itself new permissions

What to steal:

  • Separate agent policy (what it can do) from agent behavior (what it decides to do)
  • Ambient agents need memory persistence baked in from day one — retrofitting is painful
  • Heartbeat-driven proactivity is more reliable than always-on polling

Try This Today

Pick one real bug or feature request from a project you own. Walk through how Copilot Workspace would handle it:

  1. Write the plan as a markdown list: which files change, what each change does, what tests prove it works
  2. Execute the plan yourself, step by step
  3. Note where you deviated from your own plan — those are your agent's future failure modes

The exercise forces you to think like an orchestrator, not just a coder.


Resources