Haoyu’s Substack

Rethinking Software Engineering Teams

Haoyu Zha — Mon, 02 Mar 2026 21:58:55 GMT

Something interesting is happening across the industry right now.

Teams gave their devs Claude Code. Engineers started writing code at 5x the old pace. Features that used to take a sprint were getting drafted in a day.

But then a quieter question started surfacing:

“If devs are writing code 5x faster... why aren’t we shipping 5x faster?”

Because writing code faster and shipping products faster are two very different things. And that gap is precisely the problem.

I’ve been thinking about this in four stages. Once you see it, you can’t unsee it.

Phase 1: The Production Line (Where We All Started)

This is the world we all grew up in. Software gets built like a factory assembly line. PM writes the spec. Design makes the mocks. Dev builds the thing. QA breaks the thing. Ship.

Each lane is a different team. Each team hands off to the next. The cycle time is the sum of all the lanes.

  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
  │  PM      │────▶│  DESIGN  │────▶│   DEV    │────▶│    QA    │──▶ Ship
  │  (human) │     │  (human) │     │  (human) │     │  (human) │
  │ ████████ │     │ ████████ │     │ ████████ │     │ ████████ │
  └──────────┘     └──────────┘     └──────────┘     └──────────┘
    ~2 weeks         ~3 weeks         ~4 weeks         ~3 weeks

  Throughput per lane:

  PM      ████████████████████████████   ~2 wk
  Design  ████████████████████████████   ~3 wk
  Dev     ████████████████████████████   ~4 wk
  QA      ████████████████████████████   ~3 wk
          ──────────────────────────────
          Balanced. Predictable.           Total: ~12 weeks

It’s slow. Everyone knows it’s slow. But it’s predictably slow. Every lane takes roughly the same amount of time. The bottleneck is everywhere, which means the bottleneck is nowhere.

This is the world that AI was supposed to fix.

Phase 1.5: The Trap (Where A Lot of Teams Are Right Now)

AI tools rolled out. But they didn’t roll out evenly.

Dev got Claude Code, Cursor — the works. Suddenly a senior dev is generating code at 5x the old pace. Features that used to take a sprint are getting drafted in a day.

Design got some help too. AI-assisted prototyping, concept generation. Maybe a 1.5x improvement.

PM? About the same. QA? Barely touched.

Now look what happens to the pipeline:

  ┌──────────┐  ┌────────┐  ┌────┐         ┌───────────────────────┐
  │  PM      │─▶│ DESIGN │─▶│DEV │────────▶│          QA           │──▶ Ship
  │  (human) │  │ (human │  │(h+ │         │  (human, no AI help)  │
  │          │  │  + AI) │  │AI) │         │                       │
  │ ████████ │  │ ██████ │  │ ██ │         │ ████████████████████  │
  └──────────┘  └────────┘  └────┘         └───────────────────────┘
    ~2 wk        ~2 wk      ~1 wk     ⚠ BOTTLENECK   ~5 wk
                  (1.5x)     (5x)                      (swamped)

Dev went from 4 weeks to 1 week. But QA didn’t get faster — it got slower. It’s absorbing 5x the volume with the same headcount, the same manual processes, the same regression suite that takes three days to run.

  PM      ████████████████████████████   ~2 wk  (no change)
  Design  ████████████████████░░░░░░░░   ~2 wk  (modest gains)
  Dev     ████░░░░░░░░░░░░░░░░░░░░░░░   ~1 wk  (5x faster)
  QA      ████████████████████████████████████   ~5 wk  ◀── CONSTRAINT
          ─────────────────────────────────────

End-to-end: ~10 weeks. Not 5x faster. Barely faster at all.

This is the Theory of Constraints playing out in real-time. You didn’t eliminate the bottleneck — you just moved it downstream. And the faster dev ships, the worse the QA bottleneck gets.

I hear this from teams constantly. “Dev is shipping so fast now, but we can’t get anything through QA.” Or: “We have 47 PRs waiting for review.” Or: “We’re shipping faster but quality is dropping because we’re cutting corners on testing to keep up.”

Phase 1.5 is a mirage. The velocity charts look great. The end-to-end delivery doesn’t.

Phase 2: Give AI to Everyone (Where Smart Teams Are Heading)

The natural next step: if uneven AI adoption created bottlenecks, give AI tools to every lane. PM gets AI. Design gets AI. Dev has AI. QA gets AI. Level the playing field.

This is a real improvement. It works. It’s worth doing.

  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
  │  PM      │────▶│  DESIGN  │────▶│   DEV    │────▶│    QA    │──▶ Ship
  │ (human   │     │ (human   │     │ (human   │     │ (human   │
  │  + AI)   │     │  + AI)   │     │  + AI)   │     │  + AI)   │
  │ ██████   │     │ ██████   │     │ ██████   │     │ ██████   │
  └──────────┘     └──────────┘     └──────────┘     └──────────┘
    ~1 week          ~1.5 weeks       ~1 week          ~1.5 weeks

  Throughput per lane:

  PM      ██████████████░░░░░░░░░░░░░░   ~1 wk    (2x faster)
  Design  ████████████████░░░░░░░░░░░░   ~1.5 wk  (2x faster)
  Dev     ██████████████░░░░░░░░░░░░░░   ~1 wk    (4x faster)
  QA      ████████████████░░░░░░░░░░░░   ~1.5 wk  (2x faster)
          ──────────────────────────────
          Balanced again. Every lane improved.  Total: ~5 weeks

A genuine 2-3x improvement. You should absolutely pursue this.

But here’s the thing most people miss: it’s still a production line.

You still have four separate teams. Four separate handoffs. Four queues. Four sets of context lost in translation. AI made each station faster, but the architecture of how work flows is exactly the same.

  Phase 2 is still sequential:

  PM ━━━▶ Design ━━━━▶ Dev ━━━▶ QA ━━━━▶ Ship
  (fast)  (fast)       (fast)   (fast)
          ║                      ║
          ║  still a handoff     ║  still a handoff
          ║  still waiting       ║  still waiting

Phase 2 is like putting faster engines in every car on a single-lane road. The cars are faster. The road is still one lane.

5 weeks is great. But the real question is: why do we need four separate teams at all?

But First: What The Diagrams Don’t Show You

Every diagram above is a lie. A generous, best-case lie.

Real software development is never a clean left-to-right pipeline.

  What the diagram shows:

  PM ──▶ Design ──▶ Dev ──▶ QA ──▶ Ship


  What actually happens:

  PM ──▶ Design ──▶ Dev ──▶ QA ──┐
              ▲        ▲         │
              │        └─────────┤  "This doesn't match the spec"
              │                  │
              └──────────────────┤  "This flow doesn't work,
                                 │   we need to redesign"
                                 │
         PM ◀────────────────────┘  "Users are hitting a bug
                                     in production, we need to
                                     rethink the whole approach"

QA finds a bug — back to dev. Dev discovers the design breaks at edge cases — back to design. A production incident forces PM to reprioritize everything. Design and dev go back and forth three times before the interaction feels right.

These rework loops cross team boundaries every time. And that’s where three things break:

Context has to be re-hydrated at every boundary. The designer who made that mock is on a different project now. The dev has to write up a ticket, attach screenshots, and hope she can reload the mental model she had three weeks ago. Both sides pay this tax. Every loop. Every time.

Capacity utilization collapses. Dev is blocked waiting for design. Design is idle waiting for QA to surface issues. QA has nothing for two weeks, then gets slammed with five features at once. The chunks of work don’t fit neatly across teams. A bug that takes two hours to fix takes five days to schedule.

  Theoretical capacity:

  PM      ████████████████████████████  100%
  Design  ████████████████████████████  100%
  Dev     ████████████████████████████  100%
  QA      ████████████████████████████  100%


  Actual capacity:

  PM      ████░░██░░░░████░░░░██░░░░░░  ~50%
  Design  ░░████░░░░██░░░░████░░░░░░██  ~45%
  Dev     ██░░████░░░░██░░░░░░████░░██  ~55%
  QA      ░░░░░░██░░████░░██░░░░████░░  ~40%
          ──────────────────────────────
          ░░ = idle / blocked / context-switching / wrong priority

Coordination eats the remaining capacity. Every rework loop requires: file a ticket, triage it, assign it, wait for a sprint slot, re-explain context, review the fix, re-test. For a two-hour fix, you spend eight hours on coordination.

  Where time actually goes on a "12-week" feature:

  Productive work:   ██████████████████████░░░░░░░░░░░░░░  ~35%
  Rework:            ████████████░░░░░░░░░░░░░░░░░░░░░░░░  ~20%
  Waiting/blocked:   ██████████████░░░░░░░░░░░░░░░░░░░░░░  ~25%
  Coordination:      ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~20%
                     ──────────────────────────────────────
                     Only ~35% of the 12 weeks is actual building.

This is true in Phase 1, Phase 1.5, and Phase 2. Even in Phase 2 — where every lane is faster — the rework loops still cross team boundaries, context still has to be re-hydrated, and scheduling is still a problem in itself. Phase 2 shrinks the loops from weeks to days, but they’re still there. Roughly 40% of Phase 2’s 5 weeks is still structural waste.

The production line was never just slow because the stations were slow. It was slow because the architecture — sequential handoffs, cross-team rework, context loss, scheduling overhead — was eating most of the time.

Which is exactly why Phase 3 changes everything.

Phase 3: One Person, A Team of AI Agents (The Real Unlock)

Phase 3 isn’t about giving AI tools to existing roles. It’s about merging the roles entirely — because AI agents can now own each function, and a single person can orchestrate all of them.

One person. A PM agent that writes specs and user stories. A design agent that generates mocks and flows. A dev agent that writes, reviews, and refactors code. A QA agent that writes tests, runs regressions, and flags issues.

You’re not a worker on the assembly line. You’re the conductor of an AI orchestra.

                      ┌───────────────────┐
                      │                   │
                      │    YOU            │
                      │    (orchestrator) │
                      │                   │
                      └─────────┬─────────┘
                                │
                  ┌─────────────┼──────────────┐
                  │             │              │
           ┌──────┴─────┐ ┌─────┴──────┐ ┌─────┴──────┐
           │            │ │            │ │            │
           ▼            ▼ ▼            ▼ ▼            ▼
   ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐
   │  PM Agent  │ │Design Agent│ │ Dev Agent  │ │  QA Agent  │
   │            │ │            │ │            │ │            │
   │  Writes    │ │  Generates │ │  Writes    │ │  Writes &  │
   │  specs,    │ │  mocks,    │ │  code,     │ │  runs      │
   │  user      │ │  flows,    │ │  reviews,  │ │  tests,    │
   │  stories,  │ │  assets    │ │  refactors │ │  flags     │
   │  priorities│ │            │ │            │ │ regressions│
   └──────┬─────┘ └──────┬─────┘ └──────┬─────┘ └──────┬─────┘
          │              │              │              │
          └──────────────┴──────┬───────┴──────────────┘
                                │
                                ▼
                            Ship it

This doesn’t just speed up the stations. It dissolves every structural problem we just talked about.

Context never leaves. You talked to the PM agent this morning, reviewed the design agent’s output over lunch, checked the dev agent’s code after that, and read the QA agent’s test results before dinner. The full context lives in one head. No re-hydration. No tickets explaining what happened three weeks ago. No designer who’s already moved on.

Rework loops shrink from weeks to minutes. Rework isn’t the enemy — slow rework across team boundaries is. When the loop is tight and the context is shared, rework is just iteration.

  Phase 1/2 rework loop (multi-team):

  QA finds bug
    └──▶ File ticket in dev backlog
           └──▶ Wait for next sprint (~1-2 weeks)
                  └──▶ Dev picks it up, needs design input
                         └──▶ Ping design team, wait (~3-5 days)
                                └──▶ Design responds
                                       └──▶ Dev fixes
                                              └──▶ Back to QA queue (~3-5 days)
                                                     └──▶ QA re-tests

  Elapsed time for one rework loop: 2-4 weeks


  Phase 3 rework loop (one orchestrator):

  QA agent flags bug
    └──▶ You see it immediately
           └──▶ Tell dev agent to fix, ping design agent to review
                  └──▶ Both respond in minutes
                         └──▶ QA agent re-tests

  Elapsed time for one rework loop: minutes to hours

Capacity utilization jumps dramatically. No idle time waiting for another team. If the QA agent finds a bug, you route it to the dev agent right now. No scheduling. No sprint boundaries. No “we’ll get to it next week.” Sure, agents still iterate, still have gaps — but 80% utilization at machine speed is a different universe from 50% utilization at human speed.

  Phase 1/2 actual capacity:

  PM      ████░░██░░░░████░░░░██░░░░░░  ~50%
  Design  ░░████░░░░██░░░░████░░░░░░██  ~45%
  Dev     ██░░████░░░░██░░░░░░████░░██  ~55%
  QA      ░░░░░░██░░████░░██░░░░████░░  ~40%


  Phase 3 actual capacity:

  PM Agent      ████████░░████████░░██████  ~80%
  Design Agent  ██████░░████████░░████████  ~80%
  Dev Agent     ████████████░░██████████░░  ~85%
  QA Agent      ██████░░██████████░░██████  ~80%
                ──────────────────────────────
                Agents still iterate. Still idle sometimes.
                But the loops are minutes, not weeks.
                And 80% at machine speed beats 50% at human speed
                by orders of magnitude.

And here’s what most people underestimate: AI agents don’t work at human speed. They work at machine speed. A PM agent doesn’t need two weeks to write a spec — it needs two hours. A QA agent doesn’t need a week to run regressions — it needs an afternoon.

Parallel execution. Machine speed. Dramatically higher capacity utilization. Rework loops that cost minutes, not weeks.

  Phase 3 (parallel, machine speed, one orchestrator):

            Day 1      Day 2      Day 3      Day 4
         ┌──────────┬──────────┬──────────┬──────────┐
  PM     │ ░░░░░░░░ │ ░░░░░░░░ │          │          │
  Agent  │          │          │          │          │
         ├──────────┼──────────┼──────────┼──────────┤
  Design │ ░░░░░░░░ │ ░░░░░░░░ │ ░░░░░░░░ │          │
  Agent  │          │          │          │          │
         ├──────────┼──────────┼──────────┼──────────┤
  Dev    │          │ ░░░░░░░░ │ ░░░░░░░░ │ ░░░░░░░░ │
  Agent  │          │          │          │          │
         ├──────────┼──────────┼──────────┼──────────┤
  QA     │          │ ░░░░░░░░ │ ░░░░░░░░ │ ░░░░░░░░ │
  Agent  │          │          │          │          │
         └──────────┴──────────┴──────────┴──────────┘
                      Continuous, not gated.
                      One person driving all four.
                      Machine speed, not human speed.

  PM Agent      ██░░░░░░░░░░░░░░░░░░░░░░░   ~1 day
  Design Agent  ██░░░░░░░░░░░░░░░░░░░░░░░   ~1 day
  Dev Agent     ██░░░░░░░░░░░░░░░░░░░░░░░   ~1 day
  QA Agent      ██░░░░░░░░░░░░░░░░░░░░░░░   ~1 day
                ──────────────────────────
                Parallel. Machine speed.       Total: ~4 days

4 days. Not 4 weeks. Not 5 weeks. Not 12 weeks. Four days.

And it’s not just the speed — it’s the quality of the time:

  Where time goes — Phase 1 vs Phase 3:

  Phase 1 (12 weeks):
  Productive work:   ██████████████████████░░░░░░░░░░░░░░  ~35%
  Rework:            ████████████░░░░░░░░░░░░░░░░░░░░░░░░  ~20%
  Waiting/blocked:   ██████████████░░░░░░░░░░░░░░░░░░░░░░  ~25%
  Coordination:      ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~20%

  Phase 3 (4 days):
  Productive work:   ████████████████████████████████░░░░   ~80%
  Rework/iteration:  ██████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   ~12%
  Waiting/blocked:   ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   ~5%
  Coordination:      █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   ~3%
                     ──────────────────────────────────────

What used to be a quarter of work — three months of cross-team coordination, standups, sprint planning, handoff meetings, QA cycles, and release trains — collapses into a single work week. One person, one laptop, four agents, shipping on Friday what used to ship in March.

Carl Marx wrote in 1859 that “at a certain stage of development, the material productive forces of society come into conflict with the existing relations of production... From forms of development of the productive forces these relations turn into their fetters.”

In plain language: the way work is organized initially helps technology grow — but eventually the old organizational model becomes the thing holding it back. The structure that once enabled progress becomes the constraint on it.

That’s exactly what’s happening right now.

The production line — PM, Design, Dev, QA as separate teams with sequential handoffs — was the right structure when every function required deep human specialization. It enabled massive scale. But AI agents have fundamentally changed what’s possible, and the org structure hasn’t caught up. The very structure that enabled the last era of productivity is now the fetter on the next one.

The steam engine existed for decades before it transformed manufacturing. The breakthrough wasn’t a better engine. It was the factory — a new organizational model designed around what the engine made possible. The resistance was never technical. It was structural. People who organized work around the old constraints couldn’t imagine organizing it differently.

12 weeks → 10 weeks → 5 weeks is what you get from better tools inside the old structure.

5 weeks → 4 days is what you get when you redesign the structure itself.

The tools are not the bottleneck. The org is.

Phase 1 → 1.5 was about giving some humans AI tools. Phase 1.5 → 2 was about giving every human AI tools. Phase 2 → 3 is about redesigning the organization around what AI makes possible.

The tools are ready. The question is whether the org is.

Environment Engineering: The Evolving Role of the Software Engineer

Haoyu Zha — Wed, 25 Feb 2026 23:19:33 GMT

An agent that needs your input every few minutes is a tool. An agent that works autonomously for hours is a colleague. The difference isn’t the agent’s intelligence — it’s the environment you put it in.

Most engineers are focused on what to say to agents — better prompts, better context, better instructions. That matters. But it’s the smaller lever. The bigger lever is what the agent can do: what tools it has, what feedback it gets, how completely it can close its own loop. I call this environment engineering, and it’s becoming the highest-leverage skill in the industry.

Agents Are Capable. They’re Also Trapped.

Coding agents have read more documentation than you ever will. They can assess whether something is working and they know what “done” looks like. These capabilities are already strong, and they’re improving fast — every model generation brings broader knowledge and sharper judgment.

But none of that matters if they can’t reach the real world. Your docs live in Notion. Your UI lives in a browser. Your data lives in a production database. Your requirements live in a Slack thread. The agent can’t grab any of it. It’s a brilliant worker locked in an empty room.

Context Engineering Is Necessary but Insufficient

Context engineering — giving agents the right information to reason well — has gotten a lot of attention. But context lets an agent think about the world. It doesn’t let it act, observe the outcome, and course-correct.

  Context only:

  Context ──→ Agent reasons ──→ Outputs code ──→ Human tests
     ↑                                               │
     └─────────── Human provides feedback ───────────┘

  The human IS the feedback loop.
  Agent stops and waits every cycle.

Environment engineering goes further. An environment is context plus tools and feedback:

Context — the information the agent needs. Instructions, codebase knowledge, documentation, requirements. This is what context engineering already covers.
Tools + Feedback — the actions the agent can take and the signal it gets back. Running tests and seeing pass/fail. Opening a browser and visually inspecting the result. Querying a database and checking the output. Tools without feedback are just buttons the agent pushes blindly. Tools with feedback are how the agent learns whether what it did was right.

Not all tools are equal. A test suite that runs in two seconds and returns a clear failure message is a different instrument than one that takes ten minutes and dumps a wall of stack traces. The engineering skill isn’t just giving agents tools — it’s designing tools that produce fast, precise signal the agent can iterate against. The quality of that signal determines whether the agent self-corrects or spirals.

The Closed Loop Changes Everything

When an agent has context and tools that return strong signal, it can close its own feedback loop:

  With a well-designed environment:

          ┌──── no ─────────────────────────────┐
          ↓                                     │
    Agent writes code ──→ Runs tests ──→ Pass? ─┘
                          Opens browser    │
                          Checks types    yes
                                           │
                                           ↓
                                         Done

  No human in the loop.
  Feedback steers the agent autonomously.

This is the core unlock. The agent’s autonomous time — how long it can work without needing you — expands from minutes to hours. That expansion is the leverage.

METR’s research puts hard numbers on this. They measure the length of tasks that frontier AI agents can complete autonomously, and found that this time horizon has been doubling roughly every seven months.

But METR measures the ceiling — what an agent can do under idealized conditions. Whether it actually reaches that ceiling depends on the completeness and quality of the environment. A frontier model in a weak environment — no tests, no browser, no feedback — will underperform a weaker model in a well-designed one. The raw capability is there. The environment is what unleashes it.

The agent is only as good as the signal it iterates against. And once you have a closed loop that works, you can multiply it — run multiple agents in parallel, each in its own isolated environment, each closing its own loop independently. You’re not multiplying your time. You’re multiplying autonomous agents.

What This Looks Like in Practice

I set up isolated Docker containers for frontend development — each with its own database, server, and full application stack — and gave the agent a browser tool so it could open the site, visually inspect the result, and click through the UI.

The feedback design is what makes it work. The agent runs the type checker, executes the test suite, opens the browser, and visually verifies the result. Each tool returns a different signal: types catch structural errors, tests catch logic errors, the browser catches everything a user would actually see. Fast, layered feedback the agent can iterate against.

Before this environment, the loop ran through me. The agent would write code, I’d check the browser, describe what was broken, and the agent would try again. Every iteration cost me five minutes of context-switching, and the agent couldn’t do anything in the meantime.

Now the agent handles the full cycle on its own. It writes code, spins up the stack, opens the browser, sees that a button is overlapping the nav bar, fixes the CSS, re-checks — three iterations I never see. Multiple agents run in parallel across different features without stepping on each other.

This wasn’t a better prompt or a smarter model. It was an investment in the environment.

The Takeaway

The product we build as software engineers is changing. It used to be the software. Now it’s the environment that builds the software — and the quality of that environment directly determines how much of an agent’s raw capability you actually capture.

The engineers who figure this out first — who stop optimizing prompts and start designing environments — will have a compounding advantage as agents improve. Every improvement to the environment pays dividends across every task that runs inside it. Every leap in model capability makes a well-designed environment more powerful.

The agents are getting smarter. The question is whether your environment is ready for them.