AI Software Quality Needs a Factory Again

TL;DR

AI coding has increased code output faster than most teams have increased quality control
The winning teams will treat software delivery more like a factory: small batches, tolerances, inspections, failure analysis, and yield
Product leaders need to move budget from more features to better production systems: evals, observability, review loops, test design, and rollback discipline

AI software quality is now a production-system problem. AI made code cheap, but it did not make review, testing, evals, rollback, observability, or product judgment automatic.

That distinction matters because a lot of teams are behaving as if generated code is the same thing as shipped software. It is not. Code is the raw material. Software is the thing that survives users, edge cases, permissions, latency, payment failures, timezone bugs, bad data, angry customers, and whatever your integration partner changed without telling you.

I have no patience for the lazy version of this argument, where someone waves at "quality" as a reason to reject AI coding. That is not the point. AI coding is useful. I use it every day. I shipped OpenChair and OpenTradie with AI in the loop, and the lesson was not that AI removes production discipline. It makes production discipline the constraint.

The problem is that we compressed the creation step without rebuilding the production system around it.

AI coding exposed the missing factory

Factories are not great because they make parts quickly. They are great because they make parts repeatedly, inside tolerances, at useful yield.

Software forgot this lesson because for twenty years the constraint was labour. Hire enough engineers. Prioritise hard. Protect scope. Ship the highest-value slice because implementation was expensive.

AI changed the cost model. Now one builder can produce a week of code before lunch. A team can create five credible product paths in the time it used to take to write a design document. That is real progress.

It also creates a new failure mode: output rises, but the inspection system does not.

Google's DORA research is already showing the shape of the problem. The 2024 DORA report found that AI adoption increased individual productivity, flow, and job satisfaction, while negatively affecting software delivery stability and throughput. DORA's later impact report on generative AI in software development estimated that every 25% increase in AI adoption was associated with a 1.5% reduction in delivery throughput and a 7.2% reduction in delivery stability.

That is the smell of a production system under strain.

The report offers one likely explanation: AI lets people produce more code in the same time, which can make changes larger. Larger changes are slower and more likely to break things. This is not anti-AI. It is basic delivery physics.

More code means more surface area. More surface area means more places for the system to fail.

Software needs tolerances

Hardware teams understand tolerances because physics punishes them immediately.

If two parts need to fit together and each part has variance, the product either assembles at scale or it does not. Nobody gets to argue with the factory floor. The part fits, squeaks, cracks, overheats, or fails inspection.

Software has tolerances too. We just hide them behind prettier words.

Latency tolerance. Data quality tolerance. Hallucination tolerance. Permission tolerance. Migration tolerance. Retry tolerance. Context-window tolerance. Human-review tolerance. Cost-per-run tolerance.

An AI feature that works 90% of the time is not "almost ready" if the missing 10% includes deleting the wrong record, messaging the wrong customer, or giving a regulated user advice it cannot defend.

I wrote about this in agent evals: manual vibe checks are not a production system. They are inspection theatre. If the model can take action, you need a repeatable way to measure whether that action is acceptable before and after every change.

The same is true for AI-generated software. Green CI is not enough. Someone still needs to understand what changed, why it changed, and which failure modes the model introduced while "helping".

The new product bottleneck is yield

Factories care about yield because waste kills margin.

Software teams need the same lens. If AI lets your team create ten times more artefacts, the question is not "How many features did we ship?" The question is "What percentage of generated work reached production without rework, regressions, support tickets, or architectural debt?"

That is yield.

Most teams do not measure it. They measure velocity, tickets closed, pull requests merged, or features launched. Those metrics were already shaky when humans wrote all the code. With AI in the loop, they become worse because output is easier to inflate.

A team can look productive while quietly turning the product into wet cardboard.

Two-lane software yield workbench showing messy generated code becoming defects beside a controlled path through tests, evals, traces, rollback, and release

Useful yield metrics look different:

Percentage of AI-authored or AI-assisted changes reverted within 14 days
Defect rate per shipped change, segmented by human-authored, AI-assisted, and agent-authored work
Review time per meaningful architectural change, not per pull request
Escaped defects by workflow, not by component
Eval pass rate before and after prompt, model, retrieval, or tool changes
Cost per successful run, not cost per generated run

Those numbers are less flattering than "we shipped 40% more". Good. They tell you whether the factory works.

Product managers need to own the factory design

This cannot sit only with engineering.

Product managers decide what kind of variability the product can tolerate. They decide when an AI workflow needs human review, when a failure can be retried, when uncertainty should be exposed to the user, and when the feature should refuse to act.

That is product work.

For a salon booking assistant, a wrong suggestion might be annoying. For a payment reversal, property valuation, care recommendation, or bank workflow, a wrong action can create legal, financial, or trust damage. The engineering pattern may look similar. The product tolerance is completely different.

This is why tests passing is no longer enough. AI changes the shape of responsibility. A PM who cannot reason about evals, observability, rollbacks, and failure modes is not "staying strategic". They are opting out of the production system.

The factory needs product judgment baked into it.

What a software factory looks like in the AI era

It is not a heavyweight process office. It is not a return to six-month release trains.

A useful AI-era software factory has five operating habits.

Small batches. Generated code makes large changes tempting. Resist it. Keep changes small enough that a human can inspect intent, not just syntax. AI should make batch size smaller, not larger.

Typed boundaries. Models are better inside contracts. Use schemas, structured outputs, explicit tool permissions, and narrow interfaces. The less ambiguity at the boundary, the less cleanup downstream.

Eval gates. Every AI workflow needs regression tests against real examples. Start with 20. Then 50. Then the weird cases that made support angry. Coverage should grow from production pain, not abstract imagination.

Traceability. If an AI workflow makes a decision, you need to know which prompt, model, retrieval result, tool call, and user input produced it. Without traces, debugging becomes folklore.

Fast rollback. AI systems change behaviour when prompts, models, tools, and context change. Treat those changes like deploys. Version them. Roll them back. Compare them against previous behaviour before users become the test suite.

None of this slows good teams down. It lets them keep going fast after the demo.

The teams that win will be boring underneath

The visible layer of AI product work is getting more magical. Voice agents. Coding agents. Multi-step automations. Generated interfaces. Personalised workflows.

Underneath, the winners will look dull: smaller changes, cleaner contracts, better monitoring, stronger evals, boring rollback drills, ruthless defect analysis.

That is the factory.

The gap between prototype and production is where most AI product strategies will die. Not because the model is weak. Because the organisation built a code generator and mistook it for a manufacturing line.

AI changed the cost of creation. Now rebuild the system that turns creation into software.

Frequently Asked Questions

Does AI coding reduce software quality?

AI coding does not automatically reduce quality. It increases code output, which stresses the review, testing, and release systems around the code. Quality drops when teams generate more change than their production system can safely absorb.

What is software yield?

Software yield is the percentage of generated or implemented work that reaches production without expensive rework, defects, rollbacks, or support load. In the AI era, yield matters more than raw output because generating more code is easy.

What should product managers learn about AI software quality?

Product managers should understand evals, traceability, human review thresholds, rollback design, and operational risk. They do not need to become release engineers, but they do need to define what "good enough to act" means for each workflow.