A practitioner’s guide for engineering organizations adopting AI code generation at scale. This guide defines a complete, tool-agnostic SDLC process that restores confidence in reviewing, merging, and deploying AI-generated code to production.
Reading time: ~30 minutes | Audience: Engineering leadership, architects, senior engineers
Contents
- The Confidence Crisis
- The Two Reviews
- The Governed SDLC
- Compilation — From Decisions to Enforcement
- The Maturity Model
- Reviewing at Scale — Why You Don’t Need to Read 500 Files
- Addendums
The Confidence Crisis
Something broke when engineering teams got fast at writing code.
Not the code itself — the code is often fine. What broke is the ability to trust it. To review it. To look at a pull request and say with confidence: “This is correct, this follows our standards, this is safe to deploy.”
AI code generation has solved the generation bottleneck. An engineer can now produce in an afternoon what used to take a week. A well-prompted AI assistant can scaffold an entire service, wire up database migrations, generate API endpoints, and write corresponding tests — all in a single session. The constraint is no longer “how fast can we write code.” The constraint is everything that happens after.
The Review Problem
Consider a pull request with 500 changed files. This is not a thought experiment. Teams using AI-assisted development produce PRs of this scale routinely. A single feature, fully generated, touching models, controllers, views, tests, migrations, and configuration.
Now ask the question every engineering organization depends on: “Who reviews this?”
The honest answer, for most teams, is nobody. Not meaningfully. A senior engineer opens the PR, scrolls through the first fifty files, skims the test coverage summary, and approves it. They have other PRs to review. They have their own work. They trust the person who submitted it. They trust the AI that generated it — or at least, they lack the time to distrust it rigorously.
This is not a failure of individual engineers. It is a structural failure. The review process was designed for a world where humans wrote code at human speed. In that world, a 500-file PR was a red flag — a sign that someone had gone too long without committing. In the new world, a 500-file PR is Tuesday. And the process that was designed to catch problems in 50-file PRs does not scale to ten times the volume.
Three Failure Modes
The confidence crisis manifests in three ways, each reinforcing the others.
Unreviewable volume. AI-generated PRs are large not because engineers are careless, but because AI generation does not naturally decompose work into reviewable units. The generator produces everything the task requires. The reviewer receives everything at once. There is no intermediate checkpoint. The result: reviews become perfunctory, rubber-stamping increases, and defects reach production undetected.
Ignored architecture decisions. Most engineering organizations have architecture decision records, coding conventions, and style guides. These documents represent real decisions made by experienced people. But they exist in documentation that nobody reads at generation time — not the AI, and often not the engineer prompting it. The AI does not know your team decided to use repository pattern for data access. The AI does not know you banned direct database queries from controllers. The AI generates what it generates, and nobody checks compliance because nobody can review 500 files for architecture alignment.
Review fatigue. Engineers who are expected to review AI-generated code at scale burn out. Not because the code is bad, but because the volume exceeds what a human can meaningfully evaluate. The cognitive load of reviewing thousands of lines of generated code — most of it plausible, some of it subtly wrong — is qualitatively different from reviewing code a colleague wrote. There is no authorial intent to follow, no commit history that reveals the thought process, no conversation to reference. The reviewer is alone with the diff. The result is a slow erosion of trust in the entire merge-and-deploy pipeline. Teams either slow down — defeating the purpose of AI generation — or accept risk they cannot quantify.
The Compounding Loop
These three failure modes do not exist in isolation. They compound.
It starts upstream. When there are no governance constraints on how code is generated — no compiled architecture decisions, no enforced conventions, no specification to generate against — the AI receives ungoverned prompts. Ungoverned prompts produce ungoverned output. That ungoverned output becomes a specification for the next round of generation, each cycle drifting further from the team’s intended architecture.
Worse specs lead to worse code. Worse code leads to larger, more tangled PRs. Larger PRs lead to more superficial reviews. More superficial reviews let more drift through. And the drift accumulates silently — no single PR looks catastrophically wrong, but the codebase moves steadily away from the architecture the team thought it was building.
This loop has no natural feedback signal. There is no alarm that fires when architecture decisions are being ignored. There is no dashboard that tracks how far the generated code has drifted from the specification. The degradation is invisible until it surfaces in production: an outage, a security vulnerability, a data integrity failure, a performance cliff — the kind of incident that prompts a retrospective where someone asks, “How did we let this happen?”
The answer is always the same: one unreviewed PR at a time.
The Process Gap
It is tempting to frame this as a tooling problem. Build a better code review tool. Add AI-powered review assistants. Generate smaller PRs automatically. These are reasonable improvements, but they address the symptom — review overload — not the cause.
The cause is a process gap. The process that governs how software moves from idea to production has not adapted to the reality of AI-generated code. Teams are running a manual, single-reviewer, post-hoc review process against output that is generated at machine speed, machine scale, and machine scope. The mismatch is structural, and structural problems require structural solutions.
The answer is not to review code faster. The answer is to make code reviewable.
The Two Reviews
Traditional code review is a single-surface process. Code is written, a PR is opened, and a reviewer examines the output. The reviewer’s job is to evaluate everything at once: correctness, architecture compliance, security, performance, naming, test coverage, edge cases, and business logic — all by reading the diff.
This model worked when humans wrote code at human speed. A 200-line PR is reviewable. A 2,000-line PR is a stretch. A 50,000-line PR is fiction.
With AI-generated code, it is reality. And the single-surface model collapses under it.
The Reframe
The fix is not a faster reviewer or a smarter review tool. The fix is a different review model — one that recognizes that the work of evaluation happens across two distinct surfaces, at two different times, by two different groups of people.
Review the Intent. Before any code is generated, peers review the approach. What is being built? What are the requirements? What architecture decisions apply? What constraints does the generator need to follow? What does the specification say, and does the specification adequately capture what needs to be built?
This is the expensive review. It requires human judgment, domain expertise, and architectural knowledge. The participants are architects, senior engineers, and domain experts — the people who understand not just what the code should do, but why it should do it that way. They review the PRD, the technical specification, the execution plan, and the governance constraints that will bind the generator.
When this review is done well, the most consequential decisions have already been evaluated before a single line of code exists.
Review the Output. After code is generated, a second review verifies that the output conforms to the intent. This is a fundamentally different kind of review. It is not asking “Is this the right approach?” — that question was answered in the first review. It is asking “Did the generator follow the approach we agreed on?”
This question is partially automatable. Drift detection compares the generated code against the specification. Invariant compliance checks verify that hard constraints were not violated. Specialist domain checks (security, database, API design, infrastructure) examine the output through focused lenses. The human reviewer’s job shrinks to what humans are uniquely good at: evaluating edge cases, assessing naming and clarity, and confirming business logic correctness.
Why Two Surfaces
The traditional model asks one reviewer to do everything: evaluate the approach and verify the output. That was feasible when the output was small enough to serve as a proxy for the approach — you could infer the architectural intent from the code itself.
At AI-generation scale, you cannot. A 500-file PR does not reveal its architectural intent through reading. You cannot determine whether the generator followed the team’s conventions by scanning diffs. You cannot assess whether the specification was adequate by examining the output. The information is not in the code. It was in the decisions that preceded the code — and if nobody reviewed those decisions, they were never evaluated at all.
The two-review model makes this explicit. The intent is reviewed by the people best equipped to evaluate it, at the time when changes are cheapest. The output is reviewed with the intent as a reference, reducing the reviewer’s job from “understand everything” to “verify conformance.”
Where Human Judgment Goes
The principle underlying the two reviews is simple: move expensive human judgment upstream, move mechanical verification downstream.
Human judgment is scarce and slow. It is the bottleneck. Every hour a senior engineer spends reading generated code to infer architectural intent is an hour not spent evaluating the actual architecture. Every review cycle spent discovering that the specification was wrong is a cycle wasted — the code will be regenerated, re-reviewed, and re-merged.
Review the Intent is where human judgment has the highest leverage. A thirty-minute spec review by an architect prevents days of rework. A plan review that catches a missing database migration strategy prevents a production incident. A governance setup that enforces naming conventions prevents a thousand lines of inconsistency.
Review the Output is where automation has the highest leverage. Once the intent is fixed — specification approved, plan reviewed, governance compiled — the question “does the output match the intent?” is mechanical. Not trivial, but mechanical. It can be checked against a reference. It can be verified with rules. It can be confirmed by comparing the output to the specification it was supposed to implement.
The result: the PR review is no longer an investigation. It is a verification. The reviewer opens the pull request with the specification, the plan, and the conformance report already attached. Their job is not to understand the code from scratch. Their job is to confirm that what was agreed upon is what was delivered — and to apply human judgment only where automation cannot reach.
What Changes
This is not an incremental improvement to code review. It is a structural change in when, how, and by whom engineering work is evaluated.
In the traditional model, review is post-hoc. Everything is evaluated after the code exists. The reviewer bears the full cognitive burden.
In the two-review model, the most consequential evaluation happens before code exists. The cognitive burden is distributed: architects and domain experts evaluate the intent, automated checks verify conformance, and the PR reviewer focuses on the narrow set of concerns that require human eyes on generated output.
The Governed SDLC
The two-review model describes what to evaluate and when. The Governed SDLC describes how — the full, status-gated process that takes a task from inception to merge with confidence at every stage.
The process has six stages. Each stage has defined inputs, outputs, participants, and review criteria. Each stage contributes a specific layer of confidence. No stage is skipped. Progress is gated: the output of one stage becomes the input of the next, and a stage does not begin until the previous stage is complete.
The Six Stages
1. Govern
Inputs: Architecture decisions, coding conventions, hard constraints, team standards Outputs: Compiled governance — enforceable rules, constraints, and checks Participants: Architects, tech leads Review: Are decisions recorded? Are they compiled into enforcement? Are they current?
This stage happens once and is maintained continuously. It is the foundation. The team records its architecture decisions, defines its invariants (constraints that must never be violated), and documents its conventions. Then — critically — these decisions are compiled into enforceable constraints that the generator and the CI pipeline will respect.
Governance without compilation is documentation. Documentation is necessary but not sufficient. The Govern stage produces enforcement, not just records.
Confidence contributed: “We have decided, and decisions are enforced.”
2. Specify
Inputs: Product requirements (PRD), compiled governance, existing architecture Outputs: Technical specification with acceptance criteria, data models, API contracts Participants: Author (engineer or architect), reviewers (peers, domain experts) Review: Review the Intent — is the specification complete, testable, architecture-aligned, and security-aware?
This is the first review surface. The specification describes what will be built, how it will be built, and what constraints it must satisfy. Peer reviewers evaluate the approach, not the implementation. They check whether the specification captures all requirements, whether acceptance criteria are testable, whether the proposed design aligns with existing architecture decisions, and whether security and performance considerations are addressed.
The specification is the contract between the team and the generator. Everything generated downstream is measured against it.
Confidence contributed: “Peers reviewed the approach before generation.”
3. Plan
Inputs: Accepted specification, compiled governance Outputs: Execution plan — phased breakdown with dependencies, specialist domain flags, risk identification Participants: Author, reviewers (tech leads, senior engineers) Review: Review the Intent — is the plan scoped correctly, are dependencies ordered, are specialist domains covered?
The plan translates the specification into an execution sequence. It identifies which parts of the work require specialist review (database migrations, security-sensitive changes, API surface modifications), what order work should proceed in, and where risks exist. It also defines the scope of each generation unit — how the total work is decomposed into pieces small enough to verify independently. The plan is reviewed for feasibility, sequencing, scope, and coverage — not for code.
Without a plan, generation happens in one pass, producing output that is difficult to verify in pieces. With a plan, each generation unit is scoped, sequenced, and reviewable.
Confidence contributed: “The execution path has been evaluated and sequenced.”
4. Generate
Inputs: Accepted plan, compiled governance (rule files and constraints), specification Outputs: Implementation — code, tests, migrations, configuration Participants: Engineer (prompting the AI generator), the generator itself Review: The generator operates under compiled governance constraints. Rule files constrain behavior. The specification defines what to build. The plan defines the execution sequence.
Generation is the stage most teams focus on. In the Governed SDLC, it is the stage with the least human review burden — because the review already happened. The specification was approved. The plan was reviewed. Governance constraints are compiled and active. The generator is not working in a vacuum; it is working within a defined envelope.
Confidence contributed: “The generator was constrained by compiled decisions.”
5. Verify
Inputs: Generated code, specification, plan, compiled governance Outputs: Verification results — drift reports, invariant compliance, specialist review findings Participants: Automated checks, specialist reviewers (database, security, API, infrastructure) Review: Review the Output — does the generated code match the specification? Were invariants respected? Do specialist domains pass?
This is the second review surface. Drift detection compares the generated output against the specification — every requirement, every acceptance criterion, every architectural constraint. Invariant compliance checks confirm that hard constraints were not violated — the global rules that apply to all code, regardless of the specific feature. Specialist reviewers examine the output through their domain lens — database, security, API, infrastructure, performance. The output of this stage is a set of findings — conformance confirmed, or deviations flagged — that becomes the primary input for the final merge review.
Confidence contributed: “Automated checks confirmed no drift from spec.”
6. Merge
Inputs: Generated code, verification results, specification, plan Outputs: Merged code on the main branch Participants: PR reviewer (senior engineer or tech lead) Review: Final review — verification results attached, specification attached, reviewer confirms conformance and evaluates what automation cannot (edge cases, naming, business logic nuance)
The merge stage is where the traditional PR review lives — but it is a fundamentally different review than what most teams do today. The reviewer does not open a 500-file diff and try to understand it from scratch. The reviewer opens the PR with the specification, the plan, and the verification report already attached. Their job is to confirm that the output matches the intent and to apply human judgment where automation cannot reach.
Confidence contributed: “Every layer passed — this is as reviewed as hand-written code.”
The Confidence Ladder
Each stage of the Governed SDLC contributes a layer of confidence. Together, they form the Confidence Ladder — a cumulative model where each rung builds on the one below it.
| Stage | Confidence Contributed |
|---|---|
| Governance Setup | ”We have decided, and decisions are enforced” |
| Planning & Spec | ”Peers reviewed the approach before generation” |
| Generation | ”The generator was constrained by compiled decisions” |
| Post-Generation | ”Automated checks confirmed no drift from spec” |
| Merge & Deploy | ”Every layer passed — this is as reviewed as hand-written code” |
At the base, confidence comes from the existence and enforcement of decisions. At the top, confidence comes from every preceding layer having been satisfied. The merge reviewer is not the sole source of quality assurance. They are the final confirmation in a chain that started with governance setup.
A team that skips stages — generating without a spec, merging without verification — removes rungs from the ladder. The confidence at merge time is only as strong as the weakest stage that was actually completed.
Where the Two Reviews Occur
The Governed SDLC makes the two-review model concrete:
-
Review the Intent happens at the Specify and Plan stages (stages 2 and 3). Peers review the specification and the plan before any code is generated. This is where architects, domain experts, and senior engineers invest their judgment.
-
Review the Output happens at the Verify and Merge stages (stages 5 and 6). Automated checks and specialist reviewers verify conformance. The PR reviewer confirms what automation cannot.
The Generate stage (stage 4) sits between the two reviews. It is the stage where the reviewed intent is transformed into output. The quality of that output is a function of how well the intent was reviewed and how tightly the generator was constrained.
The Governance Flywheel
Governance is not a one-time setup. It is a living system that improves with use.
Every implementation surfaces new decisions. A team building a feature discovers an edge case the architecture decisions did not anticipate. A security review reveals a pattern that needs to be codified as an invariant. A specification reviewer notices a convention that was never recorded but everyone follows implicitly.
These discoveries feed back into the Govern stage. New decisions are recorded. New constraints are compiled. The next generation cycle operates under tighter, more accurate governance. Each pass through the SDLC makes the next pass more governed — not because someone mandated it, but because the process itself surfaces what needs to be captured.
This is the governance flywheel: implementation surfaces decisions, decisions are compiled into enforcement, enforcement constrains the next generation, and that generation surfaces the next set of decisions.
The flywheel creates a tension that teams must manage deliberately. Prescriptive governance locks down decisions early and constrains the generator tightly. It reduces variance but can slow adaptation. Adaptive governance keeps the envelope wider, allowing implementation to surface decisions before codifying them. It preserves flexibility but accepts more variance in generated output.
Neither extreme is correct. Teams move along this spectrum as they mature. Early-stage governance is necessarily more adaptive — the team is still discovering its conventions. Mature governance is more prescriptive — the team has codified most of its decisions and the generator operates in a well-defined envelope. The art is knowing where you are on the spectrum and adjusting deliberately, not defaulting to one extreme because it is easier.
Specialist Review Domains
Not all output can be verified by a single generalist reviewer. Certain categories of change require domain-specific expertise. The Governed SDLC identifies specialist review domains and defines when each is triggered.
Database. Triggered when changes include schema modifications, migrations, query patterns, or index changes. A database specialist reviews for migration safety (data loss risk, locking behavior, rollback strategy), query performance (missing indexes, N+1 patterns, unbounded queries), and data integrity (constraint correctness, referential integrity, default values).
Security. Triggered when changes touch authentication, authorization, data handling, external integrations, or cryptographic operations. A security specialist reviews for access control correctness, input validation, secret management, and compliance with the team’s security invariants.
API. Triggered when changes modify public-facing interfaces — REST endpoints, GraphQL schemas, event contracts, or SDK surfaces. An API specialist reviews for backward compatibility, versioning strategy, contract completeness, and documentation accuracy.
Infrastructure. Triggered when changes affect deployment configuration, environment management, scaling behavior, or service dependencies. An infrastructure specialist reviews for deployment safety, resource sizing, failure modes, and environment parity.
Performance. Triggered when changes affect hot paths, data-intensive operations, caching strategies, or resource-constrained components. A performance specialist reviews for algorithmic complexity, memory allocation patterns, concurrency behavior, and load characteristics.
Architecture. Triggered when changes introduce new patterns, dependencies, or structural decisions that are not covered by existing governance. An architecture specialist reviews for alignment with existing decisions, identifies decisions that need to be captured, and flags structural drift that automated checks would not detect.
These domains are not exhaustive — teams add domains that match their risk profile. A team with heavy machine learning workloads adds a model evaluation domain. A team with strict regulatory requirements adds a compliance domain. The principle is consistent: the right expert reviews at the right time, triggered by the nature of the change, not by manual assignment.
Compilation — From Decisions to Enforcement
Every engineering organization has decisions. Architecture decision records. Coding conventions. Style guides. Security policies. Database naming standards. API versioning rules.
Most of these decisions exist as documents. They were written thoughtfully, reviewed carefully, and approved formally. Then they were filed somewhere — a wiki, a docs folder, a knowledge base — and gradually forgotten by the humans and entirely unknown to the AI generators that now write most of the code.
This is the gap between documentation and governance. Recording a decision is not the same as enforcing it. An architecture decision record that says “all API endpoints must include rate limiting” does nothing if the generator has never seen it and the reviewer cannot verify it across 500 files. The decision exists. The enforcement does not. And without enforcement, the decision is aspirational — a wish, not a rule.
Architecture decision records without compilation is documentation, not governance.
Compilation bridges this gap. It is the act of transforming recorded decisions into enforceable constraints — things the generator is forced to follow and the verification pipeline can check.
The Compilation Principle
Decisions that are only documented are suggestions. Decisions that are compiled are governance.
Compilation means different things depending on the type of decision and the stage at which it operates. The principle is consistent: every decision must exist not only as a record of what was decided, but as a mechanism that makes violation detectable or, better, impossible.
The distinction matters because it changes the failure mode. When a decision is only documented, the failure is silent: the generator ignores the decision, the code passes review, and the violation is discovered months later — or never. When a decision is compiled, the failure is loud: the check fires, the violation is flagged, and the team knows immediately. Silent failures compound. Loud failures get fixed.
Four patterns of compilation apply across any development process, regardless of toolchain:
Prompt-time constraints. The generator receives compiled decisions as part of its operating context. Conventions, patterns, and architectural boundaries are expressed as rules that shape the generator’s behavior at the moment of generation. The generator does not need to “remember” decisions — they are provided as constraints every time it operates.
Pre-merge gates. Before code enters the main branch, automated checks verify compliance with specific decisions. These gates block merging when violations are detected. They operate on the output, not the generator — any code, from any source, is subject to the same checks. The decisions that define what the gates enforce are the compiled form of the team’s architecture records and invariants.
Generation-time checks. As code is being generated, real-time validation identifies violations before the generation is complete. This is the tightest compilation loop — decisions constrain the output as it is produced, rather than waiting for a post-generation scan.
Post-generation scans. After generation is complete, comprehensive scans compare the output against the full set of compiled constraints. These scans catch what prompt-time constraints and generation-time checks may have missed. They produce a conformance report that becomes an input to the Verify stage of the Governed SDLC.
These four patterns are not mutually exclusive. Mature governance uses all four: the generator is constrained, the output is checked in real-time, pre-merge gates block violations, and post-generation scans catch edge cases. Each layer reduces the set of violations that reach the next layer.
The order of adoption matters. Most teams start with post-generation scans because they are the easiest to implement — they operate on output that already exists and do not require changing the generation process. Pre-merge gates follow, because they integrate into existing CI pipelines. Prompt-time constraints come next, requiring changes to how the generator is invoked. Generation-time checks are the most sophisticated and typically the last to be adopted. Each layer is independently valuable. Teams do not need all four to start seeing results — they need to start with one and add layers as their governance matures.
Output Validation
Even with compilation, generated code sometimes violates the constraints it was supposed to follow. The generator may misinterpret a rule, encounter a conflict between two constraints, or simply produce output that no rule anticipated.
Output validation is the practice of detecting these violations after generation, before merge. Two methods apply regardless of toolchain:
Specification-based drift detection. The generated output is compared against the specification it was supposed to implement. For every requirement, acceptance criterion, and architectural constraint in the specification, the validation asks: “Is this present in the output? Is it implemented correctly? Does it match the agreed-upon approach?” Drift detection does not require reading every line of code. It requires checking every line of the specification against the output.
Invariant compliance scanning. The team’s hard constraints — invariants that must never be violated — are checked exhaustively against the generated output. Unlike drift detection (which checks against a specific specification), invariant scanning checks against the global set of rules that apply to all code. If the team has an invariant that says “no direct database access from controller layers,” every file is checked, regardless of what the specification says.
These two methods complement each other. Drift detection is specification-specific: it catches cases where the generator deviated from the plan. Invariant scanning is global: it catches cases where the generator violated constraints that apply everywhere.
When violations are found, the process is not to patch the output manually. It is to fix the constraint (if the constraint was wrong), update the specification (if the specification was incomplete), or regenerate the violating code under tighter constraints. The goal is a clean generation, not a manually corrected generation.
Scoping Guidance
Compilation and the Governed SDLC do not eliminate the problem of scale — they manage it. But the size of each generation unit still matters. A single generation pass that produces 500 files is harder to verify than five generation passes that produce 100 files each, even with identical governance constraints.
Scoping is the practice of sizing generation units for reviewability. Three criteria guide the decision:
Domain boundary. Each generation unit should correspond to a single domain or bounded context. If a feature touches the data layer, the API layer, and the presentation layer, generating all three in a single pass mixes concerns. Splitting generation along domain boundaries means each unit can be reviewed by the appropriate specialist and verified against a focused subset of the specification.
Dependency isolation. A generation unit should minimize dependencies on code that has not yet been generated or verified. If Unit B depends on interfaces defined in Unit A, generate and verify Unit A first. This is not always possible — circular dependencies exist — but the plan stage identifies these dependencies and sequences generation accordingly.
Verification tractability. A generation unit should be small enough that drift detection and invariant scanning can produce a meaningful report. If the unit is so large that the verification report is itself unreviewable, the unit is too large. A practical heuristic: if a specialist reviewer cannot read the verification report for a single unit in under fifteen minutes, the unit needs to be split.
These criteria are applied at the Plan stage, not the Generate stage. By the time generation begins, the scope of each unit is already defined and reviewed. Scoping is a planning decision, not a generation decision.
Teams often resist scoping because it feels like overhead — breaking one generation pass into five feels slower. It is not. Five scoped generation passes with focused verification are faster to deliver with confidence than one massive pass with an unreviewable verification report. The time saved in review more than compensates for the time spent in planning. Scoping is not a tax on speed. It is an investment in reviewability.
The Maturity Model
Not every team starts at the same place. Some teams adopted AI-generated code yesterday. Others have been using it for months with no governance at all. A few have pieces of the Governed SDLC in place but do not know what is missing.
This maturity model is a self-assessment tool. It describes five levels of governance maturity, from no governance to a fully adaptive system. Each level has observable characteristics, identifiable risks, and concrete process changes that move a team to the next level.
Most teams reading this guide are at Level 0 or Level 1. That is not a judgment — it is the industry baseline. Nearly every team that has adopted AI code generation has done so faster than their process could adapt. The question is not where you are. The question is what to adopt next.
Each level builds on the one below it. Skipping levels creates gaps — enforcement without documentation means the rules exist but nobody remembers why, governance without enforcement means the process is followed on trust alone. Move through the levels in order.
| Level | Name | Description |
|---|---|---|
| 0 | Unstructured | AI generates code with no governance. PRs are reviewed by exhausted humans. |
| 1 | Documented | Architecture decisions and conventions exist, but are not enforced. AI may or may not follow them. |
| 2 | Enforced | Decisions are compiled into enforcement. The generator is constrained. |
| 3 | Governed | Full Governed SDLC in operation. Review the Intent before generation. Review the Output after. |
| 4 | Adaptive | Governance flywheel — implementation surfaces new decisions that feed back into enforcement. |
Level 0: Unstructured
Characteristics. Engineers prompt AI generators directly from task descriptions, tickets, or verbal instructions. There are no specifications, no compiled constraints, and no formal review process beyond standard code review. PRs are opened, reviewed (or rubber-stamped), and merged. The team trusts individual engineers to make sound decisions and trusts the generator to produce reasonable output.
Risks of staying. Architecture drift is invisible and accelerating. Each generated PR moves the codebase further from any coherent architecture, but no single PR looks catastrophically wrong. Security and performance issues accumulate in code that was never reviewed against standards that were never compiled. The first sign of trouble is a production incident, not a review finding.
How to level up. Start recording decisions. Capture the architecture decisions the team already follows implicitly — the patterns, the conventions, the “everyone knows” rules. Write them down. This does not require a formal process. It requires a folder, a template, and a commitment to writing decisions down as they are made. The goal is not perfection. The goal is getting from unwritten tribal knowledge to written records.
Level 1: Documented
Characteristics. The team has architecture decision records, coding conventions, and possibly style guides. These documents exist in a known location. Some engineers read them. The AI generator does not — or if it does, only when an engineer manually includes them in a prompt. There is no enforcement mechanism. Compliance depends on individual memory and reviewer diligence.
Risks of staying. Documentation without enforcement creates a false sense of governance. The team believes it has standards because the documents exist. But the documents are not enforced, so violations go undetected. Over time, the codebase diverges from the documented decisions, and the documents themselves become stale — they describe what the team intended, not what the codebase actually does. New engineers (and AI generators) follow the code, not the docs, perpetuating the divergence.
How to level up. Compile decisions into enforcement. Take the recorded decisions and transform them into constraints the generator and the pipeline can act on. Start with the highest-impact decisions: the invariants that must never be violated and the conventions that apply to every file. Express these as rules the generator receives at prompt time and checks the pipeline runs at merge time. This is the Compilation step described in the previous section. It does not require compiling everything at once — start with five rules and expand.
Level 2: Enforced
Characteristics. The team’s most important decisions are compiled into enforceable constraints. The generator operates under rule files that encode conventions and architectural boundaries. Pre-merge gates check for invariant violations. Code that violates compiled decisions is flagged or blocked before it reaches the main branch. Engineers still generate code from task descriptions or tickets, but the generator is constrained.
Risks of staying. Enforcement without Review the Intent means the team is catching violations after generation, not preventing them. The generator produces code, the checks find problems, the code is regenerated or patched, and the cycle repeats. There is no specification to generate against, so “correctness” is defined only by the compiled rules — not by a peer-reviewed intent. Complex features are generated without an agreed-upon approach, leading to implementation debates at PR review time that should have happened before generation.
How to level up. Add specifications and Review the Intent. Before generating code for any non-trivial task, write a specification that describes what will be built and how. Have peers review the specification before generation begins. This is the Specify stage of the Governed SDLC — Review the Intent. It does not need to be a heavyweight process. A specification can be a single page. The review can be a thirty-minute conversation. The point is that the approach is evaluated before code exists, not after.
Level 3: Governed
Characteristics. The full Governed SDLC is in operation. Non-trivial tasks follow the chain: Govern, Specify, Plan, Generate, Verify, Merge. Specifications are peer-reviewed before generation. Plans sequence the work and identify specialist review domains. Generated code is verified against the specification for drift. PR reviews include the specification, plan, and verification reports as attached context. The team reviews the intent upstream and the output downstream.
Risks of staying. A governed process that does not evolve becomes rigid. The team’s decisions reflect the architecture as it was when governance was set up, not as it is now. New patterns emerge from implementation that are never captured. Edge cases surface that the invariants do not cover. The governance is correct but incomplete, and the gap between what is governed and what exists widens with each feature.
How to level up. Close the feedback loop. After each implementation cycle, conduct a governance sweep: did this implementation surface any decisions that are not captured? Any conventions that should be codified? Any invariants that need to be added? Feed these findings back into the Govern stage. This is the governance flywheel. It does not require a formal retrospective — it requires a deliberate habit of asking “what did we learn that our governance does not yet reflect?”
Level 4: Adaptive
Characteristics. The governance flywheel is active. Every implementation cycle surfaces new decisions, and those decisions are compiled into enforcement before the next cycle. The team actively manages the tension between prescriptive governance (locking down decisions) and adaptive governance (keeping the envelope open for discovery). Governance constraints are versioned, and the team tracks how the constraint set evolves over time. Drift detection runs not only against specifications but against the governance set itself — has the governance drifted from what the codebase actually does?
Risks of staying. Level 4 is not a destination; it is a practice. The risk is complacency — assuming the flywheel turns automatically. It does not. Without deliberate attention, the governance sweep becomes perfunctory, new decisions stop being captured, and the system decays toward Level 3. Maintaining Level 4 requires sustained discipline: someone owns the governance sweep, the results are visible, and the compiled constraints are actively maintained.
How to level up. There is no Level 5. Level 4 is the practice of continuous governance improvement. The goal is not to reach a final state but to maintain a system that improves itself with every cycle. Teams at Level 4 focus on the quality and coverage of their governance, the accuracy of their drift detection, and the efficiency of their compilation pipeline. The work is never done — it is designed not to be.
Reviewing at Scale — Why You Don’t Need to Read 500 Files
Return to the scenario from the beginning of this guide. A pull request lands with 500 changed files. A new service, fully generated: models, controllers, API endpoints, database migrations, tests, configuration. The PR sits in the review queue.
In the world before the Governed SDLC, this PR is a wall. The reviewer opens it, sees the file count, and faces an impossible task: understand the intent, verify the architecture, check for security issues, confirm test coverage, validate naming conventions, and assess business logic correctness — all by reading diffs. They cannot. Nobody can. The PR is approved with a cursory scan or it blocks the pipeline for days while a reviewer attempts a line-by-line pass they will never finish.
In the Governed SDLC, this PR arrives differently.
What Already Happened
Before the reviewer opens the PR, four stages of the Governed SDLC have already completed.
The team’s architecture decisions, conventions, and invariants were compiled into enforcement during the Govern stage. The generator operated under these constraints — not because an engineer remembered to include them, but because they were compiled into the generator’s operating context.
A technical specification was written, describing what the service does, how it is structured, what patterns it follows, and what constraints it must satisfy. That specification was peer-reviewed during the Specify stage. Architects and domain experts evaluated the approach. By the time they approved it, the most consequential decisions had already been reviewed.
An execution plan was written, breaking the work into phased generation units along domain boundaries. The plan was reviewed during the Plan stage. Dependencies were sequenced. Specialist review domains were flagged.
The code was generated during the Generate stage under the compiled governance constraints, against the approved specification, following the reviewed plan.
The generated code was verified during the Verify stage. Drift detection compared the output against every requirement and acceptance criterion in the specification. Invariant compliance scanning checked the output against the global constraint set. Specialist reviewers examined changes in their domains — database migrations reviewed by a database specialist, security-sensitive changes reviewed by a security specialist.
What the Reviewer Actually Does
The PR arrives with context attached: the specification, the plan, the drift detection report, the invariant compliance results, and any specialist review findings.
The reviewer does not need to understand the code from scratch. The intent is documented in the specification — which they may have already reviewed during the Specify stage. The execution approach is documented in the plan. The conformance between intent and output is documented in the drift report.
The reviewer’s job is verification, not investigation. They are confirming that a governed process produced the expected result — not discovering what was built or why.
Check the drift report. Did the generated code match the specification? If drift detection found zero deviations, the code implements what the team agreed to build. If deviations were found, they are listed specifically — the reviewer examines the flagged items, not the entire diff.
Check the invariant compliance results. Were any hard constraints violated? If the scan is clean, the code respects every invariant the team has defined. If violations were found, they are listed specifically.
Check the specialist review findings. Did the database specialist flag migration risks? Did the security specialist identify access control gaps? These findings are already attached. The reviewer reads the findings, not the raw code.
Apply human judgment where automation cannot reach. After the automated checks and specialist reviews, a narrow set of concerns remains that requires human eyes: edge case handling the specification did not anticipate, naming clarity that no rule can evaluate, business logic subtlety that requires domain knowledge, and user experience implications that require product understanding.
This is what the reviewer reads. Not 500 files. The specification, the reports, the findings, and the targeted areas where human judgment adds value.
The Math of Reviewing the Intent
The reason this works is arithmetic, not magic.
Eighty percent of the review happened at specification time. When peers review the specification, they review the architecture, the data model, the API design, the constraint compliance, and the approach. These are the decisions that determine whether the code is correct in the ways that matter most. A thirty-minute specification review by an architect covers more ground than a two-hour line-by-line code review.
Fifteen percent of the review is automated. Drift detection, invariant scanning, and specialist domain checks are mechanical. They compare output against reference. They do not require human judgment. They run in minutes, not hours.
Five percent of the review requires human eyes on generated code. Edge cases. Naming. Business logic nuance. The reviewer focuses their attention here — on the narrow set of concerns where human judgment is irreplaceable — instead of distributing that attention across 500 files where most of it is wasted.
The total review time may be similar. But the quality of the review is incomparably higher. Instead of a superficial pass across everything, the team invests deep review at each layer: architects review the intent, automation verifies conformance, and the PR reviewer applies focused judgment.
What Changes for the Reviewer
The PR review is no longer the first and only quality gate. It is the last in a series.
The reviewer no longer asks “Is this the right approach?” — that was answered at spec time. They no longer ask “Does this follow our conventions?” — that was enforced at generation time and verified post-generation. They no longer ask “Is this safe to deploy?” as an open-ended question — specialist reviewers have already examined their domains.
The reviewer asks: “Given that the intent was reviewed, the generation was constrained, and the output was verified — does anything remain that requires my judgment?”
This is a question a human can answer, even for 500 files.
The New Standard
This is not an optimization of the old review process. It is a replacement.
The old process concentrated all quality assurance in a single moment: the PR review. One reviewer, one pass, one chance to catch everything. It worked when code was written slowly and PRs were small. It does not work when code is generated at machine speed and PRs span hundreds of files.
The Governed SDLC distributes quality assurance across the entire lifecycle. Governance ensures decisions are enforced. Specification review ensures the approach is sound. Plan review ensures the execution is sequenced. Generation constraints ensure the output is bounded. Verification ensures conformance. And the PR review — the final stage — confirms that every preceding layer did its job.
The 500-file PR is no longer a wall. It is evidence that a governed process produced a complete implementation — one whose intent was reviewed, whose generation was constrained, whose output was verified, and whose remaining review burden is scoped to the narrow set of concerns that require human judgment.
This is what it means to deliver with confidence. Not confidence that a single reviewer caught everything. Confidence that the process did.
Addendums
Standalone checklists and worksheets designed to be used independently of this guide. Print them, attach them to PRs, or adapt them to your existing tools.
- Governance Setup Checklist — Step-by-step actions for teams starting governance from zero.
- Spec Review Checklist — What to verify when peer-reviewing a technical specification before code generation.
- Plan Review Checklist — What to verify when reviewing an execution plan.
- PR Review Checklist — How to review a PR of AI-generated code, including which artifacts to attach.
- Maturity Self-Assessment Worksheet — Determine your team’s current maturity level and identify next actions.