OPEN FRAMEWORK

AI Made Your Developers 10x Faster.
Did Your Engineering Practices Keep Up?

An open framework for measuring what actually matters in software engineering. 50 standards. 6 phases. Scored on evidence from your actual tools.

We Have a Measurement Problem

If you run an engineering org right now, you're getting asked two things at once: "Are we using AI?" and "How healthy is our engineering?" The first question is easy. The second one is where things fall apart.

DORA metrics are the closest thing we have to a standard. Deployment frequency, lead time, change failure rate, MTTR. They're useful for what they measure. But they were designed to answer a specific question about delivery throughput, and that question isn't the one engineering leaders are being asked anymore. The 2025 DORA Report found that AI adoption is actually increasing delivery instability even as individual developer velocity goes up. Teams are moving faster and breaking more things. DORA will tell you the "faster" part. It won't tell you why things are breaking.

Then McKinsey published their developer productivity framework, and the pushback was immediate and widespread. Kent Beck called it "absurdly naive." Gergely Orosz wrote a two-part takedown. Dan North did a line-by-line review. The core problem: McKinsey reduced software engineering to "time spent coding" and missed that the hard part of building software was never the typing.

On the other end you've got Gartner selling maturity assessments as consulting engagements. These are survey-based. You tell a consultant how mature you think you are, they put it in a proprietary model you can't see, and three months later you get a PDF with a score and some recommendations. It works for large enterprises with budget to burn, but it's a snapshot that starts going stale the moment it's delivered.

"The reality is engineering leaders cannot avoid the question of how to measure developer productivity, today. If they try to, they risk the CEO or CFO turning instead to McKinsey."

— Gergely Orosz & Kent Beck

That quote stuck with me. The demand for engineering measurement isn't going away. If engineering leaders don't provide something constructive, someone else will impose something destructive. So I started thinking about what a useful answer would actually look like.

The AI Amplifier Effect

Here's what the 2025 DORA data showed that I think matters most: AI amplifies whatever's already there. Teams with strong engineering foundations used AI to get even better. Teams with weak foundations just produced more problems, faster. The AI didn't fix anything. It accelerated everything, including the dysfunction.

I've seen this play out firsthand. A team with solid code review culture adopts Copilot and their throughput jumps because reviews catch the AI's blind spots quickly. A team with rubber-stamp reviews adopts the same tools and now they're merging AI-generated code that nobody actually read. Same tool. Wildly different outcomes. The difference was the practices around the code, not the code itself.

There's a more practical angle too. AI lets smaller teams punch above their weight. Five engineers can cover what used to take fifteen. But that means each person is more critical, institutional knowledge is more concentrated, and the impact of someone leaving is bigger. Documentation, code ownership, runbooks, operational processes — all of that is higher stakes with a lean team.

The teams that got CI/CD right early won the last decade. I think the teams that build engineering maturity around AI adoption are the ones that will win this one.

Practices, Not People

I want to be clear about what this framework does and doesn't do. It does not measure individual developer productivity. That debate has been settled. Every serious researcher has landed in the same place: individual metrics create perverse incentives and corrode team culture. We're not going there.

What we measure is engineering practice maturity. The organizational habits that determine whether a team can ship reliably regardless of how fast they're moving. Is main protected? Do PRs get meaningful reviews? Is CI gating merges? Are incidents followed by postmortems? Does someone own each service?

You can have great DORA numbers with terrible security scanning, zero documentation, no incident response process, and no tech debt tracking. I've seen it. Velocity metrics won't catch it. A practice maturity assessment will.

The other thing that matters: these scores come from observable evidence in your existing tools. Branch protection settings. CI workflow configurations. Commit history. Release tags. Issue labels. The data is already sitting there in your GitHub, your Jira, your PagerDuty. Nobody's been reading it systematically until now.

The Framework

50 standards across 6 phases of the SDLC. Each scored 1–5 based on what we can actually observe.

1

Reactive

No defined process

2

Emerging

Some practices exist but inconsistently

3

Defined

Processes are documented and followed

4

Managed

Practices are measured and actively improved

5

Optimizing

Continuous improvement is embedded in culture

📋PHASE 1Requirements8 standards

How work gets defined and tracked. Can you trace a line from a business requirement to the code that implemented it? Or does work just appear in PRs with no context?

AI can write tickets for you now. Great. That makes a disciplined requirements process more important, not less, because now you have well-formatted tickets that might be pointing nowhere.

Ticket LinkageSprint PlanningEstimation PracticeTraceabilityBacklog HygieneAcceptance CriteriaPriority ManagementRequirement Clarity

🏗️PHASE 2Design7 standards

How you make and record technical decisions. When someone asks "why did we build it this way?" six months from now, is there an answer anywhere?

Code generated by AI doesn't ship with an explanation of why it was designed that way. The architecture rationale gap gets worse fast.

ADR PracticeDesign ReviewTech DocumentationAPI ContractsTech Debt TrackingDependency ManagementArchitecture Docs

💻PHASE 3Development10 standards

The mechanics of how code goes from someone's editor to the main branch. Is main protected? Are reviews meaningful or just a checkbox? Do commits tell a story or say "fix stuff"?

When half your codebase was written by a machine, the humans reviewing it need to actually read it. Rubber-stamp reviews on AI code is how you end up with bugs nobody understands.

Branch ProtectionReviewer RequirementPR-Based WorkflowReview TurnaroundPR SizeReview DepthCommit ConventionLintingSecrets ManagementDocs Updated with Code

🧪PHASE 4Testing9 standards

What happens between "code complete" and "in production." Does CI actually block bad merges, or is it just a green checkmark that nobody waits for?

Your CI pipeline is now the bottleneck and the last line of defense. AI writes code fast. If CI doesn't gate it properly, you're just shipping bugs faster.

CI PipelineCI GatingCoverage ReportingTest File RatioCI ReliabilitySecurity ScanningIntegration TestsBuild TimeBug-Fix Tests

🚀PHASE 5Release8 standards

How software gets versioned, packaged, and shipped. Do you have a release process or do you just push to main and pray? Can you roll back when it breaks at 2am?

More code, more often, means more releases. Without rollback capability and proper versioning discipline, faster is just riskier.

Build ArtifactsRelease CadenceLead TimeRelease ApprovalSemantic VersioningRelease NotesChange Failure RateRollback Capability

📡PHASE 6Operations8 standards

What happens after code hits production. When the pager goes off, do you have a process? Does anyone own this service? Is there a runbook or is it tribal knowledge in someone's head?

No amount of AI is going to run your incident response for you. This is the area with no shortcut. Lean teams need operational maturity the most.

Incident ResponsePostmortem PracticeMTTRCode OwnershipRunbooksMonitoring as CodeSLO/SLA DefinedOperational Reviews

How Scoring Works

Every standard gets scored 1–5 based on what we can observe in your tools. Branch protection is either configured or it isn't. Your CI workflow either runs security scans or it doesn't. Your last 100 commits either follow a convention or they don't. Some signals are strong (branch rules are binary), some are weaker (inferring test quality from CI config). So every score carries a confidence weight.

// Overall score = confidence-weighted average
score = Σ(standard_score × confidence) / Σ(confidence)
// More integrations = higher confidence = clearer picture

The whole thing is transparent. Every threshold, every scoring rule, every piece of evidence is visible. You can look at any score and see exactly what drove it. No black box. If you disagree with how a standard is scored, you can see the logic and decide whether it applies to your context.

Why Not Just Use What Exists?

DORA Metrics

GAP: Tells you how fast you ship. Doesn't tell you if your security scanning works, your docs exist, your incidents get postmortems, or your tech debt is tracked.

OUR TAKE: Covers the full SDLC. Speed is one of fifty things we look at.

Consulting Assessments

GAP: Months of work, six-figure cost, survey-based, and the PDF is outdated by the time the ink dries. Great for large enterprises. Not accessible to everyone else.

OUR TAKE: Scans your repos in minutes. Repeatable. Gets better every time you improve something.

Self-Assessment Surveys

GAP: You already know the problem. Teams overestimate their maturity. There's no verification. And the results are only as honest as the people filling them out.

OUR TAKE: Evidence from your actual tools. Branch rules don't lie about whether they're enabled.

Developer Productivity Tools

GAP: Measuring individual output (commits, PRs, lines of code) creates terrible incentives and the entire research community has said as much.

OUR TAKE: Measures team and org practices. Zero individual scoring. We care about whether the system works.

Who This Is For

This was built for the people who get asked "how mature is our engineering?" and need a better answer than "I think we're doing okay." VPs of Engineering, Directors, CTOs, Staff+ engineers who own technical strategy.

It's especially relevant if you're in the middle of AI adoption and someone on the leadership team wants proof that engineering quality isn't slipping as output increases. Or if you've been told to "measure engineering productivity" and you'd rather steer that conversation somewhere useful before a consulting firm steers it somewhere harmful.

The framework is open. Use it directly if you want. Adapt it. Throw out the standards that don't apply to your stack. The point isn't to create another rigid model that everyone argues about. It's to make engineering maturity something you can observe and talk about concretely.

We Built a Tool That Does This Automatically

Concordance connects to your GitHub repos and Linear workspace, then scores all 50 standards in minutes. We're in beta right now, adding more integrations, and building the whole thing in the open.

Free during beta. Early users get grandfathered pricing.

Try the Beta →See the Roadmap

Not ready to try it? Get notified when we add new integrations.

AI Made Your Developers 10x Faster.Did Your Engineering Practices Keep Up?