Engineering Enablement August 7, 2024

Building a Competency Graph for Your Engineering Team

By James Nakamura

Most engineering organizations have some version of a skills matrix. It's a spreadsheet — or a spreadsheet masquerading as a table in Confluence — with engineers on one axis, skills on the other, and some kind of rating in each cell. Someone maintains it. Reviews happen quarterly or annually. It probably tells you which engineers are "strong" in Kubernetes and which ones still need "development."

A competency graph is not that. Understanding the difference matters if you want to build L&D programs that actually close the gaps causing incidents.

What a Skills Matrix Actually Represents

A skills matrix is a snapshot of self-reported or manager-assessed ratings at a point in time, mapped against a generic skill taxonomy. The taxonomy is usually borrowed from somewhere — a job ladder framework, an industry certification track, whatever the last L&D manager imported. The ratings are usually integers on a scale from one to five, assigned via survey or calibration discussion.

This structure has a few problems that become significant at scale:

The taxonomy is static. Your company's backend stack in 2023 is not your backend stack in 2026. If you migrated from a monolith to service-based architecture, the skills that matter changed. If you moved observability tooling, the competencies required to run incidents changed. A static taxonomy captures neither the migration nor the new failure modes it introduced.

The ratings are point-in-time. An engineer who was "strong" in a particular domain eighteen months ago may have moved to a different codebase area. The knowledge decays. The matrix doesn't update.

The mapping is bidirectional at best. A skills matrix tells you which engineers have which skills. It doesn't tell you which skills your codebase actually demands, which domains are underrepresented in your team, or which skill gaps correlate with real operational problems.

What a Competency Graph Actually Is

A competency graph starts from a different question: what does your codebase actually require? Not what skills are generically useful for engineers at your company's stage, not what appeared on the last job posting for the role — what technical competencies does operating and extending your specific infrastructure demand from people in specific roles?

The graph structure captures relationships that a matrix cannot. A competency node for "distributed tracing" connects to "incident response" connects to "service mesh configuration" connects to "latency debugging." These aren't independent skills to be rated separately. They're a cluster that tends to appear together in incident postmortems when engineers who lack them encounter distributed system failures. The graph encodes that dependency. The matrix doesn't.

Critically, a live competency graph is updated from behavioral signals rather than periodic surveys. When a P2 incident involves repeated escalation because the on-call engineer couldn't navigate the tracing pipeline, that's a signal that gets recorded against the relevant competency node. When PR reviews in a specific service area consistently surface the same categories of feedback, that's a signal. The graph updates continuously rather than on a quarterly survey cadence.

Building the Graph: What Signals You Need

Constructing a useful competency graph for an engineering team requires connecting three kinds of data that usually live in separate systems:

Codebase topology. Which services, languages, and infrastructure domains exist in your system? What are the ownership boundaries? Which parts of the codebase have the highest change frequency or incident rates? This gives you the demand side: what competencies your system actually creates requirements for.

Incident and postmortem data. What skills were exercised — or found lacking — in recent incidents? Which engineers escalated? Which engineers resolved? What domains appear repeatedly in postmortems? Incident data is the clearest signal you have about where skill gaps are affecting operational outcomes. Systems like PagerDuty capture escalation patterns. Postmortem documents, when structured consistently, record the contributing factors explicitly.

PR review patterns. Code review is where skill gaps become visible in non-incident contexts. When senior engineers leave consistent categories of feedback — error handling, resource management, security edge cases, performance assumptions — those patterns map directly to competency nodes. Review comment analysis at scale reveals which engineers are consistently encountering the same conceptual gaps across multiple PRs.

The competency graph is built by connecting these three data streams. The result is a model that answers questions a skills matrix cannot: Which engineers have operational exposure to your most critical infrastructure domains? Where is skill coverage dangerously thin? Which competency clusters are your highest-priority gaps relative to your actual incident patterns?

A Concrete Example: Platform Engineering Team Migration

Consider an engineering organization of about 80 people that moved from AWS to a multi-cloud architecture over twelve months. Their existing skills matrix showed strong Kubernetes competency across the team — a legacy of several years of container-based deployment work. What it didn't capture was the delta introduced by the multi-cloud migration: new networking patterns, different IAM models across providers, service mesh configuration in a heterogeneous environment.

Their incident data, by contrast, told a clear story within three months of the migration. Incidents in the new multi-cloud service mesh environment resolved significantly more slowly than incidents in the legacy AWS-only services. The skill gap wasn't in Kubernetes fundamentals — the matrix had that right. The gap was in the specific multi-cloud networking and IAM intersection that the migration introduced. That gap wasn't in anyone's skill taxonomy. It lived in the incident data, waiting to be read.

A competency graph built from post-migration incident patterns would have surfaced that cluster immediately. The static matrix never would have — because no one knew to add the competency until it showed up in postmortems.

What a Live Graph Enables That a Matrix Doesn't

When the competency graph updates from live signals rather than periodic surveys, a few things become possible that aren't possible with static matrices:

L&D priorities can respond to real operational events rather than quarterly planning cycles. When a new incident pattern appears, the competency node associated with it becomes visible as a gap immediately. Training investment can follow the signal.

Onboarding paths can be personalized to the specific codebase domains a new engineer will own. Rather than a generic onboarding track, the graph can identify which competency clusters are most relevant to their role and team, and in what sequence — based on what the actual work in their area demands.

Bus factor can be measured against the actual codebase rather than abstract skill categories. When only two engineers have operational exposure to a critical service area, that shows up as a graph coverage gap rather than a vague concern.

The Honest Tradeoffs

We're not arguing that competency graphs eliminate all the complexity of skills assessment. Building one from workflow signals requires data access — to version control, incident tracking, and ideally code review patterns — and that access raises legitimate privacy and trust questions that need careful handling. We'll address those separately. The graph also requires maintenance: the codebase topology it maps to will evolve, and the graph needs to evolve with it.

A well-maintained skills matrix is still better than nothing. If you're choosing between a rigorous quarterly skills matrix process and no structured skills visibility at all, take the matrix. The argument here is narrower: if you have access to the workflow data your engineering organization already produces, using that data as input to a live competency model gives you signal quality that periodic self-assessment cannot match.

The skill gaps that are affecting your incident rate and code quality are visible in your workflow data. The question is whether your skills model is dynamic enough to read them.

Building a Competency Graph for Your Engineering Team

What a Skills Matrix Actually Represents

What a Competency Graph Actually Is

Building the Graph: What Signals You Need

A Concrete Example: Platform Engineering Team Migration

What a Live Graph Enables That a Matrix Doesn't

The Honest Tradeoffs

More from Tunlai Insights

Building a Competency Graph for Your Engineering Team

Turning Incident Postmortems into Learning Signals

What PR Review Patterns Reveal About Skill Gaps