Skills Intelligence

Why Skill Surveys Lie (And What to Read Instead)

By James Nakamura

Why Skill Surveys Lie (And What to Read Instead)

Every quarter, someone in your organization runs a skills survey. Engineers self-rate on a five-point scale across forty-odd competencies, the results get aggregated into a heat map, and the L&D team plans training priorities accordingly. The process feels rigorous. It produces numbers. It generates a document.

The problem is the numbers are wrong — not because engineers are dishonest, but because the methodology asks people to measure what they cannot see about themselves.

The Epistemology of Self-Rating

When you ask an engineer to rate their Kubernetes proficiency, you're asking them to evaluate competence against failure modes they may never have encountered. An engineer who has managed deployments successfully, but only under normal conditions, will rate themselves as capable — because all their Kubernetes work has gone fine. The skill gap isn't visible to them. The gap lives in the scenarios they haven't faced yet.

This isn't a character flaw. It's how skill self-assessment works for any technical domain where competence only becomes visible under pressure. A distributed systems engineer who has never dealt with a split-brain scenario doesn't know to worry about split-brain scenarios. They're not hiding anything. They're reporting accurately from their own experience, which is necessarily incomplete.

There's a documented pattern in skill assessment research: people at intermediate levels tend to overestimate their competence relative to actual performance, while genuine experts often rate themselves more critically. Applied to engineering surveys, this creates a specific distortion. Your most experienced engineers may flag development areas in domains where they're actually strong. Engineers with operational blind spots rate themselves as confident. The survey inverts your signal.

What Your Workflow Data Actually Shows

Your engineering organization produces a different category of evidence every day, and it doesn't require anyone to self-report anything.

PR review comments are behavioral observations. When a senior engineer leaves structurally similar feedback across multiple pull requests from the same junior contributor — "this doesn't handle the edge case where the upstream service times out," "you're assuming this lock is always acquired," "what happens when this queue is at capacity" — that's a pattern. It's not a rating on a form. It's observed skill gap, recorded in your source control system, timestamped and attached to specific engineers and specific domains.

Incident postmortems are even more direct. A P1 that required escalation because the on-call engineer hadn't encountered that class of failure before is a skills gap documented in your incident tracker. The escalation chain, the time-to-resolution, the fact that only two people in the organization knew how to handle it — all of that is signal. It tells you exactly where your competency coverage is thin and which failure domains your team isn't prepared for.

Ticket assignment patterns tell a third version of the same story. When Jira tickets in a specific service area cycle through multiple reassignments before landing on the same two engineers every time, that's a bus factor problem sitting in your issue tracker. It means the skill is dangerously concentrated. A survey won't surface it because the people who lack the skill don't know they're missing it, and the two who have it are too busy doing the work to flag it on a form.

The Confidence Calibration Problem at Scale

Consider a growing payments infrastructure team — around 60 engineers across three squads — that ran quarterly skill surveys for two years before layering in workflow-based analysis. Their survey results consistently showed strong self-reported confidence in observability tooling. Their incident data told a different story: a third of their P2 incidents involved extended resolution times attributable to engineers who couldn't navigate their distributed tracing setup effectively under pressure. The survey had never flagged this. Engineers used the tracing tools successfully in non-incident contexts and rated themselves accordingly.

The behavioral signal didn't require a new tool. It was already in the incident tracker and the PR history. The gap between what the surveys reported and what the workflow data showed wasn't subtle — it was the difference between "we're fine on observability" and "we have a recurring incident pattern we can prevent."

Why Survey Design Can't Fix This

It's worth being direct here: this is not a critique of survey design. You can write better questions. You can add behavioral anchors. You can ask engineers to describe what they'd do in a specific scenario rather than rate their abstract proficiency. These improvements help at the margins.

They don't solve the core problem, which is that self-report methodology has structural limits for skills that require encountering failure to recognize. You can't design a survey that asks engineers to identify their unknown unknowns. The phrase means what it says.

We're not arguing that surveys have no value. Team-level morale checks, learning preference discovery, and qualitative "where do you feel stretched" prompts are all legitimate uses of self-report. The problem is using surveys as the primary data source for gap identification and L&D investment decisions in an engineering context where behavioral data already exists.

Reading the Data Your Org Already Produces

The workflow signals your engineering organization generates have properties that make them structurally more reliable than surveys for skills gap identification:

  • Behavioral, not declarative. PR comments, incident escalations, and ticket assignments reflect what engineers actually do under real conditions — not what they believe they'd do in a hypothetical.
  • Continuous. Workflow data updates as new incidents happen, new PRs land, new tickets cycle. Survey data updates quarterly, at best. The skill gap that opened when your infrastructure moved to a new messaging architecture shows up in incident patterns within weeks. It shows up in your next survey in three months, if you write the right question.
  • Codebase-specific. Generic skill taxonomies map to abstract competencies. Workflow data maps to the actual domains your codebase creates exposure to — your specific orchestration layer, your specific failure modes, your specific architecture decisions. Training priorities derived from this signal are relevant to the work your engineers actually do.
  • Surfaces unknown unknowns. Engineers can't report gaps they haven't encountered. The codebase records what happens when they do encounter them.

What This Means for L&D Planning

The practical implication isn't to abandon surveys entirely. It's to change what you use them for. Surveys are reasonable inputs for understanding learning preferences, identifying where engineers feel stretched or want to grow, and gathering qualitative feedback on existing training programs. They're poor inputs for identifying the skill gaps that are actually affecting your incident rate, your code quality, and your team's operational resilience.

For those decisions, you need data that records what happens when engineers encounter the hard parts of your system. That data already exists in your workflow tooling. The question is whether you're reading it, or whether you're asking engineers to describe the shape of their own blind spots instead.

"The survey isn't wrong because engineers are dishonest. It's wrong because it asks people to report the outline of what they can't see."

Your incident log knows what your team's Kubernetes blind spots are. Your PR review history knows which engineers are hitting the same conceptual wall. Your Jira escalation patterns know which skills are dangerously concentrated in one or two people. That information is already there. The work is learning to read it.

More from Tunlai Insights

All articles