May 2, 2025

A Practical Discovery Call Scoring Framework You Can Build in an Afternoon

Abstract scoring grid — horizontal bars of varying fill levels representing a rubric

Before we built Tunlai, we built a spreadsheet. It was a manual discovery call scoring rubric that two of us used to review recordings every week — a simple grid with criteria down the left column, a 0-2 scale for each, and a comments field for the moments worth noting.

We're sharing that foundation here because we think a lot of teams skip straight to "we need a tool" before they've worked out what they're actually trying to measure. A manual rubric forces you to be explicit about what good discovery looks like for your specific buyer, your deal cycle, and your methodology. That specificity is what makes automated scoring useful — the automation should be scoring against criteria you've already validated, not criteria a vendor pre-configured for a generic sales motion.

This framework takes most teams about half a day to build and calibrate. It won't catch everything, and it doesn't scale beyond maybe 20% call coverage before it becomes a full-time job. But it will tell you, quickly, whether your discovery calls are as thorough as your team thinks they are.

Start With the Outcome You're Predicting

The purpose of scoring discovery calls is to predict deal outcomes and identify coaching leverage points. That means your scoring criteria should map to behaviors that correlate with later-stage progress — not behaviors that feel good in the moment or sound like thoroughness.

The calibration question to ask before you finalize any criterion: "If a rep does this consistently, does the deal tend to progress?" If you can't trace a plausible mechanism from the criterion to pipeline outcome, cut it. Rubrics with too many criteria become administrative overhead that nobody trusts.

We recommend starting with five to seven criteria for a first version. You can always add more after running it for a month and seeing which criteria are discriminating (scores vary meaningfully across reps) versus which are flat (everyone scores the same, so the criterion isn't telling you anything).

The Core Discovery Criteria

These are the criteria we started with, derived from comparing discovery calls from deals that advanced to late stage against deals that stalled after the first or second conversation. They're written for a B2B SaaS context but translate across most complex sales motions.

Pain identification depth. Did the rep identify a specific problem, or did they stay at the surface level ("you want to improve sales performance")? Score 0 for surface-only, 1 for a named problem, 2 for a named problem with business impact quantified or articulated. The distinction between 1 and 2 is whether the rep got the prospect to say something like "we lose about two weeks of manager time per quarter to call review" versus just "our managers spend too much time on coaching."

Economic buyer clarified. Did the rep establish who makes the final purchase decision, or did they assume the person on the call is the decision maker? Score 0 for no mention, 1 for indirect indication ("I'll need to loop in my VP"), 2 for explicit confirmation of the decision process and the economic buyer's involvement or sign-off requirements.

Current solution probed. Did the rep ask what the prospect is currently doing about this problem — tool, process, or nothing? Score 0 for no probe, 1 for identifying current state, 2 for understanding why the current solution is insufficient (what's breaking, what's the workaround). This criterion separates reps who are pitching against a vacuum from those who understand what they're actually displacing.

Timeline and urgency established. Is there a driver behind this evaluation — a business deadline, a new initiative, a problem that's gotten worse recently? Score 0 for no timeline discussion, 1 for a general sense ("we want to implement something by year end"), 2 for a specific driver with a real consequence attached ("we're onboarding three new AEs in Q3 and our current process can't support that without significant manager hours").

Next step committed by prospect. Did the call end with the prospect committing to a specific next action, or just agreeing to continue the conversation? Score 0 for no next step, 1 for a vague next step ("let's find time to talk again"), 2 for a specific, dated commitment owned by the prospect ("I'll have our head of sales on the call next Thursday").

Add one or two criteria specific to your methodology. If you use MEDDIC, add criteria for Metrics and Champion. If your sales cycle requires legal or security review as a standard path, add a criterion for whether that path was mapped.

The 0-2 Scale and Why It Matters

A binary yes/no rubric collapses the distinction between "barely addressed" and "fully developed," which is exactly the distinction that coaching depends on. A 0-2 scale preserves that gradient without adding enough complexity to create scoring drift between reviewers.

The 1-score — the middle value — is your most important calibration point. Scorers will default to it when they're uncertain, which means your rubric's reliability depends on having a clear definition of what 1 means for each criterion. Write it down explicitly. "1 = named problem without impact quantification" is a definition. "1 = somewhat addressed" is not.

Run calibration sessions before you use the rubric for coaching. Have two reviewers score the same three calls independently, then compare. Any criterion with high scorer disagreement needs a sharper definition before it's useful. This calibration step is often skipped, which is why many manual scoring programs fall apart after a few weeks — the scores start meaning different things to different reviewers, and the data becomes noise.

What to Do With the Scores

A rubric without a review loop is just documentation. The scores become useful when you plot them over time per rep, look for consistent gaps, and target those gaps in coaching.

The most common pattern we see is that reps cluster low on economic buyer clarification and timeline/urgency. These are the criteria that require the rep to ask questions that can feel presumptuous — "who else needs to be involved in this decision?" and "what happens if you don't solve this by Q3?" — so many reps avoid them to keep the conversation comfortable.

Low scores on those two criteria, sustained across multiple calls, predict deals that advance to proposal stage and then go quiet. The prospect liked the conversation but the rep never established whether there was a real evaluation happening or just an exploratory chat.

When to Move to Automated Scoring

Manual scoring at any meaningful volume is genuinely hard. Two hours per week of call review covers maybe three to four calls — enough to track one rep, not a team of eight. Most managers using a manual rubric end up reviewing around 10-15% of total call volume, which means they're seeing a heavily sampled subset that may not represent the rep's actual patterns.

The case for automated scoring isn't that it's more accurate on any individual call — a trained human reviewer often catches nuance that automated analysis misses. The case is coverage. If every discovery call gets scored, you stop coaching from a sample and start coaching from the full picture. You see the outlier calls — the unusually bad ones the rep never mentioned, the unusually good ones you can use as positive examples — that manual sampling never reaches.

We're not saying you should skip the manual rubric phase. We're saying the manual rubric is how you figure out what you're measuring before you automate it. Teams that jump straight to automated scoring without having done the manual work often end up with a scoring configuration they can't explain or defend because they never worked through what good discovery actually looks like for their specific motion.

Build the rubric first. Run it for four to six weeks. Calibrate it, cut the flat criteria, tighten the definitions on the discriminating ones. Then bring in automated scoring to do what humans can't — cover every call, every week, without burning out the people doing the reviewing.

A Note on Discovery Completeness vs. Discovery Quality

A rubric measures completeness — did the rep address the criterion? It's harder to measure quality, which is whether the rep actually understood what they heard and built a shared understanding with the prospect rather than just checking boxes.

You'll see reps who score well on completeness but whose deals still stall, because they went through the discovery motions without genuine curiosity — asking the questions but not listening in a way that led anywhere. That's a coaching problem the rubric surfaces indirectly. If a rep consistently scores 2 on every criterion but their deals still stall at proposal, the issue is usually quality — they're technically thorough but the calls feel transactional to the buyer.

The rubric is a floor. Helping reps above it is coaching work that requires listening to the calls themselves, not just reviewing scores. Think of the scoring as a triage system that tells you where to focus your listening time — on the calls and the criteria where gaps appear, rather than spreading review time evenly across everything.

Want Tunlai to analyze one of your calls?

Start free trial