Performance & potential: how the 9-box grid works, and where it quietly goes wrong

You have probably been on a grid without knowing it. Somewhere in a talent review, a manager placed your name in one of nine boxes, high performance, medium potential; or solid performance, high potential, and that placement quietly shaped who got the stretch project, the succession nod, and the next promotion. The 9-box grid is one of the most-used talent tools in the world, and one of the most misused. This is what it actually does, and how to run it without lying to yourself.

The quick version

The 9-box grid plots each person on two axes: performance (results in their current role, looking back) and potential (capacity to grow into bigger or different roles, looking forward). Three levels on each axis make nine boxes.
Its value is not the labels, it is the calibration conversation it forces: a room of managers defending, comparing and aligning their ratings, which is where soft judgements get pressure-tested.
"Potential" is the slippery half. It is a forecast, not a fact, and forecasts are leaky, they absorb bias, especially about who "looks like" a future leader.
So use the grid as a prompt for action (develop, stretch, coach, move), revisit placements often, and never let a box become a verdict someone can't escape.

The idea in depth: two questions, not one

A performance rating answers a backward-looking question: did this person deliver in the job they hold now? A potential rating answers a forward-looking one: how far and how fast could they grow beyond it? The whole point of the grid is that these are different questions with frequently different answers. Your best individual contributor may be a low-potential fit for management; a middling performer in the wrong role may have high potential in a different one. Collapse the two into a single "talent score" and you lose exactly the signal the tool exists to surface.

The grid's lineage is usually traced to the GE–McKinsey nine-box matrix of the 1970s, a strategy tool that plotted business units by market attractiveness and competitive strength. The same logic was later borrowed for people, performance standing in for "competitive strength," potential for "future value", and popularised inside General Electric under Jack Welch (this adaptation is reported across HR sources rather than one canonical paper, so treat it as received history, not a citation). The practical discipline that follows: rate the two axes separately, and out loud. Make a manager say "strong results, limited reach" or "still learning the role, but the ceiling is high." The friction between those two statements is the insight, and a blended score buries it.

flowchart TB
  subgraph Grid["The 9-box grid"]
    direction TB
    A(["High potential /
Low performance
diamond in the rough"])
    B(["High potential /
Solid performance
rising star"])
    C(["High potential /
High performance
future leader"])
    D(["Med potential /
Low performance
inconsistent"])
    E(["Med potential /
Solid performance
core player"])
    F(["Med potential /
High performance
high performer"])
    G(["Low potential /
Low performance
under-performer"])
    H(["Low potential /
Solid performance
effective"])
    I(["Low potential /
High performance
trusted expert"])
  end

Performance runs left-to-right (poor → strong); potential runs bottom-to-top (limited → high). Box names vary by company. Leaders Loop

Calibration is the product, not the picture

Here is the reframe that earns the tool its place: the grid you print out is almost worthless. The meeting that produces it is where the value lives. When managers have to place their people on a shared map and defend each placement to peers, the loose, inconsistent way each one rates in private gets corrected against everyone else's standard. One leader's "high potential" turns out to mean "I like them"; the room makes them show their working. That cross-checking, calibration, is what turns a pile of opinions into something closer to a shared judgement.

This matters because the raw inputs are weak. A long line of research on performance ratings, the "idiosyncratic rater effect" first documented by Scullen, Mount and Goff (Journal of Applied Psychology, 2000), finds that a large share of the variance in ratings reflects the rater, not the person being rated. So an individual manager's potential call, made alone, is a noisy signal. Which is why placements should never be a solo exercise emailed in. Put managers in a room, or a structured async equivalent, show the whole grid at once, and ask the calibrating question for every name: "What would change my mind about this box?" The grid is the agenda; the argument is the deliverable.

The grid you print out is almost worthless. The meeting that produces it is the whole point.

An honest limitation. Calibration improves consistency, but consistency is not accuracy, a room can agree confidently on the wrong answer. Group settings also import their own distortions: the loudest advocate, the halo from one memorable win, the quiet pressure to keep your stars high so your team looks strong. Calibration is a discipline that makes ratings more comparable; it does not make "potential" a measured quantity. Treat the output as a better-informed opinion, not a measurement.

The bias hiding in "potential", and the evidence for it

The forward-looking axis is where the grid does real damage if you are not careful, because "potential" is a judgement about a future that hasn't happened, and such judgements quietly absorb who we expect leaders to be. This is not a hunch, it is one of the cleaner pieces of evidence in the talent literature. In "Potential and the Gender Promotion Gap" (American Economic Review, 2026), economists Alan Benson, Danielle Li and Kelly Shue analysed talent-review data on roughly 29,800 management-track employees at a large North American retailer. Women received, on average, 8.3% lower potential ratings than men, despite scoring higher on current performance. And those potential ratings were doing heavy lifting: a move from "medium" to "high" potential predicted a far bigger jump in promotion odds than the equivalent move on performance. The authors estimate that the gap in potential ratings accounts for roughly 30–50% of the overall gender promotion gap at the firm.

The sting is in what came next: women who were rated low on potential subsequently out-performed the men who had been given the same potential score, yet their potential ratings stayed low. The firm was persistently under-forecasting a whole group, and the grid was the instrument that encoded it. The lesson is blunt: audit your grid's distribution by gender, ethnicity, tenure and team before you act on it. If one group clusters in the low-potential boxes while out-performing on results, your problem is the rating process, not the people. Define "potential" against observable behaviours, learning agility, how someone handles a stretch, evidence of judgement under pressure, not against a vibe of "executive presence" that mostly rewards familiarity.

flowchart LR
  A(["Manager forms a
'potential' judgement"]) --> B{"Anchored on
observed behaviour,
or on familiarity?"}
  B -->|"Behaviour:
agility, stretch, judgement"| C(["Defensible forecast
calibrate & act"])
  B -->|"Familiarity:
'looks like a leader'"| D(["Bias enters the box"])
  D --> E(["Audit distribution by
group; re-anchor on
evidence"])
  E --> C

Potential is a forecast, and forecasts absorb bias unless they are anchored on observable behaviour and checked against the data. Leaders Loop

There is a second, quieter validity problem: even setting bias aside, we are not very good at predicting who rises. Harvard Business Review's "Turning Potential into Success" (Fernández-Aráoz, Roscoe & Aramaki, 2017) reports that while 66% of companies run high-potential programmes, only 24% of senior executives at those firms consider them a success, and confidence in the rising leaders coming through has been falling. The honest read: "high potential" is a useful bet, not a guarantee, so place the bet, then keep watching whether it pays off.

A worked example

Take a 40-person engineering org, call it Halden Systems, running its first proper talent review. (Illustrative figures and people throughout; this is a teaching example, not real data.) Two names land in interesting boxes.

Priya is the team's strongest delivery lead: ships on time, mentors juniors, the person everyone routes hard problems to. Her manager pencils her in as high performance / medium potential, "brilliant where she is, but I'm not sure she wants to lead." In calibration, a peer manager pushes back: has anyone actually asked Priya, or stretched her with something beyond her current scope? Nobody has. The honest conclusion is that her potential is untested, not low, so the action isn't a box, it's an experiment: give her a cross-team initiative this quarter and see how she handles ambiguity she hasn't met before.

Marcus is the opposite trap: confident, polished in front of leadership, described warmly as "definitely a future director", high potential, yet his last two projects slipped and his team's results are middling. Calibration exposes that the rating rests on presence, not evidence. The move is to separate the axes cleanly: solid-but-not-strong performance, and a potential rating that has to be earned with a turnaround, not assumed from polish. The grid didn't make these decisions. It made the room ask the right question about each person, what is this rating actually based on?, which is the only thing the grid is good for.

Frequently asked questions

What's the difference between performance and potential?

Performance is backward-looking: how well someone delivers in the role they hold now. Potential is forward-looking: their capacity to grow into bigger, broader or different roles. They often diverge, a top performer can be a poor bet for a step up, and a quiet mid-performer can have a high ceiling in the right role. Keeping the two ratings separate is the entire reason the grid has two axes instead of one.

How do you actually measure potential?

You don't measure it; you forecast it, which is why it's the riskier axis. The most defensible approach anchors the forecast on observable behaviour, learning agility, how someone performs when stretched beyond their comfort, evidence of sound judgement under pressure, drive, rather than on charisma or "executive presence," which tends to reward people who resemble the leaders already in the room. Define your potential criteria in writing before the review, not after.

Is the 9-box grid outdated or biased?

It can be both, if used badly. Peer-reviewed evidence shows potential ratings can carry significant bias, one study found women rated lower on potential despite higher performance, explaining a large share of a gender promotion gap. That is an argument for better criteria, calibration and distribution audits, not necessarily for binning the tool. A grid run as a forcing function for honest conversation is useful; a grid run as a once-a-year sorting hat is harmful.

Should employees see their box?

Sharing the literal grid placement is usually a mistake, "you're a low-potential" is demoralising and treats a noisy forecast as a verdict. What helps the person is the substance behind it: clear feedback on current performance, an honest read on what growth would require, and a concrete next step. Translate the box into a development conversation; don't hand someone a coordinate.

How often should we redo it?

Often enough that boxes stay provisional. An annual one-off lets placements calcify, and people who land low rarely climb out even when they improve. Lighter, more frequent reviews, tied to real development actions and revisited each cycle, keep the grid honest and let it register the change you were hoping to cause in the first place.

Related in the Toolkit

The grid only earns its keep if the talent it sorts was found and grown well to begin with, which is why it sits next to how you recruit and assess talent on the way in, and how you handle career development and succession on the way up.

Employer brand & talent attraction, the grid sorts the people you attracted; a weak top of the funnel limits what any review can find.
Recruiting & assessing talent, the same evidence-over-vibes discipline that should govern a potential rating starts at hiring.
Interviewing & selection (structured, competency-based), structured, behaviour-anchored judgement is exactly what de-biases a potential call too.
Onboarding & ramp, a slow ramp can make a high-potential hire look like a low performer; read early boxes with care.
Career development & succession planning, the grid's whole reason to exist: turning placements into pipelines and next moves.
Leadership styles & models (situational, servant, transformational, adaptive), what "future leader" even means depends on the leadership model you're betting on.
People analytics & workforce metrics, auditing grid distribution by group is a people-analytics job, not a gut-feel one.
Diversity, equity & inclusion, the potential axis is where inequity hides; DEI rigour is how you catch it.

Where to go next

"Potential and the Gender Promotion Gap", Benson, Li & Shue (American Economic Review, 2026), the peer-reviewed study showing how potential ratings carry bias; the most important single read on this page.
"Turning Potential into Success", Fernández-Aráoz, Roscoe & Aramaki (HBR, 2017), why high-potential programmes so often disappoint, and what a more scientific approach to potential looks like.
"9 Box Grid: How To Use It for Talent Reviews", AIHR, a clear, practical walkthrough of running a grid, defining the axes, and avoiding the common traps.
"The 9 Box Grid in Talent Management Explained" (YouTube), a short visual explainer of the nine boxes and how teams use them, if you prefer to watch the model laid out.