Algorithmic bias, explainability & model risk

A bank's credit model quietly rejects more applicants from one postcode than another. Nobody coded that rule; nobody can point to where it lives. The model learned it from history, and now it's repeating history back to you at scale and at speed. That is the whole problem in one sentence: a model doesn't invent fairness or fault, it inherits both from its data, and then makes them faster.

The quick version

Bias is usually inherited, not coded. Models learn patterns from past data, so they reproduce the inequities baked into that data, even when no protected attribute is used directly.
"Fair" has competing definitions, and you can't have them all. A landmark proof shows that several reasonable notions of fairness are mathematically incompatible, so you must choose which one matters here, on purpose.
Explainability is a property you decide on, not a feature you bolt on. For high-stakes calls, an inherently interpretable model often beats a black box plus an after-the-fact explanation.
Model risk is governance, not data science. Banking regulators have managed this for over a decade: every model needs an owner, independent validation, and limits on where it's allowed to decide.

The idea in depth

Three ideas sit underneath "responsible AI," and leaders tend to collapse them into one vague worry. They're separate: bias is whether the model is systematically unfair; explainability is whether anyone can say why it decided what it decided; model risk is the discipline that keeps the first two from hurting you. A model can be accurate, opaque, and ungoverned all at once, and that combination is where organisations get burned.

Bias is a data inheritance, not a coding error

The cleanest demonstration is Joy Buolamwini and Timnit Gebru's 2018 study Gender Shades, presented at the ACM Conference on Fairness, Accountability and Transparency. They tested three commercial gender-classification systems and found error rates of up to 34.7% for darker-skinned women, against 0.8% for lighter-skinned men. The systems weren't malicious. They were trained on benchmark datasets that were overwhelmingly lighter-skinned, so they simply learned faces like the ones they'd mostly seen.

The same mechanism shows up in hiring. In 2018 Reuters reported that Amazon had scrapped an experimental recruiting tool after discovering it downgraded résumés containing the word "women's" (as in "women's chess club captain"), because it had been trained on a decade of mostly male tech résumés and dutifully learned that male-coded language predicted "good hire." Crucially, the engineers never told it to prefer men. It inferred that from the past.

The shift to make: stop asking "did we use a protected attribute?" and start asking "what does our training data encode about a world we no longer want to reproduce?" Dropping the gender field doesn't help if a dozen other fields quietly stand in for it, a phenomenon called proxy bias. This is also why data governance, quality and lineage is not a back-office concern: if you can't trace where your training data came from, you can't reason about what it's teaching the model.

flowchart LR
  A("Historical data<br/>(reflects past inequity)") --> B("Model learns<br/>the patterns")
  B --> C("Predictions at<br/>scale and speed")
  C --> D(["Decisions reinforce<br/>the original pattern"])
  D -.->|"feeds new data"| A

Bias is a feedback loop, not a one-off bug: yesterday's decisions become tomorrow's training data. Leaders Loop

You cannot be "fair" in every sense at once

Here is the result most leaders haven't heard, and it changes how you should run the conversation. In 2016, computer scientists Jon Kleinberg, Sendhil Mullainathan and Manish Raghavan published Inherent Trade-Offs in the Fair Determination of Risk Scores, proving that, except in trivial cases, a risk score cannot simultaneously satisfy several intuitive fairness conditions. Loosely: you can make a score equally calibrated across groups (a "70% risk" means the same thing for everyone), or equally balanced in its errors across groups (the same false-positive rate for everyone), but you generally cannot have both at once.

This isn't abstract. It's the heart of the 2016 ProPublica investigation into COMPAS, a US criminal-justice risk tool. ProPublica's reporters (Julia Angwin and colleagues) found Black defendants were almost twice as likely to be wrongly flagged as high-risk. The vendor, Northpointe, countered that the tool was equally calibrated across races. The unsettling answer from the maths is that both can be true at the same time, they were measuring different, irreconcilable definitions of "fair."

"Fairness" is not one thing you optimise. It's a set of trade-offs you choose between, and the choice is a value judgement, not a calculation.

What to do with that: make the trade-off a documented, accountable decision rather than an accident. Before you deploy a scoring model, name which fairness property matters most for this use and why, then have a non-technical owner sign off on it. The honest limitation is that no setting makes all stakeholders agree, because the disagreement is about values, not statistics. The US National Institute of Standards and Technology makes the same point in its 2022 publication on AI bias (NIST SP 1270), framing bias as a socio-technical problem, part data, part process, part human, that no purely technical fix resolves.

Explainability: decide it up front, don't bolt it on

When a model declines someone, "the algorithm said so" is not an answer a regulator, a customer, or a tribunal will accept. Two roads lead out of this. The popular one is post-hoc explanation, tools like LIME and SHAP that approximate, after the fact, which inputs pushed a black-box model's decision. They're useful, but they're an approximation of the model's reasoning, not the reasoning itself.

The other road is argued forcefully by Duke computer scientist Cynthia Rudin in her 2019 Nature Machine Intelligence paper, bluntly titled Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Her case: for consequential decisions, prefer a model that is inherently interpretable, one a human can actually follow, rather than an opaque one wrapped in an explanation you can't fully trust. And, she notes, the accuracy you supposedly sacrifice is often smaller than assumed.

The practical rule: classify the decision before you pick the model. For high-stakes, individual-level, contestable decisions, credit, hiring, eligibility, lean toward interpretable models, or be ready to defend why a black box is justified. This connects directly to the difference between probabilistic and deterministic systems: a model gives you a likelihood, not a rule, and a leader needs to know which kind of answer a given decision actually requires.

flowchart TD
  A(["Is the decision high-stakes,<br/>individual, and contestable?"]) -->|"Yes"| B("Prefer an interpretable model<br/>you can defend line-by-line")
  A -->|"No"| C("A black box may be fine<br/> monitor it for drift")
  B --> D("Document the reason given<br/>to each affected person")
  C --> D

Explainability is a design decision made before you choose the model, keyed to how much the decision matters. Leaders Loop

Model risk is a governance problem with a track record

The good news: someone already wrote the playbook. After models contributed to the 2008 financial crisis, the US Federal Reserve and the OCC issued SR 11-7: Supervisory Guidance on Model Risk Management in 2011. Strip out the banking specifics and its principles are exactly what any organisation deploying AI needs: every model has a named owner; every material model gets independent validation by someone who didn't build it; and the whole thing sits inside governance with documentation, monitoring and clear limits on use. SR 11-7 also coined a phrase worth remembering, model risk comes both from a model being wrong and from a model being used wrongly. The second is the one leaders cause.

So treat any model that makes or shapes decisions as a governed asset, not a clever tool. Give it an owner, a validator who isn't its author, and a written boundary for where it is, and isn't, allowed to decide. Deciding where a model may and may not act is itself a reversible-vs-irreversible decision question: automate the cheap-to-reverse calls, keep a human in the loop on the ones you can't take back.

A worked example

Picture a mid-sized lender rolling out an AI model to pre-approve small-business loans. The pilot looks great: faster decisions, lower default rates. Then a journalist asks why approval rates differ sharply by neighbourhood.

The data team is baffled, they never used race or postcode. But the model used "average account balance" and "years at current address," both of which correlate with the very thing they thought they'd excluded. That's proxy bias. When pressed for reasons, all they can produce is a SHAP chart that even they find hard to translate into a sentence a customer would accept.

A team that had done the toolkit work would have moved differently. They would have chosen a fairness definition up front, say, equal false-rejection rates across neighbourhoods, and measured against it (illustrative figures: roughly 6% false rejections in one area but 14% in another, a gap that should have stopped the launch). They would have preferred an interpretable scoring model so every decline came with a plain-English reason. And they would have named an owner and an independent validator with authority to keep the model in "recommend, human approves" mode above a loan threshold. None of that is exotic, it's the three ideas in this guide, applied before launch rather than after the headline.

Frequently asked questions

If we don't collect race or gender, are we safe from bias?

No, and this is the most common and most expensive misconception. Models reconstruct protected attributes from proxies: postcode, name, school, shopping patterns, language style. Removing the sensitive field can even make bias harder to detect, because you lose the data needed to measure whether outcomes differ across groups. You often need to keep the attribute to test for fairness, even if you never feed it to the model.

Can't we just buy a "fairness" tool that fixes this?

Tools help you measure and mitigate, but they can't resolve the underlying value choice. Because several fairness definitions are mathematically incompatible (Kleinberg et al., 2016), any tool has already picked a definition for you, usually in its defaults. The real work is human: deciding which definition fits your context, and owning that decision.

What's the difference between explainability and interpretability?

A useful distinction: an interpretable model is transparent by construction, a human can follow its logic directly. Explainability usually means generating an after-the-fact account of an opaque model's behaviour (via tools like LIME or SHAP). The first gives you the reasoning; the second gives you an approximation of it. For high-stakes decisions, Rudin (2019) argues you should prefer the first.

We're not a bank. Does model risk governance really apply to us?

The principles do, even if the regulation doesn't. SR 11-7's logic, an owner, independent validation, limits on use, is operational hygiene for anything that decides automatically. The cost of skipping it isn't a fine; it's a model quietly making bad calls for months before anyone notices.

Who should own a model's fairness, data science or the business?

The business. Data scientists can surface the trade-offs and measure them, but choosing which fairness property matters, and accepting the residual risk, is a leadership decision with legal and reputational weight. If your only sign-off on a model's fairness comes from the team that built it, you don't have governance, you have marking your own homework.

Related in the Toolkit

Machine learning concepts & utility, how models learn from data in the first place, which is where inherited bias enters.
AI capabilities & limits (LLMs, generative AI, agents), the same bias and opacity issues, amplified in generative systems.
Probabilistic vs deterministic systems, why a model gives you a likelihood, not a rule, and when that distinction matters.
Data strategy & data as an asset, your training data is the model's worldview; treat it accordingly.
Data governance, quality, lineage & stewardship, you can't reason about bias you can't trace to its source.
First principles vs heuristics vs analogical reasoning, interrogating a model's logic instead of accepting its output.
Reversible vs irreversible decisions, where to automate and where to keep a human in the loop.
Jobs-to-be-Done & needs research, understanding who a model affects, not just how accurate it is.

Where to go next

Weapons of Math Destruction, Cathy O'Neil (Crown, 2016), the seminal popular book on how opaque models scale inequality; readable in a weekend, hard to unsee.
Stop Explaining Black Box Models…, Cynthia Rudin (Nature Machine Intelligence, 2019), the sharpest argument for interpretable-by-design over after-the-fact explanation.
Machine Bias, ProPublica (2016), the COMPAS investigation that made the fairness trade-off real for a general audience.
How I'm fighting bias in algorithms, Joy Buolamwini (TED talk), a short, vivid talk on the "coded gaze" from the researcher behind Gender Shades.
Towards a Standard for Identifying and Managing Bias in AI, NIST SP 1270 (2022), the practical, vendor-neutral framework that treats bias as socio-technical.