A vendor walks into your office and says their model is "94% accurate." A good leader's next question isn't "how do I deploy it?" It's "accurate at what, measured how, on data that looked like what?" That instinct, to interrogate the task, the measure, and the data behind the number, is most of what you need to lead well in the age of machine learning. The rest is detail.

The quick version

  • Machine learning (ML) learns from examples instead of being told the rules. You give it lots of past data; it finds patterns that help it predict or classify new cases.
  • It produces guesses, not certainties. Every output is a probability dressed up as an answer, which is why it shines at fuzzy problems and fails quietly at exact ones.
  • It's only as good as its data and its measure. The same model can be brilliant in the demo and useless in your business if your data or your goal differs from the training set.
  • Your job isn't to build it, it's to frame it. Pick problems where being usually-right is valuable, define what "right" means, and watch for where it drifts.

The idea in depth: learning from examples, not rules

Traditional software is a set of rules a human wrote down: if the invoice is over £10,000, route it to a manager. Machine learning flips that. Instead of writing the rules, you show the system thousands of examples, invoices that turned out to be fraudulent and invoices that didn't, and it works out the patterns itself. The term goes back to Arthur Samuel at IBM, who in 1959 built a program that taught itself to play checkers better than he could, and described the field as giving computers "the ability to learn without being explicitly programmed" (Samuel, IBM Journal of Research and Development, 1959).

The cleanest definition came later, from Carnegie Mellon's Tom Mitchell in his 1997 textbook Machine Learning: a program learns from experience E with respect to a task T and a performance measure P if its performance at T, as measured by P, improves with E. That sounds abstract until you notice it's a checklist. The move: before you approve any ML project, force the three letters onto the table. What's the task (predict churn? flag fraud? rank CVs?), what's the measure (precision? revenue retained? false-alarm rate?), and what experience, what data, is it learning from? A project that can't answer all three crisply isn't ready, however good the demo looked.

flowchart LR
  A(["Past examples
(labelled data)"]) --> B("Training:
find the patterns") B --> C(["A model
(the learned patterns)"]) D(["A new, unseen case"]) --> C C --> E(["A prediction
+ a confidence"]) E --> F("Outcome happens →
feed it back as new data") F -.-> A
The basic loop: a model is trained on past examples, then makes confidence-weighted guesses about new ones, and real outcomes become tomorrow's training data. Leaders Loop

Why it's always a probability, and why that's the point

An ML model never knows anything. It estimates. Ask it whether an image contains a cat and underneath the tidy "yes" is something like "0.91 probability of cat." This is the single most useful thing for a leader to internalise, and it's why machine learning belongs in the family of probabilistic rather than deterministic systems. A payroll calculation must be exactly right every time; a spam filter only has to be right often enough to be worth more than it costs in mistakes. ML is the wrong tool for the first job and the right tool for the second.

That probabilistic nature creates the field's central failure mode, named precisely by Pedro Domingos (University of Washington) in his widely-read 2012 paper A Few Useful Things to Know about Machine Learning: overfitting. A model can memorise the quirks and noise of its training data so well that it looks superb in testing and then falls apart on the real world it's never seen. Domingos frames the underlying trade-off as bias versus variance, a model too simple to capture the real pattern (high bias) versus one so flexible it learns the noise (high variance) (Domingos, Communications of the ACM, 2012). The practical guard-rail: never accept a performance number measured on the same data the model was trained on. Insist it was tested on data the model has never seen, ideally data from a different time period or region than it trained on, because that's the only number that predicts how it behaves in production.

"A model that's perfect on the test it studied for tells you nothing about the exam it hasn't sat."

flowchart TB
  A(["Same data the model
trained on"]) --> B{"Where was the
'94% accurate'
measured?"} B -->|"On training data"| C(["Almost meaningless,
could be overfitting"]) B -->|"On fresh, held-out data"| D(["Trustworthy signal"]) B -->|"On data from a new
time / market"| E(["The number that
survives contact
with reality"])
The same accuracy figure means three very different things depending on what it was measured against. Always ask. Leaders Loop

More data often beats a cleverer model, but not always

There's a counter-intuitive result worth carrying into vendor conversations. In 2009, three Google researchers, Alon Halevy, Peter Norvig and Fernando Pereira, published The Unreasonable Effectiveness of Data, arguing that for many messy, human problems, throwing more data at a simple model beats engineering a sophisticated one (Halevy, Norvig & Pereira, IEEE Intelligent Systems, 2009). This is a large part of why proprietary data has become a genuine competitive asset, see data strategy & data as an asset. So the move is: when you weigh an ML investment, audit the data you'd be feeding it as seriously as the algorithm. Whoever has more relevant, well-labelled examples usually wins.

The honest limitation: data is not a substitute for thinking. Domingos is blunt that "data alone is not enough", every learner smuggles in assumptions, and more data can't fix a model that's learning the wrong thing. Worse, if your historical data encodes past bias, more of it just teaches the bias more confidently. A hiring model trained on a decade of who you used to promote will faithfully reproduce yesterday's blind spots. That's not a bug you patch later; it's a property of learning from the past, and it's why bias, explainability and model risk are governance questions, not just technical ones.

A worked example: the support-ticket triage model

Picture a head of customer operations, call her Priya, drowning in inbound support tickets. A vendor pitches a model that auto-routes each ticket to the right team and flags the urgent ones. The demo is dazzling: on the vendor's sample, it sorts tickets with what they call 92% accuracy (an illustrative figure).

Priya runs the three-letter test. Task: route tickets and flag urgency, fine. Measure: here she pushes. "Accuracy" turns out to mean overall correct routing, but 80% of her tickets are routine, so a model that just guessed "routine" every time would score 80% and miss every emergency. The number she actually cares about is how many genuinely urgent tickets it catches (recall on the urgent class) versus how many false alarms it cries (precision). Experience: the model was trained on the vendor's other clients, whose product and customers differ from hers.

So she doesn't sign. She asks for a two-week pilot on her last quarter's tickets, measured on urgent-ticket catch-rate and false-alarm rate, with the test tickets withheld from training. The model catches 70% of urgent tickets at a 15% false-alarm rate (illustrative). That's not the headline 92%, but now Priya can make a real decision: route the confident cases automatically, send the low-confidence ones to a human, and treat the model as a tireless first-pass filter rather than an oracle. This is a textbook reversible decision, pilot small, keep a human in the loop, expand only once the real numbers hold. The probabilistic tool earns a probabilistic role.

Frequently asked questions

Is machine learning the same as AI?

Not quite. "Artificial intelligence" is the broad ambition of getting machines to do things that seem intelligent. Machine learning is the most successful current method for doing that, learning patterns from data. Today's large language models and generative tools are ML systems of a particular kind; for how those differ in practice, see AI capabilities & limits.

How much data do I actually need?

There's no universal number, it depends on how complex the pattern is and how varied your cases are. The honest answer is that data quality and relevance usually matter more than raw volume: a thousand well-labelled examples that look like your real cases beat a million noisy, off-target ones. Start by asking whether you can even get clean labelled examples of the thing you want to predict; if you can't, ML is premature.

Can we just trust the accuracy number a vendor gives us?

No, and treating it as the only number is the most common ML buying mistake. Ask what was measured, on what data, and whether that data was held out from training. For imbalanced problems (fraud, urgent tickets, rare faults), a high overall accuracy can hide total failure on the cases you care about. Demand the metric that maps to your actual goal.

Why does a model that worked last year stop working?

Because the world it learned from moves on, a phenomenon practitioners call drift. Customer behaviour shifts, a competitor changes the market, a product launches. The model is still faithfully predicting the old world. ML isn't "set and forget"; it needs monitoring and periodic retraining, and that ongoing cost belongs in the business case from day one.

Should I build a model or buy one?

Buy when the problem is generic (transcription, translation, common-object detection) and someone has already trained a strong model on far more data than you have. Lean toward building, or fine-tuning on your own data, when the edge comes from data only you possess. The decision usually turns on the data question, not the algorithm one.

Related in the Toolkit

Where to go next