Ranking ML Models: Why It’s Hard and How to Do It Right in Company Planning

Machine learning has become an invisible force behind many of the world’s most impactful products. At big tech companies, from recommendation engines to fraud prevention to ad delivery, ML models shape the user experience—and drive billions in revenue. But there’s one part of the process that consistently lags behind the sophistication of the models themselves: how companies plan which models to prioritize.

Every quarter, product managers, ML leads, and engineering directors gather to answer a deceptively simple question: Which ML models should we invest in next? The stakes are high. Resources are limited. And yet, despite all our analytical firepower, this planning exercise often feels more like gut instinct than scientific decision-making.

Why? Because ranking ML models is hard—much harder than it looks.

The Illusion of Objectivity

Imagine you’re looking at a list of proposed ML models. One claims it can improve click-through rate by 5%. Another says it will increase install conversion by 3%. A third model doesn’t touch front-end metrics at all but promises to improve fill rate, ensuring more advertiser dollars are spent. Which do you choose?

If you’re like most people, you look for the biggest number. Or maybe you lean toward the domain you know best. Maybe you favor the model that’s easiest to build or the one with the most enthusiastic champion. All of these are human tendencies—and they’re understandable. But they don’t lead to consistently good outcomes.

The truth is that comparing ML models is like comparing apples to oranges… to pineapples. Different models aim to optimize different metrics, serve different parts of the business, and carry wildly different levels of risk, uncertainty, and effort. What looks like a 5% improvement on paper may fizzle out in production. A model that’s easy to build might do little for the bottom line. And a model that could generate $10 million in ROI may be locked behind six months of complex engineering work.

So how do we make sense of this chaos?

The Real Challenge: Multi-Dimensional Tradeoffs

To prioritize ML models wisely, you have to evaluate them on multiple axes:

  • What domain does the model operate in? (e.g., fill rate, CTR, CVR, LTV)
  • What business metric does it affect?
  • How confident are we in its projected lift?
  • How much revenue could it realistically drive?
  • How much effort and time will it take to build?
  • What are the risks—technical, legal, or organizational?
  • Is the model ready to experiment, or just a theoretical idea?

Most companies don’t have a consistent way to answer these questions. Some teams use spreadsheets. Others rely on back-of-the-napkin math. A few attempt formal ROI calculations—but often lack the inputs to do them well. And almost nobody has a cross-org framework that connects models, metrics, confidence, and effort into a unified prioritization engine.

That’s what we need to fix.

A Better Way: The Structured ML Model Framework

A more thoughtful approach begins with one mindset shift: treat model ranking not as a guessing game, but as a disciplined product prioritization exercise.

Start by defining your domains. Every company has its unique ML pillars—maybe it’s engagement, monetization, trust & safety, or personalization. Within each domain, capture the proposed models, along with a structured set of attributes:

  • Model name and description
  • Primary metric impacted (e.g. CTR, fill rate, ROAS)
  • Estimated lift, based on past experiments or expert judgment
  • Estimated annualized ROI, in dollars
  • Confidence level, based on data maturity and experimentation history
  • Engineering effort, ideally sized in relative story points
  • Offline model performance, using AUC, F1, or log loss
  • A/B test readiness, i.e., can it be safely deployed?
  • Retraining needs, ownership, and last updated timestamp

Once this data is in place, use a transparent scoring formula. It doesn’t have to be perfect—it just has to be better than your gut. A simple version might look like:

Score = (ROI * Confidence Weight * Offline Score) / Effort Weight

This formula ensures that models with high expected value and strong validation float to the top—unless they’re prohibitively expensive. Models that are easy to build but low in potential still get considered. And high-risk, high-effort models remain in view but are harder to justify.

From Framework to Culture

But the framework is just the beginning. To make it stick, you need to operationalize it across the organization.

First, build a central registry—a living database of all models under consideration. Whether it lives in Notion, Airtable, or a custom dashboard, the goal is the same: give everyone visibility into what’s in play, what’s in flight, and what’s delivering value.

Second, create ML scorecards. For every deployed model, track its real-world performance: did it live up to expectations? Did it move the business metric it claimed it would? Did it degrade over time? A healthy model lifecycle includes not just launching, but evaluating, retraining, and—if necessary—retiring.

Third, bake model planning into your quarterly roadmap. Just like product features, ML models deserve space in strategic reviews. Make it part of your normal ritual: review the top-scoring models, ensure a balance across domains, flag risks and blockers, and align on what to ship next.

Finally, don’t treat the framework as dogma. It’s a decision support tool—not a replacement for human judgment. Some models will defy expectations. Others will fail spectacularly. That’s okay. The goal isn’t to guarantee success—it’s to maximize learning and increase the hit rate over time.

A Story: The Model That Almost Didn’t Happen

At one company, a small team proposed an install propensity model. The idea was simple: predict which users were most likely to install an app after seeing an ad. On paper, the impact looked modest—maybe a 3% lift in CVR. The confidence was medium, and the engineering effort was low.

The model barely made the cut. But it was prioritized because the framework highlighted its strong offline performance, low risk, and ease of experimentation. It shipped within a sprint, and the A/B test showed a surprising 6% lift. Because it was deployed quickly, the team was able to iterate and improve it within the quarter. By the end of the year, it had delivered more revenue than several high-profile moonshot models.

This is the power of structured decision-making. Without it, that model might have sat on the backlog indefinitely.

Final Thoughts: From Chaos to Clarity

Planning ML model development doesn’t have to be chaotic. With the right data, tools, and culture, it can be a deeply strategic exercise—one that aligns technical innovation with business impact.

The world of ML is inherently uncertain. But that’s not an excuse for vague planning. In fact, it’s the reason to be more rigorous. Because in a world full of clever models, what separates great ML orgs from average ones isn’t how smart their scientists are—it’s how well they decide what to build next.

So the next time you’re in a planning meeting, staring at a list of competing models, ask yourself: Do we have the data to make this decision? If not, it’s time to build the system that does.

The future of machine learning isn’t just in the models. It’s in the choices we make about them.

Leave a comment