Researchers Find That a Popular Algorithm Accuracy Metric Can Be Seriously Biased

Close-up of a computer screen displaying programming code in a dark environment.

Scientists rely heavily on numerical metrics to decide whether one algorithm performs better than another. In fields like machine learning, data science, and network research, these metrics often shape entire scientific conclusions. A new study, however, shows that one of the most widely trusted tools for this job may be giving researchers a distorted picture.

Researchers from the Santa Fe Institute, the University of Hong Kong, and the University of Michigan have revealed that Normalized Mutual Information (NMI)โ€”a metric used for decades to evaluate classification and clustering algorithmsโ€”can produce biased and misleading results. Their findings were published in Nature Communications in December 2025, and they raise important questions about how algorithm performance has been measured across thousands of studies.


What Is Normalized Mutual Information and Why It Matters

Normalized Mutual Information is a statistical measure used to compare two sets of labels. In simple terms, it tells researchers how closely an algorithmโ€™s output matches the โ€œground truthโ€ data. Because it produces values between 0 and 1, it has become extremely popular. A score closer to 1 suggests strong agreement between predicted and true classifications, while a score near 0 suggests poor performance.

NMI is used everywhere:

  • To evaluate clustering algorithms
  • To test community detection methods in networks
  • To assess classification models in medical, biological, and social data

Because it works across many types of problems and seems mathematically sound, NMI has long been treated as a reliable yardstick for algorithm comparison.


The Core Problem Researchers Discovered

The new study shows that this trust may be misplaced.

According to the researchers, NMI suffers from two major built-in biases that can significantly skew results. These biases are not small technical quirksโ€”they are large enough to change which algorithm appears โ€œbestโ€ in many cases.

The research team includes Max Jerdee, a postdoctoral fellow at the Santa Fe Institute; Alec Kirkley from the University of Hong Kong; and Mark Newman, an external professor at the Santa Fe Institute and professor at the University of Michigan. Together, they carefully analyzed how NMI behaves under different conditions.


Bias One: Rewarding Algorithms That Over-Divide Data

The first issue is that NMI can favor algorithms that create too many categories.

Imagine two algorithms trying to classify patient health conditions. One algorithm correctly identifies diabetes but lumps all patients into a single diabetes category. Another algorithm separates diabetes into many subgroups, some of which are not medically meaningful. Even if the second algorithm is less accurate overall, NMI may still score it higher simply because it produces more categories.

This happens because mutual information tends to increase as the number of groups increases. When this value is normalized, the effect doesnโ€™t disappearโ€”it gets baked into the final score. As a result, algorithms that over-split data can look artificially impressive.


Bias Two: Favoring Artificially Simple Models

The second bias comes from the way NMI is normalized.

Most commonly used versions of NMI are symmetric, meaning they treat the algorithmโ€™s output and the ground truth as equally important in the normalization step. While this seems fair on the surface, it introduces a subtle problem: the normalization depends on the algorithmโ€™s own structure.

This can create a bias toward overly simple models in some situations. Depending on how the normalization is done, NMI can penalize algorithms that capture meaningful complexity and instead reward simplerโ€”but less informativeโ€”solutions.

The researchers found that different normalization choices can point to entirely different โ€œbestโ€ algorithms, even when evaluating the same data.


Why This Is a Big Deal for Science

Normalized Mutual Information has been used or cited in thousands of scientific papers over the past several decades. In many cases, researchers relied on it to choose between competing models, publish benchmark results, or validate new methods.

If the metric itself is biased, then conclusions drawn from it may be questionable or incomplete. This is especially concerning in areas like:

  • Medical diagnostics, where classification accuracy can influence clinical decisions
  • Network science, where community detection shapes how researchers interpret social or biological systems
  • Machine learning benchmarks, where small metric differences can determine whether a method is considered state-of-the-art

The researchers stress that the problem is not with mutual information itself, but with how it has been normalized and interpreted.


A New Solution: Reduced, Asymmetric Mutual Information

To address these issues, the team developed a revised metric: an asymmetric, reduced version of mutual information.

This new measure removes both sources of bias by changing how normalization is handled. Instead of normalizing with respect to both the algorithm output and the ground truth, the new metric normalizes only against the true classification. This eliminates the incentive to over-divide data and removes dependence on the algorithmโ€™s internal structure.

The result is a metric that may be asymmetric, but is more stable, consistent, and meaningful for real-world comparisons.


Testing the New Metric on Real Algorithms

To see how well their approach works, the researchers applied the revised metric to a range of popular community detection algorithms.

They found that standard NMI often produced inconsistent rankings. Depending on the exact normalization formula used, the same algorithm could move from best to mediocre. In contrast, the new metric delivered consistent results, providing a clearer picture of which algorithms genuinely captured meaningful structure in the data.

This suggests that many previous comparisons based on NMI may have been influenced more by metric behavior than algorithm quality.


A Broader Look at Algorithm Evaluation Metrics

This study highlights a larger issue in data science: metrics are not neutral tools. Every evaluation measure encodes assumptions, and those assumptions can quietly shape outcomes.

Other commonly used metricsโ€”such as accuracy, precision, recall, F1 score, and adjusted Rand indexโ€”also have known limitations. Choosing the wrong metric can unintentionally favor certain types of errors or behaviors.

The authorsโ€™ work reinforces the importance of understanding what a metric truly measures, rather than treating it as an unquestioned standard.


What Researchers and Practitioners Should Take Away

The takeaway is not that NMI should be abandoned entirely, but that it should be used with caution. Researchers should be aware of its biases and consider alternative measuresโ€”especially when comparing algorithms with very different structural properties.

The new reduced mutual information metric offers a promising option, particularly for problems involving complex or ambiguous group structures.

As data-driven research continues to expand, careful evaluation will be just as important as algorithm design itself. A bent yardstick, after all, can lead even careful scientists to the wrong conclusions.


Research Paper

https://www.nature.com/articles/s41467-025-66150-8

Also Read

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments