Researchers Find That a Popular Algorithm Accuracy Metric Can Be Seriously Biased
Scientists rely heavily on numerical metrics to decide whether one algorithm performs better than another. In fields like machine learning, data science, and network research, these metrics often shape entire scientific conclusions. A new study, however, shows that one of the most widely trusted tools for this job may be giving researchers a distorted picture.
Researchers from the Santa Fe Institute, the University of Hong Kong, and the University of Michigan have revealed that Normalized Mutual Information (NMI)โa metric used for decades to evaluate classification and clustering algorithmsโcan produce biased and misleading results. Their findings were published in Nature Communications in December 2025, and they raise important questions about how algorithm performance has been measured across thousands of studies.
What Is Normalized Mutual Information and Why It Matters
Normalized Mutual Information is a statistical measure used to compare two sets of labels. In simple terms, it tells researchers how closely an algorithmโs output matches the โground truthโ data. Because it produces values between 0 and 1, it has become extremely popular. A score closer to 1 suggests strong agreement between predicted and true classifications, while a score near 0 suggests poor performance.
NMI is used everywhere:
- To evaluate clustering algorithms
- To test community detection methods in networks
- To assess classification models in medical, biological, and social data
Because it works across many types of problems and seems mathematically sound, NMI has long been treated as a reliable yardstick for algorithm comparison.
The Core Problem Researchers Discovered
The new study shows that this trust may be misplaced.
According to the researchers, NMI suffers from two major built-in biases that can significantly skew results. These biases are not small technical quirksโthey are large enough to change which algorithm appears โbestโ in many cases.
The research team includes Max Jerdee, a postdoctoral fellow at the Santa Fe Institute; Alec Kirkley from the University of Hong Kong; and Mark Newman, an external professor at the Santa Fe Institute and professor at the University of Michigan. Together, they carefully analyzed how NMI behaves under different conditions.
Bias One: Rewarding Algorithms That Over-Divide Data
The first issue is that NMI can favor algorithms that create too many categories.
Imagine two algorithms trying to classify patient health conditions. One algorithm correctly identifies diabetes but lumps all patients into a single diabetes category. Another algorithm separates diabetes into many subgroups, some of which are not medically meaningful. Even if the second algorithm is less accurate overall, NMI may still score it higher simply because it produces more categories.
This happens because mutual information tends to increase as the number of groups increases. When this value is normalized, the effect doesnโt disappearโit gets baked into the final score. As a result, algorithms that over-split data can look artificially impressive.
Bias Two: Favoring Artificially Simple Models
The second bias comes from the way NMI is normalized.
Most commonly used versions of NMI are symmetric, meaning they treat the algorithmโs output and the ground truth as equally important in the normalization step. While this seems fair on the surface, it introduces a subtle problem: the normalization depends on the algorithmโs own structure.
This can create a bias toward overly simple models in some situations. Depending on how the normalization is done, NMI can penalize algorithms that capture meaningful complexity and instead reward simplerโbut less informativeโsolutions.
The researchers found that different normalization choices can point to entirely different โbestโ algorithms, even when evaluating the same data.
Why This Is a Big Deal for Science
Normalized Mutual Information has been used or cited in thousands of scientific papers over the past several decades. In many cases, researchers relied on it to choose between competing models, publish benchmark results, or validate new methods.
If the metric itself is biased, then conclusions drawn from it may be questionable or incomplete. This is especially concerning in areas like:
- Medical diagnostics, where classification accuracy can influence clinical decisions
- Network science, where community detection shapes how researchers interpret social or biological systems
- Machine learning benchmarks, where small metric differences can determine whether a method is considered state-of-the-art
The researchers stress that the problem is not with mutual information itself, but with how it has been normalized and interpreted.
A New Solution: Reduced, Asymmetric Mutual Information
To address these issues, the team developed a revised metric: an asymmetric, reduced version of mutual information.
This new measure removes both sources of bias by changing how normalization is handled. Instead of normalizing with respect to both the algorithm output and the ground truth, the new metric normalizes only against the true classification. This eliminates the incentive to over-divide data and removes dependence on the algorithmโs internal structure.
The result is a metric that may be asymmetric, but is more stable, consistent, and meaningful for real-world comparisons.
Testing the New Metric on Real Algorithms
To see how well their approach works, the researchers applied the revised metric to a range of popular community detection algorithms.
They found that standard NMI often produced inconsistent rankings. Depending on the exact normalization formula used, the same algorithm could move from best to mediocre. In contrast, the new metric delivered consistent results, providing a clearer picture of which algorithms genuinely captured meaningful structure in the data.
This suggests that many previous comparisons based on NMI may have been influenced more by metric behavior than algorithm quality.
A Broader Look at Algorithm Evaluation Metrics
This study highlights a larger issue in data science: metrics are not neutral tools. Every evaluation measure encodes assumptions, and those assumptions can quietly shape outcomes.
Other commonly used metricsโsuch as accuracy, precision, recall, F1 score, and adjusted Rand indexโalso have known limitations. Choosing the wrong metric can unintentionally favor certain types of errors or behaviors.
The authorsโ work reinforces the importance of understanding what a metric truly measures, rather than treating it as an unquestioned standard.
What Researchers and Practitioners Should Take Away
The takeaway is not that NMI should be abandoned entirely, but that it should be used with caution. Researchers should be aware of its biases and consider alternative measuresโespecially when comparing algorithms with very different structural properties.
The new reduced mutual information metric offers a promising option, particularly for problems involving complex or ambiguous group structures.
As data-driven research continues to expand, careful evaluation will be just as important as algorithm design itself. A bent yardstick, after all, can lead even careful scientists to the wrong conclusions.