AI Is Transforming How Natural History Collections Are Digitized and Mapped Around the World
Artificial intelligence is beginning to reshape many areas of science, and a new study from the University of North Carolina at Chapel Hill shows just how powerful that transformation could be for natural history collections. Researchers at UNC-Chapel Hill have demonstrated that large language models (LLMs) can dramatically speed up the process of digitizing plant specimens by accurately identifying where those specimens were originally collected. This process, known as georeferencing, has long been one of the biggest obstacles to fully digitizing natural history collections.
The study, published in the journal Nature Plants, focuses on a practical and costly problem faced by herbaria and museums worldwide. Plant specimens often come with handwritten or typed labels describing where they were collected, sometimes decades or even centuries ago. Turning those descriptions into usable geographic coordinates has traditionally required manual interpretation by experts, specialized software, and multiple rounds of verification. It is slow, expensive, and difficult to scale.
The UNC research team set out to answer a clear question: can modern AI tools automate this step without sacrificing accuracy? According to their findings, the answer is yes.
How AI Handles Georeferencing Tasks
Georeferencing involves reading locality descriptions such as place names, landmarks, distances, or historical references and translating them into latitude and longitude coordinates. For humans, this often means consulting old maps, gazetteers, or regional knowledge. For AI, it means interpreting natural language and connecting it to geographic data.
The researchers tested advanced large language models, including state-of-the-art systems, to see how well they could perform this task. The results were striking. The models were able to determine specimen collection locations with a median error of less than 10 kilometers, a level of accuracy comparable to trained human experts. In many cases, the AI outperformed traditional automated tools that are currently used for georeferencing.
Just as important as accuracy was speed and cost. What can take humans minutes or longer per specimen, AI systems were able to complete in seconds. At scale, this translates into enormous savings in time and resources.
Why This Matters for Global Biodiversity Data
Natural history collections are vast. Scientists estimate that there are between 2 and 3 billion herbarium specimens stored in institutions around the world. Only a small fraction of these have been fully digitized with usable geographic data. Without that data, researchers face serious limitations.
Georeferenced specimens are critical for understanding biodiversity distribution, tracking species movement, and studying how ecosystems respond to climate change. When location data is missing or incomplete, entire collections become far less useful for modern ecological research.
By applying AI-powered georeferencing, researchers believe that millions of records currently sitting in cabinets could be unlocked. This would give scientists access to historical biodiversity data at a scale that was previously impossible.
How the UNC Study Was Conducted
The UNC-Chapel Hill team evaluated LLM performance using thousands of plant specimen records with known coordinates. By comparing AI-generated locations to verified reference points, the researchers could directly measure accuracy.
The models were also compared against traditional georeferencing methods, including widely used tools such as GEOLocate. In these comparisons, LLMs consistently matched or exceeded existing approaches while requiring far less manual input.
Another important finding was scalability. Human-based workflows struggle as collections grow larger. AI models, by contrast, can process massive datasets quickly, making them particularly well-suited for institutions with hundreds of thousands or even millions of specimens.
Reducing Cost and Labor Barriers
One of the most practical outcomes of this research is the potential reduction in cost. Digitizing large herbarium collections can require years of labor and substantial funding. For many institutions, especially smaller ones, these costs are prohibitive.
The study suggests that using LLMs could reduce georeferencing costs to a fraction of traditional methods, making large-scale digitization more accessible. Instead of replacing experts, AI can handle routine cases while specialists focus on complex or ambiguous records.
This hybrid approach could help institutions digitize collections faster without compromising data quality.
Why Large Language Models Are Suited for This Work
LLMs are particularly effective at georeferencing because they excel at understanding context in natural language. Many specimen labels include vague or outdated place names, references to historical boundaries, or informal descriptions that rule-based systems struggle to interpret.
By drawing on extensive training data, LLMs can infer meaning, resolve ambiguity, and connect textual descriptions to modern geographic references. This ability gives them a significant advantage over older automated tools.
Broader Implications for Natural History Collections
The success of AI-based georeferencing opens the door to wider applications in natural history digitization. Beyond plants, similar approaches could be used for insects, fossils, animals, and geological specimens, many of which face the same challenges with incomplete or hard-to-interpret labels.
Faster digitization also supports data sharing across institutions and countries. Once collections are digitized and georeferenced, they can be integrated into global databases used by researchers, policymakers, and conservation organizations.
Understanding Georeferencing in Simple Terms
Georeferencing may sound technical, but at its core it is about answering one basic question: where was this specimen collected? Without that answer, a specimenโs scientific value is limited. With it, the same specimen becomes a data point in understanding how life on Earth is distributed and how it changes over time.
Historically, the challenge has not been collecting specimens, but converting legacy records into digital formats that modern science can use. AI is now showing that this long-standing bottleneck can be addressed efficiently.
What Comes Next
The UNC study is among the first to show that large language models can handle georeferencing at scale with high accuracy. While further testing and refinement are still needed, the results suggest a future where digitizing natural history collections is no longer constrained by time and cost.
As AI tools continue to improve, their role in biodiversity research is likely to expand. For scientists studying climate change, conservation, and ecosystem health, access to comprehensive, georeferenced historical data could prove invaluable.
At its core, this research highlights a simple but powerful idea: by combining human expertise with intelligent automation, centuries of biological knowledge stored in museum collections can finally be brought fully into the digital age.
Research paper: https://www.nature.com/articles/s41477-025-02162-y