Compressed Data Technique Enables Pangenomics at an Unprecedented Scale

Compressed Data Technique Enables Pangenomics at an Unprecedented Scale
An illustration highlighting PanMANโ€™s ability to handle vast quantities of genetic data with remarkably low storage demands. Credit: Alice Grishchenko.

Engineers at the University of California have developed a powerful new way to store and analyze massive amounts of genetic data, potentially changing how scientists study genomes at scale. The technique introduces a new data structure and compression method that allows pangenomicsโ€”the study of many genomes from a single speciesโ€”to operate at levels that were previously impractical due to storage and computational limits. The work is led by Yatish Turakhia, a professor of electrical and computer engineering at UC San Diego, and has been published in Nature Genetics.

Pangenomics has become increasingly important as genome sequencing technologies have improved. Instead of relying on a single reference genome, researchers now compare thousands or even millions of genomes from the same species to understand genetic variation, mutation patterns, and evolutionary history. This approach is especially useful in areas such as infectious disease research, where tracking mutations can help explain increased transmissibility, immune escape, or drug resistance. However, while sequencing has become faster and cheaper, the data itself has grown overwhelming.

Why Pangenomics Faces a Data Problem

Modern sequencing technologies generate vast amounts of data. Representing and analyzing millions of genomes requires not just storage space, but also data structures that can efficiently capture relationships between genomes. Most current pangenomic methods rely on graph-based formats, which represent genetic variation across genomes. While these formats are useful, they have two major drawbacks.

First, they typically focus on representing variation but do not fully capture evolutionary and mutational historiesโ€”in other words, how genomes are related over time and how specific mutations arose. Second, these formats require enormous storage space and do not scale well as datasets grow into the millions.

This is where the new approach comes in. The research team recognized that how genetic data is structured fundamentally determines both how efficiently it can be stored and what biological questions it can answer. Their solution was to rethink pangenomic representation from the ground up.

Introducing PanMAN: A New Way to Represent Genomes

The team developed a new data structure and file format called the Pangenome Mutation-Annotated Network, or PanMAN. This approach is designed to be both highly compressible and biologically expressive, allowing it to represent not just variation, but also ancestry, mutations, and evolutionary relationships.

At the core of PanMAN are structures known as mutation-annotated trees, or PanMATs. Each PanMAT starts with a single ancestral genome sequence at its root. As the tree branches out, mutationsโ€”such as substitutions, insertions, and deletionsโ€”are annotated along the branches where they occurred. This means that each mutation is recorded once, at the point in evolutionary history where it arose, rather than being duplicated across every descendant genome.

Multiple PanMATs are then connected together to form a network, creating the full PanMAN structure. These connections allow PanMAN to represent complex biological events like recombination and horizontal gene transfer, where genetic material comes from multiple parent sequences instead of following a simple, tree-like inheritance pattern. This is something many existing pangenome formats struggle to model effectively.

Compression Without Losing Biological Meaning

One of the most striking advantages of PanMAN is its compression capability. By exploiting shared ancestry and storing mutations only once, PanMAN dramatically reduces redundancy in genomic data. At the same time, it preserves a rich set of biologically meaningful information.

PanMAN explicitly stores data such as mutations, phylogenetic relationships, genome annotations, and the ancestral root sequence. From this information, researchers can also derive additional insights, including ancestral genome sequences, whole-genome alignments, and detailed maps of genetic variation across populations.

Crucially, the researchers designed PanMAN so that analysis can be performed directly on the compressed data. This eliminates the need to decompress massive datasets before running computational analyses, saving both time and computing resources and making large-scale studies far more practical.

Real-World Results: Millions of Genomes in Megabytes

So far, the researchers have applied PanMAN primarily to microbial genomes, where large-scale sequencing is common. The results have been dramatic. PanMAN has proven to be the most compressible format among pangenomic representations that preserve genetic variation, achieving hundreds to thousands of times more compression than existing methods.

One standout example is the construction of the largest pangenome to date for SARS-CoV-2, the virus responsible for COVID-19. The team analyzed more than 8 million viral genomes, an enormous dataset by any standard. Using PanMAN, this entire pangenome required only 366 megabytes of storage. For comparison, the equivalent whole-genome alignment would have required roughly 3,000 times more space.

Building such a massive alignment was itself a major challenge. To address this, the team relied on another computational tool developed in Turakhiaโ€™s lab called TWILIGHT, which is designed to efficiently construct large-scale genome alignments. Together, TWILIGHT and PanMAN form a powerful pipeline for compressive pangenomics.

Expanding Beyond Microbes to Human Genomes

While microbial genomes were the first testing ground, the researchers are already looking ahead. The next major goal is to apply compressive pangenomics to human genomes, a far more complex and data-intensive challenge. Turakhia and Melissa Gymrek, a professor of computer science and engineering at UC San Diego, have received a Jacobs School Early Career Faculty Development Award to support this effort.

Extending PanMAN to human genomes could fundamentally change how large-scale human genetic data is stored, analyzed, and shared. Beyond enabling studies of genetic diversity, disease, and evolution at unprecedented scale, the approach has the potential to capture detailed evolutionary and mutational histories that shape different human populationsโ€”information that current genome representations often fail to capture.

A Closer Look at Pangenomics as a Field

Pangenomics represents a shift away from the idea of a single โ€œreferenceโ€ genome. In reality, no single genome can fully represent the diversity of a species. By studying many genomes together, researchers gain a more accurate picture of how genes vary, how mutations spread, and how evolution shapes populations over time.

This approach has become especially important in pathogen surveillance, where tracking mutations across millions of samples can reveal how viruses or bacteria adapt in response to vaccines, drugs, or environmental pressures. However, without scalable data structures, the promise of pangenomics has been limited by practical constraints. Techniques like PanMAN aim to remove those barriers.

Why This Matters Going Forward

As sequencing continues to accelerate, the gap between data generation and data analysis will only widen unless new methods are adopted. PanMAN represents a step toward doing more with lessโ€”less storage, less computation, and less redundancyโ€”while still capturing the full biological complexity of genomes.

By combining compression with rich biological representation, this approach opens the door to pangenomic studies at scales that were once unimaginable. Whether tracking the evolution of a virus, exploring microbial diversity, or mapping human genetic variation, compressive pangenomics could become a foundational tool for the next era of genomics research.

Research paper: https://www.nature.com/articles/s41588-025-02478-7

Also Read

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments