Scientists Created A Genetic Code Search Engine Like 'Google For DNA'

DNA sequencing is one of today's most critical scientific fields, powering leaps in humanity's understanding of genetic causes of cancer, neurodegenerative diseases, and diabetes. One issue facing the industry is an overabundance of information. With scientists sharing their sequencing results in previously unrealized droves, massive datasets numbering in the petabytes have begun to be stored in repositories like the American Sequence Read Archive and European Nucleotide Archive. Containing almost as much information as all the text on the internet, harnessing these massive datasets has proven as difficult as analyzing them. Researchers at ETH Zurich have begun to tackle this problem by creating a DNA search engine that will allow scientists to look up and isolate genetic sequences. In a paper published in the scientific journal Nature, the team describes how its search engine, dubbed MetaGraph,  transforms these massive, disparate databases into a single searchable database housing nearly 600 million distinct sequences and 21 million gigabytes of sequence data.  

Such advancements build off the chain termination methods of Nobel laureate Fred Sanger, who pioneered the field with his 1977 breakthrough in genome sequencing. Since then, scientists have pursued next-generation sequencing technologies to develop tests to identify almost any infection, catalog the SARS-CoV-2 genome behind the COVID-19 pandemic, and even revive the dire wolf species. Described as a "Google for DNA" by Professor Gunnar Rätsch, a data scientist at the Department of Computer Science at ETH Zurich, researchers hope that MetaGraph's search functionalities will vastly accelerate this form of genetic research.

A searchable genonome database

The research team at ETH Zurich has been building MetaGraph since 2020. Its strength is in its ability to streamline searching through DNA and RNA sequencing data by compressing it into full-text searchable indexes, reducing the average data size by a factor of 300. To do so, all data within the system undergoes a refining process, taking raw data and transforming it into error-corrected, refined graphs that are subsequently merged into the group's unified index. This has allowed researchers to compress 100 TB datasets like GTEx and TCGA into just 10 GB each. 

The datasets feature virus, microbe, fungi, plant, bacteria, and human DNA sequences, including human gut metagenome and metazoan samples. The scientists also added raw metagenomic data and other critical datasets. The team used advanced mathematical graphs to efficiently organize the datasets, similar to how values are ordered in a spreadsheet. The connections between raw data and metadata have allowed the team to remove several redundancies, vastly compressing the dataset.   

One benefit of MetaGraph is that it allows researchers to search through the dataset without downloading large reams of information. Previously, researchers needed to download individual datasets before searching through the raw data sequences, making the research process slow and expensive. Another benefit is that this form of search is much more cost-efficient than previous data collation methods. For instance, the entire scope of publicly available biological sequencing data can now fit on a few hard drives, with each search costing a matter of cents, making the total cost roughly $2,500.

The future of DNA sequencing

As it stands, roughly half of the world's sequencing datasets are currently available through MetaGraph's search functions. The team at ETH expects the rest of the publicly available dataset to be online by the end of 2025. Critically, MetaGraph's approach is scalable, ensuring that users continue to experience high search speeds even as its dataset multiplies. An open source resource, MetaGraph believes that it will attract various users, ranging from pharmaceutical companies, educators, scientists, researchers, and, possibly, private individuals. As Dr. André Kahles, a member of the Biomedical Informatics Group at ETH Zurich, said in a university press release, "In the early days, even Google didn't know exactly what a search engine was good for. If the rapid development in DNA sequencing continues, it may become commonplace to identify your balcony plants more precisely."

MetaGraph's team of developers hopes their new program will facilitate genetic research. For instance, scientists used genomic sequencers to map out the SARS-CoV-2 virus, a key step in developing the COVID vaccine. Others have analyzed the DNA sequences of earthworms to study evolution. MetaGraph's database could facilitate this research by making it easier to search, structure, and test genome sequences more quickly and cheaply. Such developments will make the next generation of genome sequencing technologies better, cheaper, and ultimately, healthier.

If you want to play with it, you can visit MetaGraph's Open Data repository to execute searches within the group's cloud database. For amateurs and prospective users looking to visualize the databases' results, several examples are available on their website, including visualizations of famous proteins and antimicrobial resistance genes.

Recommended