DNA sequencing is one of today's most critical scientific fields, powering leaps in humanity's understanding of genetic causes of cancer, neurodegenerative diseases, and diabetes. One issue facing the industry is an overabundance of information. With scientists sharing their sequencing results in previously unrealized droves, massive datasets numbering in the petabytes have begun to be stored in repositories like the American Sequence Read Archive and European Nucleotide Archive. Containing almost as much information as all the text on the internet, harnessing these massive datasets has proven as difficult as analyzing them. Researchers at ETH Zurich have begun to tackle this problem by creating a DNA search engine that will allow scientists to look up and isolate genetic sequences. In a paper published in the scientific journal Nature, the team describes how its search engine, dubbed MetaGraph, transforms these massive, disparate databases into a single searchable database housing nearly 600 million distinct sequences and 21 million gigabytes of sequence data.

Such advancements build off the chain termination methods of Nobel laureate Fred Sanger, who pioneered the field with his 1977 breakthrough in genome sequencing. Since then, scientists have pursued next-generation sequencing technologies to develop tests to identify almost any infection, catalog the SARS-CoV-2 genome behind the COVID-19 pandemic, and even revive the dire wolf species. Described as a "Google for DNA" by Professor Gunnar Rätsch, a data scientist at the Department of Computer Science at ETH Zurich, researchers hope that MetaGraph's search functionalities will vastly accelerate this form of genetic research.