For most of modern history, people have overlooked viruses even though they are the most abundant biological entity on the planet and carry immense ecological significance. Viruses are found in every nook and corner of the world — from soil and water to the atmosphere and even extreme environments like hot springs and hydrothermal vents.
Viruses are obligate parasites: they require a host to infect and replicate. This relationship goes both ways. Thanks to advances in research, scientists are increasingly recognising viruses as agents of disease but also as being integral components of ecosystems. Viruses drive genetic evolution through horizontal gene transfer, control microbial population balance, and even affect biogeochemical cycles.
They essay critical roles in maintaining biodiversity and may even influence climate regulation. Understanding their influence is thus key to unravelling the complexities of life on the earth. Yet only a small fraction of the roughly 100 million to a trillion viral species has been identified to date.
Beyond their environmental roles, understanding viruses is crucial for us to anticipate emerging infectious diseases. Some studies have estimated there are around 300,000 mammalian viruses yet to be discovered, many of which pose zoonotic threats. Unlike microbes, which scientists have studied using culture-based methods, viruses have remained understudied because of challenges to culturing them.
The rapidly improving scale and declining costs of nucleotide sequencing has resulted in the widespread use of genome-sequencing approaches to understand microbes in the environment, particularly in metagenomics studies. These approaches have transformed our ability to explore the vast diversity of microbes and viruses in the last decade. In a metagenomic study, researchers analyse genetic material directly from environmental samples, allowing them to identify and study an organism without the need for culturing organic material like tissues in an intermediate step.
In recent years, metagenomics has helped scientists identify a staggering number of previously unknown microbes in diverse environments. These discoveries have significantly expanded our understanding of microbial ecosystems. As sequencing technologies continue to improve — becoming more accurate, faster, and more affordable — alongside better global data-sharing practices, scientists are beginning to unlock the secrets of the microbial world at an unprecedented pace.
In this regard, RNA viruses are of especial significance primarily because they mutate rapidly and adapt quickly to new conditions. More specifically, DNA viruses have more stable genomes and their genome-replicating mechanism makes fewer ‘mistakes’ when they proliferate — whereas RNA viruses replicate faster with higher error rates. This characteristic is also particularly relevant in the context of emerging infectious diseases: COVID-19, Ebola, and influenza are all caused by RNA viruses.
One way to identify an RNA virus is to track down and isolate fragments of a specific gene that is essential for the virus to replicate: RNA-dependent RNA polymerase, or RdRP. RdRP is one of the most ancient of genes, so much so that many researchers believe it was among the world’s first genes. RdRP proteins have regions that are well-conserved (i.e. which the organism preserves as it evolves) and motifs in the protein that are essential for its function, which is to replicate RNA using a template.
In 2022, Canadian researchers led by Artem Babaian built an open source tool called Serratus. When scientists sequenced a gene, Serratus could match the sequence data with sequences known to be related to viral RdRP proteins. The researchers collected more than 10 petabytes of sequencing data encompassing 5.7 million sequencing libraries from diverse ecologies. When they fed this dataset to Serratus, it uncovered the presence of more than 100,000 viruses, considerably expanding the diversity of viruses known to humankind. Their findings were published in Nature in January 2022.
In another study published in Science in the same year, U.S. researchers led by Ahmad Zayed at the University of Ohio used computational tools to sift through the terabytes RNA sequence data to identify thousands of new RNA virus species. In particular, this team identified a new viral species to fill an important gap in our scientists’ understanding of RNA virus evolution; a new species that dominated the oceans; and another species that could infect mitochondria (organelles in cellular organisms that serve as the energy source, believed to have originated from microbes).
An important shortcoming of the metagenomic approach is that computational algorithms typically look for proteins very similar to sequences already in databases. As a result they risk missing proteins that have evolved and changed form. This risk may not hold for long, however. In a recent study, researchers from multiple Chinese research organisations combined genomics with a transformer.
In deep-learning, a transformer is a type of machine learning model known for its ability to train rapidly to identify specific patterns. In the study, researchers fed genome-sequencing data and data from ESMFold, another machine-learning model adept at predicting the structures of proteins, to their transformer and trained it to spot genetic patterns corresponding to RdRP.
Then they used the transformer to analyse large tranches of metagenomic data, where it identified more than 160,000 new RNA viruses. More than half of these viruses were described for the first time and many came from unique and/or extreme environmental niches, including hot springs, salt lakes, and air. Their findings are to be published in a forthcoming issue of Cell.
Because transformers look for patterns rather than amino-acid sequences, they can find proteins even when they have diverged significantly. They can also help computers design proteins based on these patterns, to perform functions that no natural proteins can. The discovery of new RNA viruses from new places in the environment is also important to our understanding of public health. Each new discovery betters our ability to identify and characterise similar viruses better, teaches us what to keep an eye out for and how/where to improve our methods, and helps us discover more species faster.
On the ground, a key advantage of such discoveries is with regards to pandemic preparedness. As sequencing technology becomes more widespread and data-sharing increasingly the norm, we are equipped better than ever to identify pathogenic viruses with zoonotic potential — i.e. those that could spill over from animals to humans — long before they pose a significant threat. Early detection allows us the opportunity for timely intervention and even the chance to prevent large-scale outbreaks.
Looking ahead, the deeper understanding of viruses and their evolution through genomics, with help from ecological surveillance and machine-learning, will enhance our preparedness against pandemics. By continuously mapping viral diversity in nature and improving our understanding of viral-host interactions, we can also develop machine-learning models that can anticipate and mitigate viral spillovers. This future holds the promise of not only managing emerging viruses but also tackling the risk of pandemics at the microscopic rather than at the planetary scale.
The authors work at Karkinos Healthcare and are adjunct professors at IIT Kanpur and the Dr D.Y. Patil Medical College, Hospital and Research Centre. Views expressed are personal.
Published - October 24, 2024 05:30 am IST