fusing biology and computer science?
When two of the leading experts in genomics published their thoughts about the future of genetics in Nature’s retrospective on the Human Genome Project, one of them didn’t devote significant parts of his editorial to a simple laundry list of goals and challenges in his field. Instead, J. Craig Venter called on a seemingly unlikely crowd to pitch in and help unravel the mysteries of genetics: computer scientists. He wants supercomputers and data crunching applications able to breeze through DNA and RNA sequences, quickly and inexpensively analyze the billions of nucleotides in a typical genome, and give experts the tools to navigate the oceans of data they will collect. With the proper technology, says Venter, biologists could help personalize medicine, find potential new cancer treatments, and discover new and fundamental insights into our complex phenotypes.
At first, it may seem odd that Venter is asking for a supercomputer to sequence a genome as we’re constantly hearing about DNA tests in criminal investigations, so much so that pretty much every one of the 865 forensic show on TV mentions it at least twice per episode. However, the DNA analysis used to catch criminals and a fully fledged gene sequencing project are very different. Forensic DNA analysis focuses on tracking down a small set of markers called STRs in the 3 million nucleotides or so separating one individual from another, and then compares whether this genetic data matches that of the suspects.
Scientific genome sequencing is very different because it tries to accurately identify and record every one of the 3 billion nucleotides in our DNA in strands of about 900 bases, which are assembled into a comprehensive map with specialized software. The task is roughly a thousand times bigger and much more complex, since just knowing the order of all the ACTGs isn’t enough. The important thing is identifying the various genes, what they do and how. And for that, you need the kind of technology that can parse through, and keep track of torrents of data.
Unfortunately for biologists, modern supercomputers would take weeks to sort even the best quality genomic data to generate the kind of information required for personalized medical treatments and delving further into human phenotypes. This is why Venter is asking for a computer capable of performing at a rate of an exaflop, or one quintillion floating point operations per second.
At about a thousand times faster the majority of today’s supercomputers, this machine should be able to perform an extensive analysis of raw genomic data within a few days. Hopefully, with future economies of scale and multiple machines, several decades from now, every patient could afford to have a detailed genetic analysis which would give her and her doctors an idea of what health risks she might face in the future. Likewise, data from thousands of individuals processed in just years may give researchers new insights into human evolution and functional genomics. And in order to make this happen, computer scientists and biologists will have to learn quite a bit from each other, working together on the bleeding edges of both biology and supercomputing.
See: Venter, J. (2010). Multiple personal genomes await Nature, 464 (7289), 676–677 DOI: 10.1038/464676a