Skip to content

why your dna is nothing like a database

2009 October 21
by gfish

How many times have you heard DNA being compared to a computer database? How many creationists took this metaphor to heart and started wandering off into statements about how the information contained in our genome must have been engineered to mimic the back ends of complex software systems? And how many science classes still make this analogy? On the surface, it seems more or less accurate, since DNA stores a great deal of hereditary information that’s read and executed by cellular organelles. But when we actually take a deeper look into this concept, the similarities quickly begin to break down which makes perfect sense since we’re trying to compare a dynamic, evolving system to a static tool created to perform a limited set of tasks.

dna helix

One of the first problems we would encounter when treating our genome like a database would be the way it encodes amino acids. As we know from basic biology, the bases along the DNA strands organize themselves into a triplet code read by mRNA (or messenger RNA), which carries the message to ribosomes to assemble the dictated proteins. However, both DNA and mRNA have four nucleotides doing the encoding and there are 64 combinations for only 20 amino acids. That means the same amino acids would be specified by different triplet codons. And indeed, the breakdown of what each codon means is full of redundancies. It’s like each entry in a database having several unique keys, something most modern database development tools won’t even let you do since without truly unique identifiers for each piece of data, database structures quickly break down and you’re going to have a whole lot of problems retrieving and working with its contents.

Sure you could trick the database in successfully associating one piece of data with several identifiers but it’s a very sloppy and bug prone way to organize anything. When I worked on a little app that allowed high school biology students to enter a DNA or an mRNA triplet and see the detailed chemical structure of the amino acid that it encoded, I had to trick the computer into realizing that GGU, GGC, GGA and GGG could all mean glycine by storing all the amino acids and all the codons separately, then pointing them to each other in the code. So every time I’ve ever heard creationists talk about DNA being perfectly designed, my mind just flashes back to the challenges I had with designing that little app. If someone designed DNA to work like a blueprint, it’s a very messy and jury-rigged design at best. It would’ve been far more efficient to make each codon specify only one amino acid. All you’d need would be 20 combinations for a one to one match.

But of course there’s a reason why the redundancy of the DNA has been kept by natural selection. Unlike our computer systems, genomes change due to transcriptions errors, environmental damage and so on. When a computer database starts randomly changing its data on you, it’s called a corruption and it renders your piece of software pretty much useless. In fact, just see an IT person’s reaction to the words “data corruption.” It will probably involve cursing, groaning, and if a big deadline is looming overhead, uncontrolled crying or a panic attack. But when we’re talking about DNA, the redundancy in its encoding is one of the things that protects us from potentially harmful side effects of mutations. Even if there are snips in the strand, there’s still a fairly good chance that the right amino acid will be encoded. And again, if DNA was well designed and organized, there’d be no need for the redundancy and mutations would always be corrected, not just patched up here and there or countered, so the code can function as intended.

Finally, we have to remember that a good part of our genomes doesn’t actively code for a protein or pass any important information during embryo development. In any database, having data that’s simply an archive and wouldn’t be required for day to day function is considered wasteful and an administrator would run routines to get rid of it. Leaner databases mean faster execution. If this archival data suddenly becomes necessary, the developers would simply request that the needed bits are added back on and work with them. When we add this to our list of problems with DNA running like some sort of digital blueprint, it seems that our genomes are over-engineered, redundant, inefficient and prone to random changes that would easily cripple any computer system designed to carry out a specific task. But that’s ok because biological systems are formed bottom up, not top down. They evolve for change and propagation, not self-contained data processing or a rather limited data exchange dictated by a strict, inflexible system of rules and regulations. And this is exactly why we should not be comparing the two in books or in science classes.

  • Share/Bookmark
2 Comments leave one →
  1. Dave Martin permalink
    October 24, 2009

    I think the computer code example is a good analogy, so far as it goes. This attempt to show how DNA is somehow “inefficient” is simply a failure to adequately understand its task.

    We have four nitrogenous bases for evolutionary reasons – they are, in fact a consequence of chemistry, and the double stranded DNA structure. Because of chemistry, we have the purine/pyrimidine base combinations, and because of the double stranded DNA structure, the unit of information becomes a base pair, rather than a base. This system requires a minimum of 4 bases to work.

    If you are working with 4 bases, then the triplet code is a must. If you only use 2 bases instead of 3 out of the possible 4, you have 4^2 = 16 possible combinations. This is not enough to code for 20 amino acids, not to mention start and stop codons. So we need a triplet code, which means 4^3 = 64 combinations are available. This allows for some redundancy as well as start/stop codons, but also this is the minimum number necessary. So it’s not “inefficient” in that sense – you really can’t do the job with 2, you NEED 3. Not to mention that this redundancy offers protection against mutations which can kill you.

    You could argue “well, if I were the designer, I wouldn’t go with 4 bases to begin with, I’d pick fewer”. Then you would get into much deeper details, none of which really make an argument for efficiency. The double stranded DNA structure protects against mutations – an “open” single strand is much less stable, so this is not a waste or “inefficiency”. You can’t just pick arbitrary molecules as bases either – the purine and pyrimidine bases are well suited to the task because of their high affinities with each other, because of the way they integrate with the sugar/phosphate backbone, because of how that allows the molecule to coil, so you can have a meter long molecule packed inside a tiny cell nucleus, and yet have any part of the molecule instantly available for transcription without uncoiling the whole thing.

    In short, none of this is inefficient – unless by efficiency you mean “gimme ECC memory and no other constraints other than those imposed by these silicon computers and programming languages I am used to”. Those computers and languages also have their constraints, which have to do with the nature of the electronics they run on, and the history of how we developed computing hardware and software. In the same way, biology has its own constraints. It is not “inefficient” just because those two sets of constraints are different.

    Just as the creationist/ID nuts go overboard with their design argument, I think some supposedly scientific people go overboard with their “there is no design – this is all a jury rigged inefficient PoS” argument. There are indeed some cases where biology has produced some strange results, that we as human engineers find kludgy. But much more often, when we call something “inefficient” the real problem is that we don’t understand the system well enough.

    I have no problem comparing DNA to computers, since many people are familiar with computers. I would take the trouble to point out some of the facts I have mentioned in this reply, however, so that people realize that analogies only approximate the truth to help explain some specific facet of the comparison, and the reality is of course much more complex.

  2. gfish permalink*
    October 24, 2009

    Dave,

    Most of the things you noted were mentioned in the article. Yes, it’s true that when we’re looking at DNA, we’re looking at a product of more then 3.5 billion years of selection and it’s going to be pretty efficient and well adapted as far as biology is concerned.

    However, it’s the redundancies and the necessity of dealing with mutations that make it an inefficient design. It’s not about how many bases you choose, but how you encode the data you need and how you protect it. A good designer should’ve made a simple encoding that matches the 20 amino acids one to one and protected the genome from mutations by backups which replace altered DNA.

    Funny enough though, some organisms can actually do something like that with single strand annealing. But that process is less of a backup and more of a deletion and substitution.

Leave a Comment

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS