whoops, ray manages to do it again…

August 21, 2010

Ray Kurzweil’s recent prognostications about reverse-engineering the human brain within unrealistically short time frames have been making the rounds on the web, inspiring a web comic parody, a rant from PZ calling him the Deepak Chopra for the tech-minded, and, of course, a rebuttal from yours truly. Now, Ray responds with a clarification, claiming that he’s been taken out of context, that his talk was over-simplified, and trying to offer a more detailed and nuanced version of his predictions. Okay, fair enough. The media does get ahead of itself and tech blogs like Gizmodo are notorious for tripping over a story before they learn enough details, or its real world significance. But the problem with Ray’s response is that even with some caveats and additional details, he’s still making significant mistakes about the brain and how it works, as well as what it takes for an accurate simulation of an organ this complex, and you bet I’m going to take issue with his arguments again.

Ray’s first correction is that he expects the brain to be reverse-engineered by 2030 rather than 2020, which is really not much of an improvement to those who don’t ascribe to a numerology of exponential technological and evolutionary advancement he passionately advocates. According to him, we accomplish so much more with each passing year and decade that it’s basically like giving neurologists and computer scientists enough leeway and then some in his Singularity schedule. But how much more time? What’s the formula being used to measure how much more we’ll accomplish in a given time span compared to the previous one? Ray says those who doubt his predictions don’t understand his exponential curve, and that may be true. After all, it’s his creation and he picked all the arbitrary milestones and did his own calculations. And on top of his attempts to play the tech world’s Nostradamus, he also makes major mistakes about biology like this one…

The amount of information in the genome (after lossless compression, which is feasible because of the massive redundancy in the genome) is about 50 million bytes (down from 800 million bytes in the uncompressed genome). It is true that information in the genome goes through a complex route to create a brain, but the information in the genome constrains the amount of information in the brain prior to the brain’s interaction with its environment.

Let’s see, where do we start with this one? Remember that in my previous post on the subject, I noted that the redundancy is there due to natural selection and that simply filtering out redundancies isn’t going to be of any help in retrieving only the information you need to recreate the instructions for a brain’s growth and bottom-up development. Also, how does Ray get the 800 million byte figure? If you store each nucleotide in your DNA as a character, you’ll have to allocate two bytes per nucleobase in memory. So with 6 billion nucleobases, you’re looking at roughly 12 billion bytes or 11.7 GB of data that you’d need to save to a persistent object. Which will still be hundreds of megabytes, even after some serious compression. Why do you think experts in genomics are calling for supercomputers to analyze genetic data? It’s not because it’s so easy to parse the immense amount of information and those “redundancies,” like repeated genes and STRs are actually rather important to growth and development in ways we still don’t know because we lack both the biological knowledge and a really good algorithm for parsing nucleotide sequences in an efficient and practical manner.

With that in mind, let’s move on to Ray’s bizarre claim that the genome limits the amount of information in the brain prior to it actually growing and developing. He keeps using the word information, but I wonder what he’s actually talking about when he does. In computer science, information is anything to be stored or computed in some way, shape or form. That information comes from a database, or user input, or one of the processes of the program with which we’re working. But in biology, the genome is telling the brain key stages of how it may form based on environmental cues and genetic triggers. And those triggers and cues begin as soon as cells start to divide into a new embryo. Where’s the limit of the information? What do you gain by knowing a string of nucleobases and the amino acids they encode when it isn’t actually a strict blueprint in the same sense we’d expect from a program? And what information could be in the brain when that brain hasn’t even started to form and create connections? That’s what makes our brains what they are and that should be the focus, instead of how we’re going to take a shortcut in our efforts to reverse-engineer them by ignoring their redundancies…

For example, the cerebellum contains a module of four types of neurons. That module is repeated about ten billion times. The cortex, a region that only mammals have and that is responsible for our ability to think symbolically and in hierarchies of ideas, also has massive redundancy. It has a basic pattern-recognition module that is considerably more complex than the repeated module in the cerebellum, but that cortex module is repeated about a billion times. There is also information in the interconnections, but there is massive redundancy in the connection pattern as well.

You know, for someone who says he studied this topic for four decades and is supposed to be up on virtually everything new in neurology and computer science, it’s pretty amazing that Ray is suggesting that all of those redundant connections could be ignored to get the structure of the brain and derive how it works on a neuron by neuron level. See, those redundancies are associated with higher cognition, and every living thing has a certain degree of redundancy as dictated by evolutionary processes. So what Ray is suggesting here is a terrific plan not to study what actually allows us to develop self-awareness and intellect, which we could really only get by studying the entire growth and development of the brain from day one. There are no shortcuts here and if Ray actually took the time to follow what neurologists and biologists say about these redundancies and how important their linking seems to be for high-level and complex brain functions, he would know that. Along the way, he also would’ve realized the true scope of the challenge. But then again, he really wants to become immortal so facing our limitations would also mean facing his fears. And he’s just not ready to do that.

addendum 08.24.2010: Okay, so it looks like I missed that Ray was thinking of a two bit data type for the base pairs, which really would yield an appreciably smaller file. That said, his reasoning behind the million lines of code it would take to simulate a human brain is still wrong (please see comments for elaboration as to why), and considering that we would only capture a sequence of amino acids, we would still need far more data. In fact, as biologists responding to Kurzweil have pointed out, you can’t derive a brain from the genome, and we can’t even derive complete micro structures from proteins yet because of all the complex interactions we have to take into account, interactions that depend on the development and the environment of the organism rather than its genome. Some of Ray’s defenders say that he’s not actually proposing to derive the brain from protein sequences, but if that’s the case, why even bother with DNA at all? Those amino acids encode what proteins are to be made and how, so it’s how these proteins interact that’s of vital importance if you want to understand how a brain is built during development, not just what amino acids being encoded with no additional context.

[ illustration by Goro Fujita, spotted on io9 ]

Share on FacebookTweet about this on TwitterShare on RedditShare on LinkedInShare on Google+Share on StumbleUpon
  • Don Roberto

    OK, it’s 0530 and I’m somewhat under the weather, so I’m probably about to make a fool of myself (I’m not so sick as to forget how often that happens), but isn’t “800 million bytes” in fact “hundreds of megabytes?”

  • Greg Fish

    Yes, you’re right, it would be hundreds of megabytes. But the problem is that this is his uncompressed value for a digital representation of a genome which he says could somehow be reduced down to 50 MB, then halved to isolate the genes responsible for the nervous system and assume that each line of code takes 25 bytes of memory, and therefore, the brain’s building blocks should be represented by a million lines of code or so.

    Which is not how code works. One line of code could result in megabytes of data being moved around in memory (like when you retrieve data from a database and load it into a data grid using wrappers), or conversely, return a 2 byte element. There’s no way to just average code together like Ray does.

  • sriram srinivasan

    Actually, Kurtzweill’s number is correct.

    We have 2.9 Billion base pairs; you merely need to encode one element of the pair (the other element is complementary)

    2.9 Billion pairs

    = 2900 Mil letters (ACGT)

    = 5800 Mil bits

    = 725M bytes

  • Greg Fish

    Ok, I see how Kurzweil arrived at this approximate number, but that brings up another issue. Why do we need to make a custom data type to encode the base pairs and then calculate their complements every time we load them into memory just to save a few hundred megabytes?

    You’re going to work with the data in bits and pieces anyway, and the gains in reducing genome data when it’s on a hard drive would be offset by the overhead from decoding the custom data type and calculating the complement. In the grand scheme of things, Kurzweil’s numerology is still wrong for the simple reason that you can’t simply declare that each line of code is 25 bytes as per my previous reply.

  • pwm

    Rethink your calculations- you do not need 2 bytes per base, you need 2 bits- after all, you only have 4 choices, therefore you could encode 4 bases per 8 bit byte, reducing your requirements by a factor of 8,thus 6 billion base pairs could be stored in 750 MB. This preserves the base information as laid out, unlike sriram’s calculation. The complementarity of the base pairs is important for replication and packaging. The actual information being stored is in base 3, with the codons, 3 bases coding for each amino acid, giving you 27 potential amino acid codes (humans use about 20) , plus control codes.

    Some connections have to be “hardwired” in, such as controlling breathing, as well as the autonomic system. All higher animals have a cerebrum- the part used to store memories and make associations. In an individual who is decerebrate, having lost the connections with the cerebrum, certain motor and reflexes remain, but responses requiring cognition, such as speech, voluntary movement, are absent.

    You may want to look at the research relating chaos theory with cell growth and control.

  • Greg Fish

    giving you 27 potential amino acid codes (humans use about 20) , plus control codes.

    All known living things on this planet use 20 amino acids with the exception of one bacterium which can synthesize one more. Again, I see that you can make a custom data type to compress the information in a genome, but what do you gain other than saving some space when that space isn’t a problem? I’ve made apps that decipher codons into their amino acids and found that using chars for nucleobases didn’t exactly compromize performance.

    The days when we had to watch every errant bit and byte are gone, so Kurzweil trying to make a point of how much you can compress a genome’s data really isn’t addressing the real problem. We work with databases which have tens of gigabytes of data and so a single DNA strand isn’t that difficult to store. It’s the rate at which it can be read and the time it takes to actually analyze the genes and what they’re doing that takes the most effort.

    What Ray is trying to do, is to make the task seem a lot simpler than it is, which is why he’s worried about compressing the genome and converts it into a suspiciously small amount of code, dividing megabytes by a totally arbitrary number of bytes being moved by a line of code, even though that’s not how code works. And even more importantly, it doesn’t matter how many lines of code we need to simulate a brain. What matters is what we’ll be simulating.

    Some connections have to be “hardwired” in, such as controlling breathing…

    But that hardwiring takes place at development and is not necessarily encoded in the DNA strands. Just because we think something should be essentially spelled out in the genetic code, doesn’t mean it is.

    All higher animals have a cerebrum- the part used to store memories and make associations.

    Terms like “higher” and “lower” in relation to living things are arbitrary labels. If you’re talking about species with big nervous systems, they share many of the same patterns of neuron connections with all other living things, they just have more of them, as noted in the post’s links.

    You may want to look at the research relating chaos theory with cell growth and control.

    Quantifying biological entities doesn’t mean understanding them. Simulations and models do have their limits.