for the good of science, publish your code?

February 15, 2012 — Leave a comment

Today, we’re drowning in data. Scientists can collect so many facts and figures, it’s humanly impossible to sit down, roll up one’s sleeves, and sort it out. Meanwhile, torrents of papers are being published, so many that, computers may soon be assigned to create reading lists for scientists. And since there’s so much data for mining, analyzing, and eventually feeding conclusions in research projects, it’s no wonder that a lot of code is custom written to build models and analyze important metrics. However, a lot of this code doesn’t necessarily get published with the conclusions and according to an op-ed in Bloomberg, it hurts peer review so much so that there ought to be a law requiring scientists using public funds to disclose any and all source code they’re using to arrive at their conclusions. The idea is that when someone can just run your code and replicate what you did in the lab, they can better verify your conclusions and build on them to make new contributions. This is terrific as an idea, but in practice there may be far more difficulty in reproducing someone’s code-related work than the authors seem to think. If anything, the code may drown reviewers in minutia and added expense…

We’re used to programs being more or less universal nowadays and expect a program we downloaded on a computer at work behave exactly the same way it does at home as long as the operating system is the same, though with the new platform-independent UI frameworks the OS is becoming less of a limitation. Academics don’t have to worry about consistent cross-platform, cross-browser behaviors because they are not creating a tool for millions of potential users. They’re working on tools for their labs, on their systems, using their setups which might have special tweaks that may be very expensive to duplicate. The software could be written in an outdated language as far as other systems may be concerned, and be troublesome to compile or run. Try to run an application in FORTRAN on Windows 7 and you’ll find that while you can do it, you’ll need to emulate a chunk of Linux and use an open source compiler that may or many not be well suited for what you’re doing. In addition, any special tweaks for parallel processing on your lab’s mainframes or server farms could well have to be rewritten since the computer on which it’s being ran is different. How will the program perform? Could it run into unforeseen restrictions? Will it play nice with the reviewers’ systems? You don’t know until you try and debug any issues along the way which may mean that you end with with very different code at the end.

But again, it’s possible although the task is likely to take you some time and troubleshooting. Now you could run into another problem, one that has nothing to do with the code and everything with the humans who need to review it. Not everyone is an expert in the same programming languages and simply knowing how to code, or how to learn a language is often not enough, especially to evaluate advanced tricks to boost performance. Now the reviewers may decide to just give it a pass and risk letting erroneous work be published, or call in an expert who deals with the particular language all the time to scrutinize every line of code. Considering that I’ve rarely met programmers who don’t complain about others’ code – a professional illness, I’m afraid – it’s very hard not to imagine a stickler surveying the code and pointing out anything with which he doesn’t agree or the rationale for which he doesn’t know or see for a specific loop or index value, or variable as potential problems with the research. Far from making the research more easily reproducible, this would make it either harder to verify at worst or just as difficult to survey at best. It certainly won’t be impossible, but it will certainly add costs and time for the peer review process as the lab setup is recreated and the code scrutinized. Make the review process too long and drawn out and you discourage scientists from publishing because it takes too long and their universities are breathing down their necks about publications and new grants from past research.

Of course we could try and design a set of standardized systems on which scientists are to write and execute their programs and mandate that the code is to compile and run on one of these systems for review, just like undergraduate programming students are instructed to turn it their homework. But throttling platforms for any bleeding edge research also stifles innovation and would be unworkable in both principle and cost. We could and should encourage scientists to share their code with the public when their methodology is very complex, and errors can easily creep into the complicated code, but we certainly shouldn’t ensnare the government into something it’s both ill-prepared and ill-equipped to handle. Just consider their recent history with bills relating to technology in general. Adding a scientific aspect to all this, especially when politicians already have a very tense relationship with the scientific community, seems like asking for trouble. Yes, making sure that code used to arrive at potentially major discoveries or which may be used to drive big policy decisions and alter the flow of grant dollars is up to snuff is certainly a good idea and yes, it certainly can be done. However, making it a legal requirement and hanging peer review on what could be tens of thousands of lines of code not written with easy recompilation and execution in mind, would cause as many problems as it’s intended to solve, with one of them being as drastic as having reviewers test code radically different than the one used in a study.