Back in September, news worldwide reported the results of a paper which claimed that a supercomputer had a knack for predicting revolutions and key global events, able to pick up on the events of Tahir square in Cairo and even get a fix on Osama bin Laden’s location. After reviewing the paper in question, I quickly got a strong vibe of many previous projects tried to use computing data to predict the future, projects a lot like Nexus 7, an attempt to mine reams of correlated data for predictive markers. Amazingly, after decades of failure to do that, there are still computer scientists who believe that all they really need is more data and then they’ll find what they want. Just like I wrote before of such attempts, more data simply cannot yield accurate predictions, and the supposed success of the supercomputer in question is actually a retroactive look at speculation followed by the claim that because negative sentiment about Mubarak in Egypt was widespread and because rumors of bin Laden hiding out in Pakistan persisted for years, the supercomputer effectively predicted both. And this is essentially what economist Tim Harford astutely called the God Complex in a relevant TED presentation.
Now, let’s say that the supercomputer in question was given a set of events like the sudden chain of extreme protests in the Middle East which saw over a dozen people self-immolate in front of government offices to which it spat out a chain of events for the Arab Spring, predicting the toppling of the autocrats in Tunisia and Egypt, the civil war in Lybia, and the assassination attempt on Yemen’s Saleh. That would be an impressive result and certainly the methodology used to arrive at these conclusions would merit further study. However, I am not aware of any computer coming up with such results. In fact, the paper’s model simply reflected all the buzz about the growing protest movements in Egypt and managed to pinpoint the FATA region of Pakistan as bin Laden’s hiding spot, not even close to where he was actually found, simply echoing the pundits who said that FATA was home to Taliban groups and al Qaeda elements which would be happy to harbor him and very loath to cooperate with any authorities looking for him, no matter what those authorities offered in return. This means that we’re not looking at a predictive model but a news aggregator which knows how to search a few preset keywords in the articles it’s fed and come to a general “mood” of the media.
As an attitude barometer, this machine is fairly effective. But as a predictive model? Not even close. You could even make the same kind of model at home and see its shortfalls for yourself. Simply make a list of negative words like “autocratic,” “tyrannical,” “aggressive,” and “outcry,” a list of positive words like “approval,” “cheers,” “welcomed,” and “helpful,” and a list of neutral words like “consensus,” “mediation,” “satisfied,” and “relaxed,” then include them into a script to parse a news article and identify said words. Then, have the script evaluate how many words fell in each category, giving each category a simple score. For example, 1 would be positive, a zero would be neutral, and -1 would of course be negative. Average your scores together to get a number in between the 1 and -1 bounds and assign that to the news article. Likewise, you should also identify the cities and countries from where the news comes (virtually always listed in the header of a wire service release) so you can map the location. Finally, assign a location flag and a color between green and red with which to flag your article on a map. Keep scanning article after article until you get a lot of data points, connections, and red and green flags. This step may take you a while unless you have a supercomputer. Then, after you’re all done take a look at your map and try to predict the next war, revolution, and scientific breakthrough.
Kind of a challenge, isn’t it? How accurate do you think you will be? And keep in mind that you have to have an extremely well balance news source base. Your map after a few thousand Fox News articles and roughly the same number of AlterNet articles is bound to look very different since the reporting biases will influence word choices, and remember that your entire model runs on those bias-affected words. A world pictured by writers who are on the far right is rather different than the world pictured by those on the far left. Which one would you choose as the most reliable model? Do you trust your own worldviews and those of your news sources to be as impartial as possible and balance out every bit of spin and bias no matter how slight by sheer quantity? It would also be interesting to note foreign language sources and what they say. Come to think of it, this might actually be a very interesting experiment to conduct and it might tell us even more about the state of the press at any given time period. Just don’t use the results to try and predict what will happen over the next year. Many sages have tried and failed and for good reason. A mutation of post hoc ergo prompter hoc is very limited in what it can offer an aspiring soothsayer so if you really want to try to be one, I suggest cold reading. It’s about as effective and requires a lot less coding and a lot less math.
See: Leetaru, K. (2011). Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space First Monday Online Journal, 16 (9)