censoring the web one keyword at a time
Digital rights activists often bemoan random sites being blacklisted by communication companies requested to filter certain types of content by government agencies around the world. Their lax attitude towards what may or may not get blocked, they say, restricts the flow of free speech across the web and shuts down a legitimate site with the careless flip of a switch just as easily as it shuts down truly objectionable content. Sure this may seem dire when viewed from this angle but the fact of the matter is that the companies being put in charge of filtering the web can’t help yanking the plug on perfectly legit community websites and legally protected blogs since most filtering technology is a rather crude, blunt instrument. Instead of surgically pinpointing sites for a blacklist like a gardener scans a flower bed for weeds, they just spray a thick cloud of herbicide and hope that it kills fewer plants than the offending flora. We generate so much complex data online that trying to effectively filter it would be like telling an incoming tsunami to stop, separate into liters, and have each liter tested for its chemical content and salinity. But that’s what they’re tasked with doing and a lot of small, good sites get hit.
Of course the big question is why governments are even using filters in the first place. Obviously in countries with authoritarian governments, it’s a means of control. In democratic countries, it’s usually a response to an amorphous threat like terrorism or child pornography. Both types of nations use the same basic approach, a campaign of indexing as many sites as possible along with their IPs and URLs, programmatically creating a custom-tailored blacklist, then setting critical routers to reject traffic to and from these blacklisted sites. Since telecommunication companies generally control these networks, they’re the ones to carry this out. One rather basic strategy for what they do to even begin such an enormous task is to create a web crawler, a simple and efficient program that reads the html loaded from URLs and parses it for keywords, key phrases, or anything specific requested by its writers. Drop this crawler into a site like DMOZ, then follow the thousands of links in every category you want as a starting point, branching off until the trail of hyperlinks brings you full circle or you run out of links to scan. In a matter of days, you could easily have an index covering several million sites.
Whenever the crawler finds objectionable keywords, it will add the URL and IP to your blacklist which will then be used by a routers to block any requests to it. As the crawlers will keep crawling or re-crawling, updating the blacklist, the routers will keep blocking and unblocking the traffic. So where do humans join the process? As administrators and debuggers when something goes awry. There’s just too much data for a human to parse otherwise, and the team supervising the implementation of the blacklist may never know what sites it actually shut down and why unless they’re explicitly pointed to a particular case and their crawlers left thorough logs of why they made the decisions they did. The whole thing is as close as you can get to arbitrary without fitting an appropriate definition because it works by simply finding if the HTML of the site contains the wrong word more than a few times or not. You could conduct a contextual analysis but that’s no guarantee of accurate decision- making on the crawler’s part because computers are not great at following human context, and while we have machines that can be configured to summon the computing grunt to work with natural language, it’s unlikely a lot of communication companies are going to invest the cash needed to run a sophisticated analysis of what particular websites actually say. It’s cheaper for them to block, then unblock after a complaint or two.
But again, why are governments that generally allow free speech, free expression, freedom of religion, and do nothing about their citizens accessing adult entertainment sites, in the web censorship business? My opinion is that this can be chalked up to a glaring ignorance of how the web works and politicians’ need to be seen in action when his or her constituents complain. Riots and looting haphazardly coordinated via social media? A child porn sharing ring hiding in some dark recess of the web? Minors regularly viewing porn -tube sites that’ll just take their word for it that they’re 18? Block it! After all, that’s how it worked in the old days, right? You saw a questionable or objectionable bit of content on TV or in a newspaper, or tucked away on bookstore shelves, a complaint was written and the content was pulled. So obviously, we’ll just have someone block the content on the web because that’s how we’ve always dealt with objectionable material, right? Actually, no. It’s one thing to shut down a site peddling illegal wares or doing something both exploitative and criminal, but proactive scans of the web don’t work that way. There’s no good way to run a site off the web and it can always be moved to an unreachable server farm ran by a country where its pursuers have no jurisdiction, while trying to block sites by a crawlers’ keyword analysis only makes people upset that they can’t get to some of their favorite sites after a contributor used the wrong word or the wrong quote and got flagged by an automated system.