how a simple a.i. beat a crowdsourcing service

On the surface, a basic AI handily beat humans in a classification task. But if you look at the details, its victory wasn't exactly a clean one.

Greg Fish
Jan 05 2011

Last year, we reviewed how robots were taking over more and more administrative tasks usually done by flesh and blood humans, and what that could mean for the global economy of the future. And as 2010 ended, tech writer Christopher Mims decided to highlight a study in which an AI beat humans in a simple, repetitive, but very necessary task of classifying businesses for a popular website. Since this sort of classification helps search engines display better and more relevant listings, the study’s results would seem to indicate that tech companies which need to tag specific places should probably steer clear of humans and rely on a very basic system which uses Bayesian statistics to do their tagging for them, challenging the long-established Dmoz model of keeping a human-reviewed and organized list of links crawled by most commercial web spiders to create an initial list of relevant and reliable search results. But if you actually read the study in question, you’ll find that far from showing that crowdsourcing can be replaced by a computer, it’s actually a tangled mess.

Here’s the setup. Using the anonymous crowdsourcing tool Mechanical Turk, employees of Yelp presented a test on how to classify businesses and verify their basic information to 4,660 people. Only 79 passed, and out of those 79 anonymous workers, few actually did their tasks well; too many had just around 50% accuracy, or the equivalent of random guessing. In terms of where the workers performed worst was in tagging business categories, with the most thorough freelancers having an accuracy rate of just 67%, and 79% as a group. Why was the performance so poor? Partially, the authors blame the extremely low wages paid for the task and the potential for a language barrier, something the Yelp team was aware of, but decided not to peruse past some vague guesstimates based on timestamps. The team also said it sporadically tried different methods for the anonymous workers to earn bonuses or complete their work without providing any detail other then the quick note about the attempt failing. Though considering that the bonuses in question were literally a few cents per task, it’s hardly surprising that it didn’t work. In fact, the paper emphasizes that the task is probably boring, the pay is way too low, the workers don’t really care, and that completion times can’t be measured reliably since a lot of workers are constantly competing for tasks and accept them before actually working on them.

So if Mechanical Turk is so bad, why does Yelp trot out a Bayes classifier and start comparing it to the subpar work they’ve received? Of course creating a statistical model of a business based on how many times the key words for a proper classification appears in the 12 million user reviews in its training set will yield a far better and more reliable set of results any than bored and underpaid humans. The AI can search through every last review about a particular business, calculate the frequency of a certain word appearing in those reviews, then factor in the probability of this word appearing for any business at all. A human who doesn’t have a big set of user reviews to navigate before rendering a decision has to resort to a web search if the business’s name is in any way ambiguous, and considering that she’s being paid pennies per task, running searches is probably not worth it when she can just guess to complete the job and move on. The idea of this demonstrating that a computer is better than humans at tasks that supposedly require human attention to complete properly is not supported by four pages of waffling between brief mentions of different experiments and hypotheses build on nothing more than speculations backed up with no supporting data. This is why in his conclusion, Mims has to backpedal away from his sensationalistic title and questions about how inept the Turkers must be to have over 98% of their qualification tests rejected, and then be beaten by a computer at their jobs.

Again, this study doesn’t really make sense and seems more like a ploy for Yelp to get companies that need classification tags for local businesses to ask them for help and technical advice. Considering the extremely questionable reputation Yelp has in the tech world, and that the site used to be best known for throwing an occasional wild party or three than providing honest or fair business reviews, I really wouldn’t be surprised to find out that this was the case. After all, if their goal was to honestly compare human accuracy to that of their supervised learning algorithm, they would’ve tried harder to find more qualified humans whose performance wasn’t tied to their pay, or at least find out what exactly is going on with Mechanical Turk to yield such abysmal results other than trying to guess their workers’ home countries based on a timestamp, a technique that only tells you when a task was done and nothing else, then speculate on how bored they probably are. If you want to declare that an AI beat humans in a classification contest based on Yelp’s mess of an experiment, you will have to admit that the contest was rigged in favor of the machine, either for business or PR reasons.

wowt.com

# tech // artificial intelligence / business / computer science / media

by: Greg Fish

Los Angeles-based ex-Soviet computer lobotomist. Specializes in popular science, technology, the web, and conspiracy theories. His work also appeared in Rantt, BusinessWeek, i09, HowStuffWorks, SEED, RawStory, Science To The People, Le Monde, and Discovery News/Seeker, and he is a weekly radio contributor on Canadian radio.

show comments