putting visual recognition software to the test
Look around you for a second. In front of you there’s probably a computer, your fingers are on a keyboard or a mouse, and maybe there are cars driving by outside your window on tree-lined streets, or through a maze of buildings of all shapes and sizes. And it probably took you just a few hundred milliseconds to identify virtually every object within your line of sight. For living beings, recognizing things in their view is so easy and happens so often, we don’t even think about it. But when it comes to the machines we use, the story is very different. As was mentioned in an older post about machines learning unusual skills like sarcasm, computers haven’t a clue about the objects around them, even if they’re equipped with cameras. And even more profoundly, all the top-performing prototypes for image recognition systems we have right now, may actually perform as well as they do due to a lack of rigid controls in experiments rather than the soundness of their algorithms, according to a neuroscientist’s study into the different approaches adopted by those who want to make machines see.
When it comes to tasks usually tackled by artificial intelligence researchers, there are two main approaches, with one being statistical models by which machines find patterns in data programmatically, and the other is based on biological models that try to emulate the minds of living things. Now, not all biological models are as accurate as we’d like, but a good example of a biology-centric approach in image recognition is a system known as Caltech101. As its name implies, it uses 101 different types of images and a background category thrown in as a bonus for a total of 102 different sets of pictures from which its system will have to recognize a particular object by statistically matching the outlines of that object to the ones it registers in its database. Its top performance tends to hover right around the 60% accuracy mark, not exactly impressive to us, who do this virtually every waking moment of our lives, but for a collection of transistors and logic gates, it’s pretty good. It has to be to perform that well on a test designed to have a near zero chance of randomly guessing through a whole experimental set. But MIT’s Nicolas Pinto had some questions about how well Caltech101 really does and decided to match it up with a simple computer representation of the V1 area of the visual cortex made by his team to run against the Caltech image set and a special set of his choosing alongside a very successful recognition algorithm. The results were both interesting and somewhat unnerving.
The V1 model did extremely well out of the box, performing right at the level of top image recognition systems we have today and correctly identifying objects up to 67% of the time. Then, Pinto switched gears and added a little bit of variation. And that’s when results started to go south in a hurry. You see, variation in images is what makes computer scientists working on visual AI systems wake up covered in cold sweat. It’s the Achilles heel of every optical recognition algorithm because computers “see” images as a series of pixels. Change how a previously familiar pixel outline looks just a little bit and the machine is now totally confused because its code tells it that it’s looking at something completely new. Where we can not only recognize keyboards at any angle and in virtually any light, but know that touch keypads on smart phones, desktop keyboards, and their laptop- based siblings are all just different versions of keyboards, computers wouldn’t know a phone’s keypad from a flying saucer if you turn the phone sideways. So when Pinto started adding images with a lot of variation to the mix, the accuracy rates for both recognition systems used in the experiment plummeted to the levels we could expect from haphazard guessing. And as variation increased, the computers got worse and worse, as Pinto’s team predicted. Of course you may wonder how these computers ever got to above-mediocre levels in the first place, and the answer to that question shines a rather negative light on today’s vision-oriented AI.
Usually when trying to teach an optical recognition system to differentiate between objects, researchers try to use something they call “natural images,” much like the Caltech101 approach. Those images are usually a pseudo-random collection of pictures, containing the objects the researchers want the system to recognize in a wide variety of different lights, settings and angles. However, as Pinto points out, very few people take really random pictures. Instead, we try to zoom in on our subjects and center them, making it easier for recognition algorithms to grab on to certain patterns. Not mentioned in his paper though, is another very important caveat of even the most advanced and sophisticated software tasked with making sense of images. When using any type of statistical association, it’s difficult to predict what exactly the computer will latch onto as the object that it will keep recognizing. It could be the sky, it could be a part of the object, or it could be something present in a lot of backgrounds, like a certain mountain. A picture shot near the ocean, at nighttime, and with the object at a different angle will bring successful object recognition algorithms to their knees. Clearly, there’s still plenty of work to do before computers can actually parse images in a consistent manner, and Pinto’s experiment is an important reminder that we need to keep testing our software as brutally as we can before thinking that we’re really making headway in an area where we’ve just begun to scratch the surface.
See: Pinto, N., et al. (2008). Why is Real-World Visual Object Recognition Hard? PLoS Computational Biology, 4 (1) DOI: 10.1371/journal.pcbi.0040027