Monkeys, Men and Machines: How Mismatched Mistakes Might Help Improve Computer Vision.




Machines make different mistakes in image recognition than you or me or the average macaque, and that might tell us something about how machines see - and how we can help them to improve.


State-of-the-art deep artificial neural networks (such as GoogLeNet, ResNet, and Inception) perform splendidly on categorization tasks of datasets, such as ImageNet, but can be easily confused by adversarial examples, those engineered to produce errors. Researchers in the DiCarlo Lab at the Massachusetts Institute of Technology (MIT) found that while these neural networks perform similar to primates in object category recognition, they differ in image recognition.

Object recognition includes all of the training examples for a category. But image recognition has to do with the individual images. Monkeys, men and machines are similar in which categories they find difficult, but they differ in which individual images trip them up. Say there are 1,000 images of an iPhone. A monkey, man or Deep Convolutional Neural Network might do well in this category, getting 950 correct. But the 50 that the man and the monkey get wrong will be different than the 50 that the neural network gets wrong.

The ImageNet dataset includes over 14 million images in 21,000 categories that DCNNs can recognize with over 97% accuracy. The human benchmark of 95% was surpassed in 2015. Since then, research has shown that adversarial examples can fool DCNNs into miscategorizing any image. This presents a grave danger for computer vision systems that need to be reliable, such as those used in autonomous vehicles.

One avenue to building more robust computer vision systems may be to take inspiration from the human visual system. Humans, after all, are not fooled by these adversarial examples.

Performance on large datasets is generally measured using accuracy on the entire dataset, but the information about performance within a specific object category, e.g. “Dog - Shiba Inu,” and on specific images can be used to fine-tune aspects of DCNNs to achieve better overall performance. Researchers at the Brain and Cognitive Sciences department at MIT compared these category-level and image-level performance measures from several DCNNs with human and monkey behavior.

Rajalingham and colleagues first generated 2,400 images of 24 objects. These images were presented to humans via Amazon’s Mechanical Turk and rhesus macaque monkeys via a system they dubbed MonkeyTurk. Both humans and monkeys were briefly shown an image and then had to determine in which of two categories it belonged. They captured the results of over one million behavioral trials. The same images were also presented to DCNNs (including AlexNet, NYU, VGG, GoogLeNet, ResNet, and Inception-v3) that had already been trained on the ImageNet dataset.

DCNNs show superhuman performance on the ImageNet dataset, but the way they achieve this level of performance may differ significantly from humans. Difficult to recognize object categories in this study are difficult for humans, monkeys and machines, but primates and DCNNs differ in which specific images are hard to categorize.

That can tells us something about how machines see and may allow researchers to improve computer vision.

An analogous problem emerged earlier this year with PsychLab, a deep learning and reinforcement learning testing suite from Google’s DeepMind. PsychLab makes it possible for reinforcement learning algorithms to perform human psychological tests and compare their performance to that of humans.

In one case, researchers asked one of their best performing and most highly tuned artificial agents to take a basic cognitive-science test, which has to do with looking at different sets of concentric circles. When the agent performed badly relative to humans, the researchers hypothesized that perhaps it was because the vision system of the agent works differently than that of humans.

With a traditional convolutional neural network, the receptor field is uniform, but humans have ‘foveal vision,’ with more receptors at the center of their field of vision relative to the periphery.  So, the researchers bunched up the receptor field of the agent’s neural network, imitating human vision, and the agent then started to pass these tests with closer to human-like performance.

Further research may identify the difference between primate and DCNN vision so that DCNNs can be tweaked to be closer to primates and less susceptible to adversarial attacks that don’t fool humans.

Rajalingham and colleagues at MIT also examined whether individual variability between humans might allow DCNNs to accurately model certain individuals, but showed that this is very unlikely mathematically. They also modified the most human-like DCNN tested, Inception-v3, to determine whether minor modifications or fine-tuning might result in human-like performance. None of these changes resulted in human-like performance at the image level. Moreover, none of the manipulated image attributes could account for differences in behavior. It is likely that the network architecture or optimization process will have to change before human-like image-level performance can be seen in artificial neural networks.

That there are fundamental flaws in DCNN architecture that make them incapable of capturing human-like behavior may appear discouraging, but the methods described in this research may provide a benchmarking tool for future computer vision architectures. It may well drive innovation that results in networks that, like humans, can trust what they see.