Capsule Networks As a New Approach to Image Recognition

2018-01-16

Finally something that works well G. Hinton

Several months ago the IT world was stirred up with news about a completely new way of neural networks organization, which can become a successful alternative for traditional convolutional networks. Now we are talking about a new approach to image and video recognition, although it can be assumed that the full potential of capsule networks has not been revealed yet, and the real possibilities of their usage will be even wider. Or is everything exactly the opposite? The first results turned out to be promising, however, significant work is still ahead.

Now, when the hype over CapsNets has settled down a little, it's time to take a sober look at them and evaluate their pros and cons. Let's try to answer the following questions:

Why do we need to replace traditional neural networks?
What is a capsule network and what can it give us?
What's next and what difficulties should we expect?

How Did It All Begin?

The idea of capsule networks is not new at all. And as for the recognition of images in general, and the creation of capsule networks in particular, we are indebted to Geoffrey Hinton and his team. Back in 1986, Hinton explained how the back propagation method could be used to train deep networks. However, at that time, the required computing power was not available, so only in 2012 he was able to demonstrate his idea which became a serious step in the development of deep learning.

Though it was a breakthrough in image recognition, that for 5 years has found application in many areas, Hinton was always aware of the shortcomings of his idea and did not stop searching for a better solution, considering the concept of capsule networks for almost 40 years: I think the way we're doing computer vision is just wrong. It works better than anything else at present, but that doesn't mean it's right. First the idea of capsule networks was raise in 1979 and formulated in 2011. At last, in the fall of 2017 Hinton published two papers:

which explained the aspects of CapsNets organization and initiated further research on them.

What Is Wrong With CNN?

There're some things in convolutional neural networks (CNN) which don't allow to use them as often as we would like. First of all, it's a significant amount of training dataset required. When it comes to recognizing pictures of animals or people, this is not a problem - the Internet is just full of samples for every taste and color. However, everything is a little more complicated in specific areas, for example, in medicine, where a sufficient number of examples may be unavailable. Hence, new networks should be able to learn on small datasets.

The next issue is inaccuracy of traditional neural networks related to the principle of their work. The fact is that CNNs pay attention to the presence of certain key features on the image, ignoring their relative position as to each other. The accuracy of such networks is far from ideal, and it falls off even more, when you rotate the image, select the appropriate background, etc. To deceive such a network is easy, which makes them resistant to attacks like "white box".

To increase the accuracy of neural networks, we need to bring their work closer to the human way of thinking. The child does not need to see thousands of images to understand how a house looks like. The ability to generalize also helps him to distinguish images of the same object from different angles that can not be said about many current CNNs. Moreover, the presence of certain signs still does not sufficiently tell the child about the depicted object, if their mutual arrangement does not make sense. Fortunately, capsule networks seem to fix these disadvantages.

So What Are Capsule Networks?

The main distinguishing feature of capsule networks is the introduction of an intermediate building block between a neuron and a layer - so called capsules. A capsule is a nested set of neural layers. While a usual neuron outputs a single scalar value, a capsule outputs a vector to represent generalized set of related properties. A capsule tries to capture many features of the object like pose (position, angle of view, size), deformation, velocity, texture and so on inside an image to define the probability of some entity existence.

The output vector is sent to all possible parents in the CapsNet. The prediction vector is based on multiplying it's own weight and a weight matrix. The network looks for parent with the largest scalar prediction vector to increase the corresponding capsule binding. All other parents decrease their connections. This is called routing-by-agreement. After this a squashing function is applied to the output vector of each capsule. The main steps of capsules work are well explained here.

There are many possible ways to implement a capsule network. The following papers, written by the authors of the very idea of CapsNets, describe in detail the ways of implementing their architecture, interaction of neurons (capsules), compare the achievements with convolutional networks performance and give their own assessment of the work done. In the studies, two ways of using capsule neural networks - the classification of images and the recognition of numbers - are examined.

Dynamic Routing Between Capsules
Dynamic routing in the paper is explained on the example of digit recognition capsule network. A capsule here is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. The length of the activity vector is used to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level come to conclusion on some entity existence. When multiple predictions agree, a higher level capsule is activated.

First test results on MNIST dataset show that a discriminatively trained, multilayer capsule network achieves better accuracy than a CNN at recognizing highly overlapping handwritten digits. The test error for the CapsNet was about 0.25% on a 3-layer network - such results could be earlier achieved only for deeper CNN. Besides MNIST, the team has also tested the network on CIFAR10 dataset with 10.6% error (such an error is similar to the first attempts of CNN implementation), on smallNORB dataset with 2.7 test error and in a small set of SVHN.
Matrix Capsules With EM Routing
The next work is devoted to the recognition of images of objects taken from different angles. A capsule in the network is a group of neurons whose outputs represent different properties of the same object. Each layer in a capsule network contains several capsules. Hinton describes a capsule network model in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 matrix which could learn to represent a position of the entity relatively to the viewer. The research team introduced a new iterative routing procedure between capsule layers, based on the EM algorithm, which allows the output of each lower-level capsule to be routed to a capsule in the layer above in such a way that active capsules receive a cluster of similar pose votes. The transformation matrices are trained discriminatively by back propagation through the unrolled iterations of EM between each pair of adjacent capsule layers.

The preliminary tests show that capsules can be used successfully both for dealing with viewpoint variation and for improving segmentation decisions. The approach was tested on the smallNORB dataset and showed better accuracy than the traditional CNN: the error was about 1.4% for the best model. Moreover, the gotten network was more resistant to "white box" attacks than a CNN in an image classification task (though, the network's robustness to "black box" attacks happened to be the same as for CNN models). However, the team isn't going to stay limited by the NORB set, planning to implement a capsule model for much larger datasets such as ImageNet.

Main Advantages of CapsNets

The main pros and cons of CapsNets we still have to learn during the further researches, however, some of them can already be listed based on the first studies received. Here're the main capsule network advantages:

good results for image classification, segmentation and object detection
high recognition accuracy on MNIST dataset, promising results on CIFAR10 dataset
less training dataset capacity is required
information about pose and related parameters is considered (and saved) to identify entities
capsule activation process is well mapped on hierarchy of parts
robustness to input data transformations and "white box" attacks

What's the Catch?

It would be rather reckless to assert about the flawlessness of capsule networks after a couple of months of testing. Although the idea itself is quite promising, it has some shortcomings, which we still have to solve:

slow learning process (compared with CNN), conditioned by the expensive operation of inner loop (in the routing by agreement algorithm)
a lack of testing for large image datasets (e.g. ImageNet) doesn't allow making informed conclusions about the further prospects
so called "crowding" effect: a network cannot see some very close located identical objects (this problem is also inherent in human vision)
higher complexity of implementation and computing required

The authors of the work and other enthusiasts are currently engaged in further research of capsule networks. Will they be able to overcome existing shortcomings and make a breakthrough in deep learning? Or should it take another 40 years before we achieve the next significant results?

Riter development team