Capsule Networks Beyond Image Recognition

Taking into account the novelty of capsule networks, it is not surprising that some researches still are questioning their value and ability to surpass more traditional approaches to image recognition. Despite the initial successes in this field, there is a lot of work ahead. Not to say about their application in areas other than computer vision. However, while skeptics are doubting, someone else is testing this in practice and is looking for new ways of development. Following in the footsteps of their predecessors, ConvNet, capsule networks may prove useful in two other fields: computer games and natural language processing (NLP).


When we hear about convolutional neural networks, we usually think of computer vision. Since their inception, convolutional networks have made great strides in object recognition due to the ability to generalize and distinguish important features. Nevertheless, this is not their only achievement at the moment. Today, results observed in several other types of tasks show that CNNs often provide better accuracy and performance than previous methods. This applies, in the first place, to NLP, prediction tasks, some computer games and problems with small training data sets. Thus, we should not restrict capsule networks to their original purpose as well. The fact that CapsNets have not yet been explored well beyond image recognition shouldn't stop us, particularly as the first attempts were already taken.


Behavior trees and state machine algorithms have been successfully applied to many computer games, but they may fall short for advanced environments with large state-spaces. In this regard, convolutional and capsule networks could be useful if we give them a chance. Now, we can already find several implementations of game AI using CNNs, for example, for such games as Checkers, Go, Life, a real time strategy and 2048. What about CapsNets, we could find the only research in this field prepared by Per-Arne Andersen. The paper is about deep reinforcement learning process with capsule networks in the following game environments:

  • Flash RL
  • Deep Line Wars
  • Deep RTS
  • Deep Maze
  • Flappy Bird

The study seeks to apply CapsNet architecture for Deep Q-Learning based algorithms for game AI. The objective of the network is to analyze given states (game views) and identify recommended actions. Instead of classifying objects, the capsules now estimate a vector of the likelihood that an action is sensible to do in the current state. Thus, in this case, capsule networks were used almost in a usual way - for image processing - but with a new purpose. View of the problem from a new angle can give us interesting (although not necessarily perfect) results.

The author also introduces a generative modeling approach to produce artificial training data for use in Deep Reinforcement Learning (DRL) models. For this end, he describes a Conditional Convolution Deconvolution Network (CCDN) architecture. Its purpose is to generate data sets that can be used to train RL algorithms without self-playing and reduce amount of the required exploration for game developers.

Not all results of the research are great. For example, CCDN has issues in game environments with a sparse state-space representation. Capsule networks does not scale as well as ConvNets, showing worse results in some environments. In simple environments, models tend to overestimation. In several other cases, however, they cope well. In summary, the author is optimistic about the initial achievements, though recognizes the need for further research.

CapsNets vs ConvNets


NLP is another area where capsule networks might be useful as well as ConvNets. Over the last years, ConvNets have been repeatedly applied to speech recognition, semantic parsing, sentence and document classification, machine translation, audio synthesis, language modeling and similar language processing tasks. If convolution network structure with small adaptation allows us to do this, then probably capsules can also be tried for the same goals.

Capsule Networks seem to be a good solution for language modeling problems so as they let us solve some issues of convolution networks. The fact is that a convolution layer is usually followed by a pooling layer, which is useful to reveal the most important features. However, max pooling leads to a loss of information about structure and mutual arrangement of different image parts. This is where capsule networks can succeed. Such neural networks are good where we need not only to allocate some sequence from the whole, but also to take into account the order or structure of individual elements. While convolution networks may be good in such tasks as spam detection, identifying text entities or key ideas, CapsNets could go further and cope with hierarchically more complex tasks, for example, cipher analysis or program errors detection. Theoretically, they could be better wherever there are high requirements for language generation. Chatbots program translators, search engines and content generators could become cleverer with CapsNets.

But this is only in theory. Capsule networks are very new and unpredictable. They can either make a breakthrough in the software world in some years, or die and be replaced with another, more advanced technology. Some researchers hold a middle ground here: CapsNets are likely to excel in video intelligence and object tracking, but not necessarily in NLP.

However, although it is difficult to predict the outcome of such attempts, they are quite feasible in practice. The main task is to convert the input data (text) into a matrix form that corresponds to the CapsNets architecture. We could take into account the experience of ConvNets and use word embeddings (low-dimensional representations) tools like Word2Vec or GloVe. They take a text corpus as input and produces the word vectors as output. It is also possible to work with a characters level as well, where each row of the matrix corresponds to a symbol. Thus, instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix. Applying embedding to each word in the text, we get a required array for further processing. That is our "image". What about different channels, we could use them for the same sentence represented in different languages, or phrased in different ways. Of course, such an approach requires a specific filters and layers processing algorithms.

But what for? The rest of neural networks already show excellent results and are easier to use as they are learnt better. We could continue experiments with them until all their possibilities will be exhausted. However, all these long-studied and proven technologies were once dubious and unpromising, which someone decided to develop "just because he could".

The point is, possibilities of CapsNets are not limited only to image classification. We just need to adapt the input data of a particular task to this architecture in order to take advantage of its usage. Or, at least, begin to apply it to a wider range of image processing and video analysis tasks, rather than just distinguishing dogs and cats.

Riter development team