Is the word "pose" used correctly in the paper "Matrix Capsules with EM Routing"?



In traditional computer vision and computer graphics, the pose matrix is a $4 \times 4$ matrix of the form

$$ \begin{bmatrix} r_{11} & r_{12} & r_{12} & t_{1} \\ r_{21} & r_{22} & r_{22} & t_{2} \\ r_{31} & r_{32} & r_{32} & t_{3} \\ 0 & 0 & 0 & 1 \end{bmatrix} $$

and is a transformation to change viewpoints from one frame to another.

In the Matrix Capsules with EM Routing paper, they say that the "pose" of various sub-objects of an object are encoded by each capsule lower layer. But from the procedure described in the paper, I understand that the pose matrix they talk about doesn't conform to the definition of the pose matrix. There isn't any restriction on keeping the form of the pose matrix shown above.

  1. So, is it right to use the word "pose" to describe the $4 \times 4$ matrix of each capsule?

  2. Moreover, since the claim is that the capsules learn the pose matrices of the sub-objects of an object, does it mean they learn the viewpoint transformations of the sub-objects, since the pose matrix is actually a transformation?

Aahan Singh

Posted 2018-06-28T11:00:52.403

Reputation: 53



Great question, and one that I think we could have done a better job of answering in the paper.

Essentially, the pose matrix of each capsule is set up so that it could learn to represent the affine transformation between the object and the viewer, but we are not restricting it to necessarily do that. So we talk about the output of a capsule as though it is an affine transformation matrix, but we can't ensure that it will be. We do things explicitly that make it more like such a matrix — like adding in the coordinates to the right-hand column — but we can't be sure. This somewhat embodies a large part of the capsule network theory — we set up scaffolding so that the network can learn to be equivalent to transformations that we think it ought to be invariant to, but we don't ensure that it is.

Nicholas Frosst

Posted 2018-06-28T11:00:52.403

Reputation: 46


I have tried to make it learn the affine transformation by giving it this as the label, and it works just fine. I'm really impressed and excited by capsule networks, and can't figure out why anyone didn't think of this before, because it's so obvious and simple. Spiking neurons also tell us that information between neurons can't be one dimensional only. It should be represented by Vectors of some kind.


In the above comment I claim that it works "fine" when I make a capsule network learn the affine transformation by giving it this as the label. This is not true.. It doesn't work! - I'm sorry, I was too quick there.

I assume the reason is that the affine 4x4 matrix representation is redundant. Also it is impossible to make sensible linear interpolations between such transformations, which will affect the gradient (it will not point in the direction of the minimum).

What I have succeeded doing is to make the capsule network learn a quaternion (rotation) and a 3d vector (position) - 7 parameters in all. These can be contained in a 3x3 matrix, when fixing 2 of the parameters. But training is slow and the network cannot encode skews etc. in this 3x3 setup.

Affine transformations from images using capsule networks (matrix capsules), can also be achieved by just making the network learn its own 4x4 pose representation through the decoder part. Then a small network can be trained to transform these poses into a 7d vector (quaternion and 3d vector), from which the affine 4x4 transformation obviously can be calculated. This I have also succeeded doing. It seems like the rotation encoded in the pose has a more quaternion-like nature, which makes sense.

Jens Overby

Posted 2018-06-28T11:00:52.403

Reputation: 11

Thanks for your insights Jens. Its really interesting to see your result. – Aahan Singh – 2018-10-01T12:25:06.003