4

1

In traditional computer vision and computer graphics, the pose matrix is a $4 \times 4$ matrix of the form

$$ \begin{bmatrix} r_{11} & r_{12} & r_{12} & t_{1} \\ r_{21} & r_{22} & r_{22} & t_{2} \\ r_{31} & r_{32} & r_{32} & t_{3} \\ 0 & 0 & 0 & 1 \end{bmatrix} $$

and is a transformation to change viewpoints from one frame to another.

In the Matrix Capsules with EM Routing paper, they say that the "pose" of various sub-objects of an object are encoded by each capsule lower layer. But from the procedure described in the paper, I understand that the pose matrix they talk about doesn't conform to the definition of the pose matrix. There isn't any restriction on keeping the form of the pose matrix shown above.

So, is it right to use the word "pose" to describe the $4 \times 4$ matrix of each capsule?

Moreover, since the claim is that the capsules learn the pose matrices of the sub-objects of an object, does it mean they learn the viewpoint transformations of the sub-objects, since the pose matrix is actually a transformation?

Thanks for your insights Jens. Its really interesting to see your result. – Aahan Singh – 2018-10-01T12:25:06.003