I've put together some example tensorflow code to help explain (the full, working code is in this gist). This code implements the capsule network from the first part of section 2 in the paper you linked:
N_REC_UNITS = 10
N_GEN_UNITS = 20
N_CAPSULES = 30
# input placeholders
img_input_flat = tf.placeholder(tf.float32, shape=(None, 784))
d_xy = tf.placeholder(tf.float32, shape=(None, 2))
# translate the image according to d_xy
img_input = tf.reshape(img_input_flat, (-1, 28, 28, 1))
trans_img = image.translate(img_input, d_xy)
flat_img = tf.layers.flatten(trans_img)
capsule_img_list = 
# build several capsules and store the generated output in a list
for i in range(N_CAPSULES):
# hidden recognition layer
h_rec = tf.layers.dense(flat_img, N_REC_UNITS, activation=tf.nn.relu)
# inferred xy values
xy = tf.layers.dense(h_rec, 2) + d_xy
# inferred probability of feature
p = tf.layers.dense(h_rec, 1, activation=tf.nn.sigmoid)
# hidden generative layer
h_gen = tf.layers.dense(xy, N_GEN_UNITS, activation=tf.nn.relu)
# the flattened generated image
cap_img = p*tf.layers.dense(h_gen, 784, activation=tf.nn.relu)
# combine the generated images
gen_img_stack = tf.stack(capsule_img_list, axis=1)
gen_img = tf.reduce_sum(gen_img_stack, axis=1)
Does anyone know how the mapping between input pixels to capsules should work?
This depends on the network structure. For the first experiment in that paper (and the code above), each capsule has a receptive field that includes the entire input image. That's the simplest arrangement. In that case, it's a fully-connected layer between the input image and the first hidden layer in each capsule.
Alternatively, the capsule receptive fields can be arranged more like CNN kernels with strides, as in the later experiments in that paper.
What exactly should be happening in the recognition units?
The recognition units are an internal representation that each capsule has. Each capsule uses this internal representation to calculate
p, the probability that the capsule's feature is present, and
xy, the inferred translation values. Figure 2 in that paper is a check to make sure the network is learning to use
xy correctly (it is).
How it should be trained? Is it just standard back prop between every connection?
Specifically, you should train it as an autoencoder, using a loss that enforces similarity between the generated output and the original. Mean square error works well here. Aside from that, yes, you'll need to propagate the gradient descent with backprop.
loss = tf.losses.mean_squared_error(img_input_flat, gen_img)
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)