## What are deconvolutional layers?

242

163

I recently read Fully Convolutional Networks for Semantic Segmentation by Jonathan Long, Evan Shelhamer, Trevor Darrell. I don't understand what "deconvolutional layers" do / how they work.

The relevant part is

3.3. Upsampling is backwards strided convolution

Another way to connect coarse outputs to dense pixels is interpolation. For instance, simple bilinear interpolation computes each output $y_{ij}$ from the nearest four inputs by a linear map that depends only on the relative positions of the input and output cells.
In a sense, upsampling with factor $f$ is convolution with a fractional input stride of 1/f. So long as $f$ is integral, a natural way to upsample is therefore backwards convolution (sometimes called deconvolution) with an output stride of $f$. Such an operation is trivial to implement, since it simply reverses the forward and backward passes of convolution.
Thus upsampling is performed in-network for end-to-end learning by backpropagation from the pixelwise loss.
Note that the deconvolution filter in such a layer need not be fixed (e.g., to bilinear upsampling), but can be learned. A stack of deconvolution layers and activation functions can even learn a nonlinear upsampling.
In our experiments, we find that in-network upsampling is fast and effective for learning dense prediction. Our best segmentation architecture uses these layers to learn to upsample for refined prediction in Section 4.2.

I don't think I really understood how convolutional layers are trained.

What I think I've understood is that convolutional layers with a kernel size $k$ learn filters of size $k \times k$. The output of a convolutional layer with kernel size $k$, stride $s \in \mathbb{N}$ and $n$ filters is of dimension $\frac{\text{Input dim}}{s^2} \cdot n$. However, I don't know how the learning of convolutional layers works. (I understand how simple MLPs learn with gradient descent, if that helps).

So if my understanding of convolutional layers is correct, I have no clue how this can be reversed.

Could anybody please help me to understand deconvolutional layers?

9

Hoping it could be useful to anyone, I made a notebook to explore how convolution and transposed convolution can be used in TensorFlow (0.11). Maybe having some practical examples and figures may help a bit more to understand how they works.

– AkiRoss – 2016-11-22T14:56:19.567

6

This video lecture explains deconvolution/upsampling: https://youtu.be/ByjaPdWXKJ4?t=16m59s

– user199309 – 2016-04-22T13:14:07.030

3

For me, this page gave me a better explanation it also explains the difference between deconvolution and transpose convolution: https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d

– T.Antoni – 2018-06-28T17:37:26.943

1Isn't upsampling more like backwards pooling than backwards strided convolution, since it has no parameters? – Ken Fehling – 2018-07-11T17:29:52.783

1

**Note: The name "deconvolutional layer" is misleading because this layer does not perform deconvolution.**

– user76284 – 2019-09-30T20:18:11.643

the best article about how transpose conv layer work https://towardsdatascience.com/what-is-transposed-convolutional-layer-40e5e6e31c11

– amin msh – 2020-04-26T09:18:28.233

## Answers

266

Deconvolution layer is a very unfortunate name and should rather be called a transposed convolutional layer.

Visually, for a transposed convolution with stride one and no padding, we just pad the original input (blue entries) with zeroes (white entries) (Figure 1).

In case of stride two and padding, the transposed convolution would look like this (Figure 2):

All credits for the great visualisations go to

You can find more visualisations of convolutional arithmetics here.

25Just to make sure I understood it: "Deconvolution" is pretty much the same as convolution, but you add some padding? (Around the image / when s > 1 also around each pixel)? – Martin Thoma – 2016-06-08T05:00:37.477

26Yes, a deconvolution layer performs also convolution! That is why transposed convolution fits so much better as name and the term deconvolution is actually misleading. – David Dao – 2016-06-30T20:47:57.480

19Why do you say "no padding" in Figure 1, if actually input is zero-padded? – Stas S – 2016-07-30T13:06:04.790

10

By the way: It is called transposed convolution now in TensorFlow: https://www.tensorflow.org/versions/r0.10/api_docs/python/nn.html#conv2d_transpose

– Martin Thoma – 2016-08-08T14:08:48.023

12Thanks for this very intuitive answer, but I'm confused about why the second one is the 'stride two' case, it behaves exactly like the first one when kernel moves. – Demonedge – 2016-08-10T14:01:13.950

2Personally, I prefer to see it as a convolution with fractional stride. In the cited paper, there is no real deconvolution in the sense of "reversing the effect of a convolution". The only objective is to learn the up-sampling kernels. In that case, both "deconvolution" and "transposed convolution" are not correct. – Mikael Rousson – 2016-08-17T14:54:47.423

2

Could you link the arxiv paper A guide to convolution arithmetic for deep learning, please?

– Martin Thoma – 2016-11-22T21:29:14.497

@MartinThoma, did the filter that we used in tranpose convolution flipped or transpose? Or in the tranpose convolution we just add padding and do convolution again? – RockTheStar – 2016-12-20T23:36:18.707

@DavidDao I think to be more precise, I think transpose convolution = add padding and then do convolution with flipped (vertically & horizontally) filters (from original conv layer) – RockTheStar – 2016-12-22T00:26:57.007

2

I'm sorry, but I think your answer is wrong regarding the first animation. As seen here (http://deeplearning.net/software/theano_versions/dev/tutorial/conv_arithmetic.html#no-zero-padding-unit-strides-transposed) you are showing a convolution and not a transposed convolution. It will however have the same result as a transposed convolution with 0 padding going from a 2x2 space to a 4x4 space. That's why transposed convolutions are more efficient (no padding required).

– Andrei – 2017-01-15T22:05:24.183

5Why do you call it "transpose"? What does transpose mean? – Bill Yan – 2017-03-01T05:51:32.427

2@Demonedge Shown above is a transposed convolution. 'stride two' means stride in the corresponding original convolution is two. This is precisely why you have 1 (=2-1, 2 being the original stride) layer of zeros in between rows and columns. Transposed convolution is generally used in backward pass. It is called transposed because of the analogy with fully connected layer where you multiply with the transpose of the weight matrix during a backward pass. – stillanoob – 2017-03-30T06:54:50.107

13this is still very confusing. I don't see how the stride, padding, and transpose are affecting the operation ... The first gif has zero padding but you said there is no padding. The second gif uses a stride of 1, and you said the stride is two. – Curious – 2017-12-23T22:46:20.770

3In the second image there is 1 zero padded pixel between each input pixel vertically and horizontally. So performing convolution with stride 1 here actually works like stride 2 as now a zero padded pixel sits between each input pixel. @Curious – Tahlil – 2018-04-30T13:37:37.457

Could you explain more than visually ? – meduz – 2019-07-07T20:36:33.777

1Is the animation yours or could you give the right credit? – meduz – 2019-07-07T20:37:25.883

And how did you make the animation? – Ben – 2019-08-08T06:59:37.343

57

I think one way to get a really basic level intuition behind convolution is that you are sliding K filters, which you can think of as K stencils, over the input image and produce K activations - each one representing a degree of match with a particular stencil. The inverse operation of that would be to take K activations and expand them into a preimage of the convolution operation. The intuitive explanation of the inverse operation is therefore, roughly, image reconstruction given the stencils (filters) and activations (the degree of the match for each stencil) and therefore at the basic intuitive level we want to blow up each activation by the stencil's mask and add them up.

Another way to approach understanding deconv would be to examine the deconvolution layer implementation in Caffe, see the following relevant bits of code:

DeconvolutionLayer<Dtype>::Forward_gpu
ConvolutionLayer<Dtype>::Backward_gpu
CuDNNConvolutionLayer<Dtype>::Backward_gpu
BaseConvolutionLayer<Dtype>::backward_cpu_gemm


You can see that it's implemented in Caffe exactly as backprop for a regular forward convolutional layer (to me it was more obvious after i compared the implementation of backprop in cuDNN conv layer vs ConvolutionLayer::Backward_gpu implemented using GEMM). So if you work through how backpropagation is done for regular convolution you will understand what happens on a mechanical computation level. The way this computation works matches the intuition described in the first paragraph of this blurb.

However, I don't know how the learning of convolutional layers works. (I understand how simple MLPs learn with gradient descent, if that helps).

To answer your other question inside your first question, there are two main differences between MLP backpropagation (fully connected layer) and convolutional nets:

1) the influence of weights is localized, so first figure out how to do backprop for, say a 3x3 filter convolved with a small 3x3 area of an input image, mapping to a single point in the result image.

2) the weights of convolutional filters are shared for spatial invariance. What this means in practice is that in the forward pass the same 3x3 filter with the same weights is dragged through the entire image with the same weights for forward computation to yield the output image (for that particular filter). What this means for backprop is that the backprop gradients for each point in the source image are summed over the entire range that we dragged that filter during the forward pass. Note that there are also different gradients of loss wrt x, w and bias since dLoss/dx needs to be backpropagated, and dLoss/dw is how we update the weights. w and bias are independent inputs in the computation DAG (there are no prior inputs), so there's no need to do backpropagation on those.

(my notation here assumes that convolution is y = x*w+b where '*' is the convolution operation)


7I think this is the best answer for this question. – kli_nlpr – 2016-12-25T15:30:27.727

10I agree that this is the best answer. The top answer has pretty animations, but until I read this answer they just looked like regular convolutions with some arbitrary padding to me. Oh how people are swayed by eye candy. – Reii Nakano – 2017-06-10T06:23:16.847

2Agree, the accepted answer didn't explain anything. This is much better. – BjornW – 2018-03-27T08:37:28.407

1Thanks for your great explanation. I currently can’t figure out how to do the backprop properly. Could you give me a hint on that please? – Bastian – 2019-08-20T17:37:17.730

48

Step by step math explaining how transpose convolution does 2x upsampling with 3x3 filter and stride of 2:

The simplest TensorFlow snippet to validate the math:

import tensorflow as tf
import numpy as np

def test_conv2d_transpose():
# input batch shape = (1, 2, 2, 1) -> (batch_size, height, width, channels) - 2x2x1 image in batch of 1
x = tf.constant(np.array([[
[[1], [2]],
[[3], [4]]
]]), tf.float32)

# shape = (3, 3, 1, 1) -> (height, width, input_channels, output_channels) - 3x3x1 filter
f = tf.constant(np.array([
[[[1]], [[1]], [[1]]],
[[[1]], [[1]], [[1]]],
[[[1]], [[1]], [[1]]]
]), tf.float32)

conv = tf.nn.conv2d_transpose(x, f, output_shape=(1, 4, 4, 1), strides=[1, 2, 2, 1], padding='SAME')

with tf.Session() as session:
result = session.run(conv)

assert (np.array([[
[[1.0], [1.0],  [3.0], [2.0]],
[[1.0], [1.0],  [3.0], [2.0]],
[[4.0], [4.0], [10.0], [6.0]],
[[3.0], [3.0],  [7.0], [4.0]]]]) == result).all()


2I think your calculation is wrong here. The intermediate output should be 3+ 2*2=7, then for a 3x3 kernel the final output should be 7-3+1 = 5x5 – Alex – 2017-11-14T14:59:23.707

1Sorry, @Alex, but I fail to understand why intermediate output is 7. Can you please elaborate? – andriys – 2017-11-19T09:49:22.443

4@andriys In the image that you've shown, why is the final result cropped? – James Bond – 2018-06-25T13:29:37.950

1@JamesBond I think this is what the padding parameter in the Conv2DTransposed() function in the tensorflow.keras controls. Sometime it is desirable to have the output strictly multiple (double, triple, etc) of the original size. In this case, from the input size of 2x2 (wxh) to a 4x4. – X.X – 2019-11-01T00:05:49.167

1@andriys thanks for the illustration, very informative. I was initially thinking that the upsampled input image (with interleaving rows and columns of 0) is literally "convolved" with the filter kernel. But the real operation is more like what you said: the kernel is multiplied by the input elements then "tiled" into the output with potential overlap (if kernel size is bigger than stride). I think the underlying math and gradient computing would be also different from the "conventional" Conv2D operation. – X.X – 2019-11-01T00:10:01.033

28

The notes that accompany Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition, by Andrej Karpathy, do an excellent job of explaining convolutional neural networks.

Reading this paper should give you a rough idea about:

• Deconvolutional Networks Matthew D. Zeiler, Dilip Krishnan, Graham W. Taylor and Rob Fergus Dept. of Computer Science, Courant Institute, New York University

These slides are great for Deconvolutional Networks.

35Is it possible to summarise the content of any one of those links, in a short paragraph? The links might be useful for further research, but ideally a stack exchange answer should have enough text to address the basic question without needing to go off site. – Neil Slater – 2015-06-20T07:01:29.390

1I am sorry but the content of these pages is too large to be summarized in a short paragraph. – Azrael – 2015-06-20T09:11:39.483

14A full summary is not required, just a headline - e.g. "A deconvolutional neural network is similar to a CNN, but is trained so that features in any hidden layer can be used to reconstruct the previous layer (and by repetition across layers, eventually the input could be reconstructed from the output). This allows it to be trained unsupervised in order to learn generic high-level features in a problem domain - usually image processing" (note I am not even sure if that is correct, hence not writing my own answer). – Neil Slater – 2015-06-20T11:08:49.030

The link seems to be fairly old staff. It would be better that you could summary the material in several words.@Stephen Rauch – Yossarian42 – 2019-10-25T17:19:03.853

7Although the links are good, a brief summary of the model in your own words would have been better. – SmallChess – 2015-12-19T13:34:15.347

14

Just found a great article from the theaon website on this topic [1]:

The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, [...] to project feature maps to a higher-dimensional space. [...] i.e., map from a 4-dimensional space to a 16-dimensional space, while keeping the connectivity pattern of the convolution.

Transposed convolutions – also called fractionally strided convolutions – work by swapping the forward and backward passes of a convolution. One way to put it is to note that the kernel defines a convolution, but whether it’s a direct convolution or a transposed convolution is determined by how the forward and backward passes are computed.

The transposed convolution operation can be thought of as the gradient of some convolution with respect to its input, which is usually how transposed convolutions are implemented in practice.

Finally note that it is always possible to implement a transposed convolution with a direct convolution. The disadvantage is that it usually involves adding many columns and rows of zeros to the input, resulting in a much less efficient implementation.

So in simplespeak, a "transposed convolution" is mathematical operation using matrices (just like convolution) but is more efficient than the normal convolution operation in the case when you want to go back from the convolved values to the original (opposite direction). This is why it is preferred in implementations to convolution when computing the opposite direction (i.e. to avoid many unnecessary 0 multiplications caused by the sparse matrix that results from padding the input).

Image ---> convolution ---> Result

Result ---> transposed convolution ---> "originalish Image"

Sometimes you save some values along the convolution path and reuse that information when "going back":

Result ---> transposed convolution ---> Image

That's probably the reason why it's wrongly called a "deconvolution". However, it does have something to do with the matrix transpose of the convolution (C^T), hence the more appropriate name "transposed convolution".

So it makes a lot of sense when considering computing cost. You'd pay a lot more for amazon gpus if you wouldn't use the transposed convolution.

Read and watch the animations here carefully: http://deeplearning.net/software/theano_versions/dev/tutorial/conv_arithmetic.html#no-zero-padding-unit-strides-transposed

Some other relevant reading:

The transpose (or more generally, the Hermitian or conjugate transpose) of a filter is simply the matched filter[3]. This is found by time reversing the kernel and taking the conjugate of all the values[2].

I am also new to this and would be grateful for any feedback or corrections.

1

Nit picking, but the link should be: http://deeplearning.net/software/theano_versions/dev/tutorial/conv_arithmetic.html#transposed-convolution-arithmetic

– Herbert – 2017-01-19T13:05:00.597

11

We could use PCA for analogy.

When using conv, the forward pass is to extract the coefficients of principle components from the input image, and the backward pass (that updates the input) is to use (the gradient of) the coefficients to reconstruct a new input image, so that the new input image has PC coefficients that better match the desired coefficients.

When using deconv, the forward pass and the backward pass are reversed. The forward pass tries to reconstruct an image from PC coefficients, and the backward pass updates the PC coefficients given (the gradient of) the image.

The deconv forward pass does exactly the conv gradient computation given in this post.

That's why in the caffe implementation of deconv (refer to Andrei Pokrovsky's answer), the deconv forward pass calls backward_cpu_gemm(), and the backward pass calls forward_cpu_gemm().

8

# Convolutions from a DSP perspective

I'm a bit late to this but still would like to share my perspective and insights. My background is theoretical physics and digital signal processing. In particular I studied wavelets and convolutions are almost in my backbone ;)

The way people in the deep learning community talk about convolutions was also confusing to me. From my perspective what seems to be missing is a proper separation of concerns. I will explain the deep learning convolutions using some DSP tools.

### Disclaimer

My explanations will be a bit hand-wavy and not mathematical rigorous in order to get the main points across.

## Definitions

Let's define a few things first. I limit my discussion to one dimensional (the extension to more dimension is straight forward) infinite (so we don't need to mess with boundaries) sequences $$x_n = \{x_n\}_{n=-\infty}^{\infty} = \{\dots, x_{-1}, x_{0}, x_{1}, \dots \}$$.

A pure (discrete) convolution between two sequences $$y_n$$ and $$x_n$$ is defined as

$$(y * x)_n = \sum_{k=-\infty}^{\infty} y_{n-k} x_k$$

If we write this in terms of matrix vector operations it looks like this (assuming a simple kernel $$\mathbf{q} = (q_0,q_1,q_2)$$ and vector $$\mathbf{x} = (x_0, x_1, x_2, x_3)^T$$):

$$\mathbf{q} * \mathbf{x} = \left( \begin{array}{cccc} q_1 & q_0 & 0 & 0 \\ q_2 & q_1 & q_0 & 0 \\ 0 & q_2 & q_1 & q_0 \\ 0 & 0 & q_2 & q_1 \\ \end{array} \right) \left( \begin{array}{cccc} x_0 \\ x_1 \\ x_2 \\ x_3 \end{array} \right)$$

Let's introduce the down- and up-sampling operators, $$\downarrow$$ and $$\uparrow$$, respectively. Downsampling by factor $$k \in \mathbb{N}$$ is removing all samples except every k-th one:

$$\downarrow_k\!x_n = x_{nk}$$

And upsampling by factor $$k$$ is interleaving $$k-1$$ zeros between the samples:

$$\uparrow_k\!x_n = \left \{ \begin{array}{ll} x_{n/k} & n/k \in \mathbb{Z} \\ 0 & \text{otherwise} \end{array} \right.$$

E.g. we have for $$k=3$$:

$$\downarrow_3\!\{ \dots, x_0, x_1, x_2, x_3, x_4, x_5, x_6, \dots \} = \{ \dots, x_0, x_3, x_6, \dots \}$$ $$\uparrow_3\!\{ \dots, x_0, x_1, x_2, \dots \} = \{ \dots x_0, 0, 0, x_1, 0, 0, x_2, 0, 0, \dots \}$$

or written in terms of matrix operations (here $$k=2$$):

$$\downarrow_2\!x = \left( \begin{array}{cc} x_0 \\ x_2 \end{array} \right) = \left( \begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ \end{array} \right) \left( \begin{array}{cccc} x_0 \\ x_1 \\ x_2 \\ x_3 \end{array} \right)$$

and

$$\uparrow_2\!x = \left( \begin{array}{cccc} x_0 \\ 0 \\ x_1 \\ 0 \end{array} \right) = \left( \begin{array}{cc} 1 & 0 \\ 0 & 0 \\ 0 & 1 \\ 0 & 0 \\ \end{array} \right) \left( \begin{array}{cc} x_0 \\ x_1 \end{array} \right)$$

As one can already see, the down- and up-sample operators are mutually transposed, i.e. $$\uparrow_k = \downarrow_k^T$$.

## Deep Learning Convolutions by Parts

Let's look at the typical convolutions used in deep learning and how we write them. Given some kernel $$\mathbf{q}$$ and vector $$\mathbf{x}$$ we have the following:

• a strided convolution with stride $$k$$ is $$\downarrow_k\!(\mathbf{q} * \mathbf{x})$$,
• a dilated convolution with factor $$k$$ is $$(\uparrow_k\!\mathbf{q}) * \mathbf{x}$$,
• a transposed convolution with stride $$k$$ is $$\mathbf{q} * (\uparrow_k\!\mathbf{x})$$

Let's rearrange the transposed convolution a bit: $$\mathbf{q} * (\uparrow_k\!\mathbf{x}) \; = \; \mathbf{q} * (\downarrow_k^T\!\mathbf{x}) \; = \; (\uparrow_k\!(\mathbf{q}*)^T)^T\mathbf{x}$$

In this notation $$(\mathbf{q}*)$$ must be read as an operator, i.e. it abstracts convolving something with kernel $$\mathbf{q}$$. Or written in matrix operations (example):

\begin{align} \mathbf{q} * (\uparrow_k\!\mathbf{x}) & = \left( \begin{array}{cccc} q_1 & q_0 & 0 & 0 \\ q_2 & q_1 & q_0 & 0 \\ 0 & q_2 & q_1 & q_0 \\ 0 & 0 & q_2 & q_1 \\ \end{array} \right) \left( \begin{array}{cc} 1 & 0 \\ 0 & 0 \\ 0 & 1 \\ 0 & 0 \\ \end{array} \right) \left( \begin{array}{c} x_0\\ x_1\\ \end{array} \right) \\ & = \left( \begin{array}{cccc} q_1 & q_2 & 0 & 0 \\ q_0 & q_1 & q_2 & 0 \\ 0 & q_0 & q_1 & q_2 \\ 0 & 0 & q_0 & q_1 \\ \end{array} \right)^T \left( \begin{array}{cccc} 1 & 0 & 0 & 0\\ 0 & 0 & 1 & 0\\ \end{array} \right)^T \left( \begin{array}{c} x_0\\ x_1\\ \end{array} \right) \\ & = \left( \left( \begin{array}{cccc} 1 & 0 & 0 & 0\\ 0 & 0 & 1 & 0\\ \end{array} \right) \left( \begin{array}{cccc} q_1 & q_2 & 0 & 0 \\ q_0 & q_1 & q_2 & 0 \\ 0 & q_0 & q_1 & q_2 \\ 0 & 0 & q_0 & q_1 \\ \end{array} \right) \right)^T \left( \begin{array}{c} x_0\\ x_1\\ \end{array} \right) \\ & = (\uparrow_k\!(\mathbf{q}*)^T)^T\mathbf{x} \end{align}

As one can see the is the transposed operation, thus, the name.

### Connection to Nearest Neighbor Upsampling

Another common approach found in convolutional networks is upsampling with some built-in form of interpolation. Let's take upsampling by factor 2 with a simple repeat interpolation. This can be written as $$\uparrow_2\!(1\;1) * \mathbf{x}$$. If we also add a learnable kernel $$\mathbf{q}$$ to this we have $$\uparrow_2\!(1\;1) * \mathbf{q} * \mathbf{x}$$. The convolutions can be combined, e.g. for $$\mathbf{q}=(q_0\;q_1\;q_2)$$, we have $$(1\;1) * \mathbf{q} = (q_0\;\;q_0\!\!+\!q_1\;\;q_1\!\!+\!q_2\;\;q_2),$$

i.e. we can replace a repeat upsampler with factor 2 and a convolution with a kernel of size 3 by a transposed convolution with kernel size 4. This transposed convolution has the same "interpolation capacity" but would be able to learn better matching interpolations.

## Conclusions and Final Remarks

I hope I could clarify some common convolutions found in deep learning a bit by taking them apart in the fundamental operations.

I didn't cover pooling here. But this is just a nonlinear downsampler and can be treated within this notation as well.

Excellent answer. Taking a mathematical/symbolic perspective often clarifies things. Am I correct in thinking that the term "deconvolution" in this context clashes with existing terminology?

– user76284 – 2019-09-23T19:31:13.557

1It does not really clash, it just makes no sense. Deconvolution just a convolution with upsample operator. The term deconvolution sounds like it would be some form of inverse operation. Talking about an inverse here only makes sense in the context of matrix operations. It's multiplying with the inverse matrix not the inverse operation of convolution (like division vs multiplication). – André Bergner – 2019-09-30T19:35:46.130

Right. According to the correct mathematical terminology, convolution yields the $z$ such that $\theta \ast x = z$, whereas deconvolution yields the $z$ such that $\theta \ast z = x$. The latter has nothing to do with the so-called and incorrectly named “deconvolution layer” referred to in the OP. I think the name of the latter is being deprecated, thankfully, in favor of something like “upsampled convolution”. – user76284 – 2019-09-30T19:41:43.560

(Least-norm) deconvolution is equivalent to multiplying by the inverse of the convolution matrix (or more precisely, its pseudoinverse). That is, $\theta \ast z = x$ if $z = (\theta \ast)^+ x$. This might make a good addition to your answer, since it clarifies what real deconvolution actually corresponds to. – user76284 – 2019-09-30T19:49:07.020

1In short, the so-called “deconvolution layer” of the OP is not actually doing deconvolution. It’s doing something else (what you described in your answer). – user76284 – 2019-09-30T19:53:24.767

1

I'm not happy with this edit, can you please revert. I was following the common terminology which is also used in many answers here and given links (e.g. https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d). That's the problem, the terminology is very blurry. The whole point is to clarify that deconvolutions are not a mathematical thing – it's just an unfortunate name. So it should be mentioned in my answer.

– André Bergner – 2019-10-08T20:08:33.173

1The very article you linked to says: “Some sources use the name deconvolution, which is inappropriate because it’s not a deconvolution.” – user76284 – 2019-10-08T20:48:01.503

1I think we both agree that deconvolution is the wrong term. But I would like to have it at least mentioned. I guess you are right that the way it was formulated is more confusing than helping but since the term is around it would be good to mention it similarly, i.e. also often referred to as deconvolution being a misleading term. – André Bergner – 2019-10-13T12:19:15.960

However, I think that the deconvolution used in deep learning is related to the one on signal processing. Both of them are just convolutions but with some form of reciprocal kernel. Both have the attempt to undo a former convolution operation. My main point is that the way it is used in deep learning makes it look like a different operation which it is not. The intent makes it a deconvolution not the operation. – André Bergner – 2019-10-13T12:22:16.870

I was thinking of the commonly used autoencoder architecture. In that case the "deconvolution" layer is supposed to learn projecting back from the "inner spaces" to the outer ones and eventually to the original. If you would set up an autoencoder with a convolution with a given static kernel and a second convolution, the second convolution would indeed learn the deconvolution as it would undo the previous one (if the kernel is nondegenerate). So there is some shared intuition. – André Bergner – 2019-10-13T17:10:33.647

1Ah, by undoing convolutions I mean a function $f$ such that $f(w_1, w_1 \ast x) \approx x$ for arbitrary $w_1$ and $x$. This is what deconvolution is. It has 2 inputs, like convolution. Your decoder learns a function $g$ such that $g(w_1 \ast x) \approx x$ for a fixed $w_1$. It has only 1 input. As an example, we might have $g = (w_2 \ast) \uparrow_k$ for some $w_2$ that is learned. This makes me wonder whether there is a closed form for $\operatorname{argmin}{w_2} \sup_x |g(w_1 \ast x) - x|$ or $\operatorname{argmin}{w_2} |g(w_1 \ast) - I|$ (in the operator sense) in terms of $w_1$. – user76284 – 2019-10-13T19:40:15.127

6

In addition to David Dao's answer: It is also possible to think the other way around. Instead of focusing on which (low resolution) input pixels are used to produce a single output pixel, you can also focus on which individual input pixels contribute to which region of output pixels.

This is done in this distill publication, including a series of very intuitive and interactive visualizations. One advantage of thinking in this direction is that explaining checkerboard artifacts becomes easy.

6

I had a lot of trouble understanding what exactly happened in the paper until I came across this blog post: http://warmspringwinds.github.io/tensorflow/tf-slim/2016/11/22/upsampling-and-image-segmentation-with-tensorflow-and-tf-slim/

Here is a summary of how I understand what is happening in a 2x upsampling:

# Simple example

1. imagine the following input image:

1. Fractionally strided convolutions work by inserting factor-1 = 2-1 = 1 zeros in between these values and then assuming stride=1 later on. Thus, you receive the following 6x6 padded image

1. The bilinear 4x4 filter looks like this. Its values are chosen such that the used weights (=all weights not being multiplied with an inserted zero) sum up to 1. Its three unique values are 0.56, 0.19 and 0.06. Moreover, the center of the filter is per convention the pixel in the third row and third column.

1. Applying the 4x4 filter on the padded image (using padding='same' and stride=1) yields the following 6x6 upsampled image:

1. This kind of upsampling is performed for each channel individually (see line 59 in https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/surgery.py). At the end, the 2x upsampling is really a very simple resizing using bilinear interpolation and conventions on how to handle the borders. 16x or 32x upsampling works in much the same way, I believe.

2

i wrote up some simple examples to illustrate how convolution and transpose convolution is done, and as implemented by software libraries like PyTorch

https://makeyourownneuralnetwork.blogspot.com/2020/02/calculating-output-size-of-convolutions.html

an example of the visual explanations:

-1

The following paper discusses deconvolutional layers.Both from the architectural and training point of view.Deconvolutional networks

2

This does not add any value to this answer

– Martin Thoma – 2017-01-19T12:40:32.503