When should I use 3D convolution?



I am new to convolutional neural networks, and I am learning 3D convolution. What I could understand is that 2D convolution gives us relationships between low-level features in the X-Y dimension, while the 3D convolution helps detect low-level features and relationships between them in all the 3 dimensions.

Consider a CNN employing 2D convolutional layers to recognize handwritten digits. If a digit, say 5, was written in different colors:

enter image description here

Would a strictly 2D CNN perform poorly (since they belong to different channels in the z-dimension)?

Also, are there practical well-known neural nets that employ 3D convolution?

Shobhit Verma

Posted 2019-07-31T06:09:54.267

Reputation: 131

3D convolutions are commonly used for processing 3D images such as MRI scans. – Yashas – 2019-07-31T06:37:48.457

Are there any publications on 3D Conv architectures? – Shobhit Verma – 2019-07-31T06:53:15.803

@Shobhit given the answer by ashenoy, is there some part of your question that has not been answered yet? – mshlis – 2019-08-07T11:50:49.197



3D CNN's are used when you want to extract features in 3 Dimensions or establish a relationship between 3 dimensions.

Essentially its the same as 2D convolutions but the kernel movement is now 3-Dimensional causing a better capture of dependencies within the 3 dimensions and a difference in output dimensions post convolution.

The kernel on convolution will move in 3-Dimensions if the kernel depth is lesser then the feature map depth.


On the other hand 2-D convolutions on 3-D data mean that the kernel will traverse in 2-D only. This happens when the feature map depth is the same as the kernel depth (channels)


Some use cases for better understanding are - MRI scans where relationship between a stack of images is to be understood ; and a low level feature extractor for spatio-temporal data like videos for Gesture Recognition, Weather forecast etc. (3-D CNN's are used as low level feature extractors only over multiple short intervals as 3D CNN's fail to capture long term spatio-temporal dependencies - for more on that check out ConvLSTM or an alternate perspective here.) Most CNN models that learn from video data almost always have 3D CNN as a low level feature extractor.

In the example you have mentioned above regarding the number 5 - 2D convolutions would probably perform better, as you're treating every channel intensity as an aggregate of the information it holds, meaning the learning would almost be the same as it would on a black and white image. Using 3D convolution for this on the other hand would cause learning of relationships between the channels which do not exist in this case! (Also 3D convolutions on an image with depth 3 would require a very uncommon kernel to be used, especially for the use case)

Hope your query has been cleared!


Posted 2019-07-31T06:09:54.267

Reputation: 1 194


3D convolutions should when you want to extract spatial features from your input on three dimensions. For Computer Vision, they are typically used on volumetric images, which are 3D.

Some examples are classifying 3D rendered images and medical image segmentation


Posted 2019-07-31T06:09:54.267

Reputation: 191