3D CNN's are used when you want to extract features in 3 Dimensions or establish a relationship between 3 dimensions.
Essentially its the same as 2D convolutions but the kernel movement is now 3-Dimensional causing a better capture of dependencies within the 3 dimensions and a difference in output dimensions post convolution.
The kernel on convolution will move in 3-Dimensions if the kernel depth is lesser then the feature map depth.
On the other hand 2-D convolutions on 3-D data mean that the kernel will traverse in 2-D only. This happens when the feature map depth is the same as the kernel depth (channels)
Some use cases for better understanding are - MRI scans where relationship between a stack of images is to be understood ; and a low level feature extractor for spatio-temporal data like videos for Gesture Recognition, Weather forecast etc. (3-D CNN's are used as low level feature extractors only over multiple short intervals as 3D CNN's fail to capture long term spatio-temporal dependencies - for more on that check out ConvLSTM or an alternate perspective here.)
Most CNN models that learn from video data almost always have 3D CNN as a low level feature extractor.
In the example you have mentioned above regarding the number 5 - 2D convolutions would probably perform better, as you're treating every channel intensity as an aggregate of the information it holds, meaning the learning would almost be the same as it would on a black and white image. Using 3D convolution for this on the other hand would cause learning of relationships between the channels which do not exist in this case! (Also 3D convolutions on an image with depth 3 would require a very uncommon kernel to be used, especially for the use case)
Hope your query has been cleared!