Wouldn't convolutional neural network models work better without flattening the input in any stages?


enter image description here

The above model is what really helped me understand the implementation of convolutional neural networks, so based on that, I've got a tricky hypothesis that I want to find more about, since actually testing it would involve developing an entirely new training model if the concept hasn't already been tried elsewhere.

I've been building a machine learning project for image recognition and thought about how at certain stages we flatten the input after convoluting and max pooling, but it occurred to me that by flattening the data, we're fundamentally losing positional information. If you think about how real neurons process information based on clusters, it seems obvious that proximity of the biological neurons is of great significance rather than thinking of them as flat layers, by designing a neural network training model that takes neuron proximity into account in deciding the structure by which to form connections between neurons, so that positional information can be utilized and kept relevant, it seems that it would improve network effectiveness.

Edit, for clarification, I made an image representing the concept I'm asking about:

enter image description here

Basically: Pixels 1 and 4 are related to each other and that's very important information. Yes we can train our neural network to know those relationships, but that's 12 unique relationships in just a 3x3 pixel grid that our training process needs to successfully teach the network to value, whereas a model that takes proximity of neurons into consideration, like the real world brain would maintain the importance of those relationships since neurons connect more readily to others in proximity.

My question is: Does anyone know of white papers / experiments closely related to the concept I'm hypothesizing? Why would or would that not be a fundamentally better model?


Posted 2019-11-11T20:44:01.343

Reputation: 187



I have had similar thoughts about neural networks before. Convolution layers are layers of two dimensional nodes effectively passing the spacial data so why don't we use two dimensional hidden layers to receive information out of them.

I'm sure someone has used this type of implementation before. I believe the papers bellow are using this. Part of the point of neural networks is that the weights are trained in order to optimize finding the best solution so regardless of the spacial information it learns to 'focus'/increase weight on locations that are associated with deciding the solution.

Think of the problem where your neural network examines an image and gives true or false. Training images are True if the center is red and one of the corners is blue or if the center is blue and one of the corners are red. Flattening the layers or not should have basically no effect on this model. In other circumstances like object detection or labeling outlines yes I believe not flattening will benefit the model. With that said flattening the data does not erase spacial relationship each layer will still be trained to detect the spacial information that gives a correct answer the flattened layers just won't have the benefit of neighbors when the layers are one dimensional instead of two.

In a a CNN with multi class detecting as the task you could allow each class to have its own CNN like hidden layers that narrow to a decision node and decide if they match that class or not. Imagine a palm tree shape where the palm trunk is the image convolutions and each leaf on the top are the two dimensional hidden layers that narrow to an output layer.

Multi-dimensional NN and Three dimensional Neural Network

I know I spoke in a lot of abstraction so if any part doesn't make sense, I'll make an edit to clarify.

Michael Hearn

Posted 2019-11-11T20:44:01.343

Reputation: 522


Read on Fully Convolutional Networks (FCN). There is a lot of papers on the subject, first was "Fully Convolutional Networks for Semantic Segmentation" by Long.

The idea is quite close to what you describe - preserve spatial locality in the layers. In FCN there is no fully connected layer. Instead there is average pooling on top of last low-resolution/high-channels layer. The effect is like as if you have several fully connected layer centered on different locations and end result produced by weighted voting of them.

Pleasant side effect of FCN is that they work on any spatial image size(bigger then receptive field) - image size is not coded into network.


Posted 2019-11-11T20:44:01.343

Reputation: 547

Do you know are papers where they compare them to CNN+FC networks? I was experimenting with them fa while ago and found FCN perform worse compared to CNN+FC layers. Only when I build first layer of convolutions like in the inception by using filters of different size on the input - I got something comparable, at the expense of computation time. It would be interesting to know more about the architecture they use. – serali – 2019-11-12T10:13:55.993

Generaly FCN works same or better then CNN with fully connected, but the number of channels in the last layer should be comparable to size of fully connected layer. Im dont not recall specific papers though. – mirror2image – 2019-11-12T13:12:23.873