I've seen discussions about the 'overhead' of a GPU, and that for 'small' networks, it may actually be faster to train on a CPU (or network of CPUs) than a GPU.
What is meant by 'small'?
For example, would a single-layer MLP with 100 hidden units be 'small'?
Does our definition of 'small' change for recurrent architectures?
Are there any other criteria that should be considered when deciding whether to train on CPU or GPU?
I just found a blog post (possibly outdated? It's from 2014):
"...Most network card[s] only work with memory that is registered with the CPU and so the GPU to GPU transfer between two nodes would be like this: GPU 1 to CPU 1 to Network Card 1 to Network Card 2 to CPU 2 to GPU 2. What this means is, if one chooses a slow network card then there might be no speedups over a single computer. Even with fast network cards, if the cluster is large, one does not even get speedups from GPUs when compared to CPUs as the GPUs just work too fast for the network cards to keep up with them.
This is the reason why many big companies like Google and Microsoft are using CPU rather than GPU clusters to train their big neural networks. "
So at some point, according to this post, it could have been faster to use CPUs. Is this still the case?
EDIT 2: Yes, that blog post may very well be outdated because:
Now it seems that GPUs within a node are connected via PCIe bus, so communication can happen at about 6GiB/s. (For example: https://www.youtube.com/watch?v=el1iSlP1uOs, about 35 minutes in). The speaker implies that this is faster than going from GPU1 to CPU to GPU2. It would mean the network card is no longer the bottleneck.