Lets answer the question assuming ideal case.

Say you are training a deep learning algorithm in which `p`

proportion of the algorithm is parllelizable and `(1-p)`

proportion as sequential part . Lets assume you can perfectly divide the program into their parallelizable and sequential parts. I don't exactly know the specs of aforementioned CPU's, but lets assume for now they are symmetrical processors (which is the most common) i.e. all the cores or CPU's perform the task equally and there is no master CPU coordinating the functioning between the different cores.

Now as per a law called Amdahl's Law which states:

Where `p`

holds the same meaning, while `N`

can be approximated as the number of cores sharing the task and `S`

is the reciprocal of the rate of speedup you will get (that is time required will be `T * S(N)`

. So, for each particular GPU's with `N`

number of CPU's you can calculate the times of speedup you will get.

Specifically, calculate the number of calculations your NN will make while training. Here is a quite good calculation of the same. Generally a CPU will take 6-7 clock cycles for multiplication and 60-70 for division. Calculate the total number of clock cycles taken = `a`

by the learning algorithm. Then calculate the time taken to complete the calculation by a single CPU by the formula `a/clock_speed`

. Then using Amdahl's law calculate the speedup where **N** will be the number of CPU's in the GPU.

So you see, it clearly depends on the type of program you are running. Cache memory also increases the speed of computation. How? Cache memory is basically a temporary storage memory. When the program accesses your data/observations, it first checks if the data is available in cache memory since it can access data faster from cache memory, if not available then it checks the RAM then your hard-disk and loads it first unto RAM then to cache. This is an oversimplified version of cache working.

Also check the number of actual physical CPU's, virtual CPU's are useless in parallelization. For example i7 generally consists of 4 physical CPU's.

The part I skipped, if there exists a master CPU just replace `N`

with `N-1`

since it only performs the task of coordination.

So now you just have to find the general class of programs you will be running and check its sequential and parallelizable parts.

Hope this helps!

**EDIT :** Recently I learnt that not all GPU's support the most popular deep learning framework tensorflow-gpu (i.e. the package to utilise multiple GPU's). It is specifically supported by NVIDIA GPU's as the CUDA framework required for tensorflow-gpu is specifically made for NVIDIA. Also some newer other brand GPU's are supporting tensorflow-gpu. Although tensorflow CPU only version can be run on any system. (Anyone can clarify this matter if they have more knowledge about this).