## CPU preferences and specifications for a multi GPU deep-learning setup

5

2

For a multi (4xTitan Xp) GPU deep learning setup what kind of CPU is preferable?

Specifically I am comparing:

• Intel Xeon E5-2620 with 8x2.1GHz 20MB L3 Cache
• Intel Xeon E5-1620K with 4x3.5Ghz 10MB L3 Cache
• Intel Xeon E5-1650K with 6x3.6GHz 15MB L3 Cache
• Intel i7-6850K with 6x3.6GHz 15MB L3 Cache

I wonder if the higher clock rates are important or is it better to have more number of cores in this use case.

Question was closed 2020-05-13T21:36:17.390

3

Lets answer the question assuming ideal case.

Say you are training a deep learning algorithm in which p proportion of the algorithm is parllelizable and (1-p) proportion as sequential part . Lets assume you can perfectly divide the program into their parallelizable and sequential parts. I don't exactly know the specs of aforementioned CPU's, but lets assume for now they are symmetrical processors (which is the most common) i.e. all the cores or CPU's perform the task equally and there is no master CPU coordinating the functioning between the different cores.

Now as per a law called Amdahl's Law which states:

Where p holds the same meaning, while N can be approximated as the number of cores sharing the task and S is the reciprocal of the rate of speedup you will get (that is time required will be T * S(N). So, for each particular GPU's with N number of CPU's you can calculate the times of speedup you will get.

Specifically, calculate the number of calculations your NN will make while training. Here is a quite good calculation of the same. Generally a CPU will take 6-7 clock cycles for multiplication and 60-70 for division. Calculate the total number of clock cycles taken = a by the learning algorithm. Then calculate the time taken to complete the calculation by a single CPU by the formula a/clock_speed. Then using Amdahl's law calculate the speedup where N will be the number of CPU's in the GPU.

So you see, it clearly depends on the type of program you are running. Cache memory also increases the speed of computation. How? Cache memory is basically a temporary storage memory. When the program accesses your data/observations, it first checks if the data is available in cache memory since it can access data faster from cache memory, if not available then it checks the RAM then your hard-disk and loads it first unto RAM then to cache. This is an oversimplified version of cache working.

Also check the number of actual physical CPU's, virtual CPU's are useless in parallelization. For example i7 generally consists of 4 physical CPU's.

The part I skipped, if there exists a master CPU just replace N with N-1 since it only performs the task of coordination.

So now you just have to find the general class of programs you will be running and check its sequential and parallelizable parts.

Hope this helps!

EDIT : Recently I learnt that not all GPU's support the most popular deep learning framework tensorflow-gpu (i.e. the package to utilise multiple GPU's). It is specifically supported by NVIDIA GPU's as the CUDA framework required for tensorflow-gpu is specifically made for NVIDIA. Also some newer other brand GPU's are supporting tensorflow-gpu. Although tensorflow CPU only version can be run on any system. (Anyone can clarify this matter if they have more knowledge about this).

1

That is a too broad question. It depends on what kind of operations will you do with CPU during DL training. Generally, data reading (from HDD), pre/post processing are being done on CPU while DL training is done at GPU. There are only two things that you should care:

• Queue system while reading/pre-processing your input. So that when GPU finishes one iteration, it shall not wait for CPU to process next input.
• Analyze your network complexity and get the one iteration duration for training. Basically, your CPU must be faster than that not to be a bottleneck.

So, in that case, if you have intensive reading / pre-post processing operations, go for few cores, faster GHz. If those operations are minimal, go for more cores. However, as explained above, your training script is much more important than your hardware to optimize the training duration.

1

The impact of

• the number of cores,
• cache size, and
• clock speed

on GPU performance when comparing the ones you mentioned is not project critical when doing either lab research or deploying to the field.

More important is the optimization of algorithms to the hardware available, video bus architecture (since data passing between CPUs and GPUs is dependent on upload and download bit rates), the potential use of clusters of computing nodes, and the statistical handling of the learning process, such as sampling methodologies, data hygiene, normalization of input, and the proper mapping of activations and their hyper-parametrization. The pawning off of random access to swap space can be a bottleneck unrelated to the bus communication between units too.

As an architect, one must consider the big picture. Even the exact same software and hardware combination, when given example sets with widely different distributions of values, may favor different features, one requiring more cache and the other performing proportionally with the clock.

Whether multiple cores will be effectively used has much to do with how vectors are represented and what the compiler and instruction cache do or do not do in terms of optimization.1

If you can find side-by-side test results, the specifics for particular test scenarios may be available to you, but be warned that your own cases may differ enough from the test suite used in the reports you find to completely invalidate the usefulness of the report.

References

Higher [speeds] than that predicted by Amdahl’s law may be achieved for the parallelizable part of the workload, if core threads exhibit strong cache affinity and the workload is strongly memory-bound. Then, we derive a tight speedup upper bound in the presence of both memory resource contention and critical section for multicore processors with single-threaded cores. This speedup upper bound indicates that with resource contention among threads, whether it is due to shared memory or critical section, a sequential term is guaranteed to emerge from the parallelizable part of the workload, fundamentally limiting the scalability of multicore processors for parallel computing, in addition to the sequential part of the workload