Amazon Cloud Image istance most suited for R data mining


I'm new to the field of machine learning, I always used my laptop for regular statistical analysis with no performance problems. Though lately I started programming with caret and I find myself stuck with hours optimizing models and resampling datasets. I saw EC2 istances but I can't understand the difference between the different classes, I know that generally the one with highest numbers of cpus and RAM are most performant but what is the best type of istance for R programming (for example p one vs c one)? Then I'll chose the right amount of memory for my applications but I'm wondering if one subset is more suited than the others


Posted 2019-04-03T11:12:46.967

Reputation: 125



It very much depends on the calculations you are doing as well as the tools you are using to implement them and the size of data you are working with.

Some small rules of thumb:

  • R processes (generally) tend towards being RAM bound, so you want may want to get a RAM optimised
  • R Studio IDE has a profiling tool you can use to check how your code executes and where the time is spent
  • It may be easier and cheaper to optimise code before scaling with tools like data.table and RCPP
    • ML models are likely not going to be effected by this, more your data prep, check the profiler
  • If you are REALLY keen on getting the model running fast, look into the GPU/TPU instances, however, in the case of R using caret, I don't know if it will utilise these assets. Do your research first as these are the most expensive flavour of EC2. TPU units are (from my limited research) specifically optimised to tensorflow ML.

A plan B/alternative would be to work in aws Sagemaker notebooks to start. Then you can abstract away all the faff of managing the EC2 and just focus on building the ML.

For extra credit on your EC2, use one of the R community AMI's put together by this lovely person:


Posted 2019-04-03T11:12:46.967

Reputation: 161