Possibility of working on KDDCup data in local system


I'm trying to apply classification algorithms to KDD Cup 2012 track2 data using R http://www.kddcup2012.org/c/kddcup2012-track2

It seems not possible to work with this 10GB training data on my local system with 4GB RAM. Can anyone work on this data using this kind of a local system ? Or is using a cluster the norm ?
It would be great if anyone could provide me with any guidance on how to get started with working on a cluster and the normally used type of cluster for such tasks


Posted 2015-04-11T17:45:26.600

Reputation: 69



I think that you have, at least, the following major options for your data analysis scenario:

  1. Use big data-enabling R packages on your local system. You can find most of them via the corresponding CRAN Task View that I reference in this answer (see point #3).

  2. Use the same packages on a public cloud infrastructure, such as Amazon Web Services (AWS) EC2. If your analysis is non-critical and tolerant to potential restarts, consider using AWS Spot Instances, as their pricing allows for significant financial savings.

  3. Use the above mention public cloud option with R standard platform, but on more powerful instances (for example, on AWS you can opt for memory-optimized EC2 instances or general purpose on-demand instances with more memory).

In some cases, it is possible to tune a local system (or a cloud on-demand instance) to enable R to work with big(ger) data sets. For some help in this regard, see my relevant answer.

For both above-mentioned cloud (AWS) options, you can find more convenient to use R-focused pre-built VM images. See my relevant answer for details. You may also find useful this excellent comprehensive list of big data frameworks.

Aleksandr Blekh

Posted 2015-04-11T17:45:26.600

Reputation: 6 438

Thanks for the answer. I have access to some local systems, can you give me a start on how to set up a cluster using these systems without any cloud services ? Looks like everywhere AWS is being used. – abhivij – 2015-04-13T10:15:11.803


@abhivij: You're welcome. Setting up a cluster is not a rocket science, but might be not trivial, depending on the requirements and your current skills. You can read this blog post and this blog post as a starting point. (to be continued)

– Aleksandr Blekh – 2015-04-13T10:37:50.190


@abhivij: (cont'd) Also, you'd have to refer to documentation on multiprocessing R packages that you will decide to use, for example this tutorial. A more high-level overview and example of an R-based cluster can be found in this working paper. Hope this helps.

– Aleksandr Blekh – 2015-04-13T10:38:14.633