Is the R language suitable for Big Data

39

17

R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R".

I've seen a guideline of 5TB for a dataset to be considered as Big Data.

My question is: Is R suitable for the amount of Data typically seen in Big Data problems? Are there strategies to be employed when using R with this size of dataset?

akellyirl

Posted 2014-05-14T11:15:40.907

Reputation: 328

4In addition to the answers below a good thing to remember is the fact that most of the things you need from R regarding Big Data can be done with summary data sets that are very small in comparison to raw logs. Sampling from the raw log also provides a seamless way to use R for analysis without the headache of parsing lines and lines of a raw log. For example, for a common modelling task at work I routinely use map reduce to summarize 32 gbs of raw logs to 28mbs of user data for modelling.cwharland 2014-05-14T17:45:41.430

Answers

39

Actually this is coming around. In the book R in a Nutshell there is even a section on using R with Hadoop for big data processing. There are some work arounds that need to be done because R does all it's work in memory, so you are basically limited to the amount of RAM you have available to you.

A mature project for R and Hadoop is RHadoop

RHadoop has been divided into several sub-projects, rhdfs, rhbase, rmr2, plyrmr, and quickcheck (wiki).

MCP_infiltrator

Posted 2014-05-14T11:15:40.907

Reputation: 966

But does using R with Hadoop overcome this limitation (having to do computations in memory)?Felipe Almeida 2014-06-09T23:07:41.553

RHadoop does overcome this limitation. The tutorial here: https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md spells it out clearly. You need to shift into a mapreduce mindset, but it does provide the power of R to the hadoop environment.

Steve Kallestad 2014-06-11T06:34:50.310

2

Two new alternatives that are worth mentioning are: SparkR https://databricks.com/blog/2015/06/09/announcing-sparkr-r-on-spark.html and h2o.ai http://h2o.ai/product/ both well suited for big data.

wacax 2015-12-05T06:03:32.720

29

The main problem with using R for large data sets is the RAM constraint. The reason behind keeping all the data in RAM is that it provides much faster access and data manipulations than would storing on HDDs. If you are willing to take a hit on performance, then yes, it is quite practical to work with large datasets in R.

  • RODBC Package: Allows connecting to external DB from R to retrieve and handle data. Hence, the data being manipulated is restricted to your RAM. The overall data set can go much larger.
  • The ff package allows using larger than RAM data sets by utilising memory-mapped pages.
  • BigLM: It builds generalized linear models on big data. It loads data into memory in chunks.
  • bigmemory : An R package which allows powerful and memory-efficient parallel analyses and data mining of massive data sets. It permits storing large objects (matrices etc.) in memory (on the RAM) using external pointer objects to refer to them.

asheeshr

Posted 2014-05-14T11:15:40.907

Reputation: 556

1Another package is distributedR which allows you to work with distributed files in RAM.adesantos 2014-06-25T07:03:56.810

14

Some good answers here. I would like to join the discussion by adding the following two notes:

1) The question's emphasis on the volume of data while referring to Big Data is certainly understandable and valid, especially considering the problem of data volume growth outpacing technological capacities' exponential growth per Moore's Law (http://en.wikipedia.org/wiki/Moore%27s_law).

2) Having said that, it is important to remember about other aspects of big data concept, based on Gartner's definition (emphasis mine - AB): "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." (usually referred to as the "3Vs model"). I mention this, because it forces data scientists and other analysts to look for and use R packages that focus on other than volume aspects of big data (enabled by the richness of enormous R ecosystem).

3) While existing answers mention some R packages, related to big data, for a more comprehensive coverage, I'd recommend to refer to CRAN Task View "High-Performance and Parallel Computing with R" (http://cran.r-project.org/web/views/HighPerformanceComputing.html), in particular, sections "Parallel computing: Hadoop" and "Large memory and out-of-memory data".

Aleksandr Blekh

Posted 2014-05-14T11:15:40.907

Reputation: 5 573

12

R is great for "big data"! However, you need a workflow since R is limited (with some simplification) by the amount of RAM in the operating system. The approach I take is to interact with a relational database (see the RSQLite package for creating and interacting with a SQLite databse), run SQL-style queries to understand the structure of the data, and then extract particular subsets of the data for computationally-intensive statistical analysis.

This just one approach, however: there are packages that allow you to interact with other databases (e.g., Monet) or run analyses in R with fewer memory limitations (e.g., see pbdR).

statsRus

Posted 2014-05-14T11:15:40.907

Reputation: 267

9

Considering another criteria, I think that in some cases using Python may be much superior to R for Big Data. I know the wide-spread use of R in data science educational materials and the good data analysis libraries available for it, but sometimes it just depend on the team.

In my experience, for people already familiar with programming, using Python provides much more flexibility and productivity boost compared to a language like R, which is not as well-designed and powerful compared to Python in terms of a programming language. As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library. That is, sometimes the overall productivity (considering learning materials, documentation, etc.) for Python may be better than R even in the lack of special-purpose data analysis libraries for Python. Also, there are some good articles explaining the fast pace of Python in data science: Python Displacing R and Rich Scientific Data Structures in Python that may soon fill the gap of available libraries for R.

Another important reason for not using R is when working with real world Big Data problems, contrary to academical only problems, there is much need for other tools and techniques, like data parsing, cleaning, visualization, web scrapping, and a lot of others that are much easier using a general purpose programming language. This may be why the default language used in many Hadoop courses (including the Udacity's online course) is Python.

Edit:

Recently DARPA has also invested $3 million to help fund Python's data processing and visualization capabilities for big data jobs, which is clearly a sign of Python's future in Big Data. (details)

Amir Ali Akbari

Posted 2014-05-14T11:15:40.907

Reputation: 790

5This answer seems to be wholly anecdotal and hardly shows anywhere where R is weak relative to Python.stanekam 2014-06-10T20:31:38.787

3R is a pleasure to work with for data manipulation (reshape2, plyr, and now dplyr) and I don't think you can do better than ggplot2/ggvis for visualizationorganic agave 2014-05-18T21:52:42.273

@pearpies As said in the beginning of my answer, I admit the good libraries available for R, but as a whole, when considering all areas needed for big data (which I as said a few of them in the answer), R is no match for the mature and huge libraries available for Python.Amir Ali Akbari 2014-05-19T08:08:53.207

1Peter from Continuum Analytics (one of the companies on the DARPA project referenced above) is working on some very impressive opensource code for data visualization that simply do things that other sets of code are not able to do.blunders 2014-05-20T18:46:45.713

Oh my goodness! "As an evidence, in a data mining course in my university, the best final project was written in Python, although the others has access to R's rich data analysis library." And you want to have readers respect your analysis? wow. Could there be any other factors involved in the best project being a python project other than the language it was written in? really....Shawn Mehan 2015-12-05T18:35:17.533

6

R is great for a lot of analysis. As mentioned about, there are newer adaptations for big data like MapR, RHadoop, and scalable versions of RStudio.

However, if your concern is libraries, keep your eye on Spark. Spark was created for big data and is MUCH faster than Hadoop alone. It has vastly growing machine learning, SQL, streaming, and graph libraries. Thus allowing much if not all of the analysis to be done within the framework (with multiple language APIs, I prefer Scala) without having to shuffle between languages/tools.

Climbs_lika_Spyder

Posted 2014-05-14T11:15:40.907

Reputation: 350

3

As other answers have noted, R can be used along with Hadoop and other distributed computing platforms to scale it up to the "Big Data" level. However, if you're not wedded to R specifically, but are willing to use an "R-like" environment, Incanter is a project that might work well for you, as it is native to the JVM (based on Clojure) and doesn't have the "impedance mismatch" between itself and Hadop that R has. That is to say, from Incanter, you can invoke Java native Hadoop / HDFS APIs without needing to go through a JNI bridge or anything.

mindcrime

Posted 2014-05-14T11:15:40.907

Reputation: 151

1

I am far from an expert, but my understanding of the subject tells me that R (superb in statistics) and e.g. Python (superb in several of those things where R is lacking) complements each other quite well (as pointed out by previous posts).

Stenemo

Posted 2014-05-14T11:15:40.907

Reputation: 111