Python vs R for machine learning

119

75

I'm just starting to develop a machine learning application for academic purposes. I'm currently using R and training myself in it. However, in a lot of places, I have seen people using Python.

What are people using in academia and industry, and what is the recommendation?

user721

Posted 2014-06-12T06:04:48.243

Reputation: 159

3Well, what type of machine learning (image/video? NLP? financial? astronomy?), which classifiers, what size datasets (Mb? Gb? Tb?), what scale, what latency, on what platform (mobile/single-computer/multicore/cluster/cloud)...? What specific libraries will your application use/need, and have you checked what is available in each language? Are you just building a toy application for your personal learning or does it matter if it ever gets productized? Using open-source or proprietary? Will you be working with other people or existing apps, and what do they use/support? Web frontend/GUI? etc – smci – 2016-12-12T22:51:45.610

1One observation is that Python is more used by machine learning people working with big datasets while R is more used by traditional "statisticians", e.g. those working with psychology experiments with hundreds of data points. Though that difference might be diminishing. – xji – 2018-02-15T08:43:21.517

python all the way man! I do 4 times the things my colleagues do in one day. And you can use python for all kind of programming tasks, not only machine learning. – Francesco Pegoraro – 2018-09-24T20:57:17.513

Answers

104

Some real important differences to consider when you are choosing R or Python over one another:

  • Machine Learning has 2 phases. Model Building and Prediction phase. Typically, model building is performed as a batch process and predictions are done realtime. The model building process is a compute intensive process while the prediction happens in a jiffy. Therefore, performance of an algorithm in Python or R doesn't really affect the turn-around time of the user. Python 1, R 1.
  • Production: The real difference between Python and R comes in being production ready. Python, as such is a full fledged programming language and many organisations use it in their production systems. R is a statistical programming software favoured by many academia and due to the rise in data science and availability of libraries and being open source, the industry has started using R. Many of these organisations have their production systems either in Java, C++, C#, Python etc. So, ideally they would like to have the prediction system in the same language to reduce the latency and maintenance issues. Python 2, R 1.
  • Libraries: Both the languages have enormous and reliable libraries. R has over 5000 libraries catering to many domains while Python has some incredible packages like Pandas, NumPy, SciPy, Scikit Learn, Matplotlib. Python 3, R 2.
  • Development: Both the language are interpreted languages. Many say that python is easy to learn, it's almost like reading english (to put it on a lighter note) but R requires more initial studying effort. Also, both of them have good IDEs (Spyder etc for Python and RStudio for R). Python 4, R 2.
  • Speed: R software initially had problems with large computations (say, like nxn matrix multiplications). But, this issue is addressed with the introduction of R by Revolution Analytics. They have re-written computation intensive operations in C which is blazingly fast. Python being a high level language is relatively slow. Python 4, R 3.
  • Visualizations: In data science, we frequently tend to plot data to showcase patterns to users. Therefore, visualisations become an important criteria in choosing a software and R completely kills Python in this regard. Thanks to Hadley Wickham for an incredible ggplot2 package. R wins hands down. Python 4, R 4.
  • Dealing with Big Data: One of the constraints of R is it stores the data in system memory (RAM). So, RAM capacity becomes a constraint when you are handling Big Data. Python does well, but I would say, as both R and Python have HDFS connectors, leveraging Hadoop infrastructure would give substantial performance improvement. So, Python 5, R 5.

So, both the languages are equally good. Therefore, depending upon your domain and the place you work, you have to smartly choose the right language. The technology world usually prefers using a single language. Business users (marketing analytics, retail analytics) usually go with statistical programming languages like R, since they frequently do quick prototyping and build visualisations (which is faster done in R than Python).

binga

Posted 2014-06-12T06:04:48.243

Reputation: 674

16

R hardly beats python in visualization. I think it's rather the reverse; not only does python have ggplot (which I don't use myself, since there are more pythonic options, like seaborn), it can even do interactive visualization in the browser with packages like bokeh.

– Emre – 2014-06-12T15:57:49.973

ggplot was built in 2005 and it continues to be the favorite of many researchers due to its intuitive nature and well defined grammar of visualizations. Seaborn is built on top of matplotlib and it could be adopted by many in the coming years but ggplot still leads the charts in statistical data analysis visualizations. ggplot for python is so unpythonic in nature. The API is directly taken from the R implementation. Coming to interactive visualisations part, both python and R have d3js and d3py to take leverage of interactive visualisations. – binga – 2014-06-12T17:12:15.527

11Also R has the ability to interactive viz with Shiny. – stanekam – 2014-06-12T22:04:23.300

14Librariers - I do not agree at all with that. R is by far the richest tool set, and more than that it provides the information in a proper way, partly by inheriting S, partly by one of the largest community of reputed experts. – rapaio – 2014-06-13T06:11:10.367

I am not sure if the number of libraries for R is a purely good thing. There are too many ways to accomplish a goal, which confuses beginners and advanced students alike (ok, how can I summarize my data? summary() or describe() or...) Also, often packages have theor own styles, which leads to headaches when you are working with multiple ones. Third, the naming is atrochious, a software engineer would be flogged for methods like read.csv2 oder cut2... – Christian Sauer – 2014-06-13T14:10:40.397

1Kaggler here. I generally use R for data exploration, visualization, and feature engineering with data.table and ggplot2 and then pipe my data over to python for modeling using scikit-learn. – Ben – 2016-07-09T20:47:21.800

1I mostly agree with this answer. However, if I add one more point it is documentation resources, which Python wins over R pretty easily. Python has large community in everywhere while R community has been pretty limited to the ugly mailing list until recently. Also, the "googlability" of R keeps me annoyed... – Blaszard – 2016-09-11T09:36:27.717

1One field where python completely beats R is Deep Learning. There are so many libraries with an awesome implementation like Keras, Tensoflow, Theano, torch, etc – enterML – 2016-12-13T16:31:29.813

'Python 3, R 2' 3>2 for libraries for Python is pure LOL – Qbik – 2017-02-07T21:01:50.077

@j.a.gartner The official Apache Spark page (http://spark.apache.org/) says it supports applications in Java, Scala, Python, AND R shells...

– Ryan Chase – 2017-05-15T00:21:37.040

@RyanChase that's true. Two years ago when I made the comment, R support was just starting to be enacted, and it's usability suffered by comparison to the counterparts of java/scala/python. It may have closed the gap since R has a large community of supporters. – j.a.gartner – 2017-05-17T22:35:41.223

For the completeness of discussion I would like to comment on the "difficulty of learning" aspect. While some might say that R is a bit more quirky and therefore more difficult to learn, recent developments such as tidyverse and magrittr have turned things around in this respect. These new packages offer extremely convenient syntax that is easy to read and understand. Many argue these days that learning R needs to occur using this new approach, and the base syntax can be acquired later, if and when necessary. – Maxim.K – 2017-11-06T11:36:02.513

@Emre Python's ggplot is very limited. R does kill Python for visualisation. You have ggvis and plotly as well as flex dashboards and R Shiny. – Seanosapien – 2017-11-23T13:51:20.697

plotly is a service; you can use its API in python too. – Emre – 2017-11-23T16:30:35.830

Predictions do not happen "in a jiffy" for KNN. – wordsforthewise – 2018-01-18T02:51:41.827

35"Speed: R software initially had problems with large computations (say, like nxn matrix multiplications). But, this issue is addressed with the introduction of R by Revolution Analytics. They have re-written computation intensive operations in C which is blazingly fast. Python being a high level language is relatively slow."

I'm not an experienced R user, but as far as I know pretty much everything with low-level implementations in R also has a similar low-level implementation in numpy/scipy/pandas/scikit-learn/whatever. Python also has numba and cython. This point should be a tie. – Danica – 2015-04-03T22:05:28.860

8For you "Dealing with Big Data" comment, I would add that python is one of the 3 languages supported by apache spark, which has blazing fast speeds. Your comment about R having a C back end is true, but so does python the scikitlearn library is very fast as well.

I think your post has nice balance, but I contend that speed is at least a tie, and scalability (i.e. handling big data) is certainly in favor of python. – j.a.gartner – 2015-05-12T16:31:21.347

In my own classroom experience, R has better data visualization – confused – 2020-06-30T04:12:34.313

26

There is nothing like "python is better" or "R is much better than x".

The only fact I know is that in the industry allots of people stick to python because that is what they learned at the university. The python community is really active and have a few great frameworks for ML and data mining etc.

But to be honest, if you get a good c programmer he can do the same as people do in python or r, if you got a good java programmer he can also do (near to) everything in java.

So just stick with the language you are comfortable with.

Johnny000

Posted 2014-06-12T06:04:48.243

Reputation: 579

6But what about the libraries? There are advanced R packages (think Ranfom Forest or Caret) that would be utterly impractical to reimplement in a general purpose language such us C or Java – Santiago Cepas – 2014-06-13T09:35:46.040

mahout i.e. supports random forest for java – Johnny000 – 2014-06-13T10:17:04.727

ok, maybe RF wasn't a good example, but you get my meaning: there are hundreds of statistical packages in R not implemented in other platforms – Santiago Cepas – 2014-06-13T10:21:46.037

1Yeah maybe, but R doesn't bring the performance at all that you need for proccessing big sets of data and most of the time you have really big datasets in industrial use. – Johnny000 – 2014-06-13T10:41:21.297

1I don't think that's always true @Pithikos Given the underlying math formulas, I can usually implement them faster myself with VB/T-SQL faster than I can by wading through the unnecessarily arcane syntax for either R or Python libraries. And in the process, make the resulting code far more scalable. I'm glad these libraries exist but there are downsides built into them; in some situations and particular projects it's better to bypass them. – SQLServerSteve – 2017-05-08T20:15:01.607

1

Yeah we could write a neural network in whitespace too but nobody does it...hmm I wonder why.

– wordsforthewise – 2018-01-18T02:52:39.257

1Yes, a good programmer can do the same in C. BUT a bad programmer can do it in Python as fast as an experienced programmer can do it in C. – Pithikos – 2015-01-13T13:38:41.387

That's true @Pithikos – Johnny000 – 2015-01-13T14:38:23.670

17

Some additional thoughts.

The programming language 'per se' is only a tool. All languages were designed to make some type of constructs more easy to build than others. And the knowledge and mastery of a programming language is more important and effective than the features of that language compared to others.

As far as I can see there are two dimensions of this question. The first dimension is the ability to explore, build proof of concepts or models at a fast pace, eventually having at hand enough tools to study what is going on (like statistical tests, graphics, measurement tools, etc). This kind of activity is usually preferred by researchers and data scientists (I always wonder what that means, but I use this term for its loose definition). They tend to rely on well-known and verified instruments, which can be used for proofs or arguments.

The second dimension is the ability to extend, change, improve or even create tools, algorithms or models. In order to achieve that you need a proper programming language. Roughly all of them are the same. If you work for a company, than you depend a lot on the company's infrastructure, internal culture and your choices diminish significantly. Also, when you want to implement an algorithm for production use, you have to trust the implementation. And implementing in another language which you do not master will not help you much.

I tend to favor for the first type of activity the R ecosystem. You have a great community, a huge set of tools, proofs that these tools works as expected. Also, you can consider Python, Octave (to name a few), which are reliable candidates.

For the second task, you have to think before at what you really want. If you want robust production ready tools, then C/C++, Java, C# are great candidates. I consider Python as a second citizen in this category, together with Scala and friends. I do not want to start a flame war, it's my opinion only. But after more than 17 years as a developer, I tend to prefer a strict contract and my knowledge, than the freedom to do whatever you might think of (like it happens with a lot of dynamic languages).

Personally, I want to learn as much as possible. I decided that I have to choose the hard way, which means to implement everything from scratch myself. I use R as a model and inspiration. It has great treasures in libraries and a lot of experience distilled. However, R as a programming language is a nightmare for me. So I decided to use Java, and use no additional library. That is only because of my experience, and nothing else.

If you have time, the best thing you can do is to spend some time with all these things. In this way you will earn for yourself the best answer possible, fitted for you. Dijkstra said once that the tools influence the way you think, so it is advisable to know your tools before letting them to model how you think. You can read more about that in his famous paper called The Humble Programmer

rapaio

Posted 2014-06-12T06:04:48.243

Reputation: 3 864

14

I would add to what others have said till now. There is no single answer that one language is better than other.

Having said that, R has a better community for data exploration and learning. It has extensive visualization capabilities. Python, on the other hand, has become better at data handling since introduction of pandas. Learning and development time is very less in Python, as compared to R (R being a low level language).

I think it ultimately boils down to the eco-system you are in and personal preferences. For more details, you can look at this comparison here.

Kunal

Posted 2014-06-12T06:04:48.243

Reputation: 276

1R is not really that "low level" IMO. It's also a dynamic language. – xji – 2018-02-15T08:45:52.640

2"R has a better community for [...] learning" - I guess this highly depends on the type of learning. How much is going on with neural networks (arbitrary feed-forward architectures, CNNs, RNNs) in R? – Martin Thoma – 2015-07-19T14:41:19.090

11

There isn't a silver bullet language that can be used to solve each and every data related problem. The language choice depends on the context of the problem, size of data and if you are working at a workplace you have to stick to what they use.

Personally I use R more often than Python due to its visualization libraries and interactive style. But if I need more performance or structured code I definitely use Python since it has some of the best libraries as SciKit-Learn, numpy, scipy etc. I use both R and Python in my projects interchangeably.

So if you are starting on data science work I suggest you to learn both and it's not difficult since Python also provides a similar interface to R with Pandas.

If you have to deal with much larger datasets, you can't escape eco-systems built with Java(Hadoop, Pig, Hbase etc).

Kaushalya

Posted 2014-06-12T06:04:48.243

Reputation: 73

7

An issue all other answers fail to address is licensing.

Most of the aforementioned wonderful R libraries are GPL (e.g. ggplot2, data.table). This prevents you from distributing your software in a proprietary form.

Although many usages of those libraries do not imply distribution of the software (e.g. to train models offline), the GPL may by itself lure away companies from using them. At least in my experience.

In the python realm, on the other hand, most libraries have business-friendly distribution licenses, such as BSD or MIT.

In academia, licensing issues normally are non-issues.

noe

Posted 2014-06-12T06:04:48.243

Reputation: 10 494

7

In my experience, the answer depends on the project at hand. For pure research, I prefer R for two reasons: 1) broad variety of libraries and 2) much of the data science literature includes R samples.

If the project requires an interactive interface to be used by laypersons, I've found R to be too constrained. Shiny is a great start, but it's not flexible enough yet. In these cases, I'll start to look at porting my R work over to Python or js.

Rglish

Posted 2014-06-12T06:04:48.243

Reputation: 41

7

There is no "better" language. I have tried both of them and I am comfortable with Python so I work with Python only. Though I am still learning stuff, but I haven't encounter any roadblock with Python till now. The good thing about Python is community is too good and you can get a lot of help on the Internet easily. Other than that, I would say go with the language you like not the one people recommend.

Pensu

Posted 2014-06-12T06:04:48.243

Reputation: 561

6

Not much to add to the provided comments. Only thing is maybe this infographic comparing R vs Python for data science purposes http://blog.datacamp.com/r-or-python-for-data-analysis/

martijn

Posted 2014-06-12T06:04:48.243

Reputation: 21

5

One of real challenges, I faced with R is different packages compatible with different versions.. quite a lot R packages are not available for latest version of R.. And R quite a few time gives error due to library or package was written for older version..

Ram

Posted 2014-06-12T06:04:48.243

Reputation: 21

3I'm not sure this is a particular problem with R, or that it answers the question of how Python and R differ. – Sean Owen – 2014-10-21T14:02:29.913

4

I haven't tried R (well, a bit, but not enough to make a good comparison). However, here are some of Pythons strengths:

  • Very intuitive syntax: tuple unpacking, element in a_list, for element in sequence, matrix_a * matrix_b (for matrix multiplication), ...
  • Many libraries:
    • scipy: Scientific computations; many parts of it are only wrappers for pretty fast Fortran code
    • theano > Lasagne > nolearn: Libraries for neural networks - they can be trained on GPU (nvidia, CUDA is required) without any adjustment
    • sklearn: General learning algorithms
  • Good community:
  • IPython notebooks
  • Misc:
    • 0-indexed arrays ... I made that error all the time with R.
    • Established package structures
    • Good support for testing your code

Martin Thoma

Posted 2014-06-12T06:04:48.243

Reputation: 15 590

2

I prefer Python over R because Python is a complete programming language so I can do end to end machine learning tasks such as gather data using a HTTP server written in Python, perform advanced ML tasks and then publish the results online. This can all be done in Python. I actually found R to be harder to learn and the payoffs for learning Python are much greater because it can be used for pretty much any programming task.

Dave Julian

Posted 2014-06-12T06:04:48.243

Reputation: 1

2You can do all those 3 things very easily in R – Gaius – 2017-08-25T13:04:08.740

2

R: R is the Open source counterpart. which has traditionally been used in academics and research. Because of its open source nature, latest techniques get released quickly. There is a lot of documentation available over the internet and it is a very cost-effective option. Python: With origination as an open source scripting language, Python usage has grown over time. Today, it sports libraries (numpy, scipy and matplotlib) and functions for almost any statistical operation / model building you may want to do. Since introduction of pandas, it has become very strong in operations on structured data.

Python Code

# Import Library
# Import other necessary libraries like pandas, numpy...
from sklearn import linear_model
# Load Train and Test datasets
# Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train=input_variables_values_training_datasets
y_train=target_variables_values_training_datasets
x_test=input_variables_values_test_datasets
# Create linear regression object
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)
# Equation coefficient and Intercept
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
# Predict Output
predicted= linear.predict(x_test)

R Code

# Load Train and Test datasets
# Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train <- input_variables_values_training_datasets
y_train <- target_variables_values_training_datasets
x_test <- input_variables_values_test_datasets
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
linear <- lm(y_train ~ ., data = x)
summary(linear)
# Predict Output
predicted= predict(linear,x_test) 

dileep balineni

Posted 2014-06-12T06:04:48.243

Reputation: 333

0

I do not think Python has point-click GUI that turn it into SPSS and SAS. Playing around with those is genuinely fun.

ran8

Posted 2014-06-12T06:04:48.243

Reputation: 323

0

enter image description here

I got this image in a linkedin post. Whenever I get a doubt of using python or R, I look into it and it proves to be very useful.

Arun

Posted 2014-06-12T06:04:48.243

Reputation: 154

So what do you choose? – Serhii Polishchuk – 2018-11-17T21:35:26.397