Python as a statistics workbench



Lots of people use a main tool like Excel or another spreadsheet, SPSS, Stata, or R for their statistics needs. They might turn to some specific package for very special needs, but a lot of things can be done with a simple spreadsheet or a general stats package or stats programming environment.

I've always liked Python as a programming language, and for simple needs, it's easy to write a short program that calculates what I need. Matplotlib allows me to plot it.

Has anyone switched completely from, say R, to Python? R (or any other statistics package) has a lot of functionality specific to statistics, and it has data structures that allow you to think about the statistics you want to perform and less about the internal representation of your data. Python (or some other dynamic language) has the benefit of allowing me to program in a familiar, high-level language, and it lets me programmatically interact with real-world systems in which the data resides or from which I can take measurements. But I haven't found any Python package that would allow me to express things with "statistical terminology" – from simple descriptive statistics to more complicated multivariate methods.

What can you recommend if I wanted to use Python as a "statistics workbench" to replace R, SPSS, etc.?

What would I gain and lose, based on your experience?

Fabian Fagerholm

Posted 2010-08-12T10:46:45.407

Reputation: 155


FYI, there is a new python stats subreddit that is going off:

– naught101 – 2013-12-04T04:12:52.553



It's hard to ignore the wealth of statistical packages available in R/CRAN. That said, I spend a lot of time in Python land and would never dissuade anyone from having as much fun as I do. :) Here are some libraries/links you might find useful for statistical work.

  • NumPy/Scipy You probably know about these already. But let me point out the Cookbook where you can read about many statistical facilities already available and the Example List which is a great reference for functions (including data manipulation and other operations). Another handy reference is John Cook's Distributions in Scipy.

  • pandas This is a really nice library for working with statistical data -- tabular data, time series, panel data. Includes many builtin functions for data summaries, grouping/aggregation, pivoting. Also has a statistics/econometrics library.

  • larry Labeled array that plays nice with NumPy. Provides statistical functions not present in NumPy and good for data manipulation.

  • python-statlib A fairly recent effort which combined a number of scattered statistics libraries. Useful for basic and descriptive statistics if you're not using NumPy or pandas.

  • statsmodels Statistical modeling: Linear models, GLMs, among others.

  • scikits Statistical and scientific computing packages -- notably smoothing, optimization and machine learning.

  • PyMC For your Bayesian/MCMC/hierarchical modeling needs. Highly recommended.

  • PyMix Mixture models.

  • Biopython Useful for loading your biological data into python, and provides some rudimentary statistical/ machine learning tools for analysis.

If speed becomes a problem, consider Theano -- used with good success by the deep learning people.

There's plenty of other stuff out there, but this is what I find the most useful along the lines you mentioned.


Posted 2010-08-12T10:46:45.407

Reputation: 10 399

Pythonxy is nice but it can get annoying if you want to do large computations as it is only available for 32 bits. Here are unofficial binaries for installing many python packages. They can be quite useful if you decide to work under windows. @StéphaneLaurent

– JEquihua – 2013-03-26T03:32:21.627

13All answers were both helpful and useful, and would all deserve to be accepted. This one, however, does a very good job at answering the question: with Python, you have to put together lots of pieces to do what you want. These pointers will no doubt be very useful for anyone wanting to do statistics/modeling/etc. with Python. Thanks to everyone! – Fabian Fagerholm – 2010-08-17T18:42:30.127

Somebody needs to create a Kickstarter for a Python-like GUI app for doing statistics with all of these tools built in. If I have to use Stata for another minute, I might just kill someone... – Andre Terra – 2014-11-15T17:58:40.447

Is "rpy2" hidden somewhere in there? It feels essential if you want to run R from python

– Sosi – 2017-01-02T17:40:40.577

Yeah.. You can run R from python, relatively easily, natively or through other libraries. It seems the main argument for R, is that most of the functions necessarily are already packaged in R or available in CRAN. Python also has Spyder, Anaconda, Enthough Python, Jupyter Notebooks... and these days I would expect with the popularity of python, most functions available in R is probably already available in Python.

The previous answers seem to be from quite a while back. Wondering if R still is better than Python.. is it more on equal ground? – alpha_989 – 2017-10-30T22:09:46.020

Also for those of you strongly recommending R, have you tried pythons OO programming capabilities? Isnt using the OO capabilities in Python, basically giving it similar capabilities as R? – alpha_989 – 2017-10-30T22:11:11.443

1@ars please do you know what is the best way to use Python with Windows ? – Stéphane Laurent – 2012-07-24T05:09:06.113


@StéphaneLaurent I usually install the various pieces myself, but for a quick start/install, you might consider: pythonxy.

– ars – 2012-07-24T21:21:54.770

This script installs many of the libraries cited above:

– Fr. – 2012-12-08T13:23:00.493


As a numerical platform and a substitute for MATLAB, Python reached maturity at least 2-3 years ago, and is now much better than MATLAB in many respects. I tried to switch to Python from R around that time, and failed miserably. There are just too many R packages I use on a daily basis that have no Python equivalent. The absence of ggplot2 is enough to be a showstopper, but there are many more. In addition to this, R has a better syntax for data analysis. Consider the following basic example:


results = sm.OLS(y, X).fit()


results <- lm(y ~ x1 + x2 + x3, data=A)

What do you consider more expressive? In R, you can think in terms of variables, and can easily extend a model, to, say,

lm(y ~ x1 + x2 + x3 + x2:x3, data=A)

Compared to R, Python is a low-level language for model building.

If I had fewer requirements for advanced statistical functions and were already coding Python on a larger project, I would consider Python as a good candidate. I would consider it also when a bare-bone approach is needed, either because of speed limitations, or because R packages don't provide an edge.

For those doing relatively advanced Statistics right now, the answer is a no-brainer, and is no. In fact, I believe Python will limit the way you think about data analysis. It will take a few years and many man-year of efforts to produce the module replacements for the 100 essential R packages, and even then, Python will feel like a language on which data analysis capabilities have been bolted on. Since R has already captured the largest relative share of applied statisticians across several fields, I don't see this happening any time soon. Having said that, it's a free country, and I know people doing Statistics in APL and C.


Posted 2010-08-12T10:46:45.407

Reputation: 3 575

I would prefer python for Text mining and other intensive coding while for graphics and statistical purposes, honestly, i won't see myself switching to any other platform from R. But coding in python is fun and as a full fledged programming language, i don't see any harm in learning it along with R and Matlab. – lovekesh – 2014-02-11T17:39:40.080

9as an update: the first example in the statsmodels documentation is now results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit(). Statsmodels is still far behind other statistical packages like R in terms of coverage, but there are more and more things you can do in python before you have to grab another language or statistical package. (GEE and Mixed will be in the next release.) – user333700 – 2014-04-25T15:26:02.850

4"What's nice in Python is that all these aspects are handled orthogonally...". I disagree on a number of counts. There are significant overlap between numpy, scipy, statsmodels. R's design is much more modular and economic. Besides, most in not all of the conceptual innovations in data-oriented languages (not just formulas, but also data frames, a grammar of graphics, caret as a grammar of medels, knitr, and the still-developing grammar of data of dplyr) have originated in R. The Python community seems always a step behind, and overly focused on performance. – gappy – 2014-04-25T17:04:04.070

2@gappy: that's because python is a general programming language, and there are lots of other uses that require performance. It's not surprising that that's filtering in to the python stats community. Besides, there is so much call for big data analysis these days (even if that means 2Gb datasets on a laptop), that it's understandable that a stats community wants some focus on efficiency. – naught101 – 2015-02-06T03:33:44.077

9+1 I just like this response because of the emphasis you put on R as a statistical language to work with data using formulaes and the like. That being said, I'm expecting a great positive impact of pandas (combined with statsmodels) in the Python community. – chl – 2011-10-13T21:44:45.723


in the Python community, patsy is addressing the need for "formula", which you describe, at times improving on what R offers: What's nice in Python is that all these aspects are handled orthogonally. Pandas will take care of timeseries and dataframe/series representation. patsy for the formulas. numpy for array representation and vectorization. statsmodels wraps statistics algos. scipy for optimization and a bunch of other stuff. The result is cleaner interfaces. R, in comparison, is more mature, but is a hairball. ../..

– blais – 2012-08-13T03:08:00.820


../.. I think in the long run the forces will push in the direction of more and more Python integration and you will find it will become quite a competitor to R. Cleaning data in R is such a PIA compared to Python, and it's never a trivial part of the job. – blais

– chl – 2012-08-13T08:54:47.770


First, let me say I agree with John D Cook's answer: Python is not a Domain Specific Language like R, and accordingly, there is a lot more you'll be able to do with it further down the road. Of course, R being a DSL means that the latest algorithms published in JASA will almost certainly be in R. If you are doing mostly ad hoc work and want to experiment with the latest lasso regression technique, say, R is hard to beat. If you are doing more production analytical work, integrating with existing software and environments, and concerned about speed, extensibility and maintainability, Python will serve you much better.

Second, ars gave a great answer with good links. Here are a few more packages that I view as essential to analytical work in Python:

  • matplotlib for beautiful, publication quality graphics.
  • IPython for an enhanced, interactive Python console. Importantly, IPython provides a powerful framework for interactive, parallel computing in Python.
  • Cython for easily writing C extensions in Python. This package lets you take a chunk of computationally intensive Python code and easily convert it to a C extension. You'll then be able to load the C extension like any other Python module but the code will run very fast since it is in C.
  • PyIMSL Studio for a collection of hundreds of mathemaical and statistical algorithms that are thoroughly documented and supported. You can call the exact same algorithms from Python and C, with nearly the same API and you'll get the same results. Full disclosure: I work on this product, but I also use it a lot.
  • xlrd for reading in Excel files easily.

If you want a more MATLAB-like interactive IDE/console, check out Spyder, or the PyDev plugin for Eclipse.

Josh Hemann

Posted 2010-08-12T10:46:45.407

Reputation: 2 686

4Well, can I have multi-threaded code with R? Network asynchronous I/O? Believe me, these usecases actually arise in scientific computing. R is a DSL, in my opinion. It is strong at statistics, and bad at most other things. – Gael Varoquaux – 2014-08-19T21:15:42.380

Another great answer! PyIMSL Studio sounds interesting, too bad it isn't open source. There's some overlap with NumPy/SciPy, though. In any case, I think these were good tips for anyone wanting to assemble their own Python statistics workbench! – Fabian Fagerholm – 2010-08-26T07:29:27.057

It is free as in beer (for non-commercial use), but alas, not free as in speech. – Josh Hemann – 2010-08-26T18:08:07.593


@hadley: You should probably go and remove shell scripts, PostScript, HTML+CSS3, MediaWiki templates and more from that page as well.

– naught101 – 2015-02-06T03:38:26.357

I find that, in production, if the old doesn't do the job, sometimes the new does. And it isn't like the fundamentals aren't also baked into R. – EngrStudent – 2015-04-06T18:11:34.320

14R is not a DSL in the usual sense of the term. It's a full, Turing complete programming language. – hadley – 2011-10-13T02:50:12.200


@hadley: Perhaps I am using "DSL" too colloquially, but for what it is worth, the Wikipedia page on DSLs explicitly lists S+ and R as examples of DSLs and Python as general purpose language. See In the same vein, SAS is Turing-complete (only if the IML macro component is used), but I would hardly call it a complete language in a practical sense. I find R invaluable in my work, but I try to code using general purpose languages as much as possible rather than trying to do everything in R (or Excel for that matter).

– Josh Hemann – 2011-10-13T16:12:02.630

15I think it's unfair to include R and S in the same list as those other languages in wikipedia - there is nothing you can't do in R that you can do in python. Of course there are many things that are better suited to another programming language, but the same is true of Python. – hadley – 2011-10-13T16:36:10.167

@hadley quoting Wikipedia: "A domain-specific language (DSL) is a computer language specialized to a particular application domain. This is in contrast to a general-purpose language (GPL), which is broadly applicable across domains, and lacks specialized features for a particular domain." Hard to justify R and S+ do not fall within the scope of this definition. – None – 2016-04-01T22:53:37.407

1I can't see R mentiones in the WP article on DSL. A more correct statistical example seems to be bugs/jags. I will add them. – kjetil b halvorsen – 2012-10-08T16:53:27.573

5Ahh, hadley removed R and S+ from the Wikipedia page the same day we exchanged comments, October 13, 2011. So, I often hear the mantra "R was developed by and for statisticians" as its foundational strength. Apparently, now it is a general purpose language, too... – Josh Hemann – 2012-10-10T04:32:19.767


I don't think there's any argument that the range of statistical packages in cran and Bioconductor far exceed anything on offer from other languages, however, that isn't the only thing to consider.

In my research, I use R when I can but sometimes R is just too slow. For example, a large MCMC run.

Recently, I combined python and C to tackle this problem. Brief summary: fitting a large stochastic population model with ~60 parameters and inferring around 150 latent states using MCMC.

  1. Read in the data in python
  2. Construct the C data structures in python using ctypes.
  3. Using a python for loop, call the C functions that updated parameters and calculated the likelihood.

A quick calculation showed that the programme spent 95% in C functions. However, I didn't have to write painful C code to read in data or construct C data structures.

I know there's also rpy, where python can call R functions. This can be useful, but if you're "just" doing statistics then I would use R.


Posted 2010-08-12T10:46:45.407

Reputation: 8 664

27Inserting shameless plug for Rcpp :) – Dirk Eddelbuettel – 2010-08-12T14:37:32.143

curious if you've tried PyMC and how the performance compares (relative to python/C) for your models. – ars – 2010-08-13T07:31:00.843

@ars: In the case above, each iteration (of the 10^8 iterations) involved solving 5 ODEs. This really had to be done in C. The rest of the code was fairly simple and so the C code was straightforward. My application was non-standard and so PyMC wasn't applicable - also it was ~2 years ago. – csgillespie – 2010-08-13T12:33:10.720

@cgillespie: curiosity sated. sounds interesting, thanks. :) – ars – 2010-08-13T17:46:37.937


Jeromy Anglim

Posted 2010-08-12T10:46:45.407

Reputation: 30 906

All of these discussions have been removed :-(. Perhaps this answer should be removed too? – Jonathan – 2013-03-19T15:47:02.900

12That's sad. I've updated the links to refer to wayback machine copies. – Jeromy Anglim – 2013-03-20T05:54:08.503


I haven't seen the scikit-learn explicitly mentioned in the answers above. It's a Python package for machine learning in Python. It's fairly young but growing extremely rapidly (disclaimer: I am a scikit-learn developer). It's goals are to provide standard machine learning algorithmic tools in a unified interface with a focus on speed, and usability. As far as I know, you cannot find anything similar in Matlab. It's strong points are:

  • A detailed documentation, with many examples

  • High quality standard supervised learning (regression/classification) tools. Specifically:

  • The ability to perform model selection by cross-validation using multiple CPUs

  • Unsupervised learning to explore the data or do a first dimensionality reduction, that can easily be chained to supervised learning.

  • Open source, BSD licensed. If you are not in a purely academic environment (I am in what would be a national lab in the state) this matters a lot as Matlab costs are then very high, and you might be thinking of deriving products from your work.

Matlab is a great tool, but in my own work, scipy+scikit-learn is starting to give me an edge on Matlab because Python does a better job with memory due to its view mechanism (and I have big data), and because the scikit-learn enables me to very easily compare different approaches.

Gael Varoquaux

Posted 2010-08-12T10:46:45.407

Reputation: 711


One benefit of moving to Python is the possibility to do more work in one language. Python is a reasonable choice for number crunching, writing web sites, administrative scripting, etc. So if you do your statistics in Python, you wouldn't have to switch languages to do other programming tasks.

Update: On January 26, 2011 Microsoft Research announced Sho, a new Python-based environment for data analysis. I haven't had a chance to try it yet, but it sounds like an interesting possibility if want to run Python and also interact with .NET libraries.

John D. Cook

Posted 2010-08-12T10:46:45.407

Reputation: 2 854

4I have done a lot of number crunching, one website and few administrative scripts in R and they are working quite nice. – mbq – 2010-08-14T19:40:50.960


Perhaps this answer is cheating, but it seems strange no one has mentioned the rpy project, which provides an interface between R and Python. You get a pythonic api to most of R's functionality while retaining the (I would argue nicer) syntax, data processing, and in some cases speed of Python. It's unlikely that Python will ever have as many bleeding edge stats tools as R, just because R is a dsl and the stats community is more invested in R than possibly any other language.

I see this as analogous to using an ORM to leverage the advantages of SQL, while letting Python be Python and SQL be SQL.

Other useful packages specifically for data structures include:

  • pydataframe replicates a data.frame and can be used with rpy. Allows you to use R-like filtering and operations.
  • pyTables Uses the fast hdf5 data type underneath, been around for ages
  • h5py Also hdf5, but specifically aimed at interoperating with numpy
  • pandas Another project that manages data.frame like data, works with rpy, pyTables and numpy

Griffith Rees

Posted 2010-08-12T10:46:45.407

Reputation: 171

Perhaps the rmagic extension for IPython (as pointed out by @CarlSmith) can make it easier to work with rpy2? See

– Jonathan – 2013-03-19T15:53:28.420

1I've always find rpy sloppy to work with. It requieres large lines of codes with some simple functions, for example. – Néstor – 2012-07-24T02:48:55.847


I would like to say that from the standpoint of someone who relies heavily on linear models for my statistical work, and love Python for other aspects of my job, I have been highly disappointed in Python as a platform for doing anything but fairly basic statistics.

I find R has much better support from the statistical community, much better implementation of linear models, and to be frank from the statistics side of things, even with excellent distributions like Enthought, Python feels a bit like the Wild West.

And unless you're working solo, the odds of you having collaborators who use Python for statistics, at this point, are pretty slim.


Posted 2010-08-12T10:46:45.407

Reputation: 15 705

"I find it...". This answers forgets to say what "it" is, but I assume "it" is R. – Faheem Mitha – 2015-04-05T23:15:44.073

1@FaheemMitha Fixed. – Fomite – 2015-04-06T18:07:13.013

1Also, I think in " the odds ... is pretty slim" should be "the odds... are pretty slim". – Faheem Mitha – 2015-04-06T18:15:43.227


There's really no need to give up R for Python anyway. If you use IPython with a full stack, you have R, Octave and Cython extensions, so you can easily and cleanly use those languages within your IPython notebooks. You also have support for passing values between them and your Python namespace. You can output your data as plots, using matplotlib, and as properly rendered mathematical expressions. There are tons of other features, and you can do all this in your browser.

IPython has come a long way :)

Carl Smith

Posted 2010-08-12T10:46:45.407

Reputation: 143


I am a biostatistician in what is essentially an R shop (~80 of folks use R as their primary tool). Still, I spend approximately 3/4 of my time working in Python. I attribute this primarily to the fact that my work involves Bayesian and machine learning approaches to statistical modeling. Python hits much closer to the performance/productivity sweet spot than does R, at least for statistical methods that are iterative or simulation-based. If I were performing ANOVAS, regressions and statistical tests, I'm sure I would primarily use R. Most of what I need, however, is not available as a canned R package.


Posted 2010-08-12T10:46:45.407

Reputation: 408


+1 for distinguishing what area of statistics you work in. There are areas of statistical computing (e.g. unstructured text analysis and computer vision) that a lot of functionality exists for in Python, and Python is seemingly the lingua franca in those sub-domains. I think where the Python community has to catch up on is improving the data structures and semantics around classical statistical modeling that R's design is so good at. The scikits.statsmodels developers are making a lot of progress on that front:

– Josh Hemann – 2011-10-13T17:59:24.510


What you are looking for is called Sage:

It is an excellent online interface to a well-built combination of Python tools for mathematics.


Posted 2010-08-12T10:46:45.407

Reputation: 101

4The brilliant part about Sage is that it is essentially the union of a number of great free tools for mathematics, statistics, data analysis, etc. It is more than just Python; it has access to R, maxima, GLPK, GSL, and more. – shabbychef – 2011-10-13T16:48:30.063


Rpy2 - play with R stay in Python...

Further elaboration per Gung's request:

Rpy2 documentation can be found at

From the documentation, The high-level interface in rpy2 is designed to facilitate the use of R by Python programmers. R objects are exposed as instances of Python-implemented classes, with R functions as bound methods to those objects in a number of cases. This section also contains an introduction to graphics with R: trellis (lattice) plots as well as the grammar of graphics implemented in ggplot2 let one make complex and informative plots with little code written, while the underlying grid graphics allow all possible customization is outlined.

Why I like it:

I can process my data using the flexibility of python , turn it into a matrix using numpy or pandas and do the computation in R, and get back r objects to do post processing. I use econometrics and python simply will not have the bleeding edge stats tools of R. And R will unlikely ever be as flexible as python. This does require you to understand R. Fortunately, it has a nice developer community.

Rpy2 itself is well supported and the gentleman supporting it frequents the SO forums. Windows installation maybe a slight pain - might help.


Posted 2010-08-12T10:46:45.407

Reputation: 116

1Welcome to the site, @pythOnometrist. I suspect this is a helpful contribution. Would you mind giving a brief summary of Rpy2, so readers can decide if it's what they're looking for? – gung – 2013-04-05T21:57:56.673


I use Python for statistical analysis and forecasting. As mentioned by others above, Numpy and Matplotlib are good workhorses. I also use ReportLab for producing PDF output.

I'm currently looking at both Resolver and Pyspread which are Excel-like spreadsheet applications which are based on Python. Resolver is a commercial product but Pyspread is still open-source. (Apologies, I'm limited to only one link)


Posted 2010-08-12T10:46:45.407

Reputation: 296

1Again some interesting tools. I knew about Numpy, Matplotlib and ReportLab, but Pyspread seems like an interesting idea. At least I would like to type Python expressions in spreadsheet cells. While it doesn't solve all possible problems, it could be good for prototyping and playing around with data. – Fabian Fagerholm – 2010-08-28T06:21:35.973

1+1 Wow python spreadsheets! Hadn't heard of those yet. I always wished OpenOffice/LibreOffice would really embrace and integrate python scripting in their spreadsheet software – User – 2011-10-06T19:34:45.770


great overview so far. I'm using python (specifically scipy + matplotlib) as a matlab replacement since 3 years working at University. I sometimes still go back because I'm familiar with specific libraries e.g. the matlab wavelet package is purely awesome.

I like the python distribution. It's commercial, yet free for academic purposes and, as far as I know, completely open-source. As I'm working with a lot of students, before using enthought it was sometimes troublesome for them to install numpy, scipy, ipython etc. Enthought provides an installer for Windows, Linux and Mac.

Two other packages worth mentioning:

  1. ipython (comes already with enthought) great advanced shell. a good intro is on showmedo

  2. nltk - the natural language toolkit great package in case you want to do some statistics /machine learning on any corpus.


Posted 2010-08-12T10:46:45.407

Reputation: 171


This is an interesting question, with some great answers.

You might find some useful discussion in a paper that I wrote with Roseline Bilina. The final version is here: (it has since appeared, in almost this form, as "Python for Unified Research in Econometrics and Statistics", in Econometric Reviews (2012), 31(5), 558-591).

Steve Lawford

Posted 2010-08-12T10:46:45.407

Reputation: 1


Perhaps not directly related, but R has a nice GUI environment for interactive sessions (edit: on Mac/Windows). IPython is very good but for an environment closer to Matlab's you might try Spyder or IEP. I've had better luck of late using IEP, but Spyder looks more promising.



And the IEP site includes a brief comparison of related software:


Posted 2010-08-12T10:46:45.407

Reputation: 3 715


I found a great intro to pandas here that I suggest checking out. Pandas is an amazing toolset and provides the high level data analysis capabilities of R with the extensive libraries and production quality of Python.

This blog post gives a great intro to Pandas from the perspective of a complete beginner:


Posted 2010-08-12T10:46:45.407

Reputation: 1

3Could you please write a few words about what qualities make it "great" so that readers can determine beforehand whether viewing it would be appropriate for them? – whuber – 2013-03-16T21:47:40.850

1Sorry. Just realized I attached the wrong link in my original post. – padawan – 2013-03-17T18:15:13.137


I should add a shout-out for Sho, the numerical computing environment built on IronPython. I'm using it right now for the Stanford machine learning class and it's been really helpful. It's got built in linear algebra packages and charting capabilities. Being .Net it's easy to extend with C# or any other .Net language. I've found it much easier to get started with, being a windows user, than straight Python and NumPy.


Posted 2010-08-12T10:46:45.407

Reputation: 101


No one has mentioned Orange before:

Data mining through visual programming or Python scripting. Components for machine learning. Add-ons for bioinformatics and text mining. Packed with features for data analytics.

I don't use it on daily basis, but it's a must-see for anyone who prefers GUI over command line interface.

Even if you prefer the latter, Orange is a good thing to be familiar with, since you can easily import pieces of Orange to your Python scripts in case you need some of its functionality.

Wojciech Walczak

Posted 2010-08-12T10:46:45.407

Reputation: 141


Recent comparison from DataCamp provides clear picture about R and Python.

The usage of these two languages in the data analysis field. Python is used generally used when the data analysis tasks need to be integrated with web apps or if statistics code needs to be incorporated into a production database. R is mainly used when the data analysis tasks require standalone computing or analysis on individual servers.

I found it so useful in this blog and hope it would help others also to understand recent trends in both of these languages. Julia is also coming up in the area. Hope this helps !


Posted 2010-08-12T10:46:45.407

Reputation: 1 359


Note that SPSS Statistics has an integrated Python interface (also R). So you can write Python programs that use Statistics procedures and produce either the usual nicely formatted Statistics output or return results to your program for further processing. Or you can run Python programs in the Statistics command stream. You do still have to know the Statistics command language, but you can take advantage of all the data management, presentation output etc that Statistics provides as well as the procedures.


Posted 2010-08-12T10:46:45.407

Reputation: 1 289


For those who have to work under Windows, Anaconda ( really helps a lot. Installing packages under Windows was a headache. With Anaconda installed, you can set up a ready-to-use development environment with a one-liner.

For example, with

conda create -n stats_env python pip numpy scipy matplotlib pandas

all these packages will be fetched and installed automatically.


Posted 2010-08-12T10:46:45.407

Reputation: 146


Python has a long way to go before it can be compared to R. It has significantly fewer packages than R and of lower quality. People who stick to the basics or rely only on their custom libraries could probably do their job exclusively in Python but if you're someone who needs more advanced quantitative solutions, I dare to say that nothing comes close to R out there.

It should be also noted that, to date, Python has no proper scientific Matlab-style IDE comparable to R-Studio (please don't say Spyder) and you need to work out everything on the console. Generally speaking, the whole Python experience requires a good amount of "geekness" that most people lack and don't care about.

Don't get me wrong, I love Python, it's actually my favourite language which, unlike R, is a real programming language. Still, when it comes to pure data analysis I am dependent to R, which is by far the most specialised and developed solution to date. I use Python when I need to combine data analysis with software engineering, e.g. create a tool which will perform automatisation on the methods that I first programmed in a dirty R script. In many occasions I use rpy2 to call R from Python because in the vast majority of cases R packages are so much better (or don't exist in Python at all). This way I try to get the best of both worlds.

I still use some Matlab for pure algorithm development since I love its mathematical-style syntax and speed.


Posted 2010-08-12T10:46:45.407

Reputation: 1 388


I believe Python is a superior workbench in my field. I do a lot of scraping, data wrangling, large data work, network analysis, Bayesian modeling, and simulations. All of these things typically need speed and flexibility so I find Python to work better than R in these cases. Here are a few things about Python that I like (some are mentioned above, other points are not):

-Cleaner syntax; more readable code. I believe Python to be a more modern and syntactically consistent language.

-Python has Notebook, Ipython, and other amazing tools for code sharing, collaboration, publishing.

-iPython's notebook enables one to use R in one's Python code so it is always possible to go back to R.

-Substantially faster without recourse to C. Using Cython, NUMBA, and other methods of C integration will put your code to speeds comparable to pure C. This, as far as I am aware, cannot be achieved in R.

-Pandas, Numpy, and Scipy blow standard R out of the water. Yes, there are a few things that R can do in a single line but takes Pandas 3 or 4. In general, however, Pandas can handle larger data sets, is easier to use, and provides incredible flexibility in regard to integration with other Python packages and methods.

-Python is more stable. Try loading a 2gig dataset into RStudio.

-One neat package that doesn't seem mentioned above is PyMC3 - great general package for most of your Bayesian modeling.

-Some, above mention ggplot2 and grub about its absence from Python. If you ever used Matlab's graphing functionalities and/or used matplotlib in Python then you'll know that the latter options are generally much more capable than ggplot2.

However, perhaps R is easier to learn and I do frequently use it in cases where I am not yet too familiar with the modeling procedures. In that case, the depth of R's off-the-shelf statistical libraries is unbeatable. Ideally, I would know both well enough to be able to use upon need.

Gene Burinsky

Posted 2010-08-12T10:46:45.407

Reputation: 273


When you need to move things around on the command line, pythonpy ( is a nice tool.


Posted 2010-08-12T10:46:45.407

Reputation: 101