Why do internet companies prefer Java/Python for data scientist job?

44

10

I see a many times in job description for data scientist asking for Python/Java experience and disregard R. Below is a personal email I received from chief data scientist of a company I applied for through linkedin.

X, Thanks for connecting and expressing interest. You do have good Analytics Skills. However, all our data scientists must have good programming skills in Java/Python as we are a internet/mobile organisation and everything we do is online.

While I respect the decision of the chief data scientist, I am unable to get a clear picture as to what are the tasks that Python can do that R cannot do. Can anyone care to elaborate? I am actually keen to learn Python/Java, provided I get a bit more detail.

Edit: I found an interesting discussion on Quora. Why is Python a language of choice for data scientists?

Edit2: Blog from Udacity on Languages and Libraries for Machine Learning

Enthusiast

Posted 2016-08-18T05:05:45.470

Reputation: 395

8Python is a good compromise: it provides many (non-standard) library for datascience (pandas, scikit,...) and many industrial process are already coded in python.Manu H 2016-08-18T07:59:53.153

4"our data scientists must have good programming skills in Java/Python as we are a internet/mobile organisation and everything we do is online" is a massive non-sequitur - the conclusion does not follow from the premise. I suspect the CDS is just trying to get rid of you.Spacedman 2016-08-19T10:00:57.910

@Spacedman it can or cannot be. It happened almost a year ago. As of today if i search for jobs in his company, it clearly asks for Python/JAVA and R is not even mentioned in job description for data scientist position.Enthusiast 2016-08-19T10:35:47.640

If he said "must have java/python skills as that's what our codebase is" or "that's what we all like" then it makes sense. But to say "we need Java/Python because we are an internet/mobile company" does not make sense. There are "internet/mobile companies" that only use objective-C, or Javascript, or Lisp, or almost any other language. Their reason for wanting Java/Python is not and can not be because they are an internet/mobile org and everything they do is online.Spacedman 2016-08-19T14:25:38.890

5@ManuH If by "non-standard," you mean, "not in the standard library," you're correct. But those tools get pretty wide-spread usage, and they're certainly staples of the language. numpy currently has over 100k questions on SO, pandas has 74k. I think you could certainly make a case that they're industry standards. (At least on the software development side. I'd hardly call myself a "data scientist.")jpmc26 2016-08-19T22:27:09.693

1As a counter example, all the people in my region want C# and Python or SPSS - with the reason probably being that this was what the first analyst turned data scientist used at the company and it is costly to change gears afterwards. Same reason you still occasionally find Cobol code when working as a developerJGreenwell 2016-08-20T04:53:19.773

2"Data Scientist" is not well defined term. Data Scientist is basically somebody who can do useful things with data. They don't have to be using machine learning or statistical packages. Somebody might be using Java/Scala/Spark/whatever to manage large amounts of data and get useful insights without any machine learning.Akavall 2016-08-20T05:35:49.780

1Hey, look at it as an opportunity. Data analytics theory is the same whether you develop your analysis in R or in Python. Just learn both! I would look for a candidate that has a curious mind and won't mind learning a new language (or two, or three)...Willem van Doesburg 2016-08-21T19:34:39.157

2@jpmc26 Yes that was I meant. Now I realize that even libraries which has not yet reach industry standards could be mentioned (one more argument for python)Manu H 2016-08-22T11:30:55.097

@WillemvanDoesburg Thanks, and i do realize the importance of having a degree of flexibility about learning new language in software industry. Since data science overlaps software engineering to some extent, this is bound to happen in data science too. Plus side is that salaries for such people tend to be a bit higher than traditional statisticians, which is my past experience. Thanks everyone for your input.Enthusiast 2016-11-21T10:23:27.970

Answers

63

So you can integrate with the rest of the code base. It seems your company uses a mix of Java and python. What are you going to do if a little corner of the site needs machine learning; pass the data around with a database, or a cache, drop to R, and so on? Why not just do it all in the same language? It's faster, cleaner, and easier to maintain.

Know any online companies that run solely on R? Neither do I...

All that said Java is the last language I'd do data science in.

Emre

Posted 2016-08-18T05:05:45.470

Reputation: 7 676

Thanks! So you mean to say starting from training the model to deployment, everything will be done in Python? In that case i believe product or e-commerce companies will not using R for machine learning. Instead they will be using Python as Python is more suitable for automation in a software company and in e-commerce.Enthusiast 2016-08-18T06:39:17.910

They probably won't care what language you do your prototyping in, but they will probably expect the finished product to be in Java or python. They might make an exception if you really need a library that only exists in another language.Emre 2016-08-18T06:57:05.933

Can a model built in R be 'translated' to Python language? Also, by prototyping do you mean actual model development wherein a model is finalized or you mean kind of one odd randomforest or svm model just to see if it works or not?Enthusiast 2016-08-18T07:00:15.783

Yes, since python is a general purpose language. In the worst case, you won't find a ready-made library and you'll have to write your own. You might find that the company implements its own version of an algorithm for which libraries exist. Maybe theirs is faster, or it is tweaked to better suit their needs, etc. Prototyping is the tinkering you do before writing production code.Emre 2016-08-18T07:13:32.670

Thanks! Although i feel "predictive modelling markup language" or PMML can bridge the gap between prototyping in one language and deployment in another, still doing it in one language is still better. Thanks a lot for your useful suggestions and comments!!Enthusiast 2016-08-18T07:18:19.937

1I was about to say a service-oriented architecture also helps bridge technologies. PMML is a bit enterprisey; I haven't used it, but yours is a Java shop, the mother enterprise languages, so you never know...Emre 2016-08-18T07:21:37.230

3@Enthusiast don't forget that you can run R under python using RPy2 (for example) so you may end up (as I did in a previous job) running models written in R through python so that they can be presented through a web interface via django.MD-Tech 2016-08-18T13:35:54.057

@MD-Tech i understand that model can be built using RPy and once done you will end up with a model that has .rds format. To deploy that in production environment you used RPy? or if i ask the question correctly, can you please elaborate the architecture from model building till deployment with a bit more detail for benefit of all?Enthusiast 2016-08-18T14:03:20.443

2We built the model in plain text .r files that were loaded into the R interpreter to test (and to facilitate building). Whilst this was being built and tested we built a python django project with a section that referenced RPy2 and created RPy2 objects. These objects were then used to load the R files in the same way as you would load them in the interpreter so that we could access the functions that wrapped the model. We could then pass data from the database to R via python. The python layer gave us the web frontend with django and control over the database etc..MD-Tech 2016-08-18T14:13:18.683

1@Enthusiast The results of the model were returned by the R within RPy2 and presented in the front end in various guises, mostly graphs.MD-Tech 2016-08-18T14:14:38.543

I think you should have added this as response, so that people can upvote your response as it is useful one. Nevertheless thank you!! I couldn't understand " model in plain text .r files that were loaded into the R interpreter to test (and to facilitate building)" part. By model, did you mean something like linear regression or neural network or did you mean a bar chart or some other visualization? If it is former, can you share some link or source with example? My understanding is that for deploying a model, first it has to e built and tested before productization and exported in .rds formatEnthusiast 2016-08-18T14:23:08.097

2@Enthusiast It was a Bayesian network for finance but I can't say more than that. The model was written in straight R. Just plain text; I was editing it in Vim whenever I needed to, and it was "deployed" by loading the R code, as text, into RPy2 using source("our_code.r") on the RPy2 objects. It was done this way so that we could live edit the model. This isn't an answer to this question; its an answer to one that hasn't been asked ;)MD-Tech 2016-08-18T14:32:27.310

1I'm not sure your statement for Java is accurate any more(unless you have a source supporting it). I know people like to beat up on Java for performance, but it's not as bad as it used to be. And Hadoop has become very popular in recent years, and it's built on Java. Java has the added benefit that you can turn just about any PC into a node without modifying your code at all, so for smaller companies that don't have racks of blade servers, they can mobilize every office PC in off-hours for data mining.Shaz 2016-08-19T18:24:10.123

1It's not about speed or the JVM -- I'm happy to use Scala or Clojure -- but I've got better things to do than write Java boilerplate. It just is not a good fit for data science; its target audience is different. Machine learning engineering maybe, but I wouldn't want to do any prototyping in Java.Emre 2016-08-19T18:44:28.917

The last language that I would try to do data science in would be assembler and even that would come a close heat with VB6 which some people that I know are till trying to use for everything. The again I just like python.Steve Barnes 2016-08-21T12:45:00.013

23

There may be a lot of reasons like:

  1. Workforce flexibility: One Java / Python programmers can be moved to other tasks or projects easily.

  2. Candidates availability: there are plenty of Java / Python programmers. You do not want to introduce a new programming language to later find out that there are no qualified workers or they are just too expensive.

  3. Integration and ETL: Sometimes getting the data with the right quality is the hardest part of the project. So it is natural to use the same language as the rest of the systems.

  4. Business model definition: Most business rules and business models are already written in this languages.

  5. Just keeping things simple. It is already hard enough to be up-to-date with the technologies. A diverse base of language can be chaotic. R for this, Ruby for that, Scala, Clojure, F#, Swift, Dart... They may need different servers, different pathes, a hell to administer. All have their own IDEs with tools and plugins (not always free). See some Uncle Bob's points about languages choice and new technologies

So even if you have a 5% - 15% productivity advantage using R for the specific task, they may prefer a tool that just does the job even if not in the most efficient way.

borjab

Posted 2016-08-18T05:05:45.470

Reputation: 331

Although true, none of the above actually answers the question. Getting the data reduces 99% of the times to querying a database or reading .csv files - to which aim R is actually the best suitable tool on the market. Candidates availability: that there are more Java programmers than R programmers does not imply that you have to discard a R candidate if you have one. It doesn't really matter how the scientist performs their exercises as long as they deploy readable code that can be run by some servers (or any other thing the company is running).gented 2016-08-18T14:09:33.390

Of course you should not discard the candidate. The person is much more important than the tool. Their team may learn R and the candidate can learn Java/Python. But it will take time which in means money.borjab 2016-08-18T15:35:43.257

The point I certainly disagree is that it doesn't mind the language. When the only member of the team who knows R is no holidays and they need to make changes the boss won't be happy. Or just ask the team "Oh great, we need to learn a new language just because the new one does things this way". May be server administration is another department and new types of server need some new analysis, procedures, etc. May be you need green light from IT security to use a new language.borjab 2016-08-18T15:40:03.320

@GennaroTedesco the code written by the candidate must be maintainable by other programmers, while working together and also in some future when the original author will move on. It's not sufficient to have a candidate that knows a tech well, it is still important to consider how easy it will be to hire another candidate who knows the tech well when you will need one. of course, a new piece of niche technology can be introduced if there's a good reason, but there needs to be a good reason to outweigh such business risks.Peteris 2016-08-19T01:01:43.747

You might have an $x productivity improvement by using R, but it's no help if they have to expend \$2x of effort in changes to their workflow. Why would they do that, especially if they could hire someone else who might not cost them \$2x?user1908704 2016-08-20T16:10:41.000

14

It is in general true that for purely data science and statistics exercises R offers the best and fastest (especially if using the data.table package) tools and methods, that otherwise would be heavier to implement in Python (I assume by Python we all mean Pandas, though). Most data scientists do in fact use R to perform their models and calculations, or just to see how data behave.

Once the exercise is complete it is time to make it available to the rest of the people who have to use it (i. e. to deploy); to this aim it is oftentimes preferred to submit the code in Python for two main reasons:

  1. Most architectures are written in Python or are Python-friendly, therefore it would be easier to implement models natively written in that language.
  2. R syntax and grammar is extremely complicated. I myself strongly favour R other than anything else but have to however admit that the syntax is not really straightforward and has a very picked learning curve.

The above said, it is still true that one can easily translate R code into any other language, provided methods, libraries and packages are available (in Python most of them are, so that is no problem at all). Plenty of infrastructures and databases support underlying R code, hence portability is not really a problem, especially if one just has to submit the results of the calculations (to that extend, nobody really sees the underlying code anyway).

Java is of almost no use for the pure data science itself (although the Stanford University has a collection of machine learning NLP libraries written in Java, as far as I remember - but please check). The only reason why it can be required is just that the rest of the company uses it to big extents and they do not want to replace it with something new.

gented

Posted 2016-08-18T05:05:45.470

Reputation: 239

Thanks for sharing your perspective and experience!! This is helpful. From your second last paragraph, i assume you are talking about scikit-learn? or did you mean RPy? Care to elaborate?Enthusiast 2016-08-18T14:14:27.957

1I simply mean that whatever you are doing in R, there is most likely a similar Python package that does the same job. Pandas covers most of the things that data.table offers; scikit-learn, as you mentioned, is another example, but there is many more according to the case at hand.gented 2016-08-18T14:18:45.293

1Exactly what I do. Research in R, once that is finished, translate to python to integrate into the codebase. But @Enthusiast whether you can do the same in that company depends on its culture. Most people use the programming language their boss uses. And Python isn't hard to learn.jf328 2016-08-18T15:51:58.053

1@GennaroTedesco: "I simply mean that whatever you are doing in R, there is most likely a similar Python package that does the same job". I actually strongly disagree with this statement. The biggest advantage with R is that 90% of statisticians publish their latest and "greatest" in R, rather than Python. If these methods catch on, they may eventually make their way to Python. But that's also a plus for Python; there are lots of R stats packages that are just garbage, while I think Python stats packages are more likely to be the tried and true methods.Cliff AB 2016-08-21T17:56:50.247

"R syntax and grammar is extremely complicated. I myself strongly favour R other than anything else but have to however admit that the syntax is not really straightforward and has a very picked learning curve." Both of these seem to be opinions, but one is dressed as an objective statement and the other opposes it. I'm baffled. I also feel that Python's syntax and idioms are more complicated (OOP emphasis, for one), so I'm doubly confused by this answer.bright-star 2016-08-21T17:59:18.687

@TrevorAlexander I don't understand why the latter statement opposes the former. And yes, they are of course personal opinions.gented 2016-08-21T19:48:37.637

7

I've seen quite a few companies using the title Data Scientist for "Data Engineer" type roles. Particularly in the big data space.

If the company is using Hadoop or a distributed framework like Spark to do it's analytics in then Java or Python (or probably Scala) would be the languages that would make the most sense .

greenpenguin

Posted 2016-08-18T05:05:45.470

Reputation: 71

In this case i know for sure that role was for modelling as it asked for machine learning skills and specified list of techniques.Enthusiast 2016-08-19T13:53:39.480

They could still be doing that inside those technologies though using Java/Python libraries, something like H20 or MLlib spring to mind.greenpenguin 2016-08-19T13:59:12.107

4

Java

I'd have to disagree with the other posters on the java question. There are certain noSQL databases (like hadoop) that one needs to write mapreduce jobs in java. Now you can use HIVE to achieve much the same result.

Python

The python / R debate continues. Both are extensible languages, so potentially both could have the same ability to process. I only know R and my python knowledge is quite superficial. Speaking as a small business owner, you want to not have too many tools in your business otherwise there will be a general lack of depth in them, and difficulty supporting them. I think it will come down to depth of tool knowledge in the team. If the team is focused on python, then hiring another python data scientist is going to make sense as they can engage with the existing code base and historic experiment code.

Marcus D

Posted 2016-08-18T05:05:45.470

Reputation: 561

2

At least for my current team (~80 data scientists and engineers), we don't have such preference. Half of the data scientists here use R and another half use Python. Many can code in both. We do deploy Python and R code in production.

I don't think any of our data scientists uses Java at all. If they need to deal with big data, they can use SparkSQL or PySpark. The data engineering team uses a mix of Java/Scala/Python/Go.

If you are one of few data people in a small company, I can understand why they require certain language skills so you can do both data science and engineering. But tbh, I think most small companies won't have data big enough that Python or R can't handle in production.

piggybox

Posted 2016-08-18T05:05:45.470

Reputation: 121

Can you elaborate on the type of business your organization does? And is it in house ML work or for external clients?Enthusiast 2017-07-10T08:41:18.633

1@Enthusiast Retail business. 100% for in-house ML.piggybox 2017-07-10T23:44:24.070

0

The tools in Python are just better than R. Ther R community is pretty stagnant while the Python community is evolving really quick. Especially in tools for Data Science.
Also Python works way easier with everything around it. You can easily scrape the web, connect to databases and so on. That makes prototyping really fast.
And if you have a working prototype and care to make it faster or integrate it into the company workflow, it gets usually reimplemented in Java.

R has a few neat tools and visualization but it is not that great to build new stuff in it.

sebastian

Posted 2016-08-18T05:05:45.470

Reputation: 31

4That is completely wrong in all means.gented 2016-08-19T12:35:12.283

0

My point of view as a general purpose programmer with a tiny bit of R experience: R is excellent for data science, but it's geared towards people manually interpreting data. If you want to use the results for something automated, you have to interface with something else, and that something else will be hard to do in a problem specific language like R. Can you do a web site in R? :) On the other hand, python does have ready made libraries for data sciency stuff and is a general purpose programming language that doesn't get in the way of your doing anything else with it. As for Java, it's good for large programming projects with hundreds of thousands to millions of lines of code. If the data science part needs to interface with that, it may make sense to do everything in Java then.

Random whine: Why do I have to sign in to each StackExchange site separately?

Torp

Posted 2016-08-18T05:05:45.470

Reputation: 21

4R code can be easily run by almost all the tools available out there in the market. Java is almost of no use for data science.gented 2016-08-19T12:37:31.250

1@GennaroTedesco JAVA is useful for coding in bigdata tools though. So partly useful for querying data.Enthusiast 2016-08-19T13:14:02.840