Data Science in C (or C++)



I'm an R language programmer. I'm also in the group of people who are considered Data Scientists but who come from academic disciplines other than CS.

This works out well in my role as a Data Scientist, however, by starting my career in R and only having basic knowledge of other scripting/web languages, I've felt somewhat inadequate in 2 key areas:

  1. Lack of a solid knowledge of programming theory.
  2. Lack of a competitive level of skill in faster and more widely used languages like C, C++ and Java, which could be utilized to increase the speed of the pipeline and Big Data computations as well as to create DS/data products which can be more readily developed into fast back-end scripts or standalone applications.

The solution is simple of course -- go learn about programming, which is what I've been doing by enrolling in some classes (currently C programming).

However, now that I'm starting to address problems #1 and #2 above, I'm left asking myself "Just how viable are languages like C and C++ for Data Science?".

For instance, I can move data around very quickly and interact with users just fine, but what about advanced regression, Machine Learning, text mining and other more advanced statistical operations?

So. can C do the job -- what tools are available for advanced statistics, ML, AI, and other areas of Data Science? Or must I loose most of the efficiency gained by programming in C by calling on R scripts or other languages?

The best resource I've found thus far in C is a library called Shark, which gives C/C++ the ability to use Support Vector Machines, linear regression (not non-linear and other advanced regression like multinomial probit, etc) and a shortlist of other (great but) statistical functions.


Posted 2015-03-20T14:56:23.420

Reputation: 1 019

5This question appears to be primarily opinion based. Please consider rephrasing. Maybe ask what kinds of data science tools are available for C/C++, or what kinds of applications use these languages.sheldonkreger 2015-03-20T17:30:39.807

1@sheldonkreger That is what I'm asking, I'll make that more clear, thanksHack-R 2015-03-20T17:44:09.740

Awesome, thanks.sheldonkreger 2015-03-20T17:46:43.700

1I've used Waffles (C++) to incorporate machine learning into existing C++ engines.Pete 2015-03-24T18:11:02.200

@Pete if you can incorporate that into an answer I'd be likely to mark it as the solutionHack-R 2015-03-24T18:28:05.220


Meta toolkit is available in C++ : There's a course on Coursera that uses it, it's still in week 1, so you may want to take a look. The course is called "Text Retrieval and Search Engines".

LauriK 2015-03-25T12:29:46.440

@LauriK OK, why are you guys putting the best answers in comments instead of as actual answers?? ;) Seriously, I upvoted the answers because they were great comments, but the actual answers I was looking for like this are in comments!Hack-R 2015-03-25T13:25:00.010

@Hack-R thanks :)LauriK 2015-03-25T20:11:32.137

May I ask whether there is any good book of machine learning in C?Frank Puk 2015-08-26T03:05:34.000

I appreciate wanting to have a wide variety of skills, however, if your company allows for teams of a variety of backgrounds, you're real best bet is to bring in a rock solid software engineer. C/C++ can be extremely efficient, however, getting the efficiency out of these languages takes a lot of experience. IMO, you're better off developing your ability as a scientist, which is a scarce skill set, and benefit from a diverse team.j.a.gartner 2015-11-27T21:31:51.253

@j.a.garner I hear you and my main company does have scaled agile scrum teams. However (i) I do science with C/C++ and I think better so than I could with just scripting languages (ii) I'm also a 1 man entrepreneur/consultant on the side and (iii) I don't work at a software/tech company so I'm about as high tech and programming knowledgeable as we get.Hack-R 2015-11-28T01:48:06.693



Or must I loose most of the efficiency gained by programming in C by calling on R scripts or other languages?

Do the opposite: learn C/C++ to write R extensions. Use C/C++ only for the performance critical sections of your new algorithms, use R to build your analysis, import data, make plots etc.

If you want to go beyond R, I'd recommend learning python. There are many libraries available such as scikit-learn for machine learning algorithms or PyBrain for building Neural Networks etc. (and use pylab/matplotlib for plotting and iPython notebooks to develop your analyses). Again, C/C++ is useful to implement time critical algorithms as python extensions.

Andre Holzner

Posted 2015-03-20T14:56:23.420

Reputation: 426

1Thanks, Andre. I do use Pybrain a lot; for me Python is a middleground in between R and C, but I still wanted to learn C for both speed and wider application of the code. I selected this as the solution because I had not thought of using C/C++ to write R extensions, which is a really wonderful idea that I'm absolutely going to do. Thanks!!Hack-R 2015-03-25T13:27:40.333

1I second the notion of learning Python. I work with large datasets and data scientist utilizing R to analyze those datasets. Although I learned C at a very young age, Python is the one language that is truly giving me value as a programmer and assisting these data scientist. Therefore, look to compliment the team, not yourself.Glen Swan 2015-04-09T15:13:21.893

1similarly python is sped up by writing in cython (again basically C). I have to say I have yet to use it myself. There is very little that cannot be done using existing libraries (eg scikit-learn, pandas in python[which are written in cython so you don't have to!]).seanv507 2015-04-09T21:19:11.783

Some other useful python libraries include : pandas, numpy, scipy etc. Adding this in support of learning python :)Shagun Sodhani 2015-05-25T14:45:49.237

This is spot on. I would note that if you don't have a CS background, the chance that you write code more efficiently than the underlying functions for python or packages for R is quite remote. I programmed in C++ for 13 years, and still think there are aspects of memory management and performance optimization that I did not do well.

Additionally, python & R have very smart computer scientists optimizing distribution issues, so C languages will really be relegated to extreme low latency systems. – j.a.gartner 2015-08-18T22:31:54.493

Agreed - you would want to learn C for the performance critical sections because it is very 'close to the metal' - you are in total control of the low level memory allocations, pointer manipulation etc. which makes it comparatively much more difficult to write and debug (so you want to do as little of it as possible), but also blazing fast, and you can even tailor your code more specifically to the exact hardware you have, squeezing every ounce of power out of it.James Allen 2015-08-27T14:53:37.090


As Andre Holzner has said, extending R with C/C++ extension is a very good way to take advantage of the best of both sides. Also you can try the inverse , working with C++ and ocasionally calling function of R with the RInside package o R. Here you can find how

Once you're working in C++ you have many libraries , many of them built up for specific problems, other more general


Posted 2015-03-20T14:56:23.420

Reputation: 191


I agree that the current trend is to use Python/R and to bind it to some C/C++ extensions for computationally expensive tasks.

However, if you want to stay in C/C++, you might want to have a look at Dlib:

Dlib is a general purpose cross-platform C++ library designed using contract programming and modern C++ techniques. It is open source software and licensed under the Boost Software License.

enter image description here

Franck Dernoncourt

Posted 2015-03-20T14:56:23.420

Reputation: 2 888

2I'm the dlib author. Feel free to post that image wherever you want :). Also, >20k samples means you have 20k vectors or whatever. How many variables are in each sample is a separate issue.Davis King 2016-04-22T13:24:58.063

@Hack-R "Sample" is one of those overloaded terms in statistics / machine learning where sometimes it means a set of instances drawn from a population (as in "sample size", "sample mean", etc.), and sometimes it means the individual instances (as in "trained a classifier on 10K samples").Tim Goodman 2016-10-28T18:35:13.167

Another highly useful answer. Do you know if we are allowed to freely reproduce that image (in case i want to put it in a presentation or blog, etc)? Also, when it says things like "> 20k samples" I wonder if it really means "samples" or "observations in your sample"?Hack-R 2015-11-11T00:11:08.853


R is one of the key tool for data scientist, what ever you do don't stop using it.

Now talking about C, C++ or even Java. They are good popular languages. Wether you need them or will need them depend on the type of job or projects you have. From personal experience, there are so many tools out there for data scientist that you will always feel like you constantly need to be learning.

You can add Python or Matlab to things to learn if you want and keep adding. The best way to learn is to take on a work project using other tools that you are not comfortable with. If I were you, I would learn Python before C. It is more used in the community than C. But learning C is not a waste of your time.

servais daligou

Posted 2015-03-20T14:56:23.420

Reputation: 51

I know what you mean about the overwhelming number of tools! I tell my intern not to be distracted and to focus on just 1 or 2 things, but it's hard to take my own advice.Hack-R 2015-11-11T00:09:10.247


In my opinion, ideally, to be a more well-rounded professional, it would be nice to know at least one programming language for the most popular programming paradigms (procedural, object-oriented, functional). Certainly, I consider R and Python as the two most popular programming languages and environments for data science and, therefore, primary data science tools.

Julia is impressive in certain aspects, but it tries to catch up with those two and establish itself as a major data science tool. However, I don't see this happening any time soon, simply due to R/Python's popularity, very large communities as well as enormous ecosystems of existing and newly developed packages/libraries, covering an very wide range of domains / fields of study.

Having said that, many packages and libraries, focused on data science, ML and AI areas, are implemented and/or provide APIs in languages other than R or Python (for the proof, see this curated list and this curated list, both of which are excellent and give a solid perspective about the variety in the field). This is especially true for performance-oriented or specialized software. For that software, I've seen projects with implementation and/or APIs mostly in Java, C and C++ (Java is especially popular in the big data segment of data science - due to its closeness to Hadoop and its ecosystem - and in the NLP segment), but other options are available, albeit to a much more limited, domain-based, extent. Neither of these languages is a waste of time, however you have to prioritize mastering any or all of them with your current work situation, projects and interests. So, to answer your question about viability of C/C++ (and Java), I would say that they are all viable, however not as primary data science tools, but as secondary ones.

Answering your questions on 1) C as a potential data science tool and 2) its efficiency, I would say that: 1) while it's possible to use C for data science, I would recommend against doing it, because you'd have a very hard time finding corresponding libraries or, even more so, trying to implement corresponding algorithms by yourself; 2) you shouldn't worry about efficiency, as many performance-critical segments of code are implemented in low-level languages like C, plus, there are options to interface popular data science languages with, say, C (for example, Rcpp package for integration R with C/C++: This is in addition to simpler, but often rather effective, approaches to performance, such as consistent use of vectorization in R as well as using various parallel programming frameworks, packages and libraries. For R ecosystem examples, see CRAN Task View "High-Performance and Parallel Computing with R".

Speaking about data science, I think that it makes quite a lot of sense to mention the importance of reproducible research approach as well as the availability of various tools, supporting this concept (for more details, please see my relevant answer). I hope that my answer is helpful.

Aleksandr Blekh

Posted 2015-03-20T14:56:23.420

Reputation: 5 573


As a data scientist the other languages (C++/Java) come in handy when you need incorporate machine learning into an existing production engine.

Waffles is both a well-maintained C++ class library and command-line analysis package. It's got supervised and unsupervised learning, tons of data manipulation tools, sparse data tools, and other things such as audio processing. Since it's also a class library, you can extend it as you need. Even if you are not the one developing the C++ engine (chances are you won't be), this will allow you to prototype, test, and hand something over to the developers.

Most importantly, I believe my knowledge of C++ & Java really help me understand how Python and R work. Any language is only used properly when you understand a little about what is going on underneath. By learning the differences between languages you can learn to exploit the strengths of your main language.


For commercial applications with large data sets, Apache Spark - MLLib is important. Here you can use Scala, Java, or Python.


Posted 2015-03-20T14:56:23.420

Reputation: 576


I would be keen to understand why you would need another language (apart form Python) if your goal is " but what about advanced regression, Machine Learning, text mining and other more advanced statistical operations".
For that kind of thing, C is a waste of time. It's a good tool to have but in the ~20 years since Java came out, I've rarely coded C.
If you prefer the more functional-programming side of R, learn Scala before you get into too many procedural bad habits coding with C.
Lastly learn to use Hadley Wickham's libraries - they'll save you a lot of time doing data manipulation.

Michael Cox

Posted 2015-03-20T14:56:23.420

Reputation: 76

Because languages like R and Python are very slow/inefficient compared to languages like C. Thus when dealing with lots of data and computations, if you can do something in C it's faster than if you can do it in R. I do love and use Hadley's packages tho!Hack-R 2015-03-24T18:25:56.267


There are some C++ tools for statistics and data science like ROOT , BAT , boost , or OpenCV


Posted 2015-03-20T14:56:23.420

Reputation: 21

Awesome! Thank you. I only wish they were for plain C too, but still helpful.Hack-R 2015-11-11T00:06:44.367


Not sure whether it's been mentioned yet, but there's also vowpal wabbit but it might be specific to certain kinds of problem only.

Felipe Almeida

Posted 2015-03-20T14:56:23.420

Reputation: 186

1Looks interesting. I only glanced at the link, but the types of models mentioned would be highly useful. Is it a regular C library you can use in a program though? I'll have to investigate further.Hack-R 2015-11-11T00:06:22.510


Take a look at Intel DAAL which is underway. It's highly optimised for Intel CPU architecture and supports distributed computations.


Posted 2015-03-20T14:56:23.420

Reputation: 101


Scalable Machine Learning Solutions for Big Data:

I will add my $.02 because there is a key area that seems not to have been addressed in all of the previous posts - machine learning on big data!

For big data, scalability is key, and R is insufficient. Further, languages like Python and R are only useful for interfacing with scalable solutions which are usually written in other languages. I make this distinction not because I want to disparage those using them, but only because its so important for members of the data science community to understand what truly scalable machine learning solutions look like.

I do most of my work with big data on distributed memory clusters. That is, I don't just use one 16 core machine (4 quad core processors on a single motherboard sharing the memory of that motherboard), I use a small cluster of 64 16 core machines. The requirements are very different for these distributed memory clusters than for shared memory environments and big data machine learning requires scalable solutions within distributed memory environments in many cases.

We also use C and C++ everywhere within a proprietary database product. All of our high level stuff is handled in C++ and MPI, but the low level stuff that touches the data is all longs and C style character arrays to keep the product very very fast. The convenience of std strings are simply not worth the computational cost.

There not many C++ libraries available which offer distributed, scalable machine learning capabilities - MLPACK.

However, there are other scalable solutions with APIs:

Apache Spark has a scalable machine learning library called MLib that you can interface with.

Also Tensorflow now has distributed tensorflow and has a C++ api.

Hope this helps!


Posted 2015-03-20T14:56:23.420

Reputation: 3 638