The most popular data analytics language

appears to be R



This is in line with what I’ve heard, though I thought SAS was comparable in primary or secondary tool usage.

This said, its important to note that in this survey, we don’t see mention of Python. Working against this is that it is a small (1300-ish) self-selecting sample, and the reporting company has a stake in the results. Also of importance is that R is a package with an embedded programming language, and Python is a programming language with add-ons. This is critical when you discuss tools, sort of like comparing a programmable numerical control machine (R) to a hand held drill unit with attachments. One is built specifically for the analysis, one isn’t. So it really is no wonder that comparing Python to R for big data analytics would result in people (who know better than to do such a comparison) saying, “seriously, WTF?”

More useful to this discussion is comparing pre-packaged versus home grown tools, as well as mixed use (both pre-packaged and home-grown).

This is where the kdnuggets surveys show interesting results. You can examine the data from 2009 to 2013 easily.

One of their conclusions was that

Note that only Tableau and R showed strong growth in both 2013 and 2012.

and the data itself showed

R, up 22%, to 37.4% share, from 30.7% in 2012 (was 23.3% in 2011)

Python, by contrast declined to 13.3% from 14.9% in 2012. Now understand that the number of (again, self-selecting, and possibly biased) respondents more than doubled between 2012 and 2013. Note that I didn’t see Java, Perl, or any of the other languages that showed up in the 2012 survey.

Indeed, a conclusion from the 2012 survey was

Among those who wrote analytics code in lower-level languages, R, SQL, Java, and Python were most popular.

Note that there were 798 confirmed participants in 2012, and 1880 in 2013. So 13.3% of 1880 represents about 250 people, versus 14.9% of 798 in 2012 represents about 119 people. I rounded up/down to nearest integer (I am not sure what a fractional person actually means). So from this, it appears that Python usage doubled (from the survey’s perspective, given all its problems) in a year. But it appears that the market is growing faster than that, so its relative usage actually declined.

So even if we place huge error bars on the numbers to reflect the likely impacts of bias, self-selection, etc., its hard to draw a conclusion that Python is *replacing* R. More to the point, its hard to draw a conclusion that Python is replacing *anything*.

Again, this is not a bash on Python. It is a competent language for tying other things together. The language is, at its core, slow, and you need things like numpy, pandas, and other tools written in other languages to give it speed. Which in the era of big data is extraordinarily important.

But its pretty obvious that its’ not replacing anything. It appears to be, at best, static. At worst, declining, though you’d have to accept the surveys as faithful samples of the population as a whole, and not as biased self-selecting segments with an agenda or motive to influence the outcome in a particular direction, in order to take that view.

Also, note how the survey results have changed. They started out with tools like Excel dominating. Now look where Excel is in 2013.

Things change, and likely this time next year, we will see different things. Analysis of surveys and historical data is, by definition introspective. We can build an extrapolating model from this, at the risk associated with all such models … they could be wrong, their theoretical/pragmatic basis could be shaky/non-existent.

So lets see what happens next year.

Viewed 58105 times by 5749 viewers

7 thoughts on “The most popular data analytics language

  1. @DempseyBI no, this is referencing the previous set of posts here. An author had made a claim about Python displacing R, and handwaved off evidence to the contrary. So I looked into it in depth, both from the performance of the language, and from the rough measured numbers (this post). Evidence for Python displacing R is pretty close to nil. R does appear to be displacing others though.

  2. Don’t mistake the KDNuggets poll for a scientific poll. I am sure that you understand the difference between a random sample of data scientists (which this is not) and a nonrandom sample of participants in the KDNuggets competitions and community. For example, practitioners who work in industry are probably underrepresented in the survey, which biases the results.

    That said, the result of the poll that I thought was interesting is that many respondents now use multiple tools—both commercial and open source. Over the past five years, SAS, SPSS, and other vendors have introduced features that enable users of commercial software to integrate open-source tools into their analyses. For example, SAS has supported calling R since 2009. (For details, see http://blogs.sas.com/content/iml/2013/11/25/twelve-advantages-to-calling-r-from-the-sasiml-language/ ) Similarly, some R users call perl or python from R. This enables today’s analysts to be multi-lingual and leverage the strengths of several of these powerful tools.

  3. In my description, I pointed out all the obvious issues with the KDNuggets and the other polls, in that respondents are self-selecting, often motivated to express their preferences, etc. That puts very large error bars on this if you assume that there is real data lurking in there. My core assumption on this is that there is real data that may be biased by any number of issues. In fact I point out specifically that the growth numbers could be entirely wrong for Python specifically, in that the response bias function is unknown, and the market as a whole is certainly growing, but we don’t know enough to draw any major conclusions about growth rates, other than possibly, the signal in the biased noise being something akin to the massive growth rate of one versus the much smaller growth rate of the other. This is even more interesting (and possibly less biased) when you realize that many more people know the smaller percentage system than the larger “faster” growing percentage system.

    That may be one of the few “resolvable” signals in this data.

    Why I brought all of this up: Basically someone whom should have known better got on a soap box and made a claim, unsupported by evidence, and unwilling to support it by evidence. He and his supporters used the modus operandi of language advocacy to dismiss and diminish alternative claims, to in some ways, attack different view points. I decided to look at the (obviously biased) data to see if there was signal in the noise. What I found were a number of obviously biased sources (polls, no matter how random they are, are fundamentally answered by self selecting elements of a group, after they are “randomly” sampled, and there is a definite bias function applied there … what you have to hope for is that you have the same self selecting bias function amongst all groups, and that your sample population actually reflects the group membership numbers appropriately, lest you get something really far off from “reality”). What I saw, even “correcting” for the bias, was likely strong enough to report on, and was directly counter to what the advocates said.

    As for your point that data scientists use more than one system/language and they interoperate … yes, exactly. There is no one right tool for this job, but you do need to exclude tools not able to handle the job. I wouldn’t likely do much data science in befunge, or javascript/node. But I wouldn’t exclude them if they were able to bring something of value to the table.

    For me the exercise was helpful in that I got to play with tools I’ve not used in years, or in a few cases, ever. Python is useful, its just not as useful in all situations as claimed. Its not good on the performance side, and for large enough analytics, this matters tremendously.

    This is our bias in that we deal with end users with huge data sets, so tools that really can’t handle them well aren’t seriously considered contenders. This is where I see Python. It can handle the smaller stuff, or link together tools that can handle the bigger stuff. Perl is like that too. As are many others.

  4. @Joe You make excellent points and I’m glad you have been thinking about these issues. I, too, am frustrated when “people who should know better” make broad claims based on very noisy data. You might be interested in a related issue: using the number of Web pages that refer to a subject (or language or software…) as a proxy for popularity. I have written some thoughts on this subject: http://blogs.sas.com/content/iml/2011/08/19/estimating-popularity-based-on-google-searches-why-its-a-bad-idea/ My conclusion was that “using the number of search results as a proxy for popularity is of dubious value and is fraught with statistical perils.”

  5. @RickWicklin Excellent post! I’d love to see someone start discussing the Tiobe index and other things relating to this at some point. I do think they have a signal, but figuring out what it really means is far more complex than their simple analysis indicates. Search as a polling method is IMO more likely to yield something less self-selecting, though my thoughts are that it should be easy to argue that it will miss experts and highly skilled people, and bias things towards noobs. Not that there is anything wrong with this, but understanding your input signal is IMO *far* more than defining a scoring algorithm. Its actually understanding what inputs you are getting.

    I’ll freely admit this is the physicist in me wanting to make sure I have a testable theory in place before I try my hand at statistics for a measurement, so I can, to some degree, assess the input quality before it ever runs through the model. Most folks seem to start off with a scoring model (as opposed to a theory) and then work from there. How well the scoring model corresponds to something real is left to the reader to determine.

Comments are closed.