## Hive: How to calculate the Kendall coefficient of correlation of a pair of a numeric columns in the group?

4

In this wiki page there is a function corr() that calculates the Pearson coefficient of correlation, but my question is that: is there any function in Hive that enables to calculate the Kendall coefficient of correlation of a pair of a numeric columns in the group?

I have never seen such a thing in Hive. – Sean Owen – 2014-12-08T14:23:45.440

That answer does not solve the problem :) – Marcin Kosiński – 2014-12-10T09:57:34.930

Is there a way of implementing a function on my own on Hive or maybe should I use different tool - maybe should I check Apache Spark? – Marcin Kosiński – 2014-12-10T10:13:32.133

Heh, yeah that's why I only commented. What I mean is that, having sniffed around Hive a long time, I've never heard of this, so it's unlikely to exist, but I don't know for sure that this is the answer. Spark does not have it either. – Sean Owen – 2014-12-10T12:11:32.657

3

In Hive itself? Unfortunately, the answer is simply no -- as the language definition manual shows, that statistic is simply not built in. In addition to the language manual, you can get more information on statistics in development in Hive here and here.

Having said that, there are plenty of ways to calculate Kendall's W on data that's in Hive.

You could write out the data to a file or query it into R or a statistical package such as SAS, Stat, MATLAB, Excel, etc then run your calculation and, if necessary, write your results back to Hive.

In R, for instance, you could do something like this:

install.packages("RODBC")
require(RODBC)
db   <- odbcConnect("Hive_DB")
hql  <- "select * from table A"
data <- sqlQuery(db , hql)
kenw <- cor(x = data$a, y = data$b, method="kendall")
sqlSave(db, kenw, tablename = "new_table_of_kendall_coef")


or (if using Linux or Unix) then you could use RHive without needing to use an ODBC name.

Another way to go about it would be to take the functions that do exist in Hive (which you linked to) and calculate Kendall's coefficient yourself with a custom function. As to how to specifically implement that, well you'd probably want to post on Cross Validated (stats.stackexchange.com).

Importing 10^10 rows of data into R just to calculate kendall coefficient is just simply imposible and not smart. That's why I wrote a question here. I am aware on how to specify implementation of kendall coefficient, just simply check the code of cor function in R :) I think the best idea would be to implement user defined function to calculate that kendall coefficient. Do you maybe know where could I upload that function later on for public use? By the way, have you used RHive :) do you recommend any good materials for start? – Marcin Kosiński – 2015-02-28T16:03:05.993

@MarcinKosinski Instead of importing 10^10 rows, why not just sample it? As for UDF's, if you have one or develop one with other people you can put it into a JIRA ticket for addition to Hive via the links in my solution, or become a committer and commit code to the project (or give your code to a committer): https://www.apache.org/dev/committers.html

– Hack-R – 2015-02-28T20:12:19.093

What's the bias of coefficient calculated on 1000 rows-long sample when population has 10^10 observation? – Marcin Kosiński – 2015-03-01T17:08:55.447

@MarcinKosinski I'm not sure, but I pull samples of data that are anywhere from 10,000 - 30,000 rows from SQL Server and Hive on a daily basis. Sampling data in R from Hive and other databases has been a standard approach in my team at 2 different companies where I've worked. I expected that even if you had a Kendall Coefficient function in Hive it would be impractical to run it on all 10^10 rows without sampling, at least if it's something you plan to do more than once. That would be very time- and resource-intensive without buying you much. – Hack-R – 2015-03-02T02:39:41.620

1

I've found a spare.cor functions for sparse matrixes in R that calculates correlation for any dimension for sparse matrix. @Hack-R think about this bias. By the way sampling coefficients should have also something-like bootstrap confidence intervals: see this: http://stats.stackexchange.com/questions/126176/whats-the-bias-of-calculating-the-kendall-coefficient-of-correlation-on-a-sampl

And the sparse.cor function for sparse matrixes is here: http://stackoverflow.com/questions/5888287/running-cor-or-any-variant-over-a-sparse-matrix-in-r

– Marcin Kosiński – 2015-06-29T09:44:16.767