Timeline for Hive: How to calculate the Kendall coefficient of correlation of a pair of a numeric columns in the group?

Current License: CC BY-SA 3.0

8 events

when toggle format	what		by	license	comment
Jun 29, 2015 at 9:44	comment	added	Marcin		I've found a spare.cor functions for sparse matrixes in R that calculates correlation for any dimension for sparse matrix. @Hack-R think about this bias. By the way sampling coefficients should have also something-like bootstrap confidence intervals: see this: stats.stackexchange.com/questions/126176/… And the sparse.cor function for sparse matrixes is here: stackoverflow.com/questions/5888287/…
Mar 2, 2015 at 2:39	comment	added	Hack-R		@MarcinKosinski I'm not sure, but I pull samples of data that are anywhere from 10,000 - 30,000 rows from SQL Server and Hive on a daily basis. Sampling data in R from Hive and other databases has been a standard approach in my team at 2 different companies where I've worked. I expected that even if you had a Kendall Coefficient function in Hive it would be impractical to run it on all 10^10 rows without sampling, at least if it's something you plan to do more than once. That would be very time- and resource-intensive without buying you much.
Mar 1, 2015 at 17:08	comment	added	Marcin		What's the bias of coefficient calculated on 1000 rows-long sample when population has 10^10 observation?
Feb 28, 2015 at 20:12	comment	added	Hack-R		@MarcinKosinski Instead of importing 10^10 rows, why not just sample it? As for UDF's, if you have one or develop one with other people you can put it into a JIRA ticket for addition to Hive via the links in my solution, or become a committer and commit code to the project (or give your code to a committer): apache.org/dev/committers.html
Feb 28, 2015 at 16:03	comment	added	Marcin		Importing 10^10 rows of data into R just to calculate kendall coefficient is just simply imposible and not smart. That's why I wrote a question here. I am aware on how to specify implementation of kendall coefficient, just simply check the code of `cor` function in R :) I think the best idea would be to implement user defined function to calculate that kendall coefficient. Do you maybe know where could I upload that function later on for public use? By the way, have you used RHive :) do you recommend any good materials for start?
Feb 28, 2015 at 15:59	vote	accept	Marcin
Feb 28, 2015 at 15:59
Feb 27, 2015 at 20:00	history	edited	Hack-R	CC BY-SA 3.0	added 121 characters in body
Feb 27, 2015 at 19:38	history	answered	Hack-R	CC BY-SA 3.0