Timeline for Hive: How to calculate the Kendall coefficient of correlation of a pair of a numeric columns in the group?
Current License: CC BY-SA 3.0
8 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Jun 29, 2015 at 9:44 | comment | added | Marcin | I've found a spare.cor functions for sparse matrixes in R that calculates correlation for any dimension for sparse matrix. @Hack-R think about this bias. By the way sampling coefficients should have also something-like bootstrap confidence intervals: see this: stats.stackexchange.com/questions/126176/… And the sparse.cor function for sparse matrixes is here: stackoverflow.com/questions/5888287/… | |
| Mar 2, 2015 at 2:39 | comment | added | Hack-R | @MarcinKosinski I'm not sure, but I pull samples of data that are anywhere from 10,000 - 30,000 rows from SQL Server and Hive on a daily basis. Sampling data in R from Hive and other databases has been a standard approach in my team at 2 different companies where I've worked. I expected that even if you had a Kendall Coefficient function in Hive it would be impractical to run it on all 10^10 rows without sampling, at least if it's something you plan to do more than once. That would be very time- and resource-intensive without buying you much. | |
| Mar 1, 2015 at 17:08 | comment | added | Marcin | What's the bias of coefficient calculated on 1000 rows-long sample when population has 10^10 observation? | |
| Feb 28, 2015 at 20:12 | comment | added | Hack-R | @MarcinKosinski Instead of importing 10^10 rows, why not just sample it? As for UDF's, if you have one or develop one with other people you can put it into a JIRA ticket for addition to Hive via the links in my solution, or become a committer and commit code to the project (or give your code to a committer): apache.org/dev/committers.html | |
| Feb 28, 2015 at 16:03 | comment | added | Marcin | Importing 10^10 rows of data into R just to calculate kendall coefficient is just simply imposible and not smart. That's why I wrote a question here. I am aware on how to specify implementation of kendall coefficient, just simply check the code of cor function in R :) I think the best idea would be to implement user defined function to calculate that kendall coefficient. Do you maybe know where could I upload that function later on for public use? By the way, have you used RHive :) do you recommend any good materials for start? | |
| Feb 28, 2015 at 15:59 | vote | accept | Marcin | ||
| Feb 28, 2015 at 15:59 | |||||
| Feb 27, 2015 at 20:00 | history | edited | Hack-R | CC BY-SA 3.0 | added 121 characters in body |
| Feb 27, 2015 at 19:38 | history | answered | Hack-R | CC BY-SA 3.0 |