Skip to main content
8 events
when toggle format what by license comment
Jun 29, 2015 at 9:44 comment added Marcin I've found a spare.cor functions for sparse matrixes in R that calculates correlation for any dimension for sparse matrix. @Hack-R think about this bias. By the way sampling coefficients should have also something-like bootstrap confidence intervals: see this: stats.stackexchange.com/questions/126176/… And the sparse.cor function for sparse matrixes is here: stackoverflow.com/questions/5888287/…
Mar 2, 2015 at 2:39 comment added Hack-R @MarcinKosinski I'm not sure, but I pull samples of data that are anywhere from 10,000 - 30,000 rows from SQL Server and Hive on a daily basis. Sampling data in R from Hive and other databases has been a standard approach in my team at 2 different companies where I've worked. I expected that even if you had a Kendall Coefficient function in Hive it would be impractical to run it on all 10^10 rows without sampling, at least if it's something you plan to do more than once. That would be very time- and resource-intensive without buying you much.
Mar 1, 2015 at 17:08 comment added Marcin What's the bias of coefficient calculated on 1000 rows-long sample when population has 10^10 observation?
Feb 28, 2015 at 20:12 comment added Hack-R @MarcinKosinski Instead of importing 10^10 rows, why not just sample it? As for UDF's, if you have one or develop one with other people you can put it into a JIRA ticket for addition to Hive via the links in my solution, or become a committer and commit code to the project (or give your code to a committer): apache.org/dev/committers.html
Feb 28, 2015 at 16:03 comment added Marcin Importing 10^10 rows of data into R just to calculate kendall coefficient is just simply imposible and not smart. That's why I wrote a question here. I am aware on how to specify implementation of kendall coefficient, just simply check the code of cor function in R :) I think the best idea would be to implement user defined function to calculate that kendall coefficient. Do you maybe know where could I upload that function later on for public use? By the way, have you used RHive :) do you recommend any good materials for start?
Feb 28, 2015 at 15:59 vote accept Marcin
Feb 28, 2015 at 15:59
Feb 27, 2015 at 20:00 history edited Hack-R CC BY-SA 3.0
added 121 characters in body
Feb 27, 2015 at 19:38 history answered Hack-R CC BY-SA 3.0