12
$\begingroup$

This was inspired by Efficient online linear regression, which I found very interesting. Are there any texts or resources devoted to large-scale statistical computing, by which computing with datasets too large to fit in main memory, and perhaps too varied to effectively subsample. For example, is it possible to fit mixed effects models in an online fashion? Has anyone looked into the effects of replacing the standard 2nd order optimization techniques for MLE with 1st order, SGD-type techniques?

$\endgroup$
4
  • $\begingroup$ I think the answer is "yes". Of course, there is a bit of an issue of definitions here. What one person considers "large-scale" is sometimes very different from other's. My impression is that, e.g., many academic researchers consider the Netflix dataset "large-scale", while in many industrial settings it would be considered "puny". As regards estimation techniques, usually with very large data, computational efficiency trumps statistical efficiency. For example, method of moments will, in many cases, perform (nearly) as well as MLE in these settings and can be much easier to compute. $\endgroup$ Commented Feb 8, 2011 at 5:25
  • 2
    $\begingroup$ you might also look up the Workshop on Algorithms for Modern Massive Data Sets (MMDS). It's young, but draws a pretty impressive set of speakers at the interfaces of statistics, engineering and computer science as well as between academia and industry. $\endgroup$ Commented Feb 8, 2011 at 5:27
  • $\begingroup$ It's only a few decades since most datasets were too large to fit in main memory, and the choice of algorithms used in early statistical programs reflected that. Such programs didn't have facilities for mixed-effects models though. $\endgroup$ Commented Feb 8, 2011 at 11:38
  • $\begingroup$ Are you able to calculate statistics for the data set? say for example the sum, or averages of data items? $\endgroup$ Commented Feb 25, 2011 at 8:53

1 Answer 1

5
$\begingroup$

You might look into the Vowpal Wabbit project, from John Langford at Yahoo! Research . It is an online learner that does specialized gradient descent on a few loss functions. VW has some killer features:

  • Installs on Ubuntu trivially, with "sudo apt-get install vowpal-wabbit".
  • Uses the hashing trick for seriously huge feature spaces.
  • Feature-specific adaptive weights.
  • Most importantly, there is an active mailing list and community plugging away on the project.

The Bianchi & Lugosi book Prediction, Learning and Games gives a solid, theoretical foundation to online learning. A heavy read, but worth it!

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.