Skip to main content
grammar edits
Source Link
wabbit
  • 1.3k
  • 2
  • 12
  • 15

I tend to agree with what @Dan Levin has already said. Ultimately since we want to draw useful insights from the data rather than just storing it, it's the ability of learning algorithms/systems which should determine what is called "Big data". OneAs ML systems evolve what was Big data today will no longer be Big Data tomorrow.

One way of defining itBig data could be:

  • Big data: Data on which you can't build ML models in reasonable time ( 1-2 hours) on a typical workstation ( with say 4GB RAM)
  • Non-Big data: complement of the above

Assuming this definition, as long as the memory occupied by an individual row (all variables for a single data point) does not exceed machine RAM we should be be in the Non-big data regime.

Note: Vowpal Wabbit (by far the fastest ML system as of today) can learn on any data set as long as an individual row ( data point) is < RAM ( say 4GB). The number of rows is not a limitation because it uses SGD on multiple cores. Speaking from experience you can train a model with 10k features and 10MN rows on mya laptop in a day.

I tend to agree with what @Dan Levin has already said. Ultimately since we want to draw useful insights from the data rather than just storing it, it's the ability of learning algorithms/systems which should determine what is called "Big data". One way of defining it could be:

  • Big data: Data on which you can't build ML models in reasonable time ( 1-2 hours) on a typical workstation ( with say 4GB RAM)
  • Non-Big data: complement of the above

Assuming this definition, as long as the memory occupied by an individual row (all variables for a single data point) does not exceed machine RAM we should be be in the Non-big data regime.

Note: Vowpal Wabbit (by far the fastest ML system as of today) can learn on any data set as long as an individual row ( data point) is < RAM ( say 4GB). The number of rows is not a limitation because it uses SGD on multiple cores. Speaking from experience you can a model with 10k features and 10MN rows on my laptop in a day.

I tend to agree with what @Dan Levin has already said. Ultimately since we want to draw useful insights from the data rather than just storing it, it's the ability of learning algorithms/systems which should determine what is called "Big data". As ML systems evolve what was Big data today will no longer be Big Data tomorrow.

One way of defining Big data could be:

  • Big data: Data on which you can't build ML models in reasonable time ( 1-2 hours) on a typical workstation ( with say 4GB RAM)
  • Non-Big data: complement of the above

Assuming this definition, as long as the memory occupied by an individual row (all variables for a single data point) does not exceed machine RAM we should be be in the Non-big data regime.

Note: Vowpal Wabbit (by far the fastest ML system as of today) can learn on any data set as long as an individual row ( data point) is < RAM ( say 4GB). The number of rows is not a limitation because it uses SGD on multiple cores. Speaking from experience you can train a model with 10k features and 10MN rows on a laptop in a day.

Source Link
wabbit
  • 1.3k
  • 2
  • 12
  • 15

I tend to agree with what @Dan Levin has already said. Ultimately since we want to draw useful insights from the data rather than just storing it, it's the ability of learning algorithms/systems which should determine what is called "Big data". One way of defining it could be:

  • Big data: Data on which you can't build ML models in reasonable time ( 1-2 hours) on a typical workstation ( with say 4GB RAM)
  • Non-Big data: complement of the above

Assuming this definition, as long as the memory occupied by an individual row (all variables for a single data point) does not exceed machine RAM we should be be in the Non-big data regime.

Note: Vowpal Wabbit (by far the fastest ML system as of today) can learn on any data set as long as an individual row ( data point) is < RAM ( say 4GB). The number of rows is not a limitation because it uses SGD on multiple cores. Speaking from experience you can a model with 10k features and 10MN rows on my laptop in a day.