Sampling Big Data for Predictive Analytics in Python

Question

In practice, how does one go about sampling a from big data set (eg. +/- 50 million distinct observations) to perform ML using Python? Most non-parametric models (e.g., SVM, ensemble models) start to push computer resources with much smaller sets (e.g., 200 features, 200K observations).

How is this done in practice in industry?

Other questions here get at this but don't explicitly ask. So this is not a duplicate. Thanks in advance.

Shamit Verma · Accepted Answer · 2019-03-31 16:44:30Z

This is what I do in projects :

Pre-process data in DB / Data Lake. aim is to :
A. Form batches (might require a new table with shuffled indices)
B. Create a copy with Normalization and other feature related tasks

After this, try algorithms that support batch learning. Neural networks support batch learning and few other algos (https://sklearn.org/modules/scaling_strategies.html#incremental-learning).

With batch learning, you can plot loss for each batch and see if algo is working or not.

Stack Exchange Network

Sampling Big Data for Predictive Analytics in Python

1 Answer 1

Hot Network Questions

Sampling Big Data for Predictive Analytics in Python

1 Answer 1

Related

Hot Network Questions