0
$\begingroup$

In practice, how does one go about sampling a from big data set (eg. +/- 50 million distinct observations) to perform ML using Python? Most non-parametric models (e.g., SVM, ensemble models) start to push computer resources with much smaller sets (e.g., 200 features, 200K observations).

How is this done in practice in industry?

Other questions here get at this but don't explicitly ask. So this is not a duplicate. Thanks in advance.

$\endgroup$

1 Answer 1

1
$\begingroup$

This is what I do in projects :

  • Pre-process data in DB / Data Lake. aim is to :
  • A. Form batches (might require a new table with shuffled indices)
  • B. Create a copy with Normalization and other feature related tasks

After this, try algorithms that support batch learning. Neural networks support batch learning and few other algos (https://sklearn.org/modules/scaling_strategies.html#incremental-learning).

With batch learning, you can plot loss for each batch and see if algo is working or not.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.