I have multiple csv files each of which has at least 200MB of data across 12 columns. Each csv file possibly can fall into 4 categories or labels. I am trying to see which clusters each of these files fit. I do not have any code yet. But here is my pseudo-code.
for file in my_list: read data from file find the cluster in which this file falls into using k-means or other algorithms tag file with the cluster number end-for The result of the clustering would look something like this:
- file-1 = cluster-1
- file-2 = cluster-2
- file-3 = cluster-1
- .
- .
- .
- file-n = cluster-4
and so on
Ideally, I would stream the data, for e.g. x,000 or more rows per stream. scikit typically does not handle data that is streamed. Are there other libraries that can take streamed data and achieve what I am looking for?
Many thanks for your suggestions.