Clustering multiple csv files that cannot fit in RAM

Question

I have multiple csv files each of which has at least 200MB of data across 12 columns. Each csv file possibly can fall into 4 categories or labels. I am trying to see which clusters each of these files fit. I do not have any code yet. But here is my pseudo-code.

 for file in my_list: read data from file find the cluster in which this file falls into using k-means or other algorithms tag file with the cluster number end-for

The result of the clustering would look something like this:

file-1 = cluster-1
file-2 = cluster-2
file-3 = cluster-1
.
.
.
file-n = cluster-4

and so on

Ideally, I would stream the data, for e.g. x,000 or more rows per stream. scikit typically does not handle data that is streamed. Are there other libraries that can take streamed data and achieve what I am looking for?

Many thanks for your suggestions.

How do you know there are 4 clusters? Is it possible that one file includes data from different clusters? HINT: Decide/Understand now if your problem is classification or clustering and then follow the correct terminology otherwise you will get confused later. As soon as you answer these questions I will write the answer for you — Kasra Manshaei
– Kasra Manshaei, Commented Feb 22, 2022 at 11:29
I don't know if there 4 or n clusters. 4 was just a random pick in my problem statement. The problem I am trying to solve is to find the clusters. I am not looking to classify my data. — python_beginner
– python_beginner, Commented Feb 22, 2022 at 16:36

Erwan · Accepted Answer · 2022-02-22 20:49:31Z

The approach described in the pseudo-code is flawed for what you want to achieve, because you would be running k-means on every individual file. This means that:

The instances in every file are clustered into K groups. Note that this is done based only on the instances in the current file, so even if all the instances in this file are very similar K-means will split them into K clusters.
The clusters 1,2,..,K obtained as a result are independent from one file to the other. For example cluster 1 in file 1 may correspond to cluster 3 and 4 in file 2 and no cluster at all in file 3, etc.

You're not going to obtain the kind of output that you expect in this way.

It's important to understand what a clustering algorithm does: it separates the instances of a set by comparing them with each other. Since it's impossible for the clustering algorithm to guess which instances are going to be provided later and how they compare to the current ones, any form of streaming data is incompatible with clustering.

However what is possible is to obtain a clustering model (the centroids with k-means) based on some subset of data, and then apply this model to the rest of the data. This means that the actual clustering (when the centroids are discovered by K-means) takes place only once, then the new instances are just assigned to the closest centroid.

Note: I assume that every file contains multiple instances. In this scenario it's also strange that all the instances in one file belong to the same cluster. In this case I guess that you could use a random sample from every file to obtain the same result.

Thank you, Erwan. I agree with everything you say. I was aware that running K-means on each file will cluster that file's data. But I am unable to combine all files as I will run out of RAM. Is there a lazy load K-Means cluster library? — python_beginner
– python_beginner, Commented Feb 23, 2022 at 21:57
@python_beginner I don't know any. But if you are really sure from the start that all the instances in one file belong to the same cluster, you don't need to use all of them, you can just load a subset from each file. It's a strange case, because in theory every file should be only one instance in this case. — Erwan
– Erwan, Commented Feb 24, 2022 at 10:47

Jamie · Accepted Answer · 2024-09-02 15:18:48Z

One solution is to reduce the size of the input data by using Principle Component Analysis (PCA), or a similar technique, to reduce its dimensionality.

You would loop through your input files one-by-one, run a PCA (for example) and save the resulting output. Then you can conduct k-means on your new reduced dataset.

This is a common technique in NLP given the large and sparse nature of input data in that field.

Stack Exchange Network

Clustering multiple csv files that cannot fit in RAM

2 Answers 2

Hot Network Questions

Clustering multiple csv files that cannot fit in RAM

2 Answers 2

Related

Hot Network Questions