cosine_distances on large matrices

Question

I have an embedding matrix in the size of (100000, 100). I want to compute all the pairwise cosine distances in the matrix. I've tried using sklearn.metrics.pairwise.cosine_distances function, but it crashes due to RAM memory reaching its limit. I also tried to do the calculaion in batches like so:

from sklearn.metrics.pairwise import cosine_distances embeddings.astype(np.float32) distances_matrix = [] batch_size = 1000 df_size = len(embeddings) for i in tqdm(range(0, df_size, batch_size)): end = min(i + batch_size, df_size) batch = embeddings[i:end] batch_distances = cosine_distances(batch, embeddings) distances_matrix.append(batch_distances)

but it also craches after about 11 iterations.

Any suggestions on how to approach this? Thanks.

NLP from scratch · Accepted Answer · 2024-04-24 17:30:14Z

Assuming, you are doing work with Hugging Face and already using the library for embeddings, you should take their recommended approach and use FAISS

If not, you could also try:

Making your embeddings sparse. Using sparse matrices in cosine similarity computations is much more efficient in sklearn.
Try computing in batches and using simpler functions such as scipy cosine distance
There is a wrapper library someone has written for efficient pairwise cosine similarity in sklearn, might be worth a shot also: effcosim

Collectives™ on Stack Overflow

cosine_distances on large matrices

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related