1

I have an embedding matrix in the size of (100000, 100). I want to compute all the pairwise cosine distances in the matrix. I've tried using sklearn.metrics.pairwise.cosine_distances function, but it crashes due to RAM memory reaching its limit. I also tried to do the calculaion in batches like so:

from sklearn.metrics.pairwise import cosine_distances embeddings.astype(np.float32) distances_matrix = [] batch_size = 1000 df_size = len(embeddings) for i in tqdm(range(0, df_size, batch_size)): end = min(i + batch_size, df_size) batch = embeddings[i:end] batch_distances = cosine_distances(batch, embeddings) distances_matrix.append(batch_distances) 

but it also craches after about 11 iterations.

Any suggestions on how to approach this? Thanks.

1 Answer 1

0

Assuming, you are doing work with Hugging Face and already using the library for embeddings, you should take their recommended approach and use FAISS

If not, you could also try:

  • Making your embeddings sparse. Using sparse matrices in cosine similarity computations is much more efficient in sklearn.
  • Try computing in batches and using simpler functions such as scipy cosine distance
  • There is a wrapper library someone has written for efficient pairwise cosine similarity in sklearn, might be worth a shot also: effcosim
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.