1
$\begingroup$

I want to perform large-scale image similarities detection.

For context, I have a large database containing almost 13,000,000 flats. Every time a new flat is added to the database, I need to check whether it is a duplicate or not. Here are some more details about the problem:

  • Dataset of ~13 million flats.

  • Each flat is associated with interior images (e.g.: photos of rooms).

  • Each image is linked to a unique flat ID.

  • However, some flats are duplicates and images of the same flat appear under different unique flat IDs.

  • Duplicate flats do not necessarily share identical images: this is a near-duplicate detection task.

Technical constrains and set-up:

  • I'm using Python.

  • I have access to AWS services, but main focus here is the machine learning and image similarity approach, rather than infrastructure.

  • The solution must be optimised, given the size of the database.

  • Ideally, there should be some pre-filtering or approximate search on embeddings to avoid computing distances between the new image and every existing one: multi–stage filtering images.

Thanks a lot,

Guillaume

$\endgroup$

1 Answer 1

2
$\begingroup$

Detecting whether two images are pictures of the same flat or pictures of two different flats might be hard. Telling you how to build an entire product is likely beyond the scope of a single answer on Stack Exchange.

If you have a large labelled dataset of images, where it is labelled which flat each image came from, then you could try using contrastive learning (e.g., SimCLR) to learn an encoder/embedding, so that images from the same flat have high cosine similarity. I have no idea how well that would work; you'd have to try it to figure out.

Alternatively, you could use an existing pre-trained model for image embedding, which takes an image as input and produces an embedding vector as output, like CLIP, MetaCLIP, DINOv2, or Nomic Embed Vision. Store the embedding vectors in a vector database, e.g., FAISS, and then search for similar images using nearest-neighbor search. The problem with this approach is that two images from the same flat might not register as similar by these metrics if they were taken from a different location in the flat or from a different angle.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.