I want to perform large-scale image similarities detection.
For context, I have a large database containing almost 13,000,000 flats. Every time a new flat is added to the database, I need to check whether it is a duplicate or not. Here are some more details about the problem:
Dataset of ~13 million flats.
Each flat is associated with interior images (e.g.: photos of rooms).
Each image is linked to a unique flat ID.
However, some flats are duplicates and images of the same flat appear under different unique flat IDs.
Duplicate flats do not necessarily share identical images: this is a near-duplicate detection task.
Technical constrains and set-up:
I'm using Python.
I have access to AWS services, but main focus here is the machine learning and image similarity approach, rather than infrastructure.
The solution must be optimised, given the size of the database.
Ideally, there should be some pre-filtering or approximate search on embeddings to avoid computing distances between the new image and every existing one: multi–stage filtering images.
Thanks a lot,
Guillaume