Suppose there are 10,000 JPEG, PNG images in a gallery, how to find all images with similar color palettes to a selected image sorted by descending similarity?
- 2Possible duplicate: stackoverflow.com/questions/593925/…ChristopheD– ChristopheD2009-11-10 00:08:30 +00:00Commented Nov 10, 2009 at 0:08
- Yeah, but there are no good answers on that question. :-)Frank Krueger– Frank Krueger2009-11-10 00:35:49 +00:00Commented Nov 10, 2009 at 0:35
- There's a lot of similar discussion here: stackoverflow.com/questions/1034900/…Paul– Paul2009-11-10 00:40:11 +00:00Commented Nov 10, 2009 at 0:40
- Here is another contender github.com/larytet-py/image-mathching The code groups the matching colors, adds percentage of the area occupied by the color group.Larytet– Larytet2019-09-16 13:17:05 +00:00Commented Sep 16, 2019 at 13:17
2 Answers
Build a color histogram for each image. Then when you want to match an image to the collection, simply order the list by how close their histogram is to your selected image's histogram.
The number of buckets will depend on how accurate you want to be. The type of data combined to make a bucket will define how you prioritize your search.
For example, if you are most interested in hue, then you can define which bucket your each individual pixel of the image goes into as:
def bucket_from_pixel(r, g, b): hue = hue_from_rgb(r, g, b) # [0, 360) return (hue * NUM_BUCKETS) / 360 If you also want a general matcher, then you can pick the bucket based upon the full RGB value.
Using PIL, you can use the built-in histogram function. The "closeness" histograms can be calculated using any distance measure you want. For example, an L1 distance could be:
hist_sel = normalize(sel.histogram()) hist = normalize(o.histogram()) # These normalized histograms should be stored dist = sum([abs(x) for x in (hist_sel - hist)]) an L2 would be:
dist = sqrt(sum([x*x for x in (hist_sel - hist)])) Normalize just forces the sum of the histogram to equal some constant value (1.0 works fine). This is important so that large images can be correctly compared to small images. If you're going to use L1 distances, then you should use an L1 measure in normalize. If L2, then L2.
3 Comments
Another solution is to run a Kmeans on each palette to group colors, order each palette by their hue, then use the cosine similarity to find the most similar image.
Here is a code that will find the most similar image of ref_image in a folder of images:
from PIL import Image import os import numpy as np from sklearn.cluster import MiniBatchKMeans from numpy.linalg import norm from tqdm import tqdm from skimage.color import rgb2hsv def extract_palette(image, num_colors): image_array = np.array(image) pixels = image_array.reshape((-1, 3)) kmeans = MiniBatchKMeans(n_clusters=num_colors, n_init='auto') kmeans.fit(pixels) colors = kmeans.cluster_centers_ return colors def order_vector_by_hue(colors): hsv_colors = rgb2hsv(np.array([colors])) ordered_indices = np.argsort(hsv_colors[:, :, 0]) ordered_rgb_colors = colors[ordered_indices] return ordered_rgb_colors def cosine_sim(u, v): return np.dot(u, v) / (norm(u) * norm(v)) if __name__ == "__main__": ref_image_path = '<your-path>' folder = '<your-image-folder>' files = os.listdir(folder) print('processing ref image') image = Image.open(image_path) ref_colors = extract_palette(image, num_colors=32) ref_colors = order_vector_by_hue(colors).reshape(-1) print('processing candidate images') selected_image_path = None max_similarity = -1 # value for the most dissimilar possible image for image_path in files: image = Image.open(image_path) colors = extract_palette(image, num_colors=32) colors = order_vector_by_hue(colors).reshape(-1) similarity = cosine_sim(ref_colors, colors) if similarity > max_similarity: max_similarity = similarity selected_image_path = image_path print(f'most similar image: {selected_image_path}') Edit: There are probably ways of enhancing all of this. If I had time, I would play around with palette compression using PCA and with the Lab Color Space to order (and compress?) the vectors. This solution is good enough for me right now. It could be quite slow for 10k images (kmeans). My use case is only several hundred images.
Edit2: A fast version that replace clustering is to select randomly num_colors in our pixels. Regarding what is your use case, it can be good enough.
def extract_palette(image, num_colors): image_array = np.array(image) pixels = image_array.reshape((-1, 3)) selected_colors = pixels[np.random.choice(len(pixels), num_colors, replace=False)] return selected_colors