0

From How do I calculate the MD5 checksum of a file in Python? , I wrote a script to remove the duplicate files in the folder dst_dir with md5. However, for many files(.jpg and .mp4), the md5 was not able to remove the duplicate files. I checked that the methods mentioned in Python 3 same text but different md5 hashes did not work. I suspect if might be the property file(the "modification date" etc.) that's attached to the image files that's changed.

import os dst_dir="/" import hashlib directory=dst_dir; #list of file md5 md5_list=[]; md5_file_list=[]; for root, subdirectories, files in os.walk(directory): if ".tresorit" not in root: for file in files: file_path =os.path.abspath( os.path.join(root,file) ); print(file_path) # Open,close, read file and calculate MD5 on its contents with open(file_path, 'rb') as file_to_check: # read contents of the file data = file_to_check.read() # pipe contents of the file through md5_returned = hashlib.md5(data).hexdigest() if md5_returned not in md5_list: md5_list.append(md5_returned); md5_file_list.append(file_path); else: # remove duplicate file print(["Duplicate file", file_path, md5_returned] ) if "-" not in file: os.remove(file_path); print("Duplicate file removed 01") else: file_list_index=md5_list.index(md5_returned); if "-" not in md5_file_list[file_list_index]: os.remove(md5_file_list[file_list_index]); del md5_list[file_list_index] del md5_file_list[file_list_index] print("Duplicate file removed 02") md5_list.append(md5_returned) md5_file_list.append(file_path) else: os.remove(file_path); print("Duplicate file removed 03") 

How to fix Python md5 calculation such that the same image files could be returned with the same md5 values?

8
  • 1
    This could use some clarification. I infer that while the script analyzes images of multiple formats, only exact duplicates (literal copies of the same source file) should be detected. Is this accurate, ShoutOutAndCalculate? Or is Mark correct that you're trying to detect the same image when present in different file formats? Commented May 8, 2023 at 22:36
  • 1
    See stackoverflow.com/a/28834788/2836621 and stackoverflow.com/a/54053080/2836621 Commented May 8, 2023 at 22:39
  • 1
    @CrazyChucky It's not just file formats, it's also different bit-depths, different meta-data, different compression, different encoding... Commented May 8, 2023 at 22:42
  • 1
    @MarkSetchell I just want the code to analysis and delete what supposed to be the exact copy, i.e. "file 1.jpg" and "file 1 copy.jpg" and "file 2.mp4" and "file 2 copy.mp4". However, somehow the md5 for what supposed to be the exact file ran to be different. Commented May 9, 2023 at 2:55
  • 1
    @CrazyChucky yes, it was supposed to be the exact copy. But one file was uploaded on the server and the other one was transferred through usb. They had the same file size etc. Commented May 9, 2023 at 3:33

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.