From How do I calculate the MD5 checksum of a file in Python? , I wrote a script to remove the duplicate files in the folder dst_dir with md5. However, for many files(.jpg and .mp4), the md5 was not able to remove the duplicate files. I checked that the methods mentioned in Python 3 same text but different md5 hashes did not work. I suspect if might be the property file(the "modification date" etc.) that's attached to the image files that's changed.
import os dst_dir="/" import hashlib directory=dst_dir; #list of file md5 md5_list=[]; md5_file_list=[]; for root, subdirectories, files in os.walk(directory): if ".tresorit" not in root: for file in files: file_path =os.path.abspath( os.path.join(root,file) ); print(file_path) # Open,close, read file and calculate MD5 on its contents with open(file_path, 'rb') as file_to_check: # read contents of the file data = file_to_check.read() # pipe contents of the file through md5_returned = hashlib.md5(data).hexdigest() if md5_returned not in md5_list: md5_list.append(md5_returned); md5_file_list.append(file_path); else: # remove duplicate file print(["Duplicate file", file_path, md5_returned] ) if "-" not in file: os.remove(file_path); print("Duplicate file removed 01") else: file_list_index=md5_list.index(md5_returned); if "-" not in md5_file_list[file_list_index]: os.remove(md5_file_list[file_list_index]); del md5_list[file_list_index] del md5_file_list[file_list_index] print("Duplicate file removed 02") md5_list.append(md5_returned) md5_file_list.append(file_path) else: os.remove(file_path); print("Duplicate file removed 03") How to fix Python md5 calculation such that the same image files could be returned with the same md5 values?