I have 2 fixed width files like below (only change is Date value starting at position 14).
sample_hash1.txt
GOKULKRISHNA 04/17/2018 ABCDEFGHIJKL 04/17/2018 111111111111 04/17/2018 sample_hash2.txt
GOKULKRISHNA 04/16/2018 ABCDEFGHIJKL 04/16/2018 111111111111 04/16/2018 Using pandas read_fwf I am reading this file and creating a Dataframe by excluding the Date value and loading only the first 13 characters. My dataframe looks like this
import pandas as pd df1 = pd.read_fwf("sample_hash1.txt", colspecs=[(0,13)]) df2 = pd.read_fwf("sample_hash2.txt", colspecs=[(0,13)]) df1
GOKULKRISHNA 0 ABCDEFGHIJKL 1 111111111111 ... df2
GOKULKRISHNA 0 ABCDEFGHIJKL 1 111111111111 ... Now I am trying to generate a hash value on each Dataframe, but the hash is different for df1 and df2. I'm not sure what's wrong with this. Can someone throw some light on this please? I have to identify if there is any change in data between the files (excluding the Date columns).
print(hash(df1.values.tostring())) -3571422965125408226 print(hash(df2.values.tostring())) 5039867957859242153 I am loading these files into a table (each full file is around 2 GB size). Every time we are receiving full files from source. Sometimes there is no change in the data (excluding the last column, Date). My idea is to reject such files. If I can generate a hash on the file and store it somewhere (in a table) next time I can compare the new file hash value with the stored hash. I thought this is the right approach but I got stuck with hash generation.
I checked this post Most efficient property to hash for numpy array but that is not what I am looking for.
df1.values.tostring() == df2.values.tostring(), it should be false. If you want to have the same hash, you need to remove the data in the values before taking the hash.hash(df1[:-1].values.tostring())to remove the last column.