9

I have two csv files both consist of two columns.

The first one has the product id, and the second has the serial number.

I need to lookup, all serial numbers from the first csv, and find matches, on the second csv. The result report, will have the serial number that matched, and the corresponding product ids from each csv, in a separate column i trued to modify the below code, no luck.

How would you approach this?

import pandas as pd A=set(pd.read_csv("c1.csv", index_col=False, header=None)[0]) #reads the csv, takes only the first column and creates a set out of it. B=set(pd.read_csv("c2.csv", index_col=False, header=None)[0]) #same here print(A-B) #set A - set B gives back everything thats only in A. print(B-A) # same here, other way around. 
1
  • Can you add some sample data and desired output? Because it is a bit unclear what need exactly. Commented Feb 23, 2017 at 14:41

3 Answers 3

9

I think you need merge:

A = pd.DataFrame({'product id': [1455,5452,3775], 'serial number':[44,55,66]}) print (A) B = pd.DataFrame({'product id': [7000,2000,1000], 'serial number':[44,55,77]}) print (B) print (pd.merge(A, B, on='serial number')) product id_x serial number product id_y 0 1455 44 7000 1 5452 55 2000 
Sign up to request clarification or add additional context in comments.

2 Comments

only a small modification needed, how could one give the two filenames as input, in the above snippet, instead of hard coding the values?
@user7609711 just use a = pd.read_csv("path-to-file") or a = pd.read_excel("path-to-file") if you have an xlsx file. Though you will need openpyxl to open excel files.
4

Try this:

A = pd.read_csv("c1.csv", header=None, usecols=[0], names=['col']).drop_duplicates() B = pd.read_csv("c2.csv", header=None, usecols=[0], names=['col']).drop_duplicates() # A - B pd.merge(A, B, on='col', how='left', indicator=True).query("_merge == 'left_only'") # B - A pd.merge(A, B, on='col', how='right', indicator=True).query("_merge == 'right_only'") 

Comments

1

You can convert df into Sets , that will ignore the index while comparing the data, then use set symmetric_difference

ds1 = set([ tuple(values) for values in df1.values.tolist()]) ds2 = set([ tuple(values) for values in df2.values.tolist()]) ds1.symmetric_difference(ds2) print df1 ,'\n\n' print df2,'\n\n' print pd.DataFrame(list(ds1.difference(ds2))),'\n\n' print pd.DataFrame(list(ds2.difference(ds1))),'\n\n' 

df1

id Name score isEnrolled Comment 0 111 Jack 2.17 True He was late to class 1 112 Nick 1.11 False Graduated 2 113 Zoe 4.12 True NaN 

df2

 id Name score isEnrolled Comment 0 111 Jack 2.17 True He was late to class 1 112 Nick 1.21 False Graduated 2 113 Zoe 4.12 False On vacation 

Output

 0 1 2 3 4 0 113 Zoe 4.12 True NaN 1 112 Nick 1.11 False Graduated 0 1 2 3 4 0 113 Zoe 4.12 False On vacation 1 112 Nick 1.21 False Graduated 

Comments