Compare two csv files with python pandas

Question

I have two csv files both consist of two columns.

The first one has the product id, and the second has the serial number.

I need to lookup, all serial numbers from the first csv, and find matches, on the second csv. The result report, will have the serial number that matched, and the corresponding product ids from each csv, in a separate column i trued to modify the below code, no luck.

How would you approach this?

import pandas as pd A=set(pd.read_csv("c1.csv", index_col=False, header=None)[0]) #reads the csv, takes only the first column and creates a set out of it. B=set(pd.read_csv("c2.csv", index_col=False, header=None)[0]) #same here print(A-B) #set A - set B gives back everything thats only in A. print(B-A) # same here, other way around.

Can you add some sample data and desired output? Because it is a bit unclear what need exactly. — jezrael
– jezrael, Commented Feb 23, 2017 at 14:41

jezrael · Accepted Answer · 2017-02-23 14:32:07Z

9

I think you need merge:

A = pd.DataFrame({'product id': [1455,5452,3775], 'serial number':[44,55,66]}) print (A) B = pd.DataFrame({'product id': [7000,2000,1000], 'serial number':[44,55,77]}) print (B) print (pd.merge(A, B, on='serial number')) product id_x serial number product id_y 0 1455 44 7000 1 5452 55 2000

answered Feb 23, 2017 at 14:32

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user7609771 Over a year ago

only a small modification needed, how could one give the two filenames as input, in the above snippet, instead of hard coding the values?

greenbug Over a year ago

@user7609711 just use a = pd.read_csv("path-to-file") or a = pd.read_excel("path-to-file") if you have an xlsx file. Though you will need openpyxl to open excel files.

MaxU - stand with Ukraine · Accepted Answer · 2017-02-23 14:34:03Z

Try this:

A = pd.read_csv("c1.csv", header=None, usecols=[0], names=['col']).drop_duplicates() B = pd.read_csv("c2.csv", header=None, usecols=[0], names=['col']).drop_duplicates() # A - B pd.merge(A, B, on='col', how='left', indicator=True).query("_merge == 'left_only'") # B - A pd.merge(A, B, on='col', how='right', indicator=True).query("_merge == 'right_only'")

Shijo · Accepted Answer · 2017-02-23 14:55:00Z

You can convert df into Sets , that will ignore the index while comparing the data, then use set symmetric_difference

ds1 = set([ tuple(values) for values in df1.values.tolist()]) ds2 = set([ tuple(values) for values in df2.values.tolist()]) ds1.symmetric_difference(ds2) print df1 ,'\n\n' print df2,'\n\n' print pd.DataFrame(list(ds1.difference(ds2))),'\n\n' print pd.DataFrame(list(ds2.difference(ds1))),'\n\n'

df1

id Name score isEnrolled Comment 0 111 Jack 2.17 True He was late to class 1 112 Nick 1.11 False Graduated 2 113 Zoe 4.12 True NaN

df2

 id Name score isEnrolled Comment 0 111 Jack 2.17 True He was late to class 1 112 Nick 1.21 False Graduated 2 113 Zoe 4.12 False On vacation

Output

 0 1 2 3 4 0 113 Zoe 4.12 True NaN 1 112 Nick 1.11 False Graduated 0 1 2 3 4 0 113 Zoe 4.12 False On vacation 1 112 Nick 1.21 False Graduated

Collectives™ on Stack Overflow

Compare two csv files with python pandas

3 Answers 3

2 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Linked

Related