Find only the rows that exist in one csv but not the other

Question

CSV_1.csv has the structure:

ABC DEF GHI JKL MNO PQR

CSV_2.csv has the structure:

XYZ DEF ABC

CSV_2.csv is a lot smaller than CSV_1.csv and a lot of the rows that exist in CSV_2.csv appears in CSV_1.csv. I want to figure out if there are rows that exist in CSV_2.csv but not in CSV_1.csv.

These files are not sorted.

The bigger csv has closer to 10 million rows, the smaller table has around 7 million rows.

How would I go about doing this? I tried python but taking each row from CSV_2.csv and comparing with 10 million rows in CSV_1.csv takes a lot of time.

Here is what I tried in python:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2: fileone = t1.readlines() filetwo = t2.readlines() with open('update.csv', 'a') as outFile: for line in filetwo: if line not in fileone: outFile.write(line)

awk comes to mind. What would the exact code be for awk?

Have a look at stackoverflow.com/questions/42239179/…, it has lot of well researched answers. Pick the one that suits your needs — Inian
– Inian, Commented Apr 24, 2017 at 6:07
You can go through the following questions : stackoverflow.com/questions/5268929/… stackoverflow.com/questions/11108667/… — user2125722
– user2125722, Commented Apr 24, 2017 at 6:10
@Inian The one you linked does not answer my question. I have also looked at other similar questions here on SO. — Shoumik
– Shoumik, Commented Apr 24, 2017 at 6:12
If you have tried python you should share, so that someone might help you with it. — Stephen Rauch
– Stephen Rauch ♦, Commented Apr 24, 2017 at 6:14

juanpa.arrivillaga · Accepted Answer · 2017-04-24 06:25:35Z

Yes, your approach is very inefficient. The following should be much faster, using O(1) lookup-time of sets, and iterating over the lines in t2 lazily:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2: fileone = frozenset(t1) with open('update.csv', 'a') as outFile: for line in t2: if line not in fileone: outFile.write(line)

Your solution gave me: fileone = frozenset(1) TypeError: 'int' object is not iterable
@Shoumik that wasn't my solution, I wrote fileone = frozenset(t1), so frozenset(**t1**) not frozenset(1)

Stephen Rauch · Accepted Answer · 2017-04-24 06:24:01Z

To speed up the python implementation, you should use a data structure which is fast for lookups. You should try a set:

Change:

fileone = t1.readlines()

To:

fileone = set(t1.readlines())

This will considerably speed up the line:

if line not in fileone:

Nilanjan · Accepted Answer · 2017-04-24 06:29:14Z

You can use pandas dataframe. Create 2 data frames from both the csv.

import pandas as pd df1= pd.DataFrame.from_csv('CSV_1.csv') df2= pd.DataFrame.from_csv('CSV_2.csv') >>> df1 val 0 ABC 1 DEF 2 GHI 3 JKL 4 MNO 5 PQR >>> >>> df2 val 0 XYZ 1 DEF 2 ABC >>> df = pd.merge(df1, df2, how='outer', indicator=True) >>> df val _merge 0 ABC both 1 DEF both 2 GHI left_only 3 JKL left_only 4 MNO left_only 5 PQR left_only 6 XYZ right_only >>> uniqueRowsInCsv2 = df[ df['_merge'] == 'right_only' ] >>> uniqueRowsInCsv2 val _merge 6 XYZ right_only >>>

Abhishek Balaji R · Accepted Answer · 2017-04-24 06:50:22Z

You could load the data in sets and use the set difference operation to speed up:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2: old_set = set(t1.readlines()) new_set = set(t2.readlines()) # values in new_set but not in old_set differences = new_set.difference(old_set) with open('update.csv', 'a') as outFile: for difference in differences: outFile.write(difference)

Collectives™ on Stack Overflow

Find only the rows that exist in one csv but not the other

4 Answers 4

2 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Comments

Linked

Related