2

CSV_1.csv has the structure:

ABC DEF GHI JKL MNO PQR 

CSV_2.csv has the structure:

XYZ DEF ABC 

CSV_2.csv is a lot smaller than CSV_1.csv and a lot of the rows that exist in CSV_2.csv appears in CSV_1.csv. I want to figure out if there are rows that exist in CSV_2.csv but not in CSV_1.csv.

These files are not sorted.

The bigger csv has closer to 10 million rows, the smaller table has around 7 million rows.

How would I go about doing this? I tried python but taking each row from CSV_2.csv and comparing with 10 million rows in CSV_1.csv takes a lot of time.

Here is what I tried in python:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2: fileone = t1.readlines() filetwo = t2.readlines() with open('update.csv', 'a') as outFile: for line in filetwo: if line not in fileone: outFile.write(line) 

awk comes to mind. What would the exact code be for awk?

8
  • Have a look at stackoverflow.com/questions/42239179/…, it has lot of well researched answers. Pick the one that suits your needs Commented Apr 24, 2017 at 6:07
  • You can go through the following questions : stackoverflow.com/questions/5268929/… stackoverflow.com/questions/11108667/… Commented Apr 24, 2017 at 6:10
  • @Inian The one you linked does not answer my question. I have also looked at other similar questions here on SO. Commented Apr 24, 2017 at 6:12
  • @user2125722 I already tried python. Very, very slow. Commented Apr 24, 2017 at 6:12
  • If you have tried python you should share, so that someone might help you with it. Commented Apr 24, 2017 at 6:14

4 Answers 4

3

Yes, your approach is very inefficient. The following should be much faster, using O(1) lookup-time of sets, and iterating over the lines in t2 lazily:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2: fileone = frozenset(t1) with open('update.csv', 'a') as outFile: for line in t2: if line not in fileone: outFile.write(line) 
Sign up to request clarification or add additional context in comments.

2 Comments

Your solution gave me: fileone = frozenset(1) TypeError: 'int' object is not iterable
@Shoumik that wasn't my solution, I wrote fileone = frozenset(t1), so frozenset(**t1**) not frozenset(1)
0

To speed up the python implementation, you should use a data structure which is fast for lookups. You should try a set:

Change:

fileone = t1.readlines() 

To:

fileone = set(t1.readlines()) 

This will considerably speed up the line:

if line not in fileone: 

Comments

0

You can use pandas dataframe. Create 2 data frames from both the csv.

import pandas as pd df1= pd.DataFrame.from_csv('CSV_1.csv') df2= pd.DataFrame.from_csv('CSV_2.csv') >>> df1 val 0 ABC 1 DEF 2 GHI 3 JKL 4 MNO 5 PQR >>> >>> df2 val 0 XYZ 1 DEF 2 ABC >>> df = pd.merge(df1, df2, how='outer', indicator=True) >>> df val _merge 0 ABC both 1 DEF both 2 GHI left_only 3 JKL left_only 4 MNO left_only 5 PQR left_only 6 XYZ right_only >>> uniqueRowsInCsv2 = df[ df['_merge'] == 'right_only' ] >>> uniqueRowsInCsv2 val _merge 6 XYZ right_only >>> 

Comments

0

You could load the data in sets and use the set difference operation to speed up:

with open('old.csv', 'r') as t1, open('new.csv', 'r') as t2: old_set = set(t1.readlines()) new_set = set(t2.readlines()) # values in new_set but not in old_set differences = new_set.difference(old_set) with open('update.csv', 'a') as outFile: for difference in differences: outFile.write(difference) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.