How to compare two different CSV files with different number of columns and rows by keyword?

Question

Without using the diff option.

I have two different CSV files (I know what is the names of the columns headlines)- old.csv and new.csv. they are with a different number of rows and columns. the CSV files can include numbers, strings, chars, and special characters. For example:

old.csv

 round date first second third fourth fifth sixth 1 2 2021.04 2 45e69 10 16 4565 37 2 3 2021.04 4 15 456as df924 35 4N320 4 5 2021.03 4 43!d9 23 26 29 33

new.csv

 round date first second third fourth fifth sixth 0 1 2021.04 1 14 15 24 40 41 1 2 2021.04 2 45e69 10 16 4565 37 2 3 2021.04 4 15 456as df924 35 4N320 3 4 2021.03 10 11 20 21 24325 41 5 6 2021.03 4321 9 2#@6 28 34350 41

Line 1 and 2 from both CSV are the same.

Now, I want to print out the difference between new.csv compared to old.csv. I want to print out only the new lines in new.csv file. like that:

result.csv

 round date first second third fourth fifth sixth 0 1 2021.04 1 14 15 24 40 41 3 4 2021.03 10 11 20 21 24325 41 5 6 2021.03 4321 9 2#@6 28 34350 41

Counting the lines was for easier reading. I want the headline columns will be kept in the result.csv file.

The option I think that might help here is using keywords. For example, comparing the two CSV files by columns "first" and "fifth"- if one of the columns is not the same, print that in the result.csv file.

Is there anything you have tried so far? Please post your own efforts to solve the task so far. (How do I ask a good question?) — Felix
– Felix, Commented Jul 12, 2022 at 15:01
What is not working with option one? (Except that it is not finding lines that are in file one but not in file one.) — Felix
– Felix, Commented Jul 12, 2022 at 15:11

Felix · Accepted Answer · 2022-07-13 10:05:49Z

You seperately need to check each column you want to compare. The following code is one option how to do this.

import pandas as pd def getLinesOnlyInA(dfa, dfb, result): # iterate over all lines in file a for index1, row1 in dfa.iterrows(): aLineIsEqual = False # iterate over all lines in file b for index2, row2 in dfb.iterrows(): thisLineIsDifferent = False # for each column, check if they are different for column in columns: if row1[column] != row2[column]: thisLineIsDifferent = True # ionly continue when the fields are the same break if not thisLineIsDifferent: aLineIsEqual = True # only continue when no equal line was found break # when no equal line was found, add that line to the result if not aLineIsEqual: result.append(row1) df1 = pd.read_csv('file1.csv') df2 = pd.read_csv('file2.csv') columns = ['round', 'first', 'fifth'] # columns to be compared results = [] getLinesOnlyInA(df1, df2, results) # find all lines only existing in file 1 getLinesOnlyInA(df2, df1, results) # find all lines only existing in file 2 dfResult = pd.DataFrame(results) # cast all lines into a dataframe print(dfResult.to_string()) dfResult.to_csv('result.csv', sep=',')

Result:

 round date first second third fourth fifth sixth 2 5 2021.03 4 43!d9 23 26 29 33 0 1 2021.04 1 14 15 24 40 41 3 4 2021.03 10 11 20 21 24325 41 4 6 2021.03 4321 9 2#@6 28 34350 41

In case one file contains less columns (eg. file1 does not contain the date column), this still works and results in:

 round first second third fourth fifth sixth date 2 5 4 43!d9 23 26 29 33 NaN 0 1 1 14 15 24 40 41 2021.04 3 4 10 11 20 21 24325 41 2021.03 4 6 4321 9 2#@6 28 34350 41 2021.03

svfat · Accepted Answer · 2022-07-12 15:19:24Z

There is a library you can find useful: csv-diff

You can use it from command line:

csv-diff one.csv two.csv --key=id --json

or as a python import

from csv_diff import load_csv, compare diff = compare( load_csv(open("one.csv"), key="id"), load_csv(open("two.csv"), key="id") )

Hi @svfat, thanks for your answer, but I do not want to use the diff option.

Ashgabyte · Accepted Answer · 2022-07-13 09:30:42Z

0

Use sets instead of lists after reading two files:

fileone = set(fileone) filetwo = set(filetwo)

then in result you can get a symmetric difference - all elements of both sets except elements that are in both fileone and filetwo sets.

result = fileone.symmetric_difference(filetwo)

Attention: header row is not included in result because it is the same in both sets

Update: filetwo minus fileone:

result = filetwo.difference(fileone)

edited Jul 13, 2022 at 9:30

answered Jul 12, 2022 at 15:40

Ashgabyte

1241 silver badge5 bronze badges

3 Comments

Newborn Over a year ago

Hi, it is working well, but I have fixed my question, I want it to print filetwo-(minus)fileone- get only the new results from filetwo after reduce the same results from fileone.

Felix Over a year ago

Could you please clarify what you mean by "filetwo-(minus)fileone". Its getting pretty confusing here. Do you want to yield every line that exists in file two but not in file one? (Which is what "filetwo-(minus)fileone" suggests.) Or do you want to yield every line that exists in either of the files but not in the other? (Which is what you actual question post suggests.) Or do you want to yield something completely differen?

Newborn Over a year ago

Hi @Felix, I have edited the question to be more clear.

Collectives™ on Stack Overflow

How to compare two different CSV files with different number of columns and rows by keyword?

3 Answers 3

Comments

1 Comment

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

3 Comments

Related