2

Without using the diff option.

I have two different CSV files (I know what is the names of the columns headlines)- old.csv and new.csv. they are with a different number of rows and columns. the CSV files can include numbers, strings, chars, and special characters. For example:

old.csv

 round date first second third fourth fifth sixth 1 2 2021.04 2 45e69 10 16 4565 37 2 3 2021.04 4 15 456as df924 35 4N320 4 5 2021.03 4 43!d9 23 26 29 33 

new.csv

 round date first second third fourth fifth sixth 0 1 2021.04 1 14 15 24 40 41 1 2 2021.04 2 45e69 10 16 4565 37 2 3 2021.04 4 15 456as df924 35 4N320 3 4 2021.03 10 11 20 21 24325 41 5 6 2021.03 4321 9 2#@6 28 34350 41 

Line 1 and 2 from both CSV are the same.

Now, I want to print out the difference between new.csv compared to old.csv. I want to print out only the new lines in new.csv file. like that:

result.csv

 round date first second third fourth fifth sixth 0 1 2021.04 1 14 15 24 40 41 3 4 2021.03 10 11 20 21 24325 41 5 6 2021.03 4321 9 2#@6 28 34350 41 

Counting the lines was for easier reading. I want the headline columns will be kept in the result.csv file.

The option I think that might help here is using keywords. For example, comparing the two CSV files by columns "first" and "fifth"- if one of the columns is not the same, print that in the result.csv file.

4
  • Is there anything you have tried so far? Please post your own efforts to solve the task so far. (How do I ask a good question?) Commented Jul 12, 2022 at 15:01
  • @Felix, I have added few code examples Commented Jul 12, 2022 at 15:08
  • What is not working with option one? (Except that it is not finding lines that are in file one but not in file one.) Commented Jul 12, 2022 at 15:11
  • Yes, I'm having a tough time with that. Commented Jul 13, 2022 at 5:52

3 Answers 3

1

You seperately need to check each column you want to compare. The following code is one option how to do this.

import pandas as pd def getLinesOnlyInA(dfa, dfb, result): # iterate over all lines in file a for index1, row1 in dfa.iterrows(): aLineIsEqual = False # iterate over all lines in file b for index2, row2 in dfb.iterrows(): thisLineIsDifferent = False # for each column, check if they are different for column in columns: if row1[column] != row2[column]: thisLineIsDifferent = True # ionly continue when the fields are the same break if not thisLineIsDifferent: aLineIsEqual = True # only continue when no equal line was found break # when no equal line was found, add that line to the result if not aLineIsEqual: result.append(row1) df1 = pd.read_csv('file1.csv') df2 = pd.read_csv('file2.csv') columns = ['round', 'first', 'fifth'] # columns to be compared results = [] getLinesOnlyInA(df1, df2, results) # find all lines only existing in file 1 getLinesOnlyInA(df2, df1, results) # find all lines only existing in file 2 dfResult = pd.DataFrame(results) # cast all lines into a dataframe print(dfResult.to_string()) dfResult.to_csv('result.csv', sep=',') 

Result:

 round date first second third fourth fifth sixth 2 5 2021.03 4 43!d9 23 26 29 33 0 1 2021.04 1 14 15 24 40 41 3 4 2021.03 10 11 20 21 24325 41 4 6 2021.03 4321 9 2#@6 28 34350 41 

In case one file contains less columns (eg. file1 does not contain the date column), this still works and results in:

 round first second third fourth fifth sixth date 2 5 4 43!d9 23 26 29 33 NaN 0 1 1 14 15 24 40 41 2021.04 3 4 10 11 20 21 24325 41 2021.03 4 6 4321 9 2#@6 28 34350 41 2021.03 
Sign up to request clarification or add additional context in comments.

Comments

1

There is a library you can find useful: csv-diff

You can use it from command line:

csv-diff one.csv two.csv --key=id --json 

or as a python import

from csv_diff import load_csv, compare diff = compare( load_csv(open("one.csv"), key="id"), load_csv(open("two.csv"), key="id") ) 

1 Comment

Hi @svfat, thanks for your answer, but I do not want to use the diff option.
0

Use sets instead of lists after reading two files:

fileone = set(fileone) filetwo = set(filetwo) 

then in result you can get a symmetric difference - all elements of both sets except elements that are in both fileone and filetwo sets.

result = fileone.symmetric_difference(filetwo) 

Attention: header row is not included in result because it is the same in both sets

Update: filetwo minus fileone:

result = filetwo.difference(fileone) 

3 Comments

Hi, it is working well, but I have fixed my question, I want it to print filetwo-(minus)fileone- get only the new results from filetwo after reduce the same results from fileone.
Could you please clarify what you mean by "filetwo-(minus)fileone". Its getting pretty confusing here. Do you want to yield every line that exists in file two but not in file one? (Which is what "filetwo-(minus)fileone" suggests.) Or do you want to yield every line that exists in either of the files but not in the other? (Which is what you actual question post suggests.) Or do you want to yield something completely differen?
Hi @Felix, I have edited the question to be more clear.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.