How to compare a specific column in two csv files and output differences to a third file

Question

I have two csv files named test1.csv and test2.csv and they both have a column named 'Name'. I would like to compare each row in this Name column between both files and output the ones that don't match to a third file. I have seen some examples using pandas, but none worked for my situation. Can anyone help me get a script going for this?

Test2 will be updated to include all values from test1 plus new values not included in test1 (which are the ones i want saved to a third file)

An example of what the columns look like is:

test1.csv:

Name Number Status gfd454 456 Disposed 3v4fd 521 Disposed th678iy 678 Disposed

test2.csv

Name Number Status gfd454 456 Disposed 3v4fd 521 Disposed th678iy 678 Disposed vb556h 665 Disposed

are your two columns of equal length? or is there some identifier which can be used to match them? providing some sample data would be useful — Dan
– Dan, Commented Feb 5, 2020 at 15:26
@Dan they are not equal length unfortunately. the column gets updated so more are added in each new file. i would like for it to only include the new ones in the third file. — pythonscrub
– pythonscrub, Commented Feb 5, 2020 at 15:28
@pythonscrub Do you assume that test2.csv will add more names and you want to find them? — balderman
– balderman, Commented Feb 5, 2020 at 15:35
@Dan added an example of the columns with the extra value that i would like moved to a third file since it does not match the test1 file outputs. — pythonscrub
– pythonscrub, Commented Feb 5, 2020 at 15:36
@balderman yes, test2 will include all values from test1 plus new ones which are the ones i want saved to a third file. — pythonscrub
– pythonscrub, Commented Feb 5, 2020 at 15:37

balderman · Accepted Answer · 2020-02-05 17:13:58Z

1

See below.

The idea is to read the names into s python set data structure and find the new names by doing set substruction.

1.csv:

Name Number A 12 B 34 C 45

2.csv

Name Number A 12 B 34 C 45 D 77 Z 67

The code below will print {'D', 'Z'} which are the new names.

def read_file_to_set(file_name): with open(file_name) as f: return set(l.strip().split()[0] for x,l in enumerate(f.readlines()) if x > 0) set_1 = read_file_to_set('1.csv') set_2 = read_file_to_set('2.csv') new_names = set_2 - set_1 print(new_names)

edited Feb 5, 2020 at 17:13

answered Feb 5, 2020 at 15:44

balderman

24k8 gold badges39 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

19 Comments

pythonscrub Over a year ago

this returned all the values and not just the new ones. what may be the reason for this?

balderman Over a year ago

@pythonscrub I think you are wrong. See my extended answer.

pythonscrub Over a year ago

there are more than just one column in the file so maybe this is why it returns a lot more information? Can i set it to only check the column 'Name' in both files?

balderman Over a year ago

@pythonscrub please share few 'real' lines from file #1 and few lines from file #2

balderman Over a year ago

If the separator is a whitespace - the code should work "as is". If it is not - you should find out what is it and add it to the split() function call. Did you test the latest code against your files?

|

Dan · Accepted Answer · 2020-02-05 15:46:56Z

This answer assumes that the data is lined up as in your example:

import pandas as pd # "read" each file df1 = pd.DataFrame({'Name': ['gfd454', '3v4fd', 'th678iy']}) df2 = pd.DataFrame({'Name': ['gfd454', '3v4fd', 'th678iy', 'fdvs']}) # make column names unique df1 = df1.rename(columns={'Name': 'Name1'}) df2 = df2.rename(columns={'Name': 'Name2'}) # line them up next to each other df = pd.concat([df1, df2], axis=1) # get difference diff = df[df['Name1'].isnull()]['Name2'] # or df[df['Name1'] != df['Name2']]['Name2'] # write diff.to_csv('test3.csv')

the thing is that new ones will constantly be added to test2 so i cant make a hard coded data frame.
yes, replace the pd.DataFrame calls with pd.read_csv. I just created the dummy data so that the code runs.
this works fine if the added values are added at the end of the column. Is there any way to detect the added value even if its randomly in the column somewhere?
can you provide me an example of this? Sorry for the hassle I am new to this and want to get a better understanding of it all

divingTobi · Accepted Answer · 2020-02-05 16:23:26Z

This should be straight forward - the solution assumes that the content of file2 is the same or longer, so items are only appended to file2.

import pandas as pd df1 = pd.read_csv(r"C:\path\to\file1.csv") df2 = pd.read_csv(r"C:\path\to\file2.csv") # print(df1) # print(df2) df = pd.concat([df1, df2], axis=1) df['X'] = df['A'] == df['B'] print(df[df.X==False]) df3 = df[df.X==False]['B'] print(df3) df3.to_csv(r"C:\path\to\file3.csv")

If the items are in arbitrary order, you could use df.isin() as follows:

import pandas as pd df1 = pd.read_csv(r"C:\path\to\file1.csv") df2 = pd.read_csv(r"C:\path\to\file2.csv") df = pd.concat([df1, df2], axis=1) df['X'] = df['B'].isin(df['A']) df3 = df[df.X==False]['B'] df3.to_csv(r"C:\path\to\file3.csv")

I have created the following 2 files:

A 1_in_A 2_in_A 3_in_A 4_in_A

and file2.csv:

B 2_in_A 1_in_A 3_in_A 4_in_B 5_in_B

for testing. The dataframe df looks as follows:

| | A | B | X | |---:|:-------|:-------|:------| | 0 | 1_in_A | 2_in_A | True | | 1 | 2_in_A | 1_in_A | True | | 2 | 3_in_A | 3_in_A | True | | 3 | 4_in_A | 4_in_B | False | | 4 | nan | 5_in_B | False |

and we select only the items that are flagged as False.

lets say the items are in arbitrary order. what would the code look like then?

Collectives™ on Stack Overflow

How to compare a specific column in two csv files and output differences to a third file

3 Answers 3

19 Comments

7 Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

19 Comments

7 Comments

2 Comments

Related