Pandas - How to compare 2 CSV files and output changes

Question

Situation I have 2 CSVs that are 10k rows by 140 columns that are largely identical and need to identify the differences. The headers are the exact same and the rows are almost the same (100 of 10K might have changed).

Example

File1.csv

ID,FirstName,LastName,Phone1,Phone2,Phone3 1,Bob,Jones,5555555555,4444444444,3333333333 2,Jim,Hill,2222222222,1111111111,0000000000

File2.csv

ID,FirstName,LastName,Phone1,,Phone2,,Phone3
1,Bob, Jones,5555555555,4444455444,3333333333
2,Jim, Hill,2222222222,1155111111,0005500000
3,Kim, Grant,2173659851,3214569874,3698521471

Outputfile.csv
ID,FirstName,LastName,Phone1,Phone2,Phone3
1,Bob,Jones,5555555555,4444444444,3333333333
2,Jim,Hill,2222222222,1111111111,0005500000
3,Kim, Grant,2173659851,3214569874,3698521471

I think I want the output to be File2.csv with changes from File1.csv highlighted somehow. I'm new to python and pandas and can't seem to figure out where to start. I did my best to search google for something similar to adapt to my needs but the scripts appeared to be to specific to the situation.

If someone knows of an easier/different way, I'm all ears. I don't care how this happens as long as I don't have to check record-by-record.

Are rows compared by order, or by the ID column? Are the columns guaranteed to be the same between file1 and file2? — Dane White
– Dane White, Commented Nov 7, 2018 at 1:10
Thanks for the reply! Rows are compared by the ID column and the columns will be 100% the same. — Chad Belerique
– Chad Belerique, Commented Nov 7, 2018 at 18:18
I have posted a general answer. Can you upload the files so that I can be more specific ? — seralouk
– seralouk, Commented Nov 7, 2018 at 22:46

Dane White · Accepted Answer · 2018-11-08 20:16:16Z

CSV generally doesn't support different fonts, but here's a solution that uses bold and colors output to the console (note: I only tested on Mac). If you're using Python 3.7+ (dictionaries sorted by insertion order), then the dictionary ordering and columns list shouldn't be necessary.

from collections import OrderedDict from csv import DictReader class Color(object): GREEN = '\033[92m' RED = '\033[91m' BOLD = '\033[1m' END = '\033[0m' def load_csv(file): # Index by ID in order, and keep track of the original column order with open(file, 'r') as fp: reader = DictReader(fp, delimiter=',') rows = OrderedDict((r['ID'], r) for r in reader) return rows, reader.fieldnames def print_row(row, cols, color, prefix): print(Color.BOLD + color + prefix + ','.join(row[c] for c in cols) + Color.END) def print_diff(row1, row2, cols): row = [] for col in cols: value1 = row1[col] if row2[col] != value1: row.append(Color.BOLD + Color.GREEN + value1 + Color.END) else: row.append(value1) print(','.join(row)) def diff_csv(file1, file2): rows1, cols = load_csv(file1) rows2, _ = load_csv(file2) for row_id, row1 in rows1.items(): # Pop the matching ID row row2 = rows2.pop(row_id, None) # If not in file2, then it was added if not row2: print_row(row1, cols, Color.GREEN, '+') # In both files, print the diff else: print_diff(row1, row2, cols) # Anything remaining from file2 was removed in file1 for row in rows2.values(): print_row(row, cols, Color.RED, '-')

Abhishek Patel · Accepted Answer · 2018-11-07 22:37:27Z

This can be done simply by using python's built in CSV library. If you also care about the order of your entries, you can use an OrderedDict to maintain the original file order.

import csv f = [] f3 = file('results.csv', 'w') with open('file1.csv', 'rb') as f1, open('file2.csv', 'rb') as f2: reader1 = csv.reader(f1, delimiter=",") reader2 = csv.reader(f2, delimiter=",") for line in enumerate(reader1): f.append(line) #For the first file, add them all for line in enumerate(reader2): if not any(e[0] == line[0] for e in f): #For the second file, only add them if there is not an entry with the same name already f.append(line) for e in f: if e[0] == line[0]: changedindexes = i != j for i, j in zip(e[0], line[0]) for val in changedindexes: e[val] = e[val] + 'c' c3 = csv.writer(f3, , quoting=csv.QUOTE_ALL) for line in f: #Write the new merged files into another csv c3.writerow(line) #Then find the differences between the two orderedDicts

As for bolding, there is no way to do that in CSV, as csv files contain data, not any formatting information.

Does this compare the file row to row? Because my rows can be different so I want to spit the differences in them. Thanks for the reply!
Yes, each entry in the t1 and t2 orderedDicts will be arrays
Ok. Honestly, I don't know how to find the differences. I understand I'm supposed to do some of the work so if you could point me towards the right area I'd be happy to look. Before commenting I googled but couldn't find how to compare the arrays and return the difference. What i found was row by row comparison, which won't work.
ok so I looked at what you wanted, and ill rewrite my code to do what you want, and ill write comments so you can understand it better @ChadBelerique
Thanks @abhishek, I got it working but I can't tell which rows changed. Is there an indicator or something?

seralouk · Accepted Answer · 2018-11-07 22:45:33Z

A second way:

# get indices of differences difference_locations = np.where(df1 != df2) #define reference changed_from = df1.values[difference_locations] changed_to = df2.values[difference_locations] df_differences = pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)

Collectives™ on Stack Overflow

Pandas - How to compare 2 CSV files and output changes

3 Answers 3

Comments

8 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

8 Comments

Comments

Related