using pandas to compare large CSV files with different numbers of columns

Question

I am new at python programming and I am trying to join two csv files with different numbers of columns. The aim is to find missing records and create a report with specific columns from the master column.

An example of two csv files copied directly from excel SAMPLE CSV 1(combine201709.csv)

start_time end_time aitechid hh_village grpdetails1/farmername grpdetails1/farmermobile 2016-11-26T14:01:47.329+03 2016-11-26T14:29:05.042+03 AI00001 2447 KahsuGebru 919115604 2016-11-26T19:34:42.159+03 2016-11-26T20:39:27.430+03 936891238 2473 Moto Aleka 914370833 2016-11-26T12:13:23.094+03 2016-11-26T14:25:19.178+03 914127382 2390 Hagos 914039654 2016-11-30T14:31:28.223+03 2016-11-30T14:56:33.144+03 920784222

SAMPLE CSV 2 (combinedmissingrecords.csv)

farmermobile 941807851 946741296 9 920212218 915 939555303 961579437 919961811 100004123 972635273 918166831 961579437 922882638 100006273 919728710 30000739 920770648 100004727 963767487 915855665 932255143 923531603 0 931875236 918027506 8 916353266 918020303 924359729 934623027 916585963 960791618 988047183 100002632 300007241 918271897 300007238 918250712

I tried this, but was unable to get the expected output:

 import pandas as pd normalize = lambda x: "%.4f" % float(x) # round df = pd.read_csv("/media/dmogaka/DATA/week progress/week4/combine201709.csv", index_col=(0,1), usecols=(1, 2, 3,4), header=None, converters=dict.fromkeys([1,2])) df2 = pd.read_csv("/media/dmogaka/DATA/week progress/week4/combinedmissingrecords.csv", index_col=(0,1), usecols=(0), header=None, converters=dict.fromkeys([1,2])) result = df2.merge(df[['aitechid','grpdetails1/farmermobile','grpdetails1/farmername']], left_on='farmermobile', right_on='grpdetails1/farmermobile') result.to_csv("/media/dmogaka/DATA/week progress/week4/output.csv", header=None) # write as csv

error message

/usr/bin/python3.5 "/media/dmogaka/DATA/Panda tut/test/test.py" Traceback (most recent call last): File "/media/dmogaka/DATA/Panda tut/test/test.py", line 7, in <module> header=None, converters=dict.fromkeys([1,2])) File "/home/dmogaka/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 655, in parser_f return _read(filepath_or_buffer, kwds) File "/home/dmogaka/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 405, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/home/dmogaka/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 764, in __init__ self._make_engine(self.engine) File "/home/dmogaka/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 985, in _make_engine self._engine = CParserWrapper(self.f, **self.options) File "/home/dmogaka/.local/lib/python3.5/site-packages/pandas/io/parsers.py", line 1605, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 461, in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:4968) TypeError: 'int' object is not iterable Process finished with exit code 1

Possible duplicate of Comparing two pandas dataframes for differences — MrE
– MrE, Commented Sep 16, 2017 at 20:10
@MrE, I don't think it's a duplicate. If we have different # of columns assert_frame_equal will always be returning AssertionError — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Sep 16, 2017 at 20:16
Can you post two small (3-5 rows) sample reproducible data sets and your desired resulting data set? Please read how to make good reproducible pandas examples and edit your post correspondingly. — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Sep 16, 2017 at 20:17
just guessing if yo uwant to compare dataframes, they should have the same format. So the first step is trim / adjust format to get comparable DFs, then compare as per the other post — MrE
– MrE, Commented Sep 16, 2017 at 20:19
@MrE, imagine that we want to see which rows are missing in first DF that are present in the second one... — MaxU - stand with Ukraine
– MaxU - stand with Ukraine, Commented Sep 16, 2017 at 20:22

MaxU - stand with Ukraine · Accepted Answer · 2017-09-16 21:58:36Z

Try this:

d2.merge(d1[['aitechid','grpdetails1/farmermobile','grpdetails1/farmername']], left_on='farmermobile', right_on='grpdetails1/farmermobile')

or

d2.merge(d1[['aitechid','grpdetails1/farmermobile','grpdetails1/farmername']] \ .rename(columns={'grpdetails1/farmermobile':'farmermobile'}))

i have tried your code but i keep getting the error message above @MaxU

Collectives™ on Stack Overflow

using pandas to compare large CSV files with different numbers of columns

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related