Concatenate dataframes where column names in dataframe differ

Question

I have multiple dataframes and I want to get the result_df by concatenating multiple dataframes. Any suggestions on this?

I am using following code but I am not getting error: Error tokenizing data. C error: Expected 1 fields in line 4, saw 5

list_ = [] if os.path.exists(csvfile): df = pd.read_csv(csvfile, sep=',', encoding='utf-8') list_.append(pd.concat(df)) frame = pd.concat(list_,ignore_index=True) df1 = Apple Banana 0 1 7 1 2 10 2 4 5 3 5 1 4 7 5 df2 = Apple Banana Carrot 0 1 7 5 1 2 10 8 2 4 5 8 3 5 1 2 4 7 5 1 df3 = Apple Carrot Mango 0 1 5 2 1 2 8 3 2 4 8 7 3 5 2 1 4 7 1 5 result_df = Apple Banana Carrot Mango 0 1 7 n.a n.a 1 2 10 n.a n.a 2 4 5 n.a n.a 3 5 1 n.a n.a 4 7 5 n.a n.a 5 1 7 5 n.a 6 2 10 8 n.a 7 4 5 8 n.a 8 5 1 2 n.a 9 7 5 1 n.a 10 1 n.a 5 2 11 2 n.a 8 3 12 4 n.a 8 7 13 5 n.a 2 1 14 7 n.a 1 5

jezrael · Accepted Answer · 2018-04-04 10:13:53Z

I believe need error_bad_lines=False and first concat is not necessary:

list_ = [] if os.path.exists(csvfile): list_.append(pd.read_csv(csvfile, sep=',', encoding='utf-8', error_bad_lines=False)) frame = pd.concat(list_,ignore_index=True)

Another solution:

import glob files = glob.glob('files/*.csv') list_ = [pd.read_csv(fp, encoding='utf-8', error_bad_lines=False) for fp in files] df = pd.concat(list_, ignore_index=True)

EDIT: For find problematic rows is possible check length of header with length of rows:

import csv with open('a.csv') as csv_file: reader = csv.reader(csv_file, delimiter=',') len_header = len(next(reader)) for row in reader: if (len(row) != len_header): print ("Length of row is: %r" % len(row) ) print (row) Length of row is: 3 ['4', '5', '20'] Length of row is: 4 ['5', '1', '5', '7'] df = pd.read_csv('a.csv', warn_bad_lines=True, error_bad_lines=False) print (df) Apple Banana 0 1 7 1 2 10 2 7 5 b'Skipping line 4: expected 2 fields, saw 3\nSkipping line 5: expected 2 fields, saw 4\n'

a.csv:

Apple,Banana 1,7 2,10 4,5,20 5,1,5,7 7,5

i am afraid that error_bad_lines may remove some data from my dataframe. Is there a way check what those bad_lines are ? I am loading a file which has close to 1 million lines.
Thanks Jezrael. It works but i have to figure out what is getting removed.

Collectives™ on Stack Overflow

Concatenate dataframes where column names in dataframe differ

1 Answer 1

7 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Linked

Related