Parsing CSV file in pandas with commas

Question

I need to create a pandas.DataFrame from a csv file. For that I am using the method pandas.csv_reader(...). The problem with this file is that one or more columns contain commas within the values (I don't control the file format). I been trying to implement the solution from this question but I get the following error:

pandas.errors.EmptyDataError: No columns to parse from file

For some reason after implementing this solution the csv file I tried fixing is blank.

Here is the code I am using:

# fix csv file with open ("/Users/username/works/test.csv",'rb') as f,\ open("/Users/username/works/test.csv",'wb') as g: writer = csv.writer(g, delimiter=',') for line in f: row = line.split(',', 4) writer.writerow(row) # Manipulate csv file data = pd.read_csv(os.path.expanduser\ ("/Users/username/works/test.csv"),error_bad_lines=False)

Any ideas?

Data overview:

 Id0 Id 1 Id 2 Country Company Title Email 23 123 456 AR name cargador [email protected] 24 123 456 AR name Executive assistant [email protected] 25 123 456 AR name Asistente Administrativo [email protected] 26 123 456 AR name Atención al cliente vía telefónica vía online [email protected] 39 123 456 AR name Asesor de ventas [email protected] 40 123 456 AR name inc. International company representative [email protected] 41 123 456 AR name Vendedor de campo [email protected] 42 123 456 AR name PUBLICIDAD ATENCIÓN AL CLIENTE [email protected] 43 123 456 AR name Asistente de Marketing [email protected] 44 123 456 AR name SOLDADOR [email protected] 217 123 456 AR name Se requiere vendedores Loja Quevedo Guayas) [email protected] 218 123 456 AR name Ing. Civil recién graduado Yaruquí [email protected] 219 123 456 AR name ayudantes enfermeria [email protected] 220 123 456 AR name Trip Leader for International Youth Exchange [email protected] 221 123 456 AR name COUNTRY MANAGER / DIRECTOR COMERCIAL [email protected] 250 123 456 AR name Ayudante de Pasteleria [email protected] Asesor [email protected] [email protected]

Pre-parsed CSV:

#,Id 1,Id 2,Country,Company,Title,Email,,,, 23,123,456,AR,name,cargador,[email protected],,,, 24,123,456,AR,name,Executive assistant,[email protected],,,, 25,123,456,AR,name,Asistente Administrativo,[email protected],,,, 26,123,456,AR,name,Atención al cliente vía telefónica , vía online,[email protected],,, 39,123,456,AR,name,Asesor de ventas,[email protected],,,, 40,123,456,AR,name, inc.,International company representative,[email protected],,, 41,123,456,AR,name,Vendedor de campo,[email protected],,,, 42,123,456,AR,name,PUBLICIDAD, ATENCIÓN AL CLIENTE,[email protected],,, 43,123,456,AR,name,Asistente de Marketing,[email protected],,,, 44,123,456,AR,name,SOLDADOR,[email protected],,,, 217,123,456,AR,name,Se requiere vendedores,, Loja , Quevedo, Guayas),[email protected] 218,123,456,AR,name,Ing. Civil recién graduado, Yaruquí,[email protected],,, 219,123,456,AR,name,ayudantes enfermeria,[email protected],,,, 220,123,456,AR,name,Trip Leader for International Youth Exchange,[email protected],,,, 221,123,456,AR,name,COUNTRY MANAGER / DIRECTOR COMERCIAL,[email protected],,,, 250,123,456,AR,name,Ayudante de Pasteleria,[email protected], Asesor,[email protected],[email protected], 251,123,456,AR,name,Ejecutiva de Ventas,[email protected],,,,

I think @ChihebNexus was asking for the pre-parsed CSV data so we can see how to properly parse it. — elPastor
– elPastor, Commented May 22, 2017 at 21:42
I see. If that's the case, it looks like someone is giving you bad data to work with. Is that the point? you're being asked to clean it up? The reason I ask is because row #217 has the email at the end of the line, instead of four columns from the right. Can you get the data any closer to the source? Typically a string that has potential for commas would be surrounded by quotes: "bla,bla,bla" — elPastor
– elPastor, Commented May 22, 2017 at 22:01

Stephen Rauch · Accepted Answer · 2017-05-23 00:55:34Z

If you can assume that for the Comapny, that any commas are followed by spaces, and that all of the remaining errant commas are in the column prior to the email address, then a small parser can be written to process that.

Code:

import csv import re VALID_EMAIL = re.compile(r'[^@]+@[^@]+\.[^@]+') def read_my_csv(file_handle): # build csv reader reader = csv.reader(file_handle) # get the header, and find the e-mail and title columns header = next(reader) email_column = header.index('Email') title_column = header.index('Title') # yield the header up to the e-mail column yield header[:email_column+1] # for each row, go through rebuild columns for row in reader: # for each row, put the Company column back together while row[title_column].startswith(' '): row[title_column-1] += ',' + row[title_column] del row[title_column] # for each row, put the Title column back together while not VALID_EMAIL.match(row[email_column]): row[email_column-1] += ',' + row[email_column] del row[email_column] yield row[:email_column+1]

Test Code:

with open ("test.csv", 'rU') as f: generator = read_my_csv(f) columns = next(generator) df = pd.DataFrame(generator, columns=columns) print(df)

Results:

 # Id 1 Id 2 Country Company \ 0 23 123 456 AR name 1 24 123 456 AR name 2 25 123 456 AR name 3 26 123 456 AR name 4 39 123 456 AR name 5 40 123 456 AR name, inc. 6 41 123 456 AR name 7 42 123 456 AR name 8 43 123 456 AR name 9 44 123 456 AR name 10 217 123 456 AR name 11 218 123 456 AR name 12 219 123 456 AR name 13 220 123 456 AR name 14 221 123 456 AR name 15 250 123 456 AR name 16 251 123 456 AR name Title Email 0 cargador [email protected] 1 Executive assistant [email protected] 2 Asistente Administrativo [email protected] 3 Atención al cliente vía telefónica , vía online [email protected] 4 Asesor de ventas [email protected] 5 International company representative [email protected] 6 Vendedor de campo [email protected] 7 PUBLICIDAD, ATENCIÓN AL CLIENTE [email protected] 8 Asistente de Marketing [email protected] 9 SOLDADOR [email protected] 10 Se requiere vendedores,, Loja , Quevedo, Guayas) [email protected] 11 Ing. Civil recién graduado, Yaruquí [email protected] 12 ayudantes enfermeria [email protected] 13 Trip Leader for International Youth Exchange [email protected] 14 COUNTRY MANAGER / DIRECTOR COMERCIAL [email protected] 15 Ayudante de Pasteleria [email protected] 16 Ejecutiva de Ventas [email protected]

thanks this works! is there any change that the script can catch cases where the company column is also getting mixed with the title. An example of this is index 5. It should be the following: company: name inc. title: International company representative Thanks!
@David, Since the , is the field delimiter, you would need to come up with some logic to allow the code to distinguish between company and title. I just added a bit to fix the company by assuming leading spaces go with the previous column. The heuristic I proposed above is obviously simplistic. This is to psheps point in the comments above. Data Cleaning is almost always a bit of art, as the way to do it will heavily depend on what is wrong with the data.

Collectives™ on Stack Overflow

Parsing CSV file in pandas with commas

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related