5

I'm stuck with some poorly formatted CSV data that I need to read into a Pandas dataframe. I cannot change how the data is being recorded (it's coming from someplace else), so please no solutions suggesting that.

Most of the data is fine, but some rows have commas in the last column. A simplified example:

column1 is fine,column 2 is fine,column3, however, has commas in it! 

All rows should have the same number of columns (3), but this example of course breaks the CSV reader because the commas suggest there are 5 columns when in fact there are 3.

Notice that there is no quoting that would allow me to use the standard CSV reader tools to handle this problem.

What I do know, however, is that the extra comma(s) always occur in the last (rightmost) column. This means I can use a solution that boils down to:

"Always assume there are 3 columns, counting from the left, and interpret all extra commas as string content within column 3". Or, worded differently, "Interpret the first two commas as column separators, but assume any subsequent commas are just part of the string in column 3."

I can think of plenty of kludgy ways to accomplish this, but my question is: Is there any elegant, concise way of addressing this, preferably within my call to pandas.csv_reader(...)?

3
  • If you cann't find a way .. Try loading file in list of list using filereader(python) and then import that in pandas - I know this is not the solution, but I would use it as workaround Commented Jun 11, 2014 at 13:38
  • Can you expand on this method a bit? Commented Jun 11, 2014 at 13:58
  • What produces this? I am hitting same thing. Somebody wrote something bad a long time ago. Commented Mar 4, 2021 at 23:24

1 Answer 1

4

Fix the csv, then proceed normally:

import csv with open('path/to/broken.csv', 'rb') as f, open('path/to/fixed.csv', 'wb') as g: writer = csv.writer(g, delimiter=',') for line in f: row = line.split(',', 2) writer.writerow(row) 

import pandas as pd df = pd.read_csv('path/to/fixed.csv') 
Sign up to request clarification or add additional context in comments.

1 Comment

This is good, but not that the line row = line.split(',', 3) should be row = line.split(',', 2). The second argument to split is the number of splits to make, not the number of columns.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.