Pandas error reading csv with double quotes

Question

I've read all related topics - like this, this and this - but couldn't get a solution to work.

I have an input csv file like this:

ItemId,Content i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"} i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

I've tried several different approaches but couldn't get it to work. I want to read this csv file into a Dataframe like this:

ItemId Content -------- ------------------------------------------------------------------------------- i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"} i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

With following code (Python 3.9):

df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"')

As far as I understand, commas inside dictionary column and commas inside quotation marks are being treated as regular separators, so it raises following error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6

Is it possible to produce desired result? Thanks.

RJ Adriaansen · Accepted Answer · 2021-07-22 20:26:08Z

2

The problem is that the comma's in the Content column are interpreted as separators. You can solve this by using pd.read_fwf to manually set the number of characters on which to split:

df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content'])

Result:

	ItemId	Content
0	i0000008	{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1	i0000010	{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

edited Jul 22, 2021 at 20:26

answered Jul 22, 2021 at 19:22

RJ Adriaansen

9,7092 gold badges16 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

ThePyGuy Over a year ago

This doesn't solve the problem, and gives only the first key,value in the dictionary for each row

thatOldITGuy Over a year ago

@ThePyGuy is correct. It didn't produce the desired results, only first key,value was read into Content column.

RJ Adriaansen Over a year ago

@thatOldITGuy You're right. I have updated the answer with a much simpler solution.

MDR Over a year ago

Good solution above. After this the next step for the OP is to follow the steps to expand the JSON and the df is good.

thatOldITGuy Over a year ago

@RJAdriaansen it is a good solution, but the real CSV has a much larger (and variable length) JSON column, so I guess the (9,100) column spec would have to be guessed, right?

|

ThePyGuy · Accepted Answer · 2021-07-22 20:03:19Z

I don't think you'll be able to read it normally with pandas because it has the delimiter used multiple times for a single value; however, reading it with python and doing some processing, you should be able to convert it to pandas dataframe:

def splitValues(x): index = x.find(',') return x[:index], x[index+1:].strip() import pandas as pd data = open('file.csv') columns = next(data) columns = columns.strip().split(',') df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data))

OUTPUT:

 ItemId Content 0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"} 1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}

Collectives™ on Stack Overflow

Pandas error reading csv with double quotes

2 Answers 2

6 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Linked

Related