1

I've read all related topics - like this, this and this - but couldn't get a solution to work.

I have an input csv file like this:

ItemId,Content i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"} i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"} 

I've tried several different approaches but couldn't get it to work. I want to read this csv file into a Dataframe like this:

ItemId Content -------- ------------------------------------------------------------------------------- i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"} i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"} 

With following code (Python 3.9):

df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"') 

As far as I understand, commas inside dictionary column and commas inside quotation marks are being treated as regular separators, so it raises following error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6 

Is it possible to produce desired result? Thanks.

2 Answers 2

2

The problem is that the comma's in the Content column are interpreted as separators. You can solve this by using pd.read_fwf to manually set the number of characters on which to split:

df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content']) 

Result:

ItemId Content
0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
Sign up to request clarification or add additional context in comments.

6 Comments

This doesn't solve the problem, and gives only the first key,value in the dictionary for each row
@ThePyGuy is correct. It didn't produce the desired results, only first key,value was read into Content column.
@thatOldITGuy You're right. I have updated the answer with a much simpler solution.
Good solution above. After this the next step for the OP is to follow the steps to expand the JSON and the df is good.
@RJAdriaansen it is a good solution, but the real CSV has a much larger (and variable length) JSON column, so I guess the (9,100) column spec would have to be guessed, right?
|
0

I don't think you'll be able to read it normally with pandas because it has the delimiter used multiple times for a single value; however, reading it with python and doing some processing, you should be able to convert it to pandas dataframe:

def splitValues(x): index = x.find(',') return x[:index], x[index+1:].strip() import pandas as pd data = open('file.csv') columns = next(data) columns = columns.strip().split(',') df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data)) 

OUTPUT:

 ItemId Content 0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"} 1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"} 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.