3

So I have a csv dataset that by my book is well formed, and I'm trying to get the pandas package to load it correctly. The header consists of 5 column names , but the final column consists of JSON objects which contain unescaped commas. e.g.

A,B,C,D,E 1,2,3,4,{K1:V1,K2:V2} 

I'm loading my data with a simple training = pd.read_csv('data/training.dat')

however, pandas is clearly misinterpreting the additional commas as new unlabeled columns, and I'm getting an error like this:

CParserError: Error tokenizing data. C error: Expected 75 fields in line 3, saw 84 

I'm trying to navigate the docs, but clearly failing, does anyone know how to correctly configure the pd.read_csv command to parse it correctly?

I guess the alternative is I could hack together a script that flattens the JSON objects using a union of their keys as columns.

3 Answers 3

5

If it feasible for you to replace { with "{, and } with }", it can be read correctly by: pd.read_csv('data/training.dat',quotechar='"',skipinitialspace=True)

Edit:

Or go for a regular expression based solution:

In [205]: print pd.read_csv('a.data',sep=",(?![^{]*\})", header=None) 0 1 2 3 4 0 A B C D E 1 1 2 3 4 {K1:V1,K2:V2} [2 rows x 5 columns] 
Sign up to request clarification or add additional context in comments.

1 Comment

I really like that regex-based solution. I didn't realize sep could parse regex.
2

I think it depends on what you're trying to do with the JSON. If you just want to ignore it, probably the easiest way is to set the comment char to { (For this and the next, I've assumed you don't have any braces in your other columns.)

pd.read_csv( 'woo.csv', comment='{' ) 

It is possible to extract elements from the JSON using a custom separator with read_csv, though I'm not at all convinced this is a sensible approach. Pandas will turn the separator into a column if it's a capturing group (it uses re.split internally), so I can get a column containing the JSON. Unfortunately I get a load of empty columns too because of that; hence the dropna.

I sent the JSON through loads and dumps, though obviously you'd want to do something more sensible. :)

json_bit = lambda x: json.dumps(json.loads(x)) pd.read_csv( 'woo.csv', sep=r'(\{.*\})$|,', converters={'None.3': json_bit} ).dropna(axis=1) 

Sample CSV

A,B,C,D,E 1,2,3,4,{"K1":"V1","K2":"V2"} 3,2,3,4,{"K1": "V1", "k£": {"k3": "v3"}, "K2":"V2"} 

Comments

1

No need to preprocess csv file, just use engine type python :

dataset = pd.read_csv('sample.csv', sep=',', engine='python') 

1 Comment

This doesn't seem to solve the problem. I don't see how specifying the engine type will make a column entry get interpreted as a dict.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.