pandas read_csv with final column containing commas

Question

So I have a csv dataset that by my book is well formed, and I'm trying to get the pandas package to load it correctly. The header consists of 5 column names , but the final column consists of JSON objects which contain unescaped commas. e.g.

A,B,C,D,E 1,2,3,4,{K1:V1,K2:V2}

I'm loading my data with a simple training = pd.read_csv('data/training.dat')

however, pandas is clearly misinterpreting the additional commas as new unlabeled columns, and I'm getting an error like this:

CParserError: Error tokenizing data. C error: Expected 75 fields in line 3, saw 84

I'm trying to navigate the docs, but clearly failing, does anyone know how to correctly configure the pd.read_csv command to parse it correctly?

I guess the alternative is I could hack together a script that flattens the JSON objects using a union of their keys as columns.

Community · Accepted Answer · 2020-06-20 09:12:55Z

If it feasible for you to replace { with "{, and } with }", it can be read correctly by: pd.read_csv('data/training.dat',quotechar='"',skipinitialspace=True)

Edit:

Or go for a regular expression based solution:

In [205]: print pd.read_csv('a.data',sep=",(?![^{]*\})", header=None) 0 1 2 3 4 0 A B C D E 1 1 2 3 4 {K1:V1,K2:V2} [2 rows x 5 columns]

I really like that regex-based solution. I didn't realize sep could parse regex.

meloncholy · Accepted Answer · 2014-06-07 02:16:33Z

I think it depends on what you're trying to do with the JSON. If you just want to ignore it, probably the easiest way is to set the comment char to { (For this and the next, I've assumed you don't have any braces in your other columns.)

pd.read_csv( 'woo.csv', comment='{' )

It is possible to extract elements from the JSON using a custom separator with read_csv, though I'm not at all convinced this is a sensible approach. Pandas will turn the separator into a column if it's a capturing group (it uses re.split internally), so I can get a column containing the JSON. Unfortunately I get a load of empty columns too because of that; hence the dropna.

I sent the JSON through loads and dumps, though obviously you'd want to do something more sensible. :)

json_bit = lambda x: json.dumps(json.loads(x)) pd.read_csv( 'woo.csv', sep=r'(\{.*\})$|,', converters={'None.3': json_bit} ).dropna(axis=1)

Sample CSV

A,B,C,D,E 1,2,3,4,{"K1":"V1","K2":"V2"} 3,2,3,4,{"K1": "V1", "k£": {"k3": "v3"}, "K2":"V2"}

Deepak · Accepted Answer · 2017-10-10 09:39:13Z

1

No need to preprocess csv file, just use engine type python :

dataset = pd.read_csv('sample.csv', sep=',', engine='python')

answered Oct 10, 2017 at 9:39

Deepak

1,15013 silver badges10 bronze badges

1 Comment

Teepeemm Over a year ago

This doesn't seem to solve the problem. I don't see how specifying the engine type will make a column entry get interpreted as a dict.

Collectives™ on Stack Overflow

pandas read_csv with final column containing commas

3 Answers 3

Edit:

1 Comment

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Edit:

1 Comment

Comments

1 Comment

Linked

Related