double quoted elements in csv cant read with pandas

Question

I have an input file where every value is stored as a string. It is inside a csv file with each entry inside double quotes.

Example file:

"column1","column2", "column3", "column4", "column5", "column6" "AM", "07", "1", "SD", "SD", "CR" "AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD" "AM", "01", "2", "SD", "SD", "SD"

There are only six columns. What options do I need to enter to pandas read_csv to read this correctly?

I currently am trying:

import pandas as pd df = pd.read_csv(file, quotechar='"')

but this gives me the error message: CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 14

Which obviously means that it is ignoring the '"' and parsing every comma as a field. However, for line 3, columns 3 through 6 should be strings with commas in them. ("1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD")

How do I get pandas.read_csv to parse this correctly?

Thanks.

If you are good with regex you can use it in the sep arguement to read_csv... stackoverflow.com/questions/24091356/… — rhaskett
– rhaskett, Commented Oct 27, 2014 at 23:38

Jeff · Accepted Answer · 2014-10-28 12:43:59Z

This will work. It falls back to the python parser (as you have non-regular separators, e.g. they are comma and sometimes space). If you only have commas it would use the c-parser and be much faster.

In [1]: import csv In [2]: !cat test.csv "column1","column2", "column3", "column4", "column5", "column6" "AM", "07", "1", "SD", "SD", "CR" "AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD" "AM", "01", "2", "SD", "SD", "SD" In [3]: pd.read_csv('test.csv',sep=',\s+',quoting=csv.QUOTE_ALL) pandas/io/parsers.py:637: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'. ParserWarning) Out[3]: "column1","column2" "column3" "column4" "column5" "column6" "AM" "07" "1" "SD" "SD" "CR" "AM" "08" "1,2,3" "PR,SD,SD" "PR,SD,SD" "PR,SD,SD" "AM" "01" "2" "SD" "SD" "SD"

It does not work for me.. my huge csv that is time consuming to sed contains lines like 4366201,"Erud","Facebook,Ado-Ekiti","2018-03-22 10:38:42","UR",0,0,\N ,\N,\N,\N,\N,\N and gives ParserError: ' ' expected after '"' I even tried pd.read_csv("users.csv", sep=",", delimiter="\n", quoting=csv.QUOTE_ALL, engine="python", quotechar='"', encoding="utf-8")
What finally worked for me was pd.read_csv("users.csv", sep=",", encoding="utf-8", names=["id", "name"...])
Note: sep=',\s*' seems to break using quotechar='"', quoting=csv.QUOTE_ALL. It's seemed reading this is would be equivalent. But, that's not what I found. Leaving this here for others.
this only works for python engine. When you need low_memory=True the solution won't work

Duong Hang · Accepted Answer · 2022-11-23 02:35:48Z

This worked for me: (I used Python 3.9)

dataset = pd.read_csv('test.csv', sep=',', skipinitialspace=True)

Collectives™ on Stack Overflow

double quoted elements in csv cant read with pandas

2 Answers 2

4 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Linked

Related