37

I'm reading a basic csv file where the columns are separated by commas with these column names:

userid, username, body

However, the body column is a string which may contain commas. Obviously this causes a problem and pandas throws out an error:

CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 8

Is there a way to tell pandas to ignore commas in a specific column or a way to go around this problem?

5 Answers 5

39

Imagine we're reading your dataframe called comma.csv:

userid, username, body 01, n1, 'string1, string2' 

One thing you can do is to specify the delimiter of the strings in the column with:

df = pd.read_csv('comma.csv', quotechar="'") 

In this case strings delimited by ' are considered as total, no matter commas inside them.

Sign up to request clarification or add additional context in comments.

9 Comments

I don't think his string has quotes, because if it did then pandas would recognize it without quotechar
@David Ok got it. Your dataframe has been provided "as is", or you can do some preprocessing on it?
@David have a look at this question, hope that can help you.
@David Why no import straight from the SQL, maybe something like this stackoverflow.com/questions/12047193/…
I tried this example, but does not work for me. Python 3, pandas 0.18.0 reads comma.csv but 01 become the index 1, and username become only 'string1 :(
|
21

Add usecols and lineterminator to your read_csv() function, which, n is the len of your columns.

In my case:

n = 5 #define yours df = pd.read_csv(file, usecols=range(n), lineterminator='\n', header=None) 

Comments

8

for me none of the above code samples worked (I was working on Netflix Prize dataset on Kaggle) but there is actually one cool feature from pandas version 1.3.0+ which an on_bad_lines parameter that let you use a callback function. Here is what I did:

def manual_separation(bad_line): right_split = bad_line[:-2] + [",".join(bad_line[-2:])] # All the "bad lines" where all coming from the same last column that was containing "," return right_split filename = "netflix_movie_titles.csv" df = pd.read_csv( filename, header=None, encoding="ISO-8859-1", names = ['Movie_Id', 'Year', 'Name'], on_bad_lines=manual_separation, engine="python", ) 

Works like a charm! Your only obligation is to use engine=python. Hope that helps!

1 Comment

This was exactly what I needed. It was easy enough to adjust which column I was having problems with, and return the rest of the columns unaffected as well. Thanks very much for this.
1

Does this help?

import csv with open("csv_with_commas.csv", newline='', encoding = 'utf8') as f: csvread = csv.reader(f) batch_data = list(csvread) print(batch_data) 

Reference:

[1] https://stackoverflow.com/a/40477760/6907424

[2] To combat "UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 157: character maps to undefined": https://stackoverflow.com/a/9233174/6907424

Comments

0

First, I didn't find anything to resolve "comma inside qoutes, issue" systematically and properly. The pandas=1.5.3 is unable to parse it correctly. Tried to specify parameters like qoutechar, quoting, escapechar, lineterminator, ...

Finally, found two workaround solutions taking advantage I know the comma could be in the last column only. Assume following content of csv

userid, username, body 1, Joe, string1 2, Jim, "string21, string22" 

If you don't mind the part after 3rd comma is lost then specify number of columns

pd.read_csv(r'c:\TEMP\to_parse.csv',usecols=range(3)) 

which yields

 userid username body 0 1 Joe string1 1 2 Jim "string21 

Second workaround is more complicated but it yields complete string with comma. The principle is to replace first 2 commas by semicolon (you must know the number of columns)

with open(path, 'r') as f: fo = io.StringIO() data = f.readlines() fo.writelines(u"" + line.replace(';', ':').replace(',', ';', 2) for line in data) fo.seek(0) df = pd.read_csv(fo, on_bad_lines='warn', sep=';') 

Perhaps it could be accomplished by regex as well.

1 Comment

Hi @karel-marik, you might want to try adding the skipinitialspace=True option to read_csv (see here). This helped me to get an example like yours to work with pandas.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.