pandas read csv with extra commas in column

Question

I'm reading a basic csv file where the columns are separated by commas with these column names:

userid, username, body

However, the body column is a string which may contain commas. Obviously this causes a problem and pandas throws out an error:

CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 8

Is there a way to tell pandas to ignore commas in a specific column or a way to go around this problem?

Fabio Lamanna · Accepted Answer · 2015-09-23 15:34:30Z

39

Imagine we're reading your dataframe called comma.csv:

userid, username, body 01, n1, 'string1, string2'

One thing you can do is to specify the delimiter of the strings in the column with:

df = pd.read_csv('comma.csv', quotechar="'")

In this case strings delimited by ' are considered as total, no matter commas inside them.

answered Sep 23, 2015 at 15:34

Fabio Lamanna

21.7k24 gold badges95 silver badges126 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Leb Over a year ago

I don't think his string has quotes, because if it did then pandas would recognize it without quotechar

Fabio Lamanna Over a year ago

@David Ok got it. Your dataframe has been provided "as is", or you can do some preprocessing on it?

Fabio Lamanna Over a year ago

@David have a look at this question, hope that can help you.

Leb Over a year ago

@David Why no import straight from the SQL, maybe something like this stackoverflow.com/questions/12047193/…

ragesz Over a year ago

I tried this example, but does not work for me. Python 3, pandas 0.18.0 reads comma.csv but 01 become the index 1, and username become only 'string1 :(

|

Ilyas · Accepted Answer · 2019-08-19 08:24:47Z

Add usecols and lineterminator to your read_csv() function, which, n is the len of your columns.

In my case:

n = 5 #define yours df = pd.read_csv(file, usecols=range(n), lineterminator='\n', header=None)

Antoine Krajnc · Accepted Answer · 2022-06-21 14:03:42Z

for me none of the above code samples worked (I was working on Netflix Prize dataset on Kaggle) but there is actually one cool feature from pandas version 1.3.0+ which an on_bad_lines parameter that let you use a callback function. Here is what I did:

def manual_separation(bad_line): right_split = bad_line[:-2] + [",".join(bad_line[-2:])] # All the "bad lines" where all coming from the same last column that was containing "," return right_split filename = "netflix_movie_titles.csv" df = pd.read_csv( filename, header=None, encoding="ISO-8859-1", names = ['Movie_Id', 'Year', 'Name'], on_bad_lines=manual_separation, engine="python", )

Works like a charm! Your only obligation is to use engine=python. Hope that helps!

This was exactly what I needed. It was easy enough to adjust which column I was having problems with, and return the rest of the columns unaffected as well. Thanks very much for this.

hafiz031 · Accepted Answer · 2021-01-16 23:50:19Z

Does this help?

import csv with open("csv_with_commas.csv", newline='', encoding = 'utf8') as f: csvread = csv.reader(f) batch_data = list(csvread) print(batch_data)

Reference:

[1] https://stackoverflow.com/a/40477760/6907424

[2] To combat "UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 157: character maps to undefined": https://stackoverflow.com/a/9233174/6907424

Karel Marik · Accepted Answer · 2023-04-24 19:05:31Z

First, I didn't find anything to resolve "comma inside qoutes, issue" systematically and properly. The pandas=1.5.3 is unable to parse it correctly. Tried to specify parameters like qoutechar, quoting, escapechar, lineterminator, ...

Finally, found two workaround solutions taking advantage I know the comma could be in the last column only. Assume following content of csv

userid, username, body 1, Joe, string1 2, Jim, "string21, string22"

If you don't mind the part after 3rd comma is lost then specify number of columns

pd.read_csv(r'c:\TEMP\to_parse.csv',usecols=range(3))

which yields

 userid username body 0 1 Joe string1 1 2 Jim "string21

Second workaround is more complicated but it yields complete string with comma. The principle is to replace first 2 commas by semicolon (you must know the number of columns)

with open(path, 'r') as f: fo = io.StringIO() data = f.readlines() fo.writelines(u"" + line.replace(';', ':').replace(',', ';', 2) for line in data) fo.seek(0) df = pd.read_csv(fo, on_bad_lines='warn', sep=';')

Perhaps it could be accomplished by regex as well.

Hi @karel-marik, you might want to try adding the skipinitialspace=True option to read_csv (see here). This helped me to get an example like yours to work with pandas.

Collectives™ on Stack Overflow

pandas read csv with extra commas in column

5 Answers 5

9 Comments

Comments

1 Comment

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

9 Comments

Comments

1 Comment

Comments

1 Comment

Linked

Related