0

I am trying to read a CSV file into pandas DataFrame. I have the data pattern on one of the rows on CSV as follows

a,b,\\"c\\,d",e,f,g,h --> read as 8 fields currently 

instead of the regular

a,b,c,e,f,g,h --> should be read as only 7 fields like the rest 

pattern on all other rows

When I use pd.read_csv('text.csv') to read into the DataFrame, I get the error

Error tokenizing data. C error: Expected 7 fields in line 36190, saw 8

Is there a way to read the data \"c\,d" into one column? Or what are the best practices to handle such cases in general?

Note: The letters on the rows mentioned above as part of the CSV file are just placeholders for the values in each line of CSV. They are not columns.

As suggest in the answers, this is what the data looks like at the moment on CSV file

 AA BB CC DD EE FF GG HH 0 a b \"c\ d" e f g h 1 i j k l m n o 2 p q r s t u v 

and I want to read this into the DataFrame as follows and then get rid of quotes and backslashes

 AA BB CC DD EE FF GG 0 a b \"c\d" e f g h 1 i j k l m n o 2 p q r s t u v 
1
  • This very confusing now when you say "With the data I have, only one row has [\"c\,d"] data pattern. Rest of them have one lesser field and are like any general comma-separated data. – Namesake", you need to add these things into your post and at least then have to place few lines of data into the post to reproduce it. Commented Aug 21, 2021 at 4:36

5 Answers 5

1

try below, it meets your requirement...

Sample file:

$ cat test.csv a,b,\\"c\\,d",e,f,g,h i,j,k,l,m,n,o p,q,r,s,t,u,v 

Solution based on the latest change on the Post:

Pandas is a tool to process tabular data. It means that each row should contain the same number of fields/rows. So, fields in each row should be in the same order.

But your input file actually fails to meet the criteria which pandas needs to read the CSV.

In your case, It's expecting 7 fields in line 36190, however it saw 8 which pandas don't really like hence you need to do cleaning of your data before processing some or other way around.

What you can do, read the data First into a single column while reading csv and do some cleaning, I've explained those below.

Hope this will give you an idea how to proceed, please keep in mind that you have to clean your data before you read them into pandas.

# Read your input file, using read_csv, but as a single column (sep set to a non-used char). df = pd.read_csv('test.csv', sep='|', names=['col1']) # Apply to replace with regex and remove backslash chars df['col1'] = df['col1'].replace(r'(\\|\\)', '', regex=True) df['col1']= df['col1'].replace(r'("c,d")', 'cd', regex=True) # Now save these into a new CSV file df.to_csv("new.csv") # Read new csv file again df2 = pd.read_csv("new.csv") # drop the `Unnamed: 0'` column as this is not required df2 = df2.drop('Unnamed: 0', 1) # Replace the unwanted chars in order to get the rows with same length df2['col1'] = df2['col1'].replace(r'(,d"|")', '', regex=True) df2 = df2['col1'].replace(r'(,d"|")', '', regex=True).str.split(",", expand=True).rename(columns={0:'AA',1:'BB',2:'CC',3:'DD',4:'EE',5:'FF',6:'GG'}) 

Result:

print(df2) AA BB CC DD EE FF GG 0 a b cd e f g h 1 i j k l m n o 2 p q r s t u v 
Sign up to request clarification or add additional context in comments.

2 Comments

With the data I have, only one row has [\"c\,d"] data pattern. Rest of them have one lesser field and are like any general comma-separated data.
@Namesake, Please check the answer, it should do what you mentioned if the data is in mentioned condition.
0

I think you can try the escapechar parameter of the pd.read_csv function. I've never had to use this before and I am not 100% sure I understand your question. Are you trying to combine columns C & D for all rows or only when these special characters/patterns are present in the data? Here is a link to the doc, there may be some more helpful parameters to address this issue: Pandas documentation

If you are trying to do this for all of the rows you may want to make a string processing helper function that combines these data and removes the backslashes, double quotes and comma.

Comments

0

Force the number of columns:

df = pd.read_csv('in.csv', header=None, names=list('12345678')) 

Outputs:

 1 2 3 4 5 6 7 8 0 a b c d e f g h 1 a b \\"c\\ d" e f g h 

Then clean etc. from there.

Comments

0

You can use

pd.read_csv('text.csv', sep=',') 

Comments

0

First of wall, the quotes are wrongly formatted, to consider a cell a quoted, it needs to start and end with quotation marks (or other chosen QUOTE symbol). In your examle it starts with a \\ so it is not considered as quoted, and the comma in between is not ignored. I don't know how this file was generated, but if it is a single such case, try fixing it manually. If it occurrs in multiple location, it needs to be regenerated properly, or if not possible it needs a custom preprocessing to fix such cases. A possible preprocessing might be replacing all \, with another unique symbol, and then replacing it back after reading.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.