How to read a CSV file where rows are quoted into a dataframe

Question

I have a CSV file that looks like this,

title 1 "x,y,z,w" "1,2,3,4" title 2 "a,s,d,f,g,h,j,k,l,z,x,c,v,b,n,m" "1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7" "1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7" "1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7" "1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7" "1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7" "1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7" "1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7" title 3 x,y,z,w 1,2,3,4

I am trying to read this CSV file which has a different column structure, which I was able to do that using skiprows, skipfooter, and quoting=csv.QUOTE_NONE. I am using quoting=csv.QUOTE_NONE to interpret the double-quotes (") row with different columns, not as a single string.

import csv import pandas as pd title_1 = 0 title_2 = 3 title_3 = 12 total = 14 title_1_df = pd.read_csv("test_csv.csv", engine='python', skiprows=title_1 + 1, skipfooter =(total - title_2) + 1, quoting=csv.QUOTE_NONE) title_2_df = pd.read_csv("test_csv.csv", engine='python', skiprows=title_2 + 1, skipfooter=(total - title_2) + 1, quoting=csv.QUOTE_NONE) title_3_df = pd.read_csv("test_csv.csv", engine='python', skiprows=title_3 + 1)

I was able to read the CSV file, but the double quotes also came with it in DataFrame. Below is the output.

Is there a way to remove the double quotes while reading the CSV file? As you can see in the output of title_1_df and title_2_df double quotes are coming in the first and last column and I would like to the output as title_3_df.

Will you please paste the CSV file's contents into the question? — user17242583
– user17242583, Commented Feb 4, 2022 at 23:56

score 2 · Accepted Answer · 2022-02-05 00:44:53Z

Looking carefully through pd.read_csv's (many) options, I can't find a way of removing these quotes during the read, and, thinking about it, I'm not sure there should be one.

Quoting is done for individual values, not rows. For example, it often might be used to intentionally store the commas as cell data, rather than cell separators. By telling pandas to ignore the quoting, you're also telling it the quote character (") is just normal data.

For this case, I'd either strip the CSV file of quotes prior to reading it, or remove the quotes after reading it.

Method 1. Removing the quotes before reading the file

Two ways. The first is doing it line by line, which is more intuitive but less efficient. The other way other way is doing it all in one go (less intuitive but more efficient):

import re with open('test_csv.csv') as f: text = re.sub(r'"*([\r\n])+"*|(?:^"*|"*$)', '\\1', f.read())

Now, you can either write the processed data back to the file and read the file with pd.read_csv, or you can read the CSV from the string directly. I'll show both methods:

Writing back to the file:

with open('test_csv.csv', 'w') as f: f.write(text) header_df = pd.read_csv("test_csv.csv", ...) data_df = pd.read_csv("test_csv.csv", ...) footer_df = pd.read_csv("test_csv.csv", ...)

Reading directly from the processed string:

from io import StringIO s = StringIO(text) header_df = pd.read_csv(s, ...); s.seek(0) data_df = pd.read_csv(s, ...); s.seek(0) footer_df = pd.read_csv(s, ...); s.seek(0)

Method 2. Removing the quotes after reading the file

Use df.iloc[:, [0, -1]] to select the first and last column of a dataframe:

def remove_quotes(df): df.iloc[:, [0, -1]] = df.iloc[:, [0, -1]].astype(str).apply(lambda col: col.str.strip('"')).astype(int) df.columns = df.columns.str.strip('"') remove_quotes(header_df) remove_quotes(data_df) remove_quotes(footer_df)

Output:

>>> header_df x y z w 0 1 2 3 4 >>> data_df a s d f g h j k l z x c v b n m 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 2 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 4 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 5 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 6 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 >>> footer_df x y z w 0 1 2 3 4

Chris Sears · Accepted Answer · 2022-02-05 00:24:23Z

You can do it in 2 passes, first read them in as one column, then create a new DataFrame by splitting the values of the column on ','.

import pandas as pd df = pd.read_csv('quotes.csv') df2 = pd.DataFrame(df[df.columns[0]].apply(lambda x: x.split(',')).to_list(), columns=df.columns[0].split(',')) print(df2)

Output:

 a s d f g h j k l z x c v b n m 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 2 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 4 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 5 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 6 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7

Collectives™ on Stack Overflow

How to read a CSV file where rows are quoted into a dataframe

2 Answers 2

Method 1. Removing the quotes before reading the file

Method 2. Removing the quotes after reading the file

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Method 1. Removing the quotes before reading the file

Method 2. Removing the quotes after reading the file

Comments

Comments

Related