Python CSV: commas, single and double quotes inside columns

Question

I'm trying to write csv file using DictWriter but columns like this:

2,2',2"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine

breakes everything. The header is:

"#","Index no.","EC / List no.","CAS no.","Name","Page ID","Link"

And column above should be in column Name, but here what I got when I'm trying to write this row:

OrderedDict([('\ufeff "#"', '756'), ('Index no.', '613-114-00-6'), ('EC / List no.', '225-208-0'), ('CAS no.', '4719-04-4'), # most of the following should be the value to 'Name' # `PageId` should be '122039' and 'Link' should be the 'https...' text ('Name', "2,2',2-(hexahydro-1"), ('Page ID', '3'), ('Link', '5-triazine-1'), (None, ['3', '5-triyl)triethanol|1', '3', '5-tris(2-hydroxyethyl)hexahydro-1', '3', '5-triazine"', '122039', 'https://echa.europa.eu/information-on-chemicals/cl-inventory-database/-/discli/details/122039'])

I tried every possible combination of DictWriter parameters

quotechar='"', doublequote=False, delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True, escapechar='\\'

and nothing helped.

Minimal, Complete, and Verifiable example

old.csv

"#","Index no.","EC / List no.","CAS no.","Name","Page ID" "756","613-114-00-6","225-208-0","4719-04-4","2,2',2"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine","122039"

code:

import csv with open('old.csv') as f, open('new.csv', 'w') as ff: reader = csv.DictReader(f) result = csv.DictWriter(ff, fieldnames=reader.fieldnames) for line in reader: result.writerow(line)

What format do you expect as output? doublequote=False looks wrong for most CSV dialects, but what you expect and need depends on whatever is supposed to be reading this output. Do you target a particular consumer? Can you find documentation for its behavior? — tripleee
– tripleee, Commented Feb 10, 2019 at 10:50
I tried just to print reader line and it's already has a problem, so I probably need to add some parameters to DictReader. — NoGapps
– NoGapps, Commented Feb 10, 2019 at 11:02
Adding parameters will not work -your csv data is badly formatted - it does not escape " correctly. See answer. — Patrick Artner
– Patrick Artner, Commented Feb 10, 2019 at 11:03

Patrick Artner · Accepted Answer · 2019-02-10 11:01:48Z

Your old.csv is badly formatted - it does not escape " (nor doubles it) correctly:

"756","613-114-00-6","225-208-0","4719-04-4","2,2',2"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine","122039" ----------------------------------------------------^ here is the not escaped "

This line should look like:

"756","613-114-00-6","225-208-0","4719-04-4","2,2',2\"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine","122039","https://echa.europa.eu/information-on-chemicals/cl-inventory-database/-/discli/details/122039" ----------------------------------------------------^^ escaped "

Using doublequote=True would need the " inside a field to be doubled up: "tata""tata" for tata"tata - your source data does neither: doubling up nor escaping.

This works flawlessly:

from collections import OrderedDict fieldn = ["#","Index no.","EC / List no.","CAS no.","Name","Page ID","Link"] od = OrderedDict( [('#', '756'), ('Index no.', '613-114-00-6'), ('EC / List no.', '225-208-0'), ('CAS no.', '4719-04-4'), ('Name', '''2,2',2"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine'''), ('Page ID', '122039'), ('Link', 'https://echa.europa.eu/information-on-chemicals/cl-inventory-database/-/discli/details/122039')]) print(od) # see: Input to writer: import csv # write the ordered dict with open("file.txt", "w",newline = "") as f: writer = csv.DictWriter(f, quotechar='"', doublequote=False, delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True, escapechar= '\\', fieldnames=fieldn) writer.writeheader() # remove if you do not want the header in as well writer.writerow(od) # read it back in and print it with open ("file.txt") as r: reader = csv.DictReader(r, quotechar='"', doublequote=False, delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True, escapechar= '\\', fieldnames=fieldn) for row in reader: print(row) # see Output after reading in written stuff

Input to writer:

OrderedDict([('#', '756'), ('Index no.', '613-114-00-6'), ('EC / List no.', '225-208-0'), ('CAS no.', '4719-04-4'), ('Name', '2,2\',2"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine'), ('Page ID', '122039'), ('Link', 'https://echa.europa.eu/information-on-chemicals/cl-inventory-database/-/discli/details/122039')])

Output after reading in written stuff (writes header as well - hence the double output):

OrderedDict([('#', '#'), ('Index no.', 'Index no.'), ('EC / List no.', 'EC / List no.'), ('CAS no.', 'CAS no.'), ('Name', 'Name'), ('Page ID', 'Page ID'), ('Link', 'Link')]) OrderedDict([('#', '756'), ('Index no.', '613-114-00-6'), ('EC / List no.', '225-208-0'), ('CAS no.', '4719-04-4'), ('Name', '2,2\',2"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine'), ('Page ID', '122039'), ('Link', 'https://echa.europa.eu/information-on-chemicals/cl-inventory-database/-/discli/details/122039')])

File content:

"#","Index no.","EC / List no.","CAS no.","Name","Page ID","Link" "756","613-114-00-6","225-208-0","4719-04-4","2,2',2\"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine","122039","https://echa.europa.eu/information-on-chemicals/cl-inventory-database/-/discli/details/122039"

Thank you Patrick for your answer. How can I add escape character to input file? I have hundreds of csv files with hundreds of rows like this.
@NoG Go to the source and let them fix it. If you can't and the amount of columns is fixed and you know this error can only happen inside "Name" you could write a python program that somehow "fixes" this error. That would be a totally different problem / question though. You can read each file line-wise, split it and check if it has more then the allowed columns, if so you would have to use some smart code to fix the error and write a corrected file. You might simply check each "row" for its length and report the "bad" into an error file and fix them by hand if the good/bad ratio is okish.

Mark Tolonen · Accepted Answer · 2019-02-10 18:21:18Z

IF only the 5th column has double quotes in the data and the other columns are quoted correctly as shown, you could use a regular expression to capture the columns and re-write the CSV:

bad.csv

"#","Index no.","EC / List no.","CAS no.","Name","Page ID" "756","613-114-00-6","225-208-0","4719-04-4","2,2',2"-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine","122039" "756","613-114-00-6","225-208-0","4719-04-4",""Example"","122039" "756","613-114-00-6","225-208-0","4719-04-4","Another "example" of bad formatting","122039"

test.py

import re import csv with open('bad.csv') as fin: with open('good.csv','w',newline='') as fout: writer = csv.writer(fout) for line in fin: items = re.match(r'"(.*?)","(.*?)","(.*?)","(.*?)","(.*)","(.*?)"$',line).groups() writer.writerow(items)

good.csv

#,Index no.,EC / List no.,CAS no.,Name,Page ID 756,613-114-00-6,225-208-0,4719-04-4,"2,2',2""-(hexahydro-1,3,5-triazine-1,3,5-triyl)triethanol|1,3,5-tris(2-hydroxyethyl)hexahydro-1,3,5-triazine",122039 756,613-114-00-6,225-208-0,4719-04-4,"""Example""",122039 756,613-114-00-6,225-208-0,4719-04-4,"Another ""example"" of bad formatting",122039

Collectives™ on Stack Overflow

Python CSV: commas, single and double quotes inside columns

Minimal, Complete, and Verifiable example

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

Minimal, Complete, and Verifiable example

2 Answers 2

2 Comments

Comments

Related