Remove interval double quotes in a CSV separated by comma and encapsulated by double quotes

Question

Perhaps I'm out of luck, because my double quoted comma separated CSV file has double quotes and commas within useful text.

So I want to turn this:

"record 1","name 1","text 1, text 2" "record 2","name ""2""","text 2" "record 3","name 3",""

on that:

"record 1","name 1","text 1, text 2" "record 2","name 2","text 2" "record 3","name 3",""

Notice that I removed the double quote from name ""2"" to name 2, but I kept the double quote from line #3: ,""

Kusalananda · Accepted Answer · 2020-04-02 21:56:30Z

2

Using csvformat to turn the delimiters to tabs (csvformat -T), removing any double quotes (tr -d '"'), and then returning the delimiters to commas while quoting every field (that last bit of the pipeline):

$ csvformat -T file.csv | tr -d '"' | csvformat -t -U1 "record 1","name 1","text 1, text 2" "record 2","name 2","text 2" "record 3","name 3",""

csvformat is part of csvkit.

answered Apr 2, 2020 at 21:56

Kusalananda♦

356k42 gold badges737 silver badges1.1k bronze badges

Yes, csvkit is the tool for that. Trying to parse csv with regular expressions is possible, but too complicated and error prone.

Eduardo Trápani
– Eduardo Trápani

2020-04-02 22:32:52 +00:00
Commented Apr 2, 2020 at 22:32
That'd fail if any of the fields contained tabs, it's just changing the problem from handling ,s in the fields to handling \ts in the fields, not solving the problem.

Ed Morton
– Ed Morton

2020-04-03 14:25:06 +00:00
Commented Apr 3, 2020 at 14:25
@EdMorton You would easily be able to change the intermediate delimiter to any single character that is not part of the data, such as @, # or anything else. I chose tabs because cvsformat has a shorthand option for it, and because the given data did not contain any tabs. To use @ instead, use -D @ (instead of -T) with the first csvformat and -d @ (instead of -t) with the second.

Kusalananda
– Kusalananda ♦

2020-04-03 14:27:30 +00:00
Commented Apr 3, 2020 at 14:27
Picking any char that you hope won't be in the data is always a bit risky, especially if you don't have anything to warn you if it IS in the data. You could do something like sed 's/@/@A/g; s/=/@B/g; s/,/=/g' file.csv | stuff | sed 's/=/,/g; s/@B/=/g; s/@A/@/g' so you KNOW there are no commas (or whatever char you like) in your data while processing with stuff. See stackoverflow.com/a/35708616/1745001 for what that pair of sed commands do.

Ed Morton
– Ed Morton

2020-04-03 14:59:31 +00:00
Commented Apr 3, 2020 at 14:59
Ignore that 2-seds suggestion as it'd remove the field-separating commas too. It's a good solution to a different problem :-).

Ed Morton
– Ed Morton

2020-04-03 15:07:37 +00:00
Commented Apr 3, 2020 at 15:07

Add a comment |

Ed Morton · Accepted Answer · 2020-04-03 15:45:21Z

This will work no matter which characters are in your input (except newlines within quoted fields but that's a whole other problem).

With GNU awk for FPAT:

$ awk -v FPAT='("[^"]*")+' -v OFS='","' '{ for ( i=1; i<=NF; i++ ) { gsub(/"/,"",$i) } print "\"" $0 "\"" }' file "record 1","name 1","text 1, text 2" "record 2","name 2","text 2" "record 3","name 3",""

or the equivalent with any awk:

$ awk -v OFS='","' '{ orig=$0; $0=""; i=0; while ( match(orig,/("[^"]*")+/) ) { $(++i) = substr(orig,RSTART,RLENGTH) gsub(/"/,"",$i) orig = substr(orig,RSTART+RLENGTH) } print "\"" $0 "\"" }' file "record 1","name 1","text 1, text 2" "record 2","name 2","text 2" "record 3","name 3",""

See also whats-the-most-robust-way-to-efficiently-parse-csv-using-awk.

Stack Exchange Network

Remove interval double quotes in a CSV separated by comma and encapsulated by double quotes

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Remove interval double quotes in a CSV separated by comma and encapsulated by double quotes

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions