0

Perhaps I'm out of luck, because my double quoted comma separated CSV file has double quotes and commas within useful text.

So I want to turn this:

"record 1","name 1","text 1, text 2" "record 2","name ""2""","text 2" "record 3","name 3","" 

on that:

"record 1","name 1","text 1, text 2" "record 2","name 2","text 2" "record 3","name 3","" 

Notice that I removed the double quote from name ""2"" to name 2, but I kept the double quote from line #3: ,""

2 Answers 2

2

Using csvformat to turn the delimiters to tabs (csvformat -T), removing any double quotes (tr -d '"'), and then returning the delimiters to commas while quoting every field (that last bit of the pipeline):

$ csvformat -T file.csv | tr -d '"' | csvformat -t -U1 "record 1","name 1","text 1, text 2" "record 2","name 2","text 2" "record 3","name 3","" 

csvformat is part of csvkit.

5
  • Yes, csvkit is the tool for that. Trying to parse csv with regular expressions is possible, but too complicated and error prone. Commented Apr 2, 2020 at 22:32
  • That'd fail if any of the fields contained tabs, it's just changing the problem from handling ,s in the fields to handling \ts in the fields, not solving the problem. Commented Apr 3, 2020 at 14:25
  • @EdMorton You would easily be able to change the intermediate delimiter to any single character that is not part of the data, such as @, # or anything else. I chose tabs because cvsformat has a shorthand option for it, and because the given data did not contain any tabs. To use @ instead, use -D @ (instead of -T) with the first csvformat and -d @ (instead of -t) with the second. Commented Apr 3, 2020 at 14:27
  • Picking any char that you hope won't be in the data is always a bit risky, especially if you don't have anything to warn you if it IS in the data. You could do something like sed 's/@/@A/g; s/=/@B/g; s/,/=/g' file.csv | stuff | sed 's/=/,/g; s/@B/=/g; s/@A/@/g' so you KNOW there are no commas (or whatever char you like) in your data while processing with stuff. See stackoverflow.com/a/35708616/1745001 for what that pair of sed commands do. Commented Apr 3, 2020 at 14:59
  • Ignore that 2-seds suggestion as it'd remove the field-separating commas too. It's a good solution to a different problem :-). Commented Apr 3, 2020 at 15:07
0

This will work no matter which characters are in your input (except newlines within quoted fields but that's a whole other problem).

With GNU awk for FPAT:

$ awk -v FPAT='("[^"]*")+' -v OFS='","' '{ for ( i=1; i<=NF; i++ ) { gsub(/"/,"",$i) } print "\"" $0 "\"" }' file "record 1","name 1","text 1, text 2" "record 2","name 2","text 2" "record 3","name 3","" 

or the equivalent with any awk:

$ awk -v OFS='","' '{ orig=$0; $0=""; i=0; while ( match(orig,/("[^"]*")+/) ) { $(++i) = substr(orig,RSTART,RLENGTH) gsub(/"/,"",$i) orig = substr(orig,RSTART+RLENGTH) } print "\"" $0 "\"" }' file "record 1","name 1","text 1, text 2" "record 2","name 2","text 2" "record 3","name 3","" 

See also whats-the-most-robust-way-to-efficiently-parse-csv-using-awk.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.