awk when both delimiter and quotes are used for a field

Question

I have a file in the following format:

field1|field2|field3 field1|"field2|field2"|field3

Notice the second row contains double quotes. The string within the double quotes belongs to field 2. How do extract this using awk? I've been googling with no results. I tried this with no luck as well

FS='"| "|^"|"$' '{print $2}'

See whats-the-most-robust-way-to-efficiently-parse-csv-using-awk. — Ed Morton
– Ed Morton, Commented Apr 7, 2022 at 0:16

iruvar · Accepted Answer · 2015-10-23 15:39:57Z

13

If you have a recent version of gawk you're in luck. There's the FPAT feature, documented here

awk 'BEGIN { FPAT = "([^|]+)|(\"[^\"]+\")" } { print "NF = ", NF for (i = 1; i <= NF; i++) { sub(/"$/, "", $i); sub(/^"/, "", $i);printf("$%d = %s\n", i, $i) } }' file NF = 3 $1 = field1 $2 = field2 $3 = field3 NF = 3 $1 = field1 $2 = field2|field2 $3 = field3

answered Oct 23, 2015 at 15:39

iruvar

17k8 gold badges51 silver badges81 bronze badges

You can replace + with * FPAT = "([^|]*)|(\"[^\"]+\")" to handle empty fields, such as ||

Reza Sanaie
– Reza Sanaie

2018-08-14 19:12:59 +00:00
Commented Aug 14, 2018 at 19:12
Brilliant. However, where I'm using this on comma separated files it doesn't cope with double quotes in the field, so I'm using FPAT = "([^,]*)|(\"([^\"]|\"\")*\")". For the above with pipe delimiter it would be FPAT = "([^|]*)|(\"([^\"]|\"\")*\")".

Reg Whitton
– Reg Whitton

2020-01-10 14:38:07 +00:00
Commented Jan 10, 2020 at 14:38
So, what if I don't have FPAT available?

musicin3d
– musicin3d

2020-01-23 00:23:26 +00:00
Commented Jan 23, 2020 at 0:23
@musicin3d, in that case take a look at Sobrique's perl solution

iruvar
– iruvar

2020-01-23 01:44:44 +00:00
Commented Jan 23, 2020 at 1:44

Add a comment |

Sobrique · Accepted Answer · 2015-10-23 16:32:40Z

This is something that you get in csv - if the delimiter is part of the field, it gets quoted. That suddenly makes the task of parsing it MUCH harder, because you can't just split on a delim.

Fortunately, if perl is an option, you have the Text::CSV module that handles this case:

#!/usr/bin/env perl use strict; use warnings; use Text::CSV; my $csv = Text::CSV -> new ( { 'sep_char' => '|' } ); while ( my $row = $csv -> getline ( *STDIN ) ) { print $row -> [1],"\n"; }

Could probably condense this to an inline/pipeable if you prefer - something like:

perl -MText::CSV -e 'print map { $_ -> [1] ."\n" } @{ Text::CSV -> new ( { 'sep_char' => '|' } ) -> getline_all ( *ARGV )};

Timothy Pulliam · Accepted Answer · 2015-10-23 15:37:37Z

You may want to format this data with sed so it can be parsed by awk more easily. for example:

$ sed 's/"//g' awktest1.txt field1|field2|field3 field1|field2|field2|field3 $ sed 's/"//g' awktest1.txt > awktest2.txt $ awk 'BEGIN {FS = "|"} ; {print $2}' awktest2.txt field2 field2

But then again, I don't know the nature of the data you are working with.

The idea is explicitly to have field2|field2 as a single field in the second line. — klimpergeist
– klimpergeist, Commented Oct 23, 2015 at 15:58

Stack Exchange Network

awk when both delimiter and quotes are used for a field

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

awk when both delimiter and quotes are used for a field

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions