3

When NF is used with FPAT regex, the comma is considered a field. I prefer using NF and FPAT:

1) NF – to limit the output to the actual number of fields for the record

2) FPAT – to handle an embedded comma in a quoted field like line 3:

 "Bus Driver, City/Transit",51 

3) the awk script is used for several input files that have a different number of columns for record 6 - record 6 is the column name/header for the contents of the file...

The output from testing, the first, test1, uses a fixed value for number of fields, the second, test2, uses NF for the number of fields.

using gawk 4.1.4

 BEGIN { FPAT = "(^,)|([^,]+)|(\"[^\"]+\")" OFS = "\t" } NR == 6 { for (i = 1; 6 >= i; ++i) { #for (i = 1; NF >= i; ++i) { colName[i] = $i print "Column Name: " colName[i] } { print "", "number of fields: " NF } } 

Input File starting at record 6: NR == 6 {...

 Occupation,States Licensed Barber,51 "Bus Driver, City/Transit",51 

The output I expect/want:

 Column Name: Occupation Column Name: States Licensed number of fields: 2 

test 1: for (i = 1; 6 >= i; ++i) {...

output is correct - what I expect/want, except, of course, for the 4 columns/fields that are not valid but are shown because of using a fixed value of 6.

 Column Name: Occupation Column Name: States Licensed Column Name: Column Name: Column Name: Column Name: number of fields: 2 

test 2: for (i = 1; NF >= i; ++i) {...

output is NOT what I expect/want; note the comma is a indicate as a field

 Column Name: Occupation Column Name: , Column Name: States Licensed number of fields: 3 
1
  • The problem is your regex I think - try FPAT = "\"[^\"]*\"|[^\",]*" (a possibly empty sequence of non-quotes surrounded by quotes, or a possibly empty sequence of not-comma-or-quotes). Or more readably gawk -v FPAT='"[^"]*"|[^",]*' '<stuff>' Commented Mar 2, 2019 at 1:25

1 Answer 1

3

0. Congratulations.  You seem to have found a bug in gawk.

I’ve reduced this to a very small example.  (It might be possible to demonstrate the glitch with a simpler FPAT string, but I didn’t feel like spending another ten minutes on that.)  Basically, for input like foo,bar, we can get two different results.

Case A:

NF = 2 $1 = foo $2 = bar $3 = 

and

Case B:

NF = 3 $1 = foo $2 = , $3 = bar 

This code produces Case B:

BEGIN { FPAT = "^,|[^,]+" } { print "NF =", NF print "$1 =", $1; print "$2 =", $2; print "$3 =", $3 } 

(I removed the parentheses from FPAT, because they aren’t needed; I removed the part of the regular expression that handles quoted strings maybe containing comma(s), and I cut the code down to a bare minimum.)

Use

echo foo,bar | awk -f name_of_the_above_awk_script

But — in gawk version 4.1.1, at least — if I access $1 before we access NF, then I get Case A.  You can demonstrate this by switching the order of the print statements, or by this ridiculous kluge:

{ temp = $1 # We will never use this. print "NF =", NF print "$1 =", $1; print "$2 =", $2; print "$3 =", $3 } 

This is clearly a bug; there’s no way that accessing a field should change the values of other things. 

1. So we have a work-around.

Just add temp = $1 before your for loop, and I expect you’ll get the result you want (using NF).

2. The real (?) answer:

In the above, I deliberately avoided referring to either Case A or Case B as “right” or “wrong”.  Case A is the one you want, but Case B might actually be the correct result for the value of FPAT that you’re using.  It seems to be saying that you want a field to be

  • a string beginning with a comma, or
  • a string of one or more characters that aren’t comma, or
  • a quote, a string of one or more characters that aren’t quote, and another quote.

But you don’t want a comma to be a field; you just want the second and third options.  I find that setting

FPAT = "[^,]+|\"[^\"]+\"" 

will give you the correct results.

6
  • Interesting and such a simple solution. The regex I used was copied from the GNU AWK website/page, "The GNU Awk User’s Guide": GNU.org except that I added the (^,)| at the start for some now unknown reason (doh!?). Thank you for your time, detailed solution and well laid-out presentation - much appreciated and I learned something. I wonder if I should let GNU.org know about the error in the manual, re: the parenthesis in the regex. Would it be OK if I quote your solution? Commented Mar 2, 2019 at 16:47
  • Forgot to mention that I didn't have to add temp = $1. It works fine without it. Commented Mar 2, 2019 at 16:55
  • (1) The inclusion of the parentheses in FPAT is (as far as I can tell) unnecessary, but it’s harmless; I wouldn’t call it an error.  Adding the (^,)| at the start was the error.  But, yes, you may quote me if you say my name and link to the answer. (2) Adding temp = $1 was a kludgy workaround to get the program to produce the desired results even with the wrong FPAT.  So you don’t need to use the kludgy workaround and also fix FPAT.  Sorry I wasn’t clearer about that. Commented Mar 2, 2019 at 17:19
  • using gawk 4.2.1 under Debian Testing, i got case B with our without temp=$1 Commented Mar 2, 2019 at 21:36
  • 1
    @TomM While you don't (yet) have enough reputation to upvote helpful answers, I suggest you accept this answer (click on the check mark next to it). It's highly unlikely that you'll get a better one. See unix.stackexchange.com/help/someone-answers BTW, I was the one who upvoted your question because it was such a good one. Welcome to Unix & Linux! Commented Mar 9, 2019 at 16:39

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.