Return to Revisions

3 of 5

added 6 characters in body

edited Aug 19, 2016 at 15:23

586.3k
96
1.1k
1.7k

You can take different approaches depending on whether awk treats RS as a single character (like traditional awk implementations do) or as a regular expression (like gawk or mawk do). Empty files are also tricky to be considered as awk tends to skip them.

`gawk`, `mawk` or other `awk` implementations where `RS` can be a regexp.

In those implementations (for mawk, beware that some OSes like Debian ship a very old version instead of the modern one maintained by @ThomasDickey), if RS contains a single character, the record separator is that character, or awk enters the paragraph mode when RS is empty, or treats RS as a regular expression otherwise.

The solution there is to use a regular expression that can't possibly be matched. Some come to mind like x^ or $x (x before the start, or after the end). However some (particularly with gawk) are more expensive than others. So far, I've found that ^$ is the most efficient one. It can only match on an empty input, but then there would be nothing to match against.

So we can do:

awk -v RS='^$' '{printf "%s: <%s>\n", FILENAME, $0}' file1 file2...

One caveat though is that it skips empty files (contrary to perl -0777 -n). That can be addressed with GNU awk by putting the code in a ENDFILE statement instead. But we also need to reset $0 in a BEGINFILE statement as it would otherwise not be reset after processing an empty file:

gawk -v RS='^$' ' BEGINFILE{$0 = ""} ENDFILE{printf "%s: <%s>\n", FILENAME, $0}' file1 file2...

traditional `awk` implementations, POSIX `awk`

In those, RS is just one character, they don't have BEGINFILE/ENDFILE, they don't have the RT variable, they also generally can't process the NUL character.

You would think that using RS='\0' could work then since anyway they can't process input that contains the NUL byte, but no, that RS='\0' in traditional implementations is treated as RS=, which is the paragraph mode.

One solution can be to use a character that is unlikely to be found in the input like \1. In multibyte character locales, you can even make it byte-sequences that are very unlikely to occur as they form characters that are not assigned or non-characters like $'\U10FFFE' in UTF-8 locales. Not really foolproof though and you have a problem with empty files as well.

Another solution can be to store the whole input in a variable and to process that in the END statement at the end. That means you can process only one file at a time though:

awk '{content = content $0 RS} END{$0 = content printf "%s: <%s>\n", FILENAME, $0 }' file

That's the equivalent of sed's:

sed ' :1 $!{ N;b1 } ...' file1

Another issue with that approach is that if the file wasn't ending in a newline character (and wasn't empty), one is still arbitrarily added in $0 at the end (with gawk, you'd work around that by using RT instead of RS in the code above). One advantage is that you do have a record of the number of lines in the file in NR/FNR.

answered Aug 19, 2016 at 12:20

Stéphane Chazelas

586.3k
96
1.1k
1.7k

Return to Revisions

gawk, mawk or other awk implementations where RS can be a regexp.

traditional awk implementations, POSIX awk

`gawk`, `mawk` or other `awk` implementations where `RS` can be a regexp.

traditional `awk` implementations, POSIX `awk`