Skip to main content
3 of 5
added 6 characters in body
Stéphane Chazelas
  • 586.3k
  • 96
  • 1.1k
  • 1.7k

You can take different approaches depending on whether awk treats RS as a single character (like traditional awk implementations do) or as a regular expression (like gawk or mawk do). Empty files are also tricky to be considered as awk tends to skip them.

gawk, mawk or other awk implementations where RS can be a regexp.

In those implementations (for mawk, beware that some OSes like Debian ship a very old version instead of the modern one maintained by @ThomasDickey), if RS contains a single character, the record separator is that character, or awk enters the paragraph mode when RS is empty, or treats RS as a regular expression otherwise.

The solution there is to use a regular expression that can't possibly be matched. Some come to mind like x^ or $x (x before the start, or after the end). However some (particularly with gawk) are more expensive than others. So far, I've found that ^$ is the most efficient one. It can only match on an empty input, but then there would be nothing to match against.

So we can do:

awk -v RS='^$' '{printf "%s: <%s>\n", FILENAME, $0}' file1 file2... 

One caveat though is that it skips empty files (contrary to perl -0777 -n). That can be addressed with GNU awk by putting the code in a ENDFILE statement instead. But we also need to reset $0 in a BEGINFILE statement as it would otherwise not be reset after processing an empty file:

gawk -v RS='^$' ' BEGINFILE{$0 = ""} ENDFILE{printf "%s: <%s>\n", FILENAME, $0}' file1 file2... 

traditional awk implementations, POSIX awk

In those, RS is just one character, they don't have BEGINFILE/ENDFILE, they don't have the RT variable, they also generally can't process the NUL character.

You would think that using RS='\0' could work then since anyway they can't process input that contains the NUL byte, but no, that RS='\0' in traditional implementations is treated as RS=, which is the paragraph mode.

One solution can be to use a character that is unlikely to be found in the input like \1. In multibyte character locales, you can even make it byte-sequences that are very unlikely to occur as they form characters that are not assigned or non-characters like $'\U10FFFE' in UTF-8 locales. Not really foolproof though and you have a problem with empty files as well.

Another solution can be to store the whole input in a variable and to process that in the END statement at the end. That means you can process only one file at a time though:

awk '{content = content $0 RS} END{$0 = content printf "%s: <%s>\n", FILENAME, $0 }' file 

That's the equivalent of sed's:

sed ' :1 $!{ N;b1 } ...' file1 

Another issue with that approach is that if the file wasn't ending in a newline character (and wasn't empty), one is still arbitrarily added in $0 at the end (with gawk, you'd work around that by using RT instead of RS in the code above). One advantage is that you do have a record of the number of lines in the file in NR/FNR.

Stéphane Chazelas
  • 586.3k
  • 96
  • 1.1k
  • 1.7k