AWK print all regex matches on every line

Question

I have the following text input:

lorem <a> ipsum <b> dolor <c> sit amet, consectetur <d> adipiscing elit <e>, sed do eiusmod <f> tempor incididunt ut

As seen in the text, the appearances of <?> is not fixed and can appear 0 or multiple times on the same line.

Only using awk I need to output this:

<a> <b> <c> <d> <e> <f>

I tried this awk script:

awk '{ match($0,/<[^>]+>/,a); // fill array a with matches for (i in a) { if (match(i, /^[0-9]+$/) != 0) // ignore non numeric indices print a[i] } }' somefile.txt

but this only outputs the first match on every line:

<a> <d> <f>

Is there some way of doing this with match() or any other built-in function?

RavinderSingh13 · Accepted Answer · 2022-09-05 02:17:53Z

With GNU awk you could use its OOTB variable named FPAT and could try following awk code.

awk -v FPAT='<[^>]*>' ' NF{ val="" for(i=1;i<=NF;i++){ val=(val?val OFS:"") $i } print val } ' Input_file

glenn jackman · Accepted Answer · 2022-09-05 02:56:26Z

Assuming there are no stray angle brackets, use either < or > as a field separator and print every second field:

awk -F'[<>]' '{for (i=2; i <= NF; i += 2) {printf "<%s> ", $i}; print ""}' data

markp-fuso · Accepted Answer · 2022-09-04 22:04:21Z

match() doesn't work the way you think it does; to find a variable number of matches you would need to first match() the first pattern, strip off that pattern, then match() the remainder of the input for the next pattern, and repeat until no more matches in the current line; eg:

awk ' { out=sep="" # init variables for new line while (match($0,/<[^>]+>/)) { # find 1st match out=out sep substr($0,RSTART,RLENGTH) # build up output line $0=substr($0,RSTART+RLENGTH) # strip off 1st match and prep for next while() check sep=OFS # set field separator for follow-on matches } if (out) print out }' somefile.txt

Another idea uses the split() function, eg:

awk ' { n=split($0,a,/[<>]/) # split line on dual delimiters "<" and ">" out=sep="" for (i=2;i<=n;i=i+2) { # step through even numbered array entries; assumes line does not contain any standalone "<" or ">" characters !!! out=out sep "<" a[i] ">" # build output line sep=OFS } if (out) print out } ' somefile.txt

Both of these generate:

<a> <b> <c> <d> <e> <f>

Daweo · Accepted Answer · 2022-09-05 07:53:52Z

I would harness GNU AWK for this task following way, let file.txt content be

lorem <a> ipsum <b> dolor <c> sit amet, consectetur <d> adipiscing elit <e>, sed do eiusmod <f> tempor incididunt ut

then

awk 'BEGIN{FPAT="<[^>]*>"}{$1=$1;print}' file.txt

gives output

<a> <b> <c> <d> <e> <f>

Explanation: I inform GNU AWK that field is < followed by zero-or-more (*) non(^)-> followed by >. For each line I do $1=$1 to provoke rebuilt, so now line are found fields joined by space, which I then print.

(tested in gawk 4.2.1)

maybe streamline awk 'BEGIN { … } {$1=$1;print}' to simply ::::::::::::: :::::::::::; :::::::::::::::: awk NF=NF FPAT='<[^>]*>' while retaining all functionality of original ?

Fravadona · Accepted Answer · 2022-09-07 19:07:00Z

Here's a simple awk solution based on regexps:

awk '{ gsub(/^[^<]*|[^>]*$/,""); gsub(/>[^<]*</,"> <") } NF'

^{edit: using NF instead of $0 != ""; thanks @EdMorton}

For each line:

strip all chars from the left up to the first < (excluded) or up to the end-of-line when < isn't found.
strip all chars from the right up to the first > (excluded) or up to the start-of-line when > isn't found.
replace what's between each > and < pair with a space character.
print the result when it isn't empty

example

lorem <a a> ipsum <b> dolor <c> sit amet, consectetur <d> adipiscing elit <e>, sed do eiusmod <f> tempor <g>incididunt ut<h><i> h>ell<o <j>

output

<a a> <b> <c> <d> <e> <f> <g> <h> <i> <j>

Remark: With exactly the same logic you can use sed:

sed 's/^[^<]*//; s/[^>]*$//; s/>[^<]*</> </g; /^$/d'

anubhava · Accepted Answer · 2022-09-05 16:01:40Z

8

Here is a simple gnu-awk alternative solution using patsplit:

awk ' n = patsplit($0, m, /<[^>]+>/) { for (i=1; i<=n; ++i) printf "%s", m[i] (i < n ? OFS : ORS) }' file <a> <b> <c> <d> <e> <f>

edited Sep 5, 2022 at 16:01

answered Sep 5, 2022 at 6:08

anubhava

790k67 gold badges603 silver badges671 bronze badges

2 Comments

aee Over a year ago

Exactly what I was looking for. Using FPAT is a good alternative if the only thing I'm interested in is the content of <?>. But if my records had fields separated by spaces and inside that fields I have the brackets placeholder (eg.: "word1 w<1>d2 word") things cat get complicated. Thanks.

RARE Kpop Manifesto Over a year ago

@anubhava : maybe change the filter criteria in for (...) loop to be just i < n, then one can make the sep constant OFS instead of having to to keep asking twice every loop (since i <= n and i < n have identical boolean outcome except for last cycle of the loop, when i == n). A final print m[n] that will ensure ORS be used.

RARE Kpop Manifesto · Accepted Answer · 2022-09-08 23:12:54Z

INPUT

lorem <a> ipsum <b> dolor <c> sit amet, consectetur <d> adipiscing elit <e>, sed do eiusmod <f> tempor incididunt ut

CODE

mawk -F'^[^<]+|[^>]+$' 'gsub(">[^<]*<","> <",$!(NF=NF))^_*/./' OFS=

OUTPUT

<a> <b> <c> <d> <e> <f>

The fourth bird · Accepted Answer · 2022-09-05 17:39:41Z

Another option is to use gnu awk with gensub. You can capture the angle brackets with optional surrounding spaces and match the rest.

In the replacement use group 1 surrounded with a single space.

awk '{$0 = gensub(/ *(<[^>]*>) *|[^<>]+/, " \\1 ", "g"); $1=$1}1' file

Output

<a> <b> <c> <d> <e> <f>

RARE Kpop Manifesto · Accepted Answer · 2024-04-06 06:33:29Z

if you really wanna do it the patmatch() way, here's how to emulate that effect in other awks :

echo 'lorem <a> ipsum <b> dolor <c> sit amet, consectetur <d> adipiscing elit <e>, sed do eiusmod <f> tempor incididunt ut' |

awk ' BEGIN { RS = "^$" } _ = gsub(/[<][^>]*[>]/, "\4&\5") { split($!_, __, /((^|\5)[^\4]*)\4|\5[^\4]*$/) for (_ in __) print _, __[_] }'

1 2 <a> 3 <b> 4 <c> 5 <d> 6 <e> 7 <f> 8

Collectives™ on Stack Overflow

AWK print all regex matches on every line

9 Answers 9

Comments

Comments

Comments

1 Comment

example

output

Comments

2 Comments

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

Comments

Comments

Comments

1 Comment

example

output

Comments

2 Comments

Comments

Comments

Comments

Related