18

I have the following text input:

lorem <a> ipsum <b> dolor <c> sit amet, consectetur <d> adipiscing elit <e>, sed do eiusmod <f> tempor incididunt ut 

As seen in the text, the appearances of <?> is not fixed and can appear 0 or multiple times on the same line.

Only using awk I need to output this:

<a> <b> <c> <d> <e> <f> 

I tried this awk script:

awk '{ match($0,/<[^>]+>/,a); // fill array a with matches for (i in a) { if (match(i, /^[0-9]+$/) != 0) // ignore non numeric indices print a[i] } }' somefile.txt 

but this only outputs the first match on every line:

<a> <d> <f> 

Is there some way of doing this with match() or any other built-in function?

0

9 Answers 9

16

With GNU awk you could use its OOTB variable named FPAT and could try following awk code.

awk -v FPAT='<[^>]*>' ' NF{ val="" for(i=1;i<=NF;i++){ val=(val?val OFS:"") $i } print val } ' Input_file 
Sign up to request clarification or add additional context in comments.

Comments

11

Assuming there are no stray angle brackets, use either < or > as a field separator and print every second field:

awk -F'[<>]' '{for (i=2; i <= NF; i += 2) {printf "<%s> ", $i}; print ""}' data 

Comments

10

match() doesn't work the way you think it does; to find a variable number of matches you would need to first match() the first pattern, strip off that pattern, then match() the remainder of the input for the next pattern, and repeat until no more matches in the current line; eg:

awk ' { out=sep="" # init variables for new line while (match($0,/<[^>]+>/)) { # find 1st match out=out sep substr($0,RSTART,RLENGTH) # build up output line $0=substr($0,RSTART+RLENGTH) # strip off 1st match and prep for next while() check sep=OFS # set field separator for follow-on matches } if (out) print out }' somefile.txt 

Another idea uses the split() function, eg:

awk ' { n=split($0,a,/[<>]/) # split line on dual delimiters "<" and ">" out=sep="" for (i=2;i<=n;i=i+2) { # step through even numbered array entries; assumes line does not contain any standalone "<" or ">" characters !!! out=out sep "<" a[i] ">" # build output line sep=OFS } if (out) print out } ' somefile.txt 

Both of these generate:

<a> <b> <c> <d> <e> <f> 

Comments

9

I would harness GNU AWK for this task following way, let file.txt content be

lorem <a> ipsum <b> dolor <c> sit amet, consectetur <d> adipiscing elit <e>, sed do eiusmod <f> tempor incididunt ut 

then

awk 'BEGIN{FPAT="<[^>]*>"}{$1=$1;print}' file.txt 

gives output

<a> <b> <c> <d> <e> <f> 

Explanation: I inform GNU AWK that field is < followed by zero-or-more (*) non(^)-> followed by >. For each line I do $1=$1 to provoke rebuilt, so now line are found fields joined by space, which I then print.

(tested in gawk 4.2.1)

1 Comment

maybe streamline awk 'BEGIN { … } {$1=$1;print}' to simply ::::::::::::: :::::::::::; :::::::::::::::: awk NF=NF FPAT='<[^>]*>' while retaining all functionality of original ?
9

Here's a simple awk solution based on regexps:

awk '{ gsub(/^[^<]*|[^>]*$/,""); gsub(/>[^<]*</,"> <") } NF' 

edit: using NF instead of $0 != ""; thanks @EdMorton

For each line:

  • strip all chars from the left up to the first < (excluded) or up to the end-of-line when < isn't found.
  • strip all chars from the right up to the first > (excluded) or up to the start-of-line when > isn't found.
  • replace what's between each > and < pair with a space character.
  • print the result when it isn't empty
example
lorem <a a> ipsum <b> dolor <c> sit amet, consectetur <d> adipiscing elit <e>, sed do eiusmod <f> tempor <g>incididunt ut<h><i> h>ell<o <j> 
output
<a a> <b> <c> <d> <e> <f> <g> <h> <i> <j> 

Remark: With exactly the same logic you can use sed:

sed 's/^[^<]*//; s/[^>]*$//; s/>[^<]*</> </g; /^$/d' 

Comments

8

Here is a simple gnu-awk alternative solution using patsplit:

awk ' n = patsplit($0, m, /<[^>]+>/) { for (i=1; i<=n; ++i) printf "%s", m[i] (i < n ? OFS : ORS) }' file <a> <b> <c> <d> <e> <f> 

2 Comments

Exactly what I was looking for. Using FPAT is a good alternative if the only thing I'm interested in is the content of <?>. But if my records had fields separated by spaces and inside that fields I have the brackets placeholder (eg.: "word1 w<1>d2 word") things cat get complicated. Thanks.
@anubhava : maybe change the filter criteria in for (...) loop to be just i < n, then one can make the sep constant OFS instead of having to to keep asking twice every loop (since i <= n and i < n have identical boolean outcome except for last cycle of the loop, when i == n). A final print m[n] that will ensure ORS be used.
5

INPUT

lorem <a> ipsum <b> dolor <c> sit amet, consectetur <d> adipiscing elit <e>, sed do eiusmod <f> tempor incididunt ut 

CODE

mawk -F'^[^<]+|[^>]+$' 'gsub(">[^<]*<","> <",$!(NF=NF))^_*/./' OFS= 

OUTPUT

<a> <b> <c> <d> <e> <f> 

Comments

4

Another option is to use gnu awk with gensub. You can capture the angle brackets with optional surrounding spaces and match the rest.

In the replacement use group 1 surrounded with a single space.

awk '{$0 = gensub(/ *(<[^>]*>) *|[^<>]+/, " \\1 ", "g"); $1=$1}1' file 

Output

<a> <b> <c> <d> <e> <f> 

Comments

0

if you really wanna do it the patmatch() way, here's how to emulate that effect in other awks :

echo 'lorem <a> ipsum <b> dolor <c> sit amet, consectetur <d> adipiscing elit <e>, sed do eiusmod <f> tempor incididunt ut' | 
awk ' BEGIN { RS = "^$" } _ = gsub(/[<][^>]*[>]/, "\4&\5") { split($!_, __, /((^|\5)[^\4]*)\4|\5[^\4]*$/) for (_ in __) print _, __[_] }' 
1 2 <a> 3 <b> 4 <c> 5 <d> 6 <e> 7 <f> 8 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.