2

If I do:

sed 's/match/replace/g' 

I know that sed will substitute replace for every occurrence of match on a line. But what if...?

echo "match <please dont match this?>" | sed 's/match/replace/g' 

...or...

echo "never match unless <the match is somehow delimited?>" | sed 's/match/replace/g' 

I know I can use test or branch loops to recurse the matches singly, but how can I skip sections of a line in a s///global match context?

1 Answer 1

2

The thing about sed is that it is greedy. It will gobble as much as possible for every case. This can be used to your advantage in a s///global replacement context. If you \(group\) *zero-or-more matches of a string, sed will globally gobble those first in every case. And so if you can reliably delimit a /match this/ |skip this| case you can do something like this:

sed 's/\([^<>]*<\)*\(match *\)*\(remove *\)*/\1/g s/.\{,45\}[^ ]*/&\ /g; s/\(\n\) */\1/g ' <<INPUT Never remove any match unless <the match \ you want to remove is somehow delimited.> \ And you can remove any match <per your match \ delimiter as many times as your match occurs \ within the match delimiters.> INPUT 

OUTPUT

Never remove any match unless <the you want to is somehow delimited.> And you can remove any match <per your delimiter as many times as your occurs within the delimiters.> 

The input there is a single line because the shell escapes the newlines in the here-document on the backslashes. sed splits it on 45 char (give or take) boundaries and prints it. Still, as you can see, every occurrence of either match or remove outside a <...> boundary remains, whereas all those within are removed from output.

This is a function of sed's greediness as it applies to a match occurring *zero-or-more times. It is this same greediness that makes replacements impossible to do in the same way, though that only requires an extra step or two to negate.

To get a clear picture of how this works, we can perform a replacement - which, by the way, is not often likely to be very useful if applied directly, as I mean to show:

printf '%s %s\n' '<321Nu0-9mber123>' \ 'String321strinG' \ '<321Nu0-9mber123>' \ 'String321strinG' | sed 's/\(<[^<>]*>\)*[0-9]*/\1!/g' 

OUTPUT

<321Nu0-9mber123>! !S!t!r!i!n!g!s!t!r!i!n!G! <321Nu0-9mber123>! !S!t!r!i!n!g!s!t!r!i!n!G! 

So when sed matches the line on a global pattern it attempts to match that pattern as many times as it might while maintaining its characteristic greediness. A side-effect of greediness when a pattern for zero-or-more occurrences is specified and does not match a section of the line is that it still matches - it matches the null-string between the bytes on the portion of the line it failed to match.

Above you can see that the <...> string is unaffected whereas the digits that were within String... have not only disappeared, but also that sed inserted a bang for each character. This reflects sed's match for the null-string each time. It is for this reason that this technique is useful for globally delimiting a match replacement instead of doing one.

And here's how that can work:

printf '%s\t%s\n' '<321Nu0-9mber123>' \ 'String321strinG' \ '<321Nu0-9mber123>' \ 'String321strinG' | sed 's/[0-9]/&\n/g;s/\(<[^<>]*>\)*\n*/\1/g;y/\n/0/' 

OUTPUT

<302010Nu00-90mber102030> String321strinG <302010Nu00-90mber102030> String321strinG 

This appends a zero to every digit that occurs within < and > - which is a fairly simple case - but, in truth, you can use the \newline character in that way to perform global replacements for any match. The basic principle is:

  1. Do sed 's/match/&\n/g'
  2. Then do sed 's/\(match group\)*\n*/\1/g'
  3. Last do sed 's/match\n/replace/g'

Admittedly these examples demo only flat list examples - < always precededs >. Nests need consideration too. They are harder - sometimes far harder - but, well...

sed 's/\([{}]\)\([^{}]*[{}]*\1\)*/\n<&>/g ' <<\INPUT {{{1!}{2!}{3!}}}outside!{{{4!}}{{5!}}} INPUT 

OUTPUT

<{{{1!}{2!}{>3! <}}}>outside! <{{{4!}}{{>5! <}}}> 

It serializes groups on newlines. It works by alternating the delimiter it matches per match group while simultaneously stacking as many of the same of kind delimiter as much as it can twice in a row (at least twice) and as a side-effect winds up comparing opens to closes. That said, for the sake of simplicity, the rest of this will assume that any reader will use a similar means to prepare input and nests are not a problem.

Essentially the operative idea to all of this is match precedence. The first example worked by attempting to match any group of non-delimiter characters immediately preceding an open-delimiter before attempting to match the removal strings. It stands to reason that if the first group matches then when the substitution completes the entire matched group can only be replaced with itself - and this is what can make replacements difficult. Removals are more simple because when you match them you simply leave them out of the substitution and all is well.

Also sed values certain types of patterns more than others. It is important to understand that when you do this any definitely specified pattern will always carry more weight than does a *zero-or-more case. So when you use these for global patterns use only * or don't use them at all - or you might end up skipping no groups at all.

And that's how you do that with sed.

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.