1

Suppose I have a html input like

<li>this is a html input line</li> 

I want to filter all such input lines from a file which begins with <li> and ends with </li>. Now my idea was to search for pattern <li> in the first field and pattern </li> in the last field using the below awk command

awk '$1 ~ /\<li\>/ ; $NF ~ /\</li\>/ {print $0}' 

but looks like there is no provision to match two fields at a time or I am making some syntax mistakes. Could you please help me here?

PS: I am working on a Solaris SunOS machine.

3
  • A single regex would work just fine. awk '/^ *<li>.*<\/li> *$/' - I guess needlessly escaping the wedges is your actual error. Notice also how $0 is the default for print and printing is the default in the absence of an explicit action. Commented Aug 6, 2016 at 18:28
  • Adapting your code, use && in place of ; —— awk '$1 ~ /\<li\>/ && $NF ~ /\</li\>/ {print $0}' Commented Aug 6, 2016 at 18:34
  • Good that you mentioned that you're working on Solaris, as that often adds special 'Solarisisms' to the answer you'll need. Be sure to mention that in any future Qs you post. Good luck. Commented Aug 6, 2016 at 21:57

2 Answers 2

3

There's a lot going wrong in your script on Solaris:

awk '$1 ~ /\<li\>/ ; $NF ~ /\</li\>/ {print $0}' 
  1. The default awk on Solaris (and so the one we have to assume you are using since you didn't state otherwise) is old, broken awk which must never be used. On Solaris use /usr/xpg4/bin/awk. There's also nawk but it's got less POSIX features (eg. no support for character classes).
  2. \<...\> are gawk-specific word boundaries. There is no awk on Solaris that would recognize those. If you were just trying to get literal characters then there's no need to escape them as they are not regexp metacharacters.
  3. If you want to test for condition 1 and condition 2 you put && between them, not ; which is just the statement terminator in lieu of a newline.
  4. The default action given a true condition is {print $0} so you don't need to explicitly write that code.
  5. / is the awk regexp delimiter so you do need to escape that in mid-regexp.
  6. The default field separator is white space so in your posted sample input $1 and $NF will be <li>this and line</li>, not <li> and </li>.

So if you DID for some reason compare multiple fields you could do:

awk '($1 ~ /^<li>.*/) && ($NF ~ /.*<\/li>$/)' 

but this is probably what you really want:

awk '/^<li>.*<\/li>/' 

in which case you could just use grep:

grep '^<li>.*</li>' 
Sign up to request clarification or add additional context in comments.

1 Comment

thank you Ed. this worked. i had not thought of the 2nd and 3rd methods in your answer as i was stuck with the first one. thanks everyone for taking time to help me
1

Why not just use a regex to match the start and end of the line like

awk '/^[[:space:]]*<li>.*<\/li>[[:space:]]*$/ {print}' 

though in general if you're trying to process HTML you'll be better of using a tool that's really designed to handle that.

1 Comment

@EdMorton Thanks for the heads up, fixed to be more POSIXy

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.