0

I'm trying to write a small bash script that:

  • -wget's an html file every [x] minutes from the web
  • -uses some linux utility to find differences in the file between the last two updates
  • -Uses sed to modify the lines on which new text was detected

The problem I am running into is that the HTML file uses in-line CSS to format a table, but the actual code for the page is stored on one long line.

Effectively I need a Linux utility that can scan through a single line of code, find every instance of text between each tags, and insert those instances on their own line. That should make scanning the text easier. Every tool I've tried searches on a per-line basis which can't do what I need since the entire code is stored on a single line.

1 Answer 1

1

You could first split the content into lines, by substituting (say) > with >\n. That will break up the document on the end of each HTML tag.

Maybe you don't even need to do that: if you use awk's RS variable to define the record separator as ">" instead of newline. See this page for an example of using RS: http://www.thegeekstuff.com/2010/01/8-powerful-awk-built-in-variables-fs-ofs-rs-ors-nr-nf-filename-fnr/

Sign up to request clarification or add additional context in comments.

5 Comments

I'm looking at the RS variable now. As for your first example, should I use sed to modify each "</td>" tag with "</td>\n"?
<a>some text</a> If you set RS to ">" you'll get <a>, some text, </a>, Three records, from one line. However, if your text can contain ">", it'll pickle things a little.
Taking John's advice, I tried sed -i 's/<\/tr>/<\/tr>\n/g' file.html This did the trick! Regular expressions are confusing.
Yes, for example you could use sed to add newlines after each closing tag you're interested in. Note that most versions of sed do not make this particularly easy, so see this other answer for how to do that: stackoverflow.com/questions/6111679/insert-linefeed-in-sed
Regarding that sed expression: you can use other characters than slash to delimit sed commands (the first one seen will set what sed expects for all the rest of the delimiters, so you can use anything!). So you may find this more readable: s@</tr>@</tr>\n@g.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.