BASH: How to extract text from URLs between START / STOP pattern [duplicate]

Question

For a bunch of URLs I'd like to extract a YEAR, f.e. 2022, which appears between these tags, f.e.:

<td class="text" style="border-right:0;"> 2022 </td>

How to store '2022' locally, without storing the webpage here?

I don't know what your question is, clear it first, what do you want to do, store page? want a pattern ? you are talking about about extracting from URL and giving HTML, where do you get HTML? Is this a webpage? what is this? — Yanjan. Kaf.
– Yanjan. Kaf., Commented Oct 17, 2023 at 10:01
Using bash, I'd like to call webpages one by one from a longer list of URLs. The bash-script should extract (sed? grep? awk?) the year between the tags <td> ... </td> as shown in OP and store the number locally. — StOMicha
– StOMicha, Commented Oct 17, 2023 at 10:06
it needs concreate example, scenario.. your scenario/question is so general — Ömer Sezer
– Ömer Sezer, Commented Oct 17, 2023 at 10:11
As similar example: stackoverflow.com/questions/18086468/… Imagine START and STOP pattern as <div class="post-header"> and </div>. I need to store "Test post" locally. — StOMicha
– StOMicha, Commented Oct 17, 2023 at 10:31

Ömer Sezer · Accepted Answer · 2023-10-17 10:41:45Z

This is a simple sample code. It gives the idea, you may update the sed -n according to your new search key.

index.html (that u mentioned in the post):

<td class="text" style="border-right:0;">2022</td>

Sample Code with Bash:

#!/bin/bash urls=( "file:/home/<username>/index.html" # "URL2" ) extract_year() { url="$1" html_content=$(curl -s "$url") year=$(echo "$html_content" | sed -n 's/.*<td class="text" style="border-right:0;">\([0-9]\{4\}\)<\/td>.*/\1/p' | head -1) # store the year in a file if [ -n "$year" ]; then echo "$year" >> years.txt else echo "Year not found for $url" fi } for url in "${urls[@]}"; do extract_year "$url" done

Output:

KamilCuk · Accepted Answer · 2023-10-17 10:53:22Z

Use XML parser to work with XML files.

xmllint --xpath 'string(//td[@class="text"][@style="border-right:0;"])' 1.html | xargs

Collectives™ on Stack Overflow

BASH: How to extract text from URLs between START / STOP pattern [duplicate]

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related