How to extract the source of a webpage without tags using bash?

Question

We can download the source of the page using wget or curl , but I want to extract the source of the page without tags. I mean extract it as text.

Possible duplicate of bash command to covert html page to a text file — Leventix
– Leventix, Commented Mar 3, 2016 at 16:25

SLePort · Accepted Answer · 2016-03-03 18:17:12Z

8

You can pipe to a simple sed command :

curl www.gnu.org | sed 's/<\/*[^>]*>//g'

answered Mar 3, 2016 at 18:17

SLePort

15.5k3 gold badges40 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Pablo Prieto · Accepted Answer · 2016-03-03 17:03:17Z

Using Curl, Wget and Apache Tika Server (locally) you can parse HTML into simple text directly from the command line.

First, you have to download the tika-server jar from the Apache site: https://tika.apache.org/download.html

Then, run it as a local server:

$ java -jar tika-server-1.12.jar

After that, you can start parsing text using the following url:

http://localhost:9998/tika

Now, to parse the HTML of webpage into simple text:

 $ wget -O test.html YOUR-HTML-URL && curl -H "Accept: text/plain" -T test.html http://localhost:9998/tika

That should return the webpage text without tags.

This way you're using wget to download and save your desired webpage to "test.html" and then you use curl to send a request to the tika server in order to extract the text. Notice that it's necessary to send the header "Accept: text/plain" because tika can return several formats, not just plain text.

Community · Accepted Answer · 2017-05-23 11:46:58Z

Create a Ruby script that uses Nokogiri to parse the HTML:

require 'nokogiri' require 'open-uri' html = Nokogiri::HTML(open 'https://stackoverflow.com/questions/6129357') text = html.at('body').inner_text puts text

Source

It would probably be simple to do with Javascript or Python if you're more comfortable with that, or search for a html-to-text utility. I imagine it would be very difficult to do this purely in bash.

See also: bash command to covert html page to a text file

good luck with using only bash :) – see my edit and the link to another post

Collectives™ on Stack Overflow

How to extract the source of a webpage without tags using bash?

3 Answers 3

Comments

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Linked

Related