2

We can download the source of the page using wget or curl , but I want to extract the source of the page without tags. I mean extract it as text.

1

3 Answers 3

8

You can pipe to a simple sed command :

curl www.gnu.org | sed 's/<\/*[^>]*>//g' 
Sign up to request clarification or add additional context in comments.

Comments

1

Using Curl, Wget and Apache Tika Server (locally) you can parse HTML into simple text directly from the command line.

First, you have to download the tika-server jar from the Apache site: https://tika.apache.org/download.html

Then, run it as a local server:

$ java -jar tika-server-1.12.jar 

After that, you can start parsing text using the following url:

http://localhost:9998/tika

Now, to parse the HTML of webpage into simple text:

 $ wget -O test.html YOUR-HTML-URL && curl -H "Accept: text/plain" -T test.html http://localhost:9998/tika 

That should return the webpage text without tags.

This way you're using wget to download and save your desired webpage to "test.html" and then you use curl to send a request to the tika server in order to extract the text. Notice that it's necessary to send the header "Accept: text/plain" because tika can return several formats, not just plain text.

Comments

0

Create a Ruby script that uses Nokogiri to parse the HTML:

require 'nokogiri' require 'open-uri' html = Nokogiri::HTML(open 'https://stackoverflow.com/questions/6129357') text = html.at('body').inner_text puts text 

Source

It would probably be simple to do with Javascript or Python if you're more comfortable with that, or search for a html-to-text utility. I imagine it would be very difficult to do this purely in bash.

See also: bash command to covert html page to a text file

2 Comments

I said 'Using Bash' not ruby
good luck with using only bash :) – see my edit and the link to another post

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.