How to extract text from html page?

Question

For example the web page is the link:

https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50

I must have the name of the firms and their address and website. I have tried the following to convert the html to text:

import nltk from urllib import urlopen url = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display=50" html = urlopen(url).read() raw = nltk.clean_html(html) print(raw)

But it returns the error:

ImportError: cannot import name 'urlopen

You are using Python 3 urllib, which is different to Python 2 urllib — Open AI - Opting Out
– Open AI - Opting Out, Commented Nov 6, 2015 at 12:40
Pretty sure you're going to be disappointed once you get it working: clean_html is not implemented. See this question. — Open AI - Opting Out
– Open AI - Opting Out, Commented Nov 6, 2015 at 12:43

Community · Accepted Answer · 2017-05-23 11:53:38Z

Peter Wood has answered your problem (link).

import urllib.request uf = urllib.request.urlopen(url) html = uf.read()

But if you want to extract data (such as name of the firm, address and website) then you will need to fetch your HTML source and parse it using a HTML parser.

I'd suggest to use requests for fetching the HTML source and BeautifulSoup to parse the HTML generated and extract the text you require.

Here is a small snipet which will give you a head start.

import requests from bs4 import BeautifulSoup link = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50" html = requests.get(link).text """If you do not want to use requests then you can use the following code below with urllib (the snippet above). It should not cause any issue.""" soup = BeautifulSoup(html, "lxml") res = soup.findAll("article", {"class": "listingItem"}) for r in res: print("Company Name: " + r.find('a').text) print("Address: " + r.find("div", {'class': 'address'}).text) print("Website: " + r.find_all("div", {'class': 'pageMeta-item'})[3].text)

nice, but what if i open a random site and i want to extract the important text like, i want scape the text of menu or the end of page i want just the topic direct.

Collectives™ on Stack Overflow

How to extract text from html page?

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related