5

For example the web page is the link:

https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50

I must have the name of the firms and their address and website. I have tried the following to convert the html to text:

import nltk from urllib import urlopen url = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx display=50" html = urlopen(url).read() raw = nltk.clean_html(html) print(raw) 

But it returns the error:

ImportError: cannot import name 'urlopen 
2

1 Answer 1

17

Peter Wood has answered your problem (link).

import urllib.request uf = urllib.request.urlopen(url) html = uf.read() 

But if you want to extract data (such as name of the firm, address and website) then you will need to fetch your HTML source and parse it using a HTML parser.

I'd suggest to use requests for fetching the HTML source and BeautifulSoup to parse the HTML generated and extract the text you require.

Here is a small snipet which will give you a head start.

import requests from bs4 import BeautifulSoup link = "https://www.architecture.com/FindAnArchitect/FAAPractices.aspx?display=50" html = requests.get(link).text """If you do not want to use requests then you can use the following code below with urllib (the snippet above). It should not cause any issue.""" soup = BeautifulSoup(html, "lxml") res = soup.findAll("article", {"class": "listingItem"}) for r in res: print("Company Name: " + r.find('a').text) print("Address: " + r.find("div", {'class': 'address'}).text) print("Website: " + r.find_all("div", {'class': 'pageMeta-item'})[3].text) 
Sign up to request clarification or add additional context in comments.

1 Comment

nice, but what if i open a random site and i want to extract the important text like, i want scape the text of menu or the end of page i want just the topic direct.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.