Trying to get the html from a website

Question

def main: with open(sourcefile, 'r', encoding='utf-8') as main_file: for line in main_file: htmlcontent = reader(line) def reader(line): with urllib.request.urlopen(line) as url_file: try: url_file.read().decode('UTF-8') except urllib.error.URLError as url_err: print('Error opening url: ', url, url_err) except UnicodeDecodeError as decode_err: print('Error decoding url: ', url, decode_err) return url_file

Hello everyone, I am pretty new to python and I have a question regarding reading the HTML code from a website. So I am using regular expressions as shown, and I am trying to simply return the HTML code from a website. The variable line takes in URLs from a text file, which has lines of URL so it iterates through it. This is my code so far, but there are multiple errors that are popping up. I know that I have to use the else clause, and I don't know how to incorporate that. I intend to use the returned HTML value as a subject for regex. I also hope to get the HTML using urllib.request library.

What do you exactly want to do? There are many useful libraries for parsing websites available — Nils
– Nils, Commented Mar 13, 2018 at 2:13
@Nils I'm trying to get the html code, so I can then use regex on the code to find certain patterns present in the code. But first, I simply have to get the html from the website. I was told to have a try, except, else, in cause of errors when going about this. Also, I intend to go about this using urllib.request library. — newbie123123
– newbie123123, Commented Mar 13, 2018 at 2:23

bigbounty · Accepted Answer · 2018-03-13 02:17:10Z

2

It's better to use requests module. One liner code

import requests html = requests.get("www.domain.tld").text

answered Mar 13, 2018 at 2:17

bigbounty

17.5k7 gold badges45 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

newbie123123 Over a year ago

Thank you, but I am trying to solve it using urllib.request!

Keyur Potdar Over a year ago

@newbie123123, have a look at this: stackoverflow.com/questions/2018026/…

rwx · Accepted Answer · 2018-03-13 02:13:48Z

This saves the website content in html_content and prints it

import urllib url = "www.domain.tld" seed_url = urllib.urlopen(url) html_content = seed_url.read() seed_url.close() print(html_content)

Collectives™ on Stack Overflow

Trying to get the html from a website

2 Answers 2

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Linked

Related