0
def main: with open(sourcefile, 'r', encoding='utf-8') as main_file: for line in main_file: htmlcontent = reader(line) def reader(line): with urllib.request.urlopen(line) as url_file: try: url_file.read().decode('UTF-8') except urllib.error.URLError as url_err: print('Error opening url: ', url, url_err) except UnicodeDecodeError as decode_err: print('Error decoding url: ', url, decode_err) return url_file 

Hello everyone, I am pretty new to python and I have a question regarding reading the HTML code from a website. So I am using regular expressions as shown, and I am trying to simply return the HTML code from a website. The variable line takes in URLs from a text file, which has lines of URL so it iterates through it. This is my code so far, but there are multiple errors that are popping up. I know that I have to use the else clause, and I don't know how to incorporate that. I intend to use the returned HTML value as a subject for regex. I also hope to get the HTML using urllib.request library.

3
  • Please include the actual errors in your question. Commented Mar 13, 2018 at 2:11
  • What do you exactly want to do? There are many useful libraries for parsing websites available Commented Mar 13, 2018 at 2:13
  • @Nils I'm trying to get the html code, so I can then use regex on the code to find certain patterns present in the code. But first, I simply have to get the html from the website. I was told to have a try, except, else, in cause of errors when going about this. Also, I intend to go about this using urllib.request library. Commented Mar 13, 2018 at 2:23

2 Answers 2

2

It's better to use requests module. One liner code

import requests html = requests.get("www.domain.tld").text 
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, but I am trying to solve it using urllib.request!
@newbie123123, have a look at this: stackoverflow.com/questions/2018026/…
0

This saves the website content in html_content and prints it

import urllib url = "www.domain.tld" seed_url = urllib.urlopen(url) html_content = seed_url.read() seed_url.close() print(html_content) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.