0

I have a list of urls and I want to check if any of these are working. I want to do this because I want to use google API to search within each of links, but when I run it I get the message "bad request", which seems to be because there are links that do not work within the list.

I couldn't go into all of the links, but for some of them I get message on google Chrome:

  1. That’s an error.

The requested URL /playMsg.html was not found on this server.

Is there a way to do this? Thanks.

8
  • Define what it means if a link is "not working" – 404? Malformed URL? Server doesn't respond? Some examples would be helpful Commented Feb 2, 2017 at 0:31
  • @qxz so when I enter the link, I get a message "bad request" Commented Feb 2, 2017 at 0:34
  • @qxz another message is "404. That’s an error. The requested URL /playMsg.html was not found on this server." Commented Feb 2, 2017 at 0:36
  • Do it exactly like you're doing it: try to fetch the URL, and then examine the response status code. Every request to an HTTP server returns a status code. These are all documented. Start here: en.wikipedia.org/wiki/List_of_HTTP_status_codes Commented Feb 2, 2017 at 0:36
  • @BryanOakley Do you suggest import urllib2 response = urllib2.urlopen('python.org/') html = response.read() and then find the error message? Commented Feb 2, 2017 at 0:38

1 Answer 1

3

This is simplified version of my code, that I use in some projects.

The logic is simple:

  1. Send url to server_response
  2. If status == 200 (url is valid) -> return ok
  3. If status == 404, try to re-check the url 5 times every 10 secs (cover the case with bad connections)
  4. If after 5 tries the status still 404 -> return bad

Want to mention, that this code does not cover other statuses (implement it yourself or change if status == 404: to if status != 200:)

import requests from time import sleep def server_response(url): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'} tries = 5 while True: response = requests.get(url, headers=headers, stream=True) status = response.status_code if status == 404: # u can change it to 'if status != 200:' in order to cover all status codes except 200 print('\n###################################') print('### THERE IS CONNECTION PROBLEM ###') print('Response code: %d \nURI: %s' % (status, url)) print('###################################\n') sleep(10) tries -= 1 elif status == 200: return 'ok' if tries == 0: return 'bad' list_of_urls = ['www.site1.com', 'www.site2.com'] for url in list_of_urls: status = server_response(url) if status == 'ok': # do something else: # do something 
Sign up to request clarification or add additional context in comments.

3 Comments

Hi, thank you for the suggestion. I've customized your code and tried running it but I get a response "status" not defined...as I understand we don't need to define it, what do you think is wrong?
@song0089, update your 1st post with customized code and mark somehow the line of error, then, I can help. You need to define status because 1) you need to check the code 200/404 (in function) and 2) check if the url is ok or bad (in the block for url in list_of_urls:). Sometimes not defined can occur because of wrong indentation
@song0089, I would appreciate if you could accept the answer (under the 2 up\down arrows in front of my answer there is a grey check mark). Thank you.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.