4

Language Ver: Python 3.6.3
IDE Ver: PyCharm 2017.2.3

I was trying to parse a weather website to print weather for a place. As I am learning Python, previously I used urllib.request.urlopen(url).read() and it worked. Now, I am modifying the code to BeautifulSoup4 and requests module. Below is my code:

from bs4 import * import requests url = "https://www.accuweather.com/en/in/dhenkanal/189844/weather-forecast/189844" data = requests.get(url) soup = BeautifulSoup(data.text, "html.parser") print(soup.find('div', {'class': 'info'})) 

But each time I try to run the code it gives me following error:

Traceback (most recent call last): File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen chunked=chunked) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request six.raise_from(e, None) File "", line 2, in raise_from File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 383, in _make_request httplib_response = conn.getresponse() File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 1331, in getresponse response.begin() File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 297, in begin version, status, reason = self._read_status() File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 258, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\socket.py", line 586, in readinto return self._sock.recv_into(b) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 1009, in recv_into return self.read(nbytes, buffer) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 871, in read return self._sslobj.read(len, buffer) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 631, in read v = self._sslobj.read(len, buffer) TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 440, in send timeout=timeout File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 639, in urlopen _stacktrace=sys.exc_info()[2]) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\retry.py", line 357, in increment raise six.reraise(type(error), error, _stacktrace) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\packages\six.py", line 685, in reraise raise value.with_traceback(tb) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen chunked=chunked) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request six.raise_from(e, None) File "", line 2, in raise_from File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 383, in _make_request httplib_response = conn.getresponse() File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 1331, in getresponse response.begin() File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 297, in begin version, status, reason = self._read_status() File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\http\client.py", line 258, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\socket.py", line 586, in readinto return self._sock.recv_into(b) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 1009, in recv_into return self.read(nbytes, buffer) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 871, in read return self._sslobj.read(len, buffer) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 631, in read v = self._sslobj.read(len, buffer) urllib3.exceptions.ProtocolError: ('Connection aborted.', TimeoutError(10060, 'A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond', None, 10060, None)) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "E:/Projects/Python/Practice/Practice1.py", line 5, in data = requests.get(url) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 72, in get return request('get', url, params=params, **kwargs) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 58, in request return session.request(method=method, url=url, **kwargs) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 508, in request resp = self.send(prep, **send_kwargs) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 618, in send r = adapter.send(request, **kwargs) File "C:\Users\Nrusingh\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 490, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', TimeoutError(10060, 'A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond', None, 10060, None)) Process finished with exit code 1 

What is this error and how to correct it? And why it worked in urllib, but not in requests?

3
  • Sorry to add an external link because I don't know how to add error-log into question and stackoverflow didn't let me add my error log in the question. Commented Nov 19, 2017 at 8:55
  • Edited it in for you. It's a beast of an error :) Commented Nov 19, 2017 at 8:59
  • short answer, use a header called user agent. answer below :) Commented Nov 19, 2017 at 9:21

2 Answers 2

6

I used your code straight up and I got the same error then I followed how the requests are sent in browser. Some servers don't respond if expected headers are not sent with request that they use as part of backend processing. Turns out the server was looking for a header called user-agent usually used to determine what client the request is from. Now, amended code below which works!

from bs4 import * import requests url = "https://www.accuweather.com/en/in/dhenkanal/189844/weather-forecast/189844" headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'} data = requests.get(url, headers=headers) soup = BeautifulSoup(data.text, "html.parser") 

Now you can play with your soup! You can in fact pass more headers like accept, dnt, pragma, accept-language, cache-control etc. Explanation of these http headers are for another question, another time. Hope it helps :)

Sign up to request clarification or add additional context in comments.

Comments

2

Try increasing the timeout parameter of your requests.get method :

requests.get(url, headers=headers, timeout=5) 

But if your script is being blocked by the server to prevent scrapping attempts . If this is the case you can try faking a web browser by setting appropriate headers .

{"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)", "Referer": "http://example.com"} 

your final code

import requests url = "https://www.accuweather.com/en/in/dhenkanal/189844/weather-forecast/189844" headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)", "Referer": "http://example.com"} data = requests.get(url,headers=headers,timeout=5) 

3 Comments

I've struggled to even load the link in browser. Timeout of 120 is not enough to prevent error on GET.
i am getting the result on timeout of 5
it was to add user-agent in headers their script was blocking the scrapping

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.