2

I have a simple screen scraping routine that gets an HTML page via BeautifulSoup, using a proxy crawling service (Scrapinghub):

def make_soup(self,current_url): soup = None r = requests.get(current_url, proxies=self.proxies, auth=self.proxy_auth, verify='static/crawlera-ca.crt') if r.status_code == 200: soup = bs4.BeautifulSoup(r.text, "html.parser") if soup: return soup return False 

When I run it on an http:// site it works properly.

When I run it on an https:// site it returns this:

Traceback (most recent call last): File "/home/danny/Documents/virtualenvs/AskArbyEnv/lib/python3.5/site-packages/requests/packages/urllib3/util/ssl_.py", line 295, in ssl_wrap_socket context.load_verify_locations(ca_certs, ca_cert_dir) FileNotFoundError: [Errno 2] No such file or directory 

Even weirder is that it works when I run it in a unit test accessing the same https:// site.

The only thing that changes between the unit test and the running code is the search terms that I append to the URL that I pass to 'make_soup'. Each resulting URL is well-formed, and I can access both of them in the browser.

This makes me think that it can't be to do with missing SSL certificates. So why does it seem to be complaining that it can't find certificate files?

1 Answer 1

4

By specifying verify='static/crawlera-ca.crt' in your call to requests.get, you are saying that all sites you visit have to have certificates signed by crawlera-ca.crt. If your proxy is not rewriting requests/responses and server certificates on fly (which it shouldn't, but check the update below), than all your requests to https sites will fail.

In addition, if you read the error message carefully, you can see you don't even have that crawlera-ca.crt file on disk.

To resolve your issue, just remove the verify argument. That way the requests will use the default, the certifi bundle (for requests>=2.4.0). For non-invasive proxies this is the right solution. Optionally, if you really need to, you can add some of the CA certs you trust to your local certifi store, but be very careful which certs you are adding.

Update. Looks like Crawlera proxy is a man-in-the-middle after all! Bad Crawlera, bad, bad, bad!

$ curl -vvv -x proxy.crawlera.com:8010 --cacert crawlera-ca.crt https://google.com/ [...snip...] * Proxy replied OK to CONNECT request * found 1 certificates in crawlera-ca.crt * found 697 certificates in /etc/ssl/certs * ALPN, offering http/1.1 * SSL connection using TLS1.2 / ECDHE_RSA_AES_256_GCM_SHA384 * server certificate verification OK * server certificate status verification SKIPPED * common name: google.com (matched) * server certificate expiration date OK * server certificate activation date OK * certificate public key: RSA * certificate version: #1 * subject: CN=google.com * start date: Sat, 08 Jul 2017 13:33:53 GMT * expire date: Tue, 06 Jul 2027 13:33:53 GMT * issuer: C=IE,ST=Munster,L=Cork,O=ScrapingHub,OU=Leading Technology and Professional Services,CN=Crawlera CA,[email protected] * compression: NULL 

Notice the CN=google.com there is issued by O=ScrapingHub,CN=Crawlera CA.

This means Crawlera/ScrapingHub is re-encrypting each request you make to your target URL, and reading all private and sensitive data you exchange with that site! I understand that's the only way for them to cache origin requests and save some bandwidth across all users scraping the same site, and the only way to inspect the legality of the content, but still. They should put is somewhere in their FAQ, and I'm not sure they do.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for this. The file is on disk, but it turns out that the reason it suddenly stopped working was because I moved the python file containing 'make_soup' to a lower-level folder. Changing to 'verify='../static/crawlera-ca.crt')' solved the problem.
Turns out Crawlera is inspecting all you traffic, and issuing certificates for all sites you connect to. In that case, crawlera-ca.crt is necessary. But be aware they can read any private data you exchange with the 3rd party sites.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.