1

I am helping somebody pull a bunch (tens of thousands) of pdf files from a website. We have the pattern for the file names but not all of the files will exist. I am assuming it is rude to ask for a file that does not exist, particularly at this scale. I am using python and in my tests of urllib2 I have found that this snippet gets me the file if it exists

s=urllib.urlretrieve('http://website/directory/filename.pdf','c:\\destination.pdf') 

If the file does not exist then I get a file that has the name I assigned but the text from their 404 page. Now I can handle this after I am done (read the files and delete all of the 404 pages) but that does not seem very nice to their server nor is it very pythonic.

I tried messing with the looking at the various functions in urllib and urlretrieve and do not see anything that tells me if the file exists.

6
  • 12
    What's rude is pulling tens of thousands of PDF files. That little extra rudeness of some of the files not existing...eh. Doesn't even matter, next to that. Commented Apr 3, 2012 at 18:59
  • Well we are going to do it when their traffic is down (weekends) and they don't have a restriction the files are there to read but for his research we need to collect a large number Commented Apr 3, 2012 at 19:01
  • It's actually very pythonic - ask for forgiveness, not permission - the pythonic (and only, given the way the web works) thing to do is catch the 404s. I'd also like to note they are not filenames, they are URLs - there is a difference, a URL does not mean there is an actual file on the server - they could be generated from a databased or whatever. Commented Apr 3, 2012 at 19:01
  • 1
    @Lattyware: In which case you're not just taking up bandwidth; you're also making the server generate this PDF. Even ruder. Commented Apr 3, 2012 at 19:02
  • 2
    @PyNEwbie They are going to send the full response no matter what, so whether or not you write it to disk is really more of an issue for your end then the server. Commented Apr 3, 2012 at 19:05

1 Answer 1

6

You can check the return code of the response. It will be 200 for existing PDFs and 404 for non-existing PDFs. You can use the requests library to make this a lot easier:

>>> import requests >>> r = requests.get('http://cdn.sstatic.net/stackoverflow/img/sprites.png') >>> r.status_code 200 >>> r = requests.get('http://cdn.sstatic.net/stackoverflow/img/sprites.xxx') >>> r.status_code 404 
Sign up to request clarification or add additional context in comments.

3 Comments

What if you can't install requests?
@BenL then cry :( the built-in libraries -- urllib/httplib -- are awful
@jterrace I got it working. I had to sift through the terrible irrelevant information on the web. urllib needed me to change the headers to pretend it's mozilla.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.