Which is best in Python: urllib2, PycURL or mechanize?

Question

Ok so I need to download some web pages using Python and did a quick investigation of my options.

Included with Python:

urllib - seems to me that I should use urllib2 instead. urllib has no cookie support, HTTP/FTP/local files only (no SSL)

urllib2 - complete HTTP/FTP client, supports most needed things like cookies, does not support all HTTP verbs (only GET and POST, no TRACE, etc.)

Full featured:

mechanize - can use/save Firefox/IE cookies, take actions like follow second link, actively maintained (0.2.5 released in March 2011)

PycURL - supports everything curl does (FTP, FTPS, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP), bad news: not updated since Sep 9, 2008 (7.19.0)

New possibilities:

urllib3 - supports connection re-using/pooling and file posting

Deprecated (a.k.a. use urllib/urllib2 instead):

httplib - HTTP/HTTPS only (no FTP)

httplib2 - HTTP/HTTPS only (no FTP)

The first thing that strikes me is that urllib/urllib2/PycURL/mechanize are all pretty mature solutions that work well. mechanize and PycURL ship with a number of Linux distributions (e.g. Fedora 13) and BSDs so installation is a non issue typically (so that's good).

urllib2 looks good but I'm wondering why PycURL and mechanize both seem very popular, is there something I am missing (i.e. if I use urllib2 will I paint myself in to a corner at some point?). I'd really like some feedback on the pros/cons of these things so I can make the best choice for myself.

Edit: added note on verb support in urllib2

What does "best" mean? Best with respect to what? Fastest? Largest? Best use of Cookies? What do you need to do? — S.Lott
– S.Lott, Commented Mar 5, 2010 at 11:03
httplib isn't "deprecated". It is a lower level module that urllib2 is built on top of. you can use it directly, but it is easier via urllib2 — Corey Goldberg
– Corey Goldberg, Commented Mar 5, 2010 at 16:48
What Corey said, e.g. urllib3 is a layer on top of httplib. Also, httplib2 is not deprecated - in fact it's newer than urllib2 and fixes problems like connection reuse (same with urllib3). — xyzzyrz
– xyzzyrz, Commented Apr 21, 2011 at 1:03
There is a newer library called requests. See docs.python-requests.org/en/latest/index.html — ustun
– ustun, Commented Jun 30, 2011 at 21:11

vaibhaw · Accepted Answer · 2014-03-12 07:19:23Z

I think this talk (at pycon 2009), has the answers for what you're looking for (Asheesh Laroia has lots of experience on the matter). And he points out the good and the bad from most of your listing

Scrape the Web: Strategies for programming websites that don't expect it (Part 1 of 3)
Scrape the Web: Strategies for programming websites that don't expect it (Part 2 of 3)
Scrape the Web: Strategies for programming websites that don't expect it (Part 3 of 3)

From the PYCON 2009 schedule:

Do you find yourself faced with websites that have data you need to extract? Would your life be simpler if you could programmatically input data into web applications, even those tuned to resist interaction by bots?

We'll discuss the basics of web scraping, and then dive into the details of different methods and where they are most applicable.

You'll leave with an understanding of when to apply different tools, and learn about a "heavy hammer" for screen scraping that I picked up at a project for the Electronic Frontier Foundation.

Atendees should bring a laptop, if possible, to try the examples we discuss and optionally take notes.

Update: Asheesh Laroia has updated his presentation for pycon 2010

PyCon 2010: Scrape the Web: Strategies for programming websites that don't expected it

* My motto: "The website is the API." * Choosing a parser: BeautifulSoup, lxml, HTMLParse, and html5lib. * Extracting information, even in the face of bad HTML: Regular expressions, BeautifulSoup, SAX, and XPath. * Automatic template reverse-engineering tools. * Submitting to forms. * Playing with XML-RPC * DO NOT BECOME AN EVIL COMMENT SPAMMER. * Countermeasures, and circumventing them: o IP address limits o Hidden form fields o User-agent detection o JavaScript o CAPTCHAs * Plenty of full source code to working examples: o Submitting to forms for text-to-speech. o Downloading music from web stores. o Automating Firefox with Selenium RC to navigate a pure-JavaScript service. * Q&A; and workshopping * Use your power for good, not evil.

Update 2:

PyCon US 2012 - Web scraping: Reliably and efficiently pull data from pages that don't expect it

Exciting information is trapped in web pages and behind HTML forms. In this tutorial, >you'll learn how to parse those pages and when to apply advanced techniques that make >scraping faster and more stable. We'll cover parallel downloading with Twisted, gevent, >and others; analyzing sites behind SSL; driving JavaScript-y sites with Selenium; and >evading common anti-scraping techniques.

Two or three sentences summarizing the talk's recommendations would be great, for those without the time to listen to it. :-)

Tutul · Accepted Answer · 2012-01-08 04:32:34Z

Python requests is also a good candidate for HTTP stuff. It has a nicer api IMHO, an example http request from their offcial documentation:

>>> r = requests.get('https://api.github.com', auth=('user', 'pass')) >>> r.status_code 204 >>> r.headers['content-type'] 'application/json' >>> r.content ...

Ignacio Vazquez-Abrams · Accepted Answer · 2010-03-05 10:21:12Z

urllib2 is found in every Python install everywhere, so is a good base upon which to start.
PycURL is useful for people already used to using libcurl, exposes more of the low-level details of HTTP, plus it gains any fixes or improvements applied to libcurl.
mechanize is used to persistently drive a connection much like a browser would.

It's not a matter of one being better than the other, it's a matter of choosing the appropriate tool for the job.

I have implemented httplib2 in my python application. Is httplib2 support NTLM?. If not what i have to do for NTLM authentication?. Note: I find that httplib2 will not support NTLM.
@Ayyappan urllib3 has NTLM support via the contrib submodule: urllib3/contrib/ntlmpool.py

mit · Accepted Answer · 2013-01-19 23:02:41Z

To "get some webpages", use requests!

From http://docs.python-requests.org/en/latest/ :

Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time — and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.

Things shouldn’t be this way. Not in Python.

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) >>> r.status_code 200 >>> r.headers['content-type'] 'application/json; charset=utf8' >>> r.encoding 'utf-8' >>> r.text u'{"type":"User"...' >>> r.json() {u'private_gists': 419, u'total_private_repos': 77, ...}

wisty · Accepted Answer · 2010-03-05 11:09:02Z

Don't worry about "last updated". HTTP hasn't changed much in the last few years ;)

urllib2 is best (as it's inbuilt), then switch to mechanize if you need cookies from Firefox. mechanize can be used as a drop-in replacement for urllib2 - they have similar methods etc. Using Firefox cookies means you can get things from sites (like say StackOverflow) using your personal login credentials. Just be responsible with your number of requests (or you'll get blocked).

PycURL is for people who need all the low level stuff in libcurl. I would try the other libraries first.

requests is also useful in storing cookies. With requests you create a new session, and then instead of requests.get() you call sessionName.get(). Cookies will then be stored in your session. For example once you've logged into a website using a session you will be able to do other http requests using your session as a logged in user.

mikerobi · Accepted Answer · 2010-03-05 14:10:29Z

2

Urllib2 only supports HTTP GET and POST, there might be workarounds, but If your app depends on other HTTP verbs, you will probably prefer a different module.

answered Mar 5, 2010 at 14:10

mikerobi

21k5 gold badges49 silver badges43 bronze badges

4 Comments

Piotr Dobrogost Over a year ago

Not true. See Python - HEAD request with urllib2

mikerobi Over a year ago

@Piotr Dobrogost. Still very true. Until you can use urllib2.urlopen to generate a HEAD request, it is unsupported. Creating a custom subclass != HEAD support. I could create an int subclass that generates HTML, but it would never make sense to say that python int can generate HTML.

Piotr Dobrogost Over a year ago

Until you can use urllib2.urlopen to generate a HEAD request, it is unsupported. What makes you think so? Creating a custom subclass != HEAD support. Which part of HEAD support is urllib2 missing?

mikerobi Over a year ago

@Piotr Dobrogost, I think so because it isn't supported by the api. If you can point me to an example of ullib2.urlopen generating a non GET or POST request I will delete my answer.

jedi_coder · Accepted Answer · 2010-08-04 03:27:06Z

Every python library that speaks HTTP has its own advantages.

Use the one that has the minimum amount of features necessary for a particular task.

Your list is missing at least urllib3 - a cool third party HTTP library which can reuse a HTTP connection, thus speeding up greatly the process of retrieving multiple URLs from the same site.

Stack Exchange User · Accepted Answer · 2012-06-19 09:00:37Z

Take a look on Grab (http://grablib.org). It is a network library which provides two main interfaces: 1) Grab for creating network requests and parsing retrieved data 2) Spider for creating bulk site scrapers

Under the hood Grab uses pycurl and lxml but it is possible to use other network transports (for example, requests library). Requests transport is not well tested yet.

Collectives™ on Stack Overflow

Which is best in Python: urllib2, PycURL or mechanize?

8 Answers 8

1 Comment

1 Comment

2 Comments

Comments

1 Comment

4 Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

1 Comment

1 Comment

2 Comments

Comments

1 Comment

4 Comments

1 Comment

Comments

Linked

Related