cookielib and ClientCookie

Handling Cookies in Python

Note

There is a French translation of this article - cookielib et ClientCookie.

 

 

Introduction

Some websites can't be browsed without cookies enabled. They are used for storing session information or confirming a users identity. They are sometimes used as an alternative to a scheme like basic authentication. The right Python module for fetching webpages, or other resources across the internet, is usually urllib2. It offers a simple interface for fetching resources using a variety of protocols. For a good introuduction to urllib2, browse over to the urllib2 tutorial.

By default it doesn't handle cookies though. You need to use an additional library to do this. In Python 2.4 this is called cookielib and is part of the standard library. Prior to Python 2.4 it existed as ClientCookie, but it's not a drop in replacement. In Python 2.4 some of the function of ClientCookie has been moved into urllib2. It is possible to write code that will work the same in these situations. This article illustrates how to use cookielib/ClientCookie and shows code for fetching URIs that will work unchanged on :

  • a machine with Python 2.4 (and cookielib)
  • a machine with ClientCookie
  • a machine with neither

Where either cookielib or ClientCookie is available the cookies will be saved in a file. On a machine with neither, URLs will still be fetched - but any cookies sent won't be handled or saved.

Cookies

When a website sends a page to a client, it sends a set of headers first that describe that http transaction. One of those headers can contain a line of text known as the cookie. If you fetch another page from the same server [1] then the cookie should be sent back to the server as one of the request headers. This allows the cookie to store information that the server can use to identify you, or the session you are engaged in. Obviously for some processes, this information is essential.

This was first supported by the netscape browser, and so the first spec was called the Netscape cookie protocol. It was described in RFC 2109, and then there was an attempt to extend this in the form of RFC 2965 - which has never been widely used. In reality, the protocol implemented by all the major browsers is still based on the Netscape protocol. By now it only bears a passing resemblance to the protocol sketched out in the original document [2].

Conditional Imports

A version of this code has been printed in the second edition of the Python Cookbook. One of the reasons it was included in the cookbook is that it illustrates an interesting programming idiom called conditional import. It's particularly important in this recipe because we need the behaviour of the underlying code to be slightly different depending on which library is available. The interface we present to the programmer is the same in all three cases though.

Pay attention to the first chunk of code which attempts to import the cookie libraries. This has to setup different behaviour depending on which library it imports. The pattern it uses is :

library = None

try:
    import library
except ImportError:
    # library not available
    setup alternate behaviour
     ....

else:
    # library is available
    establish normal behaviour
     ....

We use the name of the library we are importing as a marker, by setting it to None at the start. Later in the code we can tell if the library is available by using code like the following :

if library is None:
    # we know library is not available
    provide different or reduced function
     ....
else:
    # library is available
     .....

The Code

The code shown in this article can be downloaded from the Voidspace Recipebook.

In this first section we attempt to import a cookie handling library. We first try cookielib, then ClientCookie. If neither is available then we default to objects from urllib2.

import os.path
import urllib2

COOKIEFILE = 'cookies.lwp'
# the path and filename to save your cookies in

cj = None
ClientCookie = None
cookielib = None

# Let's see if cookielib is available
try:
    import cookielib
except ImportError:
    # If importing cookielib fails
    # let's try ClientCookie
    try:
        import ClientCookie
    except ImportError:
        # ClientCookie isn't available either
        urlopen = urllib2.urlopen
        Request = urllib2.Request
    else:
        # imported ClientCookie
        urlopen = ClientCookie.urlopen
        Request = ClientCookie.Request
        cj = ClientCookie.LWPCookieJar()

else:
    # importing cookielib worked
    urlopen = urllib2.urlopen
    Request = urllib2.Request
    cj = cookielib.LWPCookieJar()
    # This is a subclass of FileCookieJar
    # that has useful load and save methods

We've now imported the relevant library. Whichever library is being used the name``urlopen`` is bound to the right function for retrieving URLs. The name Request is bound to the right class for creating Request objects. If we successfully managed to import a cookie handling library then the name cj is bound to a CookieJar instance.

Installing the CookieJar

Now we need to get our CookieJar installed in the default opener for fetching URLs. This means that all calls to urlopen will have their cookies handled. The actual handling is done by an object called an HTTPCookieProcessor. If the terms opener and handler are new to you, then read either the urllib2 docs or my urllib2 tutorial.

All this is either done in ClientCookie or in urllib2, depending on which module we successfully imported.

if cj is not None:
# we successfully imported
# one of the two cookie handling modules

    if os.path.isfile(COOKIEFILE):
        # if we have a cookie file already saved
        # then load the cookies into the Cookie Jar
        cj.load(COOKIEFILE)

    # Now we need to get our Cookie Jar
    # installed in the opener;
    # for fetching URLs
    if cookielib is not None:
        # if we use cookielib
        # then we get the HTTPCookieProcessor
        # and install the opener in urllib2
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        urllib2.install_opener(opener)

    else:
        # if we use ClientCookie
        # then we get the HTTPCookieProcessor
        # and install the opener in ClientCookie
        opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj))
        ClientCookie.install_opener(opener)

If one of the cookie libraries is available, any call to urlopen will now handle cookies using the CookieJar instance we've created.

Fetching Webpages

So having done all the dirty work, we're ready to fetch our webpages. Any cookies sent will be handled. THis means they will be stored in the CookieJar, returned to the server when appropriate, and expired correctly as well. Because we may want to restart the same session next time, we save the cookies when we've finished.

theurl = 'http://www.google.co.uk/search?hl=en&ie=UTF-8&q=voidspace&meta='
# an example url that sets a cookie,
# try different urls here and see the cookie collection you can make !

txdata = None
# if we were making a POST type request,
# we could encode a dictionary of values here,
# using urllib.urlencode(somedict)

txheaders =  {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
# fake a user agent, some websites (like google) don't like automated exploration

try:
    req = Request(theurl, txdata, txheaders)
    # create a request object

    handle = urlopen(req)
    # and open it to return a handle on the url

except IOError, e:
    print 'We failed to open "%s".' % theurl
    if hasattr(e, 'code'):
        print 'We failed with error code - %s.' % e.code
    elif hasattr(e, 'reason'):
        print "The error object has the following 'reason' attribute :"
        print e.reason
        print "This usually means the server doesn't exist,',
        print "is down, or we don't have an internet connection."
    sys.exit()

else:
    print 'Here are the headers of the page :'
    print handle.info()
    # handle.read() returns the page
    # handle.geturl() returns the true url of the page fetched
    # (in case urlopen has followed any redirects, which it sometimes does)

print
if cj is None:
    print "We don't have a cookie library available - sorry."
    print "I can't show you any cookies."
else:
    print 'These are the cookies we have received so far :'
    for index, cookie in enumerate(cj):
        print index, '  :  ', cookie
    cj.save(COOKIEFILE)                     # save the cookies again

If you want to adapt this code for yourself it is worth noting the following things :

We can always tell which import was successful. :

  • If we are using cookielib then cookielib is not None.
  • If we are using ClientCookie then ClientCookie is not None.
  • If we are using neither then cj is None.

Request is the name bound to the appropriate function for creating Request objects and urlopen is the name bound to the appropriate function for opening URLs whichever library we have used !!

[1]Or the path set out in the cookie.
[2]See this paper by David Kristol and the cookie FAQ for more information on cookies than you could possibly want.

For buying techie books, science fiction, computer hardware or the latest gadgets: visit The Voidspace Amazon Store.

Hosted by Webfaction

Return to Top

Page rendered with rest2web the Site Builder

Last edited Tue Aug 2 00:51:34 2011.

Counter...