cookielib and ClientCookieHandling Cookies in PythonNote There is a French translation of this article - cookielib et ClientCookie.
Contents
IntroductionSome websites can't be browsed without cookies enabled. They are used for storing session information or confirming a users identity. They are sometimes used as an alternative to a scheme like basic authentication. The right Python module for fetching webpages, or other resources across the internet, is usually urllib2. It offers a simple interface for fetching resources using a variety of protocols. For a good introuduction to urllib2, browse over to the urllib2 tutorial. By default it doesn't handle cookies though. You need to use an additional library to do this. In Python 2.4 this is called cookielib and is part of the standard library. Prior to Python 2.4 it existed as ClientCookie, but it's not a drop in replacement. In Python 2.4 some of the function of ClientCookie has been moved into urllib2. It is possible to write code that will work the same in these situations. This article illustrates how to use cookielib/ClientCookie and shows code for fetching URIs that will work unchanged on :
Where either cookielib or ClientCookie is available the cookies will be saved in a file. On a machine with neither, URLs will still be fetched - but any cookies sent won't be handled or saved. CookiesWhen a website sends a page to a client, it sends a set of headers first that describe that http transaction. One of those headers can contain a line of text known as the cookie. If you fetch another page from the same server [1] then the cookie should be sent back to the server as one of the request headers. This allows the cookie to store information that the server can use to identify you, or the session you are engaged in. Obviously for some processes, this information is essential. This was first supported by the netscape browser, and so the first spec was called the Netscape cookie protocol. It was described in RFC 2109, and then there was an attempt to extend this in the form of RFC 2965 - which has never been widely used. In reality, the protocol implemented by all the major browsers is still based on the Netscape protocol. By now it only bears a passing resemblance to the protocol sketched out in the original document [2]. Conditional ImportsA version of this code has been printed in the second edition of the Python Cookbook. One of the reasons it was included in the cookbook is that it illustrates an interesting programming idiom called conditional import. It's particularly important in this recipe because we need the behaviour of the underlying code to be slightly different depending on which library is available. The interface we present to the programmer is the same in all three cases though. Pay attention to the first chunk of code which attempts to import the cookie libraries. This has to setup different behaviour depending on which library it imports. The pattern it uses is : library = None try: import library except ImportError: # library not available setup alternate behaviour .... else: # library is available establish normal behaviour .... We use the name of the library we are importing as a marker, by setting it to None at the start. Later in the code we can tell if the library is available by using code like the following : if library is None: # we know library is not available provide different or reduced function .... else: # library is available ..... The CodeThe code shown in this article can be downloaded from the Voidspace Recipebook. In this first section we attempt to import a cookie handling library. We first try cookielib, then ClientCookie. If neither is available then we default to objects from urllib2. import os.path import urllib2 COOKIEFILE = 'cookies.lwp' # the path and filename to save your cookies in cj = None ClientCookie = None cookielib = None # Let's see if cookielib is available try: import cookielib except ImportError: # If importing cookielib fails # let's try ClientCookie try: import ClientCookie except ImportError: # ClientCookie isn't available either urlopen = urllib2.urlopen Request = urllib2.Request else: # imported ClientCookie urlopen = ClientCookie.urlopen Request = ClientCookie.Request cj = ClientCookie.LWPCookieJar() else: # importing cookielib worked urlopen = urllib2.urlopen Request = urllib2.Request cj = cookielib.LWPCookieJar() # This is a subclass of FileCookieJar # that has useful load and save methods We've now imported the relevant library. Whichever library is being used the name``urlopen`` is bound to the right function for retrieving URLs. The name Request is bound to the right class for creating Request objects. If we successfully managed to import a cookie handling library then the name cj is bound to a CookieJar instance. Installing the CookieJarNow we need to get our CookieJar installed in the default opener for fetching URLs. This means that all calls to urlopen will have their cookies handled. The actual handling is done by an object called an HTTPCookieProcessor. If the terms opener and handler are new to you, then read either the urllib2 docs or my urllib2 tutorial. All this is either done in ClientCookie or in urllib2, depending on which module we successfully imported. if cj is not None: # we successfully imported # one of the two cookie handling modules if os.path.isfile(COOKIEFILE): # if we have a cookie file already saved # then load the cookies into the Cookie Jar cj.load(COOKIEFILE) # Now we need to get our Cookie Jar # installed in the opener; # for fetching URLs if cookielib is not None: # if we use cookielib # then we get the HTTPCookieProcessor # and install the opener in urllib2 opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) urllib2.install_opener(opener) else: # if we use ClientCookie # then we get the HTTPCookieProcessor # and install the opener in ClientCookie opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cj)) ClientCookie.install_opener(opener) If one of the cookie libraries is available, any call to urlopen will now handle cookies using the CookieJar instance we've created. Fetching WebpagesSo having done all the dirty work, we're ready to fetch our webpages. Any cookies sent will be handled. THis means they will be stored in the CookieJar, returned to the server when appropriate, and expired correctly as well. Because we may want to restart the same session next time, we save the cookies when we've finished. theurl = 'http://www.google.co.uk/search?hl=en&ie=UTF-8&q=voidspace&meta=' # an example url that sets a cookie, # try different urls here and see the cookie collection you can make ! txdata = None # if we were making a POST type request, # we could encode a dictionary of values here, # using urllib.urlencode(somedict) txheaders = {'User-agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'} # fake a user agent, some websites (like google) don't like automated exploration try: req = Request(theurl, txdata, txheaders) # create a request object handle = urlopen(req) # and open it to return a handle on the url except IOError, e: print 'We failed to open "%s".' % theurl if hasattr(e, 'code'): print 'We failed with error code - %s.' % e.code elif hasattr(e, 'reason'): print "The error object has the following 'reason' attribute :" print e.reason print "This usually means the server doesn't exist,', print "is down, or we don't have an internet connection." sys.exit() else: print 'Here are the headers of the page :' print handle.info() # handle.read() returns the page # handle.geturl() returns the true url of the page fetched # (in case urlopen has followed any redirects, which it sometimes does) if cj is None: print "We don't have a cookie library available - sorry." print "I can't show you any cookies." else: print 'These are the cookies we have received so far :' for index, cookie in enumerate(cj): print index, ' : ', cookie cj.save(COOKIEFILE) # save the cookies again If you want to adapt this code for yourself it is worth noting the following things : We can always tell which import was successful. :
Request is the name bound to the appropriate function for creating Request objects and urlopen is the name bound to the appropriate function for opening URLs whichever library we have used !!
For buying techie books, science fiction, computer hardware or the latest gadgets: visit The Voidspace Amazon Store. If you're looking for a new techie job, try the Voidspace Tech Job Board. This is part of the Hidden Network of technology and programming jobs.
Last edited Fri Feb 15 13:42:08 2008. Counter... |
|||||
|
Blogads
Follow me on: Tech Jobs |