Basic AuthenticationAuthentication with PythonNote There is a French translation of this article - Authentification Basique.
Contents
IntroductionThis tutorial aims to explain and illustrate what basic authentication is, and how to deal with it from Python. You can download the code from this tutorial from the Voidspace Python Recipebook. The first example, So Let's Do It, shows how to do it manually. This illustrates how authentication works. The second example, Doing it Properly, shows how to handle it automatically - with a handler. These examples make use of the Python module urllib2. This provides a simple interface to fetching pages across the internet, the urlopen function. It provides a more complex interface to specialised situations in the form of openers and handlers. These are often confusing to even intermediate level programmers. For a good introduction to using urllib2, read my urllib2 tutorial. Basic AuthenticationThere is a system for requiring a username/password before a client can visit a webpage. This is called authentication and is implemented on the server. It allows a whole set of pages (called a realm) to be protected by authentication. This scheme (or schemes) are defined by the HTTP spec, and so whilst python supports authentication it doesn't document it very well. HTTP documentation is in the form of RFCs [1] which are technical documents and so not the most readable The two normal [2] authentication schemes are basic and digest authentication. Between these two, basic is overwhelmingly the most common. As you might guess, it is also the simpler of the two. A summary of basic authentication goes like this :
The following sections covers these steps in more details. Making a RequestA client is any program that makes requests over the internet. It could be a browser - or it could be a python program. When a client asks for a web page, it is sending a request to a server. The request is made up of headers with information about the request. These are the 'http request headers'. Getting A ResponseWhen the request reaches the server it sends a response back. The request may still fail (the page may not be found for example), but the response will still contain headers from the server. These are 'http response headers'. If there is a problem then this response will include an error code that describes the problem. You will already be familiar with some of these codes - 404 : Page not found, 500 : Internal Server Error, etc. If this happens; an exception [3] will be raised by urllib2, and it will have a 'code' attribute. The code attribute is an integer that corresponds to the http error code [4]. Error 401 and realmsIf a page requires authentication then the error code is 401. Included in the response headers will be a 'WWW-authenticate' header. This tells us the authentication scheme the server is using for this page and also something called a realm. It is rarely just a single page that is protected by authentication but a section - a 'realm' of a website. The name of the realm is included in this header line. The 'WWW-Authenticate' header line looks like WWW-Authenticate: SCHEME realm="REALM". For example, if you try to access the popular website admin application cPanel your browser will be sent a header that looks like : WWW-Authenticate: Basic realm="cPanel" If the client already knows the username/password for this realm then it can encode them into the request headers and try again. If the username/password combination are correct, then the request will succeed as normal. If the client doesn't know the username/password it should ask the user. This means that if you enter a protected 'realm' the client effectively has to request each page twice. The first time it will get an error code and be told what realm it is attempting to access - the client can then get the right username/password for that realm (on that server) and repeat the request. HTTP is a 'stateless' protocol. This means that a server using basic authentication won't 'remember' you are logged in [5] and will need to be sent the right header for every protected page you attempt to access. First ExampleSuppose we attempt to fetch a webpage protected by basic authentication. : theurl = 'http://www.someserver.com/somepath/someprotectedpage.html' req = urllib2.Request(theurl) try: handle = urllib2.urlopen(req) except IOError, e: if hasattr(e, 'code'): if e.code != 401: print 'We got another error' print e.code else: print e.headers print e.headers['www-authenticate'] Note If the exception has a 'code' attribute it also has an attribute called 'headers'. This is a dictionary like object with all the headers in - but you can also print it to display all the headers. See the last line that displays the 'www-authenticate' header line which ought to be present whenever you get a 401 error. A typical output from above example looks like : WWW-Authenticate: Basic realm="cPanel" Connection: close Set-Cookie: cprelogin=no; path=/ Server: cpsrvd/9.4.2 Content-type: text/html Basic realm="cPanel" You can see the authentication scheme and the 'realm' part of the 'www-authenticate' header. Assuming you know the username and password you can then navigate around that website - whenever you get a 401 error with the same realm you can just encode the username/password into your request headers and your request should succeed. The Username/PasswordLets assume you need to access pages which are all in the same realm. Assuming you have got the username and password from the user, you can extract the realm from the header. Then whenever you get a 401 error in the same realm you know the username and password to use. So the only detail left, is knowing how to encode the username/password into the request header base64There is a very simple recipe base64 recipe over on the Activestate Python Cookbook (It's actually in the comments of that page). It shows how to encode a username/password into a request header. It goes like this :
import base64
base64string = base64.encodestring('%s:%s' % (username, password))[:-1]
req.add_header("Authorization", "Basic %s" % base64string)
Where req is our request object like in the first example. So Let's Do ItLet's wrap all this up with an example that shows accessing a page, extracting the realm, then doing the authentication. We'll use a regular expression to pull the scheme and realm out of the response header. : import urllib2 import sys import re import base64 from urlparse import urlparse theurl = 'http://www.someserver.com/somepath/somepage.html' # if you want to run this example you'll need to supply # a protected page with your username and password username = 'johnny' password = 'XXXXXX' # a very bad password req = urllib2.Request(theurl) try: handle = urllib2.urlopen(req) except IOError, e: # here we *want* to fail pass else: # If we don't fail then the page isn't protected print "This page isn't protected by authentication." sys.exit(1) if not hasattr(e, 'code') or e.code != 401: # we got an error - but not a 401 error print "This page isn't protected by authentication." print 'But we failed for another reason.' sys.exit(1) authline = e.headers['www-authenticate'] # this gets the www-authenticate line from the headers # which has the authentication scheme and realm in it authobj = re.compile( r'''(?:\s*www-authenticate\s*:)?\s*(\w*)\s+realm=['"]([^'"]+)['"]''', re.IGNORECASE) # this regular expression is used to extract scheme and realm matchobj = authobj.match(authline) if not matchobj: # if the authline isn't matched by the regular expression # then something is wrong print 'The authentication header is badly formed.' print authline sys.exit(1) scheme = matchobj.group(1) realm = matchobj.group(2) # here we've extracted the scheme # and the realm from the header if scheme.lower() != 'basic': print 'This example only works with BASIC authentication.' sys.exit(1) base64string = base64.encodestring( '%s:%s' % (username, password))[:-1] authheader = "Basic %s" % base64string req.add_header("Authorization", authheader) try: handle = urllib2.urlopen(req) except IOError, e: # here we shouldn't fail if the username/password is right print "It looks like the username or password is wrong." sys.exit(1) thepage = handle.read() When the code has run the contents of the page we've fetched is saved as a string in the variable 'thepage'. Warning If you are writing an http client of any sort that has to deal with basic authentication, don't do it this way. The next example that shows using a handler is the right way of doing it. Doing it ProperlyIn actual fact the proper way to do BASIC authentication with Python is to install an opener that uses an authentication handler. The authentication handler needs a passowrd manager - and then you're away Every time you use urlopen you are using handlers to deal with your request - whether you know it or not. The default opener has handlers for all the standard situations installed [7]. What we need to do is create an opener that has a handler that can deal with basic authentication. The right handler for our needs is called urllib2.HTTPBasicAuthHandler. As I mentioned it also needs a password manager - urllib2.HTTPPasswordMgr. Unfortunately our friend HTTPPasswordMgr has a slight problem - you must already know the realm you're fetching. Luckily it has a near cousin HTTPPasswordMgrWithDefaultRealm. Despite the keyboard busting name, it's a bit more friendly to use. If you don't know the name of the realm - then pass in None for the realm, and it will try the username and password you give it - whatever the realm. Seeing as you are going to specify a specific URL, it is likely that this will be sufficient. If you aren't convinced then you can always use HTTPPasswordMgr and extract the realm from the authentication header the first time you meet it. This example goes through the following steps :
At this point we have a choice. We can either use the open method of the opener directly. This leaves urllib2.urlopen using the default opener. Alternatively we can make our opener the default one. This means all future calles to urlopen will use this opener. As all openers have the default handlers installed as well as the ones you pass it, it shouldn't break urlopen to do this. In the example below we install it, making it the default opener : import urllib2 theurl = 'www.someserver.com/toplevelurl/somepage.htm' protocol = 'http://' username = 'johnny' password = 'XXXXXX' # a great password passman = urllib2.HTTPPasswordMgrWithDefaultRealm() # this creates a password manager passman.add_password(None, theurl, username, password) # because we have put None at the start it will always # use this username/password combination for urls # for which `theurl` is a super-url authhandler = urllib2.HTTPBasicAuthHandler(passman) # create the AuthHandler opener = urllib2.build_opener(authhandler) urllib2.install_opener(opener) # All calls to urllib2.urlopen will now use our handler # Make sure not to include the protocol in with the URL, or # HTTPPasswordMgrWithDefaultRealm will be very confused. # You must (of course) use it when fetching the page though. pagehandle = urllib2.urlopen(protocol + theurl) # authentication is now handled automatically for us Hurrah - not so bad hey Warning See that when we set the url in the password manager we don't include the protocol - the http:// part. A Word About CookiesSome websites may also use cookies alongside authentication. Luckily there is a library that will allow you to have automatic cookie management without having to think about it. This is ClientCookie. In Python 2.4 it becomes part of the python standard library as cookielib. See my article on cookielib - for an example of how to use it. Footnotes
For buying techie books, science fiction, computer hardware or the latest gadgets: visit The Voidspace Amazon Store. If you're looking for a new techie job, try the Voidspace Tech Job Board. This is part of the Hidden Network of technology and programming jobs.
Last edited Fri Feb 15 13:42:08 2008. Counter... |
|||||||||||||||
|
Blogads
Follow me on: Tech Jobs |