Python Programming, news on the Voidspace Python Projects and all things techie.

#97

Techie Articles

emoticon:envelope There are three new techie articles up at voidspace. The first two are related to python, the third is about junk email.

  1. Duck Typing with Python - this is a brief discussion of typing in Python, followed by a rant on duck typing Rolling Eyes .
  2. Data Persistence in ConfigObj - a discussion (and associated code) on using ConfigObj for data persistence.
  3. SpamTrap - this is an article on beating spam, and suggests what I think is an alternative approach. It's a WIP because I haven't done the code implementation.. but it could work Question .

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2005-09-08 11:56:30 | |

Categories: ,


#96

Duck Typing in Python

Note

This article also appears as Duck Typing in Python. I've included it in full here so that people can comment on it.

Dynamic Typing

Caution!

A lot of what of written here has some exceptions - but it's basically true. It could be hedged around with all sorts of caveats - but as far as I can tell, none of it is actually going to mislead you.

Python is dynamically but strongly typed. The fact that the two halves of that statement fit together can confuse those who come from a static language type background. In Python it is perfectly legal to do this :

variable = 3
variable = 'hello'

So hasn't variable just changed type ? The answer is a resounding no. variable isn't an object at all - it's a name. In the first statement we create an integer object with the value 3 and bind the name 'variable' to it. In the second statement we create a new string object with the value hello, and rebind the name 'variable' to it. If there are no other names bound to the first object (or more properly no references to the object - not all references are names) then it's reference count will drop to zero and it will be garbage collected.

This is all fairly basic Python stuff - but it can take the newbie a while to get his (or hopefully her) head around. We say Python is dynamically typed because we pass around references and don't check the type until the last possible minute [1]. We say it is strongly typed because objects don't change type.

Duck typing

There is another concept in this typing lark that is a feature of dynamic languages. This is duck typing. The idea is that it doesn't actually matter what type my data is - just whether or not I can do what I want with it.

For example in a statically typed language we have a concept of adding. Some types of object can be added - usually only to objects of the same type. (Although most languages will let you add an integer to a floating point number - resulting in a floating point number). Try to add different types of objects together and the compiler will tell you that you're not allowed to [2].

In Python we allow the object to define what it means to be added. The expression 3 + 3 is syntactic sugar for calling the __add__ method of the integer type. It's the same as calling int.__add__(3, 3). This means that if you define an __add__ method for one of your classes you can make all sorts of things happen when you add instances of them together Cool [3].

Much of Python syntax is sugar for underlying methods. Especially data access. Accessing members of both sequence type objects and mapping type objects is done by using the __getitem__ method of these objects.

a = [0,1 2, 3]
print a[0]
0
b = {'a': 0, 'b': 1}
print b['a']
0

is exactly the same as :

a =  [0,1 2, 3]
print list.__getitem__(a, 0)
0
b = {'a': 0, 'b': 1}
print dict._getitem__(b, 'a')
0

In the first example we use normal Python syntax. In the second example we do what the first example is doing under the hood. In order to set members we would use the __setitem__ method instead of __getitem__. There are lots more examples of syntactic sugar - including comparing objects and accessing attributes. Beyond this we start getting into the realms of descriptors and meta programming. I haven't yet climbed these lofty heights - but playing with properties seems to be coming alarmingly close.

Duck typing happens because when we do a['member'] Python doesn't care what type object a is. All it cares is whether the call to it's __getitem__ method returns anything sensible. If not - an error will be raised. Something like TypeError: Unsubscriptable object..

This means you can create your own classes that have their own internal data structures - but are accessed using normal Python syntax. This is awfully convenient.

For example in my module ConfigObj [4] we read config files. These are values where each value has a name - just like a dictionary (a mapping type object). So you can do :

config = ConfigObj(filename)
value = config['member 1']
value2 = config['member 2']

and so on... no need for horrible getter and setter methods.

It also makes it easy to write things like ordered dictionaries that keep keys in insertion order and all sorts of other things.

Problem

The principle of duck typing says that you shouldn't care what type of object you have - just whether or not you can do the required action with your object. For this reason the isinstance keyword is frowned upon.

isinstance(object, dict) returns True if object is a dictionary - or an instance of a subclass of dict. What this usually means is that you want to perform some action only appropriate to a mapping type object.

Instead of :

if isinstance(object, dict):
    value = object[member]

it is considered more pythonic to do :

try:
     value = object[member]
except TypeError:
    # do something else

This means that anyone else using your code doesn't have to use a real dictionary or subclass (by the way - don't use the name object in your code, this is an example !) - they can use any object that implements the mapping interface.

Unfortunately in practise it's not that simple. What if member in the above example might be an integer ? Integers are immutable - so it's perfectly reasonable to use them as dictionary keys. However they are also used to index sequence type objects. If member happens to be an integer then example two could let through lists and strings as well as dictionaries.

If we want our code to treat different types of object differently then the approach in example two fails. This isn't contrived - this is exactly the situation we found ourselves in with ConfigObj. If you are setting a new member - passing in a dictionary creates a new new section. This has to be handled differently to just setting a value (which could be a string, a list, a boolean, or whatever).

It gets worse if you need to tell the difference between a string and a list (which we also needed to do). They're both sequence types - so any way of accessing a list member is a valid way of indexing a string.

You can tell the difference by the method signatures. For example a dictionary has a keys method which lists don't have. Strings have all sorts of methods that lists don't have (e.g. lower).

Our example above could become :

if hasattr(object, 'keys'):
     value = object[member]

This is still arbitrary though. It is perfectly possible to create a dictionary like object that doesn't have the keys method. What's a guy to do if he wants to detect dictionary like objects ?

The problem is exemplified in the types module. This module has two functions IsMappingType and IsSequenceType. The theory is that these functions will return True if an object you pass them is of the requisite type. For the built in types this works fine. However the mapping type and sequence type interfaces are so poorly defined (i.e. not defined at all) that both functions return True for any user defined class [5]. This makes them utterly useless for detecting mapping type and sequence type objects.

So the Python mapping type and sequence type 'interfaces' are so vague that we can't really use duck typing at all Sad .

Subclassing the Built In Types

There's a good principle which says don't use the word new in names. It gets old pretty quickly.

Ever since Python 2.2 we've had new style classes. This removed the old type/class dichotomy (good word hey) and allowed us to subclass the built in types. Any object that has object as it's ultimate base class is considered a new style class.

This means I can now create my own class of objects that are a subclass of dictionary.

My assertion is that (currently) the only sensible way of telling if an object will behave sensibly as a mapping type object or a sequence type object is by isinstance tests. That means that if an object inherits from dict you can assume it's safe to treat it as a dictionary like object (and so on).

There is an argument (isinstance considered harmful) that says you can't rely on a subclass to properly implement the interface of the parent class. This is true enough - Python won't stop you writing broken code Razz .

In ConfigObj we were asked to allow the passing in of 'dictionary like objects' [6] to create new sections. This would have meant removing the isinstance tests and replacing them with something else.

The closest we could come up with was by relying on method signature. This meant defining our own, arbitrary, rules about what methods strings, other sequences (lists or tuples), and dictionary like objects should have. We gave up in disgust and decided that dictionary like objects should inherit from dict and sequences should inherit from the appropriate base class.

The alternative is for the Python community to define what interfaces a objects should have.

I would suggest something like :

  • string like object should inherit from str Razz
  • Sequence objects should implement __getitem__ as a minimum
  • Dictionary like objects should implement __getitem__ and keys as a minimum

This would mean that we could actually know what we're talking about when we say a dictionary like object. It also means that IsMappingType and IsSequenceType can finally do something useful after all these years [7]...........


Footnotes

[1]This is as good an explanation as I can come up with. Very Happy
[2]This is the sort of errors that static heads are ever so proud of not being allowed to make.
[3]The path module has a good example of this. It allows you to add paths together and automatically inserts the correct separator between the parts.
[4]Written in conjunction with Nicola Larosa.
[5]Possibly only true for old style classes I haven't bothered to check this.
[6]Which didn't inherit from dict.
[7]It's possible that the proposed 'interfaces' could address this issue.

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2005-09-08 11:50:39 | |

Categories: ,


#95

Offline to Online

emoticon:world I think we are still in the process of paradigm shift - moving from the offline world to an 'always on' world. This means a move from individual computer as platform - to the internet as the platform for new applications. Actually it's more a kind of paradigm drift....

I think that what people want to do involves being connected. This makes sense - it is basically impossible to achieve anything that is encapsulated within a single PC [1]. For anything to have any effect beyond the mind of the individual it has to be (at the least) communicated.

In the old days we achieved this with floppy drives and printers - now we have email, and chat programs, and shared documents. But the change to online communication is still happening. People already use computers in multiple different places (at home and at the office for example) and they work with many different people on the same projects. But most applications are still basically either offline or online though.

You have some 'applications', like bulletin boards and web forums, that are wholly online. These happen on someone else's server. To a certain extent modern applications like ebay and amazon are in this category.

You also have offline applications like the office software - word processors and spreadsheets. The data belongs to you, and so to share it with others or access it from different locations you have to transfer the data somehow - usually by email.

Email can be either. It can be a 'webmail' application - like hotmail or gmail. In this case all your emails sit on someone else's machine. They are insecure and you are at their mercy.

Or you can use a desktop client that accesses your email for you. In this case the email is downloaded from 'online' (the server) and becomes available offline. However if you then move to another location the data is no longer available to you.

What we need are more applications that blur this distinction between online and offline. These need to still be usable in a world of insecure and unreliable connections.

P2P is an interesting technology and it at least partly blurs this distinction. Suddenly you have access to the data you want - wherever in the world it happens to be. You want a particular piece of music - hit search, bam - its yours. Unfortunately this is (often) illegal - and also only works for commonly shared data. Try and access the paper you were halfway through writing, and unsurprisingly you won't find it.

An intermediate application might remotely sync your desktop applications and data with an online server. You can then access it from multiple locations - and even via a web browser.

We're already used to using diverse web services to access and share our data. Flickr for pictures, del.icio.us for our bookmarks, and gmail for emails are a few that spring to mind. Perhaps the day will come when these different applications integrate seamlessly with the desktop applications we use. When we save a private document we know we will be able to access it securely from any location. When we mark a document as public (or available to specific groups of people) then other people can access it from wherever they are working.

The rub is - that to make that compatible with today's generation of offline/online tools is a lot of work. Every application has their own data storage format and protocols. Imagine writing an email sync tool that has a nice online interface - but can also monitor any changes made from outlook, your offline tool (whichever version you happen to be using). Extend that to include all the major email clients (thunderbird, eudora, and so on) and you've got a huge task - and that's just one aspect of our everyday activities.

One way forward is standards. If every application structured it's data in open and standard ways - then this kind of interoperability would be much easier. Programs can expose their data or an API in such a way that makes it easy for the data to be shared with other 'users' (whether that user is an application or a person). I think the real solution goes deeper than that though. I think the client-server (online-offline) distinction is going to get more blurred.

Filesystems are going to become distributed and applications are going to become a lot more like P2P. They're not all going to be traditional web applications - because the browser sucks [2] and they're not all public tools. But neither will they be desktop applications because it won't matter where you access them from. Applications are going to include 'live' data sharing built in. Data will be marked public, or private - belonging to individuals or groups. Everyone with the right permission will be able to access and work with the data.

Perhaps a better model for this is the programmers tool the 'version control systems'. Public repositories hold the latest version of an application that is being worked on. This may be hundreds (or thousands) of individual files being worked on by many different people. The system includes access permissions and tools for rolling back to previous versions or resolving conflicts - when two people change the same piece of code.

It'll be a long time before that's the dominant model though. Maybe the next 'quantum leap' in this gradual paradigm drift is to share more data online. Make it easier to access our own data through an online server. But making this work seamlessly with the current generation of offline applications sounds messy and painful.

[1]By this I mean that anything we do only accrues value once it is communicated to someone else - whether by printed document or across the internet.
[2]And why should every application be accessed through a single interface

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2005-09-08 11:45:47 | |

Categories:


#94

Lunix Fiend

Laughing Yes, it's true - I'm slowly being converted into a Lunix fiend.

I've moved Voidspace over to a virtual server account with unixshell.com. Nicola Larosa put a lot of effort in getting lighttpd setup for me, and I got Exim working.

Amazingly this was actually one of easiest server moves I've had - surprisingly smooth.

Unfortunately it does mean that search isn't working yet (I'm switching to a new one called Namazu anyway) and the page counters are doing funny things. I like it though Very Happy .

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2005-09-08 11:28:36 | |

Categories: ,


#93

ConfigObj and rest2web

emoticon:beaker ConfigObj beta 4 has just been released. This fixes two moderately serious bugs. It's a pain to have to do so many beta releases - but heck, that's what beta is for.

Now that ConfigObj and pythonutils are well into their beta releases my focus has switched back to rest2web.

I'm actively working towards a new release. The version currently in SVN already features the gallery and has lots of minor fixes etc.

The next release will include :

  • the gallery
  • ordered pages within a section
  • a new 'file' keyword in the restindex

I probably won't complete the plugin system before I do a new release. It basically works - but there is a difficult encoding problem (related to the order that things are processed) that I can't be bothered to think about.

The gallery is working nicely - at last you can see the results at Very Happy :

The Voidspace Gallery Pages

The plugin system could potentially be used to bolt on an alternative templating system by the way - like cheetah. If anyone wanted to do that I'd be happy to work on integrating it better.

rest2web also badly needs a tutorial. It has lots of options in the restindex and lots of values available in the templates. This makes the documentation 'dense' and seem complicated. A basic site can be achieved very simply in rest2web though.

Also a brief note about 'dynamic sites'. With some optimisation and 'content caching' combined with 'pickling' of data structures - it would be possible to use rest2web as a framework to deliver dynamic content.

Hmmm.....

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2005-09-08 11:12:11 | |

Categories:


Hosted by Webfaction

Counter...