Python Programming, news on the Voidspace Python Projects and all things techie.
A Rambling Recording on Member Lookup in Python (podcast)
I was thinking about the Python object model, in part as a result of my post on The Python Class Statement. Python is a really easy language to learn, but it also has advanced features like its protocols, descriptors and metaclasses, that make the full object model pretty complex - and that's before you start looking at the corner cases.
It would be really nice to write up a single document describing the Python object model, including all of its intricacies. That sounds too much like hard work, so instead I recorded a rambling hand-wavy description of member lookup in Python. I don't go into full blown detail, but then this is a podcast - it won't seriously mislead you and no-one is going to use it as a reference guide...
- Python Attribute Lookup Part 1 on Audioboo
- Python Attribute Lookup Part 2 on Audioboo
- Python Attribute Lookup in full mp3 (9minutes 8MB)
This was recorded using the Blue Fire iPhone app whilst I was wandering around outside. I chopped out about half my pauses and coughing using Audacity, so if you think the quality is rough you should have heard the first version.
Topics covered include:
- Member lookup on instances and classes
- How the interpreter looks up protocol ('magic') methods
- __getattr__ and its mysterious cousin __getattribute__
- Descriptors, bound methods, properties and friends
In the podcast I mention the new technique I have for dynamically mocking magic methods. Magic methods, when they are called for you by the interpreter, are usually looked up directly on the class. Unfortunately Python is not entirely consistent, some magic methods are still looked up on the instance first before the class. This is gradually being fixed in Python (in 2.7 they pretty much all fixed), but the inconsistency is a pain for mocking the magic methods.
Mock now allows you to mock the magic methods by assigning an appropriate function, that takes self as the first argument, to the magic method on the mock instance. By default mocks do not have the magic methods implemented except the ones it uses itself. When you assign to them it dynamically grows them on just that instance - all other mock instances are unaffected. Magic methods can then be looked up on the class or the instance, either way works (and you can delete them):
>>> from mock import Mock
>>> m = Mock()
>>> m
<mock.Mock object at 0x429770>
>>> m.__repr__ = lambda self: 'A Mock Object'
>>> m
A Mock Object
>>> m.__repr__()
'A Mock Object'
>>> del m.__repr__
>>> m
<mock.Mock object at 0x429770>
You can also use Mocks for magic methods. Here's an example of mocking out the built-in open function when used as a context manager:
@patch('__builtin__.open')
def test_with_statement(self, mock_open):
mock_open.__enter__ = Mock()
mock_open.__exit__ = Mock()
mock_open.__exit__.return_value = False
with open('filename') as handle:
handle.read()
mock_open.assert_called_with('filename')
mock.__enter__.assert_called_with()
mock.__enter__.return_value.read.assert_called_with()
mock.__exit__.assert_called_with(None, None, None)
The version of mock with magic method support hasn't yet been released, but you can pull it from the google code SVN repo. When I have time to write docs it will be released as 0.7.0.
There's a bit of trickery involved in making this work. If you're interested in how it's done look at the implementation of __new__ and __setattr__.
Like this post? Digg it or Del.icio.us it.
Posted by Fuzzyman on 2010-01-10 20:25:22 | |
Categories: Python, Projects, Hacking Tags: mocking, testing, podcast, magic methods
Notes on the Python Class Statement
Python classes are created at runtime, usually when you execute a script, or import the module they are defined in. Class creation is done primarily with the class statement. The class statement is executed by the Python runtime to create the class. Functions and names assigned in the body of the class statement become methods and attributes of the class.
You can easily see that the code inside the body of the class is executed, and that it can contain arbitrary code, by putting a print statement inside the class body:
>>> class ClassName(object):
... print 'hello world'
...
hello world
Any assignments that happen in the body of the class definition create class members. Class and function definitions both cause names to be assigned, so classes defined inside the body of another class statement can be accessed as class attributes and functions defined inside the body of a class become methods.
Here's a trivial example with simply assigning a value to the name X:
>>> class SomeClass(object):
... X = 3
...
>>> SomeClass.X
3
We can combine the fact that arbitrary code is executed with the assignment rule to conditionally define class members:
>>> import sys
>>> class SomeClass(object):
... if sys.platform == 'darwin':
... X = 3
... else:
... X = 4
...
>>> SomeClass.X
3
What happens in class creation (in Python 2 - the rules change slightly in Python 3 as the metaclass mechanism is improved) is that the class body is executed, the collection of names and values are passed as a dictionary (along with the class name and a tuple of the base classes) to the metaclass which is 'called' (if the metaclass is a type - which it usually is - the metaclass is instantiated) and the resulting class object is assigned to the name in the scope in which it was defined. The resulting class is an object like everything else in Python. Unless the class uses __slots__ the dictionary of members becomes the class __dict__. This dictionary is protected by being wrapped in a dictproxy. Although you can fetch members directly from the dictproxy you can't directly assign or delete members, instead you have to go through the normal attribute setting / deleting mechanisms:
>>> class SomeClass(object):
... X = 3
...
>>> SomeClass.__dict__
<dictproxy object at 0x50b7b0>
>>> SomeClass.__dict__.keys()
['__dict__', 'X', '__module__', '__weakref__', '__doc__']
>>> SomeClass.__dict__['X']
3
>>> SomeClass.__dict__['Y'] = 4
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'dictproxy' object does not support item assignment
>>> del SomeClass.__dict__['X']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'dictproxy' object does not support item deletion
>>> SomeClass.Y = 4
>>> del SomeClass.X
>>> # X has now gone from the __dict__ and Y appeared
>>> SomeClass.__dict__.keys()
['__module__', 'Y', '__dict__', '__weakref__', '__doc__']
An interesting example of assignment creating class members is what happens when you put a list comprehension inside a class body. An implementation detail of list comprehensions is that variables used in the list comprehension 'leak' into their surrounding scope. A list comprehension in a class body creates an unexpected class member:
>>> class SomeClass(object):
... [foo for foo in (1, 2, 3)]
...
>>> SomeClass.foo
3
The same isn't true of generator expressions where the variable doesn't leak:
>>> class AnotherClass(object):
... list(bar for bar in (1, 2, 3))
>>> AnotherClass.bar
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: type object 'AnotherClass' has no attribute 'bar'
The variable leaking from list comprehensions is a side-effect and should not be relied on.
Whilst the code in the class statement is being executed it creates a temporary namespace. Code can refer to names already assigned as if they were local variables.
>>> class SomeClass(object):
... X = 3
... b = [a * X for a in (1, 2, 3)]
...
>>> SomeClass.b
[3, 6, 9]
A common use for this is to create aliases, where you give the same member two or more names. In this example cost is an alias to the calculate_price method:
>>> class SomeClass(object):
... def calculate_price(self, quantity):
... return quantity * 10.0
... cost = calculate_price
...
>>> instance = SomeClass()
>>> instance.calculate_price(20)
200.0
>>> instance.cost(20)
200.0
It is also the standard way of creating properties before Python 2.6:
>>> class SomeClass(object):
... _value = None
... def get(self):
... return self._value
... def set(self, value):
... self._value = value
... value = property(get, set)
...
The value property is created using the get and set functions from the scope that forms the class members.
Unfortunately we have a problem with generator expressions. Generator expressions create their own scope, causing names to be looked up lexically and ignoring the temporary class scope.
>>> class AnotherClass(object):
... X = 3
... b = list(a * X for a in (1, 2, 3))
...
Traceback (most recent call last):
File "<stdin>", line 3, in AnotherClass
File "<stdin>", line 3, in <genexpr>
NameError: global name 'x' is not defined
If you're interested in how metaclasses are involved in class creation then you should read: Metaclasses in five minutes. (Hopefully readable even for non-gurus.)
An interesting reference on why the class statement in Python contains executable code is this article by Guido van Rossum, the creator of Python: How Everything Became an Executable Statement.
Like this post? Digg it or Del.icio.us it.
Posted by Fuzzyman on 2010-01-10 13:49:13 | |
Categories: Python, Hacking Tags: language, classes, objects
Fun with Unicode, Latin-1 and a C1 Control Code
Unicode is a rabbit-warren of complexity; almost fractal in nature, the more you learn about it the more complexity you discover. Anyway, all that aside you can have great fun (i.e. pain) with fairly basic situations even if you are trying to do the right thing.
This particular problem was encountered by Stephan Mitt, one of my colleagues at Comsulting. I helped him find the solution, and with a bit of digging (and some help from #python-dev) worked out why it was happening.
We receive data from customers as CSV files that need importing into a web application. The CSV files are received in latin-1 encoding and we decode and then iterate over them to process a line at a time. Unfortunately the data from the customers included some \x85 characters, which were breaking the CSV parsing.
One of the problems with the latin-1 encoding is that it uses all 256 bytes, so it is never possible to detect badly encoded data. Arbitrary binary data will always successfully decode:
>>> data = ''.join(chr(x) for x in range(256))
>>> data.decode('latin-1')
u'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f...'
If you iterate over a standard file object in Python 2 (i.e. one that reads data as bytestrings) then you iterate over it a line at a time. This splits lines on carriage returns (\x0D) and line feeds (\x0A). If you're on Windows then the sequence \x0D\x0A (CRLF) signifies a new line. If you're trying to do-the-right-thing, and decode your data to Unicode before treating it as text, then you might use code a bit like the following to read it:
import codecs
handle = codecs.open(filename, 'r', encoding='latin-1')
for line in handle:
...
This was the cause of our problem. When decoding using latin-1 \x85 is transcoded to u'\x85', which Unicode treats as a line break. So if your source data has \x85 embedded in it, and you are splitting on lines, where the lines break will be different depending on if you are using byte-strings or Unicode strings:
>>> d = 'foo\x85bar'
>>> d.split()
['foo\x85bar']
>>> u = d.decode('latin-1')
>>> u
u'foo\x85bar'
>>> u.split()
[u'foo','bar']
This could still be a pitfall in Python 3, where all strings are Unicode, particularly if you are porting an application from Python 2 to Python 3. Suddenly your data will behave differently when you treat it as Unicode. The answer is to do the split manually, specifying which character to use as a line break.
The problem isn't restricted to \x85. The Unicode spec on newlines shows us why. \x85 is referred to by the acronym NEL, which is a C1 Control Code: NEL Next Line Equivalent to CR+LF. Used to mark end-of-line on some IBM mainframes.
In fact NEL belongs to a general class of characters known as Paragraph Separators (Category B). This category includes the characters \x1C, \x1D, \x1E, \x0D, \x0A and \x85. Splitting on lines will split on any of these characters, which may not be what you expect. It certainly wasn't what we expected.
For us the solution was simple; we just strip out any occurence of \x85 in the binary data before decoding.
Note
Marius Gedminas suggests that the data is probably encoded as Windows 1252 rather than Latin-1. He is probably right.
There are some interesting notes on Unicode line breaks in this Python bug report: What is an ASCII linebreak?.
Like this post? Digg it or Del.icio.us it.
Posted by Fuzzyman on 2010-01-07 12:42:27 | |
Categories: Python, Work, Hacking Tags: Unicode, latin-1, encoding
Python Surprises
In the last few days I've run into several things I didn't know about Python. Not necessarily bad or wrong, just new to me.
>>> object.__new__(int)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object.__new__(int) is not safe, use int.__new__()
The same happens for pretty much all the built-in types. I don't think you can achieve this effect from pure-Python code, which is why it is impossible (I think) to write a real singleton in pure-Python. From any singleton instance you can always do this:
object.__new__(type(the_singleton))
Anyway, next surprise:
>>> class Meta(type):
... __slots__ = ['foo']
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Error when calling the metaclass bases
nonempty __slots__ not supported for subtype of 'type'
This was annoying at the time, but caused me to find a better way to achieve what I wanted anyway. These first two show that despite the 'grand-merger' of Python 2.2 you can't treat the built-in types exactly as if they were user-defined classes.
The next one I actually ran into a while back:
>>> @EventHandler[HtmlEventArgs]
File "<stdin>", line 1
@EventHandler[HtmlArgs]
^
SyntaxError: invalid syntax
This one is annoying. In IronPython EventHandler[HtmlEventArgs] would return a typed event handler for wrapping a function with. Decorator syntax would be very convenient but the only valid syntax is a name followed by optional parentheses and arguments - not any arbitrary expression.
The relevant part of the grammar is:
decorator ::= "@" dotted_name ["(" [argument_list [","]] ")"] NEWLINE
This grammar not only prevents indexing but means you can't (for example) define lambda decorators. All it would take is a grammar change and these could work, no actual code would need to be written in support. The reason that Guido didn't allow it is that he didn't want people writing code like:
@(F((foo + bar / 3 )) / [x**2 for x in frobulator])
def function():
...
Guido did agree that the rules could be relaxed (here is the python-ideas thread where it was discussed), but then the language moratorium came into effect.
The final surprise was that default object equality comparison is implemented inside the Python runtime instead of there being a default implementation in object. In fact object() instances don't even have the equality / inequality methods (__eq__ / __ne__).
>>> object().__eq__(object())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'object' object has no attribute '__eq__'
However, if you look up __eq__ on the type, as you might if you were trying to delegate up to the default implementation that doesn't exist, then something weird happens:
>>> object.__eq__(object(), object())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: expected 1 arguments, got 2
>>> object.__eq__
<method-wrapper '__eq__' of type object at 0x141fc0>
When you look up __eq__ on object (the type rather than an instance) then you get the __eq__ method of its metaclass (type) bound to object which is an instance of type. As this is a bound method it only takes one argument and calling it with two arguments causes a TypeError.
In fact there is nothing special about __eq__ here, I just didn't realise that member resolution on types would check the metaclass after checking the base classes:
>>> class Meta(type):
... X = 3
...
>>> class Something(object):
... __metaclass__ = Meta
...
>>> Something.X # from the metaclass
3
>>> Something.X = 4 # set on the type
>>> Meta.X
3
>>> class SomethingElse(Something): pass
...
>>> SomethingElse.X # fetched from base class not the metaclass
4
Like this post? Digg it or Del.icio.us it.
Posted by Fuzzyman on 2010-01-04 12:16:12 | |
Categories: Hacking, Python Tags: decorators, grammar, singletons
Mocking Magic Methods and Preserving Function Signatures Whilst Mocking
So, I'm most of the way through one blog entry, my tax return is due, I have a PyCon talk to write and I have a release of ConfigObj [1] just waiting for me to finish updating the docs. Naturally then I should mess around implementing new features for Mock.
These particular features were inspired by an email from Mock user Juho Vepsalainen who had a particular problem with Mock. In case you aren't familiar with it, Mock is a simple mocking library for unit testing. Mock makes creating mock objects, and patching out implementations with mocks at runtime, trivially easy.
I've spent a chunk of time today implementing a module that extends Mock to add new features. Eventually they will become part of Mock itself, but that would require a new release and tedious things like writing documentation:
Note
I've already improved the code in extendmock and merged it into the main mock module. No need for a special MagicMock class any more. You can use mock.py from subversion or wait for the release of version 0.7.
To implement a lot of functionality (mocking any class and recording how they are used), mocks are instances of the Mock class. This can be a problem for code that uses introspection to determine if something is a function or not, or introspects the function signature. If you mock a function or method it will be replaced with a callable object with the signature (*args, **kwargs). This also means that code which is called incorrectly won't raise an error, you will only catch this in your tests if you specifically check how the object is called (which you usually will because that's the point of mocking it out - but still).
A solution to all these problems is the mocksignature function. This takes a function (or method) and a mock object. It creates a wrapper function with the same signature as the function you pass in. When called this wrapper function calls the mock, so instead of directly patching a mock to replace a function or method you use the function returned by mocksignature. Code that introspects the function you are patching out will still work. Here's an example:
from mock import Mock, patch
from extendmock import mocksignature
from some_module import some_function
mock = Mock()
mock_function = mocksignature(some_function, mock)
@patch('some_module.some_function', mock_function)
def test():
from some_module import some_function
some_function('foo', 'bar', 'baz')
test()
mock.assert_called_with('foo', 'bar', 'baz')
To make it more convenient to use I will build support for mocksignature into the patch decorator.
You can also use mocksignature on instance methods:
from mock import Mock
from extendmock import mocksignature
class Something(object):
def method(self, a, b):
pass
s = Something()
mock = Mock()
mock_method = mocksignature(s.method, mock)
s.method = mock_method
s.method(3, 4)
mock.assert_called_with(3, 4)
A limitation of mocksignature is that all arguments are passed to the underlying mock by position. If there are default values they will be explicitly passed in. Keyword arguments are only collected if the function uses **kwargs. See the tests for more details. The important fact is that the function signature is unchanged:
import inspect
from extendmock import mocksignature
from mock import Mock
def f(a, b, c='foo', **kwargs):
pass
mock = Mock()
new_function = mocksignature(f, mock)
assert inspect.getargspec(f) == inspect.getargs(new_function)
The limitation on keyword arguments sounds confusing (certainly the way I expressed it above), so it's easier to demonstrate in practise with the call_args attribute:
>>> from mock import Mock
>>> from extendmock import mocksignature
>>>
>>> mock = Mock()
>>>
>>> def f(a=None): pass
...
>>> f2 = mocksignature(f, mock)
>>> f2()
<mock.Mock object at 0x441d70>
>>> mock.call_args
((None,), {})
>>> mock.assert_called_with(None)
>>>
Even though we passed no arguments in, the argument with the default value (a) is called as if None was passed in explicitly. This affects the way you use assert_called_with when using Mock and mocksignature in concert. You can still use mocksignature with functions that collect args with *args and **kwargs:
>>> from extendmock import mocksignature
>>> from mock import Mock
>>>
>>> def f(*args, **kw): pass
...
>>> mock = Mock()
>>> mock.return_value = 3
>>> f2 = mocksignature(f, mock)
>>> f2(1, 'a', None, foo='fish', bar=1.0)
3
>>> mock.call_args
((1, 'a', None), {'foo': 'fish', 'bar': 1.0})
>>>
Another problem with Mock is that it currently doesn't support mocking out the Python protocol methods (like __len__, __getitem__ and so on). extendmock contains a new class that adds magic suport to Mock: MagicMock. Here's an example of how you use it:
from extendmock import MagicMock
mock = MagicMock()
_dict = {}
def getitem(self, name):
return _dict[name]
def setitem(self, name, value):
_dict[name] = value
def delitem(self, name):
del _dict[name]
mock.__setitem__ = setitem
mock.__getitem__ = getitem
mock.__delitem__ = delitem
self.assertRaises(KeyError, lambda: mock['foo'])
mock['foo'] = 'bar'
self.assertEquals(_dict, {'foo': 'bar'})
self.assertEquals(mock['foo'], 'bar')
del mock['foo']
self.assertEquals(_dict, {})
You mock magic methods by assigning a function (or a mock object) to the mock instance. Magic methods are looked up on the object class by the Python interpreter. MagicMock has all the magic methods implemented in a way that checks for corresponding instance variables, with sensible behaviour if the instance variable doesn't exist. However, the presence of these magic methods on the class could break some duck-typing (if it checks for the presence or absence of these methods), so I would rather have MagicMock be a separate class instead of integrating this into the Mock class. On the other hand there is no reason why I can't move MagicMock into the mock module next time I do a release.
For all magic methods you mock in this way you have to include self in the function signature. I might change this at a future date, so be warned this an experimental implementation. Also note that calls to mocked magic methods aren't recorded in method_calls and don't use object wrapping - all things that may change in the future.
One reason that some users have been requesting magic method support is for mocking context managers. Unfortunately __enter__ and __exit__ are looked up differently from the other magic methods in Python 2.5 and 2.6 (they aren't looked up on the class first but on the instance first like normal members). This makes the following technique still the correct way to mock the with statement.
Note
This is no longer true in the magic method support now in trunk. You mock __enter__ and __exit__ in exactly the same way as you do other magic methods.
You can also mock magic methods by assigning a Mock instance to the method you are mocking. For example:
>>> from mock import Mock
>>> mock = Mock()
>>> mock.__getitem__ = Mock()
>>> mock.__getitem__.return_value = 'bar'
>>> mock['foo']
'bar'
>>> mock.__getitem__.assert_called_with('foo')
Mocking the with statement:
mock = Mock()
mock.__enter__ = Mock()
mock.__exit__ = Mock()
mock.__exit__.return_value = False
with mock as m:
self.assertEqual(m, mock.__enter__.return_value)
mock.__enter__.assert_called_with()
mock.__exit__.assert_called_with(None, None, None)
| [1] | Mike Driscoll has just written a very good short tutorial for ConfigObj by the way: A brief ConfigObj Tutorial. |
Like this post? Digg it or Del.icio.us it.
Posted by Fuzzyman on 2010-01-03 00:35:50 | |
Categories: Python, Projects, Hacking Tags: mock, mocking, testing, magic methods
Django json support
As I mentioned in my last entry I'm now working on a Silverlight application with Django on the backend. This means that we're using Django to serve json to the Silverlight application, so whilst we're using the Django ORM, url routing and authentication we aren't using its templating.
The data model is 'unusual' but makes sense for the app. We've only implemented the first user story, which uses a subset of the data, but you can already start to see the shape of it. Here's a simplified approximation of the data from the point of view of the Django model classes:
from django.db import models
class CompanyType(models.Model):
type = models.CharField(max_length=255)
class Company(models.Model):
name = models.CharField(max_length=255)
company_type = models.ForeignKey(CompanyType)
class Address(models.Model):
street = models.CharField(max_length=255)
city = models.CharField(max_length=255)
postcode = models.CharField(max_length=255)
company = models.ForeignKey(Company)
class Individual(models.Model):
first_name = models.CharField(max_length=255)
last_name = models.CharField(max_length=255)
address = models.ForeignKey(Address)
The reason for this slightly non-intuitive setup is that a company may have several addresses. At every address there can be several contacts.
In our view we have a companies function that needs to return a list of all the companies. If we use the built-in json serializer then for the company_type field it just puts an id number into the json. If we wanted the actual company_type then we would have to make an additional query per company.
Additionally, for this view we want to retrieve all of the addresses associated with a company and every individual associated with each address.
There is a project called wadostuff that includes a replacement serializer. It's very easy to use, just specify the following in settings.py:
SERIALIZATION_MODULES = {
#'json': 'djangoserializers.json'
'json': 'wadofstuff.django.serializers.json'
}
When we import and call the Django json serializer we can now specify relations for the serializer to follow and include in the json:
from django.core import serializers
from project.app.models import Company
from django.http import HttpResponse
def company(request):
companies = Company.objects.all()
json = serializers.serialize(companies, relations=('company_type',))
return HttpResponse(json, mimetype="application/json"))
This doesn't solve the problem of how we include the addresses and individuals information. One option would be to generate three separate lists and include them all in the json and let the client sort them out. The wadostuff serializer does let us specify a set of extra fields (extras). Despite what the documentation says, in practise I had to implement these as methods on the model objects that could only return a string. This means I couldn't use it to return a list of model objects like I wanted.
Maybe I'm missing something obvious, which is entirely likely as I'm new to Django, but it doesn't seem like this use case is that unusual. I'm surprised that Django has no infrastructure at all to support this kind of use case??
After a bit of hunting I discovered the awesome django-piston project. We don't need an XML or YAML API, nor streaming or throttling, but it includes an awesome json serializer that I 'borrowed' and hacked around so that I could use it on its own. My final code for associating each company with the related addresses and individuals looks like this:
from project.app.models import Company
from project.modules.emitter import Emitter
from django.http import HttpResponse
def companies(request):
companies = Company.objects.all()
for company in companies:
addresses = company.address_set.all()
company.addresses = addresses
for address in addresses:
individuals = address.individual_set.all()
address.individuals = individuals
emitter = Emitter(fields=('company', 'company_type', 'address', 'addresses', 'individuals'))
thejson = emitter.render(companies)
return HttpResponse(thejson, mimetype="application/json")
This follows the 'company', 'company_type', and 'address' relations on model objects it serializes and also handles the addresses and individuals fields. What I get back on the Silverlight end is json representing a list of all companies. Each company has an 'addresses' field with a list of all addresses for that company and each address has an 'individuals' field. This is exactly what we need.
Note
In the comments Doug Napoleone suggests using select_related() rather than all() as it should be more efficient given the way we are using all the relations.
Doug also suggests setting 'related_name' in the model fields which would give me nicer names than address_set and individual_set. If I taught the serializer how to handle these names then I could move the loop from my view to the serializer; but the loop would still be there, so no efficiency gain just nicer looking code.
Like this post? Digg it or Del.icio.us it.
Posted by Fuzzyman on 2009-11-16 01:54:10 | |
Categories: Python, Work, Hacking Tags: django, json, database
The Python Object Model Revisited (data descriptors)
A few weeks ago I demonstrated the complexity of the Python object model by fetching docstrings from objects. A while after posting it I thought of a bug - or at least a way in which it could return the wrong result when looking up an attribute on an object. It will probably come as no surprise that this is due to the descriptor protocol.
Descriptors are special types of objects that have __get__ and or __set__ and __delete__ methods and have special behaviour when fetched, set or deleted as object attributes. They are how methods, class methods, static methods, properties and __slots__ are implemented in Python.
Descriptors that have both __get__ and __set__ are called data descriptors (properties are the canonical example), descriptors with only __get__ are non-data descriptors (methods being the canonical example). Data descriptors have interesting behaviour when they are on a class which has the same member in the instance dictionary.
Instance members are stored in the __dict__ attribute of the object. Normally if this instance dictionary has a member then fetching that member will pull it out of the dictionary. The exception is that if the class has a data-descriptor with the same name then that will be invoked instead of the object in the instance dictionary. This is easy to demonstrate:
... @property
... def a(self):
... return 'property'
...
>>> a = A()
>>> a.__dict__['a'] = 'attribute'
>>> a.a
'property'
So a data-descriptor on the class will override a member with the same name on the instance - but the 18 lines of code I wrote before for fetching docstrings from attributes will always look on the instance first.
The same is true for inherited data-descriptors:
...
>>> b = B()
>>> b.__dict__['a'] = 'attribute'
>>> b.a
'property'
Non-data descriptors don't override instance attributes and data-descriptors on a base class don't override normal class attributes on a subclass.
To handle this we need to check both the instance and walk the inheritance hierarchy. If we find the member we are looking for in both then we check the member from the class for a __set__ method. If the member from the class (or one of its base classes) has a __set__ member then we return that - otherwise we return the member from the instance.
Our modified full code that takes this into account has grown to 22 lines and now looks like:
import inspect
def get_doc(obj, member):
found = []
if hasattr(obj, '__dict__') and member in obj.__dict__:
found.append(obj.__dict__[member])
if isinstance(obj, (type, types.ClassType)):
search_order = inspect.getmro(obj)
else:
search_order = inspect.getmro(obj.__class__)
for entry in search_order:
if member in entry.__dict__:
if hasattr(entry.__dict__[member], '__set__'):
return entry.__dict__[member].__doc__
found.append(entry.__dict__[member])
return found[0].__doc__
def get_docstrings(obj):
try:
members = dir(obj)
except Exception:
members = []
return [(member, get_doc(obj, member)) for member in members]
Note
In practise there is another exception that we haven't handled here. Although you can override methods with instance attributes (very useful for monkey patching methods for test purposes) you can't do this with the Python protocol methods. These are the 'magic methods' whose names begin and end with double underscores. When invoked by the Python interpreter they are looked up directly on the class and not on the instance (however if you look them up directly - e.g. x.__repr__ - normal attribute lookup rules apply).
There is a corner case (that I alluded to in my previous post), classes can define __slots__ and create a dummy __dict__ member. If this member isn't a dictionary then our code will barf horribly - but really this is such an evil corner case that I'm not going to worry about it.
I have seen one use case for __slots__ in combination with a fake __dict__ member: proxying attribute access. This is a part of the werkzeug web framework - the LocalProxy class defines __dict__ as a property which returns the __dict__ member of the object it is proxying...
Like this post? Digg it or Del.icio.us it.
Posted by Fuzzyman on 2009-06-22 23:08:08 | |
Categories: Python, Hacking Tags: descriptors, object model
Parametrized Tests and unittest
Yet another blog entry on unittest; this is the last one in my list so I'm not planning any more for a while. Something that both nose and py.test provide that unittest (the [1] Python standard library testing framework) doesn't is a builtin mechanism for writing parametrized tests. The technique that both nose and py.test use (currently anyway) is to allow your test methods to be generators that return a series of tests. The testing framework then runs all these tests for you.
Whenever I've needed to run a series of similar tests with different input parameters I've always used a simple loop; something like:
for x in range(100):
for y in range(100):
self.assertSomethingForXandY(x, y)
The problem with this approach is that as soon as you have a failure for any of the x, y value combinations the test will stop running. In some circumstances it would be much better for all the tests to run and have all the failures reported instead of just the first.
With nose, you could write the above test like this:
for x in range(100):
for y in range(100):
yield self.assertSomethingForXandY, x, y
Nose would detect that the test is a generator and collect the functions (along with their arguments) that it yields and run them independently. The disadvantage of this approach is that you can't know up front how many tests you have (and indeed it could change every time you run the tests) and neither are they isolated from each other (they share the fixture).
Although unittest doesn't include an equivalent it is easy to achieve the same thing and there are several possible approaches. The two I've come up with, prompted by another discussion on the Testing in Python mailing list and with Brandon Craig Rhodes, are available as params.py from my unittest-ext sandbox project (where I tinker with unittest related stuff from time to time). (Note - after showing you two possible approaches I'll show you some better ways that other people have found for solving the same problem.)
The first uses a metaclass in concert with a decorator. You decorate methods with a list of dictionaries - for every dictionary in the list the method will be called with the parameters from the dictionary. (It'll be easier to understand when I show you some code using it.) The metaclass examines all decorated methods at class creation time and adds new test methods to the class.
from types import FunctionType
class Paramaterizer(type):
def __new__(meta, class_name, bases, attrs):
for name, item in attrs.items():
if not isinstance(item, FunctionType):
continue
params = getattr(item, 'params', None)
if params is None:
continue
for index, args in enumerate(params):
def test(self, args=args, name=name):
assertMethod = getattr(self, name)
assertMethod(**args)
test.__doc__ = """%s with args: %s""" % (name, args)
test_name = 'test_%s_%s' % (name, index + 1)
test.__name__ = test_name
if test_name in attrs:
raise Exception('Test class %s already has a method called: %s' %
(class_name, test_name))
attrs[test_name] = test
return type.__new__(meta, class_name, bases, attrs)
def with_params(params):
def inner(func):
func.params = params
return func
return inner
class TestCaseWithParams(unittest.TestCase):
__metaclass__ = Paramaterizer
You don't need to use the metaclass directly, instead subclass TestCaseWithParams:
@with_params([dict(a=1, b=2), dict(a=3, b=3), dict(a=5, b=4)])
def assertEqualWithParams(self, a, b):
self.assertEqual(a, b)
@with_params([dict(a=1, b=0), dict(a=3, b=2)])
def assertZeroDivisionWithParams(self, a, b):
self.assertRaises(ZeroDivisionError, lambda: a/b)
The disadvantage of this approach is that you have to know (or calculate) all of the parameters at class creation time instead of when the test runs. The advantage is that the number of tests is known ahead of running the tests - so countTestCases on the TestSuite works as normal and each failure is recorded individually.
Another approach is to use the same generator technique as nose / py.test with a decorator that runs all the tests yielded by the generator.
def inner(self):
failures = []
errors = []
for test, args in func(self):
try:
test(*args)
except self.failureException, e:
failures.append((test.__name__, args, e))
except KeyboardInterrupt:
raise
except:
# using sys.exc_info means we also catch string exceptions
e = sys.exc_info()[1]
errors.append((test.__name__, args, e))
msg = '\n'.join('%s%s: %s: %s' % (name, args, e.__class__.__name__, e) for (name, args, e) in failures + errors)
if errors:
raise Exception(msg)
raise self.failureException(msg)
return inner
class Test2(unittest.TestCase):
@test_generator
def testSomething(self):
for a, b in ((1, 2), (3, 3), (5, 4)):
yield self.assertEqual, (a, b)
def raises():
raise Exception('phooey')
yield raises, ()
This is a bit less 'heavy' than using a metaclass. Decorated tests are all run to completion. If any test fails or errors then an appropriate failure is raised - with the message listing all the failures. It has the advantage of allowing tests to be created at test execution time, but the disadvantage of all failures only counting as a single failure. The total number of tests counted will only count generative tests as a single test. If you run the code above you'll see how errors are reported and it is ok (could do better - must try harder). It is also easy to use with any unittest based test framework.
Of course other people have come up with better ideas - which I may evaluate for integrating into unittest. They do still suffer from the problem of non-deterministic number of tests (breaking the countTestCases part of the unittest protocol) but this is unavoidable with this feature.
Konrad Delong posted one solution to his blog: Reporting assertions as a way for test parameterization. The code is here. He uses a decorator to collect the failures / errors and modifies TestCase.run to be aware of them. I like this technique.
Robert Collins has a different solution, which at the heart uses a similar technique but is more general and powerful. This is his testscenarios project. (Every time I try to actually find the code on a launchpad project I go round in circles for a while first. Anyway - it's here.) The project is described thusly:
testscenarios provides clean dependency injection for python unittest style tests. This can be used for interface testing (testing many implementations via a single test suite) or for classic dependency injection (provide tests with dependencies externally to the test code itself, allowing easy testing in different situations).
Instead of just individual tests it allows you to parameterize whole test cases - so you can do 'interface' testing where you swap out the backend implementation and check that all tests pass for various different backends.
The basic nose / py.test technique for generator tests is a dirty hack. They introspect the test method code objects to see which of them are generators. Holger Krekel, core developer of py.test, also thinks that they offer little real advantage over loops and is looking to replace them in py.test with a more powerful system. This uses pytest_generate_tests and he describes it in: Parametrizing Python tests, generalized.
This new system is more powerful, but it seems to make the simple cases more difficult. If Holger is right in that a generalized mechanism that only caters for the simple cases doesn't really have much advantage then this new system may indeed be a winner.
| [1] | Yes Zeth the testing framework. doctest is for testing documentation and makes an awful unit testing tool, especially for test first as practised in test driven development (TDD). Of course not everyone shares my opinions on this matter. |
Like this post? Digg it or Del.icio.us it.
Posted by Fuzzyman on 2009-06-15 15:38:42 | |
Categories: Python, Hacking Tags: testing, generators, unittest, nose, py.test
Fetching docstrings from objects: easy, right? (A painful exploration of the Python object model)
Extraordinarily simple introspection is one of the features that makes dynamic languages like Python such a joy to work with. If you have code dealing with arbitrary objects and you want to discover and present information about those objects it is marvellously simple. Alongside this Python has docstrings; a simple way to associate usage documentation with just about any object. This is invaluable when working with the interactive interpreter; you can import and instantiate classes and get documentation on live objects by calling help(instance.method).
These features, listing all members, fetching them by name and looking at the docstring, can also be used in live systems or in tools that automatically build documentation for libraries.
So just how easy is it to fetch the docstring from an arbitrary member on an arbitrary object? If you are going to handle all the corner cases of the Python type system then the answer is harder than you might think. As I've ended up writing code that does this (more or less) twice in the last week I'll walk you through the nefarious pitfalls and lets see how many lines of code we end up with.
The arbitrary goal I've set for this task is to write a function that when given an object returns a list of all members along with their docstrings. Along the way we'll learn far more about the Python object model than we wanted to know.
A naive first attempt looks like this:
members = dir(obj)
return [(member, getattr(obj, member).__doc__) for member in members]
dir(obj) returns a list of all the member names of obj as strings. getattr(obj, member).__doc__ fetches an individual attribute by name and gets its associated docstring.
In IronPython I can make this crash very easily:
>>> from System import ArgIterator >>> dir(ArgIterator) Traceback (most recent call last): ... SystemError: An attempt was made to load a program with an incorrect format. (Ex ception from HRESULT: 0x8007000B)
Ha, bad IronPython I can already hear you chuckling. Not so fast. Since Python 2.6 and the introduction of the __dir__ protocol method we can easily suffer the same problem:
def __dir__(self):
raise Exception("Don't dir me bro")
>>> dir(UnDesirable())
Traceback (most recent call last):
...
Exception: Don't dir me bro
Ok, so we can catch this error and return an empty list.
try:
members = dir(obj)
except Exception:
members = []
return [(member, getattr(obj, member).__doc__) for member in members]
What's next? What happens when we run this function on an instance of a class with properties?
@property
def something(self):
"Here be the docstring"
>>> get_docstrings(Proper())
[... ('something', None)]
When our code called getattr(obj, 'something') it actually triggered the property instead of fetching it as an object for us to look at. Introspection that causes code execution is a bad thing (tm) - it could raise exceptions or have unpleasant side effects. More to the point we don't get the information we want.
The property descriptor for 'something' lives on the class:
>>> getattr(Proper, 'something') <property object at 0x7e600> >>> getattr(Proper, 'something').__doc__ 'Here be the docstring'
The problem is that although this works for things like methods and properties it doesn't work for instance attributes. Even worse, in IronPython this is no better than our previous solution because static properties are very common - fetching an attribute from a class can also cause arbitrary code execution. Haha, bad IronPython again. Well, thanks to the wonders of the descriptor protocol the same thing can happen in Python:
def __init__(self, function):
self.func = function
self.__doc__ = function.__doc__
def __get__(self, instance, owner):
raise Exception("No one expects the descriptor protocol")
class Harumph(object):
@descriptor
def something(self):
"I still haven't found what I'm looking for"
>>> getattr(Harumph, 'something')
Traceback (most recent call last):
...
Exception: No one expects the descriptor protocol
The reason it is so tempting to use getattr is that it neatly handles finding exactly where attributes live. Once we have decided we shouldn't use getattr we have to start understanding the way Python looks up attributes in more detail than we really wanted.
We've already seen that although instance attributes live on the instance, to safely find things like properties and methods we need to look on the class. Python objects have __dict__ attributes that act as namespaces for objects - they map the member names to the members. We could rewrite our function to use it, first checking the instance and then the class:
try:
members = dir(obj)
except Exception:
members = []
return [(member, get_doc(obj, member)) for member in members]
def get_doc(obj, member):
if member in obj.__dict__:
return obj.__dict__[member].__doc__
return type(obj).__dict__[member].__doc__
This works fine for some situations but is so full of holes it is hard to know where to start. Let's start with classes that use slots. Classes that define a __slots__ member (list of strings) don't have an instance dictionary (a memory optimisation) but have reserved slots for the instance members specified in the class. Attempting to access instance.__dict__ for objects like this will die with an attribute error.
Classes using slots will however have an entry in the class dictionary for each slot:
__slots__ = ['x']
>>> slotty = Slotted()
>>> slotty.__dict__
Traceback (most recent call last):
...
AttributeError: 'Slotted' object has no attribute '__dict__'
>>> dir(slotty)
[... 'x']
>>> slotty.x
Traceback (most recent call last):
...
AttributeError: x
>>> type(slotty).__dict__['x']
<member 'x' of 'Slotted' objects>
These member slots don't have a docstring but lets assume this is sufficient for our use case and handle it:
try:
members = dir(obj)
except Exception:
members = []
return [(member, get_doc(obj, member)) for member in members]
def get_doc(obj, member):
if hasattr(obj, '__dict__') and member in obj.__dict__:
return obj.__dict__[member].__doc__
return type(obj).__dict__[member].__doc__
This gets us past one hurdle. Let's try our function on a the Proper class instead of an instance:
>>> get_docstrings(Proper) Traceback (most recent call last): ... KeyError: '__class__'
Slightly obscure, but it turns out that the __class__ attribute of an object (a pointer to its type) is another descriptor and is inherited from object. We don't find it in the instance dictionary or the class dictionary. In fact our code would have the same problem with any attribute inherited from a base class.
>>> object.__dict__['__class__'] <attribute '__class__' of 'object' objects> >>> descriptor = object.__dict__['__class__'] >>> descriptor.__get__(Proper(), Proper) <class '__main__.Proper'>
To find an arbitrary attribute you need to not only look in the instance dictionary and the class dictionary, but also walk the inheritance hierarchy if the attribute lives on any of the base classes. And what about multiple inheritance, oh god. Thankfully this isn't as difficult a problem as it sounds. Python classes define an __mro__ attribute (the Method resolution order) which even in the face of multiple inheritance returns a list of base classes you need to search - and in the right order too (the order is important in case the method is defined on a base class and overridden in a sub-class). We can rewrite our code to use this:
if hasattr(obj, '__dict__') and member in obj.__dict__:
return obj.__dict__[member].__doc__
for entry in type(obj).__mro__:
if member in entry.__dict__:
return entry.__dict__[member].__doc__
Let's try it on a subclass of Proper:
pass
>>> get_docstrings(SubProper())
[... ('something', 'Here be the docstring')]
>>> get_docstrings(SubProper)
[... ('something', None)]
This works fine for instances but fails when we use it on the class object itself. It turns out that we need to handle instances differently to classes. For instances this code is perfectly correct: for entry in type(obj).__mro__. If we pass in an instance we need to check its class object and then all base classes. For class objects its type is its metaclass. This controls some of the behavior of the class but isn't where its members are defined. We need to go straight to the bases classes. This should do the trick:
if hasattr(obj, '__dict__') and member in obj.__dict__:
return obj.__dict__[member].__doc__
if isinstance(obj, type):
search_order = obj.__mro__
else:
search_order = type(obj).__mro__
for entry in search_order:
if member in entry.__dict__:
return entry.__dict__[member].__doc__
>>> get_docstrings(SubProper)
[... ('something', 'Here be the docstring')]
Hooray, let's see how it fares on a sub-class of an old style class:
def something(self):
"Here be dragons"
class Sub(Base):
pass
>>> get_docstrings(Sub)
[... ('something', None)]
Ah, old style classes aren't instances of type. Instead they're instances of types.ClassType, painfully they don't have __mro__ which was introduced in Python 2.3 for new style classes. We need a different strategy for handling old style classes - and thankfully this is already implemented for us in inspect.getmro. For new style classes this already uses __mro__, so we can rewrite our code as:
import inspect
def get_doc(obj, member):
if hasattr(obj, '__dict__') and member in obj.__dict__:
return obj.__dict__[member].__doc__
if isinstance(obj, (type, types.ClassType)):
search_order = inspect.getmro(obj)
else:
search_order = inspect.getmro(type(obj))
for entry in search_order:
if member in entry.__dict__:
return entry.__dict__[member].__doc__
>>> get_docstrings(Sub)
[... ('something', 'Here be dragons')]
>>> get_docstrings(Sub())
[... ('something', None)]
Update
The code above was my final version, but a colleage pointed out a problem (bug). It's explained below.
As you can see, we have fixed the problem for old style classes - but still have a problem with instances of old style classes. The problem is that you can't don't get the class of an old-style instance by calling type on it:
>>> class A: ... pass ... >>> a = A() >>> type(a) <type 'instance'>
So this line was broken for some objects search_order = inspect.getmro(type(obj)). Instead you can access the class of any arbitrary object through its __class__ member:
>>> a.__class__ <class __main__.A at 0x01D23780>
If we make this change, then finally we have code that works for all these cases. We have the additional burden in IronPython of a few 'magic' attributes that IronPython sticks on for us (like the ReferenceEquals method) which can't be found in any of the base classes. This code defaults to returning None if it fails to find a member, so it handles these gracefully.
Our full code for finding all docstrings for all members on an arbitrary object is 18 lines of code:
import inspect
def get_doc(obj, member):
if hasattr(obj, '__dict__') and member in obj.__dict__:
return obj.__dict__[member].__doc__
if isinstance(obj, (type, types.ClassType)):
search_order = inspect.getmro(obj)
else:
search_order = inspect.getmro(obj.__class__)
for entry in search_order:
if member in entry.__dict__:
return entry.__dict__[member].__doc__
def get_docstrings(obj):
try:
members = dir(obj)
except Exception:
members = []
return [(member, get_doc(obj, member)) for member in members]
Even all this isn't bulletproof. A class with slots can define a __getattr__ method that returns something for every attribute access. This means that hasattr(obj, '__dict__') will return True and the code that follows may die. A class or instance can override __bases__ and lie about its base classes. It may even be possible to create a metaclass that uses __slots__ and then have classes (instances of the metaclass) without a dictionary (breaking our code that accesses Klass.__dict__). Oh well, they're obscure corner cases of obscure corner cases.
If you can think of any cases I've missed, or ways of doing this more elegantly, then let me know in the comments!
Like this post? Digg it or Del.icio.us it.
Posted by Fuzzyman on 2009-05-20 21:51:45 | |
Categories: Hacking, Python, IronPython Tags: descriptors, introspection, docstrings, method resolution order
Monkey Patching, Static Methods and the Descriptor Protocol
In the Resolver One test framework we have a fairly elaborate mock system. It emulates a .NET API from a commercial vendor which we don't have installed on our test / development machines. Most of the methods on a core part of this API are static methods.
Jonathan and I have spent a good chunk of this morning tracking down an obscure failure. It's one of those wonderful failures; some test is passing but causing other tests later on to fail. Great fun.
Some of our tests work by monkey patching parts of an API to test their interaction with other parts. This is a fairly standard testing technique in Python, but where you are patching any global state (class and module members) you need to be very careful. Even in the event of test failure you must restore the original members after the test completes. Otherwise you get the kind of errors we were seeing.
The standard way of doing this is code that looks like this:
module.something = something_else
try:
do_something()
self.assertTrue(something_happened())
finally:
module.something = original
This gets particularly painful if you patch several things and it is easy to get wrong. To solve exactly this problem, and remove the boilerplate, we use a patch function (which is also available in my Python mock module).
It reduces the snippet above to:
def function():
do_something()
function()
self.assertTrue(something_happened())
So as we are doing this, it was particularly odd that we were seeing this error. Eventually we tracked it down to the patching of a static method on our mock API. Let me illustrate the problem:
... @staticmethod
... def test():
... print 'woozer'
...
>>> Test.test()
woozer
>>> original = Test.test
>>> original
<function test at 0x70df0>
>>> Test.test = original
>>> Test.test()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unbound method test() must be called with Test instance as first argument (got nothing instead)
>>>
Fetching the static method and then setting it directly back on the class breaks it! To understand why we need to understand how attributes are fetched in Python and a bit about the descriptor protocol.
When a method is created inside a class declaration it is a normal function stored inside the class dictionary. Python functions have a __get__ method (part of the descriptor protocol) which controls how they are fetched as attributes from classes and instances. When a method is fetched from an instance what you get is a bound method - effectively the original function bound to the instance which is passed in as the first argument (self). When a method is fetched from the class itself what you actually get is an unbound method (in Python 2.X only - unbound methods have gone in Python 3). This is often used when calling up to base classes in overridden methods:
def method(self):
BaseClass.method(self)
Static methods are called without self as the first argument. Functions wrapped in staticmethod exhibit different behaviour when fetched from the class - the underlying function is returned directly. So in our monkeypatching scenario, when the static method is fetched we get a function. Attaching a function back to the class (when the original is restored) turns it into an unbound method next time it is fetched. It has lost its 'staticmethod'ness.
A solution is to do this instead:
... @staticmethod
... def test():
... print 'fritch'
...
>>> Test.test()
fritch
>>> original = Test.__dict__['test']
>>> original
<staticmethod object at 0x74710>
>>> Test.test = original
>>> Test.test()
fritch
By accessing the class dictionary we get the static method descriptor which is safe to set back onto the class. There is slight drawback to this approach. Instances of classes that use __slots__ don't have a dictionary - so we either need to fallback to the normal attribute lookup in this case or treat patching classes differently from patching instances. Normally this isn't an issue because unless it is a singleton it isn't so important to unpatch instances.
Like this post? Digg it or Del.icio.us it.
Posted by Fuzzyman on 2009-05-18 14:01:01 | |
Categories: Python, Hacking Tags: testing, monkeypatching
Managing Object Lifecycles in Ironclad
Ironclad 0.8 has just been released. Ironclad is an open source project by Resolver Systems to allow the use of Python C extensions from IronPython.
Ironclad is an implementation of the Python C API in C#, with a bit of assembly language to fool extensions into believing they are calling into Python25.dll. It also reuses as much of the C implementation of the API where possible. When extensions make API calls, Ironclad creates IronPython objects rather than CPython objects, and Ironclad also handles the mapping of extension objects into IronPython.
One of the hardest challenges for Ironclad is that Python extension modules expect to use reference counting for garbage collection, whereas the .NET framework has its own (more efficient!) garbage collector. Ironclad objects that are still being used by the C extension module mustn't be freed even if no .NET references exist to an object, and the reference count of extension objects mustn't be allowed to drop to 0 if they are still being used inside .NET.
A while ago I blogged about how Ironclad handled this. That entry is now well out of date, so William Reade (lead developer of Ironclad) has provided an update.
Hello! My name is William Reade. I'm a colleague of Michael's, and the primary developer on the Ironclad project, and some months ago I explained to Michael how I managed object lifetimes in Ironclad. He in turn wrote a detailed post about it here, and I merrily chipped in to clarify a couple of points in the comments; the result all looks perfectly respectable and technically quite neat, apart from the fact that the approach described is, in fact, dangerously stupid and wrong in at least one critical respect [1].
So, er, sorry. Let's see if we can do a little better this time. The fundamental problem that Ironclad needs to solve is: "how do we make an IronPython object from a type defined in a compiled CPython extension?"
Well, modulo a few mildly diabolical details, it's actually pretty easy to construct an IronPython type which forwards all method calls (and attribute accesses, etc) to an underlying CPython instance. So, in essence, we solve the problem by creating IronPython objects which wrap CPython objects -- which combination I will call a "bridge object", for want of a better term [2].
There's a bit of an impedance mismatch between the two systems, and the biggest problem thus far has been managing object lifetimes. CPython objects are reference-counted, and will effectively commit suicide -- completely deterministically -- as soon as their refcount hits 0, while IronPython objects are destroyed non-deterministically by the Garbage Collector: effectively, at random.
Some objects' lifetimes are easy to track: when we create a CPython proxy for an IronPython object, we store the IronPython object and the pointer to the CPython stub object in a 2-way map (thus ensuring the IronPython object will not be GCed), and give the stub a dealloc method that deletes the association (rendering the IronPython object -- or CLR object, for that matter -- once again eligible for GC), as soon as the CPython object's refcount hits 0.
However, when dealing with a bridge object, we need to manage its constituents' lifetimes very carefully. For reasons that will hopefully soon become clear, we need to depend on the GC to initiate the chain of events leading up to object destruction, so we can't let our map strongly reference the IronPython part; instead, we have to use a weak reference to it [3].
We can't stop the CPython reference-counting from working, but clearly we can't afford to have the bridge object's CPython part deallocate itself when the IronPython part is still alive: referencing freed memory is not generally considered to be industry Best Practice.
So, the IronPython object IncRefs the CPython object as soon as it's created, to ensure that the CPython object's refcount never actually hits 0 again -- and hence that the normal mechanism for CPython object destruction never gets triggered. Instead, at the point when the refcount hits 1, I can be sure that no CPython references to the bridge object still exist. Now, when the IronPython object gets garbage-collected, it can safely call the CPython dealloc method and release its unmanaged resources at the same time as the managed object dies.
And that's fine, as far as it goes: it ensures that CPython objects can't disappear from underneath their IronPython counterparts. However, that's not the only failure scenario: the CPython part of the bridge object depends upon the IronPython part as well [4], so I can't afford to let wanton garbage-collections destroy the IronPython part while the CPython part is still being kept alive by unmanaged references.
So: whenever IronPython code calls into a CPython extension, we need to translate every parameter to that function into a format comprehensible to CPython. If the parameter lacked a CPython representation beforehand, one is created with refcount 1; otherwise, the existing representation is IncReffed. When the function returns, each parameter is DecReffed (as is the return value, once its IronPython representation has been created or retrieved).
However, the IncRefs and DecRefs described above do extra checks for bridge objects whose refcounts are increasing to 2 or decreasing to 1. Every time the refcount increases to 2, the IronPython object is added to a managed set, and it's removed from it again whenever the refcount drops to 1. This ensures that, if the CPython code grabs a reference to a bridge object by IncReffing the CPython part, the refcount will not drop as far as 1 when it's DecReffed on return. Therefore, the IronPython part will stay in the strongRefs set and remain ineligible for garbage collection, even if it falls out of scope everywhere else, and so we can guarantee that it won't disappear while the CPython part still needs it.
Note
It's important to draw a distinction between the Ironclad IncRef / DecRef operations and the normal reference incrementing and decrementing done by extension modules. The former is fully under our control (of course), the latter are implemented as C macros and directly manipulates a field on objects. As we can't know when a C extension has updated the reference count on an object [5] we are subject to these contortions.
The only remaining drawback is that, once CPython has finished with the bridge object, it may become an unreferenced zombie: the strongRefs set keeps the IronPython part, which holds the last reference to the CPython part, ineligible for GC.
So... I periodically loop [6] over the CPython parts of the bridge objects, checking refcount. When I find a zombie, I just remove the -head,-or-destroy-the-brain- IronPython object from the strongRefs set, and allow GC to take its natural course... and that closes all the loopholes I have thus far identified [7].
I hope that was broadly comprehensible, and perhaps even interesting or edifying; if I've been at all unclear, please comment, and I'll answer your questions as best I can.
| [1] | Basically, when reregistering IronPython objects for finalization, I completely failed to notice the actual mechanism IronPython uses to call finalizers; this ended up causing inevitable zombie apocalypses in long-running processes. Whoopsydoodle. |
| [2] | Not precisely true: the two objects actually have no direct knowledge of each other. However, the details of precisely what happens are entirely tedious and irrelevant; generally, issues like this will be handwaved/ignored. You Have Been Warned. |
| [3] | See src/InterestingPtrMap.cs for gory details. Goggles optional. |
| [4] | Or, at least, it might depend on it, and I can't generally be sure that it doesn't. For example, the CPython type may not define a __setattr__; if, then, I were to assign to random attributes on a bridge object's IronPython part, those attributes would have no CPython representation at all, and would be lost when the IronPython part were GCed, leading to Bad Things. |
| [5] | We could if we modified the macros, but then extension modules would need recompiling - and currently Ironclad maintains binary compatibility with Python C extensions. Thanks to Adam Olsen for forcing William and I to make this clearer. |
| [6] | I know, I know |
| [7] | I suspect that I have not yet fully grasped the subtleties of bridge objects which hold unmanaged references to other bridge objects. |
Like this post? Digg it or Del.icio.us it.
Posted by Fuzzyman on 2009-01-29 13:20:12 | |
Categories: Python, IronPython, Hacking, Work Tags: Ironclad, William Reade, Garbage Collection, Reference counting, object lifecycles
Archives
Counter...

