Python Programming, news on the Voidspace Python Projects and all things techie.

Django concurrency, database locking and refreshing model objects

emoticon:Lithium Using expressions to make some of our model updates atomic (as discussed previously) wasn't sufficient to make all of our operations safe for concurrent database modifications (although still useful). This is because having fetched some values we wanted to perform operations based on those values, and they must not change whilst the operations are taking place (because the end result will be written back and would overwrite any other changes made).

What we really needed to do is to lock rows in the database, where a row corresponds to a particular model instance, so that no other operation could modify it whilst a change is taking place.

Database row locking is typically done with SELECT ... FOR UPDATE; and unfortunately Django has no built in support for this. Hopefully it will be added in Django 1.4.

Our solution to this is to manually modify a sql query to make it FOR UPDATE. This obtains a lock on the row, and any other query against that row will be blocked until the lock is released. This must be done inside a transaction as the lock is released when the transaction completes (commit or rollback).

Note

Since I started writing this blog entry (interrupted by my wife inconsiderately giving birth to our beautiful daughter Irina) it has been committed. New in Django development version: the QuerySet select_for_update method.

Our initial attempt created a query that would just fetch a single field from the particular model instance we wanted to lock. The query is then made into a FOR UPDATE query and executed to obtain the lock.

This query fetches the field we want to modify. Once we obtain the lock we know that nothing else can modify the field whilst we are using it. We need to fetch the field again (and update the field on our model instance) after obtaining the lock because the model object we already have has a value, but obtaining the lock may have blocked whilst another request modified the value (we always need the latest value at this point - so even if we weren't blocked we want to update after obtaining the lock. As changes to this particular field are always guarded with the lock, in practise we would have been blocked if something else was changing it).

Our initial iteration was only interested in one field that may have changed due to a concurrent modification, so the code that obtains the lock updates that field:

@transaction.commit_on_success
def do_something(self):
    self._get_lock()
    # do stuff

def _mangle_sql_for_locking(self, sql):
    # yeah, it's really this difficult
    return sql + ' FOR UPDATE'

def _get_lock(self):
    query = SomeModel.objects.filter(id=self.id).values_list('field')
    sql, params = query._as_sql(connection=connection)

    cursor = connection.cursor()
    cursor.execute(self._mangle_sql_for_locking(sql), params)

    # acquire the lock and update the field
    field = cursor.fetchone()[0]
    self.field = field

This worked fine until we had more than one field we were interested in. Our naive attempt to modify the code looked like this:

def _get_lock(self):
    values = ('field', 'other_field')
    query = SomeModel.objects.filter(id=self.id).values_list(*values)
    sql, params = query._as_sql(connection=connection)

This blows up in the _as_sql call with this exception:

Cannot use a multi-field ValuesListQuerySet as a filter value.

There was no mention of this in the docs and google didn't yield much, but then we are calling a private method directly.

Sooo... how about a method that will refresh all fields with the latest version from the db? I've always been slightly surprised that model instances don't have this capability built-in (maybe I've just missed it?), but maybe it's a bit of an anti-pattern outside of very specific use cases.

So just in case you're tempted to go down the same rabbit holes as me, here's a refresh method that fetches the latest version of all fields and updates the model instance.

def refresh(self):
    updated = SomeModel.objects.filter(id=self.id)[0]
    fields = [f.name for f in self._meta._fields()
              if f.name != 'id']
    for field in fields:
        setattr(self, field, getattr(updated, field))

This allows us to obtain the lock (still generate and execute the FOR UPDATE SQL) but discard the result as all fields are updated with another query (with the additional overhead that implies of course).

Note that for our code that does the locking there is a sane reason that the SQL locking, trivial as it maybe, is in its own method. The actual code has an additional method and call in it:

def _mangle_sql_for_locking(self, sql):
    # yeah, it's really this difficult
    return sql + ' FOR UPDATE'

def _concurrency_poison(self):
    pass

def _get_lock(self):
    values = ('field', 'other_field')
    query = SomeModel.objects.filter(id=self.id).values_list(*values)
    sql, params = query._as_sql(connection=connection)

    cursor = connection.cursor()
    cursor.execute(self._mangle_sql_for_locking(sql), params)
    self._concurrency_poison()

The _concurrency_poison method does nothing in production, but it enables us to write tests that both prove there is a race condition and prove that it is fixed. In our tests we patch out _mangle_sql_for_locking with a function that returns the sql unmodified. We additionally patch out _concurrency_poison with a function that makes a concurrent modification.

Without the locking the "concurrent change" will be overwritten and the final value will be incorrect (the concurrent change will be lost). We test that we get the wrong final result, which proves we have a race condition.

A second test leaves _mangle_sql_for_locking unchanged, but still patches _concurrency_poison to make a concurrent change. Because this should now block (_concurrency_poison is called after the lock has been obtained) the concurrent change must be made from a separate process or thread. A typical example (used for both tests) might look something like this:

import subprocess
import time
from textwrap import dedent

from django.conf import settings
from mock import patch

ENV = {
    'PGHOST': settings.DATABASES['default']['HOST'],
    'PGUSER': settings.DATABASES['default']['USER'],
    'PGPASSWORD': settings.DATABASES['default']['PASSWORD'],
    'PGPORT': settings.DATABASES['default']['PORT'],
}

@patch('appname.models.Model._concurrency_poison')
def concurrently_modify(self, concurrency_mock):
    # Poison: modify the database in an inconvenient way at an
    # inconvenient time.
    database_name = settings.DATABASES['default']['NAME']
    proc = subprocess.Popen(['psql', database_name],
        stdin=subprocess.PIPE, stdout=subprocess.PIPE,
        stderr=subprocess.PIPE, env=ENV)

    def poison():
        proc.stdin.write(dedent('''
            UPDATE some_table
            SET field = field - 3;\n'''
        ))
        proc.stdin.flush()
        # give the database code a chance to execute
        time.sleep(1)

    concurrency_mock.side_effect = poison

    # call the code that obtains the lock here
    # it will automatically trigger the
    # concurrent operation

In the second test the concurrent change is blocked until the lock is released. Instead of being overwritten the concurrent change is executed after the logic in the model, so the result will be different from the first test and we can assert that the result has this different (correct) value.

This testing pattern, which I think is pretty cool, was devised by David Owen who is the resident database expert on our team. He teaches me about databases whilst I encourage him that testing is useful... Smile

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2011-05-13 11:06:41 | |

Categories: , , Tags: , ,


Danger with django expression objects

emoticon:file1 I've recently been dealing with a bunch of concurrency issues in our django app. Some of the views modify database rows (model objects), which is unsurprising, but we did have a few situations where concurrent modifications could cause modifications to be overwritten (silently lost). Following blog entries will describe some of the other fun we've had with this, this entry looks at Django expression objects and one particular caveat with their use.

Given the following trivial django model definition:

from django.db import models

class Foo(models.Model):
    field = models.IntegerField()

Consider the following code:

foo = Foo.objects.get(id=some_id)
foo.field += 3
foo.save()

The problem here is that if this code is executed concurrently on the same database row by two requests then the second request could fetch the field value after the first request has fetched it, but before the change has been saved. The first request would then save its change, followed by the second request doing the same and overwriting the first.

Django provides one solution to this by allowing the change to be executed in the database which is then atomic. This is done by the F() object which uses references to fields instead of modifying the field directly.

Here's the same example using F():

from django.db.models import F
foo = Foo.objects.get(id=some_id)
foo.field = F('field') + 3
foo.save()

foo.field (the instance member) has an expression object assigned to it and calling .save() causes the expression to be evaluated in the database.

The hidden danger here is that this expression object will be evaluated every time you call save. If you have a code path that potentially makes several changes to a model instance, and saves those changes independently, then your expression can be evaluated multiple times.

This code illustrates the problem:

>>> from django.db.models import F
>>> from foo.models import Foo
>>> f = Foo.objects.create(field=3)
>>> f.field = F('field') + 27
>>> f.field
<django.db.models.expressions.ExpressionNode object at 0x...>
>>> f.save()
>>> Foo.objects.get(id=f.id).field
30
>>> f.save()
>>> Foo.objects.get(id=f.id).field
57

This happened to us when we changed a code path sometime after the switch to using these expressions. Fortunately our tests picked up on it, but it was "fun" to debug and work out where the failure was being caused.

Note

jmv comments on this entry that for similar use cases he uses the QuerySet update method in conjunction with the F() to make the updates atomic. This (effectively) saves and updates from a single call.

It doesn't work on individual model instances but on QuerySet, so the code would look like:

Foo.objects.filter(id=f.id).update(field=F('field') + 1)

This updates the queryset, so if you need the model instance with the modified value then you need to pull it out of the QuerySet after the update. As it doesn't leave the expression object on the instance it solves the problem of multiple saves.

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2011-05-06 15:58:05 | |

Categories: , Tags: , ,


A Brief Update

emoticon:lighton Over the summer I went even-longer-than-normal without blogging. In that time a lot has changed.

The wife and I had a great summer break visiting her family in Romania. This included a trip up the gloriously beautiful Ceahlau mountain. Whilst we were out there we discovered that Delia is pregnant. This was a very pleasant discovery, but as I'm sure you all know babies are the great enemies of open source - so we'll have to see what happens in the future... Fortunately I have Jesse Noller as a great inspiration for enjoying fatherhood whilst still contributing to Python.

(Greg Newman hypothesises that Michael Foord junior will look like this)

Immediately on returning from Romania I started working full time with Canonical as a Python developer on the ISD (Infrastructure Systems Development) team. Our team blog is here. We work mainly with Django, but we also use a whole range of 'standard' Python tools including pip, fabric, virtualenv, and mock. We also have some of our internal libraries open sourced like configglue and Canonical SSO.

I work from home (mostly - my team is joining the Ubuntu Developer Summit in Florida next week for a development sprint) communicating with the team over IRC and occasionally VOIP (mumble), which is how all Canonical development happens. Having done several years commmuting to London, followed by working from home for the last nine months or so, I have the taste for home working and wouldn't work any other way now. My team includes one guy from Poland, one from Germany, a couple of Englishmen, two Argentinians and at least one American. All great people, but it will be nice to actually meet them at the UDS.

My computers are all Macs (and have been for sometime), but I've been working inside a Ubuntu VM. So far I've enjoyed it but am not (yet) tempted to move to Ubuntu as my main OS... The job change also means that for the first time in about five years I'm not working at all with IronPython. I'm still involved (more news on that to come soon), but purely for hobby programming. (As a side note, I wonder how long it will take for me to lose my reputation as a 'windows guy'. Years probably.)

Unsurprisingly the Canonical development workflow is all based around Launchpad and Bazaar. I'm really impressed. In the past I've been annoyed by the web UI of Launchpad, finding source code for a project can be annoying and it doesn't provide documentation hosting. For small projects it is still possibly overkill, but for teams of several people doing development in new branches and using merge reviews Launchpad provides a very nice workflow. As I now have some projects in Mercurial, work with Bazaar and some projects still in subversion I do get confused about which version control system I am actually using at any one time...

configglue, one of our internal projects is interesting as it is a direct 'competitor' for ConfigObj. As it is owned by my team I will almost certainly be working on it as well as with it. configglue provides layered configuration files, a frequent feature request for ConfigObj. In ConfigObj you can do it "manually" via the merge() method, but the configglue way is nicer.

configglue has schema validation (which is its major feature really). It is code based whereas ConfigObj validation is via schema files. The configglue way is more verbose but I quite like it. We use it with django-configglue which allows django project settings to be kept in ini files rather than as code.

You may also be wondering how the unittest2 plugin system I blogged about a while ago is getting on.

I recently posted an update to the Testing in Python mailing list:

After recent conversations with Holger Krekel and Robert Collins I have decided to make a fairly major internal changes to unittest2 plugins. I'm getting rid of all global state, which includes the global hooks that are currently the core of the extension machinery.

Getting rid of global state always sounds goood, but it isn't without API consequences. With no global access to the hooks set it needs to be passed down and 'made available' to the runner, loader, suites and all test cases. This is annoying, but probably still better than global state. Removing the global state makes testing the plugin machinery massively easier however... It also allows multiple different plugin configurations within the same process (which is nice but only actually useful for a small set of use cases).

This refactoring also means changing the way plugins are initialized, merely importing the plugin is no longer enough (the metaclass goes - which makes me sad but is probably also a good thing).

So the refactor is still in progress. This means plugins won't land in Python 3.2, but they will have longer to mature and stabilise in unittest2, which is no bad thing. Meanwhile I'm still working on the final release of Mock 0.7.0. There will be one more beta before final release. I also have a few Python issues for unittest itself that I'm really hoping to work on before the Python 3.2 beta which is rapidly approaching!

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2010-10-19 14:43:14 | |

Categories: , , Tags: , ,


CherryPy with a serving of werkzeug

emoticon:html My next work project is an interesting application. We're porting a C++ Windows desktop application to IronPython and modernising it in the process. We'd also like to make it run on the Mac as well, which will be a relief to me as it will save me developing inside a VM. WPF, the most powerful Windows user interface is WPF - and this isn't available on Mac OS X even with Mono. The Silverlight UI is a cut-down version of WPF however, and so if we develop the application as an out-of-browser Silverlight application then it will be trivially easy (at least compared to porting a desktop application) to move it to the web at some future point. For the app to work as an out of browser Silverlight application it will need a local data server and so I've been looking at the simplest way of achieving this with Python (vanilla CPython).

The fastest pure Python web server is still good old CherryPy [1], but to make the application easily portable to the web it would be nice to develop the local component as a wsgi app. Fortunately CherryPy and wsgi are good friends and play nicely together. We can create a simple wsgi application with werkzeug and serve it with CherryPy. In our case the wsgi app is really just a very thin wrapper around a custom data processing / calculation component.

I couldn't find an example of exactly how to do this, but it wasn't very hard to work out from the documentation. Anyway, here's a simple proof-of-concept example:

from cherrypy import wsgiserver

# the werkzeug wsgi app
from app import application

d = wsgiserver.WSGIPathInfoDispatcher({'/': application})
server = wsgiserver.CherryPyWSGIServer(('localhost', 8080), d)

if __name__ == '__main__':
   try:
      server.start()
   except KeyboardInterrupt:
      server.stop()

For the wsgi app the werkzeug example on the front page of the documentation works fine, but an even simpler one that doesn't require a templating engine installed (in our case we will be sending and receiving data - probably encoded using protobuf as it is more compact / faster than json and there are both CPython and Silverlight libraries for working with it) is:

from werkzeug import Request, Response, SharedDataMiddleware
from werkzeug.routing import Map, Rule
from werkzeug.exceptions import HTTPException

url_map = Map([Rule('/', endpoint='index')])

@Request.application
def application(request):
    adapter = url_map.bind_to_environ(request.environ)
    try:
        endpoint, values = adapter.match()
        response = Response(mimetype='text/html')
        response.data = u'<h2>foo</h2>'
        return response
    except HTTPException, e:
        return e

application = SharedDataMiddleware(application, {})

There are interesting possibilities and challenges ahead, like creating a Windows service that starts the local server when you open the Silverlight app, or possibly creating a WPF desktop application and a Silverlight application that share code - something that became a lot easier with Silverlight 4.

[1]The development servers that ship with frameworks like Django are typically single threaded, blocking on individual requests and not suitable for production deployment even as a local server.

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2010-05-11 00:29:26 | |

Categories: , Tags: , ,


Buttons in a Silverlight DataGrid Header

emoticon:noise Today I was hoping to complete adding an excel-like auto-filter to the Silverlight DataGrid which is used for the main view throughout a big chunk of our application. Instead I spent almost the entire day just getting the first step working - putting a button in the grid header and wiring up the click event.

Generally using Silverlight from IronPython is a pleasant experience, but there are one or two things that can't be done through code and have to be done in XAML. XAML is the markup language used to describe Silverlight user interfaces; the Silverlight UI is a cutdown mostly-subset of Windows Presentation Foundation (WPF), the hardware accelerated Windows [1] user interface framework that is part of .NET 3 and above.

With WPF you can not only define the visual elements (including animations) of a UI, but also setup data-binding and bind handlers to events. Event handlers are statically bound and this is a problem for IronPython. Even though we can now use clrtype to create true .NET classes from IronPython, the XAML loader uses Type.GetType() to find the types - and this fails with generated classes.

For almost everything this isn't a problem as we can just hook up events from code, but creating custom DataGrid headers and hooking up events to controls in them is one of the places where we can't.

So the first part of the puzzle is creating a custom style in UserControl.Resources and setup the ContentTemplate:

<Setter Property="ContentTemplate">
    <Setter.Value>
        <DataTemplate>
            <Grid Height="{TemplateBinding Height}" Width="Auto">
                <Grid.RowDefinitions>
                    <RowDefinition Height="*"/>
                </Grid.RowDefinitions>
                <StackPanel Orientation="Horizontal" Margin="2"
                  HorizontalAlignment="Stretch" VerticalAlignment="Stretch">
                    <TextBlock Text="{Binding}" HorizontalAlignment="Center"
                     VerticalAlignment="Center" Margin="0.2"/>
                    <Button Content="Foo" Margin="5,0" x:Name="filter" />
                </StackPanel>
            </Grid>
        </DataTemplate>
    </Setter.Value>
</Setter>

See how the DataTemplate contains a Grid with sub-ui-elements like a TextBlock and the oh-so-important Button. The {Binding} in the textblock means that the standard header text is displayed alongside our button. The XAML for the DataGrid specifies the style we created:

<data:DataGrid x:Name="dataGrid" CanUserReorderColumns="False"
    ColumnHeaderStyle="{StaticResource OurColumnHeaderStyle}"
    IsReadOnly="True" AutoGenerateColumns="False" />

Ok, so far so good. When the columns are created for our grid they will use the ContentTemplate to create the column headers:

A Silverlight DataGrid with custom headers

If we'd been using C# we could have hooked up the button click event in the style XAML. So how do we do this from IronPython? This is what took most of the day to work out. The column headers aren't exposed in any straightforward way, but once the columns have been created we can 'walk the visual tree' to find the column header object, and from there we can find the button. Walking the visual tree is done with VisualTreeHelper, and a recursive helper function:

from System.Windows.Media import VisualTreeHelper

def find_children(parent, findtype):
    count = VisualTreeHelper.GetChildrenCount(parent)
    for i in range(count):
        child = VisualTreeHelper.GetChild(parent, i)
        if isinstance(child, findtype):
            yield child
        else:
            for entry in find_children(child, findtype):
                yield entry

We use it to find and hook up the buttons thusly:

from System.Windows import Visibility
from System.Windows.Controls import Button, DataGridTextColumn

for entry in find_children(datagrid, DataGridColumnHeader):

    for button in find_children(entry, Button):
        if not button.IsEnabled:
            button.Visibility = Visibility.Collapsed
        else:
            button.Click += handler_function
        # just one button per column
        break

We make disabled button invisible because otherwise the grid will leave a disabled button visible in the header to the right of the populated columns.

So that was the hard part. The next problem wasn't quite so difficult; inside the event handler how do we know which column the button click is for? It turns out that getting a reference to the column inside the click event handler is easy:

from System.Windows.Controls import DataGridColumn

def handler_function(sender, event):
    column = DataGridColumn.GetColumnContainingElement(sender)
    ...

At least next time I have to do this is it will be a bit easier... Smile

The final version will use an image for the button content, but the hard part is done.

UPDATE

My colleague Stephan Mitt tells me that the only reason it was so easy to get the column from the button event is that he spent three hours previously discovering this API. Props to him for working this out and making my life easier. Stephan also wanted me to show you the end result with his nice filter images instead of my proof-of-concept 'Foo' buttons:

A Silverlight DataGrid with custom headers using image buttons

The code to set the image on the button looks like this:

from System import Uri, UriKind
from System.Windows.Controls import Image
from System.Windows.Media.Imaging import BitmapImage

def click_handler(sender, event):
    uri = Uri('images/someimage.jpg', UriKind.RelativeOrAbsolute)
    bitmap = BitmapImage(uri)
    image = Image()
    image.Source = bitmap

    # The sender is the button
    sender.Content = image

As I was integrating this with our production code I encountered another problem. Many of the grids we used are loaded into TabControl pages, and the headers aren't created until the grid is rendered (the tab page the grid is in is selected). This means it matters when you run the code that walks the visual tree to find the buttons and setup the click handlers. A good place to do it is in response to the Loaded event, which is fired after the headers have been created.

[1]Silverlight runs on both the Mac and Windows however, and I do almost all my development on the Mac. With Moonlight, Silverlight apps can run and be developed on Linux too.

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2010-03-17 23:49:04 | |

Categories: , , Tags: ,


Encoding json on Silverlight with System.Json

emoticon:ghostradio At the backend of last I year I wrote a blog entry on decoding json on Silverlight. Well, the time has finally come and we're now encoding json to post back to our Django application. The code to encoode json is just as simple as the code for decoding. It makes extensive use of clr.Convert(...) as encoding json relies heavily on implicit conversions:

import clr
from System.Collections.Generic import Dictionary
from System.Json import JsonValue, JsonArray, JsonObject

NoneType = type(None)

def encode_string(data):
    return encode_object(data).ToString()

def encode_object(value):
    if isinstance(value, (int, long, bool, float, str, NoneType)):
        # JsonValue has implicit conversion operators for all
        # these types
        return clr.Convert(value, JsonValue)
    elif isinstance(value, (list, tuple, set)):
        return JsonArray([encode_object(obj) for obj in value])
    elif isinstance(value, dict):
        values = Dictionary[str, JsonValue]()
        for key, val in value.items():
            values[key] = encode_object(val)
        return JsonObject(values)

    # raise an error if we encounter an unhandled type
    raise TypeError(value)

encode_string takes any data-structure of Python primitive / container types and serializes it into JSON. This json seems to be consumed fine by simplejson on the other end. As well as being simple this code is blazingly fast. The actual encoding operation is done in the .ToString() call, which all happens in C#.

Note

We used to use simplejson inside Silverlight for our json handling. This pulls in several standard library modules and importing them had a big impact on our startup time. Switching to using System.Json was a big win for us both in terms of performance and startup time.

When you send the json over the wire then it will be encoded as UTF-8 for the simple reason that this is the only encoding that Silverlight ships with.

This is relevant because the reason we are shipping huge data structures over the wire is in order to export Excel files from our data-grid views. I initially tried to use xlwt inside Silverlight. Unfortunately it relies on either latin-1 or utf-16 encodings and so my efforts were doomed to failure. Fortunately it works grand on the server side, but the encoding restriction does place limits on the Python libraries that you can use in the browser (unless you are prepared to write your own encoding functions of course).

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2010-03-10 00:01:49 | |

Categories: , Tags: , , ,


Fun with Unicode, Latin-1 and a C1 Control Code

emoticon:pill Unicode is a rabbit-warren of complexity; almost fractal in nature, the more you learn about it the more complexity you discover. Anyway, all that aside you can have great fun (i.e. pain) with fairly basic situations even if you are trying to do the right thing.

This particular problem was encountered by Stephan Mitt, one of my colleagues at Comsulting. I helped him find the solution, and with a bit of digging (and some help from #python-dev) worked out why it was happening.

We receive data from customers as CSV files that need importing into a web application. The CSV files are received in latin-1 encoding and we decode and then iterate over them to process a line at a time. Unfortunately the data from the customers included some \x85 characters, which were breaking the CSV parsing.

One of the problems with the latin-1 encoding is that it uses all 256 bytes, so it is never possible to detect badly encoded data. Arbitrary binary data will always successfully decode:

>>> data = ''.join(chr(x) for x in range(256))
>>> data.decode('latin-1')
u'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f...'

If you iterate over a standard file object in Python 2 (i.e. one that reads data as bytestrings) then you iterate over it a line at a time. This splits lines on carriage returns (\x0D) and line feeds (\x0A). If you're on Windows then the sequence \x0D\x0A (CRLF) signifies a new line. If you're trying to do-the-right-thing, and decode your data to Unicode before treating it as text, then you might use code a bit like the following to read it:

import codecs

handle = codecs.open(filename, 'r', encoding='latin-1')
for line in handle:
    ...

This was the cause of our problem. When decoding using latin-1 \x85 is transcoded to u'\x85', which Unicode treats as a line break. So if your source data has \x85 embedded in it, and you are splitting on lines, where the lines break will be different depending on if you are using byte-strings or Unicode strings:

>>> d = 'foo\x85bar'
>>> d.split()
['foo\x85bar']
>>> u = d.decode('latin-1')
>>> u
u'foo\x85bar'
>>> u.split()
[u'foo','bar']

This could still be a pitfall in Python 3, where all strings are Unicode, particularly if you are porting an application from Python 2 to Python 3. Suddenly your data will behave differently when you treat it as Unicode. The answer is to do the split manually, specifying which character to use as a line break.

The problem isn't restricted to \x85. The Unicode spec on newlines shows us why. \x85 is referred to by the acronym NEL, which is a C1 Control Code: NEL Next Line Equivalent to CR+LF. Used to mark end-of-line on some IBM mainframes.

In fact NEL belongs to a general class of characters known as Paragraph Separators (Category B). This category includes the characters \x1C, \x1D, \x1E, \x0D, \x0A and \x85. Splitting on lines will split on any of these characters, which may not be what you expect. It certainly wasn't what we expected.

For us the solution was simple; we just strip out any occurence of \x85 in the binary data before decoding.

Note

Marius Gedminas suggests that the data is probably encoded as Windows 1252 rather than Latin-1. He is probably right.

There are some interesting notes on Unicode line breaks in this Python bug report: What is an ASCII linebreak?.

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2010-01-07 12:42:27 | |

Categories: , , Tags: , ,


Decoding json on Silverlight with System.Json (plus implicit conversion operators in IronPython)

emoticon:beaker As I explained we're writing a Silverlight application that communicates with our Django server and exchanges a lot of json with it.

Unfortunately, due to what is apparently just an oversight, the codecs module is incomplete for IronPython on Silverlight. This means that recent versions of simplejson don't work. Sad

What we've been using is an older version of simplejson, that pulls in the obsolete and huge sre module in as one of its dependencies. On top of bloating up our application and adding several seconds to the startup importing it all the performance is not exactly blazing.

Fortunately part of the Silverlight SDK is System.Json.dll. Using this API from IronPython is pretty simple. The following code defines a loads function (load string - the same API as simplejson) that takes a string and parses it.

import clr
clr.AddReference('System.Json')
from System import Boolean
from System.Json import JsonValue, JsonType

String = clr.GetClrType(str)

def loads(inString):
    thing = JsonValue.Parse(inString)
    return handle_value(thing)

def handle_value(inValue):
    if inValue is None:
        return None
    elif inValue.JsonType == JsonType.String:
        return clr.Convert(inValue, String)
    elif inValue.JsonType == JsonType.Boolean:
        return Boolean.Parse(str(inValue))
    elif inValue.JsonType == JsonType.Number:
        return get_number(inValue.ToString())
    elif inValue.JsonType == JsonType.Object:
        return dict((pair.Key, handle_value(pair.Value)) for pair in inValue)
    elif inValue.JsonType == JsonType.Array:
        return [handle_value(value) for value in inValue]
    # Should be unreachable - but if it happens I want to know about it!
    raise TypeError(inValue)

def get_number(inString):
    try:
        return int(inString)
    except ValueError:
        return float(inString)

As with my custom json emitter there is a lot it doesn't do. It handles the following types:

  • None
  • lists
  • dictionaries
  • strings
  • floats and integers
  • booleans

If you want it to handle decmials or dates you'll have to add that yourself. How it works is mostly straightforward, but there is one little piece of 'magic' in there. When you call ToString() (or str() they do the same thing) you get the original json back. For numbers and booleans this is fine as we can easily turn them into the objects they represent. For strings this is a nuisance as we get double quoted, escaped strings. The correct way to get the value we want is to use implicit conversion. In C# this looks something like:

string value = jsonArray["key"];

Declaring the result as a string calls the appropriate JsonValue implicit conversion operator for us. IronPython doesn't have implicit conversion operators as we don't declare types and with a dynamic type system you rarely need to cast. Up until version 2.6 it wasn't possible to call implicit operators, but in IronPython 2.6 we gained a new function in the clr module; clr.Convert:

>>> print clr.Convert.__doc__
object Convert(object o, Type toType)


            Attempts to convert the provided
     object to the specified type.  Conversions that

                 will be attempted include standard
     Python conversions as well as .NET implicit

            and explicit conversions.


               If the conversion cannot be performed a
     TypeError will be raised.

The implicit conversion to string is done with the code:

clr.Convert(jsonValue, clr.GetClrType(str))

I cover some of the other goodies new in IronPython 2.6 (like CompileSubclassTypes) in Dark Corners of IronPython.

I haven't yet written the code to do json encoding, but the JsonValue class (and friends) can be used for encoding as well as decoding, so I expect the code will pretty much be the opposite of the snippet shown above...

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2009-12-28 22:56:22 | |

Categories: , , Tags: ,


Django: Tear down and re-sync the database

emoticon:dollars Django includes the useful management command syncdb for creating the database tables and columns used by your application. If you add new tables (model classes) then re-running syncdb will add them for you. Unfortunately if you modify columns of existing tables, or add new columns, then syncdb isn't man enough for the job.

For modifying the schema of production systems migrations are the way to go. I played a bit with South for Django, which is pretty straightforward. For a system still in development, and changing rapidly, migrations are overkill. We have a script for populating the database with test data, which we update as the schema evolves. (In parallel with this we have a script that imports the original data from the legacy application we are replacing - again updating the script as our app is capable of handling more of the original schema.)

For development what we really want to do is to tear down our development database and re-run syncdb. Running syncdb requires manual input, to create a superuser, so preferably we want to disable this so that the whole process can be automated. I found various recipes online to do this, but mostly using an obsolete technique to disable superuser creation.

In the end I used a combination of this recipe to programatically clear the databases (using the sql generated by sqlclear) and this recipe to disable super user creation.

Note

The code also skips clearing the authentication table as we are using Django authentication unmodified. Comment out the line that does this if you aren't using Django authentication or want to clear it anyway.

#!/usr/bin/env python

import os
import sys
import StringIO

import settings
from django.core.management import setup_environ, call_command
setup_environ(settings)

from django.db import connection
from django.db.models import get_apps, signals


app_labels = [app.__name__.split('.')[-2] for app in get_apps()]
# Skip clearing the users table
app_labels.remove('auth')

sys.stdout = buffer = StringIO.StringIO()
call_command('sqlclear', *app_labels)
sys.stdout = sys.__stdout__

queries = buffer.getvalue().split(';')[1:-2]

cursor = connection.cursor()
for query in queries:
    cursor.execute(query.strip())

from django.db.models import signals
from django.contrib.auth.management import create_superuser
from django.contrib.auth import models as auth_app

# Prevent interactive question about wanting a superuser created.

signals.post_syncdb.disconnect(
    create_superuser,
    sender=auth_app,
    dispatch_uid = "django.contrib.auth.management.create_superuser")
call_command('syncdb')

It wasn't all plain sailing. We're using MySQL (God help us) and our development machines are all running Mac OS X. On Mac OS X MySQL identifiers, including table names, are case insensitive. Whilst I would object strongly to a case sensitive programming language this actually makes working at the sql console slightly less annoying so it isn't a problem in itself.

We define our data model using standard Django model classes:

from django.db import models

class NewTableName(models.Model):
    NewColumnName = models.CharField(max_length=255, db_column="OriginalSpaltennamen")

     class Meta:
         db_table = 'UrsprunglichenTabellennamen'

The Meta.db_table specifies the table name that will actually be used in the database. We use the original table and column names where possible as the end users will have to modify some existing tools to work with the new system and this minimizes the changes. As you can see both the original table and new table names are mixed case.

For some reason, which I never got to the bottom of, where the model classes have foreign key relationships syncdb will create these tables with all lowercase names. This could be Django, MySQL or the Python connector to MySQL (or any combination of these) and I never worked out why.

Unfortunately sqlclear will only generate sql to drop tables where the casing specified in the model exactly matches the casing in the database. I worked round it by changing all our Meta.db_table entries to be all lowercase. Not what you would call ideal but acceptable.

Now everytime we update our database schema we can simply run this script. It drops all existing tables and then re-creates them with all the changes.

Note

Carl Meyer suggests using call_command('syncdb', interactive=False) instead of the signals.post_syncdb.disconnect code. It's certainly shorter but I haven't tried it yet.

In the comments Stavros Korokithakis points out that the reset admin command will reset individual apps and regenerate them. If you have several apps in a project this script is still simpler, but if you only need to reset one then you might as well just use ./manage.py reset <appname>. It takes the --no-input switch if you want to supress the user prompts.

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2009-12-27 00:06:13 | |

Categories: , , Tags: , , ,


A Custom json Emitter for Django

emoticon:drive I blogged previously about the limited built-in support for JSON in Django (Python web framework), and how I got around it by first using the wadostuff json encoder and then modifying the one built in to django-piston.

Well, it turned out that json was still a bottleneck in our application. A little bit of background first. I'm working with Comsulting.de on a business application using Django on the server and Silverlight, programmed with IronPython, running in the browser. As the Silverlight application is delivered as a 'xap' file, the application and Django communicate via JSON. This means that we use the Django ORM, URL routing and views, but not the Django templating system.

As I can't program without tests we're testing in the browser with unittest, and to mitigate against slow application startup we do parallel imports in the background (multithreaded using a modified version of ipy-parallel-import) whilst the user is presented with the login screen. If the user logs in before imports have completed they get a progress bar until it is done. The basic framework for all of this is pretty simple and I'll turn it into an article or two when I have time.

The application itself is almost the archetypal business application. The main view is a grid displaying information on about 900 different companies. Users can drill down into details on any company, generate reports, amend details, add contacts and so on. As the main view fetches information on 900 companies, our initial json payload for that view was 2.2 megabytes of json! Unsurprisingly encoding, sneding and decoding that amount of information takes some time. Too much time.

There are various ways we improved performance in our main view; caching the generated json is one way and enabling the gzip middleware was another.

Note

Caching would possibly be a good subject for another blog entry. I ended up using a crude, custom in-database caching mechanism. The existing systems seem to use time based cache invalidation whereas I want unlimited time lifetime but precise control of when to invalidate cache entries programatically. At some point we will probably need to investigate 'proper' caching machinery but our current system is sufficient at the moment.

By far the biggest improvement came from implementing a custom json emitter that only sends the information that we actually need. The default Django json serializer and the modified emitter based on django-piston both send a huge amount of redundant information. With a custom emitter that only sends specified information (yet can work with basic Python data-types and Django model objects) we were able to reduce this initial payload from 2.2 megabytes down to around 500 kilobytes.

The encoding and decoding are both much faster (and fast enough in particular) now. The code for our custom emitter is shown below. No it doesn't handle everything (dates and Decimal for example - and nor would it cope with recursive references), but that is the point. It is simple, it only knows about what we need it to know about and it only sends what we ask for. Just as important it is easy to extend as our requirements grow or change. The work is done by the construct method, whose job it is to turn the object tree we pass in into primitive objects only that simplejson can serialize for us.

from django.db.models import Model
from django.db.models.manager import Manager
from django.db.models.query import QuerySet

from django.core.serializers.json import DateTimeAwareJSONEncoder
from django.utils import simplejson
from django.utils.encoding import smart_unicode

NoneType = type(None)
MISSING = object()

class Emitter(object):

    def __init__(self, fields):
        self.fields = set(fields)

    def render(self, data):
        out = self.construct(data)
        return simplejson.dumps(out, cls=DateTimeAwareJSONEncoder, ensure_ascii=False)

    def construct(self, data):
        fields = self.fields
        def _any(thing):
            if isinstance(thing, (set, tuple, list, QuerySet)):
                return [_any(t) for t in thing]
            elif isinstance(thing, Model):
                # Handle django model objects
                return _model(thing)
            elif isinstance(thing, Manager):
                # For many-to-many relationships
                return [_any(t) for t in thing.values()]
            elif isinstance(thing, dict):
                return dict((_any(key), _any(value)) for key, value in
                            thing.iteritems() if key in fields)
            elif isinstance(thing, (basestring, int, float, long, NoneType)):
                # primitive types
                return smart_unicode(thing, strings_only=True)

            # ha!! (useful for debugging)
            raise TypeError('Asked to handle unknown type: %r, %s' % (thing, type(thing)))

        def _model(thing):
            ret = {}
            for member in fields:
                attr = getattr(thing, member, MISSING)
                if attr is MISSING:
                    continue
                ret[member] = _any(attr)
            return ret
        return _any(data)

Note

We use select_related in our Django queries which follows foreign key relationships automatically and improved performance by ensuring that the emitter itself rarely results in kicking off new queries.

Using the emitter is simple, we simply have to specify what information we want in the response:

@login_required
def some_view(request):
    ...
    fields = ('FirstName', 'FamilyName', 'eMail', 'PhoneNumber')
    emitter = Emitter(fields=fields)
    json = emitter.render(in_data)
    return HttpResponse(json, mimetype="application/json")

The style of the emitter is very much influenced by django-piston, but the code was all written from scratch so any bugs or failings are entirely my own... Smile

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2009-12-24 17:19:28 | |

Categories: , Tags: ,


Resolver One 1.7, BioPython and 3D Graphics

emoticon:cat Although I've left Resolver Systems I still follow closely what they're up to. Resolver One, the IronPython powered spreadsheet with the programming model right at its heart, is an innovative and unique product [1] that deserves to flourish. Despite the horrific loss Resolver One seems to still be moving forward at great pace without me.

Major features in this new release, the last one in which I have been involved in the development [2], include:

  • Button-click handlers are now executed without blocking the rest of Resolver One. This means that if you accidentally write a handler that never finishes, or just one that takes longer than you want, you can cancel it while it's running and fix the problem.
  • Added long-awaited dialog for setting wrapping and aligment for cells.
  • External imported modules now shared between documents and RunWorkbook (better performance for RunWorkbook in particular).
  • Faster install time.
  • Improved responsiveness when dragging tabs and switching between tabs.
  • [Shift] while clicking Recalc toolbar button now now forces reload of all imported modules (like Shift-F9 has always done).
  • A few minor bugfixes.

Giles Thomas and team have also produced screencasts of a couple of particularly interesting uses of Resolver One:

Using OpenGL and Resolver One for 3D visualization of stock prices:

This one uses Yahoo! Finance to download the close prices over the last two years for every stock that's currently in the Dow Jones index, then charts them in a 3D window which you can pan and zoom using the mouse. Here's a video showing it in action...

If you're interested in OpenGL Giles has a blog exploring WebGL, OpenGL in the browser: Learning WebGL.

Resolver One includes built-in support for Python C extensions like Numpy through the Resolver Systems sponsored open-source project Ironclad. Ironclad overcomes IronPython's inability to use compiled C-extensions. This video demonstrates the use of one such powerful library, BioPython, from within a Resolver One spreadsheet, to compare the molecular shapes of proteins. (3m12s)

[1]In fact Resolver One is a finalist in the 2009 IET Innovations Award.
[2]Possibly not true. I've done some work on the port to IronPython 2.6, which has mainly been spearheaded by Glenn Jones and will hopefully be 1.8. That's assuming that the IronPython team get 2.6 final released soon.

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2009-11-24 15:00:56 | |

Categories: , Tags: , , , ,


Hosted by Webfaction

Counter...