The Voidspace Techie Blog

Python Programming, News on the Voidspace Python Projects and All Things Techie. Multi-coloured Me
For my more personal blog, go to the Voidspace Blog. This also has links to the old Techie Blog, God rest its soul.
Reviews by consumers and price comparison In need of a new and cool mobile phone? At Ciao you find great prices on lots of products. We have software for a PDA or a computer. Do not hesitate, make a bargain today!

 

Loading ...

 

The Tart and his Sole Scion

emoticon:cyberpunk The esteemed Mr Tartley has been up to his usual tricks. This time he has been playing with the Chipmunk physics engine and Pyglet to flesh out some prototypes for his 'summer-doss-out' game:

2D Pysics type stuff from Mr Tartley

In true freetard style he's made the code available so that you can have fun too, but with a typically odd name....

Although he says he has only tested it with Linux, at Resolver Systems we've had it running on Windows. Smile

Hopefully in the next day or so I'll do a 'release' of Python in the Browser (interactive Python interpreter running in an HTML textarea using Silverlight 2). I've fixed all the basic usability issues and added some 'preloaded examples' to illustrate how it might be used in tutorials / documentation. All the code is already available in the Google Code repository, but I need to write some brief docs. There is only around 200 lines of Python, 75 lines of Javascript and 25 lines of C# in the whole project - so customising it should be very easy.

Like this post? Digg it or Del.icio.us it. Looking for a great tech job? Visit the Hidden Network Jobs Board.

Posted by Fuzzyman on 2008-06-18 22:52:02 | |

Categories: , Tags: , , , ,


Ironclad 0.3 Released (Use CPython Extensions from IronPython)

emoticon:mobile William Reade of Resolver Systems has just announced the release of Ironclad version 0.3:

Ironclad is an open source project that aims to allow you to seamlessly use CPython C extensions from IronPython. It basically provides a reimplementation of the CPython API in C#, and maps the C extension (unmanaged code) to creating IronPython objects (managed objects) rather than CPython objects.

Ironclad targets IronPython 2 and Python 2.5.

The project is still in its infancy (only part of the C API is complete), but we have implemented sufficient for most of the CPython bz2 module to be used. Functions, types and their methods (including the file types [1]) can be used from this module. Class members and properties cannot yet be used and is the target of the next release (before we move onto getting Numpy working - the 'secret goal' of the project).

The major advancement in this release is that object lifetime is now handled basically correctly.

When you create a "managed object from an unmanaged type" (and we need a better way of saying that: when you use an object from a C extension basically), some 'unmanaged resources' (i.e. memory) are allocated.

Now, when the managed object is garbage collected this memory is correctly freed and the unmanaged destructor (tp_dealloc) for the type is called. This means that Ironclad doesn't leak [so much] memory and file descriptors owned by the unmanaged side are properly closed on garbage collection.

This is implemented with an interesting strategy. We basically maintain our own reference count for objects. When an object (a "managed object of unmanaged type") is created, we set its reference count to 1 (using a weak reference in the mapper so that the reference counting alone doesn't keep it alive on the managed side). Setting the reference count to 1 means that the destructor won't be called when the unmanaged side (the C extension) has finished with the object and is no longer holding any references to it.

The 'reference counting' for the unmanaged side, is simply a small block of memory storing an integer which corresponds to how many 'unmanaged references' there are to it. An unmanaged reference is not exactly the same as a managed reference (a possible cause of terminology confusion). This means that we are using real reference counting (in the same way as CPython) on the unmanaged side.

When there are no more references to it on the managed side (the IronPython world), the finalizer is called which checks the reference count in our mapping. If the reference count is more than 1 then we know that the extension itself still has a reference to the object and so the finalizer resurrects the object by storing a strong reference in the mapper.

We then regularly check (every time a managed object is finalized - but we may find a better strategy) if the reference count of any objects we have promoted to strong references have dropped back down to 1. If they have, we demote the strong reference back to a weak reference so that the object can be garbage collected.

In this way Ironclad mixes the reference counting that the C extensions use with the garbage collection of the .NET framework. The only downside is that if our "managed object of unmanaged type" itself has any references to managed objects, then their finalizers may have already been called at the point at which we resurrect it. Hopefully this is a pathological enough case that it won't bite us for a long time to come...

Currently Ironclad is only tested on Windows with .NET, but it deliberately uses the gcc compiler and a (theoretically) platform independent approach so that it can be ported to Mono with 'minimal' effort. Obviously the source distribution comes with full tests, and pretty good documentation.

[1]Getting access to the 'real' file descriptor from a .NET stream is possible, but 'fun'.

Like this post? Digg it or Del.icio.us it. Looking for a great tech job? Visit the Hidden Network Jobs Board.

Posted by Fuzzyman on 2008-05-19 13:59:47 | |

Categories: , , , Tags: , ,


Videos, Spreadsheet Quagmires and a Summer of Code

emoticon:pill The google summer of code projects are now up. As usual, the PSF has several projects. This year there I don't find many of them particularly exciting (although I'm sure others will). Ones that did catch my eye:

OK, so maybe it's not such a bad list - nothing for docutils though.

A couple of Resolver One related videos have gone up.

Firstly, someone (Mick!) from Developer Day Ireland has put up a video interview I did at TechEd on developing with IronPython at Resolver Systems on YouTube:

Menno and Glenn have also put up a new Resolver One screencast. This one is on using named ranges in Resolver One, and how they can make your spreadsheets simpler:

(The voice over is done by Menno.) This is the first in a series of short screencasts that we hope to produce, highlighting useful features in Resolver One.

On the subject of Resolver One, Investment News have just published an article on us:

There has also been a rash of Python based build tools recently. This is good, because until now we have had Rake envy. The new kid on the block is Paver by Kevin Dangoor:

Paver is a Python-based build/distribution/deployment scripting tool along the lines of Make or Rake. What makes Paver unique is its integration with commonly used Python libraries. Common tasks that were easy before remain easy. More importantly, dealing with your applications specific needs and requirements is also easy.

Other alternatives include:

  • zc.buildout

    Doesn't require Zope, honestly. Works with eggs and very actively developed.

  • vellum

    Another new kid on the block, by Zed Shaw. Uses a 'subset' of Python for configuration scripts, so that your build scripts deliberately can't execute arbitrary code.

  • SCons

    This seems to be the 'heavyweight' of the bunch.

  • waf

    I don't know anything about this. Smile

  • memoize

    I don't know anything about this. Smile

  • doit

    Very new and I don't know anything about it. Smile

It seems that Python build tools are flavour of the month. (Build tools are the new web frameworks.) Perhaps the Ruby community's ability to coalesce (more or less) around standard tools and frameworks is a good thing after all...

And finally, as a reward for reading to the end, have you ever wished that you could raise an exception as an expression? (Inside a lambda for example, raise being a statement n'all.) Here is one truly evil (but kind of beautiful in its awfulness) suggestion using ctypes from Ironfroggy:

ctypes.pythonapi.PyErr_SetObject(*map(ctypes.py_object, (e.__class__, e)))

Like this post? Digg it or Del.icio.us it. Looking for a great tech job? Visit the Hidden Network Jobs Board.

Posted by Fuzzyman on 2008-04-23 00:50:58 | |

Categories: , , Tags: , ,


Global Python Sprint Weekend in London

emoticon:lighton The next Python bug days are going to be a bit different. Instead of just a bug day we're going for 'Sprint Weekends'.

  • May 10th-11th (four days after 2.6a3 and 3.0a5 are released)
  • June 21st-22nd (~week before 2.6b2 and 3.0b2 are released)

The goal is to have the usual coordination via IRC (on #python-dev at irc.freenode.net), but also (hopefully) Python user groups can meet across the globe to sprint collaboratively. This is a great way for people to have fun hacking together and learning the non-esoteric art of contributing to core Python development. User groups who are interested in this can register by responding to this thread on the Python-Dev mailing list (that post also has all the details).

This is being coordinated by Trent Nelson, who suggests that Saturday be for groups to meet up in person, with Sunday geared more towards an online collaboration day via IRC, where we can take care of all the little things that got in our way of coding on Saturday - like finalising/preparing/reviewing patches, updating tracker and documentation, writing tests (you do test first, right Trent?).

Trent, Simon Brunning and I are organising a London Python sprint. There is also a meetup a few days before on May 6th. Venue of the sprint weekend (or the Saturday at any rate) still to be decided, but we'll keep you up to date. Smile

Like this post? Digg it or Del.icio.us it. Looking for a great tech job? Visit the Hidden Network Jobs Board.

Posted by Fuzzyman on 2008-04-16 19:54:54 | |

Categories: , Tags:


Piping Objects: Bringing a Powershell-alike Syntax to Python

emoticon:key Harry Pierson recently twittered about how he missed the Powershell syntax, for piping objects between commandlets, in Python. After exchanging emails with him, he actually prefers something like the F# syntax - which uses '|>' rather than '|' as a pipe. I decided to see how far I could get with Python (settling on '>>' as the operator), and I think that what I've come up with is quite nice.

It enables you to create commandlets and pipe objects between them using '>>' (the right shift operator). Creating new commandlets is as easy as writing a function.

As I don't actually have a use case for this (!), this 'proof of concept' implementation is pretty specific to my example use case - but as it is around 60 lines of Python it is very easy to customise. The syntax is nice and declarative, so creating a library of commandlets could be useful for working at the interactive interpreter, or it could be used for creating Domain Specific Languages.

Suppose you have a set of data that you want to pass through several filters that also transform the data, and then perform an action on each record. With commandlets you can do things like:

some_data >> filter1 >> filter2 >> action

The normal Python technique would be to use list comprehensions. With list comprehensions each record has to go through the filter twice, as transforming and filtering have to be done separately. An equivalent of the above using list comprehensions looks like:

intermediate = [filter1(x) for x in some_data if filter1(x) is not ignored]
[action(filter2(x)) for x in intermediate if filter2(x) is not ignored]

Rolling that exactly into a single list comprehension means one big-ass ugly list comprehension. Smile

As an example of the syntax it enables, I've implemented three simple commandlets that allow you to do:

listdir('.') >> notolderthan('2/3/08') >> prettyprint

I'm afraid that the example only works with IronPython (because the date handling is much nicer than CPython), but none of the rest of the code requires IronPython.

The first part of the chain shown above is a commandlet called listdir that returns a list of all the files in a directory (it delegates to os.listdir). Although it is lists that are piped between commandlets, the functions you write (which are wrapped in a Cmdlet class) only need to handle one argument at the time.

You create commandlets that take arguments (like notolderthan), in the same way you write decorators that take arguments - as a function that returns a function.

Here is the implementation of listdir and the notolderthan filter:

def f_listdir(path):
    def listdir():
        # see the end of this blog entry for the definition of Path
        return [Path(path, member) for member in os.listdir(path)]
    return listdir

def f_notolderthan(date):
    datetime = System.DateTime.Parse(date)
    def notolderthan(member):
        if member.mtime >= datetime:
            return member
        return  ignored
    return notolderthan

listdir = Cmdlet(f_listdir)
notolderthan = Cmdlet(f_notolderthan)

ignored is a special sentinel value that allows commandlets to act as filters. Commandlets can also perform an action instead of piping objects out. prettyprint is an example of this:

def f_prettyprint(val):
    print val

prettyprint = Action(f_prettyprint)

You can also pass in a generator (or any iterable) to the start of the chain. Here is an argument that uses a recursive generator, listing all the files in a directory and its subdirectories, on the left hand side of the chain:

def recursive_walk(path):
    for e in os.listdir(path):
        p = os.path.join(path, e)
        if os.path.isfile(p):
            yield Path(path, e)
        else:
            for entry in recursive_walk(p):
                yield entry

recursive_walk('.')  >> notolderthan('2/3/08') >> prettyprint

The Cmdlet class is a subclass of list, so a chain of commandlets returns a list (well - a Cmdlet) populated with the results of the call chain.

Here's the full implementation of Cmdlet and Action:

import os
import System

__version__ = '0.1.0'

__all__ = ['Action', 'Cmdlet', 'ignored', 'listdir', 'notolderthan', 'prettyprint']

ignored = object()


class Cmdlet(list):
    def __init__(self, function, _populated=False):
        self.function = function
        self._populated = _populated


    def __call__(self, *args, **keywargs):
        function = self.function
        if args or keywargs:
            function = self.function(*args, **keywargs)

        return Cmdlet(function)


    def __rshift__(self, other):
        if not self._populated:
            # TODO: the first function must return a list ?
            self[:] = self.function()

        new = Cmdlet(other.function, True)
        vals = [other.function(m) for m in self]
        new[:] = [v for v in vals if v is not ignored]
        return new


    def __rrshift__(self, other):
        # the left side is not a commandlet
        # so it must be an iterable
        new = Cmdlet(self.function, True)
        vals = [self.function(m) for m in other]
        new[:] = [v for v in vals if v is not ignored]
        return new


    def __repr__(self):
        return 'Cmdlet(%s)' % list.__repr__(self)


class Action(Cmdlet):
    def __rshift__(self, other):
        Cmdlet.__rshift__(self, other)
        return None

    def __repr__(self):
        return 'Action(%s)' % self.function.__name__

Nice. Smile

Most of the magic is in __rshift__ and __rrshift__, but I'm also fond of __call__ which allows you to create commandlets that take arguments.

To run the examples you also need my homegrown Path class:

class Path(object):
    def __init__(self, path, entry):
        self.dir = path
        self.name = entry
        self.path = os.path.join(path, entry)
        self.mtime = System.IO.File.GetCreationTime(self.path)
        self.ctime = System.IO.File.GetLastWriteTime(self.path)

    def __repr__(self):
        start = 'File:'
        if os.path.isdir(self.path):
            start = 'Dir:'
        ctime = self.ctime
        mtime = self.mtime
        return "%s %s :ctime: %s :mtime: %s" % (start, self.path, ctime, mtime)

Like this post? Digg it or Del.icio.us it. Looking for a great tech job? Visit the Hidden Network Jobs Board.

Posted by Fuzzyman on 2008-03-27 01:12:31 | |

Categories: , , , Tags: , , ,


Raising Arbitrary Objects as Exceptions (a Hack!)

emoticon:pen_book In Python you can't raise arbitrary objects as exceptions with the raise statement, for obvious reasons.

From this introduction you should be able to work out what this code does:

class MyException(Exception):
   def __new__(cls):
       return None

try:
   raise MyException
except MyException, e:
   print e

That's right, the except statement catches None as an exception. This obscure code path was discovered by Dino Viehland as he had to implement it for IronPython!

What happens is that CPython has an optimization for raising exception classes rather than instances. It only instantiates if you actually catch the exception object; for many except blocks you only need to know the type and the exception never need be instantiated. If you override __new__ then you can return any arbitrary object at instantiation (including an instance of another exception type if you want) - which will be caught as MyException.

It is not recommended that you do this in production code...

In a similar vein, can you guess what exception this statement raises:

raise UnicodeDecodeError

That's right, a TypeError...

Like this post? Digg it or Del.icio.us it. Looking for a great tech job? Visit the Hidden Network Jobs Board.

Posted by Fuzzyman on 2008-03-21 23:30:11 | |

Categories: , Tags:


Ironclad Overview

emoticon:html Ironclad is an MIT licensed Open Source by Resolver Systems. The goal of Ironclad is to allow you to seamlessly import C extension modules (for CPython) in IronPython.

The short version of how it works, is that Ironclad creates a fake Python dll, with the C-API pointing to functions mainly implemented in C#. These create and manipulate IronPython objects.

Resolver Systems has had one pair programming on the project (led by one of our core developers, William Reade) for several months. The current status of the project is that, for simple modules, module initialisation works. In the 0.1 binary release you can call functions. The version in the SVN repository allows you to use classes defined in extensions (so long as they only call the parts of the C-API that we have implemented). The file type kind-of works - you can open and close files but not yet read or write from them. Smile

Terminology

Code that runs on the .NET VM is called 'managed code'. It is managed by the virtual machine which provides garbage collection, security features and type verification etc. You can still call into code written in C (native code), but although this runs in the same process space it isn't running inside the .NET VM - so it is called 'unmanaged code'.

Here is a longer (but still brief) overview of Ironclad that I presented at the Python and .NET open space at PyCon.

We generate a stub-dll that looks superficially like Python25.dll. Prior to importing a CPython extension we load this dll into our process space and initialise it by passing in a pair of function pointers (that point to managed code). The dll initialisation function calls these repeatedly passing in symbol names (CPython's exported symbols - the CPython API). This creates the CPython API but with these functions pointing to managed code (delegates) that actually provides the implementation (in C#).

You can now load C extension modules ('.pyd') by creating a 'PydImporter' and calling its load method with the path to the binary on the filesystem. This loads the binary into the process space and calls the module initialisation function (e.g. initbz2 or initmultiarray). This initialisation function will call the C-API that we have already initialised.

Generally this will start off by calling 'Py_InitModule4'. One of the parameters to this function is a pointer to an array of PyMethodDef structs which defines the functions exported by the module we are loading. We generate IronPython code corresponding to these functions and execute it in a new module - to expose these functions to IronPython. These functions call delegates which point to the corresponding unmanaged code (the module!). We do a similar thing for the classes, which are added by calls to 'PyModule_AddObject'.

Almost all of the hard work is done by one monster C# class that actually implements the Python C API (or the parts of it that we have got to) - the 'Python25Mapper' (which inherits from 'PythonMapper' which is autogenerated).

When we call a function from IronPython - we call the mapper's 'Store' method on both the args tuple and the kwargs dictionary - this creates pointers representing the original items, which are then passed into the delegate. Marshally magic happens and the unmanaged function is called with pointers to args (tuple) and kwargs (dict).

(Inevitably the first thing that a function does is call PyArg_ParseTuple[AndKeywords]. This was so complicated that we actually just lifted the CPython implementation rather than re-implementing in C#.)

Eventually it will return something. This something will usually have been created through the C-API (so we have control over what was created) - so the PythonMapper has a reference to it, mapping a pointer to a managed object. When this pointer is returned (from the delegate) we check for errors set on the PythonMapper, raising the exception if necessary. If not we can 'Retrieve' the managed object and return it from the function.

An interesting point is that we have to handle reference counting. When a object is stored for the first time, the mapper allocates some memory following the layout of the normal 'PyObject' and sets the refcount to 1. Subsequent calls to 'Store' for the same object will increment the ref-counter and return the same pointer. When the refcount drops to 0 (as a result of managed or unmanaged code) this memory can be deallocated because we know that the unmanaged code has no references to.

Strings and tuples we need to handle slightly specially because C extensions need direct access to memory to some of their items. We do this by converting these objects on the way in (and on the way out if necessary). (Lists will need to be handled very specially at some point - but not yet!!)

In a call to a module function, once we have retrieved the result from the result pointer it is safe for us to DecRef the args and kwargs and result pointers.

Types are complicated, but fundamentally the same!

BZ2 is our test case so far. compress and decompress works. The BZ2Compressor and BZ2Decompressor types both work. BZ2File (passing file stream descriptors from managed to unmanaged code) is proving fun... (You can open and close the files, just not read or write anything yet.) Numpy next...

We haven't done much with the GIL yet!

The stub-dll actually uses assembly language to write a table of function pointers. Most of these functions are not implemented in C (except a few like PyArg_ParseTuple[AndKeywords]) but in C#. The assembly is a chunk of code with a series labels (function name) followed by a jump instruction (using a pointer to a managed delegate which calls into our C#). We do this in assembly language because gcc won't let us define C functions without a calling convention (it won't let us use __declspec(naked)).

So, lots of magic and lots of fun difficulties, but it is working so far! The binaries are only suitable for Windows and we haven't yet tried this project with Mono. William deliberately picked gcc as the compiler so that it could be ported to Mono though. There is a longer, more detailed, version of this overview in code repository.

The business value for Resolver Systems in this project is that our customers will be able to use Scipy in Resolver One, our IronPython spreadsheet.

There are two interesting potential 're-uses' of Ironclad.

  • Some of the PyPy folk are interested in looking to see how much of the project they could reuse to allow you to use C-Extensions from PyPy.
  • By rewriting parts of the top layer (or even just embedding IronPython) it could allow you to use Python C-extensions from any .NET language.

Like this post? Digg it or Del.icio.us it. Looking for a great tech job? Visit the Hidden Network Jobs Board.

Posted by Fuzzyman on 2008-03-21 20:03:07 | |

Categories: , , , , , Tags: ,


Silverlight 2 Articles and Interpreter in the Browser Coming Soon

emoticon:videocam Sorry, not the shortest title in the world. I've just arrived home from PyCon with a world class case of jetlag and a blogging backlog.

I've nearly completed turning my talk on Silverlight 2 into a series of articles. I aim to post them later today.

On the day of my talk, Dino Viehland (IronPython developer) gave me some updated binaries allowing me to demo a prototype 'Interactive Interpreter in the Browser'. It wasn't much work to turn the prototype into an interactive interpreter running in an HTML textarea (better looking than the one I demoed):

Interactive Interpreter in the Browser

So far it runs on Safari (and probably Firefox but currently Firefox won't connect to localhost for me to test it), but the Javascript probably needs some attention for it to work on IE.

It behaves like the Python interactive interpreter - so you can only type (and delete) on the last line after the prompt. This magic is done with a Javascript 'onkeydown' handler that calls into Silverlight with the current cursor position. It cancels text edits except on the last line after the prompt. It also detects newlines (that IE sends as '\r' would you believe) and executes the current code in the interpreter (using the Python standard library code module).

It doesn't detect Ctrl-C and would need to run inside a thread to handle them anyway. I can't release this until the bugfixed IronPython binaries are released on dynamicsilverlight.net, but Dino is working on it. Smile

This will be great way of embedding a Python interpreter (that runs on the browser) into documentation and tutorials.

Like this post? Digg it or Del.icio.us it. Looking for a great tech job? Visit the Hidden Network Jobs Board.

Posted by Fuzzyman on 2008-03-21 15:30:44 | |

Categories: , , Tags: ,


Silly Snippets

emoticon:scanner Here are some fun snippets of Python, which may or may not do what you expect.

>>> isinstance(type, object)
True
>>> isinstance(object, type)
True

This really just illustrates Python's object orientation. Everything in Python is an object (is an instance of an object), this includes type since types are first class objects. object itself is a type - so it is an instance of type.

>>> 3 is not False
True
>>> 3 is (not False)
False

Just illustrating that sometimes all is not as it appears with identity checks. (I assume that is and not are both tokens to the lexer but that is and is not are actually different operators to the parser.)

class S:
    def __del__(self):
        print e.args

e = BaseException(1, S())
e.__init__("hello")   # segfault

This is an interesting one posted to Python-dev recently and actually reveals a subtle bug in the way Python uses reference counting to do garbage collection. It is pretty pathological though. When the 'decref' happens the finalizer (__del__) is called - which can sometimes find a route back to the now invalid object. There is a really interesting post on Python garbage collection over on the PyPy blog: PyPy Development: Python Finalizers Semantics, Part 1.

>>> class T(type):
...  def mro(self):
...   return []
...
>>> class C: __metaclass__ = T
...
>>> class D(object):
...  pass
...
>>> d = D()
>>> d.__class__ = C
>>> isinstance(d, object)
False

This is silliness from Michael Hudson. He says that the primary use case is for looking geeky on IRC. Actually lying to isinstance can be useful when creating proxy classes (but you can't lie to checks that use type(something)).

Back to Mac stuff. SCPlugin is nothing like as good as TortoiseSVN (yet - it is a younger project), but dropping down to the command line is kind of reassuring. I've discovered that Parallels can be made to work with multiple monitors (it basically fakes a single display the combined width of your Mac desktop - but it works fine). Looking for a straightforward IRC client for the Mac - so far Colloquy seems fine other than its unhelpful error reporting. The most useful little application I've found is OpenTerminalHere. It puts an icon on finder windows that opens a terminal window with the current directory set to the directory you are viewing (kind of like 'command window here' on Windows).

Like this post? Digg it or Del.icio.us it. Looking for a great tech job? Visit the Hidden Network Jobs Board.

Posted by Fuzzyman on 2008-02-18 20:34:52 | |

Categories: , , Tags:


Resolverforge: Download Modules on Demand for IronPython and Resolver One

emoticon:dove I've implemented a client-side module that allows you to specify what modules your code 'requires'. After user-confirmation, a required module that isn't available will be downloaded and on the import path. This is intended for use with Resolver One, but should work with any IronPython code where Windows Forms is available (just the message box is used currently).

Modules are downloaded from whichever repository you specify, which defaults to www.resolverhacks.net/resolverforge.

Requiring the "helloworld" Module

The code to use Resolverforge looks like:

from Resolverforge import require
require('helloworld')
import helloworld

helloworld.sayhello()

You can specify your own repository, so that you can make modules available on the internet or an intranet instead of having to keep dependencies with your spreadsheets. This is an early implementation (which works of course!). Resolverforge downloads modules that are individual Python files and doesn't support versioning. It will also one day gain a website counterpart that will allow users to create projects and make them available for download. For the moment you will have to make do with the modules I've put up, which are mainly aimed at Resolver One.

Like this post? Digg it or Del.icio.us it. Looking for a great tech job? Visit the Hidden Network Jobs Board.

Posted by Fuzzyman on 2008-02-11 00:28:36 | |

Categories: , , , Tags: , ,


Garbage Collection, Strings and Newlines

emoticon:globepage Today I was working with Christian on an optimisation and memory use story. Those are usually fun as they mean poking around inside the guts of Resolver and thinking about data-structures and algorithms. One of the things we looked at was an out of memory error with a very large spreadsheet. The cause was this innocent enough looking code to normalise newlines in some generated code:

text = text.replace('\r\n', '\n').replace('\n', '\r\n')

Obviously this code ensures that all newlines in the text use '\r\n' rather than '\n' (as is required for properly displaying in the windows code editor control that is part of Resolver One). It is called on all the code sections that make up a Resolver spreadsheet. This includes the constants and formatting section, which in a large spreadsheet can be substantial.

.NET (and therefore IronPython) uses garbage collection rather than reference counting. Strings are immutable, so every step in the double replace above creates a new string - meaning that this line temporarily triples the memory usage of the string. I don't believe this would happen in CPython, because reference counting would cause the intermediate strings to be freed immediately within the operation. On the .NET framework they aren't freed until garbage collection runs, probably when the current frame of execution exits [1]. This was the cause of out of memory errors in large spreadsheets.

We replaced it with the following code using regular expressions. It uses the regex r'(^|[^\r])\n' to match 'lonely newlines'. Unfortunately re.sub, won't find overlapping matches which means it will miss multiple newlines next to each other. To get round this we use the replace function pattern for calling re.sub:

import re
LONELY_NEWLINES = re.compile(r'(^|[^\r])\n')

def NormaliseText(text):
    # Replacement is slightly complicated by the fact that re.sub only finds
    # *non-overlapping* matches. To solve this, we add an extra \r if there's
    # another lonely \n directly after this one.
    def ReplaceLonelyNewlines(match):
        firstChar = match.group(1)
        if match.end() < len(text) and text[match.end()] == "\n":
            return firstChar + "\r\n\r"
        return firstChar + "\r\n"

    if text.count('\n') != text.count('\r\n'):
        text = LONELY_NEWLINES.sub(ReplaceLonelyNewlines, text)
    return text

It's an interesting tale, but it seems like a complex solution to a simple problem. It is also slightly slower than a simple double replace, but we can get a performance win by only invoking it if we know that there are lonely newlines in the text.

If you have an idea for a better way, then let me know. Smile

Update

Orestis (who will be joining us at Resolver in about two weeks) provided a much nicer solution using 'lookbehind' regular expression syntax.

>>> r = '(?<!\r)\n'
>>> re.sub(r, '\r\n', input_string)

Dino has also confirmed that .NET garbage collection can run at any time (including in the middle of operations), but in our case it clearly wasn't.

[1]Before you get too excited about having another reason to use CPython rather than IronPython, consider the following fact that I wasn't aware of until recently. The disadvantage of reference counting is that cycles are hard to free. Python has a cycle detector, but it is unable to free cycles where more than one of the objects implements __del__ - since the cycle detector doesn't know what order to call the destructors in to break the cycle. This means that you shouldn't implement __del__ unless you can guarantee that your objects won't be involved in cycles (or you are happy to leak memory).

Like this post? Digg it or Del.icio.us it. Looking for a great tech job? Visit the Hidden Network Jobs Board.

Posted by Fuzzyman on 2008-01-28 22:00:37 | |

Categories: , , , Tags: , , ,


For buying techie books, science fiction, computer hardware or the latest gadgets: visit The Voidspace Amazon Store. If you're looking for a new techie job, try the Voidspace Tech Job Board. This is part of the Hidden Network of technology and programming jobs.


Hosted by Webfaction

Counter...


Voidspace: Cyberpunk, Technology, Fiction and More
Search this Site:
 
Web Site

IronPython in ActionIronPython in Action

Blogads

Follow me on:

Twitter

Pownce

Jaiku

Del.icio.us

Shared Feeds

Tech Jobs

Hidden Network

Tech Jobs Board

Hosting for an agile web