Python Programming, news on the Voidspace Python Projects and all things techie.
The Joys of Open Source
I'm a great believer in Open Source software. I like the collaboration and the sharing. I also like the idea of people using my code. It's a buzz to feel that all over the world, people are learning or getting benefit from my efforts.
I've also gained from using other peoples' open source code. Python itself is a great example of course.
Many businesses are built on top of open source software (the major architecture of Google for example is run on Linux, they are also extensive users of Python). Almost every field of computer use has some open source involvement. It's a great way that people can work together, sometimes in very large groups, to achieve far more than they ever could on their own.
Together we can share knowledge, develop new technologies (yes software techniques are a technology) and occasionally meet interesting people.
Programming is also a great way to express creativity. Programming can be viewed as the antithesis of creativity. It operates according to a fixed set of precise rules that govern the behaviour of computers. As Charles Babbage demonstrated (at least in theory), this can be done with entirely mechanical devices. This fixed framework enables creative thinking. By knowing precisely what building blocks you have at your disposal, and how they will (or at least ought to) behave, the software craftsman is free to put those building blocks together in any way his (or her) imagination can conceive of. If the cathedral, formed from unyielding rock, is a work of art and beauty; then so are the elegant and intuitive creations of the great programmers.
Not all open source projects are successful though. Success of course is a very nebulous term, sometimes failure is easier to spot than success ! If you browse the many open source repositories to be found on the world wide web, you will soon discover the seemingly countless projects there are. Certainly in the millions. The overwhelming majority these are the creations of single programmers, possibly never used even by their progenitor. A few of these have a thriving community based around them. A successful project may have ten or less active developers, but be used by thousands of people.
If you'd like to be involved in software collaboration, then you have two choices. Contribute to an existing project or release your own code. Getting other people to join a new project, where you are the only developer, is difficult. Why should people jin you, when there are so many thousands of other ones out there ? Finding an existing project is a better way to get mentoring and feedback, but the barrier to entry is higher. An already mature project, by virtue of being successful is likely to be fairly sophisticated. Projects that I've considered joining, but never quite managed to contribute to in any meaningful way, include docutils, SPE, and kupu.
So if you decide to go it alone, how do you measure the success of your project ? One measure is how often your code is downloaded. I get over a hundred downloads a day for my various projects. Sounds good ? Well, based on my own habits, that doesn't necessarily mean very many users. Whenever I hear of some new interesting project or code I will take a look at it on the internet. A lot of these I download, intending to look at later. Most of them I never get a chance to look at, and most of the ones I do I never actually use. As a wild guesstimate I'd say that only about one in twenty of the things I download do I ever actually use at all.
Even if I do use them, unless I'm very impressed it's rare for me to contact the author. I'm not alone. Despite my website getting over a thousand visitors a day and over a hundred downloads a day, I only ever get contact from a few people in the course of a week. Maybe the code (or program) is straightforward, so they don't need to contact me, or maybe it's just no good.
I just released a new version of Movable Python. In it's first incarnation this was an open source project hosted on sourceforge. It was a good idea, and I put a lot of effort into making it useful. In the first year it was downloaded over three thousand times. In all that time I had a single donation from someone who found it useful (thanks Mickey!), as well as occasional positive feedback from other users. For the new version I decided I needed more return on my investment of time and effort. Movable Python is no longer open source, but available for a very low price from The Voidspace Shop. I've had a pleasing number of buyers (growing daily), and some good feedback. An update will follow shortly, and the Python 2.2 version will also be released. This will be useful for people who want Movable Python for compatibility testing, so I'm hopeful of a fresh batch of 'donators'. The number of people paying for it though (unsurprisingly), is vastly less than the number of people who downloaded it when it was free. This is despite the fact that the new version is a much improved.
As another example, take Nanagram. It's a silly little program that generates anagrams from words (or names). It was one of the first programs I wrote with a GUI, but has a really neat recursive algorithm for finding the anagrams. There is also an online version. When I finally installed awstats for my website, I was surprised to find that the online version is getting hundreds of hits. The windows installer gets downloaded about six times a day or more, and has done for over a year. Anagram hunting is a surprisingly popular activity ! Whilst Nanagram may not be the fastest one available, it's easy to use and fun. Most of the other anagram programs are commercial or shareware. In the last year I've not had a single person contact me about Nanagram in any way. Hmmm...
I'm no better than anyone else though, I've only contributed financially to a single open source project in the last year. I've given feedback and encouragement to a lot more, but by no means all the ones I've used.
I think the answer with Nanagram is that I will probably leave the source code open source, but charge for the windows executables. In the meantime I guess my code needs to improve (in quality and relevance) before I can expect more participation.
Movable Python Logo
Python on a Stick. Nice isn't it.
I've improved the look of the Movable Python Shop and added the logo.
Work is now in progress on the Python 2.2 version, which will also mean a slight update to the other versions.
Gadgets and Games
Although my PDA has died, my life is not bereft of gadgets. I've just bought a webcam so that Delia and I can have video conversations with our nephews and nieces (and other assorted relatives). Delia can probably use it to talk to her parents in Romania, isn't this technology stuff grand.
I've also got my new Nokia 3230 Phone. I haven't got Python running on it yet though.
Finally, (for this brief post) I've bought my first game for as long as I can remember it. I have very fond memories of Doom 2 played over the LAN at college. That was over ten years ago now, which makes me feel very old. I've bought Quake 4 to try and relive some of those moments. I hope it won't eat too much into the programming time
Doom had stacks of sheer shoot-em-up-mayhem, that made many of its technically superior descendants seem like pale imitations. From what I hear about Quake, I may be pleasantly surprised.
I've had a PDA disaster. Over the last few days, my XDA IIi has sputtered to a halt and finally died.
This is a real nuisance as I was halfway through a good book, and it's great for doing blog entries on the way to work. Luckily it's still under warranty, so I'll be able to get it fixed. I'll be without it for a week though. sigh
Now that I have broadband at home this is less of an issue. The phone interface is rubbish on it, but I used to use it to fetch my email.
I'd completely forgotten to try a hard reset. It looks like that might have solved my problems.
ConfigObj 4.2.0 Beta 2
I've just checked an updated version of ConfigObj into the subversion repository. This is ConfigObj 4.2.0 Beta 2, and it's in the usual place :
This now has a set of tests and I'm happy with the changes. If no bugs are found, then this will become ConfigObj 4.2.0.
The way that ConfigObj handles file like objects has changed. It no longer keeps a reference to them. This is better, but could break existing code.
Additionally, the BOM attribute is now a boolean.
I haven't yet done the documentation, but these are the changes :
Full unicode support.
You can specify an encoding and a default_encoding when you create your instance.
The encoding keyword maps to the encoding attribute. It is used to decode your config file into unicode, and also to re-encode when writing.
The default_encoding (if supplied) is used to decode any byte-strings that have got into your ConfigObj instance, before writing in the specified encoding. This overrides the system default encoding that is otherwise used.
UTF16 encoded files are automatically detected and decoded to unicode. This is because ConfigObj can't handle them as byte strings.
The BOM attribute is now a boolean. If a UTF8 or UTF16 BOM was detected then it is True. The default is False.
If BOM is True, then a UTF8 BOM will be written out with files that have no encoding specified, or have a utf_8 encoding.
File like Objects.
ConfigObj no longer keeps a reference to file like objects you pass it. If you create a ConfigObj instance from a file like object, the filename attribute will be None.
In addition to this, the seek method of file like objects is never called by ConfigObj. (It tests for the read method when you instantiate.) You must call seek(0) yourself first, if necessary. This means you can use file like objects which don't implement seek.
Writing to a file like object.
The write method can now receive an optional file like object as an argument. This will be written to in preference to a file specified by the filename attribute.
When passed a config file (by whatever method), ConfigObj will attempt to determine the line endings in use. (It chooses the first line ending character it finds, whether this be \r\n, \n, or \r.)
This is preserved as the newlines attribute.
When writing (except when outputting a list of lines), this will be used as the line endings for the file.
For new ConfigObj instances (or where no line endings are found), it defaults to None. In this case the platform native line ending (os.linesep) is used.
There are also the new Section Methods added in Beta 1 :
They all take a single key as an argument, and return the value in the specified type. They can all raise KeyError or ValueError should the situation demand it.
Another change is that ConfigObj does not now convert the filename attribute into an absolute path, unless that is what you supply it with.
Line Endings Part III
Part III (and definitely the final part) of my tangled investigation into Python line endings in files.
Quick summary of the story so far :
- I want to read files and split them into lines.
- Sometimes an encoding will be explicitly supplied.
- Sometimes no encoding will be specified, but in order to correctly handle UTF16 files I need to decode to Unicode first.
- For UTF16 files each character is two bytes. I will only be able to recognise the line endings after decoding.
There are a few different ways I could do this :
- Use the splitlines attribute of the unicode string.
- Open the file in universal mode "rU". Once read has encountered a line ending it sets the newlines attribute on the file.
- Use my code snippet to determine what line ending is in use.
- Open the file and read a few bytes. If it is UTF16, re-open with the correct reader using codecs.getreader('utf_16'). (I would still have to splitlines, but the decode would already be done).
In fact option 2 doesn't work for me. UTF16 is a multi-byte encoding, so \r\n is encoded as :
Opening the file in universal mode and reading sets the newlines attribute to :
This is because it thinks that it has seen both \r and \n line endings, rather than \r\n endings.
In option 1, splitlines actually treats all of \r, \n, and \r\n as line endings :
'one\r two \n and three\r\n really'.splitlines()['one', ' two', ' and three', ' really']
This means I am definitely worrying about this too much .
It does occur to me that it would be nice to preserve the line endings and use the same ones when writing. I will use splitlines(True) which preserves the line endings, and treat the first one encountered as the definitive one for the file. Sorted.
|||But also that my code is better.|
Ajax in Action
I've received a review copy of the book Ajax in Action, by Dave Crane.
Because of my house move (and resulting confusion), I've only managed to get part way through the book. As soon as I'm able to finish it, I'll post a review.
A couple of pieces of Movable Python news.
A user reports getting Zope 3.2 to work with Movable Python
Zope on a rope! Movable Python allows me to carry Zope3.2 and my development environment everywhere I go. It's a great product, I love it.-- Kevin Smith
It looks like a German printed computer magazine, c't (with a circulation of around four hundred thousand), are going to have a brief article on Movable Python in their next issue. Great.
Detect Line Endings, Part II
Here's the method I came up with to detect which line endings are in use in a piece of text. It counts occurrences of the three line endings, and picks the largest.
As you can see from the docstring, it attempts to do sensible(-ish) things in the event of a tie, or no line endings at all.
Comments/corrections welcomed. I know the tests aren't very useful (because they make no assertions they won't tell you if it breaks), but you can see what's going on :
rn = re.compile('\r\n')
r = re.compile('\r(?!\n)')
n = re.compile('(?<!\r)\n')
# Sequence of (regex, literal, priority) for each line ending
line_ending = [(n, '\n', 3), (rn, '\r\n', 2), (r, '\r', 1)]
def find_ending(text, default=os.linesep):
Given a piece of text, use a simple heuristic to determine the line
ending in use.
Returns the value assigned to default if no line endings are found.
This defaults to ``os.linesep``, the native line ending for the
If there is a tie between two endings, the priority chain is
``'\n', '\r\n', '\r'``.
results = [(len(exp.findall(text)), priority, literal) for
exp, literal, priority in line_ending]
if not sum([m for m in results]):
if __name__ == '__main__':
tests = [
'\n\r \n\r \n\r',
for entry in tests:
A useful little recipe, if you can't leave python to handle your line separators for you.
There are two reasons for using this :
- Saving a file with the same line endings it was created with
- Splitting files into lines after decoding into unicode
Apparently opening the file in universal mode ("rU") exposes a newlines attribute. I need to check that it works with Python 2.2 and with UTF16 encoded files. If it does, it's a bit easier than the code posted here.
After a bit of investigation - it doesn't do the correct thing with UTF16 encoded files. splitlines on the decoded string does, more or less, though. See Part III...
One of the ways that compilers of static languages optimise generated code, is by inlining small functions. This removes the overhead caused by having to use the stack when calling the function and exiting the function. In Python the cost of creating and destroying stack frames is particularly high, so the potential benefits are even greater.
This is one of those silly ideas I waste brain bandwidth on, with no way of making it happen.
Inlining functions adds the benefit of code efficiency, whilst keeping your source clean (and maintainable) through code re-use.
Static language compilers are able to do this because they know a great deal about the types of variables passed to functions, used within them, and returned from them. Because of Python late binding, the compiler is able to almost nothing about the types of variables; right up until runtime. This is the barrier Brett Cannon hit when attempting to implement Localized Type Inferencing in Python. He saw virtually no speed increase.
However there are several interesting projects out there. Not least of which is PyPy. There is also an alternative implementation of the Python virtual machine called pyvm. Both of these have custom Python compilers. (The pyvm one is called pyc.)
Either of these could do bytecode optimisation by inlining functions. A new syntax, or even a decorator, could be added to specify that a function is suitable for inlining.
Function local names would have to be suitably mangled to avoid clashes, and the function would need to be non-recursive and not use nested scoping. Other limitations may also need to apply, but with a user syntax to mark functions for inlining - caveat emptor.
I wonder if any other compiler optimisation tricks could be implemented as bytecode hacks
Detecting Line Endings
My forays with unicode still leaves me in a dilemma as to how to handle (or expect) line endings for windows. For 16 bit encodings, \r\n is a four byte sequence. Should I expect to read and write these for windows ? I need to maintain compatibility with windoze tools that the user might use to create the text files I read. Because I'm reading and writing in binary mode, I can't expect Python to handle this for me.
So how do I detect line endings safely and sanely ? There are three possible line endings. (The native one is available as os.linesep.)
- \r\n - Windoze
- \n - Lunix type systems (Unix and Linux variants)
- \r - Mac systems
Is the following safe and sane :
text = text.decode(encoding)
ending = '\n' # default
if '\r\n' in text:
text = text.replace('\r\n', '\n')
ending = '\r\n'
elif '\n' in text:
ending = '\n'
elif '\r' in text:
text = text.replace('\r', '\n')
ending = '\r'
My worry is that if '\n' doesn't signify a line break on the Mac, then it may exist in the body of the text - and trigger ending = '\n' prematurely ? (Or vice-versa with \r on Lunix ?)
A suggestion on comp.lang.python is to count occurrences of '\r\n', '\n' without a preceding '\r' and '\r' without following '\n', and let the majority decide. (Thanks Sybren.)
Edge case where you have small files of course, but what's a guy to do ?
Unicode, UTF Encodings and BOM
Over the weekend I've been adding full unicode support, along with other improvements, to ConfigObj the config file reader/writer.
This entry summarises the difference between handling UTF8 and UTF16 encoded text, in Python.
Unicode isn't as difficult as it is reputed. Nonetheless, it can be fiddly writing code that has to handle both unicode strings and byte-strings. Even worse if you potentially have a mix.
The basic principle is that when you read a file you get a byte-string. To turn this to unicode, you need to specify the encoding and decode.
To write unicode strings to file, you have to specify the encoding you want to use and encode back into a byte-string.
To make it more interesting there are two common encodings (plus other less common) that cover the whole unicode spec . These are UTF8 and UTF16. For these you may have to handle (or at least understand) the BOM.
Unsurprisingly, UTF8 is an 8 bit encoding. It represents the ASCII characters using a single byte. Other characters use three bytes. Because it is ASCII compatible, it is the only full unicode encoding recognised for web pages.
UTF16 is a 16 bit encoding. It uses two bytes per character. In order to understand text encoded with UTF16, Python needs to know whether it was produced on a big endian machine, or a little endian one - the byte order.
For this reason, UTF16 strings start with a two byte BOM. UTF8 also has an associated BOM. As an eight bit encoding it doesn't have a byte order, so this is better referred to as the unicode signature.
For ConfigObj, you really don't want your first key starting with a BOM: it needs to be detected and removed. (But preserved for writing later). You may not want UTF8 strings automatically decoding to unicode.
UTF16 strings must be recognised and decoded. The regular expressions ConfigObj uses to parse config files will split the string on byte boundaries. Because UTF16 is a multi-byte encoding, this will truly mangle your text.
For UTF8 (which I've handled before) this is straightforward. Detect and remove the BOM, then decode. Later, encode to byte strings, add the BOM then write.
So the following code works for UTF8 (simple example) :
text = open('test.cfg').read()
encoding = 'utf_8'
text = text[len(BOM_UTF8):]
text = text.decode(encoding)
# Next we want to write
# Our text is now a list called 'members'
if encoding == 'utf_8':
text = ''
for mem in members:
text += mem.encode(encoding)
text = BOM_UTF8 + text
Let's simplify even further, and see what happens if we do the last step (encoding) with UTF16. To confirm that it works, we'll decode our final string back into unicode and try to re-encode as latin-1 :
text = ''
for mem in members:
text += mem.encode(encoding)
text = text = BOM_UTF16 + text
unicode_string = text.decode(encoding)
Surprisingly we get the following result :
UnicodeDecodeError: 'ascii' codec can't decodebyte 0xff in position 0: ordinal not in range(128)
Aside from the confusing fact that the error is reported from the ascii codec (which shouldn't have anything to do with it), what is going on here ?
u'and some more'.encode('utf16'))
' \x00\xff\xfea\x00n\x00d\x00 \x00s\x00o\x00m\x00e\x00'
u'Some text \ufeffand some more'
See all the null bytes - \x00, this is because UTF16 is a two byte encoding; even for ascii text. The string decodes back into unicode using the UTF16 codec, but it can't be encoded as latin1. This is because of the extra \ufeff that has somehow got into the middle of the string.
It turns out that because UTF16 needs the BOM (for an arbitrary machine to decode later), the Python codec adds the BOM automatically. The codec will automatically ignore (well, transparently remove) a BOM at the start of the string, but when decoding it leaves any others it finds in place. The BOM is a valid unicode character - but can't be encoded using latin1.
So for UTF16 it's more correct to detect the BOM, but not add or remove it. This is different from how Python handles UTF8. For UTF8 you must remove the BOM yourself. If you want one when you write, you have to add it yourself as well. sigh
This means you can't use the UTF16 codec to encode string fragments.
So now I need to refactor the ConfigObj write method to leave the whole file as unicode until the final write, where I must encode in one pass. (Remembering to write in binary mode if the encoding is UTF16, so that Python doesn't insert \r anywhere in my file.) Maybe it's time to investigate the StreamWriter and StreamReader objects which will handle parts of this automatically.
For those of you who wonder how I have time to make such long blog entries on a Monday morning, my journey to work is now a two hour mission involving two bus rides and a half hour wait in the bus station. At least I have my PDA for company.
Hmmm... as I don't know the encoding before reading I can't use codecs.open. I probably can use it for writing.
|||Ironically there are encoding problems evident in the article. These aren't my fault; I explained to PyZine what encoding the article body was in. Sadly PyZine seems to now be defunct. Especially sad as they've never published my article Part II on writing CGI Applications for Python.|
|||Meaning that any unicode character can be represented using this encoding.|
Movable Python & Digital Downloads
Over the weekend I've sold nine copies of Movable Python. All for Python 2.4.2 so far.
That amounts to about forty pence per hour (less than a buck), based on a wild random estimate of the amount of time I've put into it.
I don't mind, not only do I expect the userbase to grow, but I'm heavily eating my own dogfood here. This blog entry is being typed at work, on Firedrop2 running under Movable Python.
A user reports having got Zope 3.2 to run under Movable Python, so it could turn up in some interesting places.
I'm using Tradebit to handle the downloads and PayPal transactions. A basic Tradebit account is only $2 a month. It was very easy to setup, and provides a logical (if not very attractive) interface for my customers. The guy who runs it is very responsive to questions and suggestions. A happy experience so far.
I still haven't got the Python 2.2 distribution ready. I've spent most of my free time (TM) this weekend wrestling with unicode for ConfigObj. This will be the subject of another blog entry.
Unfortunately just testing the Movpy code under Python 2.2 is no guarantee that the built distribution will behave as expected. The main issue I have to sort is adding the import paths in the right place. This happens automatically in normal Python, and differently for py2exe 0.6.3 (used to build for Python 2.3/2.4) and 0.4.1 (used to build for Python 2.2). This means I have to do an edit/compile (well, build distribution)/test cycle - all running in a VMWare session where I have Python 2.2 installed. Give me back my normal Python !
Oh, and a last word for the search engines - it's Movable Python, not Moveable Python. (Which is what a lot of people seem to google for.)
Guido Gives In
Well, it finally happened Guido gave in. Looks like lambda is here to stay.
I've only followed the arguments vaguely, but a brief summary would go something like this :
- The lambda keyword defines an anonymous function
- It is convenient where you need a function without keeping a named reference
- But they're hard to read
- and only a minor convenience really...
Guido was in favour of dropping it altogether. This caused a huge amount of controversy, not least because various folk over on python-dev had all sorts of uses and theoretical reasons why we ought to keep them (or something similar).
So Guido has decided that he'd rather people expended their energy on something a little more productive... Maybe this is the last word, and maybe it isn't...
This work is licensed under a Creative Commons Attribution-Share Alike 2.0 License.