Python Programming, news on the Voidspace Python Projects and all things techie.

Unicode and new style string formatting

emoticon:world Python 2.6 and Python 3 gain a new style of string formatting, which is apparently based on the string formatting in C#. I wasn't a big fan of the string formatting in C# and so wasn't very excited about it moving into Python, but as is to be expected it has grown a bit on me.

The 'old-style' string formatting in Python is based on the % operator. In Python the % operator is the modulo operator, so strings have a __mod__ method that implements the string formatting:

>>> some_string = '%s: calls str. %r: calls repr.'
>>> some_string % ('foo', object())
'foo: calls str. <object object at 0x3284a0>: calls repr.'
>>> some_string.__mod__(('foo', object()))
'foo: calls str. <object object at 0x3284e8>: calls repr.'
>>>

In Python 2.6 and 3 strings grow a new format method as well as the modoulo operator:

>>> "The sum of 1 + 2 is {0}".format(1+2)
'The sum of 1 + 2 is 3'

In Python 2.7 and 3.2 you can use empty braces where you are formatting with a sequence. This makes basic string formatting operations as simple as the old equivalent:

>>> '{} {} {}'.format(1, 2, 3)
'1 2 3'

The new style formatting was implemented by Eric Smith. He did a talk on it at PyCon 2010: Advanced String Formatting (video). Without taking anything away from what Eric has achieved, I kind of agree with Maciej Fijalkowski who said I could make that talk in two minutes: Advanced string formatting, don't. Smile

There is unfortunately an issue with the implementation of the new string formatting (with an open issue) that makes it unsuitable for use in some situations.

With the old string formatting the normal Python rules for coercion to Unicode are obeyed. If the string is a byte-string but any of the format arguments are Unicode then the bytestring will be implicitly decoded to Unicode and a Unicode string returned:

>>> 'foo %s baz' % (u'bar',)
u'foo bar baz'

With the new style formatting str.format(...) always returns a byte-string, so if any of the arguments are Unicode strings they will be implicitly encoded:

>>> 'foo {0} baz'.format(u'bar')
'foo bar baz'

In Python 2.X the encoding used for these implicit encodes / decodes is ascii, so non-ascii characters in a string can cause a UnicodeEncodeError or UnicodeDecodeError. As always, the best solution is to not mix Unicode and byte-strings but to keep all strings in Unicode and only perform the encode when actually needed.

So why does this behaviour matter? Well it particularly matters for framework authors formatting messages based on 'user' input. This is the case with unittest, which creates error messages when tests fail. The error messages internally in unittest are byte-strings and they are often mixed with user supplied messages using string formatting. We use old-style (% based) formatting, so if the user supplies byte-strings then the resulting messages will be byte-strings. If the user supplies Unicode strings then the resulting messages will be in Unicode. Because all the internal unittest messages are ascii only we can guarantee than an implicit decode to Unicode will succeed - so the user can choose the output type by varying the type of the messages they provide.

If we switched to using new-style formatting then we would have to choose Unicode or byte-strings: and the user supplied input would have to be safe to either decode or encode with ascii. So using the old style formatting allows the user to choose the string type of messages and puts no requirements on them. If we used the new style formatting then we would have to choose and the burden of making sure the messages don't raise Unicode related exceptions is on the user. The other option would be for us to check the type of the user messages and do the conversion of unittest messages internally. This would complicate the code for very little benefit.

This example shows the difference in practise when you format a byte-string with unicode arguments:

>>> value = u'\u00a3'
>>> 'foo bar %s' % (value,)
u'foo bar \xa3'
>>>
>>> 'foo bar {0}'.format(value)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
>>>

If value from the examples above come from a user call into an API then .format(...) creates a more restrictive API.

Note that % formatting is still in Python 3 and is not yet deprecated. This will happen eventually I expect.

Of course in Python 3, where all strings are Unicode, this particular problem disappears entirely. Another reason to herald the bright new day that Python 3 is ushering in...

Like this post? Digg it or Del.icio.us it.

Posted by Fuzzyman on 2010-04-10 18:10:46 | |

Categories: , Tags: , ,


Hosted by Webfaction

Counter...