Text Encodings

Encodings with rest2web

Fun With Encodings

emoticon:dove rest2web handles text encodings in ways that are hopefully sensible and simple. Whenever you are dealing with text you should know and specify the encoding used [1]. If you disagree with me (or haven't a clue what I'm on about), read The Minimum Every Developer Must Know About Unicode, by Joel Spolsky. rest2web allows you to specify what encodings your content is (source files) and what encoding to write the output as. In order to achieve this, rest2web is almost entirely unicode internally.

If all of this is gobbledygook, then don't worry rest2web will guess what encoding your files are, and write them out using the same encoding. If you're not 'encoding aware', this is usually going to be what you want.

Summary

In a nutshell you use 'encoding' to specify the encoding of a contents, file 'template-encoding' to specify the encoding of your templates, and 'output-encoding' to specify the encoding the file should be saved with. 'output-encoding' used in an 'index file' sets the output encoding for all the files in that directory and below.

Input Encodings

As usual, the way we handle encodings is through the restindex.

The first thing we can specify is the encoding of a page of content. We do this with the encoding keyword. If this keyword is set for a file [2], it is used to read and decode the file. If no encoding is specified, rest2web will attempt to guess the encoding. It tries a few standard encodings and retrieves information from the locale [3] about the standard encoding for the system. If the page is successfully decoded then it stores the encoding used [4], otherwise an error is raised.

From this point on the page content is stores as unicode inside rest2web.

Whichever encoding was used to decode the page, it is available as the variable encoding inside the templates.

Template Encodings

In the restindex you can also specify the encoding of the template file. This is the value template-encoding. It is only used if you are also specifying a template file. If you specify a template file without specifying an encoding, our friend guess_encoding will be used to work out what it is.

This value is available as the variable template_encoding.

Output Encodings

When the page is rendered you can specify what encoding should be used. This is done with the output-encoding keyword. If you specify an 'output-encoding' in an 'index file' then that encoding will be used for all the files in that directory, and any sub-directories. That means you can specify it once, in your top level index file, and have it apply to the whole site.

The encoding you specify applies to all values passed to the template. This means the variables body, title, crumb, sectionlist, etc.

As well as the standard Python encodings, there are a couple of special values that output-encoding can have. These are unicode and none.

If you specify a value of none for 'output-encoding', then all string are re-encoded using the original encoding.

If you specify a value of unicode for 'output-encoding' then strings are not encoded, but are passed to the template as unicode strings. Your template will also be decoded to unicode. The output will be re-encoded using the 'template-encoding', but will be processed in the template as unicode.

The value for 'output-encoding' is available in your template as the variable output_encoding. Because of the special values of 'output-encdoing' there is an additional variable called final_encoding. final_encoding contains the actual encoding used to encode the strings. If the value of output_encoding was none, then it will contain the original encoding. If the value of output_encoding was unicode, then it will contain the None object.

Encodings and the Templating System

The templating system makes several variables available to you to use. These include indextree, thispage, and sections. These contain information about the current page - and other pages as well. The information contained in these data structures allows you to create simple or complex navigation aids for your website.

See the templating page for the details of how these data structures are constructed.

Because these values contain information taken from other pages you can't be certain of their original encoding. This isn't a problem though, rest2web will convert these appropriately to the final_encoding of the current page. (It also converts all target URLs to be relative to the current page).

The slight exception to this is part of the sections value. This value is a dictionary containing all the pages in the current directory, divided into sections. Each section has information about the section, as well as a list of pages in the section. Each page is itself a dictionary with various members containing information about the page.

One of these members is namespace, which is the namespace the page will be rendered in. This means that the values in the namespace are encoding with the appropriate encoding for that page. You can always tell what encoding that is by looking at final_encoding in the namespace. To be fair you're unlikely to dig that far into sections from the template - but I thought you ought to know Rolling Eyes

Problem

Note the following problem.

  • Unfortunately, the guess_encoding function can recognise cp1252 [5] as ISO-8859. In this case docutils can choke on some of the characters. We probably have to special case the cp1252 and ISO-8859 encodings - but I wonder if this problem applies to other encodings ?

Footnotes

[1]The bottom line being, that if you don't know what encoding is used - you don't have text, you have random binary data.
[2]It must be an encoding that Python recognises. See the Python Standard Encodings
[3]See the python Locale module documentation.
[4]The only exception is if the page successfully decodes using the ascii codec. In this case we store ISO8859-1 as the encoding. This is a more flexible ascii compatible encoding.
[5]The standard windows encoding.

Return to Top
Part of the rest2web Docs
Page last modified Sat Jun 25 19:11:57 2005.

SourceForge.net Logo Support This Project




Certified Open Source