Oliver Sturm's Blog - Multi-Codepage Zope and Structured Text

I was recently trying to find a solution for a problem with code pages in Zope, in conjunction with Structured Text. The basic problem, which is discussed all over the web, is this: the correct code page has to be configured in the running Zope instance to make STX work correctly with “non-US” ASCII characters. If this is not done, there’ll be problems with special formatting, because STX can’t find the borders of words correctly.

Example: *snídaně* should really appear as snídaně and **Frühstück** as Frühstück, they appear as snídaně and Frühstück, respectively. Or one of them would work, but not the other, because these two words need different code pages: snídaně needs ISO 8859-2 and Frühstück needs ISO 8859-1. The problem is that STX is dependent on the code page configured for the running Zope instance. As there can obviously be only one code page configured at a time, this is a problem when there are pages that need one code page and others that need another one. Two solutions seemed useful, but didn’t prove to really work in the end:

Not using STX. There are other variants of the Structured Text concept, which might not have the same problems. I didn’t really check no this because it wasn’t an option in my situation; too much text had already been written in STX format over the course of several years and it had been hard enough to have users understand that concept the first time.
Using Unicode, just like I’m doing in this page, to be able to mix characters from different code pages. I actually tried going this way, but it didn’t work. While the Unicode encoding would allow for the characters to be shown correctly by the end users’ browsers, the STX implementation still had the same problems as before, parsing the strings when formatting was in use.

In the end, my workaround was simple: I changed the regular expressions in the STX implementation to work with whatever characters there might be between the start and end markers. The original expressions went out of their way to make sure there would only be specific characters, and that was the core of the problem because that character set was the one that would be defined depending on the code page. For example, the line where the strong marker was searched for looked like this:

expr = re.compile(r'**(\[%s%s%ss\]+?)**' % (letters, digits, strongem_punc)).search

I used this expression instead:

expr = re.compile(r'**(\[^\*\]+?)**').search

I made similar changes in the places where the em and ul markers are checked. Of course, my changes allow a wider range of characters between the markers than would originally be allowed, which may have its own problems. But the workaround seems to work fine for me, as long as no real fix is available (and there probably won’t be because I don’t think anyone’s really still working on the old STX implementations). Click here to download a patch file with all changes I made to DocumentClass.py.