Text code-page mapping
I recognize that many developers need to be able to input and output older text strings that are not Unicode UTF-8 encoded.
R3.A100 will provide a codec for "arbitrary" text mapping. The default map will be UTF-8, but we'll also support LATIN1 (the first 256 code points of Unicode, also known as ISO-8859-1). If you need some other mapping such as Windows-1252 or ISO-8859-15 you can specify those by providing your own maps (which are simple to create).
In order to specify 8-bit code-page maps, we will need to allow additional inputs to codecs. My current design direction is to allow the codec media argument to be either a word or a block. For example:
text: decode 'text bin ; default conversion as UTF-8
text: decode [text latin1] bin
text: decode [text 16] bin ; UTF-16 BE
text: decode [text -16] bin ; UTF-16 LE
text: decode reduce ['text char-map] bin
Here, char-map is a reference to a string that maps a byte to each character (unicode-point). The map is simple to setup, but note that each byte is an zero-based index to a char.
For example, latin1 is created this way:
char-map: make string! 256
repeat n 256 [append char-map to-char n - 1]
In some cases you might load the map from a text file (which is encoded itself in UTF-8):
char-map: read/string %iso-8859-15
The alternative would be to add a refinement to decode to specify the map, but I like that less because it splits up the "spec" of the decoding. Thoughts on that?
The encoding method would reflect the same approach:
bin: encode 'text text ; default conversion as UTF-8
bin: encode [text latin1] text
bin: encode [text 16] text ; UTF-16 BE
bin: encode [text -16] text ; UTF-16 LE
bin: encode reduce ['text char-map] text
It should be pointed out that char-map can be either a string or a binary here. If it's the same string as used with decode, that will work, but an internal binary map must be temporarily allocated for the conversion. However, if you're converting a lot of strings, then make it a binary to directly map chars in the conversion.
Since this change will be part of A100, please post your comments soon.
17 Comments
|