R3 and Unicode

There's a long document about Unicode on the R3 Wiki, but let me summarize the main points:

R3 source code is UTF-8. This is the most popular and most backward compatible Unicode format. That's why we use it.
Source code that is ASCII (0-127) is also valid UTF-8. You don't need to modify such files.
Files that use other encodings with 128-255 will not work. You need to convert them. We will provide examples.
Internally, a string! is Unicode. You don't need to worry about its encoding or storage format. Just use the normal series functions like insert, append, remove, etc.
The console on Win32 should work for all characters: Latin, Chinese, Greek, Cyrillic, Hiragana, etc. If you discover a problem, please mention it.
The console on non-Win32 (Linux, BSD, OS X) does not currently support Unicode. The reason is due to the R3 ReadLine() line-editor. If you want to help fix that problem, contact me and I'll send you the source.
This is 2009 and most editors should be able to handle UTF-8. Even Notepad in XP handles UTF-8 (as well as UTF-16 LE and BE.)
Data files can be binary, ASCII, UTF-8, UTF-16LE, and UTF16-BE. If the file contains a BOM, it will be auto detected when using read/string.
We still need to provide a method for reading LATIN-1 and other "codepage" encodings as data. This will be done with /as added to read and write functions. Then, a small script can convert any codepages to UTF-8.
Internally, REBOL is smart about Unicode. It optimizes storage for strings. For example, ASCII and LATIN-1 strings take no more space than in R2.
R3 in it's default configuration only supports the Lower Unicode Plane (0-64K). That's nearly everything you can imagine. It is possible to compile with full 32bit Unicode support, but that is not what we want for a default.

Adding Unicode to REBOL required a major development effort. It was non-trivial and very expensive to add. Internally, we found that in many cases adding Unicode does not make code twice as complicated, it makes it 5-10 times more complicated.

However, we've isolated nearly all this complexity from REBOL programs. For the most part, programs can be just about as clean and simple as they were in R2 (that did not have Unicode.) This is a significant accomplishment.

8 Comments