The following text discusses some of the "dimensions" of unicode
support. I am presenting these here to seed your thoughts, so that
you can make suggestions and help us produce a smart and balanced unicode design for REBOL 3.0.
I want to note that some aspects of unicode are quite complicated, and it is not practical for REBOL to support every possible nuance related to unicode. (Or, stated another way, I will not allow REBOL to become 10 times larger just to support all possible unicode variations.)
A lot of information has been written about unicode. For a good
place to start, browse the Unicode entry on Wikipedia. Unicode has its own special glossary of words
(of which I openly admit not strictly adhering to). For example,
unicode refers to "code points" as references to characters to
provide an abstraction layer.
Here is a good Unicode Tutorial that helps explain these terms.
Unicode! Datatype
A new REBOL datatype, unicode!, handles the internal storage
for unicode string series and provides the standard series
functions (make, to, find, next, last, etc.).
Literal Direct Format
REBOL will support a new syntax for unicode strings. This is the
format that appears directly within scripts and data files.
The simplest literal format is a hex (or perhaps base-64) encoded
string similar to the binary data format. The advantage is that such
a format does not cause problems when processed or transferred with
normal 8-bit character systems. The disadvantage is that it is large
and a user cannot view the actual string contents within the source
code. For example:
When loaded, both of these strings would result in a unicode! series
datatype.
It should be noted here that REBOL defines a byte-order for such
literals. REBOL uses big-endian format, so no byte-order-marker
need appear within the literal strings.
Literal Encoded Format
Another possibility would be to allow UTF-8 encoding within strings
in the source code. The advantage is that you will be able to view
the strings in the appropriate editor. The disadvantage is that the
script would contain a range of odd looking characters.
Even if UTF-8 is not supported as a literal datatype of REBOL,
we would still support conversion to and from UTF-8 format.
I've not reached a decision on this issue as of yet.
Script Encoding
I am thinking about allowing support for unicoded REBOL scripts. The
load and do functions would accept scripts as ASCII and also as UTF-16.
For scripts that include a REBOL header, the UTF-16 could also be
automatically detected (because it would appear as "0R0E0B0O0L",
where 0 is a null byte).
Supporting scripts in UTF-8 format would be more problematic,
because the REBOL header would appear the same. Also, existing
scripts that use the latin-1 encoding could cause false UTF-8
detection. More discussion is needed.
Note that if we do allow unicode for scripts themselves, only
literal string! and char! datatypes will be allowed to contain the
unicoded characters. Other datatypes, such as words, will remain as
they are today. This raises some issues with regard to datatypes
like unicoded file names and email addresses, which we should
discuss in more detail.
Conversions
A few new functions will be provided to encode and decode unicode
strings into a variety of formats. For example, we will provide
functions to input and output UTF-8 and UTF-16 formats (including
byte-order-marker for endian detection, allowing UTF-16-LE and
UTF-16-BE).
Coercions
When functions combine both unicode and string datatypes, we will
automatically provide conversion when it makes sense. For example:
insert a-unicode-string "REBOL"
will insert the "REBOL" string, converting from latin-1 to unicode
bytes.
Casing and Sorting
Many programmers have requested that the unicode datatype be able
to handle the upper- and lower-case conversions as we do today
with normal strings. They also want a way to sort
unicoded strings, just as we do today with other types of strings.
Also, we must allow case-insensitive searching and sorting features.
Ports
I think it would make sense to allow ports (e.g. network connections,
files) to operate in a number of unicode codec formats. For example,
you may want to read data directly from a UTF-16 XML file without
calling an extra conversion function.
Perhaps the best way to solve this requirement is to look at what
solutions other languages, such as Java, provide.
Graphical Display and Input Events
In addition to being able to handle unicode as a datatype, we will
want to be able to create displays that handle unicode characters
and accept unicoded input.
For example, we've had many requests to support Chinese characters
in REBOL applications, so we need to make this possible at some
stage in REBOL 3.0 or perhaps 3.1.
Operating System Compatibility
And finally, I should mention that unicode is an important
consideration over the wide range of operating systems supported
by REBOL. Native APIs for Windows, OSX, BSD, Linux, and others
have support for unicode, and REBOL must be able to interface to
those APIs. This is true not only for making DLL calls, but for
operations as standard as file and directory access.