UTF-16 auto-detect for READ/string
R3 A92 adds UTF-16 detection and decoding to READ/string on files. This is mainly an ease-of-use enhancement. You can read Unicode files with minimal code.
For example, now you can write:
doc: read/string %that-unicode.doc ; a 16 bit Unicode file
and, then process it as a normal REBOL text string.
When you use READ/string on a full file read, if it begins with a Unicode byte order marker (BOM), that will determine the encoding it will use to decode the file text.
Currently, these are supported:
- UTF-8
- UTF-16BE (big endian)
- UTF-16LE (little endian)
If no BOM is found, then UTF-8 (hence also ASCII) is assumed.
Take note that surrogate pairs (code points beyond the 16-bit basic multilingual plane) are not currently supported. Hopefully, not many of you require those at this time.
We will need to add an /as refinement to allow you to specify an encoding when no BOM is provided. This also gives us a way to read the common 8-bit latin-1 encoding (as used in R2.)
Similarly, WRITE will need an /as refinement in order to do the desired encoding. Currently, WRITE only outputs UTF-8 (and of course ASCII) for strings.
1 Comments
|