Ok, this article could easily become the size of a book, so let me try to keep it short and to the point.
Start here...
Let me start by saying:
That was easy to say, but the implications run deep:
To make a STRING! from a BINARY! requires decoding. Why? Because a
binary series could be UTF8, UTF16, Latin-1, or something else.
To make a BINARY! from a STRING! requires encoding. Why? Because
you will want the binary result to accurately represent the string;
meaning, it must conform to a standard text encoding like UTF8.
To INSERT, APPEND, or CHANGE a string (or any other FORMed value) in a
BINARY! requires encoding. Why? For the same reason as the prior point.
BINARY! and STRING! are not equivalent. AS-BINARY and AS-STRING must go
away because they assume that you can reference the same data without
decoding or encoding. You must use TO-BINARY and TO-STRING as we did in
pre-V2.6 REBOL.
Encoders and decoders such as ENBASE, DEBASE, COMPRESS, DECOMPRESS,
ENCLOAK, and DECLOAK have new rules. If you ENBASE a STRING, either
you are implying that it must be converted to BINARY first, or that
function should throw an error because it's not directly valid.
Various functions may only work with STRING data or BINARY data, but
not both. For example, LOWER-CASE and UPPER-CASE functions are not valid
for BINARY, and AND, OR, XOR are not valid for STRING.
Wow. How's that for introducing the wonderful benefits of
Unicode? Yes you can love to hate it or hate to love it. It just
depends on where you live. But, that's the reality of the modern world
of computing. Sorry.
Well, actually, all of this was bound to happen eventually. In the past,
we've had the luxury of being a bit sloppy in our coding practices. We
could throw binary and text around like they were different sides of the
same coin. Now we must buckle down if we want the rest of the world to
enlist in our REBOL forces. A lot of people live in Asia. A lot.
An important rule...
So, what's really truly going on here? Well, you know me, I like to
summarize down to a nice little rule:
What do I mean?
Here's a quick test of your understanding:
Q: What does this line do?
bin: to binary! "hello"
A: If you said that it converts the internal representation of the
string "hello" into a standard binary encoded representation such as
UTF-8, then you got it right. (What if you wanted it encoded into
something different like UTF-16 or Latin-1? You must specify a function
refinement for that.)
Q: What does it not do?
A: It does not give you the internal representation of the "hello" string (anymore).
Q: What does this line do?
str: to string! #{68656C6C6F}
A: If you said that it converts the a standard binary encoded
representation such as UTF-8 to an internally represented string, then
you got it right.
Q: What does it not do?
A: It does not consider those bytes to be the internal representation of the string. They are an encoding of it.
Note: just because the binary literal looks like ANSI or Latin-1 here,
does not mean it is. In fact, the default is UTF-8.
A mistake...
A couple years ago, I added the AS-BINARY and AS-STRING functions to
REBOL. When I did it, I knew it was treading on an important rule of
computer science: you cannot go around directly aliasing datatypes like
that, because if either of the datatypes change representation (such as
to support Unicode) then you've got a problem. And, of course, the
problem is made much worse by the fact that different CPU's store the
data differently: in big or little endian.
To continue...
If I write:
insert bin "example"
What does that do?
Since we do not know how the string is internally represented, we must
either auto-encode the string into binary, or throw an error.
Which does R3 actually do? Currently, I assume it should encode the
string and insert it, but that's not final. Give me some feedback.
Now let me give you something a bit more complex. What happens if I
write:
data: enbase "this is an example line"
Well, ENBASE is a binary base encoder (which defaults to a BASE-64
encoding). Since it encodes binary, that implies that the string needs
to be converted to binary; therefore, the string needs to be encoded,
either automatically or explicitly. ENBASE is a "double encoder".
Are you with me? Guess what? There's more! ENBASE returns a STRING!
because it is designed for inserting base encoded data into things like
email or web CGI text. So, if I write:
out: make binary! 1000
append out enbase "this is an example line"
then the STRING! output of ENBASE must be encoded into the BINARY!.
So, there's a triple encoding going on here.
Fortunately, R3 is quite smart and efficient internally about how all of
this is done. (All of this work is the main reason you've not seen me
around much chatting online.) In theory, the above line should evaluate
about as fast as it did in R2, and perhaps even a bit faster. I have yet
to measure it.
In summary...
I should note that there are advantages and disadvantages to
these new rules. For users who just want to write scripts and not worry
about it, R3 does a lot of the hard work. However, for those who want to
fiddle around with the bits in the bytes, it may be more difficult to
make things work out. For that, we'll need to develop some smart and
well defined methods. Yep, those of you will need to buckle down. It's not your grandpa's BINARY anymore.
Got some comments to any of this? Please post them right away.