It's not your grandpa's BINARY anymore

Ok, this article could easily become the size of a book, so let me try to keep it short and to the point.

Start here...

Let me start by saying:

In R3 A BINARY! is not a STRING!, and a STRING! is not a BINARY!

That was easy to say, but the implications run deep:

To make a STRING! from a BINARY! requires decoding. Why? Because a binary series could be UTF8, UTF16, Latin-1, or something else.
To make a BINARY! from a STRING! requires encoding. Why? Because you will want the binary result to accurately represent the string; meaning, it must conform to a standard text encoding like UTF8.
To INSERT, APPEND, or CHANGE a string (or any other FORMed value) in a BINARY! requires encoding. Why? For the same reason as the prior point.
BINARY! and STRING! are not equivalent. AS-BINARY and AS-STRING must go away because they assume that you can reference the same data without decoding or encoding. You must use TO-BINARY and TO-STRING as we did in pre-V2.6 REBOL.
Encoders and decoders such as ENBASE, DEBASE, COMPRESS, DECOMPRESS, ENCLOAK, and DECLOAK have new rules. If you ENBASE a STRING, either you are implying that it must be converted to BINARY first, or that function should throw an error because it's not directly valid.
Various functions may only work with STRING data or BINARY data, but not both. For example, LOWER-CASE and UPPER-CASE functions are not valid for BINARY, and AND, OR, XOR are not valid for STRING.

Wow. How's that for introducing the wonderful benefits of Unicode? Yes you can love to hate it or hate to love it. It just depends on where you live. But, that's the reality of the modern world of computing. Sorry.

Well, actually, all of this was bound to happen eventually. In the past, we've had the luxury of being a bit sloppy in our coding practices. We could throw binary and text around like they were different sides of the same coin. Now we must buckle down if we want the rest of the world to enlist in our REBOL forces. A lot of people live in Asia. A lot.

An important rule...

So, what's really truly going on here? Well, you know me, I like to summarize down to a nice little rule:

Rule:

In high level languages it is dangerous to make assumptions about low-level internal data representation.

What do I mean?

Here's a quick test of your understanding:

Q: What does this line do?

bin: to binary! "hello"

A: If you said that it converts the internal representation of the string "hello" into a standard binary encoded representation such as UTF-8, then you got it right. (What if you wanted it encoded into something different like UTF-16 or Latin-1? You must specify a function refinement for that.)

Q: What does it not do?

A: It does not give you the internal representation of the "hello" string (anymore).

Q: What does this line do?

str: to string! #{68656C6C6F}

A: If you said that it converts the a standard binary encoded representation such as UTF-8 to an internally represented string, then you got it right.

Q: What does it not do?

A: It does not consider those bytes to be the internal representation of the string. They are an encoding of it.

Note: just because the binary literal looks like ANSI or Latin-1 here, does not mean it is. In fact, the default is UTF-8.

A mistake...

A couple years ago, I added the AS-BINARY and AS-STRING functions to REBOL. When I did it, I knew it was treading on an important rule of computer science: you cannot go around directly aliasing datatypes like that, because if either of the datatypes change representation (such as to support Unicode) then you've got a problem. And, of course, the problem is made much worse by the fact that different CPU's store the data differently: in big or little endian.

To continue...

If I write:

insert bin "example"

What does that do?

Since we do not know how the string is internally represented, we must either auto-encode the string into binary, or throw an error.

Which does R3 actually do? Currently, I assume it should encode the string and insert it, but that's not final. Give me some feedback.

Now let me give you something a bit more complex. What happens if I write:

data: enbase "this is an example line"

Well, ENBASE is a binary base encoder (which defaults to a BASE-64 encoding). Since it encodes binary, that implies that the string needs to be converted to binary; therefore, the string needs to be encoded, either automatically or explicitly. ENBASE is a "double encoder".

Are you with me? Guess what? There's more! ENBASE returns a STRING! because it is designed for inserting base encoded data into things like email or web CGI text. So, if I write:

out: make binary! 1000
append out enbase "this is an example line"

then the STRING! output of ENBASE must be encoded into the BINARY!. So, there's a triple encoding going on here.

Fortunately, R3 is quite smart and efficient internally about how all of this is done. (All of this work is the main reason you've not seen me around much chatting online.) In theory, the above line should evaluate about as fast as it did in R2, and perhaps even a bit faster. I have yet to measure it.

In summary...

I should note that there are advantages and disadvantages to these new rules. For users who just want to write scripts and not worry about it, R3 does a lot of the hard work. However, for those who want to fiddle around with the bits in the bytes, it may be more difficult to make things work out. For that, we'll need to develop some smart and well defined methods. Yep, those of you will need to buckle down. It's not your grandpa's BINARY anymore.

Got some comments to any of this? Please post them right away.

8 Comments