Pruning down READ and WRITE

As you know, read and write functions in R3 will default to binary, rather than string. This is necessary because:

Strings are encoded as binary, for example UTF-8. So in order to decode them, we must know how they are encoded. If the file has no BOM (byte order marker) then its encoding is unknown and must be provided to the function itself.
Running in binary mode will not accidentally corrupt a file. This is the ancient FTP transfer binary/text problem. If you transferred an image using FTP text mode, the file could be damaged if line terminators where found.

A few days ago, I proposed new read and write functions that add an /as refinement. This refinement allows you to specify the encoding:

data: read/as file 'utf-8
write/as file data 'utf-16le

This is useful because the encoding can be specified as part of the function call. However, this also makes this approach the standard method for reading and writing all decodings, even those for image files, etc. For example:

image: read/as file 'jpeg

Our plan was to add an intermediate layer in read and write to allow for codecs (encoding and decoding). They would be stream oriented to allow for partial transfers, and also "fragments" (when not enough data has been received to finish a well-aligned encoding or decoding process.)

Of course, all of that makes read and write more complicated.

The question is: do we want to do that?

Another factor is that the old R2 lower level I/O functions read-io and write-io have been eliminated from R3. The read and write functions have that capability now.

So, we can say that it all boils down to: are read and write lower-level or higher-level functions?

After some consideration, I think they should be lower-level functions.

This would mean that they should be as fast and efficient as possible. This also implies that they should have as few refinements as possible (because, as in any language, the more function arguments there are, even if optional or local, the more overhead the function call has, because those slots must be allocated in the function frame.)

Ok, if we do that, function like read can be defined as:

read file /part size /skip len

That's really pruned down compared to R2. We could prune it even more by adding a seek function to eliminate the /skip refinement.

So, what about the primary REBOL rule of keeping things simple?

This is important, and the solution would be to provide higher level mezzanine functions that provide the necessary encoding.

For example, we could have:

str: read-text file
write-text file str

This read-text function could be smart in many ways. For example, it could examine the BOM of the file to determine the encoding. It would also make the line termination corrections.

Note, because it is a mezzanine function, users have access to easily improve it over time.

We would also provide an /as refinement:

str: read-text/as file 'utf-16le

and, even:

write-text/bom/as file 'utf-16le

To indicate we want the BOM inserted at the head of the file data.

So, there you go. Let's do a quick survey of REBOLers and get some comments.

12 Comments