Unicode strings - an opportunity?
by Colin Adams (modified: 2007 Dec 16)
It seems to me that STRING_GENERAL isn't worth having - just take a look at some of the implementations, and you will see they are marked for 8-bit only.
I think it would be better to take the opportunity to abandon read-write strings altogether (confining STRING_8 to "legacy") and make STRING_32 unconnected with STRING_8.
Also substring would then not need to take a copy, and so could be much faster. Indeed, we could consider wasting the initial byte, and so eliminating the cost of translating from 1-base addressing to 0-based addressing (1 byte wasted would not be very significant when every character consumes 4 bytes - I am assuming UTF-32 for the implementation of STRING_32).
STRING_32 read-only
I think that's a great idea, Colin.
But what about the fact that this would makeSTRING_32 an oddity in Eiffel's type system? All other reference types are writeable. Would Eiffel need a
readonly
keyword?Also, just to be sure that I understand your point aboutsubstring being more efficient, do you mean that each new substring object would be implemented by indexing into the area of the old string? Cunning!
All other reference types are not writable. Only those that provide mutating features.
Yes, you understand my point about substring correctly, but describing it as cunning is a bit much.
Colin Adams
I think I've mentioned it quite a few times, STRING_GENERAL is indeed not worth it when there is no more legacy using STRING_8. It was only created to offer a smooth transition path for those who were using STRING_8 (thus the many restrictions on the string containing only characters that can fit into a STRING_8 instance). So it is just a matter of time until it becomes obsolete.
For substring, there is nothing that you cannot do with today's implementation. If we add a boolean flag to say whether a STRING object has changed or not, then we can easily implement substring the way you describe it. It could be called `aliased_substring'.
For the starting index being 0 instead of 1, we can easily do it, but it would break some existing code using
area' directly instead of
to_c'. However I'm not sure if it makes sense as most of the operations in class STRING are already using 0 based indexing for efficiency. So here you would only optimize client code but the drawback is that indexing from 0 is always messy, especially when the rest always starts at 1.Read-only strings
Ignoring for the moment Colin's suggested optimisations, I think the most interesting suggestion he made was to makeSTRING_32 read-only. I'm convinced that read-write strings are a maintenance problem, in many ways, not the least of which is the complexity it adds to the interface of the STRING class. I agree with Colin: Eiffel should grab this opportunity to abolish read-write strings. I'd like to see what the interface of STRING_32 looks like without any commands!