UTF-8 Unicode in Eiffel for .NET

by Peter Gummer (modified: 2017 Mar 06)

We have an Eiffel for .NET dll which is called by a VB.NET application. We need to internationalize this application. "That should be easy", I thought, "because Eiffel's STRING is basically just a wrapper for the .NET String." So we fed our Eiffel dll some UTF-8 data, which it dutifully manipulated; but the VB.NET client application received garbage. Instead of displaying beautiful Farsi characters, it spat out wingding-dingbat-like droppings. What was wrong?

The Eiffel STRING class is not natively Unicode, but it can encode UTF-8 as a sequence of bytes.

It seems that STRING is heavily byte-oriented, even in .NET. I guess this is just something the Eiffel Software team hasn't got around to doing properly yet in EiffelStudio 5.7. How could I convert this sequence of bytes, supplied by STRING, into a true Unicode String? And could I do this without having to examine and modify all of our Eiffel classes, on the one side, and our VB classes, on the other side?

The answer lay in SYSTEM_STRING_FACTORY. This is a helper class for STRING; it converts to and from the .NET String. By modifying this class, I could transform those UTF-8 bytes into meaningful characters, with the help of the Gobo class UC_UTF8_STRING. The first step was to copy SYSTEM_STRING_FACTORY from the base.kernel.dotnet cluster to an override cluster of my own making.

Then I edited my override version of the class. The original version of the from_string_to_system_string function converts the bytes in the Eiffel STRING l_str8 to the .NET String Result with this one line of code:

create Result.make (l_str8.area.native_array, 0, a_str.count)

Now I really don't understand why this doesn't work. The native_array is declared as NATIVE_ARRAY [CHARACTER], and CHARACTER is simply a .NET Character as far as I can tell, which is a Unicode character. So why doesn't creating a Unicode string from an array of Unicode characters produce true Unicode text, rather than a mash of wingdings? I don't know. To fix it, however, I replaced the above line with this:

create utf8.make_from_utf8 (l_str8)

Note that utf8 is declared as UC_UTF8_STRING. Then I looped through utf8, copying each Unicode character to l_str via a .NET StringBuilder (using almost exactly the same code that the original version of SYSTEM_STRING_FACTORY uses to copy 32-bit strings, by the way). The full source for my version of SYSTEM_STRING_FACTORY is attached.

This single change was sufficient for allowing VB client classes to display the strings that our Eiffel libraries create. No change was required to any VB code. Each of the hundreds of places in our VB code that called STRING.to_cil now automatically did the conversion with the help of my version of SYSTEM_STRING_FACTORY.from_string_to_system_string, and so our application displayed proper Farsi text.

But this wasn't enough to handle converting the other way. Passing VB strings to our Eiffel libraries still didn't work: I had to fix the creation routine STRING.make_from_cil. This was achieved by modifying the SYSTEM_STRING_FACTORY.read_system_string_into command, which converts the .NET String a_str to the Eiffel STRING l_str8 with this line of code:

a_str.copy_to (0, l_str8.area.native_array, 0, a_str.length)

Once again, I'm not sure why this doesn't work; but it doesn't. To fix it, I replaced the above line with two loops:

from i := 0 nb := a_str.length create utf8.make_empty until i = nb loop utf8.append_character (a_str.chars (i)) i := i + 1 end from i := 1 nb := utf8.byte_count l_str8 ?= a_result l_str8.wipe_out until i > nb loop l_str8.append_character (utf8.byte_item (i)) i := i + 1 end

So now make_from_cil and to_cil handle UTF-8 properly. In each case, they copy the contents of the input string to an instance of UC_UTF8_STRING, which is then copied to the result. This implementation is no doubt inefficient, but it does seem to be working ok. It requires minimal change on the Eiffel side (only one class, SYSTEM_STRING_FACTORY, has been overridden); and it's completely transparent, as far as I can tell, on the VB side.

Before arriving at this approach, I had several false starts. I tried doing all of the conversion on the VB side, but that required modifying hundreds of lines of code. I tried mapping the Eiffel STRING class to STRING_32, but that didn't even compile (and it probably wouldn't have worked anyway); and I tried making VB work directly with a .NET-enabled override of UC_UTF8_STRING, but that blew up at run time with type-cast errors in STRING.is_equal (CAT-calls, I think).

So all in all, I'm happy that it seems to be working; but I'm unhappy that it took a lot of work to figure out how to do it. I'm looking forward to the day when Unicode in Eiffel is as easy as it is in C# and VB.

Attached is my override of SYSTEM_STRING_FACTORY.

Note (May 17, 2007): I've written a follow-up to this at UTF-8 in .NET, revisited, including a new override of SYSTEM_STRING_FACTORY.

Comments

Manu (17 years ago 16/3/2007)
Limitations

As far as I can tell, this implies that all your Eiffel strings are UTF-8, as otherwise it might not work for characters that are above 128. But if you get your data from UTF-8, wouldn't it be better to generate STRING_32 instead when reading the data. Once done, the STRING_32 would convert nicely with .NET System.String.
Peter Gummer (17 years ago 17/3/2007)
Limitations - Yes, UTF-8

Yes, the assumption here is that the strings are all UTF-8. (I tried to make that clear, especially in the title). This assumption is sound for our purposes.

For this reason, the official EiffelStudio version of SYSTEM_STRING_FACTORY probably should not adopt my "fix". Other good reasons for EiffelStudio to come up with a better fix than this are that my implementation is inefficient, and it would create a dependency of the base library on the gobo library. I don't mind my own project having a dependency on Gobo - the project already uses Gobo - but this is not ok in general.

I agree with your idea of generating STRING_32 when reading the data. I was thinking along those lines when I attempted (unsuccessfully) to map STRING to STRING_32. I modified my project's config to use base as a cluster rather than a library; then I copied all of the mappings from base.ecf to my config, editing STRING to map it to STRING_32 rather than STRING_8. But I quickly abandoned that route, because it wouldn't even compile. Some line in a library (something like true_string: STRING is "True") couldn't convert a STRING_8 to a STRING_32. I could have left the mapping alone, I suppose, and done a global search and replace of STRING with STRING_32 in our code; but that is very invasive, and I'd be surprised if it worked given that the libraries we call would still be producing STRING_8 objects.

I really don't like this STRING_8 / STRING_32 idea. I programmed in C# for two years, developing an internationalised application, and I was barely conscious of the fact that my strings and characters were Unicode. It just worked. I acknowledge that Eiffel is contending with a backward-compatibility problem here, but I would be much happier if I could just flip a switch in the config file so that all of my STRING objects instantly became Unicode.
- Manu (17 years ago 17/3/2007)
  Flip of a coin
  
  That would be soon possible when we have converted all our legacy code that only handle STRING_8 will be adapted to work with STRING_32 as well.
- Colin Adams (17 years ago 17/3/2007)
  Multiple string types
  
  I don't think there was a backwards-compatibility problem - at least, not one that needed the STRING_*/STRING_32 separation as a solution. A much simpler fix of adding a query would have done the trick.
  - Manu (17 years ago 17/3/2007)
    Solution?
    
    The issue is that you had legacy code wrapping C interfaces required 8-bit strings and having Unicode strings would have broken those API. Therefore the separation was and is still needed.
    
    What do you mean by a query?
    - Colin Adams (17 years ago 17/3/2007)
      My solution
      
      Well, two queries actually. maximum_code: INTEGER is -- Maximum value of `code' permitted by `character_set' do -- 255 for ISO-8859-x, 1114111 for Unicode -- compiler can optimize this as a builtin query -- in the case that only 1 character set is used in -- a system. Otherwise it would be an attribute ensure positive_code: Result > 0 end and character_set: STRING is -- Name of character set used in `Current' do -- same considerations as for `maximum_code' ensure result_not_empty: Result /= Void and then not Result.is_empty result_is_ascii: -- whatever end
      
      With these two queries, all incompatibilities can be coped with (including your c-string stuff - the compiler can include transcoding when necessary).
      
      Note that the latter query was also needed for ETL2, for supporting multiple or alternate encodings, such as ISO-8859-2.
      - Manu (17 years ago 18/3/2007)
        Real issue
        
        What I meant is that it is not an issue of encoding or character set, but an issue with memory representation of Eiffel strings. Indeed the existing legacy C code wrapping are using this directly. So changing STRING to support unicode will break those programs. This is one of the reason why EiffelBase introduced C_STRING so that it works regardless of the memory representation of Eiffel strings.
- Thomas Beale (17 years ago 18/3/2007)
  STRING_8/STRING_32
  
  I would also really like to know why we should have STRING_8 etc. It makes no sense to me at all. Currently the compiler is doing tricks to convert all STRINGs in our code to STRING_8, but certain things like checking generated_type.is_equal("STRING") break; anywhere where dynamic_type() from INTERNAL is used with STRING objects might or might not work. And I don't see why issues of UTF encoding (not the same as unicode per se) should be exposed at a developer-visible level.
  
  What we need is to be able to say at the beginning of an application set_unicode_encoding_utf8 or set_unicode_encoding_utf16 and everything just works. The default should be whichever makes sense in your linguistic culture (UTF-8 in all european languages).
  
  - thomas
  - Colin Adams (17 years ago 18/3/2007)
    Unicode encoding
    
    There are two Unicode encodings to be considered - the Unicode Encoding Form, and the Unicode Encoding Scheme.
    
    The former is one of UTF-8, UTF-16, and UTF-32. This is what is used internally within the program. It is a classic time/space trade-off. ISE use UTF-32, which waste memory to speed computing time. Either UTF-16 or UTF-8 would be slower, but UTF-16 is rarely significantly slower. I don't see any linguistic cultural differences affecting the issue.
    
    The Unicode Encoding Schemes are byte serializations of the encoding forms. The full list is UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. The natural default tends to map to your computer hardware + O/S, except there is a disk-space consideration here: The UTF-32* use more disk space. Which is the most economical DOES depend upon the linguistic culture - in Europe UTF-8 is cheapest, in East Asia, UTF-16* are cheapest. I don't know what {STRING_32}.out produces with ISE 5.7.
    
    So there are two possible sets of set_unicode_encoding features.
    
    Colin Adams
    - Manu (17 years ago 19/3/2007)
      The form doesn't matter since it is most likely hidden from the user point of view (the user manipulate a sequence of characters and nothing more). Nevertheless, it might be better if it is kept 32-bit since it is faster.
      
      Regarding the encoding scheme, it cannot be set on the application level since many libraries might choose a different encoding, or you might have a need to read different encoding. So it has to be configurable and this should be outside the STRING class.
      - Colin Adams (17 years ago 19/3/2007)
        Serializing
        
        I agree that the encoding form doesn't matter, and that the compiler should be free to choose whichever it likes (and UTF-32 would be my choice too).
        
        But for serializing, STRING_GENERAL should have the following routines (bodies omitted):
        
        to_utf8: !STRING_8 is -- Serialization of `Current' as bytes of UTF-8 representation. do ensure not_shorter: Result.count >= count end to_utf_16_be: !STRING_8 is -- Serialization of `Current' as bytes of UTF-16BE representation. do ensure not_shorter: Result.count >= 2 * count end to_utf_16_le: !STRING_8 is -- Serialization of `Current' as bytes of UTF-16LE representation. do ensure not_shorter: Result.count >= 2 * count end to_utf_32_be: !STRING_8 is -- Serialization of `Current' as bytes of UTF-32BE representation. do ensure four_times_longer: Result.count = 4 * count end to_utf_32_le: !STRING_8 is -- Serialization of `Current' as bytes of UTF-32LE representation. do ensure four_times_longer: Result.count = 4 * count end The question remains what {STRING_32}.out should produce. Perhaps it should be platform-specific (and may be finer grained than just Windows v. POSIX - different Windows configurations may have different natural defaults - I'm not sure about this).
        
        Colin Adams
        
        Manu (17 years ago 20/3/2007)
        I believe it does not matter whether or not those routines are in STRING_GENERAL. It might be better to have them outside, possibly that you may want to serialize the data in something else than a string and to reduce code duplication it makes more sense outside.
        
        For {STRING_32}.out, I don't think it is a major issue. The default implementation of out' is compiler defined at the moment to be STRING. In the future we may want to change this to be STRING_32, but for the time being, being a truncated version of the STRING_32 representation is fine to me since out' has different semantics depending on the Eiffel class. In my opinion using `out' for encoding would be really wrong.