Mixing Unicode and Latin-1 class texts

by Colin Adams (modified: 2007 Mar 30)

Since ECMA allows class texts to be written as either sequences of CHARACTER_8 of CHARACTER_32 (which although not properly specified yet, we can assume means Latin-1 or Unicode), there arises the question of to what extent the two can be mixed.

It is clear that fully unrestricted mixing is not possible. For instance, if a class written in Unicode has a routine named 了, then this routine cannot be called from a class written in Latin-1 (unless it is passed as an agent).

I would suggest that a suitable rule is that all classes within a cluster must either be all CHARACTER_8 or all CHARACTER_32. Furthermore, no class in a CHARACTER_8 cluster may depend upon a class from a CHARACTER_32 cluster.

This rule suggests a requirement for the ACE/XACE/ECF file formats to be able to specifying the character size used for writing class texts within a cluster (or library too, perhaps).

  • Colin Adams (17 years ago 31/3/2007)


    See also Heuristics for detecting class text encoding. Colin Adams

  • Manu (17 years ago 31/3/2007)

    Is this relevant?

    The source code will be using plain text file (i.e. sequence of character codes that are between 0 and 255), UTF-8 or any other Unicode encoding. Once you have the encoding then the semantics is properly defined.

    Of course if one library author is using Unicode characters beyond 255, the user of that library will be forced to use a Unicode encoding for his source code, but is this relevant to the project specification? I don't think so.

    • Colin Adams (17 years ago 2/4/2007)


      The main benefit for homogeneous clusters is simpler heuristics - there is no possibility of confusing Latin-1 with UTF-8.

      If you can't eliminate the possibility of one or another, I know of know way to disambiguate them, in general.

      Of course, there are lots of cases where it is easier to see which of the two is meant. But in other cases, not.

      So, starting from the case where the file is pure ASCII. Did the author intend it to be treated as Latin-1 or UTF-8?

      In this case, it doesn't matter (the only possibility is the type of manifest string constants, but these are defined to be of type STRING).

      But all we have to do is to mutate one character in a string literal, and immediately (if we choose the mutation carefully), the case becomes undecidable.

      Colin Adams