Introducing class ZSTRING

by Finnian Reilly (modified: 2015 Nov 17)

Contents

Introduction
The Algorithm
Asian Languages
Test Suite
Extra Routines
Code EL_ZSTRING (alias ZSTRING)
Code EL_UNENCODED_CHARACTERS
Code EL_ZCODEC
Code STRING_BENCHMARK

Introduction

I am currently working on new kind of string class: ZSTRING. Z-strings are essentially a compressed STRING_32 offering similar performance characteristics but with a much smaller memory footprint. Large strings composed mainly of characters found in the Latin-15 character-set use up to 75% less memory. Some preliminary benchmarks can be found here: http://www.eiffel-loop.com/benchmarks/zstring.html

The first benchmark set shows the best case scenario and the second the worst case. The letters A, B, C, D are codes for the test strings shown in a table at the bottom of the benchmark page. Each benchmark is the average of 100 runs with a full garbage collect at the end of each.

Z-strings have a precursor called EL_ASTRING (alias ASTRING) which I wrote about some years ago in this article.

However Z-strings use a very different kind of compression algorithm that removes one of the fundamental limitations of A-strings which is they were limited to a maximum of 26 unicode overflows per string. By overflow is meant characters which are not mappable to latin-15. With Z-string you can include an unlimited amount of Unicode overflow. The Z-string code is also a lot simpler.

The Algorithm

Z-string works by using a hybrid of Latin and Unicode character sets. This means that for languages which work well with 8 bit character-sets, the strings are stored as 8 bit character sequences. Z-strings are mainly initialized from Unicode or UTF-8 strings. Any character which is not translatable with the current system codec is put into an overflow array unencoded, which is composed of a sequence of substrings. Each substring is a start_index followed by an end_index followed by the unicode substring characters. By default unencoded is assigned the once singleton: Empty_unencoded.

unencoded: SPECIAL [NATURAL] You can choose any Latin or Windows codec to be your system codec depending on the selected language locale of the user machine. Latin-15 is fine for most Western-European languages.

Asian Languages

However looking at the benchmarks it becomes obvious that Z-strings will inefficient for handling Asian text which is mostly composed of double byte characters. The solution I envisage for this is to have a second Z-string implemenation that is available as an ECF compile option. The Asian implemenation of ZSTRING will inherit from a class EL_ZSTRING_16 which operate on double byte character sequences. Any odd characters requiring more than 2 bytes can be stored in the unencoded overflow array. So even with Asian languages, I would expect to see a substantially smaller memory footprint compared to STRING_32.

Test Suite

I have been able to reuse an extensive set of tests developed for EL_ASTRING that prove equivalent results to STRING_32 routines.

Extra Routines

There are also a number of fancy routines in EL_ASTRING that are not found in the STRING_32. Later I will add these to ZSTRING. The code will be the same. The most useful of these is an operator which every Python programmer will miss if he switches to Eiffel.

template_demo local template: ASTRING; count: INTEGER do count := 2 template := "The value of $S is $S" log_or_io.put_string (template #$ ["count", count]) log_or_io.put_new_line end This is equivalent to the Python code def template_demo (): count = 2 template = "The value of %s is %s" print template % ("count", count) Program output: The value of count is 2

There is also a useful routine translate, which is similar to the Python String translate routine. It is used to either substitute or delete selected characters from a string.

There are about 10 other useful routines not found in STRING_32

Code EL_ZSTRING (alias ZSTRING)

EL_ZSTRING inherits from class EL_ZSTRING_8 which contains routines copied from STRING_8. EL_ZSTRING wouldn't work if I just inherited from STRING_8 due to various contract violations that are not possible to override. Since the class will be used a lot, it will be conveniently aliased as ZSTRING in the ECF.

Note that this is work in progress and there are still many routines to be implemented. class EL_ZSTRING inherit STRING_GENERAL rename append as append_unicode_general, append_substring as append_substring_general, code as raw_code, prepend as prepend_string_general, prepend_substring as prepend_substring_general, same_caseless_characters as same_caseless_characters_general, starts_with as starts_with_general, ends_with as ends_with_general, is_case_insensitive_equal as is_case_insensitive_equal_general undefine has, index_of, last_index_of, occurrences, out redefine -- Access hash_code, out, as_string_32, to_string_32, -- Element change append_unicode_general, copy, prepend_string_general, -- Comparison is_equal, same_characters, -- Conversion split select has end EL_ZSTRING_8 rename append as append_8, ends_with as ends_with_8, fill_character as fill_character_8, has as has_8, hash_code as hash_code_8, item as item_8, index_of as index_of_8, keep_head as keep_head_8, keep_tail as keep_tail_8, linear_representation as linear_representation_8, make as make_8, occurrences as occurrences_8, same_characters as same_characters_8, same_string as same_string_8, share as share_8, string as string_8, substring as substring_8, String_searcher as String_8_searcher, wipe_out as wipe_out_8 export {EL_ZSTRING, EL_ZSTRING_SEARCHER} area_lower undefine copy, is_equal redefine changeable_comparison_criterion, make_from_string end EL_UNENCODED_CHARACTERS rename append as append_unencoded, grow as unencoded_grow, hash_code as unencoded_hash_code, has as unencoded_has, make as make_unencoded, make_from_other as make_unencoded_from_other, overlaps as overlaps_unencoded, put_code as put_unencoded_code, shifted as shifted_unencoded, substring as unencoded_substring, same_string as same_unencoded_string export {EL_ZSTRING} all {EL_ZSTRING_SEARCHER, EL_ZSTRING} unencoded_code, set_unencoded {STRING_HANDLER} unencoded, has_unencoded undefine is_equal, copy, out redefine set_unencoded end READABLE_INDEXABLE [CHARACTER_32] undefine out, copy, is_equal end INDEXABLE [CHARACTER_32, INTEGER] undefine copy, is_equal, out redefine prune_all, changeable_comparison_criterion select linear_representation end RESIZABLE [CHARACTER_32] undefine copy, is_equal, out redefine changeable_comparison_criterion end EL_SHARED_ZCODEC export {NONE} all undefine is_equal, copy, out end EL_SHARED_ONCE_STRINGS export {NONE} all undefine is_equal, copy, out end EL_MODULE_UTF export {NONE} all undefine is_equal, copy, out end create make, make_empty, make_from_string, make_from_unicode, make_from_latin_1, make_from_utf_8, make_shared, make_from_other, make_filled, make_from_components, make_from_latin1_c -- make_from_string_view convert make_from_unicode ({STRING_32}), make_from_latin_1 ({STRING_8}), -- make_from_string_view ({EL_STRING_VIEW}), to_unicode: {STRING_32} feature {NONE} -- Initialization make (n: INTEGER) -- Allocate space for at least `n' characters. do make_8 (n) make_unencoded end make_filled (uc: CHARACTER_32; n: INTEGER) -- Create string of length `n' filled with `c'. require valid_count: n >= 0 do make (n) fill_character (uc) ensure count_set: count = n area_allocated: capacity >= n filled: occurrences (uc) = count end make_from_components (s: STRING; a_unencoded: like unencoded) do make_filled ('%U', s.count) area.copy_data (s.area, 0, 0, s.count) unencoded := a_unencoded end make_from_latin1_c (latin1_ptr: POINTER) do make_from_latin_1 (create {STRING}.make_from_c (latin1_ptr)) end make_from_other (other: EL_ZSTRING) do area := other.area.twin count := other.count make_unencoded_from_other (other) end feature {NONE} -- Initialization make_from_string (s: STRING) -- initialize with string that has the same encoding as codec do Precursor (s) make_unencoded end make_from_unicode, make_from_latin_1 (a_unicode: READABLE_STRING_GENERAL) do make_filled ('%U', a_unicode.count) encode (a_unicode, 0) end make_from_utf_8 (utf_8: READABLE_STRING_8) local l_unicode: STRING_32 do l_unicode := empty_once_string_32 UTF.utf_8_string_8_into_string_32 (utf_8, l_unicode) make_from_unicode (l_unicode) end -- make_from_string_view (matched_text: EL_STRING_VIEW) -- do -- make_from_other (matched_text.to_string) -- end make_shared (other: like Current) do share (other) end feature -- Access fuzzy_index (other: READABLE_STRING_GENERAL; start: INTEGER; fuzz: INTEGER): INTEGER -- <Precursor> do -- Result := string_searcher.fuzzy_index (Current, other, start, count, fuzz) end hash_code: INTEGER -- Hash code value do Result := internal_hash_code if Result = 0 then Result := unencoded_hash_code (hash_code_8) internal_hash_code := Result end end index_of (uc: CHARACTER_32; start_index: INTEGER): INTEGER local c: CHARACTER; uc_code: NATURAL; i, l_count: INTEGER; l_area: like area do c := codec.encoded_character (uc.natural_32_code) if c = Unencoded_character then uc_code := uc.natural_32_code l_area := area; l_count := count from i := start_index until Result > 0 or else i > l_count loop if l_area [i - 1] = c and then uc_code = unencoded_code (i) then Result := i end i := i + 1 end else Result := index_of_8 (c, start_index) end ensure then valid_result: Result = 0 or (start_index <= Result and Result <= count) zero_if_absent: (Result = 0) = not substring (start_index, count).has (uc) found_if_present: substring (start_index, count).has (uc) implies item (Result) = uc none_before: substring (start_index, count).has (uc) implies not substring (start_index, Result - 1).has (uc) end item alias "[]", at alias "@" (i: INTEGER): CHARACTER_32 assign put -- Unicode character at position `i' local c: CHARACTER do c := item_8 (i) if c = Unencoded_character then Result := unencoded_code (i).to_character_32 else Result := codec.as_unicode (c) end end last_index_of (uc: CHARACTER_32; start_index_from_end: INTEGER): INTEGER local c: CHARACTER; uc_code: NATURAL; i, l_count: INTEGER; l_area: like area do c := codec.encoded_character (uc.natural_32_code) if c = Unencoded_character then uc_code := uc.natural_32_code l_area := area; l_count := count from i := start_index_from_end until Result > 0 or else i = 0 loop if l_area [i - 1] = c and then uc_code = unencoded_code (i) then Result := i end i := i - 1 end else Result := last_index_of (c, start_index_from_end) end end raw_code (i: INTEGER): NATURAL_32 -- code which can be latin or unicode depending on presence of unencoded characters local c: CHARACTER do c := area [i-1] if c = Unencoded_character then Result := unencoded_code (i) else Result := c.natural_32_code end end share (other: like Current) do share_8 (other) unencoded := other.unencoded end substring_index (other: READABLE_STRING_GENERAL; start_index: INTEGER): INTEGER do Result := String_searcher.substring_index (Current, adapted_general (other), start_index, count) end substring_index_in_bounds (other: READABLE_STRING_GENERAL; start_pos, end_pos: INTEGER): INTEGER -- <Precursor> do -- Result := string_searcher.substring_index (Current, other, start_pos, end_pos) end unicode (i: INTEGER): NATURAL_32 local c: CHARACTER do c := area [i - 1] if c = Unencoded_character then Result := unencoded_code (i) else Result := codec.as_unicode (c).natural_32_code end end unicode_item (i: INTEGER): CHARACTER_32 do Result := unicode (i).to_character_32 end feature -- Measurement index_set: INTEGER_INTERVAL -- Range of acceptable indexes do create Result.make (1, count) ensure then index_set_not_void: Result /= Void index_set_count: Result.count = count end occurrences (uc: CHARACTER_32): INTEGER local c: CHARACTER; i, nb: INTEGER; l_area: like area; uc_code: NATURAL do c := codec.encoded_character (uc.natural_32_code) if c = Unencoded_character then uc_code := uc.natural_32_code l_area := area; nb := count from i := 0 until i = nb loop if l_area [i] = c and then uc_code = unencoded_code (i + 1) then Result := Result + 1 end i := i + 1 end else Result := occurrences_8 (c) end end feature -- Status query changeable_comparison_criterion: BOOLEAN = False ends_with (str: READABLE_STRING_GENERAL): BOOLEAN local i, offset, l_count: INTEGER do if attached {like Current} str as zstring then Result := ends_with_8 (zstring) if Result and then zstring.has_unencoded then -- match unencoded l_count := zstring.count; offset := count - l_count from i := 1 until i = l_count or else not Result loop Result := Result and zstring.unencoded_code (i) = unencoded_code (i + offset) i := i + 1 end end else Result := ends_with (adapted_general (str)) end end extendible: BOOLEAN = True -- May new items be added? (Answer: yes.) has (uc: CHARACTER_32): BOOLEAN -- Does string include `uc'? local c: CHARACTER do c := codec.encoded_character (uc.natural_32_code) if c = Unencoded_character then Result := unencoded_has (uc.natural_32_code) else Result := has_8 (c) end end is_string_8: BOOLEAN = False -- <Precursor> is_string_32: BOOLEAN = True -- <Precursor> is_substring_whitespace (start_index, end_index: INTEGER): BOOLEAN do end is_valid_as_string_8: BOOLEAN do Result := not has_unencoded end prunable: BOOLEAN -- May items be removed? (Answer: yes.) do Result := True end valid_code (code: NATURAL_32): BOOLEAN -- Is `code' a valid code for a CHARACTER_32? do Result := True end feature -- Element change append (s: like Current) local old_count: INTEGER do if s.has_unencoded then old_count := count append_8 (s) append_unencoded (s.shifted_unencoded (old_count)) else append_8 (s) end end append_character, extend (uc: CHARACTER_32) do append_unicode (uc.natural_32_code) end append_string (s: READABLE_STRING_GENERAL) do if attached {like Current} s as zstring then append (zstring) else append_unicode_general (s) end end append_unicode (c: like unicode) -- Append `c' at end. -- It would be nice to make this routine over ride 'append_code' but unfortunately -- the post condition links it to 'code' and for performance reasons it is undesirable to have -- code return unicode. local l_count: INTEGER do l_count := count + 1 if l_count > capacity then resize (l_count) end set_count (l_count) put_unicode (c, l_count) ensure then item_inserted: unicode (count) = c new_count: count = old count + 1 stable_before: elks_checking implies substring (1, count - 1) ~ (old twin) end append_unicode_general (s: READABLE_STRING_GENERAL) local old_count: INTEGER do old_count := count grow (old_count + s.count) set_count (old_count + s.count) encode (s, old_count) reset_hash ensure then valid_unencoded_count: unencoded_count = occurrences_8 (Unencoded_character) unicode_agrees: to_unicode ~ old (to_unicode + s) end fill_character (uc: CHARACTER_32) local c: CHARACTER do c := codec.encoded_character (uc.natural_32_code) if c = Unencoded_character then else fill_character_8 (c) end end left_adjust do end prepend_string_general (s: READABLE_STRING_GENERAL) do end put (c: CHARACTER_32; i: INTEGER) -- Replace character at position `i' by `c'. do -- area.put (c, i - 1) internal_hash_code := 0 ensure then stable_count: count = old count stable_before_i: elks_checking implies substring (1, i - 1) ~ (old substring (1, i - 1)) stable_after_i: elks_checking implies substring (i + 1, count) ~ (old substring (i + 1, count)) end put_code (code: NATURAL_32; i: INTEGER) -- Replace character at position `i' by character of code `v'. do end put_unicode (a_code: NATURAL_32; i: INTEGER) -- put unicode at i th position require -- from STRING_GENERAL valid_index: valid_index (i) local c: CHARACTER; l_area: like area do l_area := area c := codec.encoded_character (a_code) l_area [i - 1] := c if c = Unencoded_character then put_unencoded_code (a_code, i) end internal_hash_code := 0 ensure inserted: unicode (i) = a_code stable_count: count = old count stable_before_i: Elks_checking implies substring (1, i - 1) ~ (old substring (1, i - 1)) stable_after_i: Elks_checking implies substring (i + 1, count) ~ (old substring (i + 1, count)) end right_adjust do end feature {EL_ZSTRING} -- Element change set_unencoded (a_unencoded: like unencoded) do unencoded := a_unencoded ensure then valid_unencoded: is_unencoded_valid end feature -- Removal keep_head (n: INTEGER) -- Remove all characters except for the first `n'; -- do nothing if `n' >= `count'. local old_count: INTEGER do old_count := count keep_head_8 (n) if n < old_count and then has_unencoded then set_unencoded (unencoded_substring (1, n).unencoded) end end keep_tail (n: INTEGER) -- Remove all characters except for the last `n'; -- do nothing if `n' >= `count'. local old_count: INTEGER do old_count := count keep_tail_8 (n) if n < old_count and then has_unencoded then set_unencoded (unencoded_substring (old_count - n + 1, old_count).unencoded) end end prune (uc: CHARACTER_32) do end prune_all (uc: CHARACTER_32) do end remove (i: INTEGER) do end remove_tail (n: INTEGER) -- Remove last `n' characters; -- if `n' > `count', remove all. require n_non_negative: n >= 0 local l_count: INTEGER do l_count := count if n > l_count then count := 0 internal_hash_code := 0 else keep_head (l_count - n) end ensure removed: elks_checking implies Current ~ (old substring (1, count - n.min (count))) end wipe_out do wipe_out_8 unencoded := Empty_unencoded end feature -- Conversion as_lower: like Current -- New object with all letters in lower case. do Result := twin Result.to_lower end as_upper: like Current -- New object with all letters in upper case do Result := twin Result.to_upper end as_string_32, to_string_32, to_unicode: STRING_32 -- UCS-4 do create Result.make_filled ('%/000/', count) codec.decode (count, area, Result.area) if has_unencoded then write (Result) end end linear_representation: LINEAR [CHARACTER_32] do end plus alias "+" (s: READABLE_STRING_GENERAL): like Current -- <Precursor> do Result := new_string (count + s.count) Result.append (Current) Result.append_unicode_general (s) end split (a_separator: CHARACTER_32): LIST [like Current] -- Split on `a_separator'. local l_list: ARRAYED_LIST [like Current]; part: like Current; i, j, l_count: INTEGER separator: CHARACTER; is_unencoded_separator: BOOLEAN do separator := codec.encoded_character (a_separator.natural_32_code) is_unencoded_separator := separator = Unencoded_character l_count := count -- Worse case allocation: every character is a separator if is_unencoded_separator then create l_list.make (occurrences (a_separator) + 1) else create l_list.make (occurrences_8 (separator) + 1) end if l_count > 0 then from i := 1 until i > l_count loop if is_unencoded_separator then j := index_of (a_separator, i) else j := index_of_8 (separator, i) end if j = 0 then -- No separator was found, we will -- simply create a list with a copy of -- Current in it. j := l_count + 1 end part := substring (i, j - 1) l_list.extend (part) i := j + 1 end if j = l_count then check last_character_is_a_separator: item (j) = a_separator end -- A separator was found at the end of the string l_list.extend (new_string (0)) end else -- Extend empty string, since Current is empty. l_list.extend (new_string (0)) end Result := l_list check l_list.count = occurrences (a_separator) + 1 end end to_upper do end to_lower do end feature -- Duplication copy (other: like Current) do end substring (start_index, end_index: INTEGER): like Current -- Copy of substring containing all characters at indices -- between `start_index' and `end_index' do -- Cannot use Precursor because post-conditions will become invalid if (1 <= start_index) and (start_index <= end_index) and (end_index <= count) then Result := new_string (end_index - start_index + 1) Result.area.copy_data (area, start_index - 1, 0, end_index - start_index + 1) Result.set_count (end_index - start_index + 1) else Result := new_string (0) end if has_unencoded then Result.set_unencoded (unencoded_substring (start_index, end_index).unencoded) end ensure then valid_unencoded_count: Result.unencoded_count = Result.occurrences_8 (Unencoded_character) end feature -- Comparison is_equal (other: like Current): BOOLEAN local l_count: INTEGER; l_hash, l_other_hash: like internal_hash_code do if other = Current then Result := True else l_count := count if l_count = other.count then -- Let's compare the content if and only if the hash_code are the same or not yet computed. l_hash := internal_hash_code l_other_hash := other.internal_hash_code if l_hash = 0 or else l_other_hash = 0 or else l_hash = l_other_hash then Result := same_string (other) end end end end is_less alias "<" (other: like Current): BOOLEAN -- Is string lexicographically lower than `other'? local other_count, l_count: INTEGER do if other /= Current then other_count := other.count l_count := count if other_count = l_count then Result := order_comparison (other_count, other) > 0 else if l_count < other_count then Result := order_comparison (l_count, other) >= 0 else Result := order_comparison (other_count, other) > 0 end end end end same_characters (other: READABLE_STRING_GENERAL; start_pos, end_pos, index_pos: INTEGER): BOOLEAN local l_has_unencoded, other_has_unencoded: BOOLEAN do if attached {like Current} other as z_other then if has_unencoded or else z_other.has_unencoded then Result := same_unencoded_string (z_other) and then same_characters_8 (z_other, start_pos, end_pos, index_pos) else Result := same_characters_8 (z_other, start_pos, end_pos, index_pos) end else Result := Precursor (other, start_pos, end_pos, index_pos) end end feature {EL_ZSTRING} -- Contract Support is_unencoded_valid: BOOLEAN local i, j, lower, upper, l_count, sum_count, array_count: INTEGER l_unencoded: like unencoded; l_area: like area do l_area := area; l_unencoded := unencoded; array_count := l_unencoded.count Result := True from i := 0 until not Result or else i = array_count loop lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32 l_count := upper - lower + 1 from j := lower until not Result or else j > upper loop Result := Result and l_area [j - 1] = Unencoded_character j := j + 1 end sum_count := sum_count + l_count i := i + l_count + 2 end Result := Result and occurrences_8 (Unencoded_character) = sum_count end feature {STRING_HANDLER} -- Implementation frozen set_count (number: INTEGER) -- Set `count' to `number' of characters. do count := number reset_hash end feature {NONE} -- Implementation adapted_general (str: READABLE_STRING_GENERAL): like Current do if attached {like Current} str as zstr then Result := zstr else create Result.make_from_unicode (str) end end encode (a_unicode: READABLE_STRING_GENERAL; area_offset: INTEGER) require valid_area_offset: a_unicode.count > 0 implies area.valid_index (a_unicode.count + area_offset - 1) local l_unencoded: like Once_extendible_unencoded do l_unencoded := Once_extendible_unencoded; l_unencoded.wipe_out codec.encode (a_unicode, area, area_offset, l_unencoded) if l_unencoded.unencoded.count > 0 then if has_unencoded then append_unencoded (l_unencoded) else unencoded := l_unencoded.unencoded_copy end end end new_string (n: INTEGER): like Current -- New instance of current with space for at least `n' characters. do create Result.make (n) end order_comparison (n: INTEGER; other: like Current): INTEGER -- Compare `n' characters from `area' starting at `area_lower' with -- `n' characters from and `other' starting at `other.area_lower'. -- 0 if equal, < 0 if `Current' < `other', > 0 if `Current' > `other' require other_not_void: other /= Void n_non_negative: n >= 0 n_valid: n <= (area.upper - other.area_lower + 1) and n <= (other.area.upper - area_lower + 1) local i, j, nb, index_other, index: INTEGER; l_code, other_code: NATURAL; c, other_c: CHARACTER other_area, l_area: like area do l_area := area; other_area := other.area; index := area_lower; index_other := other.area_lower from i := index_other; nb := i + n; j := index until i = nb loop c := l_area [i]; other_c := other_area [i] if c = Unencoded_character then -- Do Unicode comparison l_code := unencoded_code (i + 1) if other_c = Unencoded_character then other_code := other.unencoded_code (i + 1) else other_code := codec.as_unicode (other_c).natural_32_code end else if other_c = Unencoded_character then -- Do Unicode comparison l_code := codec.as_unicode (c).natural_32_code other_code := other.unencoded_code (i + 1) else l_code := c.natural_32_code other_code := other_c.natural_32_code end end if l_code /= other_code then if l_code < other_code then Result := (other_code - l_code).to_integer_32 else Result := -(l_code - other_code).to_integer_32 end i := nb - 1 -- Jump out of loop end i := i + 1; j := j + 1 end end reset_hash do internal_hash_code := 0 end feature {NONE} -- Constants Codec: EL_ZCODEC once Result := system_codec end Once_like_current: EL_ZSTRING once create Result.make (0) end Once_extendible_unencoded: EL_EXTENDABLE_UNENCODED_CHARACTERS do create Result.make end String_searcher: EL_ZSTRING_SEARCHER once create Result.make end Unencoded_character: CHARACTER = '%/026/' end

Code EL_UNENCODED_CHARACTERS

note description: "[ Representation of consecutive substrings in a STRING_32 string that could not be encoded using a latin character set. The substring are held in the array unecoded: SPECIAL [CHARACTER_32] Each substring is prececded by two 32 bit characters representing the lower and upper index. ]" author: "Finnian Reilly" class EL_UNENCODED_CHARACTERS create make, make_from_other feature {NONE} -- Initialization make do unencoded := Empty_unencoded end make_from_other (other: EL_UNENCODED_CHARACTERS) do if other.has_unencoded then unencoded := other.unencoded.twin else make end end feature -- Access hash_code (seed: INTEGER): INTEGER -- Hash code value local i, offset, lower, upper, count, array_count: INTEGER; l_unencoded: like unencoded do Result := seed l_unencoded := unencoded; array_count := l_unencoded.count from i := 0 until i = array_count loop lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32 count := upper - lower + 1 from i := i + 2; offset := 0 until offset = count loop -- The magic number `8388593' below is the greatest prime lower than -- 2^23 so that this magic number shifted to the left does not exceed 2^31. Result := ((Result \\ 8388593) |<< 8) + l_unencoded.item (i + offset).to_integer_32 offset := offset + 1 end i := i + count end end first_lower: INTEGER local l_unencoded: like unencoded do l_unencoded := unencoded if l_unencoded.count > 0 then Result := l_unencoded.item (0).to_integer_32 end end last_upper: INTEGER -- count when substrings are expanded to original source string local i, lower, upper, array_count: INTEGER; l_unencoded: like unencoded do l_unencoded := unencoded; array_count := l_unencoded.count from i := 0 until i = array_count loop lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32 i := i + upper - lower + 3 end Result := upper end unencoded_count: INTEGER -- count when substrings are expanded to original source string local i, lower, upper, count, array_count: INTEGER; l_unencoded: like unencoded do l_unencoded := unencoded; array_count := l_unencoded.count from i := 0 until i = array_count loop lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32 count := upper - lower + 1 Result := Result + count i := i + count + 2 end end unencoded_code (index: INTEGER): NATURAL local i, lower, upper, array_count: INTEGER; l_unencoded: like unencoded do l_unencoded := unencoded; array_count := l_unencoded.count from i := 0 until Result > 0 or else i = array_count loop lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32 i := i + 2 if lower <= index and then index <= upper then Result := l_unencoded [i + index - lower] end i := i + upper - lower + 1 end end intervals: ARRAYED_LIST [INTEGER_INTERVAL] local i, lower, upper, count, array_count: INTEGER; l_unencoded: like unencoded do create Result.make (3) l_unencoded := unencoded; array_count := l_unencoded.count from i := 0 until i = array_count loop lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32 count := upper - lower + 1 Result.extend (lower |..| upper) i := i + count + 2 end end feature -- Access attributes unencoded: SPECIAL [NATURAL] feature -- Status query has (unicode: NATURAL): BOOLEAN local i, j, lower, upper, count, array_count: INTEGER; l_unencoded: like unencoded do l_unencoded := unencoded; array_count := l_unencoded.count from i := 0 until Result or else i = array_count loop lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32 count := upper - lower + 1 from j := 1 until Result or else j > count loop if l_unencoded [i + 1 + j] = unicode then Result := True end j := j + 1 end i := i + count + 2 end end has_unencoded: BOOLEAN do Result := unencoded.count > 0 end overlaps (start_index, end_index: INTEGER): BOOLEAN do Result := not (end_index < first_lower) and then not (start_index > last_upper) end feature -- Comparison same_string (other: EL_UNENCODED_CHARACTERS): BOOLEAN local l_unencoded: like unencoded do l_unencoded := unencoded if l_unencoded.count = other.unencoded.count then Result := l_unencoded.same_items (other.unencoded, 0, 0, l_unencoded.count) end end feature -- Element change append (other: EL_UNENCODED_CHARACTERS) require other_has_unencoded: other.has_unencoded local i, lower, upper, count, array_count: INTEGER; l_unencoded, other_unencoded: like unencoded do if has_unencoded then other_unencoded := other.unencoded l_unencoded := unencoded; array_count := l_unencoded.count from i := 0 until i = array_count loop lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32 count := upper - lower + 1 i := i + count + 2 end if upper + 1 = other.first_lower then -- merge intervals l_unencoded := resized (l_unencoded, array_count + other_unencoded.count - 2) l_unencoded.copy_data (other_unencoded, 2, i, other_unencoded.count - 2) l_unencoded.put (other_unencoded [1], i - count - 1) else l_unencoded := resized (l_unencoded, array_count + other_unencoded.count) l_unencoded.copy_data (other_unencoded, 0, i, other_unencoded.count) end else unencoded := other.unencoded end ensure valid_count: unencoded_count = old unencoded_count + other.unencoded_count end append_interval (a_unencoded: like unencoded; source_index, lower, upper: INTEGER) local old_count, count: INTEGER; l_unencoded: like unencoded do l_unencoded := unencoded; old_count := unencoded.count; count := upper - lower + 1 l_unencoded := resized (l_unencoded, old_count + count + 2) l_unencoded.put (lower.to_natural_32, old_count) l_unencoded.put (upper.to_natural_32, old_count + 1) l_unencoded.copy_data (a_unencoded, source_index, old_count + 2, count) ensure count_increased_by_count: unencoded_count = old unencoded_count + upper - lower + 1 end put_code (code: NATURAL; index: INTEGER) local i, previous_i, previous_upper, lower, upper, count: INTEGER; found: BOOLEAN l_unencoded: like unencoded do l_unencoded := unencoded from i := 0 until found or i = l_unencoded.count loop lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32 count := upper - lower + 1 if lower <= index and then index <= upper then -- update interval l_unencoded.put (code, i + 2 + index - lower) found := True elseif upper + 1 = index then -- extend interval right count := count + 1 l_unencoded := extended (i + count + 1, 0, 0, code) l_unencoded.put (index.to_natural_32, i + 1) found := True elseif lower - 1 = index then -- extend interval left count := count + 1 lower := lower - 1 l_unencoded := extended (i + 2, 0, 0, code) l_unencoded.put (index.to_natural_32, i) found := True elseif previous_upper < index and then index < lower then -- insert a new interval l_unencoded := extended (previous_i, index, index, code) i := i + 3 found := True end if previous_upper > 0 and then previous_upper + 1 = lower then -- Merge interval with previous l_unencoded.overlapping_move (i + 2, i, l_unencoded.count - i - 2) l_unencoded.remove_tail (2) i := previous_i; upper := previous_upper + count lower := l_unencoded.item (i).to_integer_32 l_unencoded.put (upper.to_natural_32, i + 1) count := upper - lower + 1 end previous_i := i; previous_upper := upper i := i + count + 2 end if not found then -- append a new interval l_unencoded := extended (l_unencoded.count, index, index, code) end end shift (n: INTEGER) -- shift intervals right by n characters. n < 0 shifts to the left. local i, lower, upper, count, array_count: INTEGER; l_unencoded: like unencoded do if n /= 0 then l_unencoded := unencoded; array_count := l_unencoded.count from i := 0 until i = array_count loop lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32 count := upper - lower + 1 l_unencoded.put ((lower + n).to_natural_32, i) l_unencoded.put ((upper + n).to_natural_32, i + 1) i := i + count + 2 end end end set_unencoded (a_unencoded: like unencoded) do unencoded := a_unencoded end feature -- Basic operations write (expanded_str: STRING_32) -- write substrings into expanded string 'str' require string_big_enough: last_upper <= expanded_str.count local i, j, lower, upper, count, array_count: INTEGER; l_unencoded: like unencoded do l_unencoded := unencoded; array_count := l_unencoded.count from i := 0 until i = array_count loop lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32 count := upper - lower + 1 from j := 1 until j > count loop expanded_str.put_code (l_unencoded.item (i + j + 1), lower + j - 1) j := j + 1 end i := i + count + 2 end end feature -- Duplication substring (start_index, end_index: INTEGER): EL_UNENCODED_CHARACTERS local i, lower, upper, count, array_count: INTEGER l_unencoded: like unencoded do create Result.make l_unencoded := unencoded; array_count := l_unencoded.count from i := 0 until i = array_count loop lower := l_unencoded.item (i).to_integer_32; upper := l_unencoded.item (i + 1).to_integer_32 count := upper - lower + 1 if lower <= start_index and then end_index <= upper then -- Append full interval Result.append_interval (l_unencoded, i + 2 + (start_index - lower), start_index, end_index) elseif start_index <= lower and then upper <= end_index then -- Append full interval Result.append_interval (l_unencoded, i + 2, lower, upper) elseif lower <= end_index and then end_index <= upper then -- Append left section Result.append_interval (l_unencoded, i + 2, lower, end_index) elseif lower <= start_index and then start_index <= upper then -- Append right section Result.append_interval (l_unencoded, i + 2 + (start_index - lower), start_index, upper) end i := i + count + 2 end Result.shift (-(start_index - 1)) end shifted (n: INTEGER): EL_UNENCODED_CHARACTERS do create Result.make_from_other (Current) Result.shift (n) end feature {NONE} -- Implementation resized (old_unencoded: like unencoded; new_count: INTEGER): like unencoded local i, delta: INTEGER do Result := old_unencoded if new_count > Result.capacity then grow (new_count) Result := unencoded end delta := new_count - Result.count from i := 1 until i > delta loop Result.extend (0) i := i + 1 end end grow (new_count: INTEGER) local l_unencoded: like unencoded do l_unencoded := unencoded if new_count > l_unencoded.capacity then create unencoded.make_empty (new_count + l_unencoded.capacity // 2) unencoded.insert_data (l_unencoded, 0, 0, l_unencoded.count) end ensure big_enough: new_count <= unencoded.capacity end extended (destination_index, lower, upper: INTEGER; code: NATURAL): like unencoded local insert: like unencoded do Result := unencoded; insert := Unencoded_insert; insert.wipe_out if lower > 0 then insert.extend (lower.to_natural_32) end if upper > 0 then insert.extend (upper.to_natural_32) end insert.extend (code) if Result.count + insert.count > Result.capacity then grow (unencoded.count + insert.count) Result := unencoded end Result.insert_data (insert, 0, destination_index, insert.count) end feature {NONE} -- Constants Empty_unencoded: SPECIAL [NATURAL] once create Result.make_empty (0) end Unencoded_insert: SPECIAL [NATURAL] once create Result.make_empty (3) end Minimum_capacity: INTEGER = 3 end

Code EL_ZCODEC

deferred class EL_ZCODEC inherit EL_MODULE_UTF STRING_HANDLER feature {NONE} -- Initialization make do create latin_characters.make_filled ('%U', 1) unicode_table := create_unicode_table initialize_latin_sets end feature -- Access id: INTEGER deferred end name: STRING do create Result.make_from_string (type) Result.append_character ('-') Result.append_integer (id) end type: STRING deferred end feature -- Basic operations decode (a_count: INTEGER; latin_chars_in: SPECIAL [CHARACTER]; unicode_chars_out: SPECIAL [CHARACTER_32]) -- Replace Ctrl characters used as place holders for foreign characters with original unicode characters. -- If 'a_decode' is true encode output as unicode -- Result is count of unencodeable Unicode characters require enough_latin_chars_in: latin_chars_in.count > a_count unicode_chars_out_big_enough: unicode_chars_out.count > a_count local i, code: INTEGER; c: CHARACTER; l_unicodes: like unicode_table do l_unicodes := unicode_table from i := 0 until i = a_count loop c := latin_chars_in [i]; code := c.code if c /= Unencoded_character then unicode_chars_out [i] := l_unicodes.item (code) end i := i + 1 end end encode ( unicode_in: READABLE_STRING_GENERAL; latin_chars_out: SPECIAL [CHARACTER]; out_offset: INTEGER; unencoded_characters: EL_EXTENDABLE_UNENCODED_CHARACTERS ) -- encode unicode characters as latin -- Set unencodeable characters as the Substitute character (26) and record location in unencoded_intervals require latin_chars_out_big_enough: latin_chars_out.count >= unicode_in.count valid_offset_and_count: unicode_in.count > 0 implies latin_chars_out.valid_index (unicode_in.count + out_offset - 1) local i, count, unicode: INTEGER; uc: CHARACTER_32; c: CHARACTER; l_unicodes: like unicode_table do l_unicodes := unicode_table; count := unicode_in.count from i := 1 until i > count loop uc := unicode_in [i]; unicode := uc.code if unicode <= 255 and then l_unicodes [unicode] = uc then latin_chars_out [i + out_offset - 1] := uc.to_character_8 else c := latin_character (uc, unicode) if c.code = 0 then latin_chars_out [i + out_offset - 1] := Unencoded_character unencoded_characters.extend (uc, i + out_offset) else latin_chars_out [i + out_offset - 1] := c end end i := i + 1 end end feature -- Conversion as_unicode (c: CHARACTER): CHARACTER_32 -- lookup for character which there is not a place holder do Result := unicode_table [c.code] end encoded_character (unicode: NATURAL_32): CHARACTER local uc: CHARACTER_32; index: INTEGER do uc := unicode.to_character_32; index := unicode.to_integer_32 if unicode <= 255 and then unicode_table [index] = uc then Result := uc.to_character_8 else Result := latin_character (uc, index) if Result.code = 0 then Result := Unencoded_character end end end latin_character (uc: CHARACTER_32; unicode: INTEGER): CHARACTER -- unicode to latin translation -- Returns '%U' if translation is the same as ISO-8859-1 or else not in current set deferred ensure valid_latin: Result /= '%U' implies unicode_table [Result.code] = uc end feature {EL_ZSTRING} -- Implementation create_unicode_table: SPECIAL [CHARACTER_32] -- map latin to unicode deferred end initialize_latin_sets deferred end latin_set_from_array (array: ARRAY [INTEGER]): SPECIAL [CHARACTER] do create Result.make_empty (array.count) across array as c loop Result.extend (c.item.to_character_8) end end single_byte_unicode_chars: SPECIAL [CHARACTER_32] local i: INTEGER do create Result.make_filled ('%U', 256) from i := 0 until i > 255 loop Result [i] := i.to_character_32 i := i + 1 end end feature {NONE} -- Internal attributes latin_characters: SPECIAL [CHARACTER] unicode_table: like create_unicode_table -- map latin to unicode feature {NONE} -- Constants Unencoded_character: CHARACTER = '%/026/' end

Code STRING_BENCHMARK

deferred class STRING_BENCHMARK inherit MODULE_HEXAGRAM EL_MODULE_EIFFEL MEMORY feature {NONE} -- Initialization make (a_number_of_runs: like number_of_runs) do number_of_runs := a_number_of_runs create string_list.make (64) create tests.make (23) end feature -- Access tests: ARRAYED_LIST [ TUPLE [routines: STRING_32; input_format: STRING; output_count: INTEGER; average_time: DOUBLE; storage_size: INTEGER] ] feature -- Factory new_string (unicode: READABLE_STRING_GENERAL): STRING_GENERAL deferred end new_string_with_count (n: INTEGER): STRING_GENERAL deferred end feature -- Basic operations do_test ( routines: STRING_32; input_format: STRING; is_string_32_input: BOOLEAN procedure: PROCEDURE [STRING_BENCHMARK, TUPLE [like input_strings]] ) require valid_input_format: across input_format as c all c.item.is_alpha implies c.item.is_upper end local timer: EL_EXECUTION_TIMER; i, l_storage_bytes, output_count: INTEGER; args: TUPLE [like input_strings] l_input_format: STRING do create args args.put (input_strings (input_format, is_string_32_input), 1) full_collect create timer.make from i := 1 until i > number_of_runs loop procedure.call (args) if i + 1 > number_of_runs then across string_list as string loop l_storage_bytes := l_storage_bytes + storage_bytes (string.item) end end output_count := string_list.count string_list.wipe_out; full_collect i := i + 1 end timer.stop if input_format.has ('$') then create l_input_format.make (input_format.count + 2) l_input_format.append_character ('"') l_input_format.append (input_format) l_input_format.append_character ('"') else create l_input_format.make (input_format.count + 2) l_input_format.append ("<< ") l_input_format.append (input_format) l_input_format.append (" >>") end tests.extend ([routines, l_input_format, output_count, timer.elapsed_millisecs / number_of_runs, l_storage_bytes]) end feature -- Benchmark tests join_all (a_input_strings: like input_strings) local target: like new_string_with_count do target := new_string_with_count (joined_count (Hexagram.Chinese_characters, False)) across a_input_strings as strings loop append (target, strings.item.first) end string_list.extend (target) end join (a_input_strings: like input_strings) local target: like new_string string_general: READABLE_STRING_GENERAL do across a_input_strings as array loop target := new_string_with_count (joined_count (array.item, True)) across array.item as string loop if string.cursor_index = 1 then append (target, space) end append (target, string.item) end string_list.extend (target) end end find_word (word: like new_string; a_input_strings: like input_strings) local index: INTEGER do across a_input_strings as strings loop if attached {like new_string} strings.item.first as string then index := string.substring_index (word, 1) end end end sort (a_input_strings: like input_strings) local sortable: EL_SORTABLE_ARRAYED_LIST [like new_string] do create sortable.make (a_input_strings.count) across a_input_strings as strings loop if attached {like new_string} strings.item.first as string then sortable.extend (string) end end sortable.sort string_list := sortable end split (a_input_strings: like input_strings) do across a_input_strings as strings loop if attached {like new_string} strings.item.first as string then across string.split (' ') as word loop string_list.extend (word.item) end end end end feature {NONE} -- Implementation append (target: like new_string; s: READABLE_STRING_GENERAL) deferred end to_string_32 (string: like new_string): STRING_32 deferred end input_strings (format: STRING; is_string_32_input: BOOLEAN): ARRAYED_LIST [ARRAYED_LIST [READABLE_STRING_GENERAL]] local columns: like input_columns; parts_32, parts: ARRAYED_LIST [STRING_GENERAL] i: INTEGER do columns := input_columns (format) create Result.make (64) across Hexagram.string_arrays as array loop create parts_32.make (columns.count) from i := 1 until i > columns.upper loop parts_32.extend (array.item [columns [i]]) i := i + 1 end if format.has ('$') then create parts.make (1) parts.extend (new_string_general (joined_count (parts_32, True), is_string_32_input)) across parts_32 as part loop if part.cursor_index > 1 then parts.last.append (space) end if is_string_32_input then parts.last.append (part.item) else append (parts.last, part.item) end end else create parts.make (parts_32.count) across parts_32 as part_32 loop if is_string_32_input then parts.extend (part_32.item) else parts.extend (new_string (part_32.item)) end end end Result.extend (parts) end end new_string_general (n: INTEGER; is_string_32_input: BOOLEAN): STRING_GENERAL do if is_string_32_input then create {STRING_32} Result.make (n) else Result := new_string_with_count (n) end end input_columns (format: STRING): ARRAY [INTEGER] local c: CHARACTER; l_array: ARRAYED_LIST [CHARACTER]; i: INTEGER do create l_array.make (4) if format.has ('$') then across format.split (' ') as str loop l_array.extend (str.item [str.item.count]) end else across format.split (',') as str loop l_array.extend (str.item [1]) end end create Result.make (1, l_array.count) from i := 1 until i > l_array.count loop c := l_array [i] Result [i] := c.natural_32_code.to_integer_32 - {ASCII}.upper_a + 1 i := i + 1 end end joined_count (parts: LIST [READABLE_STRING_GENERAL]; with_separators: BOOLEAN): INTEGER do if with_separators then Result := parts.count - 1 end across parts as part loop Result := Result + part.item.count end end storage_bytes (s: like new_string): INTEGER deferred end feature {NONE} -- Internal attributes number_of_runs: INTEGER string_list: ARRAYED_LIST [like new_string] feature {NONE} -- Constants Space: STRING = " " end

Comments

Finnian Reilly (10 years ago 15/11/2015)
Bug in blogging software

For Eiffel Software: please note that due a strange bug in this blogging software I had to change a line of Eiffel code so the characters "[code]" do not appear in it. They cause the rest of the article to become garbled.