blogIntroducing class ZSTRING

finnianr's picture
Contents

Introduction

I am currently working on new kind of string class: ZSTRING. Z-strings are essentially a compressed STRING_32 offering similar performance characteristics but with a much smaller memory footprint. Large strings composed mainly of characters found in the Latin-15 character-set use up to 75% less memory. Some preliminary benchmarks can be found here: http://www.eiffel-loop.com/benchmarks/zstring.html

The first benchmark set shows the best case scenario and the second the worst case. The letters A, B, C, D are codes for the test strings shown in a table at the bottom of the benchmark page. Each benchmark is the average of 100 runs with a full garbage collect at the end of each.

Z-strings have a precursor called EL_ASTRING (alias ASTRING) which I wrote about some years ago in this article.

However Z-strings use a very different kind of compression algorithm that removes one of the fundamental limitations of A-strings which is they were limited to a maximum of 26 unicode overflows per string. By overflow is meant characters which are not mappable to latin-15. With Z-string you can include an unlimited amount of Unicode overflow. The Z-string code is also a lot simpler.

The Algorithm

Z-string works by using a hybrid of Latin and Unicode character sets. This means that for languages which work well with 8 bit character-sets, the strings are stored as 8 bit character sequences. Z-strings are mainly initialized from Unicode or UTF-8 strings. Any character which is not translatable with the current system codec is put into an overflow array unencoded, which is composed of a sequence of substrings. Each substring is a start_index followed by an end_index followed by the unicode substring characters. By default unencoded is assigned the once singleton: Empty_unencoded.

  You can choose any Latin or Windows codec to be your system codec depending on the selected language locale of the user machine. Latin-15 is fine for most Western-European languages.

Asian Languages

However looking at the benchmarks it becomes obvious that Z-strings will inefficient for handling Asian text which is mostly composed of double byte characters. The solution I envisage for this is to have a second Z-string implemenation that is available as an ECF compile option. The Asian implemenation of ZSTRING will inherit from a class EL_ZSTRING_16 which operate on double byte character sequences. Any odd characters requiring more than 2 bytes can be stored in the unencoded overflow array. So even with Asian languages, I would expect to see a substantially smaller memory footprint compared to STRING_32.

Test Suite

I have been able to reuse an extensive set of tests developed for EL_ASTRING that prove equivalent results to STRING_32 routines.

Extra Routines

There are also a number of fancy routines in EL_ASTRING that are not found in the STRING_32. Later I will add these to ZSTRING. The code will be the same. The most useful of these is an operator which every Python programmer will miss if he switches to Eiffel.

  This is equivalent to the Python code   Program output:  

There is also a useful routine translate, which is similar to the Python String translate routine. It is used to either substitute or delete selected characters from a string.

There are about 10 other useful routines not found in STRING_32

Code EL_ZSTRING (alias ZSTRING)

EL_ZSTRING inherits from class EL_ZSTRING_8 which contains routines copied from STRING_8. EL_ZSTRING wouldn't work if I just inherited from STRING_8 due to various contract violations that are not possible to override. Since the class will be used a lot, it will be conveniently aliased as ZSTRING in the ECF.

Note that this is work in progress and there are still many routines to be implemented.  

Code EL_UNENCODED_CHARACTERS

 

Code EL_ZCODEC

 

Code STRING_BENCHMARK

 

Comments

Bug in blogging software

finnianr's picture

For Eiffel Software: please note that due a strange bug in this blogging software I had to change a line of Eiffel code so the characters "[code]" do not appear in it. They cause the rest of the article to become garbled.

Syndicate content