blogZSTRING released in Eiffel-Loop 1.3.1

finnianr's picture
Contents

Preface

Since 2012 I have been interested in the idea of creating a memory efficient alternative to STRING_32 that does not sacrifice performance. In 2013 I published this article which describes and benchmarks the class EL_ASTRING (aka ASTRING). While ASTRING peformed well against STRING_32 and the Gobo UC_UTF8_STRING in terms of both memory consumption and performance, it had one fundamental problem which I felt would prevent it from ever gaining wide acceptance. This was that each ASTRING instance was limited to a maximum of 26 unicode characters that could be used to augment the base latin character set.

In the fall of 2015 I came up with a better algorithm which did not suffer the limitations of ASTRING. Although the new design was somewhat less memory efficient for storing characters wider than 8-bits, it performed well overall. I decided to call it ZSTRING. In November 2015 I announced work in progress on ZSTRING in this article.

I am happy to announce that this project is now completed and since January I have migrated all my projects to use ZSTRING. ZSTRING is now available to download as part of the latest 1.3.1 release of Eiffel-Loop.

Finnian Reilly

What is ZSTRING?

ZSTRING is a memory efficient alternative to the standard STRING_32 class. If your application processes text that is mainly written in a language that is encodeable by a latin character set, you can expect up to a 75% saving in memory. There is a small trade off in performance, but not by much as the benchmarks show. Actually there is not always a trade off. Some ZSTRING string operations are slightly faster than their STRING_32 equivalent.

How is ZSTRING different from STRING_8?

With STRING_8 you are limited to characters from the ISO-8859-1 character set. A ZSTRING instance can be initialized from any STRING_32 string and you are guaranteed to always get back the same STRING_32 string with a call to the function {ZSTRING}.as_string_32. You can also convert it to UTF-8 with a call to {ZSTRING}.to_utf_8.

How ZSTRING Works

ZSTRING uses a combination of two different character arrays to represent the text. The first array is defined as:

 

When initialized from a STRING_32, area holds every character that is encodeable using a system defined latin character-set (ISO-8859-15 by default). Any unicode character that is not encodeable as latin, is denoted by the control code 26 'SUB' character. SUB stands for substitution.

The unencoded array

The second array is defined as:

 

This array is used to store any characters that are not encodeable by the latin codec. The array is structured as a sequence of substrings. The first two cells of each substring record the indices defining the substring interval. As an example to illustrate, consider the string:

 

It contains a total of 27 characters. 24 of these are encodeable as Latin-15 and 3 are not. The three that are not encodeable are stored in the unencoded_area array as shown in the following diagram:

The olive colored cells store the interval indices of the 2 substrings "ō" and "中孚".

The Default Codec

By default, ZSTRING uses the codec class EL_ISO_8859_15_ZCODEC to convert unicode strings. However you can override this choice by calling the routine:

 

For a desktop application you could set the system codec according to the user locale. The following class offers a convenient way to create codec instances from an integer arguement:

 

The ZSTRING Test Suite

ZSTRING has a comprehensive test suite that verifies the results of all string operations against STRING_32.

See ZSTRING_TEST_SET also TEST_STRING_CONSTANTS

ZSTRING for Asian text

If your application is mainly processing Asian text, it is obvious that the current implementation of ZSTRING will incur a significant penalty in terms of both memory and runtime performance. The solution to this problem is to use a different source compatible implementation of ZSTRING adapted for Asian text that can be set with an ECF option.

A 2-byte Solution

The solution is in inherent in the fact that most Asian characters will fit into 2 bytes. By changing the storage arrays to the following definition it should be possible to create an efficient ZSTRING for Asian text

  Now the criteria for whether the character is stored in the unencoded_area is if it does not fit into 2-bytes. As it is unlikely that you will have many consecutive runs of 3 or 4-byte characters, the current scheme of using 2 cells for interval onsets and offsets makes no sense. Instead we should assume that over-sized characters occur in isolated instances and we only need to store the index. However some research is required to verify these assumptions.

Ideally area should be defined as CHARACTER_16, but as this is not part of standard Eiffel, NATURAL_16 can be used as a workaround.

Perhaps in the future some Asian coder will be interested to take on the challenge of implementing this.

A 4-byte Solution

Of course an easier option is to create a source compatible version of ZSTRING that uses the same implementation as STRING_32.

Benchmark Notes

You can find a set of benchmarks at the following URL comparing ZSTRING and STRING_32 in terms of memory consumption and runtime performance.

Benchmark Source Code links on Github

ZSTRING_BENCHMARK_APP

Pure Latin-15 encoding

STRING_BENCHMARK

ZSTRING_BENCHMARK

STRING_32_BENCHMARK

Mixed Latin-15 and Unicode encoding

MIXED_ENCODING_STRING_BENCHMARK

MIXED_ENCODING_ZSTRING_BENCHMARK

MIXED_ENCODING_STRING_32_BENCHMARK

The Test Data

The test data used in these benchmarks is shown at the bottom of the benchmark page. The data is composed of 64 rows and 4 columns referenced as A, B, C, D. The Input column indicates which combination of columns have been selected for each test. For example   means that each input string is a string from column A joined with a string from column D. Each runtime performance test is repeated for each of the 64 rows.

The columns have different storage characteristics as follows:

Columns A and D consist solely of characters that can be encoded as Latin-15. Column B is composed of characters, some of which will need to be stored in the unencoded_area array. Column C is composed of characters, all of which will need to be stored in the unencoded_area array.

Memory Consumption

Table 1 shows memory consumption under optimal conditions, where all the text data can be encoded as Latin-1.

Table 2 shows memory consumption under sub-optimal conditions, where some proportion of the text must be stored in the unencoded_area array.

Runtime Performance

Runtime performance is benchmarked using the 25 most common string operations. The performance figures are the average of 1000 runs for each benchmark test operation. The results are sorted in order of descending ZSTRING performance rank. Table 3 shows runtime performance under optimal conditions, where all the text data can be encoded as Latin-1. Table 4 shows performance under sub-optimal conditions, where some proportion of the text must be stored in the unencoded_area array. Table 4 has more input permutations than table 3.

I Ching Hexagram Test Strings

Table 5 at the bottom of the page shows the data used for the benchmarks. The data are names and descriptions of the 64 I Ching hexagrams divided into 4 columns as follows: A. Hexagram number B. Chinese names as pinyin phonetic spelling C. Chinese names as chinese characters D. English description

ZSTRING argument options

Most ZSTRING routines have two alternate forms as follows:

  1. taking arguments of type EL_READABLE_ZSTRING
  2. taking arguments of type READABLE_STRING_GENERAL

For example:  

Extra ZSTRING routines

ZSTRING has a number of extra routines not found in the standard Eiffel string classes.

 

This routine is inspired by the Python string formatting or interpolation operator '%'. It is not quite as sophisticated as the Python operator as it cannot do numeric formatting, but nevertheless is very useful. It means any ZSTRING can be used as an interpolation template with the addition of '%S' markers (AKA '#').

 

This routine is inspired by the Python method string.translate(s, table[, deletechars]). ZSTRING has a number of different variations of this routine one of which is able to do character deletion.

 

This returns a list of substring indices as an arrayed list.

 

The class EL_SEQUENTIAL_INTERVALS is a high performance alternative to using ARRAYED_LIST [INTEGER_INTERVAL] both in terms of memory consumption and runtime performance.

 

Returns an escaped copy of current string

 

Unescapes the current string

Source Code Links

class EL_ZSTRING aka ZSTRING

class EL_READABLE_ZSTRING

class EL_UNENCODED_CHARACTERS

Comments

Are you aware of Gobo's

berend's picture

Are you aware of Gobo's UC_STRING? That's basically UTF-8 (there is also a UTF-16 variant), and that has been a very stable solution for me for many years.

gobo class UC_UTF8_STRING

finnianr's picture

hi Berend

I published this article in 2013 which benchmarks UC_UTF8_STRING against STRING_32 and ASTRING (a precursor to ZSTRING). These benchmarks show that UC_UTF8_STRING is extremely slow for some commonly used operations. I don't recall now the UTF-16 variant or perhaps it's something new.

UC_UTF8_STRING Benchmark Screenshot String BenchmarksString Benchmarks

UC_UTF8_STRING is broken

finnianr's picture

Besides being very slow, UC_UTF8_STRING is also broken, as this test shows. (I am using the version distributed with EiffelStudio 15.01.9)

 

UC_UTF8_STRING fixed

ericb's picture

Thank you for reporting this problem. It's now fixed in the Git repository of Gobo in Github.

Syndicate content