Lair Of The Multimedia Guru

2006-07-08

UTF-8

UTF-8 is the most common encoding to store unicode, the following table summarizes how it works

number of bytes byte1 byte2 byte3 byte4 range of unicode symbols
1 0XXX XXXX 0-127
2 110X XXXX 10XX XXXX 128-2047
3 1110 XXXX 10XX XXXX 10XX XXXX 2048-65535
4 1111 0XXX 10XX XXXX 10XX XXXX 10XX XXXX 65536-2097151

Features

  • 0-127 ASCII is stored as is in UTF-8
  • bytewise substring matching works as is, or in other words one UTF-8 symbol will never match anything but one and the same UTF-8 symbol, not maybe just the last byte of a symbol
  • UTF-8 is easy to distinguish from a random byte sequence
  • when starting at a random byte, resynchronization is trivial and always happens at the next symbol

S1D-7

The biggest problem of UTF-8 is that its not very compact, it needs more bytes then other (non-unicode) encodings, can we do better without loosing its advantages?, well apparently yes, its sad because youd expect the standard comittees to do better …
Why “S1D-7”, well 1 “stop” bit to determine if the byte is the last one and 7 data bits for each byte

number of bytes byte1 byte2 byte3 byte4 range of unicode symbols
1 0XXX XXXX 0-127
2 1XXX XXXX 0XXX XXXX 128-16383
3 1XXX XXXX 1XXX XXXX 0XXX XXXX 16384-2097151

Compactness differences

  • In all cases is S1D-7 able to store as many or more symbols per number of bytes
  • where UTF-8 needs up to 4 bytes to store all symbols, S1D-7 never needs more then 3
  • Many languages which need 3 bytes per symbol in UTF-8 can be encoded with 2 bytes per symbol, just look at the Basic Multilingual Plane at wikipedia, all the pages which are in the range of 0x07FF to 0x3FFF need 3 bytes per symbol in UTF-8 and 2 bytes per symbol in S1D-7

Features

Well, the argumentation at the wikipedia page to justify UTF-8 bloatedness is that its “advantages outweigh this concern”, well so lets see if our simpler encoding which requires fewer bytes per symbol and is much more compact looses any features

0-127 ASCII is stored as is in UTF-8

same, no change here

Substring matching

in UTF-8 we can simply use bytewise substring matching while in S1D-7 we need to write a slightly differnt substring matching function, though both are very simple, see below

int match_utf8_or_ascii(uint8_t *heap, uint8_t *needle){
    int i,j;

    for(i=0; heap[i]; i++){
        for(j=0; needle[j]; j++){
            if(heap[i+j] != needle[j])
                break;
        }
        if(!needle[j])
            return i;
    }
    return -1;
}

int match_S1D8(uint8_t *heap, uint8_t *needle){
    int i,j;

    for(i=0; heap[i]; i++){
        for(j=0; needle[j]; j++){
            if(heap[i+j] != needle[j])
                break;
        }
        if(!needle[j] && !(i && heap[i-1]>127))
            return i;
    }
    return -1;
}
Distinguishing S1D-7 from random bytes

Due to the lower overhead there is less to check and it consequently needs more text to detect reliably, but the check is very simple, just check if there are any consecutive 3 bytes which are each larger then 127 and you know its not S1D-7, to distingissh it from UTF-8 just run your UTF-8 detection over it, its very unlikely that S1D-7 will be parsed correctly by a UTF-8 parser
not to mention, the encoding should be specified somehow anyway and not guessed by such encoding analysis

resynchronization

is even more trivial then in UTF-8, in UTF-8 you search for a byte <128 or a byte > 192 when you find it you know its the first byte of a symbol
in S1D-7 you search for a byte < 128 when you find it you know that the next byte is the first byte of a symbol

parsing

Parsing S1D-7 is much easier then UTF-8, see:

read_utf8_symbol(uin8_t **b){
    int val= *((*b)++);
    int ones=0;

    while(val&(128>>ones))
        ones++;

    if(ones==1 || ones>4)
        return -1;
    val&= 127>>ones;
    while(--ones > 0){
        int tmp= *((*b)++) - 128;
         if(tmp>>6)
            return -1;
        val= (val<<6) + tmp;
    }
    return val;
}

read_s1d7_symbol(uin8_t **b){
    int i, val=0;

    for(i=0; i<3; i++){
        int tmp= *((*b)++) - 128;
        val= (val<<7) + tmp;
        if(tmp < 0)
            return val + 128;
    }
    return -1;
}
zero byte occurances (new 2006-07-21)

This issue has been raised by a comment …
While UTF-8 gurantees that a zero byte in the encoding will occur only in a zero symbol, S1D-7 does not gurantee that, for example the value 256 is encoded as 0x82 0x00, UTF-16 also has this problem
zero bytes are problematic if the encoded strings are passed through code which has been written for one byte per symbol zero terminated strings, of course you should never pass something through code which is designed for a different encoding but if for whatever odd reason you still do that then one possible solution is to simply change the value before encoding and after decoding to avoid zero bytes, this is actually very easy:

int avoid_zero(int x){
    x+= x<<7;
    return (x + (x>>14))>>7;
}
int reverse_avoid_zero(int x){
    return x - (x>>7);
}

these transforms have almost no effect on the space efficiency of S1D-7

occurance of specific bytes (new 2006-07-21)

This issue has been raised by a comment …
While UTF-8 gurantees that no ascii bytes below 128 will occur in the encoding of symbols larger then 128, S1D-7 does not gurantee that, UTF-16 also has this “problem”
again this “problem” is limited to the case of passing encoded strings through code which expects some other encoding, and thats something you should never ever do, not only will the code not be able to make sense of at least some symbols its likely going to lead to bugs and exploits, if strings have to be passed through code which expects another encoding then you should convert your strings to that encoding and escape the problematic symbols somehow
S1D-7 can also easily be changed to avoid specific low <128 bytes in its encoded representation, allthough thats just for demonstration and shouldnt be done in practice as its very bad to pass strings though code which expects another encoding, if your filesystem for example expects filenames in ascii without specific letters then thats what the filename should be encoded with, and letters which cannot be represented should be encode with some sort of escape sequence

int avoid_some_bytes(int x){
    if(x>127) x= (x<<1) ^ table1[x&63];
    return x;
}
int reverse_avoid_some_bytes(int x){
    if(x>127) x= (x ^ table2[x&127])>>1;
    return x;
}

with these transforms any 64 low ascii bytes can be avoided at the cost of 1 bit per symbol, that makes S1D-7 somewhat worse then before but still much better then UTF-8 in space efficiency

Filed under: Uncategorized — Michael @ 12:40

Powered by WordPress