UTF-8
UTF-8 is the most common encoding to store unicode, the following table summarizes how it works
number of bytes | byte1 | byte2 | byte3 | byte4 | range of unicode symbols |
---|---|---|---|---|---|
1 | 0XXX XXXX | 0-127 | |||
2 | 110X XXXX | 10XX XXXX | 128-2047 | ||
3 | 1110 XXXX | 10XX XXXX | 10XX XXXX | 2048-65535 | |
4 | 1111 0XXX | 10XX XXXX | 10XX XXXX | 10XX XXXX | 65536-2097151 |
Features
- 0-127 ASCII is stored as is in UTF-8
- bytewise substring matching works as is, or in other words one UTF-8 symbol will never match anything but one and the same UTF-8 symbol, not maybe just the last byte of a symbol
- UTF-8 is easy to distinguish from a random byte sequence
- when starting at a random byte, resynchronization is trivial and always happens at the next symbol
S1D-7
The biggest problem of UTF-8 is that its not very compact, it needs more bytes then other (non-unicode) encodings, can we do better without loosing its advantages?, well apparently yes, its sad because youd expect the standard comittees to do better …
Why “S1D-7”, well 1 “stop” bit to determine if the byte is the last one and 7 data bits for each byte
number of bytes | byte1 | byte2 | byte3 | byte4 | range of unicode symbols |
---|---|---|---|---|---|
1 | 0XXX XXXX | 0-127 | |||
2 | 1XXX XXXX | 0XXX XXXX | 128-16383 | ||
3 | 1XXX XXXX | 1XXX XXXX | 0XXX XXXX | 16384-2097151 |
Compactness differences
- In all cases is S1D-7 able to store as many or more symbols per number of bytes
- where UTF-8 needs up to 4 bytes to store all symbols, S1D-7 never needs more then 3
- Many languages which need 3 bytes per symbol in UTF-8 can be encoded with 2 bytes per symbol, just look at the Basic Multilingual Plane at wikipedia, all the pages which are in the range of 0x07FF to 0x3FFF need 3 bytes per symbol in UTF-8 and 2 bytes per symbol in S1D-7
Features
Well, the argumentation at the wikipedia page to justify UTF-8 bloatedness is that its “advantages outweigh this concern”, well so lets see if our simpler encoding which requires fewer bytes per symbol and is much more compact looses any features
0-127 ASCII is stored as is in UTF-8
same, no change here
Substring matching
in UTF-8 we can simply use bytewise substring matching while in S1D-7 we need to write a slightly differnt substring matching function, though both are very simple, see below
int match_utf8_or_ascii(uint8_t *heap, uint8_t *needle){ int i,j; for(i=0; heap[i]; i++){ for(j=0; needle[j]; j++){ if(heap[i+j] != needle[j]) break; } if(!needle[j]) return i; } return -1; } int match_S1D8(uint8_t *heap, uint8_t *needle){ int i,j; for(i=0; heap[i]; i++){ for(j=0; needle[j]; j++){ if(heap[i+j] != needle[j]) break; } if(!needle[j] && !(i && heap[i-1]>127)) return i; } return -1; }
Distinguishing S1D-7 from random bytes
Due to the lower overhead there is less to check and it consequently needs more text to detect reliably, but the check is very simple, just check if there are any consecutive 3 bytes which are each larger then 127 and you know its not S1D-7, to distingissh it from UTF-8 just run your UTF-8 detection over it, its very unlikely that S1D-7 will be parsed correctly by a UTF-8 parser
not to mention, the encoding should be specified somehow anyway and not guessed by such encoding analysis
resynchronization
is even more trivial then in UTF-8, in UTF-8 you search for a byte <128 or a byte > 192 when you find it you know its the first byte of a symbol
in S1D-7 you search for a byte < 128 when you find it you know that the next byte is the first byte of a symbol
parsing
Parsing S1D-7 is much easier then UTF-8, see:
read_utf8_symbol(uin8_t **b){ int val= *((*b)++); int ones=0; while(val&(128>>ones)) ones++; if(ones==1 || ones>4) return -1; val&= 127>>ones; while(--ones > 0){ int tmp= *((*b)++) - 128; if(tmp>>6) return -1; val= (val<<6) + tmp; } return val; } read_s1d7_symbol(uin8_t **b){ int i, val=0; for(i=0; i<3; i++){ int tmp= *((*b)++) - 128; val= (val<<7) + tmp; if(tmp < 0) return val + 128; } return -1; }
zero byte occurances (new 2006-07-21)
This issue has been raised by a comment …
While UTF-8 gurantees that a zero byte in the encoding will occur only in a zero symbol, S1D-7 does not gurantee that, for example the value 256 is encoded as 0x82 0x00, UTF-16 also has this problem
zero bytes are problematic if the encoded strings are passed through code which has been written for one byte per symbol zero terminated strings, of course you should never pass something through code which is designed for a different encoding but if for whatever odd reason you still do that then one possible solution is to simply change the value before encoding and after decoding to avoid zero bytes, this is actually very easy:
int avoid_zero(int x){ x+= x<<7; return (x + (x>>14))>>7; } int reverse_avoid_zero(int x){ return x - (x>>7); }
these transforms have almost no effect on the space efficiency of S1D-7
occurance of specific bytes (new 2006-07-21)
This issue has been raised by a comment …
While UTF-8 gurantees that no ascii bytes below 128 will occur in the encoding of symbols larger then 128, S1D-7 does not gurantee that, UTF-16 also has this “problem”
again this “problem” is limited to the case of passing encoded strings through code which expects some other encoding, and thats something you should never ever do, not only will the code not be able to make sense of at least some symbols its likely going to lead to bugs and exploits, if strings have to be passed through code which expects another encoding then you should convert your strings to that encoding and escape the problematic symbols somehow
S1D-7 can also easily be changed to avoid specific low <128 bytes in its encoded representation, allthough thats just for demonstration and shouldnt be done in practice as its very bad to pass strings though code which expects another encoding, if your filesystem for example expects filenames in ascii without specific letters then thats what the filename should be encoded with, and letters which cannot be represented should be encode with some sort of escape sequence
int avoid_some_bytes(int x){ if(x>127) x= (x<<1) ^ table1[x&63]; return x; } int reverse_avoid_some_bytes(int x){ if(x>127) x= (x ^ table2[x&127])>>1; return x; }
with these transforms any 64 low ascii bytes can be avoided at the cost of 1 bit per symbol, that makes S1D-7 somewhat worse then before but still much better then UTF-8 in space efficiency
In your S1D-7 scheme, the last byte of a multi-byte symbol can be mistaken for a valid single-byte symbol, and the last two bytes of a three-byte symbol can be mistaken for a valid two-byte symbol. This can’t happen in UTF-8, and is an advantage when resynchronizing.
Comment by Stern — 2006-07-08 @ 13:34
As a clarification to my previous comment, that means you’ll either skip one valid symbol or process one garbage symbol – in your post you prefer the former. It also means you have to special-case the start of a stream.
Comment by Stern — 2006-07-08 @ 14:01
What you say is true but i cant see a practical advantage in that 1 byte difference in resynchronization, at least compared to the 50% increase in space UTF-8 needs for many asian characters
resynchronization does only happen if you start at a random byte position, now if its random then loosing 1 symbol in a few cases doesnt seem like a big issue to me …
furthermore if the previous data is available then you can decode S1D-7 in backward direction to recover the symbol
its just a while(p>start && p[-1]>127) p–;
Comment by Michael — 2006-07-08 @ 14:27
I agree that the scenarios where this would be a problem are rather contrived (repositioned to the start of a binary stream without the ability to rewind, the stream starts with an S1D-7-encoded identifier), but I think that’s the mentality required for being a standards committee member!
Comment by Stern — 2006-07-10 @ 06:32
UTF-8 is variable-length code and so has all advantages and weaknesses of it. And if you compare standard alternatives (UTF-16 and UTF-32) – it may be really the best of them.
And for S1D-7: it is used in encoding delta times in MIDI format and can be treated as Elias Gamma radix-7 code (got that from one of Peter Fenwick’s reports).
Comment by Kostya — 2006-07-10 @ 09:00
Stern wrote:
> I agree that the scenarios where this would be a problem are rather
> contrived (repositioned to the start of a binary stream without the
> ability to rewind, the stream starts with an S1D-7-encoded identifier)
additionally the fact that the position is exactly the start would have to be unknown otherwise the first symbol could be decoded without problems
> but I think that’s the mentality required for being a standards committee member!
the mentality that a design should make contrived and rare things easy at the expense of very common things? yes i agree standard comitees design things based on that, but i dont agree that they should
Kostya wrote:
> it. And if you compare standard alternatives (UTF-16 and UTF-32) – it may
> be really the best of them.
i fully agree, UTF-16 and -32 are even worse :)
> And for S1D-7: it is used in encoding delta times in MIDI format and can
> be treated as Elias Gamma radix-7 code (got that from one of Peter
> Fenwick’s reports).
i knew it must have been used and named already by someone (and iam also pretty sure UTF-8 has not been used by anyone before unicode at least no sane person …)
Comment by Michael — 2006-07-10 @ 09:46
Michael wrote:
> additionally the fact that the position is exactly the start would have to be
> unknown otherwise the first symbol could be decoded without problems
Yep, but I think that is actually a realistic situation (eg. something akin to an MP3 stream with embedded tags).
I had a look at Rob Pike’s comments on the origins of UTF-8 (linked through the Wikipedia entry) and the ability to synchronize without consuming one character was actually one of the design criterias of the format. Unfortunately that document doesn’t state any rationalization for that criteria, as I assume Kernighan and Pike had some real reason for including it.
Comment by Stern — 2006-07-10 @ 10:40
utf-8/s1d-7 tags embeded in some format are not a problem as you must know where they start, otherwise you will end up with random trash no matter what encoding is used
Comment by Michael — 2006-07-10 @ 11:10
I think that one of the main advantages of utf8 is that you cannot feed rundom binary stream or some other encoding, without breaking some of the many obscure rules.
Of course this complicates all programs (detection and handling of invalid strings) but it also allows autodetection of text file encoding.
Comment by iive — 2006-07-13 @ 14:30
Your encoding doesn’t meet the 2 most important criteria:
– being compatible with existing software
– being file system safe
It’s not compatible with old software because a string encoded with S1D-7 may contain zero bytes as the last byte of a multibyte code in the middle of the string (e.g. 256 (A with macron) is coded as 0x82 0x00 with S1D-7). Therefore any program source using the standard string handling functions (like strlen) is broken.
It’s not file system safe because there are some byte codes not allowed in a file name, like binary 0 and the slash ‘/’. (In Windows NTFS there are several other illegal codes, too, like the backslash ‘\’, colon ‘:’, and a lot more.) In your encoding these ASCII codes could easily emerge as the last byte of a multibyte sequence. Encodings that are not file system safe weren’t even considered at the design of the UTF codes.
And a note to comment #7 (and #6): It’s not Kernighan and Pike, but Thompson and Pike, who designed UTF8. And they are sane… :-)
Comment by István — 2006-07-21 @ 15:36
> Your encoding doesn’t meet the 2 most important criteria:
> – being compatible with existing software
if software depends on a specific encoding then it has to be provided with that encoding, nothing else, UTF-8 will not work with software designed for ASCII only either
> – being file system safe
again, if you take one encoding and store it somewhere pretending it to be a different encoding then you are asking for troubble, this doesnt work with UTF-8 either, just take one filename with UTF-8 and one with a 8bit charset, now tell me which is which and display them, your OS must be aware of the encoding used or it will not work
>
> It’s not compatible with old software because a string encoded with S1D-7 may
> contain zero bytes as the last byte of a multibyte code in the middle of the
> string (e.g. 256 (A with macron) is coded as 0×82 0×00 with S1D-7). Therefore any
> program source using the standard string handling functions (like strlen) is
> broken.
to make you happy ive updated my blog entry to described how to avoid specific bytes with S1D-7 with very low complexity and space efficiency still much better then utf-8
[…]
Comment by Michael — 2006-07-21 @ 19:10
But If we filter all ascii character in multibyte sequence, we end up with something like :
number of bytes byte1 byte2 byte3 byte4 range of unicode symbols
1 0XXX XXXX 0-127
2 11XX XXXX 10XX XXXX 128-4222
3 11XX XXXX 11XX XXXX 10XX XXXX 4223-266365
4 11XX XXXX 11XX XXXX 11XX XXXX 10XX XXXX 266366-16781437
It is still better than utf-8, but we lose lot’s of space efficiency…
Comment by mat — 2006-07-23 @ 18:04
avoid_zero is great! I like it! I like this blog for such neat ideas! Anyway it’s worth noting that some (at this moment unassigned) Unicode characters would be encoded as 4 bytes, so someone implementing it needs to be aware of that extra byte.
For mat’s comment: I think this encoding is closer to S1D-7+avoids:
7 data bits: 0xxx xxxx
13 data bits: 1xxx xxxx 01xx xxxx
20 data bits: 1xxx xxxx 1xxx xxxx 01xx xxxx
27 data bits: 1xxx xxxx 1xxx xxxx 1xxx xxxx 01xx xxxx
(21 bits used for Unicode only…)
It could be called S2D6.5 :-)
This is smaller than UTF-8, larger than S1D7, doesn’t have embedded zeroes and is file system safe in Unix (not in Windows)
Michael wrote:
> this doesnt work with UTF-8 either, just take one filename with UTF-8
> and one with a 8bit charset, now tell me which is which and display them,
> your OS must be aware of the encoding used or it will not work
UTF-8 was designed for the filesystem, to replace 8 bit charsets at the OS level. (The original name was FSS-UTF, meaning File System Safe UCS Transformation
Format, not UTF-8.) According to this idea *all* the filenames in the filesystem are utf8, no need to know which is what. I think most of the unixes do it like that, while Windows stores utf16 internally and has different notions (ansi name, oem name, etc.) externally. Unfortunately it doesn’t have utf8 support at the filesystem level.
Comment by István — 2006-07-24 @ 16:16
I hate to flame, ;-) but as someone who’s spent a lot of time on Unicode and UTF-8 implementation and making it efficient and non-bloated, I’d like to step in. IMO your proposed SID-7 design is the worst of UTF-8 and UTF-16 combined. It contains embedded 0 bytes and it’s variable-width and thus not character-addressible. The only way it’s better than either of them is saving space, and it does so at extreme expense.
One thing Michael warned against is passing data to systems in the wrong encoding, but the systems (the C library and POSIX) in question were not designed for any specific encoding, just for the C definition of a string (and in some cases, such as printf/scanf/atoi/strtol/etc., of the “portable character set” which is a fairly restrictive subset of ASCII). For example, the traditional UNIX filesystem does not use ASCII but rather raw byte sequences with special meaning attributed to just two bytes, 0 and ‘/’.
Microsoft had their day of trying to destroy C, UNIX, and the Internet standards with UCS-2. The idea was that by adopting their 16bit “WCHAR”, which was fundamentally incompatible with C/UNIX/Internet strings and text, they could render these systems severely obsolete and replace everything with their own proprietary system. Thankfully Ken Thompson and friends stepped in to save the day, and quite frankly the idea for UTF-8 was brilliant in the context of the problem they wanted to solve: supporting all the world’s languages on top of the HUGE existing framework of Internet standards and the C language that simply can’t be thrown away and replaced without giving up the game to a huge behemoth like Microsoft or Apple.
The greatest advantage of UTF-8 is that traditional, super-efficient byte-oriented implementations of system operations are still possible. In many cases, this extends to the application level: except for pretty formatting of output on a GUI or character cell device, it’s almost always possible to ignore the idea of encoding and just work with byte strings. String length (size in memory, which is often all that matters) is still searching for a 0 byte. Path resolution is still searching for ‘/’ bytes. Printf is still searching for a ‘%’ byte followed by appropriate single-byte characters. This property is what essentially every other multibyte encoding failed to deliver, and why they were all poorly supported by applications. Why should printf have to go through a disgusting slow “decoding” process on the string passed to it? All this does is punish users who want multilingual support by making their systems slow and bloated.
If we could start the world over without any existing systems or protocols (and without any existing people ;), maybe there would be better choices than UTF-8. But SID-7 isn’t one of them. Most likely if we were starting from scratch with today’s excessive technology, the choice would be to define a byte as 32 bits or even 64 bits and have that be the basic unit of addressing, even though it would be very inefficient and stupid for multimedia and the like… :(
By the way, the inefficiency of UTF-8 is largely due to the inefficiency of Unicode and its wasteful catering to legacy Western encodings. If all the precombined Latin characters, IPA, spacing diacritic marks, precombined Greek, etc. were thrown out or relegated to high code points, then most or all of the world’s non-ideographic characters would fit into the two-byte range of UTF-8, and we wouldn’t be having this discussion. :)
P.S. Don’t get me wrong, I _am_ pissed off about the inefficiency, especially since Tibetan got stuck in the 3-byte range and it needs more characters-per-word than about any other language except German… :P But sadly nothing can be done at this point. It’s been hard enough getting users of legacy codepages to switch to UTF-8 already.
Comment by Rich — 2006-08-01 @ 23:46
I already said way too much, but I found a quote from the original Plan9 paper on UTF-8 which is particularly apt:
“To adopt Unicode, we would have had to convert all text going into and out of Plan 9 between ASCII and Unicode, which cannot be done. Within a single program, in command of all its input and output, it is possible to define characters as 16-bit quantities; in the context of a networked system with hundreds of applications on diverse machines by different manufacturers, it is impossible.”
Here Thompson and Pike are referring to UCS-2, but the same applies to any encoding of Unicode (such as SID-7) which is not a superset of ASCII or not even C-string-compatible.
Also, if anyone is still looking for rationale for any of the requirements UTF-8 was designed to satisfy, I’ll be happy to provide them. I can assure you they’re all necessary for everyday practical purposes.
Comment by Rich — 2006-08-02 @ 00:18
> I hate to flame, ;-)
ROTFLMAO :)
[…]
> It contains embedded 0 bytes and it’s variable-width and thus not
> character-addressible. The only way it’s better than either of them is saving space,
> and it does so at extreme expense.
whats wrong why my suggested method of avoiding 0 bytes ?
> One thing Michael warned against is passing data to systems in the wrong encoding,
> but the systems (the C library and POSIX) in question were not designed for any
> specific encoding, just for the C definition of a string (and in some cases, such
> as printf/scanf/atoi/strtol/etc., of the “portable character set†which is a
> fairly restrictive subset of ASCII). For example, the traditional UNIX filesystem
> does not use ASCII but rather raw byte sequences with special meaning attributed
> to just two bytes, 0 and ‘/’.
in comment #12 mat “proposed” another encoding which is more space efficient then UTF-8 and has no low letters in encodings of high ones, that encoding could be improved by 1bit per symbol if guranteed resynchronisation isnt needed
arent these better then UTF-8 in your oppinion? just curious …
Comment by Michael — 2006-08-02 @ 00:37
btw, even if SID-7 doesnt fit too well into POSIX and C its IMO still much better then the others for storage on disk and transmission over the net due to the lower overhead , when things are read into memory they can be converted to UTF-8/16/32 if desired …
Comment by Michael — 2006-08-02 @ 01:00
> arent these better then UTF-8 in your oppinion? just curious …
No, it’s highly detrimental/unfair to non-English languages since strstr() and similar will find false positives where one character is a “subcharacter” of another for all non-ASCII characters. You may think this is stupid and programs should just be rewritten, but why should a program like “fgrep” have to care about encoding at all? If the encoding is good, it doesn’t have to. I’ve spent a lot of time looking at these issues, and UTF-8 is almost(*) the optimal encoding that has all the properties which are needed for sanity.
(*) Curious about the one “bug” in UTF-8? It’s the fact that there are overlong sequences you have to check for and reject, e.g. 0xc0 0x80 emulating 0x00. If 2-byte characters started from a base of 0x80, 3-byte characters started from a base of 0x880, etc. then this problem would not arise and no special logic would be necessary (the offsets come for free as a natural consequence of the most efficient decoding algorithm :) and unlike checking for errors they don’t involve any branching..though you can also cheat and avoid extra branching with fancy bit-twiddling). This also buys you slightly more space in each byte size, although it’s fairly inconsequential except for a decent amount of additional 3-byte space. Note that this improved approach was included in Unicode, albeit too late, when UTF-16 was introduced.
Comment by Rich — 2006-08-02 @ 07:34
> btw, even if SID-7 doesnt fit too well into POSIX
> and C its IMO still much better then the others
> for storage on disk and transmission over the net > due to the lower overhead , when things are read
> into memory they can be converted to UTF-8/16/32
> if desired …
It’s not. Files on disk need to be readable by standard programs, not specialized programs for reading SID-7 compressed text. If you want to save space or bandwidth, use UTF-8 plus gzip. This was discussed 1000 times already when UTF-8 was first introduced and all the evidence showed that using standard compression algorithms on UTF-8 gave roughly equivalent results to using them on more “efficient” forms of Unicode encoding.
The benefit of using gzip (or any other standard non-text-specific compression algorithm) is that you move the complexity to another layer. You could do this with SID-7, converting back and forth to UTF-8, but it won’t save nearly as much space/bw, it will not be binary-clean (what happens if your UTF-8 file has invalid sequences in it and you try to convert it to SID-7 and back?), and it will be more headache for the user (why do you even need to know if a file is text or binary to transmit it? sounds like we’re back in the dark ages of Windows and CRLF and ascii mode http://ftp...).
Comment by Rich — 2006-08-02 @ 07:40
>> arent these better then UTF-8 in your oppinion? just curious …
> No, it’s highly detrimental/unfair to non-English languages since strstr() and
> similar will find false positives where one character is a “subcharacter†of another
> for all non-ASCII characters. You may think this is stupid and programs should just
> be rewritten, but why should a program like “fgrep†have to care about encoding at
> all? If the encoding is good, it doesn’t have to. I’ve spent a lot of time looking
> at these issues, and UTF-8 is almost(*) the optimal encoding that has all the
> properties which are needed for sanity.
strstr() and others wont work due to precomposed vs. Combining characters differences anyway, UTF-8 or not
if you now say that all UTF-8 should be converted so symbols have a unique encoding then so can you also convert from a more space efficient encoding to UTF-8
Comment by Michael — 2006-08-02 @ 08:36
> This was discussed 1000 times already when UTF-8 was first introduced and all the
> evidence showed that using standard compression algorithms on UTF-8 gave roughly
> equivalent results to using them on more “efficient†forms of Unicode encoding.
using a more space efficient encoding instead of UTF-8 before gzip means that gzip has to compress significantly less data (= its faster) based on the same reasoning compressed video normally uses subsampled chroma, its less data to mess with so faster, of course a video codec could just discard the high frequency components or perform more course quantization on not-subsampled video and it probably would give better quality per bitrate then YV12 as its more flexible
furthermore not all files on disk and all network transmissions are compressed, actually quite few files on my disk are compressed with *zip so the space efficiency of the encoding isnt irrelevant IMO
and about detecting if a file is UTF-8 or SID-7 just add a header to SID-7 which cannot occure in valid UTF-8 so all the standard tools could then easily read it
its really just a few lines of code to convert between SID-7 and UTF-8 …
Comment by Michael — 2006-08-02 @ 08:59
> strstr() and others wont work due to precomposed
> vs. Combining characters differences anyway,
> UTF-8 or not
Unicode consortium says applications should treat canonical equivalences like this as the same, but everyone else (including serious standards bodies, e.g. W3C and the group that specified characters valid in identifiers for languages including C) agrees that’s nonsense. They’re different characters. Just like the micro symbol and Greek letter mu are different characters. You should grep for what you want. :)
Admittedly it would have been better if Unicode had omitted precombined nonsense but they chose the pragmatic approach of trying to make Europeans happy without software having to implement proper combining support.
> and about detecting if a file is UTF-8 or SID-7
> just add a header to SID-7 which cannot occure in
> valid UTF-8 so all the standard tools could then
> easily read it
Not an option. Headers are not preserved under operations on text files, like grep, sort, cat, … This is why windows BOM nonsense is universally rejected on unix.
All your proposals put huge amount of complexity into _every_ _single_ program in order to save a few bytes of space. Space is cheap; complexity is not. It’s like if we decided to omit timestamps in NUT to save 0.005% overhead(*) because you can reconstruct them from decoding at the codec level. Surely you _can_, but programs should not have to do this since it makes them overly complex and codec-dependent. Similarly programs should not have to speak a special encoding (SID-7) to process text. The encoding should be at a level where they don’t have to care about it (UTF-8) and all byte string operations just work on text as generic bytes.
(*) Yes the percentage cost is higher with UTF-8 than with NUT per-frame timestamps, but on the other hand the total space cost is much lower seeing as multimedia files are many orders of magnitude bigger than text.
Comment by Rich — 2006-08-02 @ 09:41
I thought I’d offer a positive comment now. Although I would never use S1D-7 for data interchange, I am planning to use a vlc form similar to this for packing characters into the buffer in my terminal emulator “uuterm” and thought I might share an example where encoding like this is useful for internal representation. In a multilingual terminal, each character cell may store more than one character due to the need for combining marks, etc. Traditional implementations assume combining characters will be rare and use malloc and linked lists to chain arbitrary length combinations. In practice only a few characters (4 is probably sufficient, 5 surely is) are needed for any combination, but storing them as fixed-size UCS sequences would still be prohibitive in space.
My approach is to use NUT vlc (also known as S1D-7) to store variable-size characters in a small fixed-size buffer per cell. Lead character is a vlc unsigned UCS code point and subsequent combining marks are signed vlc, coded relative to the previous character. This allows nearly all reasonable combining sequences of up to 5 combining marks to fit in 8 bytes, while also making short unreasonable sequences (with just 1 or 2 combining chars) possible. And of course, like you said, no character in UCS takes more than 3 bytes since 3*7=21. :)
Glad something good came of this topic!
Comment by Rich — 2006-09-07 @ 00:14
Just one other feature of UTF-8 that your proposed encoding is missing:
With UTF-8, it is possible to determine exactly how many bytes a character will be after reading just the first byte. This could presumably be useful for anything that counts characters or has to seek by character through a (known to be well-formed) UTF-8 file.
Comment by Calvin Walton — 2007-09-15 @ 02:40
> Just one other feature of UTF-8 that your proposed encoding is missing:
> With UTF-8, it is possible to determine exactly how many bytes a character
> will be after reading just the first byte.
no, you cannot, you have to fully decode the utf-8 symbols to identify if they are combining characters and possibly surrogate pairs
and searching for the next byte <128 is not really more complex than counting the leading 1 bits and dealing with a special case of no such bits which utf-8 needs though yes in utf-8 you dont need to access more than the first byte
Comment by Michael — 2007-09-15 @ 21:10
I don’t particularly care for utf-8 myself. Aside from the fact that it uses the unviewable C1 control characters, it makes Cyrillic text unnecessarily bloated, and sometimes I might want to write in Cyrillic. The most efficient solution for pages entirely written in one language is to have more national “legacy†encodings, preferably additions to the ISO-8859 family. ISO-8859 allows up to 94 national characters, a comfortable margin for even the largest alphabetic scripts, Armenian and Georgian, so you can have each character written with only one Latin1 character. For Asian scripts, like Chinese or Korean, you’ll of course need two Latin1s per character, but that’s still better than utf-8, where you’ll need three Latin1s per character.
And when you consider that you can indicate on a webpage what encoding a browser should use, utf-8 only really makes sense when you’re putting several languages on one page, and the majority of pages on the web will be written in only one language. Escape sequences can take up the slack for extranational characters, which are often few and far between on any given page. The way I see it, though, utf-8 has become a sort of sacred cow these days, chosen in the name of the “globalization†religion over the “nationalist†legacy encodings, spoken as if nationalism was some sort of bad thing.
Comment by Hank — 2007-11-17 @ 02:36
[…] http://guru.multimedia.cx/utf-8/ […]
Pingback by Is “UTF-8″ case-sensitive in XML declaration?en « Coding Out Loud — 2009-04-08 @ 12:37