Re: Byte size of characters when encoding

From: mikeb (mailbox.google_at_nospam.mailnull.com)
Date: 07/09/04


Date: Fri, 09 Jul 2004 16:06:01 -0700

Vladimir wrote:
>>>Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
>>>Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.
>>>
>>>But why that?
>>
>>Strings in .NET are already Unicode encoded. So if you encode the
>>string to an array of bytes, you get bytes per character.
>>
>>However, for UTF8 encoding a single Unicode character can be encoded
>>using up to 4 bytes in the worst case. charCount*4 is just a worst case
>>scenario if the string happened to contain only characters that required
>>4 byte encoding.
>
>
> Do you want to say that two instances of struct Char in UTF-8 can occupy 8
> bytes?
>

It turns out that while a UTF8 character can take up to 4 bytes to be
encoded, for the Framework, a struct Char can always be encoded in at
most 3 bytes. That's because the struct char holds a 16-bit Unicode
value, and that can always be encoded in 3 or fewer bytes.

A 4-byte UTF8 encoding is only needed for Unicode code points that
require 'surrogates' - or a pair of 16-bit values to represent the
character. Surrogates cannot be represented in a single struct Char -
but I believe they are supported in strings.

Anyway, here's what can happen using struct Char:

     char c1 = '\uFFFF';
     char c2 = '\u1000';

     byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });

If you dump the byte array, you'll see that each Char was encoded into 3
UTF8 bytes.

Jon Skeet has written an excellent article on this type of issue:

     http://www.yoda.arachsys.com/csharp/unicode.html

-- 
mikeb


Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: Problem reading file with umlauts
    ... UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range ... This file is contains data in the unicode ... character set and is encoded with utf-8. ...
    (comp.lang.python)
  • Re: unicode
    ... 'ascii' codec can't encode character u'\u9999' in ... it looks like when I try to display the string, ... If you try to print a Unicode string, then Python will attempt to first ... encode it using the default encoding for that file. ...
    (comp.lang.python)
  • Re: Byte size of characters when encoding
    ... >>Strings in .NET are already Unicode encoded. ... So if you encode the ... >>string to an array of bytes, you get bytes per character. ...
    (microsoft.public.dotnet.framework)
  • Wide character notation, was Re: How to NOT use utf8.
    ... > So the author suggests that there may be a problems for unicode, ... in the Perl documentation). ... The Unicode code for the desired character, in hexadecimal, ... Unicode strings ...
    (comp.lang.perl.misc)