Re: Byte size of characters when encoding

Tech-Archive recommends: Fix windows errors by optimizing your registry

From: Vladimir (xozar_at_tut.by)
Date: 07/10/04

  • Next message: David Levine: "Re: Debug Build works perfectly; Release Build fails silently!"
    Date: Sat, 10 Jul 2004 22:53:29 +0300
    
    

    > >>>Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount *
    2.
    > >>>Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.
    > >>>
    > >>>But why that?
    > >>
    > >>Strings in .NET are already Unicode encoded. So if you encode the
    > >>string to an array of bytes, you get bytes per character.
    > >>
    > >>However, for UTF8 encoding a single Unicode character can be encoded
    > >>using up to 4 bytes in the worst case. charCount*4 is just a worst case
    > >>scenario if the string happened to contain only characters that required
    > >>4 byte encoding.
    > >
    > >
    > > Do you want to say that two instances of struct Char in UTF-8 can occupy
    8
    > > bytes?
    > >
    >
    > It turns out that while a UTF8 character can take up to 4 bytes to be
    > encoded, for the Framework, a struct Char can always be encoded in at
    > most 3 bytes. That's because the struct char holds a 16-bit Unicode
    > value, and that can always be encoded in 3 or fewer bytes.
    >
    > A 4-byte UTF8 encoding is only needed for Unicode code points that
    > require 'surrogates' - or a pair of 16-bit values to represent the
    > character. Surrogates cannot be represented in a single struct Char -
    > but I believe they are supported in strings.
    >
    > Anyway, here's what can happen using struct Char:
    >
    > char c1 = '\uFFFF';
    > char c2 = '\u1000';
    >
    > byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });
    >
    > If you dump the byte array, you'll see that each Char was encoded into 3
    > UTF8 bytes.
    >

    It's makes me crazy.
    I don't understand.

    Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
    Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.

    If charCount means unicode 32 bit character:
    UnicodeEncoding.GetMaxByteCount(charCount) must returns charCount * 4.
    UTF8Encoding.GetMaxByteCount(charCount) must returns charCount * 4.

    If charCount means unicode 16 bit character (Char structure):
    UnicodeEncoding.GetMaxByteCount(charCount) must returns charCount * 2.
    UTF8Encoding.GetMaxByteCount(charCount) must returns charCount * 3.

    Suppose we have a string with length 5 (length in string menas count of
    instances of stuct Char).
    UTF8Encoding.GetMaxByteCount(stringInstance.Length) returns 15.
    But it's not true.

    And.
    May be in string each surrogate pair (by 16 bit characters) in UTF-8 occupy
    only 4 bytes?
    Yes or not?

    Look:

    /*
    UTF?16 encodes each 16?bit character as 2 bytes. It doesn't affect the
    characters at all,
    and no compression occurs-its performance is excellent. UTF?16 encoding is
    also referred
    to as Unicode encoding.

    UTF?8 encodes some characters as 1 byte, some characters as 2 bytes, some
    characters
    as 3 bytes, and some characters as 4 bytes. Characters with a value below
    0x0080 are
    compressed to 1 byte, which works very well for characters used in the
    United States.
    Characters between 0x0080 and 0x07FF are converted to 2 bytes, which works
    well for
    European and Middle Eastern languages. Characters of 0x0800 and above are
    converted to
    3 bytes, which works well for East Asian languages. Finally, surrogate
    character pairs are
    written out as 4 bytes. UTF?8 is an extremely popular encoding, but it's
    less useful than
    UTF?16 if you encode many characters with values of 0x0800 or above.
    */

    Does it mean that each pair of characters in UTF-16 can't be occupy more
    than 4 bytes in UTF-8?

    Wait a minute.
    It seams that I undestend something.

    Characters in UTF-16 below 0x0800 in UTF-8 can occupy less or equal to
    2 bytes (in UTF-16 its occupy always 2 bytes).
    Characters in UTF-16 above 0x0800 in UTF-8 will occupy 3 bytes
    (in UTF-16 its occupy always 2 bytes).
    Surrogate charactes pair UTF-16 in UTF-8 will occupy 4 bytes (in UTF-16 its
    occupy always 4 bytes).

    Right?

    But then I think UTF8Encoding.GetMaxByteCount(charCount) must
    returns charCount * 3.


  • Next message: David Levine: "Re: Debug Build works perfectly; Release Build fails silently!"

    Relevant Pages

    • Re: Byte size of characters when encoding
      ... Surrogates cannot be represented in a single struct Char - ... Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2. ... May be in string each surrogate pair (by 16 bit characters) in UTF-8 occupy ... UTF?16 encodes each 16?bit character as 2 bytes. ...
      (microsoft.public.dotnet.general)
    • Re: urlencode vs rawurlencode
      ... > rawurlencode is that urlencode translates spaces to '+' characters, ... > rawurlencode translates it into it's hex code. ... The first could be a URI ... encodes certain unreserved characters ). ...
      (comp.lang.php)
    • Re: converting to and from octal escaped UTF--8
      ... I am writing unicode stings into a special text file that requires to ... have non-ascii characters as as octal-escaped UTF-8 codes. ... encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8))) ... It encodes only non-ascii and non-printables, ...
      (comp.lang.python)
    • Re: Transmitting strings via tcp from a windows c++ client to a Java server
      ... the length is "followed by a standard UTF-8 byte encoding of the ... However there is also a major difference in how it encodes ... will encode in UTF-8 as (taken from the Uncode Standard 4.0.1, ... Unicode characters. ...
      (comp.lang.java.programmer)
    • Re: CSharp problem
      ... public string EnsureSpaces(string target, int maxChars) ... else charCount = 0; ... to an access database. ... line of text that it has at least 1 space for every 40 characters. ...
      (microsoft.public.dotnet.languages.csharp)