Re: Byte size of characters when encoding
From: Vladimir (xozar_at_tut.by)
Date: 07/10/04
- Previous message: Daniel Billingsley: "WinForms databinding not quite working"
- In reply to: mikeb: "Re: Byte size of characters when encoding"
- Next in thread: Jon Skeet [C# MVP]: "Re: Byte size of characters when encoding"
- Reply: Jon Skeet [C# MVP]: "Re: Byte size of characters when encoding"
- Messages sorted by: [ date ] [ thread ]
Date: Sat, 10 Jul 2004 22:53:29 +0300
> >>>Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount *
2.
> >>>Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.
> >>>
> >>>But why that?
> >>
> >>Strings in .NET are already Unicode encoded. So if you encode the
> >>string to an array of bytes, you get bytes per character.
> >>
> >>However, for UTF8 encoding a single Unicode character can be encoded
> >>using up to 4 bytes in the worst case. charCount*4 is just a worst case
> >>scenario if the string happened to contain only characters that required
> >>4 byte encoding.
> >
> >
> > Do you want to say that two instances of struct Char in UTF-8 can occupy
8
> > bytes?
> >
>
> It turns out that while a UTF8 character can take up to 4 bytes to be
> encoded, for the Framework, a struct Char can always be encoded in at
> most 3 bytes. That's because the struct char holds a 16-bit Unicode
> value, and that can always be encoded in 3 or fewer bytes.
>
> A 4-byte UTF8 encoding is only needed for Unicode code points that
> require 'surrogates' - or a pair of 16-bit values to represent the
> character. Surrogates cannot be represented in a single struct Char -
> but I believe they are supported in strings.
>
> Anyway, here's what can happen using struct Char:
>
> char c1 = '\uFFFF';
> char c2 = '\u1000';
>
> byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });
>
> If you dump the byte array, you'll see that each Char was encoded into 3
> UTF8 bytes.
>
It's makes me crazy.
I don't understand.
Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.
If charCount means unicode 32 bit character:
UnicodeEncoding.GetMaxByteCount(charCount) must returns charCount * 4.
UTF8Encoding.GetMaxByteCount(charCount) must returns charCount * 4.
If charCount means unicode 16 bit character (Char structure):
UnicodeEncoding.GetMaxByteCount(charCount) must returns charCount * 2.
UTF8Encoding.GetMaxByteCount(charCount) must returns charCount * 3.
Suppose we have a string with length 5 (length in string menas count of
instances of stuct Char).
UTF8Encoding.GetMaxByteCount(stringInstance.Length) returns 15.
But it's not true.
And.
May be in string each surrogate pair (by 16 bit characters) in UTF-8 occupy
only 4 bytes?
Yes or not?
Look:
/*
UTF?16 encodes each 16?bit character as 2 bytes. It doesn't affect the
characters at all,
and no compression occurs-its performance is excellent. UTF?16 encoding is
also referred
to as Unicode encoding.
UTF?8 encodes some characters as 1 byte, some characters as 2 bytes, some
characters
as 3 bytes, and some characters as 4 bytes. Characters with a value below
0x0080 are
compressed to 1 byte, which works very well for characters used in the
United States.
Characters between 0x0080 and 0x07FF are converted to 2 bytes, which works
well for
European and Middle Eastern languages. Characters of 0x0800 and above are
converted to
3 bytes, which works well for East Asian languages. Finally, surrogate
character pairs are
written out as 4 bytes. UTF?8 is an extremely popular encoding, but it's
less useful than
UTF?16 if you encode many characters with values of 0x0800 or above.
*/
Does it mean that each pair of characters in UTF-16 can't be occupy more
than 4 bytes in UTF-8?
Wait a minute.
It seams that I undestend something.
Characters in UTF-16 below 0x0800 in UTF-8 can occupy less or equal to
2 bytes (in UTF-16 its occupy always 2 bytes).
Characters in UTF-16 above 0x0800 in UTF-8 will occupy 3 bytes
(in UTF-16 its occupy always 2 bytes).
Surrogate charactes pair UTF-16 in UTF-8 will occupy 4 bytes (in UTF-16 its
occupy always 4 bytes).
Right?
But then I think UTF8Encoding.GetMaxByteCount(charCount) must
returns charCount * 3.
- Previous message: Daniel Billingsley: "WinForms databinding not quite working"
- In reply to: mikeb: "Re: Byte size of characters when encoding"
- Next in thread: Jon Skeet [C# MVP]: "Re: Byte size of characters when encoding"
- Reply: Jon Skeet [C# MVP]: "Re: Byte size of characters when encoding"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|