Re: Byte size of characters when encoding
From: mikeb (mailbox.google_at_nospam.mailnull.com)
Date: 07/09/04
- Next message: Shri Borde [MS]: "Re: Assembly.GetExecutingAssembly().GetReferencedAssemblies()"
- Previous message: Christopher Kimbell: "Re: What Windows versions work with VB.NET 2003 executable?"
- In reply to: Vladimir: "Re: Byte size of characters when encoding"
- Next in thread: Vladimir: "Re: Byte size of characters when encoding"
- Reply: Vladimir: "Re: Byte size of characters when encoding"
- Messages sorted by: [ date ] [ thread ]
Date: Fri, 09 Jul 2004 16:06:01 -0700
Vladimir wrote:
>>>Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
>>>Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.
>>>
>>>But why that?
>>
>>Strings in .NET are already Unicode encoded. So if you encode the
>>string to an array of bytes, you get bytes per character.
>>
>>However, for UTF8 encoding a single Unicode character can be encoded
>>using up to 4 bytes in the worst case. charCount*4 is just a worst case
>>scenario if the string happened to contain only characters that required
>>4 byte encoding.
>
>
> Do you want to say that two instances of struct Char in UTF-8 can occupy 8
> bytes?
>
It turns out that while a UTF8 character can take up to 4 bytes to be
encoded, for the Framework, a struct Char can always be encoded in at
most 3 bytes. That's because the struct char holds a 16-bit Unicode
value, and that can always be encoded in 3 or fewer bytes.
A 4-byte UTF8 encoding is only needed for Unicode code points that
require 'surrogates' - or a pair of 16-bit values to represent the
character. Surrogates cannot be represented in a single struct Char -
but I believe they are supported in strings.
Anyway, here's what can happen using struct Char:
char c1 = '\uFFFF';
char c2 = '\u1000';
byte [] utf8bytes = UTF8Encoding.GetBytes( new char [] { c1, c2 });
If you dump the byte array, you'll see that each Char was encoded into 3
UTF8 bytes.
Jon Skeet has written an excellent article on this type of issue:
http://www.yoda.arachsys.com/csharp/unicode.html
-- mikeb
- Next message: Shri Borde [MS]: "Re: Assembly.GetExecutingAssembly().GetReferencedAssemblies()"
- Previous message: Christopher Kimbell: "Re: What Windows versions work with VB.NET 2003 executable?"
- In reply to: Vladimir: "Re: Byte size of characters when encoding"
- Next in thread: Vladimir: "Re: Byte size of characters when encoding"
- Reply: Vladimir: "Re: Byte size of characters when encoding"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|