Re: Byte size of characters when encoding

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance

From: mikeb (mailbox.google_at_nospam.mailnull.com)
Date: 07/09/04


Date: Fri, 09 Jul 2004 14:36:54 -0700

Vladimir wrote:

> Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2.
> Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4.
>
> But why that?

Strings in .NET are already Unicode encoded. So if you encode the
string to an array of bytes, you get bytes per character.

However, for UTF8 encoding a single Unicode character can be encoded
using up to 4 bytes in the worst case. charCount*4 is just a worst case
scenario if the string happened to contain only characters that required
4 byte encoding.

>
> Look:
>
> /*
> Each Unicode character in a string is defined by a Unicode scalar value,
> also called ...
>
> An index is the position of a Char, not a Unicode character, in a String. An
> index is a zero-based, nonnegative number starting from the first position
> in the string, which is index position zero. Consecutive index values might
> not correspond to consecutive Unicode characters because a Unicode character
> might be encoded as more than one Char. To work with each Unicode character
> instead of each Char, use the System.Globalization.StringInfo class.
> */
>
> With UTF-8 encoding one instance of struct Char can only occupy 1/2, 1, 1
> 1/2, 2 bytes?
> Isn't it?
> Therefore UTF8Encoding.GetMaxByteCount(charCount) must returns charCount *
> 2.
> Because charCount means count of instance of struct Char.
> Or not? May be it means count of Unicode characters?
> If not, then UnicodeEncoding.GetMaxByteCount(charCount) must returns
> charCount * 4.
>
> This methods does not fit each other.
>
>

-- 
mikeb


Relevant Pages

  • Re: Encoding bytes into UTF-8 string
    ... encode them into a string. ... POSIX don't contain characters. ... with respect to utf-8 or unicode. ...
    (comp.lang.lisp)
  • Re: Unicode Encoding
    ... All .NET strings are Unicode encoded. ... I have a windows application that I need to encode a string using Unicode. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Ascii to Unicode.
    ... output.write(unicodestring.encode('utf-8')) # This second encode ... it was using a properly encoded Unicode string. ... Encoding a string in your favourite encoding ...
    (comp.lang.python)
  • Re: unicode
    ... 'ascii' codec can't encode character u'\u9999' in ... it looks like when I try to display the string, ... If you try to print a Unicode string, then Python will attempt to first ... encode it using the default encoding for that file. ...
    (comp.lang.python)
  • Re: encode() question
    ... UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position ... happens when you convert a regular string to a unicode string. ... You are trying to encode a string. ...
    (comp.lang.python)