Re: Acquiring UTF-8 string length



Coder Guy wrote:
Well as in the example at
http://en.wikipedia.org/wiki/Multi-byte_character_set, this UTF-8 string
actually has three different lengths:

// I{heart}NY
char str[] = { 0x49, 0xE2, 0x99, 0xA5, 0x4E, 0x59, 0x00 };

count of bytes = 7, obtained by sizeof()

Right.

count of code points = 6, obtained by strlen()

Wrong. strlen() only returns the number of chars up to the first NUL char.
The number of codepoints is four, plus the terminating NUL.

count of characters = 4, no API af far as I can tell

Four characters plus a terminating NUL. Note that there are codepoinst that
don't resolve to a character and characters that resolve to more than one
codepoint.

How would I get the number of characters in this string? Or how would
I go about reversing the characters in this string? Do I have to really
implement my own UTF-8 decoder/encoder?

There are libraries out there that help you handle all the various facets of
Unicode. You might want to take a look at ICU, for example.

Another example might be that I am reading a file and it specifies a
code page at the top... arn't there any APIs which will help me manage
by per-character rather than per-code point?

Not that I was aware of.

[snipped fullquote with signature]

Stop this misbehaviour please, it is considered impolite on the Usenet.

Uli

.



Relevant Pages

  • Unicode LISP??
    ... >>Unicode codepoints, in many cases, are not characters. ... They didn't yet detail how to manipulate parts of characters ... Combining codepoints in isolation are members of the ...
    (comp.lang.lisp)
  • Re: More MSDN lies: RtlStringCchLength
    ... You mix up characters and codepoints... ... because some parts of MSDN use "characters" to really mean ... Microsoft's Unicode is a subset of real Unicode (except for a few ...
    (microsoft.public.win32.programmer.kernel)
  • Re: CString help
    ... MSDN has gotten a lot better about defining the meaning of N in the ... As a function parameter, ... terminating nul. ... characters actually written, excluding the nul. ...
    (microsoft.public.vc.mfc)
  • Re: More MSDN lies: RtlStringCchLength
    ... and in Chinese codepages these characters also have codepoints larger than 0xFF. ... But I'm still not sure if TCHARs are supposed to exist in kernel mode or not -- although ntddk.h and wdm.h export definitions of some subset of the user-mode TCHAR stuff, it seems that maybe that's a bug and maybe these headers weren't supposed to export any TCHAR definitions. ...
    (microsoft.public.win32.programmer.kernel)
  • Re: Repetitive method name
    ... not on the length of the UTF-8 string. ... I read it as 255 characters after the type id ... > It is very much possible to do so in bytecode. ... > possible to have a static method and an instance method with the same ...
    (comp.lang.java.programmer)