Re: Acquiring UTF-8 string length



Well as in the example at
http://en.wikipedia.org/wiki/Multi-byte_character_set, this UTF-8 string
actually has three different lengths:

// I{heart}NY
char str[] = { 0x49, 0xE2, 0x99, 0xA5, 0x4E, 0x59, 0x00 };

count of bytes = 7, obtained by sizeof()
count of code points = 6, obtained by strlen()
count of characters = 4, no API af far as I can tell

How would I get the number of characters in this string? Or how would I go
about reversing the characters in this string? Do I have to really implement
my own UTF-8 decoder/encoder?

Another example might be that I am reading a file and it specifies a code
page at the top... arn't there any APIs which will help me manage by
per-character rather than per-code point?


"Igor Tandetnik" wrote:

"Coder Guy" <CoderGuy@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message
news:7DDDB5D3-38E3-496F-8D93-17DC5EDE11AE@xxxxxxxxxxxxx
Isn't there a Win32 function for acquiring the string length of a
UTF-8 string (or any given code page)? Functions like lstrlen and
StringCchLength only support ANSI strings, and not strings with
variable sized UTF-8 code-point characters.

What do you mean by "support"? Functions like strlen will give you the
size _in bytes_ of any string, regardless of codepage, variable size or
otherwise. They don't really care where character boundaries lie, they
just count bytes. If this is not what you want, what exactly do you
want?

Another question is that MultiByteToWideChar maps into a UTF-16. If
the input string is of a character set which has a character
code-point requiring more than one UINT16 then I'd imagine it would
cause all those strlen functions which dont consider the rules of a
UTF-16 character set to report a string length larger than it should.

Well, what length do you believe they "should" return? For memory
allocation purposes, you'd definitely want the length in bytes or in
wchar_t's. For display purposes, you'd probably want a string consisting
of character A and a combining diacritic to be considered of length 1,
even though it contains two Unicode characters.

There is no single "right" definition of string length. What's your
definition? Suppose your wish is granted and you have a function that
returns one. What do you plan to use it for?

Taking a look at strsafe.h, StringCchLength is one of those functions
that would return an incorrect length in this instance.

So am I to assume there are no Win32 APIs to safely manipulate Unicode
strings

How is StringCchLength or strlen unsafe? Can you show a fragment of code
whose safety you are worried about?
--
With best wishes,
Igor Tandetnik

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925



.



Relevant Pages

  • Re: DB2 UTF-8 ODBC double conversion
    ... UTF-8 *is* Unicode. ... byte to store characters in the 7-bit ASCII code. ... If I give a UTF-8 string to CreateFile, ... this means that everyone who is using that database has to understand that the ...
    (microsoft.public.vc.mfc)
  • Re: Writing Japanese or Chinese strings in a text file
    ... characters on the screen. ... start of the file that flags the data as UTF-8. ... VB uses Unicode internally, for 'String' data in memory. ... So they are right in the excel file. ...
    (microsoft.public.vb.general.discussion)
  • Re: CString and UTF-8
    ... installing a machine that isn't a standard locale and see how CString ... as multibyte characters, there are two chinese characters in there. ... ASCII string with UTF-8 embedded into it, delimited with quotes, the ...
    (microsoft.public.vc.mfc)
  • Re: Unicode string libraries
    ... it comes to sequences of characters? ... I know that Perl uses UTF-8 as its internal string representation. ... Ruby just didn't do Unicode. ...
    (comp.programming)
  • Re: How to convert Infix notation to postfix notation
    ... If this is for an error message, why isn't it using stderr for its output? ... array of 15 characters, and you call this function with the limit 15 on ... Making sure that the only string I allocate and append to, ... because mulFactor in all versions must needs incorporate the functions ...
    (comp.lang.c)

Loading