Re: Acquiring UTF-8 string length
- From: Coder Guy <CoderGuy@xxxxxxxxxxxxxxxxxxxxxxxxx>
- Date: Sun, 1 Apr 2007 21:26:01 -0700
Well as in the example at
http://en.wikipedia.org/wiki/Multi-byte_character_set, this UTF-8 string
actually has three different lengths:
// I{heart}NY
char str[] = { 0x49, 0xE2, 0x99, 0xA5, 0x4E, 0x59, 0x00 };
count of bytes = 7, obtained by sizeof()
count of code points = 6, obtained by strlen()
count of characters = 4, no API af far as I can tell
How would I get the number of characters in this string? Or how would I go
about reversing the characters in this string? Do I have to really implement
my own UTF-8 decoder/encoder?
Another example might be that I am reading a file and it specifies a code
page at the top... arn't there any APIs which will help me manage by
per-character rather than per-code point?
"Igor Tandetnik" wrote:
"Coder Guy" <CoderGuy@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote in message.
news:7DDDB5D3-38E3-496F-8D93-17DC5EDE11AE@xxxxxxxxxxxxx
Isn't there a Win32 function for acquiring the string length of a
UTF-8 string (or any given code page)? Functions like lstrlen and
StringCchLength only support ANSI strings, and not strings with
variable sized UTF-8 code-point characters.
What do you mean by "support"? Functions like strlen will give you the
size _in bytes_ of any string, regardless of codepage, variable size or
otherwise. They don't really care where character boundaries lie, they
just count bytes. If this is not what you want, what exactly do you
want?
Another question is that MultiByteToWideChar maps into a UTF-16. If
the input string is of a character set which has a character
code-point requiring more than one UINT16 then I'd imagine it would
cause all those strlen functions which dont consider the rules of a
UTF-16 character set to report a string length larger than it should.
Well, what length do you believe they "should" return? For memory
allocation purposes, you'd definitely want the length in bytes or in
wchar_t's. For display purposes, you'd probably want a string consisting
of character A and a combining diacritic to be considered of length 1,
even though it contains two Unicode characters.
There is no single "right" definition of string length. What's your
definition? Suppose your wish is granted and you have a function that
returns one. What do you plan to use it for?
Taking a look at strsafe.h, StringCchLength is one of those functions
that would return an incorrect length in this instance.
So am I to assume there are no Win32 APIs to safely manipulate Unicode
strings
How is StringCchLength or strlen unsafe? Can you show a fragment of code
whose safety you are worried about?
--
With best wishes,
Igor Tandetnik
With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925
- Follow-Ups:
- Re: Acquiring UTF-8 string length
- From: Igor Tandetnik
- Re: Acquiring UTF-8 string length
- From: Ulrich Eckhardt
- Re: Acquiring UTF-8 string length
- References:
- Re: Acquiring UTF-8 string length
- From: Igor Tandetnik
- Re: Acquiring UTF-8 string length
- Prev by Date: Re: Acquiring UTF-8 string length
- Next by Date: Re: Acquiring UTF-8 string length
- Previous by thread: Re: Acquiring UTF-8 string length
- Next by thread: Re: Acquiring UTF-8 string length
- Index(es):
Relevant Pages
|
Loading