Re: Multilingual support in Len method
- From: "mayayana" <mayaXXyana1a@xxxxxxxxxxxxxxxx>
- Date: Fri, 13 Oct 2006 13:11:10 GMT
Hopefully someone more knowledgeable will
give you the best answer, but just in case, these
notes might help:
This page explains details of decoding UTF-8.
It sounds like you might be able to write some kind
of routine that filters characters beyond 127 for
interpretation:
http://64.233.187.104/search?q=cache:sQeEQqN_31EJ:www.cl.cam.ac.uk/~mgk25/un
icode.html+utf&hl=en&gl=us&ct=clnk&cd=3
------------------------
# The first byte of a multibyte sequence that represents a non-ASCII
character is always in the range 0xC0 to 0xFD and it indicates how many
bytes follow for this character. All further bytes in a multibyte sequence
are in the range 0x80 to 0xBF. This allows easy resynchronization and makes
the encoding stateless and robust against missing bytes.
# All possible 231 UCS codes can be encoded.
# UTF-8 encoded characters may theoretically be up to six bytes long,
however 16-bit BMP characters are only up to three bytes long.
# The sorting order of Bigendian UCS-4 byte strings is preserved.
# The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
---------------
Michael Kaplan is an MS MVP who has always been
the main foreign language expert for VB and used to
write programming articles along those line in the
VB Programmer's Journal. It might be worth trying
to reach him:
http://blogs.msdn.com/michkap/
http://www.trigeminal.com/
VB is different than VBScript in that it has more options
to access a string as Unicode or ASCII, as character
or byte. But it is similiar, and it causes similar problems
in the way that it generally shows you an ASCII string
while storing a Unicode string internally.
I am writing a website that should support Hebrew & English (as well as
other languages).
I need to know the simple character count (not byte count!) of a
string.
When I am using the Len(str) method (with string variable as input), I
receive the following:
- English strings return the exact number of characters.
- Hebrew strings return the UTF-8 byte count.
I would like to know the number of characters in the string in both
cases, in order to cut long strings, but actually Hebrew strings are
cut differently than English strings.
I am aware of the fact that all strings are UNICODE, but I am not
interested in their byte count - only character count.
Thanks,
Gabi
.
- References:
- Multilingual support in Len method
- From: Gabriella
- Multilingual support in Len method
- Prev by Date: Re: Find user group membership
- Next by Date: Where can I ask for help with javascript ?
- Previous by thread: Re: Multilingual support in Len method
- Next by thread: Re: Multilingual support in Len method
- Index(es):
Relevant Pages
|
Loading