Re: Multilingual support in Len method



Hopefully someone more knowledgeable will
give you the best answer, but just in case, these
notes might help:

This page explains details of decoding UTF-8.
It sounds like you might be able to write some kind
of routine that filters characters beyond 127 for
interpretation:

http://64.233.187.104/search?q=cache:sQeEQqN_31EJ:www.cl.cam.ac.uk/~mgk25/un
icode.html+utf&hl=en&gl=us&ct=clnk&cd=3

------------------------
# The first byte of a multibyte sequence that represents a non-ASCII
character is always in the range 0xC0 to 0xFD and it indicates how many
bytes follow for this character. All further bytes in a multibyte sequence
are in the range 0x80 to 0xBF. This allows easy resynchronization and makes
the encoding stateless and robust against missing bytes.
# All possible 231 UCS codes can be encoded.
# UTF-8 encoded characters may theoretically be up to six bytes long,
however 16-bit BMP characters are only up to three bytes long.
# The sorting order of Bigendian UCS-4 byte strings is preserved.
# The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
---------------

Michael Kaplan is an MS MVP who has always been
the main foreign language expert for VB and used to
write programming articles along those line in the
VB Programmer's Journal. It might be worth trying
to reach him:

http://blogs.msdn.com/michkap/
http://www.trigeminal.com/

VB is different than VBScript in that it has more options
to access a string as Unicode or ASCII, as character
or byte. But it is similiar, and it causes similar problems
in the way that it generally shows you an ASCII string
while storing a Unicode string internally.

I am writing a website that should support Hebrew & English (as well as
other languages).
I need to know the simple character count (not byte count!) of a
string.
When I am using the Len(str) method (with string variable as input), I
receive the following:
- English strings return the exact number of characters.
- Hebrew strings return the UTF-8 byte count.
I would like to know the number of characters in the string in both
cases, in order to cut long strings, but actually Hebrew strings are
cut differently than English strings.

I am aware of the fact that all strings are UNICODE, but I am not
interested in their byte count - only character count.

Thanks,
Gabi



.



Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: Defacto standard string library
    ... And for 8-bit Characters this might be true. ... But when UTF-8 is being manipulated, this is going to cause problems. ... the same problem exists when you're comparing strings encoded with two different variants of ISO-8859; comparisons only work when the strings use the same encoding. ... If you're doing extensive manipulation of Unicode strings, converting to wide characters is almost always the correct solution. ...
    (comp.lang.c)
  • Unicode and ANSI Common Lisp
    ... sequences of Unicode code points, is the internal format that best ... better left to a higher layer above ANSI Common Lisp. ... how characters are counted is the only way for LENGTH to return the same ... values across implementations for the same external strings. ...
    (comp.lang.lisp)
  • Re: Optimization of code
    ... that leet alphabet, with excessive accents. ... Latest MSVC releases can handle UNICODE C sources, ... Swedish, German, French, Hungarian, etc. that use accented characters). ... that require ASCII text strings as part of their protocol. ...
    (microsoft.public.vc.mfc)
  • Re: Rubys not ready - an indepth essay
    ... Most people don't need full-on Unicode munging in ... Without Unicode support, a string operation in a non-English alphabet ... UTF-8 is backwards compatible with ASCII. ... Thus you can safely split any UTF-8 strings on ASCII ...
    (comp.lang.ruby)

Loading