Re: Determining if a string is Unicode

From: Tony Proctor (tony_proctor_at_aimtechnology_NoMoreSPAM_.com)
Date: 12/02/04

  • Next message: Ethics: "RE: Service->vb exe->bat file"
    Date: Thu, 2 Dec 2004 09:43:50 -0000
    
    

    There seems to be a lot of confusion in the replies to this Jerry.

    However, there's nothing magic about Unicode. It's simply a text encoding
    where each character occupies 2 bytes, as opposed to a Single-Byte Character
    set (SBCS) or Multi-Byte Character Set (MBCS). In a MBCS, each character is
    represented by a variable number of characters, 1:n. In a Double-Byte
    Character Set (DBCS), e.g. Shift-JIS, this is either 1 or 2 bytes.

    In an executing VB program, String data is always stored in BSTR format, in
    Unicode. However, VB doesn't actually know for sure that they contain valid
    Unicode characters. You could load up a string with rubbish, and VB will try
    and treat it as Unicode characters. Characters are only converted to/from a
    different character set by explicit things such as calling StrConv, calling
    a Declare Function, using native file I/O, or explicitly calling the Win32
    character conversion APIs.

    Now, back to your case...

    In general, it's impossible to look at a sequence of bytes and say what
    character set the original text was encoded in. Even with Unicode versus an
    ANSI character set, there's still no bullet-proof method. If the text
    contained, say, Chinese or Japanese characters then you won't get many NULL
    bytes in there, so it just looks like random bytes. If the file contains
    just text (i.e. no binary data) in an ANSI character set then you would not
    expect to see any NULL characters. Hence, if you examined the bytes and
    found a NULL then there's a fairly good chance that it's not in an ANSI
    character set. If your text was predominantly Latin (i.e. no accented
    characters, or Far-Eastern characters) then you could substantiate this by
    performing a statistical scan. If Unicode then all the NULL bytes would all
    be in even or odd byte positions, depending on whether the Unicode was
    stored in big-endian or little-endian format.

                Tony Proctor

    "Jerry West" <jw@comcast.net> wrote in message
    news:10qs4q2blgnv7db@news.supernews.com...
    > I have a strange issue I can't seem to get a handle on. I'm reading in an
    > INF file like so:
    >
    > Public Function mP_GetFileText(sFileName As String, Optional bNoLock As
    > Boolean = True, Optional nStart As Long = 1) As String
    >
    > Dim sText As String
    >
    > Dim i As Integer
    >
    > i% = FreeFile
    >
    > If gFSO.FileExists(sFileName$) Then
    >
    > If bNoLock Then Open sFileName$ For Binary Access Read Lock Write
    As
    > i% Else Open sFileName$ For Binary Access Read As i%
    >
    > sText$ = String$(LOF(i%), 0)
    > Get i%, nStart&, sText$
    >
    > Close i%
    >
    > mP_GetFileText = sText$
    >
    > End If
    >
    > End Function
    >
    > If the INF file is read from a NT based system it appears to be in
    Unicode.
    > Viewing the string in the watch window has every other char as a Null. If
    I
    > then perform this operation on it it then it appears as a "normal"
    > (non-Unicode) string:
    >
    > sString$ = StrConv(sString, vbFromUnicode)
    >
    > Now, if I read in the INF file from a 9x based computer the string does
    not
    > appear to be in Unicode. Further, if I perform the StrConv operation on
    this
    > string it changes the entire string to all question marks.
    >
    > This left me with attempting to determine whether or not the INF file read
    > is in Unicode or not. First I tried examining the VarType() value. However
    > for either type of string returned the value was always an 8 (8 = string).
    > Clearly both types of strings are NOT the same so this seems odd to me. I
    > clearly cannot perform the StrConv function on the already "normal" string
    > w/o changing it to all question marks. I then thought I'd try to read the
    > files in using a different method like so:
    >
    > mP_GetFileText = gFSO.OpenTextFile(sFileName$, ForReading,
    > TristateFalse).ReadAll
    >
    > This also failed. No matter what Tristate value I would use the string
    > returned was always Null chars IF the INF file being read was on a remote
    9x
    > or NT system. It would work OK when reading local files. Finally, I tried
    > creating a function that would return True if the string was Unicode like
    > so:
    >
    > Dim l As Long
    >
    > Dim sa() As Byte
    >
    > On Error GoTo ErrHandler
    >
    > sa = sString$
    >
    > If (UBound(v) > -1) Then
    >
    > For l& = 1 To UBound(sa) Step 2
    >
    > If (sa(l&) <> 0) Then Exit For
    >
    > Next l&
    >
    > mP_IsStringNonEnglishUnicode = (l& < UBound(sa))
    >
    > End If
    >
    > This also would not properly detect that the "normal" string was
    non-Unicode
    > (if that is what it is).
    >
    > Has anyone else seen this type of issue before? Is there a way to detect
    the
    > difference between these two string types? I'm not even certain that the
    > string read from the NT based system is in Unicode at this point. Does
    > anyone have any comments to share on this situation?
    >
    > Thanks!
    >
    > JW
    >
    >
    >
    >


  • Next message: Ethics: "RE: Service->vb exe->bat file"

    Relevant Pages

    • Re: Defacto standard string library
      ... Is there a defacto standard string library ... Unicode, encoded in UTF8 format, except that a zero byte is ... Standard C string functions will be fine with this ... result, it cannot be encoded using a single byte per character, unless ...
      (comp.lang.c)
    • Re: Determining if a string is Unicode
      ... bytes per character, and MULTI-byte occupies one!!?? ... there's nothing magic about Unicode. ... You could load up a string with rubbish, ... if I read in the INF file from a 9x based computer the string does ...
      (microsoft.public.vb.general.discussion)
    • Re: Arabic or Chinese characters in a URL link give error copying
      ... the active ANSI character set, ... Arabic/Chinese then the associated "wide" Unicode characters will have been ... Function ContainsWideChars(ByRef inString As String) As Boolean ... Dim iCh As Integer ...
      (microsoft.public.vb.general.discussion)
    • Re: Arabic characters gives ASCII code 63
      ... The only problem is that you are looking at the ASCII/ANSI values i.e. assuming that each character is represented as a number between 0 and 255. ... This is hidden from the developer - the length of a 5 character string is still 5 but it's still 10 bytes. ... all you need to do is get the unicode value for each character rather than the ANSI number. ... Dim CellValue As String ...
      (microsoft.public.excel.programming)
    • Re: Unicode conversion problem (codec cant decode)
      ... I've read a lot of stuff about Unicode and Python and I'm pretty comfortable ... with how you can convert between different encoding types. ... understand is how to go from a byte string with 8-bit characters to an encoded ... I really don't care about the character set used. ...
      (comp.lang.python)