Re: Determining if a string is Unicode
From: Tony Proctor (tony_proctor_at_aimtechnology_NoMoreSPAM_.com)
Date: 12/02/04
- Previous message: Mike D Sutton: "Re: Array Dimensioned or not"
- In reply to: Jerry West: "Determining if a string is Unicode"
- Next in thread: Bonj: "Re: Determining if a string is Unicode"
- Reply: Bonj: "Re: Determining if a string is Unicode"
- Messages sorted by: [ date ] [ thread ]
Date: Thu, 2 Dec 2004 09:43:50 -0000
There seems to be a lot of confusion in the replies to this Jerry.
However, there's nothing magic about Unicode. It's simply a text encoding
where each character occupies 2 bytes, as opposed to a Single-Byte Character
set (SBCS) or Multi-Byte Character Set (MBCS). In a MBCS, each character is
represented by a variable number of characters, 1:n. In a Double-Byte
Character Set (DBCS), e.g. Shift-JIS, this is either 1 or 2 bytes.
In an executing VB program, String data is always stored in BSTR format, in
Unicode. However, VB doesn't actually know for sure that they contain valid
Unicode characters. You could load up a string with rubbish, and VB will try
and treat it as Unicode characters. Characters are only converted to/from a
different character set by explicit things such as calling StrConv, calling
a Declare Function, using native file I/O, or explicitly calling the Win32
character conversion APIs.
Now, back to your case...
In general, it's impossible to look at a sequence of bytes and say what
character set the original text was encoded in. Even with Unicode versus an
ANSI character set, there's still no bullet-proof method. If the text
contained, say, Chinese or Japanese characters then you won't get many NULL
bytes in there, so it just looks like random bytes. If the file contains
just text (i.e. no binary data) in an ANSI character set then you would not
expect to see any NULL characters. Hence, if you examined the bytes and
found a NULL then there's a fairly good chance that it's not in an ANSI
character set. If your text was predominantly Latin (i.e. no accented
characters, or Far-Eastern characters) then you could substantiate this by
performing a statistical scan. If Unicode then all the NULL bytes would all
be in even or odd byte positions, depending on whether the Unicode was
stored in big-endian or little-endian format.
Tony Proctor
"Jerry West" <jw@comcast.net> wrote in message
news:10qs4q2blgnv7db@news.supernews.com...
> I have a strange issue I can't seem to get a handle on. I'm reading in an
> INF file like so:
>
> Public Function mP_GetFileText(sFileName As String, Optional bNoLock As
> Boolean = True, Optional nStart As Long = 1) As String
>
> Dim sText As String
>
> Dim i As Integer
>
> i% = FreeFile
>
> If gFSO.FileExists(sFileName$) Then
>
> If bNoLock Then Open sFileName$ For Binary Access Read Lock Write
As
> i% Else Open sFileName$ For Binary Access Read As i%
>
> sText$ = String$(LOF(i%), 0)
> Get i%, nStart&, sText$
>
> Close i%
>
> mP_GetFileText = sText$
>
> End If
>
> End Function
>
> If the INF file is read from a NT based system it appears to be in
Unicode.
> Viewing the string in the watch window has every other char as a Null. If
I
> then perform this operation on it it then it appears as a "normal"
> (non-Unicode) string:
>
> sString$ = StrConv(sString, vbFromUnicode)
>
> Now, if I read in the INF file from a 9x based computer the string does
not
> appear to be in Unicode. Further, if I perform the StrConv operation on
this
> string it changes the entire string to all question marks.
>
> This left me with attempting to determine whether or not the INF file read
> is in Unicode or not. First I tried examining the VarType() value. However
> for either type of string returned the value was always an 8 (8 = string).
> Clearly both types of strings are NOT the same so this seems odd to me. I
> clearly cannot perform the StrConv function on the already "normal" string
> w/o changing it to all question marks. I then thought I'd try to read the
> files in using a different method like so:
>
> mP_GetFileText = gFSO.OpenTextFile(sFileName$, ForReading,
> TristateFalse).ReadAll
>
> This also failed. No matter what Tristate value I would use the string
> returned was always Null chars IF the INF file being read was on a remote
9x
> or NT system. It would work OK when reading local files. Finally, I tried
> creating a function that would return True if the string was Unicode like
> so:
>
> Dim l As Long
>
> Dim sa() As Byte
>
> On Error GoTo ErrHandler
>
> sa = sString$
>
> If (UBound(v) > -1) Then
>
> For l& = 1 To UBound(sa) Step 2
>
> If (sa(l&) <> 0) Then Exit For
>
> Next l&
>
> mP_IsStringNonEnglishUnicode = (l& < UBound(sa))
>
> End If
>
> This also would not properly detect that the "normal" string was
non-Unicode
> (if that is what it is).
>
> Has anyone else seen this type of issue before? Is there a way to detect
the
> difference between these two string types? I'm not even certain that the
> string read from the NT based system is in Unicode at this point. Does
> anyone have any comments to share on this situation?
>
> Thanks!
>
> JW
>
>
>
>
- Previous message: Mike D Sutton: "Re: Array Dimensioned or not"
- In reply to: Jerry West: "Determining if a string is Unicode"
- Next in thread: Bonj: "Re: Determining if a string is Unicode"
- Reply: Bonj: "Re: Determining if a string is Unicode"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|