Re: Opening a text file that may be ASCII *or* Unicode



Hi Andrew,

[top post]

Michael Harris posted something very similar back in 2001. AFAIK, this is
is the only way to determine status prior to opening.

Here's MH's post from 2001

http://groups-beta.google.com/group/microsoft.public.scripting.vbscript/browse_frm/thread/628f93f8430000a5/66d14306ff6c925c?q=unicode+255+254+group:microsoft.public.scripting.*+author:Michael+author:Harris&rnum=1&hl=en#66d14306ff6c925c

There is a fly in the ointment, however, even in MH's post. There are at
least 5 different unicode BOMs, that signal how the file is interpreted --

UTF-8: EF BB BF
UTF-16, Big-Endian: FE FF
UTF-16, Little-Endian: FF FE
UTF-32, Big-Endian: 00 00 FE FF
UTF-32, Little-Endian: FF FE 00 00

Your technique catches the two most common for Western users. Once you've
opened your file, however, it's neither time-consuming nor much additional
code, to read the first four bytes and test for all of these. I have a WSC
routine that's called by a file open-for-reading method to do that. (A
small suggestion for your posted code, either error-trap or get the file
size first, to insure that file contains the appropriate number of bytes
that you're reading. It could well be ASCII empty -- no bytes.)

You might want to take a look at these --

UTF & BOM
http://www.unicode.org/faq/utf_bom.html
(for the BOM table, scroll down to the Byte Order Mark heading)

Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely,
Positively Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html
(thanks to mayayana for this one)

Regards,
Joe Earnest


"Andrew Aronoff" <NOSPAM_WRONG.ADDRESS@xxxxxxxxx> wrote in message
news:pigab19o4s8tqc5h9h3ibtb01eofdiohtf@xxxxxxxxxx
> Since I can't find any documentation about TriStateUseDefault, I
> decided to open the file in ASCII; read the first two characters;
> close the file; compare those characters to 255 & 254; if true, open
> in Unicode, otherwise open in ASCII.
>
>
> Const ForReading = 1
> Const TriStateFalse_ASCII = 0, TriStateTrue_Unicode = -1
>
> 'strFileName points to a text file in ASCII or Unicode
> Set oTextFile = Fso.OpenTextFile (strFileName, ForReading, _
> False,TriStateFalse_ASCII)
>
> 'read 1st 2 chrs, find Asc chr code
> intAsc1Chr = Asc(oTextFile.Read(1))
> intAsc2Chr = Asc(oTextFile.Read(1))
>
> oTextFile.Close
>
> If intAsc1Chr = 255 And intAsc2Chr = 254 Then
>
> 'open the file in Unicode
> Set oTextFile = Fso.OpenTextFile (strFileName,ForReading, _
> False,TriStateTrue_Unicode)
>
> Else
>
> 'open the file in ASCII
> Set oTextFile = Fso.OpenTextFile (strFileName,ForReading, _
> False,TriStateFalse_ASCII)
>
> End If
>
>
> It's not elegant, but it seems to work.
>
> regards, Andy
> --
> **********
>
> Please send e-mail to: usenet (dot) post (at) aaronoff (dot) com
>
> To identify everything that starts up with Windows, download
> "Silent Runners.vbs" at www.silentrunners.org
>
> **********


.



Relevant Pages

  • Re: CFile::Read problem ???
    ... As far as the C compiler is concerned, ... you can pretty much always assign a char ... as ASCII and wchar_t as Unicode. ...
    (microsoft.public.windowsce.embedded.vc)
  • Re: Cross-platform e-mail text size problems
    ... ASCII is mentioned mostly as historical reference. ... It says that "plain text" used to require ASCII (and never one of the 'high ascii' variants we were stuck with before Unicode) and goes on to explain how Unicode is replacing ASCII in plain text. ... If you define "plain text" as "lowest common denomiator", I suppose you could say that it has indeed been upgraded from ASCII to Unicode, thanks to Unicode having become ubiquitous enough to be considered a "low enough common denominator". ...
    (comp.sys.mac.apps)
  • Re: Cross-platform e-mail text size problems
    ... ASCII that I referred to. ... stuck with before Unicode) and goes on to explain how Unicode is ... Since Mac OS X the system has Unicode support under the hood. ...
    (comp.sys.mac.apps)
  • Re: Format of string output of a socket server
    ... ASCII is the same no matter what byte encoding is used. ... By definition any ASCII string is in UTF-8 encoding. ... The client program can then convert to Unicode or whatever they see fit? ... I am writing a socket server to deliver telephony events to clients on ...
    (microsoft.public.win32.programmer.networks)
  • Re: Opening a text file that may be ASCII *or* Unicode
    ... I have always heard that 2k/Me/XP/2k3 default to unicode and other OSs ... > I still have an underlying question -- why does TriStateUseDefault ... > work when opening ASCII and Unicode files? ... > regards, Andy ...
    (microsoft.public.scripting.vbscript)