Re: Unicode strings and byte arrays

Tech-Archive recommends: Fix windows errors by optimizing your registry



YYZ wrote:
> In doing more investigation over lunch, it seems that my text editor
> was helping me out a bit, but because I didn't know it it was
> confusing me. A "bad" files looks like this in hex view:
>
> FF FE 69 00 66 00 20 00 65 00 78 00 69 00 73 00
> 74 00 73 00 20 00 28 00 73 00 65 00 6C 00 65 00

I remember seeing that signature, now that you post it! That's how Notepad stored
Unicode. No idea how "universal" it is. Oughta be, huh? <g>

I just tried saving "as Unicode" with TextPad, and that didn't add the signature,
fwiw. Resaving with Notepad added the sig. TextPad could still open it, too, btw.

> all the even columns of hex codes. EXCEPT that return characters (0D
> 0A) aren't separated with 00 between them, which really messes

Oh that's just *really* weird! Here, they're 2-char, just like everything else! I
guess every editor is using a different standard. Lovely, huh?

>>> Does the IsTextUnicode API call do any good? I've seen it but never
>>> needed to try it.
>>>
>>
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_81np.asp
>>
>> There ya go! I *thought* I'd run across that at some point, but
>> sure as heck
>> couldn't find it in MSDN myself. Gotta love the description:
>>
>> "The IsTextUnicode function determines whether a buffer is likely to
>> contain a form
>> of Unicode text. The function uses various statistical and
>> deterministic methods to
>> make its determination, ..."
>>
>> So, not even Microsoft knows of a good way to really tell. <g>
>
> No kidding!

Looks like your best bet, though.

> I did find out that I can use this:
> IsTextUnicode(btByte(0), lLen, IS_TEXT_UNICODE_SIGNATURE)
> and the retval will be 0 for a pure ascii file, and <> 0 for one of my
> messed up files -- that unicode signature evidently is added to the
> beginning of all unicode files as 0xFEFF -- assuming the app that
> saved it like that plays by the rules. So far it works fine. Now I
> just have to write the function to copy selected elements of the byte
> array.

If it weren't for the odd Cr/Lf pairs, I'd say you're on your way. Would it be
possible to get these goobs to use normal ANSI when they save? <g>

Have fun... Karl
--
Working Without a .NET?
http://classicvb.org/petition


.



Relevant Pages

  • Re: Unicode Support
    ... While it's stated that any editor conforming to Unicode 4.x must ... to be conformant), and I copy it over to another editor which produces ... technically correct, (where U+00E3 is the technically correct ...
    (alt.lang.asm)
  • Re: Setting dynamically the Greek charset in Firefox ?
    ... Greek characters required more than 1 byte in UTF-8 and assumed they ... Unicode is a set of character maps and UTF-8 ... copy-pasted in a UTF-8 editor. ...
    (comp.lang.javascript)
  • Re: Hebrew in php
    ... It looks like the phantom "EF BB BF" bytes are Unicode BOMs. ... see if there's an option to turn off the inclusion of the BOM. ... you're looking for a new editor, I recommend Vim, which has great ...
    (comp.lang.php)
  • Re: Thesaurus Problem
    ... files by using text editor tools, the files must be saved in Unicode format ... (ANSI, Unicode, Big Endian, and UTF-8). ... I then downloaded an XML editor and saved it from there. ...
    (microsoft.public.sqlserver.fulltext)
  • Re: Edited signature, now some ActiveX object is embeded. How to remove?
    ... changing your default HTML editor also helps;-) ... > Looking at a new message, my signature has some kind of object above it. ... > when I edit my signature. ...
    (microsoft.public.outlook.general)