Re: How to convert from UTF-8 or ASCII to UTF-16 and back.

Tech-Archive recommends: Fix windows errors by optimizing your registry



"Tom Serface" <tom.nospam@xxxxxxxxxxxxx> wrote in message
news:umaX3e0sHHA.4476@xxxxxxxxxxxxxxxxxxxxxxx
Hi Jeff,

Are you asking how to do this or offering up a solution. I looked at your
.cpp file and I can't testify to whether or not it works (I'll assume it
does), but why not just use the following?

If you are using ATL/MFC you may find these macros handy:

http://msdn2.microsoft.com/en-us/library/87zae4a3(VS.80).aspx

Otherwise take a look at MultiByteToWideChar() and WideCharToMultiByte()
functions.

I didn't click on the .EXE link (wouldn't do that in a newsgroup), but
like I said, I'll assume it works.

BTW, the "magic" number you're referring to is called a BOM (Byte Order
Mark) and you'll find it at the start of most Unicode and UTF-8 files. It
odes make it easier to figure out the file type.

Thanks for your post. The code was an interesting read.

Tom

"Jeff.Relf" <Jeff_Relf@xxxxxxxxx> wrote in message
news:Jeff_Relf_2007_Jun_20__6_1_A0@xxxxxxxxxxxx
Hi Tom_Serface Mr. Z.K. and David Lowndes,

This is my line-wrapper for .HTM files and the like:

www.Cotse.NET/users/jeffrelf/Wrap_HTML.EXE
www.Cotse.NET/users/jeffrelf/Wrap_HTML.CPP ( VC++ 8 )
www.Cotse.NET/users/jeffrelf/Wrap_HTML.VCProj

Pass Wrap_HTML.EXE the file you want to wrap.
( e.g. run " Wrap_HTML index.HTM " )

It's a simple example of how to convert from UTF-8 or ASCII to UTF-16
and then then back to the original encoding ( UTF-8 or ASCII ).

UTF-16 files begin like this:
" const wchar_t Magic_UTF_16 = 0xFeFF ; ";
UTF-8 like this:
" const unsigned char Magic_UTF_8[] = { 0xeF, 0xbb, 0xbF }; ".

Basically, Unicode is just wchar_t ( an unsigned short )
instead of char ( i.e. a " 7-bit " signed byte, __int8 ).

Intel is little byte first,
so a memory dump of the " space glyph " ( ASCII 32, 20 hex )
shows " 20 00 " ( hex ).

Some UTF-16 characters aren't ever used,
allowing custom control codes like this
( used to color-code differences between 2 files ):

const wchar_t
Ch_Default = 0xD801 , Ch_Hi = 0xD802 , Ch_Klld = 0xD803
, Ch_Born = 0xD804 , Ch_Klld_Swapd = 0xD805
, Ch_Born_Swapd = 0xD806 ;

For more on that, search for " Dif.CPP " at my website:
" www.Cotse.NET/users/jeffrelf ".



Don't you love it when you look at the "example code" (having looked up
"MultiByteToWideChar" and followed the example code link to "Looking Up a
User's Full Name") and find:

MultiByteToWideChar( CP_ACP, 0, UserName,
strlen(UserName)+1, wszUserName,
sizeof(wszUserName)/sizeof(wszUserName[0]) );
MultiByteTOWideChar( CP_ACP, 0, Domain,
strlen(Domain)+1, wszDomain,
sizeof(wszDomain)/sizeof(wszDomain[0]) );

Was it ever compiled, let alone tested?




.



Relevant Pages

  • Re: GAS-style syntax issue...
    ... but, alas, the issue becomes a little more hairy than a few simple parser ... I guess it is an issue right up there with making the assembler UTF-8 ... (UTF-16 just wastes too much memory IMO, ... majority of text is ASCII... ...
    (alt.lang.asm)
  • Re: what does "serialization" mean?
    ... UTF-8 means that each unit is 8 bits ... of characters common to ASCII UTF-8 and UTF-16, ... bytes were used to represent each character you see. ...
    (comp.programming)
  • Re: Writing out text with nulls
    ... The bytes you posted have mixed UTF-8 and UTF-16 (UTF-8 is the default for StreamWriter, and as long as the characters are all in the 0-127 range will be indistinguishable from ASCII), because you're reading UTF-16 data from the original file and emitted that data as if it were UTF-8. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: [Vim] UTF-8
    ... want it to default to standard ascii. ... Do you really mean UTF-8, ... Unicode files that Windows handles are actually UTF-16. ...
    (comp.editors)
  • Re: UTF-16 file input, C programming.
    ... However, you are only partly correct, from the fact that all standard ASCII chars, are mapped on a single byte as you mention. ... UTF-8 only maps the standard ASCII chars in one byte and anything above is represented in two or more bytes. ... I believe unicode.org has some source, providing functions, that can convert UTF-16 surrogate pairs, into UTF-8 multibyte characters, but I will have to look into that. ...
    (comp.unix.programmer)