Re: Writing Japanese or Chinese strings in a text file

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



There's lots of potential for problems here Oliver. I thought the file was
generated directly by Excel. Cut-and-pasting from a web page sounds a bit
"heroic", but if you are sure that the data is then correctly stored in the
DBCS for Chinese (say) then at least we have a good starting point.

When you load the data into Notepad, I assume you see the correct Chinese
characters on the screen. You didn't say this explicitly in your reply. If
so then I would probably save it explicitly as UTF-8 to ensure it's never
ambiguous later, i.e. select an Encoding of "UTF-8" in the 'Save As' dialog.
This writes a magic 3-byte sequence, defined by the Unicode standard, at the
start of the file that flags the data as UTF-8. Whenever Notepad reloads it,
it sees this sequence and treats the data accordingly.

Now the VB side: VB uses Unicode internally, for 'String' data in memory.
However, file I/O converts to/from the current ANSI character set -- which
is why it's necessary to read other data in binary mode instead (see below).
Also, the VB controls normally use the current ANSI character set.

The code you're using to generate UTF-8 file is not correct since it puts
UTF-8 encoded data back into a String (remember, VB Strings are Unicode, not
UTF-8). The following code reads a UTF-8 data file properly into VB, and
then writes it out in the current ANSI character set:
http://groups.google.ie/group/microsoft.public.vb.general.discussion/msg/f3c3fd8182563e?hl=en
However, is this what you really want? Do you want to manipulate the data
with VB?

Tony Proctor

<olivier.letang@xxxxxxx> wrote in message
news:1122987546.633191.236080@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> Thanks for your reply Tony.
>
> > "Boo K.M" is right, but I suspect you need a whole lot more information
> > before you can do that. For instance, what locale are you running in? If
not
> > a Far Eastern locale then your current ANSI code page will not be
> > appropriate, and so you'll get "?" when you try to display any Far
Eastern
> > data.
>
>
> Well. I am using a french computer on Windows XP. But I checked an
> option in the locale preferences to display correctly far eastern
> characters. So they are right in the excel file.
>
> >
> > Also, what character set is the Excel file stored in. If you're in a
> > different locale when reading the data then you may have misread the
> > character codes (i.e. what's stored in memory no longer represents the
> > original characters). If the source data is in a Far Eastern DBCS (e.g.
> > Shift JIS), or UTF-8, then it would be better to read it in binary mode,
> > into a Byte array, and then handle the translation explicitly in your
code.
>
>
> Ok. The excel file is one of mine : I made a copy-paste from a chinese
> web page (exactly I put with VB the string from a textarea in a chinese
> web page into the value of a cell of my own excel file). And the
> characters are fine on my screen in the excel file.
> Then I tried something like that :
>
> open myFileName for output as myFileNumber
> print #myFileNumber,myCell.value
> close myFileNumber
>
> but the produced file just contains "?????".
> I think there is a conversion in the print instruction. I think that VB
> (I am using a VB5 version) converts the string in unicode automatically
> (but I suppose that it is a DBCS string in the cell value).
> I effectivly tried something like :
>
> dim myString() as Byte
>
> myString = myCell.value
> (...)
> print #myFileNumber, myString
> (...)
>
> but it does not work.
> Since my post, I found this source code using the WideCharToMultiByte
> API :
>
> Function UTF8Encode(ByVal wText As String) As String
> Dim vNeeded As Long
> Dim vSize As Long
> vSize = Len(wText)
> vNeeded = WideCharToMultiByte(CP_UTF8, 0, StrPtr(wText), vSize, "", 0,
> 0, 0)
> UTF8Encode = String(vNeeded, 0)
> WideCharToMultiByte CP_UTF8, 0, StrPtr(wText), vSize, UTF8Encode,
> vNeeded, 0, 0
> End Function
>
> I will try it soon. Do you think it should work ?
>
> I suppose that my trouble is due to a melting between ANSI, DBCS,
> Unicode and UTF-8. I suppose that my excel cell is in DBCS, and that VB
> deals with Unicode strings.
> If I put manually chinese characters in notepad, I have to save as
> unicode format to keep these characters.
> I thought it was good for me that VB converts automatically strings
> into Unicode, but it seems that it is not so simple !
> That is the reason why I think now that I have to convert my string
> into UTF-8 as Boo K.M. said.
> Am I right ?
>
> Thanks for your help
> Olivier
>


.



Relevant Pages

  • Re: DB2 UTF-8 ODBC double conversion
    ... UTF-8 *is* Unicode. ... byte to store characters in the 7-bit ASCII code. ... If I give a UTF-8 string to CreateFile, ... this means that everyone who is using that database has to understand that the ...
    (microsoft.public.vc.mfc)
  • Re: Fast UTF-8 strlen function
    ... >> Is there a fast UTF-8 string length function floating around? ... Length in bytes, or length in characters? ... For UTF-8, the main basic "change" you have to make to your string routines ... then I could individually look up the characters in my UNICODE ...
    (alt.lang.asm)
  • Re: CString and UTF-8
    ... installing a machine that isn't a standard locale and see how CString ... as multibyte characters, there are two chinese characters in there. ... ASCII string with UTF-8 embedded into it, delimited with quotes, the ...
    (microsoft.public.vc.mfc)
  • Re: Unicode string libraries
    ... it comes to sequences of characters? ... I know that Perl uses UTF-8 as its internal string representation. ... Ruby just didn't do Unicode. ...
    (comp.programming)
  • Re: Behaviour of headers
    ... I'd put an additional space in this string: ... Then excel could tell the difference between the size of the font and the next ... set of characters. ...
    (microsoft.public.excel.misc)