How can I make Excel save Unicode CSV data correctly?



Excel 2007's Unicode for CSV (comma separated values) support seems sketchy
at best:
- Reads ANSI CSV OK.
- Reads Unicode (UTF-16, UCS-2) CSV OK.
- Regularly garbles UTF-8 CSV when reading (depends on the included
characters).
- Regularly garbles Unicode big-endian CSV when reading (depends on the
included characters).
- Writes ANSI CSV OK.
- Can't write any form of non-ASCII Unicode CSV data OK.

When saving into CSV format it saves in ANSI CSV and breaks any non-ANSI
characters by converting them into ? (question mark) characters.

To replicate:
1. Open notepad and type in the following line of text:
Degrees C (℃),foo
You can type the special ℃ symbol by pressing-and-holding the Alt key while
typing 2103 on the numeric keypad, then release the Alt.
2. File / Save as... into a file named "sample-unicode.csv" and select the
"Unicode" encoding.
3. Open the "sample-unicode.csv" file in Excel. The data is still displayed
correctly.
4. Press F12 to get the "Save as..." window, save into a file named
"sample-excel.csv" in CSV format.

Excel incorrectly writes the following data, omitting any kind of
UTF-8/Unicode/Unicode-BE BOM:
0000: 44 65 67 72 65 65 73 20 43 20 28 3F 29 2C 66 6F Degrees C (?),fo
0010: 6F 0D 0A o..

Correct Unicode UTF-8 output should look like the following:
0000: EF BB BF 44 65 67 72 65 65 73 20 43 20 28 E2 84 ...Degrees C (..
0010: 83 29 2C 66 6F 6F 0D 0A .),foo..
(Note the EF BB BF byte order mark)

Correct Unicode UTF-16/UCS-2 output should look like the following:
0000: FF FE 44 00 65 00 67 00 72 00 65 00 65 00 73 00 ..D.e.g.r.e.e.s.
0010: 20 00 43 00 20 00 28 00 03 21 29 00 2C 00 66 00 .C. .(..!).,.f.
0020: 6F 00 6F 00 0D 00 0A 00 o.o.....
(Note the FF FE byte order mark)

Interestingly, saving from Excel to "Unicode text" format does save into
UTF-16/UCS-2 correctly:
0000: FF FE 44 00 65 00 67 00 72 00 65 00 65 00 73 00
0010: 20 00 43 00 20 00 28 00 03 21 29 00 09 00 66 00
0020: 6F 00 6F 00 0D 00 0A 00

But it converts the comma field separators (\x002c) into tabs (\x0009).

How can I make Excel behave correctly and save into some kind of Unicode CSV
format?

I'm using Microsoft Office Excel 2007 (12.0.6300.5000) SP1 MSO
(12.0.6213.1000).

.



Relevant Pages

  • Re: extract selective info from a flat text file
    ... there is no need for Excel at all. ... CSV is a simply a TEXT file with commas separating the row of text into fields. ... Most any scripting utility will allow you to read in each line from the source file, and then write the desired string subset to a destination file named as desired. ... they would be individual characters not numbers and so not divisible by 100. ...
    (microsoft.public.office.misc)
  • Re: Excel text export limit - 1024 per line (not cell), workaround
    ... I am saving it as CSV - specifically what I am doing is taking a large ... today I had one field that had 690 characters in it. ... exporting the data and the version of excel you're using. ...
    (microsoft.public.excel.misc)
  • Re: CSV file conversion...
    ... > reading exactly 1024 characters. ... Has anyone worked with csv files ... I think your problem is a file format one, ... can have Excel export tab delimited files as well. ...
    (comp.lang.cpp)
  • Exporting unicode characters to csv in excel 2008
    ... Exporting to csv in excel 2008 results in unicode characters not being ...
    (microsoft.public.mac.office.excel)
  • Re: csv reader
    ... In fact I'm reading a csv file saved from openoffice oocalc using ... UTF-8 encoding. ... convert the input data from bytes to unicode, using the encoding of those bytes ...
    (comp.lang.python)