Re: C# and encodings

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



1)

a) With "Encoding.Default" you retrieve system’s default code page.
But if windows has numerous code pages, then what exactly would
default page be, meaning where ( or in what apps ) does windows use
this default page over other code pages?

This is controlled by the system locale (or "Language for non-Unicode
applications"). A user can change it, but affects the whole system,
all users, and requires a reboot.
See http://www.mihai-nita.net/article.php?artID=20050611a



b) Can code page support Unicode coded character set, but may use
different encoding than Unicode does ( Unicode set uses three
encodings - UTF-8, UTF-16 and UTF-32 )?

No. Unicode is one code page (with several encodings).
If a code page "supports Unicode" then it is Unicode.
But (most) other code pages are a subset of Unicode
(not necesarily contigous subset).
Anyway, a text in any code page can also be representes as Unicode.
But not the other way around.


c)

* Are there also 8-bit code pages which use Unicode character
encoding, and thus have only 255 code points matched to characters?

No. A code page does not use Unicode encodings.

* Can these code pages also use UTF-16 or UTF-32 encoding?
All UTF encodings are Unicode only
(as the name says: UTF = *Unicode* Transformation Form)


* Are there also code pages that support more than 255, but less than
2^16 code points?
Yes. Code pages designed for Japanese (cp932, or Shift-JIS), Chinese
Traditional (cp950, or Big-5), Chinese Simplified (cp936, or GB2312),
Korean (cp949). These are the only ones that can also be system code
pages. But there are other code pages with more than 255 characters.
You should consider these "legacy" and not use them, except for interchange
with old applications and files.


2)
As far as I understand, the above text suggests that preamble should
be added by default, but I’d say that’s not true?!

Based on the doc, is should add the preamble.
Did you test and it is not true?


3) I noticed there are only four classes derived from Encoding class
( ASCIIEncoding, UTF8Encoding, UnicodeEncoding and UTF7Encoding ).What
if you want to use some other, non-unicode encoding?

You use the Encoding class with a numeric code page identifier:
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx


4)

a)
From MSDN site:
“StreamWriter Constructor (Stream, Encoding)

If you specify something other than Encoding.Default, the byte order
mark (BOM) is written to the file.”

But BOM should only be added when using one of Unicode encodings, thus
why would BOM be added if you specify non-Unicode encoding?

Sounds like the doc is not correct.


b) “Since the Unicode byte order mark character is not found in any
....
Well, since at least some (ANSI) code pages do have glyphs for
characters at code points FF and FE, I assume above text implies that
apps ( using non-unicode code pages ) reading such a file would
understand that FE FF sequence represents BOM and thus should ignore
it?
Nope. The BOM is not FF FE code points.
The code point is FEFF. And when you convert to a code page, there is
no equivalent for it.



5)
a) “Internally, the .NET Framework stores text as Unicode UTF-16.”

I assume that the above quote is only referring to String objects and
char variables using UTF-16 encoding, or is there some other text
which is also stored as UTF-16?

All text is UTF-16. If you can think of text other than String and char,
(I cannot), it is also UTF-16.



b) Ignoring the fact that FE FF sequence identifies the type of
encoding, does U+FEFF also represent a character ( outside the context
of encoding )?

You are confusing things. FEFF is a character (BOM) at the beginning
of the file. In the middle, it means ZERO WIDTH NON-BREAKING SPACE.
It was used as a marker, then kind of transformed into a de facto
standard.
A bit like #! at the beginning of a Unix script meaning
"this file is as a script and the thing after #! is the path to the
interpretor for this script"
It is a convention. It does not mean there are no scripts without
#! or that the existence of #! guarantees 100% that the file is a script.
Just that "very-very likely" is a script.



a) But does only data in the packet’s header uses this byte order,
while application data is sent just as it is, without reversing its
byte order ( assuming this data is sent over the network by PC1 )?

The packet headers are always in the same byte order.
It is part of the specs.


Advice: these are things in which many people have opinions,
but relatively few know what they are talking about (including me).
Whenever something is not clear, don't ask on newsgroups, go to the
official sources:
- http://unicode.org/faq/utf_bom.html
- http://www.unicode.org/glossary/


--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
.



Relevant Pages

  • Re: Default Encoding =?ISO-8859-15?Q?=E4ndern?=
    ... Wobei intern schon eine Erkennung anhand eines BOM gemacht ... > Weil ich kein Encoding angeben kann, denke ich, dass ich das Default ... Aber eine BOM existiert nur bei Unicode und ist dort ebenfalls optional. ...
    (microsoft.public.de.german.entwickler.dotnet.vb)
  • Re: CharNext
    ... IDN names only use ASCII characters since they are ... How to figure out what encoding the text in the file is in? ... UNICODE text files may contain a BOM (as the first character that can be ... Unfortunately, this BOM ...
    (microsoft.public.vc.language)
  • Re: Some interesing aspect of injecting scripts on page...
    ... you have to use Unicode escape sequences for that: ... So the encoding problem is not applicable to the program code itself - ... What happens with the script containing this alert box with a text on ... If you are using only base ASCII characters in your string literals ...
    (comp.lang.javascript)
  • Re: C# and encodings
    ... different encoding than Unicode does (Unicode set uses three ... Any character encoding that is not Unicode by definition uses a different encoding than Unicode does. ... The point is that the Unicode "character" 0xfeff is not representable in any ANSI code page, and is treated specially by stripping it from input rather than replacing it with the "default character". ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Proposal: require 7-bit source strs
    ... If the application knows which encoding it is so it can convert at all, ... If you mean 'limited' to some other character set than Unicode, ... is that because you think of Unicode as The ... > standard grows with its adoption. ...
    (comp.lang.python)