Re: C# and encodings
- From: "Mihai N." <nmihai_year_2000@xxxxxxxxx>
- Date: Tue, 03 Feb 2009 23:14:40 -0800
1)
a) With "Encoding.Default" you retrieve system’s default code page.
But if windows has numerous code pages, then what exactly would
default page be, meaning where ( or in what apps ) does windows use
this default page over other code pages?
This is controlled by the system locale (or "Language for non-Unicode
applications"). A user can change it, but affects the whole system,
all users, and requires a reboot.
See http://www.mihai-nita.net/article.php?artID=20050611a
b) Can code page support Unicode coded character set, but may use
different encoding than Unicode does ( Unicode set uses three
encodings - UTF-8, UTF-16 and UTF-32 )?
No. Unicode is one code page (with several encodings).
If a code page "supports Unicode" then it is Unicode.
But (most) other code pages are a subset of Unicode
(not necesarily contigous subset).
Anyway, a text in any code page can also be representes as Unicode.
But not the other way around.
c)
* Are there also 8-bit code pages which use Unicode character
encoding, and thus have only 255 code points matched to characters?
No. A code page does not use Unicode encodings.
* Can these code pages also use UTF-16 or UTF-32 encoding?All UTF encodings are Unicode only
(as the name says: UTF = *Unicode* Transformation Form)
* Are there also code pages that support more than 255, but less thanYes. Code pages designed for Japanese (cp932, or Shift-JIS), Chinese
2^16 code points?
Traditional (cp950, or Big-5), Chinese Simplified (cp936, or GB2312),
Korean (cp949). These are the only ones that can also be system code
pages. But there are other code pages with more than 255 characters.
You should consider these "legacy" and not use them, except for interchange
with old applications and files.
2)
As far as I understand, the above text suggests that preamble should
be added by default, but I’d say that’s not true?!
Based on the doc, is should add the preamble.
Did you test and it is not true?
3) I noticed there are only four classes derived from Encoding class
( ASCIIEncoding, UTF8Encoding, UnicodeEncoding and UTF7Encoding ).What
if you want to use some other, non-unicode encoding?
You use the Encoding class with a numeric code page identifier:
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx
4)
a)
From MSDN site:
“StreamWriter Constructor (Stream, Encoding)
If you specify something other than Encoding.Default, the byte order
mark (BOM) is written to the file.”
But BOM should only be added when using one of Unicode encodings, thus
why would BOM be added if you specify non-Unicode encoding?
Sounds like the doc is not correct.
b) “Since the Unicode byte order mark character is not found in any....
Well, since at least some (ANSI) code pages do have glyphs forNope. The BOM is not FF FE code points.
characters at code points FF and FE, I assume above text implies that
apps ( using non-unicode code pages ) reading such a file would
understand that FE FF sequence represents BOM and thus should ignore
it?
The code point is FEFF. And when you convert to a code page, there is
no equivalent for it.
5)
a) “Internally, the .NET Framework stores text as Unicode UTF-16.”
I assume that the above quote is only referring to String objects and
char variables using UTF-16 encoding, or is there some other text
which is also stored as UTF-16?
All text is UTF-16. If you can think of text other than String and char,
(I cannot), it is also UTF-16.
b) Ignoring the fact that FE FF sequence identifies the type of
encoding, does U+FEFF also represent a character ( outside the context
of encoding )?
You are confusing things. FEFF is a character (BOM) at the beginning
of the file. In the middle, it means ZERO WIDTH NON-BREAKING SPACE.
It was used as a marker, then kind of transformed into a de facto
standard.
A bit like #! at the beginning of a Unix script meaning
"this file is as a script and the thing after #! is the path to the
interpretor for this script"
It is a convention. It does not mean there are no scripts without
#! or that the existence of #! guarantees 100% that the file is a script.
Just that "very-very likely" is a script.
a) But does only data in the packet’s header uses this byte order,
while application data is sent just as it is, without reversing its
byte order ( assuming this data is sent over the network by PC1 )?
The packet headers are always in the same byte order.
It is part of the specs.
Advice: these are things in which many people have opinions,
but relatively few know what they are talking about (including me).
Whenever something is not clear, don't ask on newsgroups, go to the
official sources:
- http://unicode.org/faq/utf_bom.html
- http://www.unicode.org/glossary/
--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email
.
- References:
- C# and encodings
- From: beginwithl
- C# and encodings
- Prev by Date: Re: ^ operator
- Next by Date: Re: C# and encodings
- Previous by thread: Re: C# and encodings
- Next by thread: Re: C# and encodings
- Index(es):
Relevant Pages
|