Re: C# and encodings
- From: Göran Andersson <guffa@xxxxxxxxx>
- Date: Wed, 04 Feb 2009 03:07:16 +0100
beginwithl@xxxxxxxxx wrote:
b) Can code page support Unicode coded character set, but may use
different encoding than Unicode does
It doesn't really work that way. Strings are Uncode, and they can be encoded into a binary stream using an encoding that either supports the full Unicode character set or an encoding that supports the subset that a codepage represents.
( Unicode set uses three
encodings - UTF-8, UTF-16 and UTF-32 )?
Four, there is UTF-7 also.
* Are there also 8-bit code pages which use Unicode character
encoding, and thus have only 255 code points matched to characters?
No, if they only have 256 code points they don't use Unicode encoding, they map each character to a byte value.
* Can these code pages also use UTF-16 or UTF-32 encoding?
No, they already have an encoding, they can't use another encoding also.
* Are there also code pages that support more than 255, but less than
2^16 code points?
There are character sets that use double character combinations, like chinese, japanese, korean and arabic, but the characters in a pair are encoded as separate characters, not as a single entity like Unicode does.
2)
From MSDN site:
“StreamWriter defaults to using an instance of UTF8Encoding unless
specified otherwise. This instance of UTF8Encoding is constructed such
that the Encoding.GetPreamble method returns the Unicode byte order
mark written in UTF-8. The preamble of the encoding is added to a
stream when you are not appending to an existing stream. This means
any text file you create with StreamWriter will have three byte order
marks at its beginning."
As far as I understand, the above text suggests that preamble should
be added by default, but I’d say that’s not true?!
Why would you say that? I think that it actually is true. Most applications that read text files supports unicode, so you will never notice the byte order mark.
3) I noticed there are only four classes derived from Encoding class
( ASCIIEncoding, UTF8Encoding, UnicodeEncoding and UTF7Encoding ).What
if you want to use some other, non-unicode encoding?
There are other classes that handle multiple encodings, like SBCSCodePageEncoding, DBCSCodePageEncoding, ISO2022Encoding, EUCJPEncoding, GB18030Encoding and ISCIIEncoding. They are marked as internal and can only be created from the factory methods in Encoding class, so you don't read about them in the documentation.
4)
a)
From MSDN site:
“StreamWriter Constructor (Stream, Encoding)
If you specify something other than Encoding.Default, the byte order
mark (BOM) is written to the file.”
But BOM should only be added when using one of Unicode encodings, thus
why would BOM be added if you specify non-Unicode encoding?
Here the documentation is not correct. I tried some, and the BOM is written for UTF-8, UTF-16 and UTF-32, but not for UTF-7, ASCII, ISO-8859-1 or Windows-1252.
5)
a) “Internally, the .NET Framework stores text as Unicode UTF-16.”
I assume that the above quote is only referring to String objects and
char variables using UTF-16 encoding, or is there some other text
which is also stored as UTF-16?
Well, the Char structure is the only type that handles characters, all other types that handle text (String, StringBuilder et.c.) uses the Char structure.
b) Ignoring the fact that FE FF sequence identifies the type of
encoding, does U+FEFF also represent a character ( outside the context
of encoding )?
Yes, in the Unicode character set it's the code for a zero width space character. This means that if you for example concatenate two unicode encoded text files so that you end up with a BOM in the middle of the text, it will still be invisible.
6)
Say app1 ( running on PC1 ) and app2 ( running on PC2 ) communicate
via network using TCP/IP protocol. PC1 uses little endian-order, while
PC2 uses big-endian order. Now, I know we send information over TCP/IP
( and networks in general ) using big-endian order, but:
a) But does only data in the packet’s header uses this byte order,
while application data is sent just as it is, without reversing its
byte order ( assuming this data is sent over the network by PC1 )?
Yes. The network layer can not change the endianness of the data, as it doesn't know what kind of data it represents. It treats everything as bytes.
b) If so, then if PC1 sends some .exe file to PC2, then how will PC2
know whether it came from little endian-machine and thus should
reverse bytes before trying to load this .exe file?
The file formats that use little endian or big endian data are well defined and doesn't change endianness just because the system natively uses a different endianness.
--
Göran Andersson
_____
http://www.guffa.com
.
- Follow-Ups:
- Re: C# and encodings
- From: Mihai N.
- Re: C# and encodings
- References:
- C# and encodings
- From: beginwithl
- C# and encodings
- Prev by Date: Re: C# and encodings
- Next by Date: Re: C# and encodings
- Previous by thread: Re: C# and encodings
- Next by thread: Re: C# and encodings
- Index(es):
Relevant Pages
|