Re: C# and encodings



hi


On Feb 4, 4:41 am, "Peter Duniho" <NpOeStPe...@xxxxxxxxxxxxxxxx>
wrote:
On Tue, 03 Feb 2009 15:08:59 -0800, <beginwi...@xxxxxxxxx> wrote:
1)

a) With "Encoding.Default" you retrieve system’s default code page.
But if windows has numerous code pages, then what exactly would
default page be, meaning where ( or in what apps ) does windows use
this default page over other code pages?

Windows only has one current code page at a time.


Where exactly is that code page used, since as far as I know, apps
running on top of Windows can use whatever code page they choose?




c)

* Are there also 8-bit code pages which use Unicode character
encoding, and thus have only 255 code points matched to characters?

That makes no sense. Unicode can't be represented in only 8-bits, so
there's no such thing as an 8-bit code page that uses Unicode character
encoding.


But Notepad supports Unicode and yet it only recognizes 255 character,
thus it only has 255 code points – couldn’t we then say that Notepad
uses 8-bit code page?




* Can these code pages also use UTF-16 or UTF-32 encoding?

UTF-16 and UTF-32 are also Unicode. See above.


I realize that




2)
From MSDN site:
“StreamWriter defaults to using an instance of UTF8Encoding unless
specified otherwise. This instance of UTF8Encoding is constructed such
that the Encoding.GetPreamble method returns the Unicode byte order
mark written in UTF-8. The preamble of the encoding is added to a
stream when you are not appending to an existing stream. This means
any text file you create with StreamWriter will have three byte order
marks at its beginning."

As far as I understand, the above text suggests that preamble should
be added by default, but I’d say that’s not true?!

You are welcome to say that.


Well, I created a file and wrote some text ( via StreamWriter ) and
then checked via Hex editor if there was also BOM present and it
wasn’t, so uhm ...


4)

a)
From MSDN site:
“StreamWriter Constructor (Stream, Encoding)

If you specify something other than Encoding.Default, the byte order
mark (BOM) is written to the file.”

But BOM should only be added when using one of Unicode encodings, thus
why would BOM be added if you specify non-Unicode encoding?

Three possibilities:

-- the documentation is wrong
-- the implementation is wrong
-- your assumption about when the BOM should be added is wrong

I don't have enough first-hand knowledge to choose among those three at
the moment.


Prob the last option :(


b) “Since the Unicode byte order mark character is not found in any
code page, it disappears if data is converted to ANSI. Unlike other
Unicode characters, it is not replaced by a default character when it
is converted. If a byte order mark is found in the middle of a file,
it is not interpreted as a Unicode character and has no effect on text
output.”

Well, since at least some (ANSI) code pages do have glyphs for
characters at code points FF and FE, I assume above text implies that
apps ( using non-unicode code pages ) reading such a file would
understand that FE FF sequence represents BOM and thus should ignore
it?

I believe the statement refers to converting from Unicode to an ANSI code
page, not the other way. The point is not whether 0xff or 0xfe can be
found on their own. The point is that the Unicode "character" 0xfeff is
not representable in any ANSI code page, and is treated specially by
stripping it from input rather than replacing it with the "default
character".

In other words, it is up to app ( using non-Unicode code page )
reading such a file to realize that FE FF sequence should be ignored?!

If you use Encoding to convert to an ANSI code page, it will be ignored
automatically.


While other characters not presentable in ANSI code page will be
replaced with default character by Encoding class?


5)
b) Ignoring the fact that FE FF sequence identifies the type of
encoding, does U+FEFF also represent a character ( outside the context
of encoding )?

It's the Unicode BOM character. It's an actual Unicode character.

But has no other, non computer related meaning?


3)
3) I noticed there are only four classes derived from Encoding class
( ASCIIEncoding, UTF8Encoding, UnicodeEncoding and UTF7Encoding
).What if you want to use some other, non-unicode encoding?

You use them instead. See Encoding.GetEncoding().

a) From MSDN:
"The GetEncoding method relies on the underlying platform to support
most code pages; however, the .NET Framework natively supports some
encodings."

So if .Net doesn’t support particular encoding, it checks the
underlying OS if it supports that encoding, and if it does, it ”
borrows ” OS’s code page and instructions on how to encode it?

b) From MSDN:
"For a list of code pages, see the Encoding class topic. Or use the
GetEncodings method to get a list of all encodings."

I assume GetEncodings also lists code pages which .Net may not support
out of the box?







On Feb 4, 4:36 am, Arne Vajhøj <a...@xxxxxxxxxx> wrote:
beginwi...@xxxxxxxxx wrote:
1)
c)

* Are there also 8-bit code pages which use Unicode character
encoding, and thus have only 255 code points matched to characters?

That is what all the regular code pages do.


* My question was a bit off ... what I meant to ask was if there are 8-
bit code pages that only have 255 points defined, where all of those
those 255 code points are assigned to same characters as the first 255
code points in Unicode coded characters set?

* In any case, if the answer is yes, are then any of such code pages
encoded with either UTF-16 or UTF-8 encoding?





5)
a) “Internally, the .NET Framework stores text as Unicode UTF-16.”

I assume that the above quote is only referring to String objects and
char variables using UTF-16 encoding,

UTF-8 and ANSI are external formats.


I’m not sure I understand what you meant by that

Internally all string and char uses 16 bit unicode.

b) Ignoring the fact that FE FF sequence identifies the type of
encoding, does U+FEFF also represent a character ( outside the context
of encoding )?

I believe the 16 bit exist as a unicode code point but not as
UTF-8.


* But since UTF-8 can use up to 6 octets, wouldn’t that suggest that
it has defines code point FEFF? Or is that code point skipped? For
what reason?





thank you guys
.



Relevant Pages

  • Re: C# and encodings
    ... different encoding than Unicode does (Unicode set uses three ... Any character encoding that is not Unicode by definition uses a different encoding than Unicode does. ... The point is that the Unicode "character" 0xfeff is not representable in any ANSI code page, and is treated specially by stripping it from input rather than replacing it with the "default character". ...
    (microsoft.public.dotnet.languages.csharp)
  • C# and encodings
    ... Can code page support Unicode coded character set, ... Are there also 8-bit code pages which use Unicode character ... encoding, and thus have only 255 code points matched to characters? ... mark written in UTF-8. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: C# and encodings
    ... different encoding than Unicode does ... encoded into a binary stream using an encoding that either supports the ... So if code page supports only a subset of Unicode character set… ... characters as those in Unicode coded character set, ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Try this
    ... Because that's the absence of encoding? ... If you want to understand what happens here: The Unicode block for 'CJK ... Unified Han' goes from U+4E00 to U+9FFF and is the largest block in the ... would collapse each two letters into a single character, ...
    (comp.lang.python)
  • Re: C# and encodings
    ... But if windows has numerous code pages, ... different encoding than Unicode does (Unicode set uses three ... encoding, and thus have only 255 code points matched to characters? ... header info - typical they use network order which is big endian. ...
    (microsoft.public.dotnet.languages.csharp)