Re: C# and encodings



hi


On Feb 4, 7:07 am, Göran Andersson <gu...@xxxxxxxxx> wrote:
beginwi...@xxxxxxxxx wrote:
b) Can code page support Unicode coded character set, but may use
different encoding than Unicode does

It doesn't really work that way. Strings are Uncode, and they can be
encoded into a binary stream using an encoding that either supports the
full Unicode character set or an encoding that supports the subset that
a codepage represents.


* So if code page supports only a subset of Unicode character set…
what do we call it then? Unicode compliant code page or…?

* if I create a code page with 30000 code points that map to same
characters as those in Unicode coded character set, and if I encode
this code page using UTF-16, it still won’t be considered Unicode?

* But that doesn’t explain why Notepad claims to support Unicode and
yet it only has 383 code points defined?!

And if you say we use UTF-16/UTF-8 encodings only for Unicode coded
character set ( that is for sets that support full Unicode character
set ), then why does Notepad also use UTF-16 encoding? After all, its
code page supports only a subset of Unicode character set?!





* Are there also 8-bit code pages which use Unicode character
encoding, and thus have only 255 code points matched to characters?

No, if they only have 256 code points they don't use Unicode encoding,
they map each character to a byte value.


But doesn’t that depend on the implementation – if I wanted to create
a code page that is subset of first 600 Unicode characters points,
then why couldn’t I use UTF-16 encoding ( for whatever reason,
perhaps because I’m expecting the files my app will read will only
contain first 600 Unicode characters )?








On Feb 4, 7:17 am, Arne Vajhøj <a...@xxxxxxxxxx> wrote:
beginwi...@xxxxxxxxx wrote:
On Feb 4, 4:41 am, "Peter Duniho" <NpOeStPe...@xxxxxxxxxxxxxxxx>
wrote:
On Tue, 03 Feb 2009 15:08:59 -0800, <beginwi...@xxxxxxxxx> wrote:
c)
* Are there also 8-bit code pages which use Unicode character
encoding, and thus have only 255 code points matched to characters?
That makes no sense. Unicode can't be represented in only 8-bits, so
there's no such thing as an 8-bit code page that uses Unicode character
encoding.

But Notepad supports Unicode and yet it only recognizes 255 character,
thus it only has 255 code points – couldn’t we then say that Notepad
uses 8-bit code page?

Notepad supports UTF-8.


I checked it and as far as I can tell it only recognizes first 383 or
so characters. That would suggest it only has around 380 code points
defined and yet it still uses UTF-16 or UTF-8 encodings



a) From MSDN:
"The GetEncoding method relies on the underlying platform to support
most code pages; however, the .NET Framework natively supports some
encodings."

So if .Net doesn’t support particular encoding, it checks the
underlying OS if it supports that encoding, and if it does, it ”
borrows ” OS’s code page and instructions on how to encode it?

No.


Then what is meant by “relies on the underlying platform to support
most code pages”?


On Feb 4, 7:23 am, Arne Vajhøj <a...@xxxxxxxxxx> wrote:
beginwi...@xxxxxxxxx wrote:

On Feb 4, 4:36 am, Arne Vajhøj <a...@xxxxxxxxxx> wrote:

beginwi...@xxxxxxxxx wrote:
* My question was a bit off ... what I meant to ask was if there are 8-
bit code pages that only have 255 points defined, where all of those
those 255 code points are assigned to same characters as the first 255
code points in Unicode coded characters set?

No.

* In any case, if the answer is yes, are then any of such code pages
encoded with either UTF-16 or UTF-8 encoding?

Encoding X is never encoded with Encoding Y.

Isn’t term code page used for any coded character set? You imply as if
the term code page already automatically suggests some non-Unicode
encoding ( BTW – I realize that term Unicode means coded character set
and not encoding )?











On Feb 4, 12:14 pm, "Mihai N." <nmihai_year_2...@xxxxxxxxx> wrote:
1)

b) Can code page support Unicode coded character set, but may use
different encoding than Unicode does ( Unicode set uses three
encodings - UTF-8, UTF-16 and UTF-32 )?

No. Unicode is one code page (with several encodings).
If a code page "supports Unicode" then it is Unicode.

By “code page supports Unicode” you mean that it has as many code
points ( and thus characters that map to those code points ) defined
as full Unicode character set?


But (most) other code pages are a subset of Unicode
(not necesarily contigous subset).

* By subset you mean that the code points these code pages do have
defined, map to same characters as equivalent Unicode code points?

* and yet even though these code pages are subsets of Unicode
character set, we still don’t call them Unicode coded character set?
Then what do we call it? Unicode compliant code page…?




c)

* Are there also 8-bit code pages which use Unicode character
encoding, and thus have only 255 code points matched to characters?

No. A code page does not use Unicode encodings.

* Can these code pages also use UTF-16 or UTF-32 encoding?

All UTF encodings are Unicode only
(as the name says: UTF = *Unicode* Transformation Form)


So code pages that are a subset of Unicode don’t use Unicode
encodings? But I thought that was a design choice…thus if someone
created a code page CP1 that is a subset of Unicode Encoding, they
could also decide to use UTF-16 … that way apps that understood CP1
would also be able to (partially) understand Unicode files?! What
reasoning would prevent a programmer from using UTF to encode CP1?

..

2)
As far as I understand, the above text suggests that preamble should
be added by default, but I’d say that’s not true?!

Based on the doc, is should add the preamble.
Did you test and it is not true?

I checked it with Hex editor and there was no BOM written





b) Ignoring the fact that FE FF sequence identifies the type of
encoding, does U+FEFF also represent a character ( outside the context
of encoding )?

You are confusing things. FEFF is a character (BOM) at the beginning
of the file. In the middle, it means ZERO WIDTH NON-BREAKING SPACE.
It was used as a marker, then kind of transformed into a de facto
standard.

I guess I’m a bit confused on what the definition of character is.
‘\n’ or ‘\r’ are considered characters, but to me they are no more
character-like than ZERO WIDTH NON-BREAKING SPACE, and yet the latter
is not considered a character


thank you all
.



Relevant Pages

  • Re: C# and encodings
    ... But if windows has numerous code pages, ... encoding, and thus have only 255 code points matched to characters? ... Unicode can't be represented in only 8-bits, ... But Notepad supports Unicode and yet it only recognizes 255 character, ...
    (microsoft.public.dotnet.languages.csharp)
  • C# and encodings
    ... Can code page support Unicode coded character set, ... Are there also 8-bit code pages which use Unicode character ... encoding, and thus have only 255 code points matched to characters? ... mark written in UTF-8. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Posting with XHR and ISO-8859-15
    ... Universal Character Set, regardless of the encoding used. ... that was not a problem before Unicode and the various Unicode ... encodeURIComponent() for the reason stated above, ...
    (comp.lang.javascript)
  • Re: Sloe day
    ... we could all switch to Unicode. ... Unicode is a 2-byte character set; with 16 bits per character, ... UTF-32 is the encoding which simply does "number in encoding is equal to ...
    (uk.rec.sheds)
  • Re: Try this
    ... Because that's the absence of encoding? ... If you want to understand what happens here: The Unicode block for 'CJK ... Unified Han' goes from U+4E00 to U+9FFF and is the largest block in the ... would collapse each two letters into a single character, ...
    (comp.lang.python)

Loading