Re: How many bytes per Italian character?
- From: Joseph M. Newcomer <newcomer@xxxxxxxxxxxx>
- Date: Mon, 09 Apr 2007 21:41:10 -0400
See below...
On Tue, 10 Apr 2007 09:52:13 +0900, "Norman Diamond" <ndiamond@xxxxxxxxxxxxxxxx> wrote:
"Joseph M. Newcomer" <newcomer@xxxxxxxxxxxx> wrote in message****
news:efjk13p1rhh2ospibfvfjt43e1i3u3q0cl@xxxxxxxxxx
On Mon, 9 Apr 2007 16:34:40 +0900, "Norman Diamond"
<ndiamond@xxxxxxxxxxxxxxxx> wrote:
In a general case where I don't know the expected length of the string,*****
yes I know how to ask Windows CE how big a buffer I'm going to need.
Notice that the above is not the question.
In this specific case I knew how long a string I was expecting, UTF-16 and
MSDN gave me some misleading inkling of how many bytes it should take, and
I was expecting to do a minimal amount of validation. Any nonzero return
value and any length other than 10 bytes could have been treated as an
error. Instead I have to do extra garbage in order to be compatible with
broken Windows.
Well, the question is misleading, since it says "per Italian character",
and characters in Italian are the same size as characters in English, or
French, or Urdu.
Maybe the question was overspecified. I was afraid that if I just asked
"How many bytes per character" then the tangents would go off into surrogate
pairs (which aren't supported in CE to the best of my knowledge but might be
someday and anyway wouldn't help this issue) instead of tangents that go off
into implementing pure virtual functions ^_^ Also a few months ago I found
the Japanese version of CE barfing all over Japanese characters when an
idiot programmer tries to use the %S format in StringCchPrintf.
That's because %S is carefully specified to be completely useless. It means, in a Unicode
app, "format an 8-bit string" and in an ANSI app means "format as a Unicode string". So
what you're saying is that "an incorrectly-written program fails:" for which there is
little sympathy.
****
Also I*****
found Visual Studio 2005 barfing on an English character, the pound sign.
So I wanted to make it clear that this was a simpler question.
Define "barf". What is going wrong, and what is the manifestation of the error?
****
*****
As far as I know, the Registry calls will not cause buffer overruns; I've
not seen any problem in regular Windows,
IRRELEVANT. We've already seen enough evidence that regular Windows and
Windows CE are not bug-for-bug compatible. Now I'm going to have to go off
on one of those tangents again. StringCchPrintf has been observed to work
in regular Windows, even with the %S format code. Regular Windows has
sometimes figured out that five wchar_t's occupy ten bytes.
So have you detected that WinCE will cause a buffer overrun?
*****
*****
so do you have any evidence that CE is doing a buffer overrun?
No. Do you have any evidence that it doesn't? The only evidence we have at
the moment is that CE thinks five wchar_t's occupy twenty bytes. It seems
to me that extreme caution is warranted.
Hmm. It allocates MORE space than required? How is this bad? Perhaps it is allowing for
the fact that the characters might require surrogates? You are saying that a conservative
approach to buffer allocation is bad? So it suggests that it might need more space when
it goes to read the data (note that it hasn't necessarily read the data yet, and we don't
actually know what representation is used in the Registry). So it could be storing UTF-32
and think it needs 20 bytes, and when it downconverts to UTF-16, lo! it only needed 10!
But it can't tell this without actually processing the data. You seem to be suggesting
that because it estimates high, it must be a bug. There's nothing requiring that it
estimate precisely. Just estimate enough to not have an overflow.
*****
****
****This seems to be a lot of concern over a fine point;
Fine. Look at the subject line. I posted a finely worded question to
begin with. Fine, neither you nor I nor anyone else knows the answer, but
why bother shifting the question domain to a question which is really
irrelevant to this thread?
It may be finely-worded, but it wasn't relevant to the question you wanted
to ask.
Huh???????????
As I pointed out, Italian characters are the same width as characters in
other languages,
Only you and I think that. We don't know yet if CE agrees, do we?
So you're suggesting that CE stores Italian characters in a different format than English
characters? This seems a bit odd, given they almost certainly have the same source code.
*****
*****
The real question, had it been correctly worded, would have been something
about Registry lengths being perhaps erroneous.
OK. One other person pointed out something along that line too. I'm
wondering how to figure out whether a length returned by RegQueryValueEx
should be believed or not. I don't see any way to check whether the result
is buggy or not. But I didn't think of that possibility before asking the
question. Sorry for my shortsightedness.
It might, as I pointed out, be a conservative estimate based on an issue of possible
maximum length. It is not an error to tell you a string might be longer than it actually
turns out to be, because it is NUL-terminated and therefore the length is what it turns
out to be.
****
****
Because the answer to your finely-worded question is "1 in ANSI apps, 2
in Unicode apps" and that's the end of the discussion.
Only you and I think that, and for the most part we only think half of that
in CE. There are no ANSI apps in CE, there are only a few APIs that are
ANSI-only (such as gethostname) and a few applications that default to
interpreting file contents as ANSI (such as Pocket Word when a text file
doesn't start with a BOM, unless that's changed in recent years). Anyway,
we think that way, but we don't know if CE agrees, do we? CE said five
characters occupy twenty bytes.
So the answer is simpler: "2 bytes in CE, independent of locale"
This is a DIFFERENT question than "How many bytes does RegQueryValueEx suggest should be
allocated to hold a string?" It makes a lot of sense to let it estimate high, since the
NUL terminator actually defines the length of a REG_SZ value (and other _SZ values)., and
for performance reasons it might not compute the actual number of bytes required.
joe
****
Joseph M. Newcomer [MVP]
email: newcomer@xxxxxxxxxxxx
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
.
- Follow-Ups:
- Re: How many bytes per Italian character?
- From: Norman Diamond
- Re: How many bytes per Italian character?
- References:
- Re: How many bytes per Italian character?
- From: Joseph M . Newcomer
- Re: How many bytes per Italian character?
- From: Norman Diamond
- Re: How many bytes per Italian character?
- From: Joseph M . Newcomer
- Re: How many bytes per Italian character?
- From: Norman Diamond
- Re: How many bytes per Italian character?
- From: Joseph M . Newcomer
- Re: How many bytes per Italian character?
- From: Norman Diamond
- Re: How many bytes per Italian character?
- Prev by Date: Re: How can I use CArray efficiently?
- Next by Date: Re: Are _T() and TEXT() macros equivalent?
- Previous by thread: Re: How many bytes per Italian character?
- Next by thread: Re: How many bytes per Italian character?
- Index(es):
Relevant Pages
|