Re: MFC Interview Tests

Tech Tip: Click here to run a free scan for Windows Errors and optimize PC performance



Note that in the days of this example, we were still using only 8-bit character strings.

Surrogate pairs pose a huge number of problems, and by the time we start seriously moving
to Win64, we should be thinking about wchar_t being 32 bits (it already is in some
compilers). The problem with surrogate pairs is they give you all the jobs of MBCS all
over again!
joe

On Mon, 23 Mar 2009 17:44:42 +0100, "Giovanni Dicanio"
<giovanniDOTdicanio@xxxxxxxxxxxxxxxxx> wrote:


"Joseph M. Newcomer" <newcomer@xxxxxxxxxxxx> ha scritto nel messaggio
news:946cs49312ru1k4e3cv03kb8gibv14b3cl@xxxxxxxxxx
The string-reverse question is trivial
strrev()
works fine. End of discussion.

I think the Unicode version of strrev has problems...

It seems to me that it fails to properly handle Unicode strings that contain
surrogate pairs.

Please see the simple demo project here, and screenshots in the zip file:

http://www.geocities.com/giovanni.dicanio/vc/TestStringRev.zip

For example, this Unicode character made up by surrogate pairs, encoded in
UTF-16 as U+D840 U+DC01, is not properly managed in the reverse process.

If I had to implement strrev(), I would consider a couple of pointers: a
'left' pointer pointing to the beginning of the string, and a 'right'
pointer pointing to the end of the string.
At each loop iteration the characters pointed by the left and right pointers
are swapped.
The left pointer is moved on the right (++), and the right pointer is moved
to the left (--).
The loop continues while left pointer <= right pointer.

I think that this algorithm works in case of ASCII strings, when each
character is stored in one 'char'.
I think that this algorithm also works in case of Unicode UTF-16 strings
without surrogate pairs, when each character is stored in one 'WCHAR'.
But this algorithm fails in case of surrogate pairs... maybe UTF-32 should
be used to make this algorithm work fine in Unicode?
(In fact, I think that in UTF-32 there is no concept of surrogat pair, and
all characters are stored in a 32-bit DWORDs...).

Giovanni
Joseph M. Newcomer [MVP]
email: newcomer@xxxxxxxxxxxx
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
.



Relevant Pages

  • Re: GetOpenFilename()
    ... Specifies that the File Name list box allows multiple selections. ... the directory and file name strings are NULL ... You advance the pointer one character more. ...
    (microsoft.public.vc.language)
  • Re: MFC Interview Tests
    ... For example, this Unicode character made up by surrogate pairs, encoded in UTF-16 as U+D840 U+DC01, is not properly managed in the reverse process. ... If I had to implement strrev, I would consider a couple of pointers: a 'left' pointer pointing to the beginning of the string, and a 'right' pointer pointing to the end of the string. ... I think that this algorithm works in case of ASCII strings, when each character is stored in one 'char'. ...
    (microsoft.public.vc.mfc)
  • Re: Reading Zip file contents without DLLs... How?
    ... Isn't a PChar just a PString that can only contain ... A PChar points to a single character, but through pointer arithmetic you ... memory containing the strings, and the pointer in the variable that points ...
    (alt.comp.lang.borland-delphi)
  • Re: Search for a string backwards in a file.
    ... default fstreams already read the file into blocks of strings (though the ... when the user reads the last character, ... streambuf's get pointer to the last character, ... Call file.rdbufto get a pointer to the streambuf. ...
    (comp.lang.cpp)
  • Re: Byte size of characters when encoding
    ... But each character itself in .NET is only 16 bits. ... It's only strings ... which have the concept of surrogate pairs, ... sequences of those characters as UTF-16 sequences. ...
    (microsoft.public.dotnet.framework)