Re: Submiting Arabic Language characters to ISAPI Extension dll

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance

From: David Wang [Msft] (someone_at_online.microsoft.com)
Date: 06/24/04


Date: Thu, 24 Jun 2004 00:05:20 -0700

I suggest that you do the following:
1. Inside the ISAPI, you get the querystring:
    a. %-decode the querystring into actual characters (i.e. %-decode turns
the six byte string %D8%B4 into a 2 byte string 0xD8 0xB4)
    b. call MultiByteToWideChar() with UTF8 as the code page (65001) to turn
this 2 byte UTF8 string into Unicode character(s)
    c. store Unicode in your database
2. When your ISAPI needs to send a response:
    a. Data it retrieves from the database is Unicode
    b. call WideCharToMultiByte such that it matches the Accept-Language of
the client. Obviously, if the client's language does not match the language
of the data, you get gibberish -- but if the language matches, you get DBCS.
Or if the client accepts UTF8, you can transform to that character set as
well.

Key thing is to take input as UTF8 and store it as Unicode -- so that when
it is time to send output, you have many conversion options from Unicode
into either UTF8, CP_ACP, or any other encoding.

If your database does not accept Unicode, then you should store the
querystring as %-encoded into the database, and on sending a response,
%-decode it into actual characters and send it. This completely depends on
the encoding sent by the original client and whether it matches an encoding
and font-set on the retrieving client.

As for why it doesn't display properly on the client -- you need to make
sure the client has the necessary fonts, the HTTP response tells the browser
the right encoding, and that the entity body that you sent has the
characters encoded in the right byte sequence to match the encoding claimed
by the HTTP response headers.

Right now, I think that that you are misinterpreting the character encoding
of %D8%B4 as UTF8. It is not possible for you to "use UTF-8 encoding on my
html form" -- it is an option in IE to send characters as either DBCS or
UTF8 -- the HTML form itself cannot control content encoding. Thus, I think
that you're confusing the browser by telling it that the content is encoded
in one code page while sending it %-decode characters in another codepage.
Figure out and match the code page. If you can't do this, I really cannot
help further. You need to understand what you are doing.

-- 
//David
IIS
This posting is provided "AS IS" with no warranties, and confers no rights.
//
"Basit Saleem" <BasitSaleem@discussions.microsoft.com> wrote in message
news:3207361B-68FE-4863-BB5F-9569081EA36E@microsoft.com...
Hi David,
Thnx for ur input and urgent response.
I had the feeling that i will have to do the decoding myself in ISAPI
extension. Just wanted to confirm this from you.
Can you help me a bit further please. I am using UTF-8 encoding on my html
form. i need a hint on how to do that decoding you referred to in your
previous post. Right now for a single character i am doing something like
that
 if(strToClean.Find("%D8%B4")!=-1)
 {
  TCHAR ch = 0xD8B4;
  strToClean = ch;
 }
i can write an algo which can do this sort of conversion for any string,
rather than one character i am currently trying to decode. but it doesn't
seem to work as it displays a "?" where as character i submitted was "?".
"?" has utf-8 code of %D8%B4.
thnx a lot for ur contribution.
waiting earnestly for ur reply
cheers
"David Wang [Msft]" wrote:
> The problem here is with the code in your ISAPI Extension and has nothing
to
> do with Arabic, IIS, or ISAPI Extension API itself.
>
> The ISAPI received data that is %-encoded, and apparently it just returns
> that data as-is.  If you want it to instead return %-decoded data such
that
> the browser will subsequently display the right characters, then you need
to
> write your ISAPI Extension to do that decoding.  There is no way to
> "automagically" do it for you correctly -- you have to know what you are
> doing.
>
> You have two choices with character %-encoding -- use UTF8, or use DBCS.
I
> suggest you %-encode the characters in UTF8 so that you can use the exact
> same code inside the ISAPI Extension to provide properly decoded data
> regardless of input language charset.  Otherwise, you'll need to save the
> charset of the %-encoding in the database and use it on %-decoding.
>
> -- 
> //David
> IIS
> This posting is provided "AS IS" with no warranties, and confers no
rights.
> //
> "Basit Saleem" <BasitSaleem@discussions.microsoft.com> wrote in message
> news:9B33B06E-A7A3-498C-8F44-41251C6428DF@microsoft.com...
> We have got a problem here while submitting some data in arabic language
> from an html form. When we submit form using GET method, in querystring
> arabic characters are passed as combination of hexadecimal numbers. We are
> passing this data to ISAPI which saves it in database and later displays
the
> results. What happens is that ISAPI save these numbers in a string and
later
> displays these numbers in textbox instead of printing the arabic language
> character that the user entered.
>
>
http://wpdv2-2k/WORK+/Ex/WPEDHtml.dll?R1=1&Submit=SubmitTask&proc=MiscTest&incident=0&TaskId=0618160ffbf2efabe105a631fc0f94&Index=3&method=&DCName=&EditBox1=&EditBox3=%D8%B4
>
> for example the data in Editbox is displayed as %D8%B4 whereas user
entered
> "?". Later when we display it will be displaed as %D8%B4  instead of "?"
>
>
>


Relevant Pages

  • Re: Character semantics for filenames (was: win32 reading wide filenames (unicode))
    ... Now file name is stored in utf8 format. ... it doesn't make any difference whether the string is internally ... DO WITH CHARACTERS ABOVE "\xFF". ... encoding to perl strings by readdir and from perl strings to the OS ...
    (comp.lang.perl.misc)
  • Re: utf8 Problems
    ... I converted to utf8 in the hope that my non ASCII character problems ... use all sorts of special characters, limited only by the fonts you have ... encoding in a standardized way, for example in plain text files. ... $ locale | grep -v en_US ...
    (Debian-User)
  • Re: character encoding & regex
    ... I found something that SpamAssassin uses to convert all this "goo" into a repeatable set of characters by running something that looks like this: ... Many UTF8 characters are words, ... I had to tell Perl that the program was written in utf8 using the 'encoding' module. ... Basically, the '\w' in a regular expression is sensitive to the current locale, and if utf8 is enabled in the locale, '\w' will know which unicode characters are word characters and which are not. ...
    (perl.beginners)
  • Re: Encode exception for chinese text
    ... Are you sure all the characters in original text are in "gb2312" ... Encoding with "utf8" seems work for this character, ...
    (comp.lang.python)
  • Re: unicode conversion
    ... breaks utf8 output of Chinese characters to an otherwise perfectly utf8- transparent console, see my XML::Simple and utf8 woe posting of ... As I explained in the other thread, what's probably happening is that, without -CS, your data is being read in by Perl as octets, then printed out as octets; however, under -CS your data is still read as octets yet printed to a UTF8-aware filehandle. ... my latest experience is with bulk quantities of utf8 data (latin, CJK material, _tons_ of characters with accents and diacritics in one soup). ... When I try to segment such a string with approx. ...
    (comp.lang.perl.misc)