Re: What all could be the security issues in Unicode Database

From: Bart Duncan [MSFT] (bartd_at_online.microsoft.com)
Date: 12/15/04


Date: Wed, 15 Dec 2004 21:57:43 GMT

If you have a large database that primarily contains English string data,
you should definitely consider whether the increase in db size will be a
problem before converting to nvarchar. If your db size grows by, say, 50%,
you should expect some operations (CHECKDB, DBREINDEX, etc) to take longer.
 However, in most cases you should not expect a well-optimized OLTP query
to be any slower.

Regarding UTF-8, it definitely has some strengths, but it also has its own
set of downsides. Some examples:

 - UTF-8 is only more efficient than UCS-2 from a storage perspective for
7-bit ASCII data (A-z). Any "non-English" characters encoded using UTF-8
will be at least as large as the equivalent UCS-2/UTF-16 encoding. And for
some languages like Japanese or Chinese, UTF-8 actually requires 3 bytes
for a typical character and up to 5 in the worst case, meaning that UTF-8
is worse from a storage perspective in some situations.

 - Parsing or string manipulation with UTF-8 can be considerably more
complicated and expensive. Consider an operation like SUBSTRING(column, 1,
100). With UCS-2 data you have fixed-width code units to work with,
meaning that an app can jump directly to the 100th character because it is
always at a fixed byte offset. UTF-8, on the other hand, has
variable-width characters. This means you would have to visit each byte in
the string to dynamically determine where each character boundary falls,
and continue working through the string byte-by-byte in this way until you
have found the 100th character.

 - To reiterate what Mike said, NT internally uses UTF-16 Unicode for its
strings. The same applies to COM, ODBC, OLEDB, Visual Basic, C#, and even
Java. The standard C runtime functions expect any Unicode data you hand
them to be encoded using a 16-bit code unit; they don't know how to
properly handle UTF-8. If you were to store your data as UTF-8, in many
cases you'd have to manually convert it back to UTF-16 before handing it
off to the app layer.

Bart
------------
Bart Duncan
Microsoft SQL Server Support

Please reply to the newsgroup only - thanks.
This posting is provided "AS IS" with no warranties, and confers no rights.

--------------------
| From: "AnthonyThomas" <Anthony.Thomas@CommerceBank.com>
| References: <eWBoW9N4EHA.3388@TK2MSFTNGP15.phx.gbl>
<9A2BA7C8-38E5-48CA-9521-7694146A5D8A@microsoft.com>
<OnwJcuR4EHA.1564@TK2MSFTNGP09.phx.gbl>
<#K9dgYV4EHA.1392@tk2msftngp13.phx.gbl>
| Subject: Re: What all could be the security issues in Unicode Database
| Date: Mon, 13 Dec 2004 23:25:48 -0600
| Lines: 100
| X-Priority: 3
| X-MSMail-Priority: Normal
| X-Newsreader: Microsoft Outlook Express 6.00.2800.1437
| X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1441
| Message-ID: <efUnC2Z4EHA.4008@TK2MSFTNGP15.phx.gbl>
| Newsgroups: microsoft.public.sqlserver.server
| NNTP-Posting-Host: cpe-69-76-153-75.kc.rr.com 69.76.153.75
| Path:
cpmsftngxa10.phx.gbl!TK2MSFTNGXA01.phx.gbl!TK2MSFTNGP08.phx.gbl!TK2MSFTNGP15
phx.gbl
| Xref: cpmsftngxa10.phx.gbl microsoft.public.sqlserver.server:370730
| X-Tomcat-NG: microsoft.public.sqlserver.server
|
| Conversion or not, storage is another matter. Convert it if you must, but
| even that is not a standard. Take a look at UTF-8. SQL Server does not
| support this. So, there is the doubling of the storage.
|
| Again, a presentation or, at least, an application issue.
|
| Sincerely,
|
|
| Anthony Thomas
|
|
| --
|
| "Mike Epprecht (SQL MVP)" <mike@epprecht.net> wrote in message
| news:%23K9dgYV4EHA.1392@tk2msftngp13.phx.gbl...
| Hi
|
| Don't forget, all NT based OS's use UNICODE internally, so the data gets
| converted to UNICODE and is used in the OS as UNICODE.
| If an application is not UNICODE, the OS converts it for it. .NET
framework
| is UNICODE.
| If you like it or not, the conversion happens automatically.
|
| --
| --------------------------------
| Mike Epprecht, Microsoft SQL Server MVP
| Zurich, Switzerland
|
| IM: mike@epprecht.net
|
| MVP Program: http://www.microsoft.com/mvp
|
| Blog: http://www.msmvps.com/epprecht/
|
| "AnthonyThomas" <Anthony.Thomas@CommerceBank.com> wrote in message
| news:OnwJcuR4EHA.1564@TK2MSFTNGP09.phx.gbl...
| > Not to mention that UNICODE is horribly inefficient. You do not need to
| use
| > it unless you know for fact that you will be using extended code pages
and
| > character sets, like Kana Fonts, and such.
| >
| > The reason is that UNICODE stores ALL character information in a 2-byte
| > storage assignement. For any character that would have fall in the
normal
| > ASCII range, UNICODE stores a 0x00 in the High-Ordered byte and the
normal
| > ASCII code in the Low-Ordered byte. You'd be paying to store a lot of
| > zero's just for the off-chance that someone might be needing to store an
| > extended character.
| >
| > That is not to say that UNICODE does not have a place. Names (people,
| > places, things) are good candidates for NVARCHAR data types, especially
on
| > Websites where you do not know a priori who may be using the site, but
for
| > all of the internal data elements, you, as the designer, should know
| exactly
| > what charater types are going to be allowed. You could even swith
| > collations and code pages for individual table attributes if you knew
| which
| > character sets would be used for them. All of this without the expense
of
| > UNICODE.
| >
| > It is up to you, but when your superiors start asking why the database
is
| so
| > big and expensive, you can tell them you jumped on the UNICODE Hype
| > bandwagon and its just they price they have to pay to be "cutting edge."
| >
| > Sincerely,
| >
| >
| > Anthony Thomas
| >
| > --
| >
| > "Mike Epprecht (SQL MVP)" <mike@epprecht.net> wrote in message
| > news:9A2BA7C8-38E5-48CA-9521-7694146A5D8A@microsoft.com...
| > Hi
| >
| > Normal security issues apply, like SQL Injection, bad permissioning etc.
| > UNICODE does not bring any specific issues with it from a security
| > perspective.
| >
| > Regards
| > Mike
| >
| > "Sierra" wrote:
| >
| > > Hi
| > >
| > > Can u guys give me some inputs as to what all could be the security
| > concerns
| > > in an Unicode Database
| > >
| > > Regards
| > > Sierra
| > >
| > >
| > >
| >
| >
|
|
|



Relevant Pages

  • Re: Unicode Support
    ... > Not knowing much about UTF-8 (my Unicode knowledge extends as far as ... > literal strings of this form as long as the character code for quote ... > can never appear in a MBCS (multibyte character sequence). ... then XP Notepad directly understands UNICODE and you can ...
    (alt.lang.asm)
  • Re: Unicode Delphi Win32 - which approach
    ... I like the backwards compatibility aspects of UTF-8 vs UTF-16. ... The first 256 Unicode characters map to the ANSI character set. ... entire stream> but calling an API 100 times in a loop I can imagine. ... and explicitly contextualise every string. ...
    (borland.public.delphi.non-technical)
  • Re: Defacto standard string library
    ... Is there a defacto standard string library ... Unicode, encoded in UTF8 format, except that a zero byte is ... Standard C string functions will be fine with this ... result, it cannot be encoded using a single byte per character, unless ...
    (comp.lang.c)
  • Re: Determining if a string is Unicode
    ... there's nothing magic about Unicode. ... where each character occupies 2 bytes, as opposed to a Single-Byte Character ... You could load up a string with rubbish, ... > INF file like so: ...
    (microsoft.public.vb.general.discussion)
  • Re: Attention: European C/C++/C#/Java Programmers-Call for Input
    ... Simply make a straight decision now - you will use UTF-8. ... character format) much like UTF-8 which itself ... I would have little more than UNICODE left. ... generator is assembly language. ...
    (comp.arch.embedded)