Re: Performance ideas - comparing first N chars of 2 TEXT columns

From: Aaron W. West (tallpeak_at_hotmail.NO.SPAM)
Date: 07/14/04


Date: Tue, 13 Jul 2004 21:04:11 -0700

You should never count on CHECKSUM or BINARY_CHECKSUM to return a unique
hash. You always need to combine the test with a comparison against the base
column. That said, I'm also quite disappointed in the poor-quality of the
CHECKSUM hash. It is no where near as good a hash as CRC32. The algorithm is
apparently, rotate the sum left 4 bits and XOR the next character's value.

For example:
print
cast(checksum(char(1)+char(2)+char(3)+char(4)+char(5)+char(6)+char(7)+char(8
)) as binary(4))
0x12345678

print
cast(checksum(char(1)+char(2)+char(3)+char(4)+char(5)+char(6)+char(7)+char(8
)+char(9)) as binary(4))
0x23456788

(0x12345678 rotated left 4 places XOR 9 = 0x23456788)

A consequence of this algorithm is that if a string is a multiple of 8
characters, repeated, the checksum will always be 0.

print cast(checksum(' Testing Testing') as binary(4))
0x00000000

print cast(checksum('1234567812345678') as binary(4))
0x00000000

Anyway, going back to my first statement, don't count on CHECKSUM to return
different values for different passed strings. It should do so over 99% of
the time with most data, but that's about all. You should generally get a
performance boost when comparing large string values that are usually
different, and hence usually have different checksum values, but also need
to compare the base values.

EG:

...
WHERE BINARY_CHECKSUM(col1) <> BINARY_CHECKSUM(col2)
    AND col1 <> col2

(replace col1 and col2 with substring functions as desired)

-aaron

--------------------------

"Mischa Sandberg" <mischa_sandberg@telus.net> wrote in message
news:Ft1Jc.43492$Rf.12473@edtnps84...
You might want to check on how many duplicates you're getting
using CHECKSUM() as a hash. I got some nasty surprises from
using a checksum where there were repeated substrings ---
different substrings in different rows, but repeated in the same places.

CHECKSUM() is a CRC32, which means it wasn't intended to be
a hash function --- it was intended to detect differences of short runs of
bits.

The solution for my case, translated to your data, was something like:

    CHECKSUM(LEFT(xml,400), REVERSE(RIGHT(xml,400)))

"Stuart" <nonnb@webmail.co.za> wrote in message
news:66f26287.0407130413.5f70fa4f@posting.google.com...
> Hi
>
> Would appreciate ideas on how to optimise a query which needs to
> compare 2 text fields in a table (obviously these cols are not
> indexed).
>
> As SQL doesn't allow TEXT column comparison, am currently doing a
> compare on the first 800 chars of to eachother, like so:
>
> if not exists (select ... where (some other criteria) AND
> (convert(convert(varchar(800), stq.xml) = convert(varchar(800),
> i.xml)))
>
> As you can imagine, the performance is lousy. Changing the length of
> the text compared does make a big difference - e.g. about 10 x quicker
> if only compare first 100 bytes etc, however at higher the risk of
> getting the comparison 'wrong'.
>
> Both text fields actually contain XML, although the XML schemas used
> can be totally different. The XML does stretch beyond 8192 chars, so
> cannot use VARCHAR or CHAR columns. It isn't absolutely critical to
> get the match 100% right (the query is used in an instead of trigger
> to eliminate duplicate tasks), but do want > 90% accuracy.
> Unfortunately, the XML only gets 'different' after about 200 chars
> (standard envelopes are applied to the XML).
>
> Ideally would have liked to do a HASH / Checksum type of match e.g.
> CHECKSUM(stq.XML) = CHECKSUM(i.xml). Or compare the lengths first then
> only the chars? or the last few 100 chars? etc
>
> Thanks in Advance
>
> Stuart



Relevant Pages

  • Re: Performance ideas - comparing first N chars of 2 TEXT columns
    ... using CHECKSUM() as a hash. ... > Both text fields actually contain XML, ... The XML does stretch beyond 8192 chars, ...
    (microsoft.public.sqlserver.programming)
  • Re: Performance ideas - comparing first N chars of 2 TEXT columns
    ... You should also look at checksum. ... > Would appreciate ideas on how to optimise a query which needs to> compare 2 text fields in a table. ... > Both text fields actually contain XML, although the XML schemas used> can be totally different. ... The XML does stretch beyond 8192 chars, so> cannot use VARCHAR or CHAR columns. ...
    (microsoft.public.sqlserver.programming)
  • Re: About Hsiehs hash: initial value? 64 bit?
    ... when implementing Paul Hsieh's hash for incremental updates? ... Well if you just want to use it as a checksum, ... CRCs of course, have a much longer history as checksums. ...
    (comp.programming)
  • Re: Really fast checksum?
    ... Using JDBC, I select records from a table, roll through them, calculate a checksum on the text of all of the fields, and then check it against a stored checksum to see if the record has changed. ... In the past I've used CRC32, but that only operates on bytes and I'd rather not convert all the strings I'll get from JDBC into bytes. ... No hash function, of any kind, is capable of giving you that guarantee. ...
    (comp.lang.java.programmer)
  • Re: Byte to byte compare, duplicate file finder/killer
    ... cryptographically strong hash, simply because the CRC ... By using a secure hash instead of CRC, the actual byte-to-byte compare ... checksum (files with identical CRCs or checksums are trivial to ...
    (comp.programming)