Re: CHECKSUM() question



Eris S,

Let's assume for a moment that you use CHECKSUM and that there are hash collisions. If you stage the answer in a temporary table, you can use that to join to your source table, then filter out the few collisions. E.g.

INSERT INTO #Temp
SELECT RowID FROM HashTable
WHERE NameHash = CHECKSUM('Jones, Callaway')
AND AddressHash = CHECKSUM('123 Main Street')

SELECT *
FROM #Temp t JOIN Source s
ON t.RowID = s.RowID
-- Filter out the nonmatching rows
WHERE Name = 'Jones, Callaway'
AND Address = '123 Main Street'

This is very much like the example used in the BOL topic on CHECKSUM.
http://msdn.microsoft.com/en-us/library/ms189788.aspx

Remember absolutely no hash that reduces the size of data for searching can be guaranteed to avoid a collision. However, from a private discussion, Steve Kass (who knows way more about numbers than I do) pointed out that HASHBYTES with MD5 (16 bytes long) for 300 million rows probably has a lower chance of collision than the the possibility that some bit will get randomly changed by some other influence. Of course, then your keys are bigger.

FWIW,
RLF




"Eric S" <Ericxyz@xxxxxxxxxxx> wrote in message news:Ox0kWny0JHA.3408@xxxxxxxxxxxxxxxxxxxxxxx
Hi Eric Isaacs,

Thanks for your reply.

Basically I am creating a hash tables in a separate database, and speed is crucial, since that database will be used for web based where users will perform a search based on First, Last, Address, State , Zip etc... After I find the required records in the hash tables then I join the temp table into the source database where I grab the full records. I hope that clarifies a bit more.

Thanks,

Eric

"Eric Isaacs" <eisaacs@xxxxxxxxx> wrote in message news:3662b581-9f59-4584-911a-512333796984@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
You're also going to want to take into account the speed. Depending
on how you store the hashbyte value (as a calculated field or an
indexed view) I think you'll find that checking the value for
uniqueness will happen much faster if the value is stored and
indexed. If it's a calculated field and you add an index to it, it
will be stored twice. If you add it as an indexed view on the
hashbytes, I think it will store the hashbyte value only once. If you
add a unique constraint to the last and first names, it will store the
names twice (once in the table and once in the index.)

Diskspace is cheap, but getting this wrong could be expensive and
slow. You didn't mention speed as a criteria, but speed could be
exactly what you give up by finding a solution that doesn't index the
two names properly.

-Eric Isaacs



.



Relevant Pages

  • Re: Display Changed Records
    ... No hash system is 100% reliable (including HashBytes) and my comment was not meant to criticize you, but just to give a warning. ... CHECKSUM is less reliable for checking changes than many people think it is. ... declare @value1 int ...
    (microsoft.public.sqlserver.programming)
  • Re: CHECKSUM() question
    ... to join to your source table, then filter out the few collisions. ... This is very much like the example used in the BOL topic on CHECKSUM. ... Steve Kass once pointed out that HASHBYTES with MD5 for 300 million rows probably has a lower chance of collision than the the possibility that some bit will get randomly changed by some other influence. ... Links for SQL Server Books Online: ...
    (microsoft.public.sqlserver.programming)
  • Re: CHECKSUM() question
    ... Thanks for the replies but the Address table will not be part of the hash ... to join to your source table, then filter out the few collisions. ... This is very much like the example used in the BOL topic on CHECKSUM. ... Links for SQL Server Books Online: ...
    (microsoft.public.sqlserver.programming)
  • Re: CHECKSUM() question
    ... Keep in mind that doing checksum or hashbytes on firstname + lastname ... 'EricaPole') then 'True' ... Maybe hash the ...
    (microsoft.public.sqlserver.programming)
  • Getting hashbytes from XML
    ... How do you get hash bytes from XML? ... just want to get the hashbytes from pure XML or stream as is. ... Not only hash but also checksum will do because the computing occurs in safe ...
    (microsoft.public.sqlserver.xml)

Loading