Re: CHECKSUM() question
- From: "Russell Fields" <russellfields@xxxxxxxxxx>
- Date: Tue, 12 May 2009 14:01:48 -0400
Eris S,
Let's assume for a moment that you use CHECKSUM and that there are hash collisions. If you stage the answer in a temporary table, you can use that to join to your source table, then filter out the few collisions. E.g.
INSERT INTO #Temp
SELECT RowID FROM HashTable
WHERE NameHash = CHECKSUM('Jones, Callaway')
AND AddressHash = CHECKSUM('123 Main Street')
SELECT *
FROM #Temp t JOIN Source s
ON t.RowID = s.RowID
-- Filter out the nonmatching rows
WHERE Name = 'Jones, Callaway'
AND Address = '123 Main Street'
This is very much like the example used in the BOL topic on CHECKSUM.
http://msdn.microsoft.com/en-us/library/ms189788.aspx
Remember absolutely no hash that reduces the size of data for searching can be guaranteed to avoid a collision. However, from a private discussion, Steve Kass (who knows way more about numbers than I do) pointed out that HASHBYTES with MD5 (16 bytes long) for 300 million rows probably has a lower chance of collision than the the possibility that some bit will get randomly changed by some other influence. Of course, then your keys are bigger.
FWIW,
RLF
"Eric S" <Ericxyz@xxxxxxxxxxx> wrote in message news:Ox0kWny0JHA.3408@xxxxxxxxxxxxxxxxxxxxxxx
Hi Eric Isaacs,
Thanks for your reply.
Basically I am creating a hash tables in a separate database, and speed is crucial, since that database will be used for web based where users will perform a search based on First, Last, Address, State , Zip etc... After I find the required records in the hash tables then I join the temp table into the source database where I grab the full records. I hope that clarifies a bit more.
Thanks,
Eric
"Eric Isaacs" <eisaacs@xxxxxxxxx> wrote in message news:3662b581-9f59-4584-911a-512333796984@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxYou're also going to want to take into account the speed. Depending
on how you store the hashbyte value (as a calculated field or an
indexed view) I think you'll find that checking the value for
uniqueness will happen much faster if the value is stored and
indexed. If it's a calculated field and you add an index to it, it
will be stored twice. If you add it as an indexed view on the
hashbytes, I think it will store the hashbyte value only once. If you
add a unique constraint to the last and first names, it will store the
names twice (once in the table and once in the index.)
Diskspace is cheap, but getting this wrong could be expensive and
slow. You didn't mention speed as a criteria, but speed could be
exactly what you give up by finding a solution that doesn't index the
two names properly.
-Eric Isaacs
.
- References:
- CHECKSUM() question
- From: Eric S
- Re: CHECKSUM() question
- From: Erland Sommarskog
- Re: CHECKSUM() question
- From: Eric S
- Re: CHECKSUM() question
- From: Eric Isaacs
- Re: CHECKSUM() question
- From: Eric Isaacs
- Re: CHECKSUM() question
- From: Eric S
- CHECKSUM() question
- Prev by Date: Re: Am I getting the MERGE syntax right?
- Next by Date: Re: Am I getting the MERGE syntax right?
- Previous by thread: Re: CHECKSUM() question
- Next by thread: Re: CHECKSUM() question
- Index(es):
Relevant Pages
|
Loading