Re: GetHashCode for Objects that Compare Based on Value (Not reference equality)



Jon,

You've proven my point. Every time you list an example, you're
demonstrating that by knowing your data, you can come up with a better
hash than a generic implementation.

The likelihood of producing collisions using my method is sufficiently
low in classes that contain the following data sets:

1. stock ticker, price
2. patient name, age, ssn, collection of descriptive flags
3. date/time, event description, event data

Further, for those sets of data, an XOR hash produces a sufficiently
random distribution for hashes to work just fine. Your algorithm works
just as well in these cases, but mine does well also. If you disagree
with that, we need to start to develop a metric for how "good" a hash
has to be and what your benchmark is. Then we're departing from the
realm of developing a good generic algorithm; thus specific examples
haven't gotten us far.

Saying that XOR produces collisions is not very useful. XOR works on
the basis of collisions, so the task is not to prove that you could
have collisions, but that the collisions inside two data structures
that contain this data would lead to low orthogonality if the data was
highly orthogonal.

Your argument that you can generalize about the English language
doesn't apply because your assumptions have no bearing on the
assumptions that make up this case. It's about as logical as me saying,
"I'm right because the Cardinals won the World Series in 2006," then
going on to prove that the Cardinals did in fact win.

You've made statements about how you have some ability to characterize
all "real world" data. All you've done to prove it is list single data
types that aren't themselves randomly distributed, although some that
are used as commonly as those which you didn't mention are. Correlating
two or more of those changes the orthogonality of your random
variables, be they of Gaussian, random, lognormal or any other
distribution. A few cases, or two dozen cases for that matter, don't
prove a point.

Your hash works well in some scenarios and my hash works well in
general scenarios. That's not to say that yours doesn't work well in
general scenarios -- it appears as if it would -- or that there are
some in which mine doesn't work well -- we know there are. You have,
however, failed to prove that mine doesn't work well in the "real
world."

I'm not going to get into a petty argument over whose book is better.
You're certainly not the only person ever to notice an error somewhere,
and you're even more certainly not an authority on whether MSDN has
more errors per unit volume than any particular book in existence,
except perhaps any that you may have written. Either prove that the
specific passage I quoted is incorrect or drop the "MSDN is the root of
all evil" argument.



Stephan



Jon wrote:
ssamuel <ssamuel@xxxxxxxxx> wrote:
I highly respect you for taking the time to write sample code, but I
think you're chasing a proof by example. Given your assumptions, you
are completely correct. The fact that you have to justify your
assumptions as "in real life," which is a highly subjective assumption,
your view being one with which I cannot agree, proves my point: if you
know your data well, you can always come up with a better hashing
algorithm.

My proof is more abstract than yours:

It's not really more abstract. You're effectively saying that a good
default position is to treat all data as if it were randomly and evenly
distributed. I'm saying that's not a good default position because so
much of the time in the real world we see repeated values, or values
which tend to lie within relatively small ranges.

1. It is too hard to collect all the data used as member variables in
every class in every industry through the history, present and future
of all C# applications.

2. Therefore, one cannot characterize 'all data' or even 'real life
data.'

That's false logic - or at least, it's effectively claiming that
because we can't know *everything* it's not worth using what we *do*
know as a reasonable representative sample.

Here's a counterexample: I can't collect all the data used in every
English language text ever written, but if I were to look at all the
English language books I happen to have in my book cases, I can
reasonably quickly come to the conclusion that 'e' and 's' are used
more often than 'x' and 'z'. Similarly I don't think it's unreasonable
to suggest that the number 8 comes up more often as a value for real-
life data stored in ints than 8327646.

(For the record, your posit that an array of evenly distributed
integers between any x and -x is common would need lots more empirical
data before I begin to believe it; in the last three industries I've
worked in, such data is uncommon.

a) There's no need for the range to be -x to x.
b) There's no need for the distribution to be even

In fact, there's no need for the values even to be in the same range -
try the code I gave with different ranges for the three variables:
you'll never get out of the number of bits which can be different. With
my algorithm you quickly get a broader range.

Do you really think it's particularly uncommon for real-world data to
not end up using the full range of the datatype's possible values?

I claim that your other posit, that a
well-designed class should contain two string members that contain the
same value, is very poor.)

I didn't say that well-designed classes "should" contain two string
members that contain the same value - I claimed it often happened.
Think "shipping address" and "invoice address" - as references to
Address objects, if you will (it really doesn't matter whether they're
strings or not - the point is that repeating a value effectively
cancels out the first value's hashcode when using XOR).

3. Uncharacterizable data should be considered randomly distributed, by
the definition of random distribution, and thus should data in member
variables of a given class.

I think that's as much a guess as any other distribution - but I'd
consider it far worse than my guess that numeric values, particularly
integers, are often relatively small, usually positive. This is
reasonable, as many real-world values have the same attributes.
Consider:

1) Age
2) Day of month
3) Month of year
4) Year
5) House number
6) Price (in cents)
7) Length of DVD or CD in minutes
8) Dimension (eg in centimeters)
9) Number of messages in my inbox
10) Number of replies to a blog post

Many of these will end up being represented using an int in simple
classes. Do you really believe they have a uniform random distribution
over the full range of ints?

4. Given two random distributions, A and B, A XOR B is random.

5. It follows that given a class with random member variable values,
XOR-ing member variable values will give a random hash.

If life really consisted of uniformly distributed random values, XOR
would be fine. However, I contest that the real world isn't like that.

QED, unless you have specific characterization of your data, XOR is
efficient. Seeing that it's the method recommended in the docs -- I
won't even touch your comment about their incorrectness -- I'll follow
Ockham's Razor and stick with it.

Why won't you touch my comment about the docs not being correct? I've
submitted many corrections to MSDN before, and there are *huge* numbers
of bad code examples. I'm sure I could easily find some if you doubt
me...

Now, as to it being good solely by being recommended in documentation -
"my" algorithm is recommended in a book which is widely recognised as a
great text. Okay, so it's about Java, but I don't believe that there's
any significant difference between .NET and Java that would affect the
advice about hashcodes.

Mind you, I'm not trying to say your algorithm is incorrect. I'm simply
stating that one should implement the simplest efficient algorithm
until optimization, when one can gain insight into the true
distribution of values in ones real data set.

I don't think there's any significant difference in simplicity, to be
honest. I *do* believe you'll get more collisions in the real world
using XOR than using multiply-and-add.

--
Jon Skeet - <skeet@xxxxxxxxx>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

.



Relevant Pages

  • Re: PEP 376
    ... Hash: SHA1 ... algorithm without changing the PEP or the installer code. ... If someone wants to modify a file of a distribution he can recreate ...
    (comp.lang.python)
  • Re: Big Push for Guest Worker Program
    ... so that I wouldn't every have to do taxes again. ... Since my big distribution comes in November, ... I remember my car license plate ID by an algorithm rather ...
    (soc.retirement)
  • Re: SHA-1 vs. triple-DES for password encryption?
    ... be better to use a standard algorithm rather than a home-grown one. ... SHA-1 and 3DES have been reviewed for some time. ... This is where a hash comes in nicely. ... Longer passwords and hashes aid in making the hash much harder to work with. ...
    (SecProg)
  • Re: sort unique
    ... given that a hash table is not ... IMO if the vendor's algorithm does something "obvious", ... function to eliminate keys that hash to the same bucket per some ... strings of random lengths, and two strings are ...
    (comp.lang.lisp)
  • Re: out of memory
    ... read only the smaller file into a hash. ... the smaller file will fit into RAM. ... Depending upon the sorting algorithm this would be Ologor ... put your relevant data into a database and use ...
    (comp.lang.perl.misc)

Loading