Re: GetHashCode for Objects that Compare Based on Value (Not reference equality)
- From: Jon Skeet [C# MVP] <skeet@xxxxxxxxx>
- Date: Tue, 14 Nov 2006 23:56:51 -0000
ssamuel <ssamuel@xxxxxxxxx> wrote:
I highly respect you for taking the time to write sample code, but I
think you're chasing a proof by example. Given your assumptions, you
are completely correct. The fact that you have to justify your
assumptions as "in real life," which is a highly subjective assumption,
your view being one with which I cannot agree, proves my point: if you
know your data well, you can always come up with a better hashing
algorithm.
My proof is more abstract than yours:
It's not really more abstract. You're effectively saying that a good
default position is to treat all data as if it were randomly and evenly
distributed. I'm saying that's not a good default position because so
much of the time in the real world we see repeated values, or values
which tend to lie within relatively small ranges.
1. It is too hard to collect all the data used as member variables in
every class in every industry through the history, present and future
of all C# applications.
2. Therefore, one cannot characterize 'all data' or even 'real life
data.'
That's false logic - or at least, it's effectively claiming that
because we can't know *everything* it's not worth using what we *do*
know as a reasonable representative sample.
Here's a counterexample: I can't collect all the data used in every
English language text ever written, but if I were to look at all the
English language books I happen to have in my book cases, I can
reasonably quickly come to the conclusion that 'e' and 's' are used
more often than 'x' and 'z'. Similarly I don't think it's unreasonable
to suggest that the number 8 comes up more often as a value for real-
life data stored in ints than 8327646.
(For the record, your posit that an array of evenly distributed
integers between any x and -x is common would need lots more empirical
data before I begin to believe it; in the last three industries I've
worked in, such data is uncommon.
a) There's no need for the range to be -x to x.
b) There's no need for the distribution to be even
In fact, there's no need for the values even to be in the same range -
try the code I gave with different ranges for the three variables:
you'll never get out of the number of bits which can be different. With
my algorithm you quickly get a broader range.
Do you really think it's particularly uncommon for real-world data to
not end up using the full range of the datatype's possible values?
I claim that your other posit, that a
well-designed class should contain two string members that contain the
same value, is very poor.)
I didn't say that well-designed classes "should" contain two string
members that contain the same value - I claimed it often happened.
Think "shipping address" and "invoice address" - as references to
Address objects, if you will (it really doesn't matter whether they're
strings or not - the point is that repeating a value effectively
cancels out the first value's hashcode when using XOR).
3. Uncharacterizable data should be considered randomly distributed, by
the definition of random distribution, and thus should data in member
variables of a given class.
I think that's as much a guess as any other distribution - but I'd
consider it far worse than my guess that numeric values, particularly
integers, are often relatively small, usually positive. This is
reasonable, as many real-world values have the same attributes.
Consider:
1) Age
2) Day of month
3) Month of year
4) Year
5) House number
6) Price (in cents)
7) Length of DVD or CD in minutes
8) Dimension (eg in centimeters)
9) Number of messages in my inbox
10) Number of replies to a blog post
Many of these will end up being represented using an int in simple
classes. Do you really believe they have a uniform random distribution
over the full range of ints?
4. Given two random distributions, A and B, A XOR B is random.
5. It follows that given a class with random member variable values,
XOR-ing member variable values will give a random hash.
If life really consisted of uniformly distributed random values, XOR
would be fine. However, I contest that the real world isn't like that.
QED, unless you have specific characterization of your data, XOR is
efficient. Seeing that it's the method recommended in the docs -- I
won't even touch your comment about their incorrectness -- I'll follow
Ockham's Razor and stick with it.
Why won't you touch my comment about the docs not being correct? I've
submitted many corrections to MSDN before, and there are *huge* numbers
of bad code examples. I'm sure I could easily find some if you doubt
me...
Now, as to it being good solely by being recommended in documentation -
"my" algorithm is recommended in a book which is widely recognised as a
great text. Okay, so it's about Java, but I don't believe that there's
any significant difference between .NET and Java that would affect the
advice about hashcodes.
Mind you, I'm not trying to say your algorithm is incorrect. I'm simply
stating that one should implement the simplest efficient algorithm
until optimization, when one can gain insight into the true
distribution of values in ones real data set.
I don't think there's any significant difference in simplicity, to be
honest. I *do* believe you'll get more collisions in the real world
using XOR than using multiply-and-add.
--
Jon Skeet - <skeet@xxxxxxxxx>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
.
- Follow-Ups:
- References:
- GetHashCode for Objects that Compare Based on Value (Not reference equality)
- From: Matt
- Re: GetHashCode for Objects that Compare Based on Value (Not reference equality)
- From: ssamuel
- Re: GetHashCode for Objects that Compare Based on Value (Not reference equality)
- From: Jon Skeet [C# MVP]
- Re: GetHashCode for Objects that Compare Based on Value (Not reference equality)
- From: ssamuel
- Re: GetHashCode for Objects that Compare Based on Value (Not reference equality)
- From: Jon Skeet [C# MVP]
- Re: GetHashCode for Objects that Compare Based on Value (Not reference equality)
- From: ssamuel
- GetHashCode for Objects that Compare Based on Value (Not reference equality)
- Prev by Date: Re: GetHashCode for Objects that Compare Based on Value (Not reference equality)
- Next by Date: Re: GetHashCode for Objects that Compare Based on Value (Not reference equality)
- Previous by thread: Re: GetHashCode for Objects that Compare Based on Value (Not reference equality)
- Next by thread: Re: GetHashCode for Objects that Compare Based on Value (Not reference equality)
- Index(es):
Relevant Pages
|