Re: problem using identity column as primary key

From: Joe Celko (joe.celko_at_northface.edu)
Date: 02/24/04


Date: Tue, 24 Feb 2004 08:55:32 -0800


>> I am thinking of creating an identity column to use it as primary key
in my table. <<

**by definition** a key is a subset of the attributes of an entity. It
is a LOGICAL construct. The proprietary, non-relational IDENTITY
property is a PHYSICAL state of a particular release of a particular
product on of a particular machine at of a particular time. It cannot
have anything whatsoever to with your data model (you did build a data
model, build a data dictionary and research your industry weeks before
you started coding, didn't you?)

It is a handle to the row. This means that you are faking a sequential
file's positional record number, so you can reference the physical
storage location? Sure, I'd do this if I want to lose all the
advantages of an abstract data model, SQL set oriented programming,
carry extra data and destroy the portability of code!

More and more programmers who have absolutely no database training are
being told to design a database. They are using GUIDs, IDENTITY, ROWID
and other proprietary auto-numbering "features" in SQL products to
imitate either a record number (sequential file system mindset) or OID
(OO mindset) since they don't know anything else.

Experienced database designers tend toward intelligent keys they find in
industry standard codes, such as UPC, VIN, GTIN, ISBN, etc. They know
that they need to verify the data against the reality they are modeling.
A trusted external source is a good thing to have.

The IDENTITY column is a holdover from the early programming languages
which were very close to the hardware. For example, the fields (not
columns; big difference) in a COBOL or FORTRAN program were assumed to
be physically located in main storage in the order they were declared in
the program. The languages have constructs using that model -- logical
and physical implementations are practically one! The data has meaning
BECAUSE of the program reading it (i.e. the same bits could be a
character in one program and be an integer in another)

The early SQLs were based on existing file systems. The data was kept
in physically contiguous disk pages, in physically contiguous rows, made
up of physically contiguous columns. In short, just like a deck of
punch cards or a magnetic tape. Most programmer still carry that mental
model, which is why I keep doing that rant about file vs. table, row vs.
record and column vs. field.

But physically contiguous storage is only one way of building a
relational database and it is not the best one. The basic idea of a
relational database is that user is not supposed to know *how* or
*where* things are stored at all, much less write code that depends on
the particular physical representation in a particular release of a
particular product on particular hardware at a particular time.

One of the biggest errors is the IDENTITY column (actually property, not
a column at all) in the Sybase/SQL Server family. People actually
program with this "feature" and even use it as the primary key for the
table! Now, let's go into painful details as to why this thing is bad.

The first practical consideration is that IDENTITY is proprietary and
non-portable, so you know that you will have maintenance problems when
you change releases or port your system to other products. Newbies
actually think they will never port code! Perhaps they only work for
companies that are failing and will be gone. Perhaps their code is such
crap nobody else want their application.

But let's look at the logical problems. First try to create a table
with two columns and try to make them both IDENTITY. If you cannot
declare more than one column to be of a certain data type, then that
thing is not a datatype at all, by definition. It is a property which
belongs to the PHYSICAL table, not the LOGICAL data in the table.

Next, create a table with one column and make it an IDENTITY. Now try
to insert, update and delete different numbers from it. If you cannot
insert, update and delete rows from a table, then it is not a table by
definition.

Finally create a simple table with one IDENTITY and a few other columns.
Use a few statements like

INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

To put a few rows into the table and notice that the IDENTITY
sequentially numbered them in the order they were presented. If you
delete a row, the gap in the sequence is not filled in and the sequence
continues from the highest number that has ever been used in that column
in that particular table. This is how we did record numbers in
pre-allocated sequential files in the 1950's, by the way. A utility
program would then "pack" or "compress" the records that were flagged as
deleted or unused to move the empty space to the physical end of the
physical file.

But now use a statement with a query expression in it, like this:

INSERT INTO Foobar (a, b, c)
SELECT x, y, z
  FROM Floob;

Since a query result is a table, and a table is a set which has no
ordering, what should the IDENTITY numbers be? The entire, whole,
completed set is presented to Foobar all at once, not a row at a time.
There are (n!) ways to number (n) rows, so which one do you pick? The
answer has been to use whatever the *physical* order of the result set
happened to be. That non-relational phrase "physical order" again!

But it is actually worse than that. If the same query is executed
again, but with new statistics or after an index has been dropped or
added, the new execution plan could bring the result set back in a
different physical order.

Can you explain from a logical model why the same rows in the second
query get different IDENTITY numbers? In the relational model, they
should be treated the same if all the values of all the attributes are
identical.

Using IDENTITY as a primary key is a sign that there is no data model,
only an imitation of a sequential file system. Since this "magic,
all-purpose, one-size-fits-all" pseudo-identifier exists only as a
result of the physical state of a particular piece of hardware at a
particular time as read by the current release of a particular database
product, how do you verify that an entity has such a number in the
reality you are modeling?

You will see newbies who design tables like this:

CREATE Drivers
(driver_id IDENTITY (1,1) NOT NULL PRIMARY KEY,
 ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
 vin CHAR(17) NOT NULL REFERENCES Motorpool(vin));

Now input data and submit the same row a thousand times, a million
times. Your data integrity is trashed. The natural key was this:

CREATE Drivers
(ssn CHAR(9) NOT NULL REFERENCES Personnel(ssn),
 vin CHAR(17) NOT NULL REFERENCES Motorpool(vin),
 PRIMARY KEY (ssn, vin));

To demonstrate, here is a typical idiot newbie schema -- you will them
all over the news groups. I call them "idiots" because they always name
the IDENTITY property column "id" in EVERY table. They don't understand
basic data modeling -- one and only name for an attribute. About half
the time they don't use any DRI, but let's show it.

CREATE TABLE MotorPool
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
 ssn CHAR(9) NOT NULL REFERENCES Personnel(id),
 vin CHAR(17) NOT NULL REFERENCES Vehicle(id));

CREATE TABLE Personnel
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
 ssn CHAR(9) NOT NULL UNIQUE,
 ..);

CREATE TABLE Vehicles
(id IDENTITY (1,1) NOT NULL PRIMARY KEY,
 vin CHAR(17) NOT NULL UNIQUE,
.);

Now change the natural key in Personnel:

UPDATE Personnel
   SET ssn = '666666666'
 WHERE ssn = '000000000';

Nothing happened in Motorpool, did it? You can do the same thing with a
VIN.

Now you are REALLY thinking about relations and keys instead of 1950's
sequential record numbering. Adding an IDENTITY column to either of
these tables as a candidate key would be dangerously redundant; one
query uses the IDENTITY and another uses the real key, and like a man
with two watches, you are never sure what time it is.

Finally, an appeal to authority, with a quote from Dr. Codd: "..Database
users may cause the system to generate or delete a surrogate, but they
have no control over its value, nor is its value ever displayed to them
.."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
the database relational model to capture more meaning. ACM Transactions
on Database Systems, 4(4). pp. 397-434.

This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user. That means never
used in queries, DRI or anything else that a user does.

Codd also wrote the following:

"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result that
some or all of the serial numbers might be changed.).

(2) Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has
ceased to have one (e.g. and applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities. A solution - proposed in part [4] and more fully in [14] - is
to introduce entity domains which contain system-assigned surrogates.
Database users may cause the system to generate or delete a surrogate,
but they have no control over its value, nor is its value ever displayed
to them....." (Codd in ACM TODS, pp 409-410).

References

Codd, E. (1979), Extending the database relational model to capture more
meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

>> But the problem is that once I delete some records [sic], the SQL
Server would not reuse the IDs of the deleted records [sic]. So there
will be holes in the sequence.<<

You need to read a book on SQL and RDBMS. Rows are not records; fields
are not columns; tables are not files; there is no sequential access or
ordering in an RDBMS, so "first", "next" and "last" are totally
meaningless.

You are confusing the parking spaces with the automobiles. Automobiles
have VIN numbers, not parking space numbers!!

>> I know many people use Identity columns to generate unique id. <<

And that is a major reason so many database project produce bad data,
fail out right and why I could easily bill between $1000-2000 per day
when I did clean up work or SQL training. Better than half of my clean
up jobs used IDENTITY, GUIDs or other autonumbering schemes that could
not be verified.

>> How do they deal with the holes in the identity columns? <<

Your mental model is data in a sequential file. You have no concept
whasoever of what an RDBMS is. This question is what we used to do with
magnetic tape files in the early 1960's.

You marked a record (not a row) as deleted on the tape as you processed
it; this was usually a bit or byte flag at the start of each record. We
also used record numbers, where the zero or negative values meant that
the record was deleted.

Record number were handy because you did relative reads from the current
position. For example, an order might consists of an order header
record (the term "Header record" is still in use today! Look at some of
the postings here) followed by detail records. The header would have a
count of the number of details. So I could program my application to
jump to the next order header with a little math. I would finally run
the tape thru a program that closed up the gaps and re-numbered the
records as it sorted them.

Look at the CURSOR reads in Standard SQL today; the whole cursor model
is based on mag tape file systems. The reason for this is that all of
the X3J Programming Language Standards have a sequential file model in
them.

Every DB programmer ought to have to take a course in file systems and
have to write a polyphase merge sort once in their education. That
teach them why we stopped using file systems and maybe, just maybe, they
will quit tryng to imitate 1950's technology in SQL.

--CELKO--
 ===========================
 Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, datatypes, etc. in your
schema are.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!



Relevant Pages

  • Re: Table Design Question
    ... > requires more than two probes, no matter how large the database. ... > acceptable (in the relational model) to have an Identity attribute to ... the gap in the sequence is not filled in and the sequence ... > vin CHARNOT NULL REFERENCES Motorpool); ...
    (microsoft.public.sqlserver.programming)
  • Re: table structure question
    ... no matter how large the database. ... acceptable (in the relational model) to have an Identity attribute to ... the gap in the sequence is not filled in and the sequence ... vin CHARNOT NULL REFERENCES Motorpool); ...
    (microsoft.public.sqlserver.programming)
  • Re: why use 4nf?
    ... no matter how large the database. ... acceptable (in the relational model) to have an Identity attribute to ... the gap in the sequence is not filled in and the sequence ... vin CHARNOT NULL REFERENCES Motorpool); ...
    (microsoft.public.sqlserver.programming)
  • Re: Primary Key Dilemma
    ... no matter how large the database. ... acceptable (in the relational model) to have an IDENTITY attribute to ... the gap in the sequence is not filled in and the sequence ... vin CHARNOT NULL REFERENCES Motorpool); ...
    (microsoft.public.sqlserver.programming)
  • Re: How should I generate a primary key?
    ... wrong with using IDENTITY or SEQUENCE? ... I asked in a separate thread "Is there a database independent way of ... value and primary key values should not be changeable for any reason. ... I have normally called this simply reference data. ...
    (comp.databases)