Re: do all primary keys use autonumber



I ran the test you suggested. Perhaps I missed something (details below), but I wasn't impressed with the results I got. Please let me know if I did miss something here.

peregenem@xxxxxxxxxxxx wrote:
Vincent Johns wrote:

How does knowing the "physical order of the disk" (assuming you can
determine what that is) help you do ordinary database operations?

Physical order of the disk (a.k.a the cluster index) for a table is determined by its PRIMARY KEY designation.

Here's a quick demo of why for Jet you must choose a PRIMARY KEY
carefully.

I assume you have a 'sequence' table i.e. a table of integers e.g.

CREATE TABLE [Sequence] (seq INTEGER)
;
INSERT INTO [Sequence] VALUES (1)
;
INSERT INTO [Sequence] VALUES (2)
;
INSERT INTO [Sequence] VALUES (3)
;
etc etc.

In reality, my data was created in Excel and imported <g>.

You didn't say how many records you created. What I did was to put 100,000 records into [Sequence], with values of 1 to 100,000.


Now create two test tables:

CREATE TABLE Test1 (
key_col INTEGER IDENTITY(1, 1) NOT NULL,
data_col INTEGER NOT NULL,
PRIMARY KEY (key_col)
)
;
CREATE TABLE Test2 (
key_col INTEGER IDENTITY(1, 1) NOT NULL,
data_col INTEGER NOT NULL,
PRIMARY KEY (data_col, key_col)
)
;

I set [Test1].[key_col] and [Test2].[key_col] to be incremental Autonumber (long integer) fields, and [Test1].[data_col] and [Test2].[data_col] to be Number (long integer) fields. For [Test1], I set the primary key field to be [key_col], and for [Test2] I set the primary key to include both fields.


Note the difference in PK designation between them.

Now load some data using the 'sequence table':

INSERT INTO Test1 (data_col)
SELECT DT1.data_col FROM (
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
UNION ALL
SELECT S1.seq MOD 100 AS data_col
FROM [Sequence] AS S1
WHERE S1.seq <= 50000
) AS DT1
;
INSERT INTO Test1 (data_col)
SELECT DT1... (as before)
;

For the second version, I assume you really meant

  INSERT INTO Test2 (data_col)

.. Your SQL didn't work on my Access 2000, so what I actually did was to define the following Queries:

[Q_ShowSeq] SQL:
  SELECT S1.seq MOD 100 AS data_col
  FROM Sequence AS S1
  WHERE S1.seq<=50000;

[Q_UnionAll] SQL:
  SELECT * FROM Q_ShowSeq
  UNION ALL
  SELECT * FROM Q_ShowSeq
  UNION ALL
  SELECT * FROM Q_ShowSeq
  UNION ALL
  SELECT * FROM Q_ShowSeq
  UNION ALL
  SELECT * FROM Q_ShowSeq;

[Q_PopulateTest1] SQL:
  INSERT INTO Test1 ( data_col )
  SELECT DT1.data_col
  FROM [SELECT * FROM Q_UnionAll
  ]. AS DT1;

and similarly for [Q_PopulateTest2].

This put 250,000 records into each of [Test1] and [Test2].

At this point, compact the file. This has the effect of physically
rebuilding the tables based on their PRIMARY KEY designations.

To test the relative performance of the PRIMARY KEY choices, run a
query which uses a BETWEEN construct because this favours the physical
order:

SELECT key_col, data_col
FROM Test1
WHERE data_col BETWEEN 10 AND 15
OR data_col BETWEEN 30 AND 35
OR data_col BETWEEN 50 AND 55
OR data_col BETWEEN 70 AND 75
OR data_col BETWEEN 90 AND 95
;
SELECT key_col, data_col
FROM Test2
WHERE data_col BETWEEN 10 AND 15
OR data_col BETWEEN 30 AND 35
OR data_col BETWEEN 50 AND 55
OR data_col BETWEEN 70 AND 75
OR data_col BETWEEN 90 AND 95

In my test timings, table with PRIMARY KEY (key_col) is about 10 times
slower than the table with PRIMARY KEY (data_col, key_col).

My version of your Queries (essentially identical to yours) was

[Q_SelectTest1] SQL:
  SELECT [key_col], [data_col]
  FROM Test1
  WHERE data_col Between 10 And 15
  Or data_col Between 30 And 35
  Or data_col Between 50 And 55
  Or data_col Between 70 And 75
  Or data_col Between 90 And 95
  ;

and similarly for [Q_SelectTest2].

In my test timings, after compacting the database, both versions completed essentially instantly (for example, 1.23E-08) as measured by the Time function (system time, = days since midnight), so both were apparently completed in a few milliseconds or less.

But looking at your description, even assuming I had included enough records to be able to notice a difference, it appears that you forgot to INDEX the [Test1].[data_col] field! Declaring the primary key to include only [Test1].[key_col], omitting the field on which you were filtering the records, would also omit putting an index onto the other field. I imagine that that's why the Query took longer when you ran it.

This is exactly what I was referring to earlier, specifically the problem of FORGETTING to index a field that one needs to use for sorting or filtering a recordset.

Choosing an index that fails to match the physical record order on disk pales in comparison to this type of omission.

Bear in mind that putting lots of indices onto a Table may not improve performance. Indices improve performance when searching and sorting, but they reduce performance with updating (since each index must be updated when a record is changed).

In summary, although it's perhaps useful to know that a primary key gets slightly special treatment from Access, the improvement in performance that you might experience is likely to be pretty marginal. Far more important, IMHO, is organizing the information (via primary keys; indices; helpfully named Tables, fields, and Queries; properly human-engineered Forms; etc.) in your database so that it will do the job that you need done and will be easy to maintain. Worrying too much about internal implementation or "efficiency" can be counterproductive, if it distracts you from the imporant parts of the design. (Worse than any of this, of course, is failing to make the database model the real-world system it's supposed to model, but that's another story.)

  -- Vincent Johns <vjohns@xxxxxxxxxxxxxxxxxx>
  Please feel free to quote anything I say here.
.



Relevant Pages

  • Re: How should I generate a primary key?
    ... wrong with using IDENTITY or SEQUENCE? ... I asked in a separate thread "Is there a database independent way of ... value and primary key values should not be changeable for any reason. ... I have normally called this simply reference data. ...
    (comp.databases)
  • Re: problem using identity column as primary key
    ... >> I am thinking of creating an identity column to use it as primary key ... More and more programmers who have absolutely no database training are ... the gap in the sequence is not filled in and the sequence ... vin CHARNOT NULL REFERENCES Motorpool); ...
    (microsoft.public.sqlserver.programming)
  • Re: How should I generate a primary key?
    ... database engines and their extensions over time. ... Please go back and read Codd. ... The primary key cannot have any meaning to it unless its creation is always ... be committed before the next person could get the next sequence. ...
    (comp.databases)
  • Re: How should I generate a primary key?
    ... wrong with using IDENTITY or SEQUENCE? ... To avoid complications in any relational database, ... value and primary key values should not be changeable for any reason. ... obtain their value outside of the transaction. ...
    (comp.databases)
  • Re: Doubling the order
    ... Other than the orderId, there is nothing to say that two orders are the ... > 1.PRODUCTID) - primary Key ... >> relational database and it is not the best one. ... the gap in the sequence is not filled in and the sequence ...
    (microsoft.public.sqlserver.programming)