Re: Managing duplicates when INSERTing.

From: Tom Ellison (tellison_at_jcdoyle.com)
Date: 09/27/04


Date: Mon, 27 Sep 2004 11:24:18 -0500

See inline, below.

Tom Ellison
Microsoft Access MVP
Ellison Enterprises - Your One Stop IT Experts

On Tue, 28 Sep 2004 01:51:19 +1000, "GPO" <cre@inine> wrote:

>Thanks Tom for your very quick response.
>
>Yes, I've toyed with precisely this before, and agree that it does work for
>moderate-sized chunks of data. I have a couple of observations though:
>1. Are you suggesting that I will end up with a "raw" table AND a "final"
>table, or a "raw" table and a "final"query?

I would definitely plan to have a "final" table.

>2. If just raw, could I get around the speed issues of nested queries by
>having a "void" flag in each row of the "raw" table set to yes for each
>unwanted row? If so, is it possible to write an update query to do this
>(noting point 3)?
>3. One problem I've come across with the INNER JOIN between the query and
>the raw table, is the "this recordset is not updateable" message when
>attempting to delete or update from what would be the raw table. Am I right
>in thinking that this only occurs when tables/queries don't have PKs
>defined?

Usually, and I believe specifically in this case, that can be overcome
with the "Allow Inconsistent Updates" option. However, I don't see
why you would ever mess with the "raw" table except to empty it for
the next round of imports.

>4. Because of the volume of data, I fear I may approach the 2gb limit of the
>db (one reason why I was hoping to stop the data at the point of insertion -
>especially given that 75% of the "raw" data is unwanted).

Not an insignificant consideration, but making it much more difficult
to write isn't a good alternative, either. I suggest here that you
could put the "raw" table in another database completely. The
drawback to doing so is that you can't establish relationships between
this and other tables, and that would be of no concern anyway.

>5. Interestingly I did a similar exercise using ADO to loop through every
>row of a correctly ordered recordset, and compare each row with the previous
>row, flagged the unwanted rows for deletion and processed a delete query.
>Maybe I had my SQL badly set up, but ADO (or AD-slow as it is known) was
>impressively faster than the SQL. Go figure.

That seems backwards to me, as well. The query should be the much
preferred method, for performance and for other reasons.

>6. Although it is a business rule that there are no duplicates within a
>single text file, it does happen to about 3 in every 100,000 records.
>Because we can point to a business rule that says "There can only be one"
>(the highlander rule), we just arbitrarily pick one, and ignore the other.
>This means though, importing the data into a "raw" table with a surrogate
>key (autonumber), or no key at all (which brings you back to point three)...

You can import without a PK, eliminate duplicates, then establish the
PK before proceeding.
>
>Another approach I was thinking about was maybe a bit of script that worked
>directly with the text files before importing them. It would have to:
>1. Open a new text file for writing
>2. Loop through the data files in descending filename order:
>3. Within each data file, loop through each line and if the text string
>preceding the second TAB (it's tab delimited with key data in the first two
>fields) is not in the new file, in the equivalent location, copy that line
>to the new file (concatenated to metadata info like originating filename).
>4. Import just the desired rows, from the new text file.

This does not sound so good to me.

>
>Has anybody dealt with it this way before? Is it slow? or maybe even
>SLOOOWWW?
>
>Thanks again
>
>GPO
>
>
>"Tom Ellison" <tellison@jcdoyle.com> wrote in message
>news:a59gl0davkod50cd39imnn3lcn45oi8pi9@4ax.com...
>> Dear GPO:
>>
>> I recommend using a totally query based solution.
>>
>> It is my policy to always have a "raw data" table into which every
>> proposed row is inserted first. In this table, I always have a column
>> "source" which identifies from where the data came. For your
>> situation, this would be the name of the text file, on which you could
>> sort descending so the desired row shows up first from each set,
>> sorting first by those columns that make up your unique key.
>>
>> Assuming that you never have a duplicate key from any one text file,
>> you can proceed to write a "totals query" grouped by all your key
>> values giving the MAX() value of the Source column. By making an
>> INNER JOIN to your table of all the source rows, you could then
>> eliminate all duplicates, again assuming you do not have any
>> duplicates from within the same source text file.
>>
>> Does that sound like a way to do it?
>>
>> Tom Ellison
>> Microsoft Access MVP
>> Ellison Enterprises - Your One Stop IT Experts
>>
>>
>> On Mon, 27 Sep 2004 23:56:51 +1000, "GPO" <cre@inine> wrote:
>>
>>>My instructions are: Use ADO and Access 2000 - and avoid docmd.
>>>
>>>I have several thousand text files (aka extracts) of several thousand rows
>>>each that I want to insert into a table. Most of the rows in the extracts
>>>are unwanted duplicates in the sense that my definition of a primary key
>>>in
>>>the table, can appear several times over the course of several extracts.
>>>Will probably have to sift through about 6 million rows of text to arrive
>>>at
>>>a final table of about 1.4 million "unique" rows.
>>>
>>>The extracts are named in such a way that if I load them in descending
>>>order
>>>of file name, the most recent records (the ones I want) would load first.
>>>I
>>>want to set things up such that subsequent records simply don't load if
>>>they
>>>are duplicating the keys of what is already loaded. I thought I had seen
>>>this functionality somewhere in Access (?transfer text), where a paste
>>>errors table is created and the non duplicated records load successfully.
>>>Instead I get the "The changes you requested to the table were not
>>>successful because they would create duplicate values..." message (and,
>>>well
>>>thanks but I know that).
>>>
>>>For example
>>>Extract name: "A1.txt"
>>>KeyPart1 Key Part2 Field1 Field2 ...
>>>.
>>>.
>>>.
>>>1234 456 ABC DEF ...
>>>.
>>>.
>>>.
>>>
>>>Extract name: "A2.txt"
>>>KeyPart1 Key Part2 Field1 Field2 ...
>>>.
>>>.
>>>.
>>>1234 456 ABE DEG ...
>>>.
>>>.
>>>.
>>>
>>>In the above example the row in A2.txt would be loaded first, and is the
>>>desired row.
>>>
>>>Given the volume of the data, I obviously want to be able to load it with
>>>the fewest passes through the data possible. Is there a way to load only
>>>the
>>>non duplicated records in each extract (hopefully in a single pass without
>>>nested subqueries)?
>>>
>>>Here is a hacked up aircode version of the import function called in the
>>>process of looping through all the extracts (apols for any wrapping):
>>>Function ImportExtract( _
>>> strExtractName As String, _
>>> strLocalStore As String, _
>>> strDestTable As String) As Boolean
>>>
>>>Dim cmd As ADODB.Command
>>>Dim strTextFileNameModified As String
>>>
>>> On Error GoTo ErrorHandler
>>>
>>> '1. Modify the file name by substituting the dot (.) in file name for
>>> a
>>>"#".
>>>
>>> strTextFileNameModified = Replace(strExtractName, ".", "#", 1, 1,
>>>vbTextCompare)
>>> '3. Create command object
>>> Set cmd = New ADODB.Command
>>> With cmd
>>> .ActiveConnection = CurrentProject.Connection
>>> .CommandType = adCmdText
>>> '4. Use SQL as command text.
>>> 'SQL references text file directly (needs schema.ini).
>>> .CommandText = "INSERT INTO " & strDestTable & " ( KeyPart1, Key
>>>Part2 [etc] ) " & _
>>> "SELECT KeyPart1, Key Part2 [etc] " & _
>>> "FROM " & strTextFileNameModified & _
>>> " IN " & """" & """" & " [Text; DATABASE=" & strLocalStore & ";]"
>>> '5. Run the SQL
>>> .Execute
>>>
>>> End With
>>> ImportExtract = True
>>>
>>>ExitHandler:
>>> Set cmd = Nothing
>>> Exit Function
>>>ErrorHandler:
>>> ImportExtract = False
>>> Goto ExitHandler
>>>End Function
>>>
>>>Many thanks, and sorry for the length of the post.
>>>
>>
>



Relevant Pages

  • RE: finding non-duplicated records
    ... Actually what also worked was "Find unmatched query ... "Klatuu" wrote: ... Dave Hargis, Microsoft Access MVP ... use the field name or names that would determine if there are duplicates. ...
    (microsoft.public.access.queries)
  • Re: Subtracting date and time values
    ... computation with the next Date Opened to produce MTBF (Mean Time ... I can provide you with the query to do this. ... >>Tom Ellison ... >>Microsoft Access MVP ...
    (microsoft.public.access.queries)
  • Re: Query Help.. Creating a running Total Field?????
    ... Here's an approximate query: ... ORDER BY ClientCode,, Autonumber ... >> Tom Ellison ... >> Microsoft Access MVP ...
    (microsoft.public.access.queries)
  • Re: How to count record
    ... "Tom Ellison" wrote: ... I built a database with your 2 tables, entered the data, and tested. ... The result of running this query are: ... Microsoft Access MVP (watch out! ...
    (microsoft.public.access.queries)
  • Re: line numbers
    ... then the next time you run this query, the row that was 3 before is ... you cannot use these numbers as a long-term reference ... >>Tom Ellison ... >>Microsoft Access MVP ...
    (microsoft.public.access.queries)

Loading