Re: Sorting algorithms was: Factorial

From: Ulrich Korndoerfer (ulrich_wants_nospam_at_prosource.de)
Date: 05/27/04


Date: Thu, 27 May 2004 21:00:32 +0200

Hi,

Bob O`Bob wrote:

> Can you folks make some specific suggestions for a specific real-world need?
>
> I repeatedly need to sort tab-separated-value files of 600megabytes or so,
> sometimes up to a GB. And I expect this file size to creep up, perhaps
> approaching 2GB as soon as a year from now. Though actually the ability to
> sort even larger files might lead to an overall improvement in the efficiency
> of the whole system it's a part of. Line lengths vary, though they're rarely
> under 90 bytes, or over 1200.
>
> There are usually nine columns (eight tabs/line) but I would hope for a
> generalized solution where columns are adaptable. I usually need the sort to
> ignore the first column (i.e., everything up to the first tab), then most of the
> time sort by the others, in order. Actually just columns 2,3,4, then 5 would be
> sufficient, since column 5 holds a unique identifier by design.
>
> Occasionally other column orders are needed for different formats of records,
> usually smaller file sizes, but commonality of the sort routines would more than
> make up for these smaller one-off sort runs being slightly suboptimal.
>
> Any suggestions? Maybe even sample code?

Missing which your values kind is, I assume they are character
sequences, which mostly differ in their beginning few bytes.

Sorting in memory is much faster than sorting on disk. Nowadays 256 Megs
of RAM or more are quite common. So I would build in memory an index
array from the file to sort, holding the position of the individual
records and the first few characters (the "short value") of the column
to sort, then sort this index array (it really does not matter which
kind of sort one takes, as long it is sufficiently fast).

Then the presorted result can be written back to a file using the
unsorted files record positions in the index array. During writing, one
would have to identify (simple enough to do) groups of records which
have equal short values and do a full sort on them (using now a newly
build index array for those records, which holds the complete records).
As long as your values are sufficiently different, this would work fine.

Some speed ups could be:

- Do a transformation on the short values (kind of hashing) resulting
eg. in a long value, whose values are usable for sorting. Simplest would
be to stuff the first 4 character bytes into a long value.

- The most important speed gain can be achieved on the writing back
process. If one walks up the sorted index array, one can write the
records in sequence to the output file, but must read the records at
random from the input file. The first thing would be to hold the input
file and the output file on different hard disks to minimize disk head
positioning. The second thing would be to cache the records to write
(read a reasonable amount of records in, concatenate them in memory and
write them out with one put command). The third thing would be to sort
the next chunk of records to read on their file positions and read them
then beginning from the lowest file position. This point (reading the
records) could have some better optimization, for now my elementary
suggestion should suffice.

- A further speed up could be obtained by not using the traditional VB
file io but rather using memory mapping. For this to work fast the data
should be reorganized. Tab separated data are not really fast to read
and parse. Best would be using a UDT with members of fixed size. Then
the data on disk can be mapped directly to a VB array (using some
tricks) and accessed using the VB standard array element access.

-- 
Ulrich Korndoerfer
VB tips, helpers, solutions -> http://www.proSource.de/Downloads/


Relevant Pages

  • Re: Hackish file -> string trick
    ... Unless you've got oodles of RAM, a lot of the data will just go ... straight bak to the disk via the swapfile, possibly running out of room on ... trying to load a 1 gig file into memory for random ... Coding for this sort of thing is easy, ...
    (comp.lang.pascal.delphi.misc)
  • Re: Alpha mystery! Only on Sundays??
    ... Sunday morning and initiated a heavy random access on the disk. ... simple to fix and funny in a wierd sort of way. ... programmer who had written and set it up put an entry in the ECL ... for all the memory the machine had before beginning the sort. ...
    (comp.os.vms)
  • Re: Laptop
    ... memory and on to disk? ... A sort of 'RMSave'? ... the correct parameter. ...
    (comp.sys.acorn.hardware)
  • Re: Question about the clc string lib
    ... Having the whole thing in memory (if it ... > fits) is a *lot* more efficient than trying to sort it on disk. ... anything about the relative efficiency of memory access vs. disk ...
    (comp.lang.c)
  • Re: non-recursive quicksort for 2-D arrays?
    ... I only need to sort on one column and the sort column will hold long integer numbers. ... I can see that the best approach is to make an index array first, which is a 2-D 2-column array ... It is not as fast as the QuickSort algorithm ...
    (microsoft.public.vb.general.discussion)

Loading