Re: Sorting algorithms was: Factorial
From: Ulrich Korndoerfer (ulrich_wants_nospam_at_prosource.de)
Date: 05/27/04
- Next message: Unforgiven: "Re: Icons increase EXE size...NO!!!"
- Previous message: William Ryan eMVP: "Re: Need help in finding some means of being able to program with speech recognition"
- In reply to: Bob O`Bob: "Sorting algorithms was: Factorial"
- Next in thread: Russ Holsclaw: "Re: Factorial"
- Messages sorted by: [ date ] [ thread ]
Date: Thu, 27 May 2004 21:00:32 +0200
Hi,
Bob O`Bob wrote:
> Can you folks make some specific suggestions for a specific real-world need?
>
> I repeatedly need to sort tab-separated-value files of 600megabytes or so,
> sometimes up to a GB. And I expect this file size to creep up, perhaps
> approaching 2GB as soon as a year from now. Though actually the ability to
> sort even larger files might lead to an overall improvement in the efficiency
> of the whole system it's a part of. Line lengths vary, though they're rarely
> under 90 bytes, or over 1200.
>
> There are usually nine columns (eight tabs/line) but I would hope for a
> generalized solution where columns are adaptable. I usually need the sort to
> ignore the first column (i.e., everything up to the first tab), then most of the
> time sort by the others, in order. Actually just columns 2,3,4, then 5 would be
> sufficient, since column 5 holds a unique identifier by design.
>
> Occasionally other column orders are needed for different formats of records,
> usually smaller file sizes, but commonality of the sort routines would more than
> make up for these smaller one-off sort runs being slightly suboptimal.
>
> Any suggestions? Maybe even sample code?
Missing which your values kind is, I assume they are character
sequences, which mostly differ in their beginning few bytes.
Sorting in memory is much faster than sorting on disk. Nowadays 256 Megs
of RAM or more are quite common. So I would build in memory an index
array from the file to sort, holding the position of the individual
records and the first few characters (the "short value") of the column
to sort, then sort this index array (it really does not matter which
kind of sort one takes, as long it is sufficiently fast).
Then the presorted result can be written back to a file using the
unsorted files record positions in the index array. During writing, one
would have to identify (simple enough to do) groups of records which
have equal short values and do a full sort on them (using now a newly
build index array for those records, which holds the complete records).
As long as your values are sufficiently different, this would work fine.
Some speed ups could be:
- Do a transformation on the short values (kind of hashing) resulting
eg. in a long value, whose values are usable for sorting. Simplest would
be to stuff the first 4 character bytes into a long value.
- The most important speed gain can be achieved on the writing back
process. If one walks up the sorted index array, one can write the
records in sequence to the output file, but must read the records at
random from the input file. The first thing would be to hold the input
file and the output file on different hard disks to minimize disk head
positioning. The second thing would be to cache the records to write
(read a reasonable amount of records in, concatenate them in memory and
write them out with one put command). The third thing would be to sort
the next chunk of records to read on their file positions and read them
then beginning from the lowest file position. This point (reading the
records) could have some better optimization, for now my elementary
suggestion should suffice.
- A further speed up could be obtained by not using the traditional VB
file io but rather using memory mapping. For this to work fast the data
should be reorganized. Tab separated data are not really fast to read
and parse. Best would be using a UDT with members of fixed size. Then
the data on disk can be mapped directly to a VB array (using some
tricks) and accessed using the VB standard array element access.
-- Ulrich Korndoerfer VB tips, helpers, solutions -> http://www.proSource.de/Downloads/
- Next message: Unforgiven: "Re: Icons increase EXE size...NO!!!"
- Previous message: William Ryan eMVP: "Re: Need help in finding some means of being able to program with speech recognition"
- In reply to: Bob O`Bob: "Sorting algorithms was: Factorial"
- Next in thread: Russ Holsclaw: "Re: Factorial"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|