Re: Sorting algorithms was: Factorial

From: Schmidt (sss_at_online.de)
Date: 05/27/04


Date: Thu, 27 May 2004 15:22:32 +0200


"Bob O`Bob" <filterbob@yahoogroups.com> schrieb im Newsbeitrag
news:40B56490.65FB@yahoogroups.com...

> I repeatedly need to sort tab-separated-value files of 600megabytes or so,
> sometimes up to a GB. And I expect this file size to creep up, perhaps
> approaching 2GB as soon as a year from now. Though actually the ability
to
> sort even larger files might lead to an overall improvement in the
efficiency
> of the whole system it's a part of. Line lengths vary, though they're
rarely
> under 90 bytes, or over 1200.
>
> There are usually nine columns (eight tabs/line) but I would hope for a
> generalized solution where columns are adaptable. I usually need the sort
to
> ignore the first column (i.e., everything up to the first tab), then most
of the
> time sort by the others, in order. Actually just columns 2,3,4, then 5
would be
> sufficient, since column 5 holds a unique identifier by design.
>
> Occasionally other column orders are needed for different formats of
records,
> usually smaller file sizes, but commonality of the sort routines would
more than
> make up for these smaller one-off sort runs being slightly suboptimal.
>
> Any suggestions?
First thought, to get a copy of the Src-File with sorted values, you cannot
be faster than the HD Read/Write-Performance.
Modern Disks have a Copy-Performance of ca. 8-12 MByte/sec on itself and
30-50MByte/sec from Disk to (another) Disk.
So a sorted Copy of a 1GByte-File needs at least 100 seconds with the
DstFile on the same disk, respectively 25 seconds with DstFile on another
disk.
For such big files a mergesort is IMO the best choice, wich (if the
implementation is ok) shouldn't generate more than 15% Overhead to the pure
Copy-Process.
Assuming an average of 200Bytes per record, the recordcount of an
1GByte-File would reach ca. 5 Mio records.
5 Mio * 32 Bytes Len of a Fixed Type for Sort- and Compare-Information would
give 2 MergeArrays, wich would need ca. 300 MB RAM.
To implement a mergesort without using such large amounts of memory, two
temporary files (Tmp1 and Tmp2) should be used instead of the merge-arrays.
The algo should scan the SrcFile first (record for record) and write a
smaller FixedLenght-SearchRecord into Tmp1 for every original SrcRecord.
During this first run Tmp1 should be filled in chunks with presorted 256'er
record-groups (sorted with some simpler sort-algo).
Then the merging should be done -> (256'er in Tmp1) into (512'er in Tmp2)
then (512'er in Tmp2) into (1024'er in Tmp1) and so on - until all 5Mio
Records will be sorted.
Finally the DstFile can be constructed from the information in the last
sorted mergefile (every fixedlength Merge-Record is pointing to the
appropriate record in the original SrcFile).

>Maybe even sample code?
Maybe I find time (the next days), to suit this code here...
http://groups.google.de/groups?q=vb+sss+mergesort+Code
to your specific needs.

Olaf



Relevant Pages

  • Re: Calculus XOR Probability
    ... I don''t know what you mean by "a line of some sort, ... A disk contradicts a line? ... No, but if you say "you have an elephant in your backyard", and I say ... A disk is not a sequence of points. ...
    (sci.math)
  • Re: "Depreciated" Parameters In SPFILE
    ... the SORT_AREA_SIZE did affect the number of sorts to disk, ... switched from a one pass sort to an optimal sort. ... Total IO sort cost: 4931 Total CPU sort cost: 992293034 ... total temp space used = estimated amount of temporary space needed ...
    (comp.databases.oracle.server)
  • Re: Directory sorted in chronological order
    ... SH> The ST is an optimisation technique and as such is never neccesary. ... SH> results in a stat system call and system calls are slow, ... SH> worse it may require waiting dor a disk read which is even slower ... SH> for the sort, which is *much* better. ...
    (comp.lang.perl.misc)
  • Re: Directory sorted in chronological order
    ... # time perl testtime1.pl ... > SH> worse it may require waiting dor a disk read which is even slower ... To make matters even worse stat calls need to access ... The sort comparison function is ...
    (comp.lang.perl.misc)