Re: Sorting algorithms was: Factorial
From: Larry Serflaten (serflaten_at_usinternet.com)
Date: 05/27/04
- Next message: Larry Serflaten: "Re: simple question"
- Previous message: Andy: "Re: simple question"
- In reply to: Bob O`Bob: "Sorting algorithms was: Factorial"
- Next in thread: Ulrich Korndoerfer: "Re: Sorting algorithms was: Factorial"
- Messages sorted by: [ date ] [ thread ]
Date: Thu, 27 May 2004 11:05:07 -0500
"Bob O`Bob" <filterbob@yahoogroups.com> wrote
> Can you folks make some specific suggestions for a specific real-world need?
>
> I repeatedly need to sort tab-separated-value files of 600megabytes or so,
> sometimes up to a GB. And I expect this file size to creep up, perhaps
> approaching 2GB as soon as a year from now. Though actually the ability to
> sort even larger files might lead to an overall improvement in the efficiency
> of the whole system it's a part of. Line lengths vary, though they're rarely
> under 90 bytes, or over 1200.
What advantage is there to keeping such large amounts of data, in one
file? The data is on the disk regaurdless of whether its in one file, or in
several. I mean if you already know you expect the data to grow to such
proportions, why plan to use only one file?
It just seems strange that the idea of working on one file at a time has propogated
from the beginnings of computing, until this time. Its about time we start making
the applications smarter, especially when they deal with such large amounts of data!
I also think it would be best to keep the data sorted as you add to it. If you
broke up the data into like 32 smaller files you could the put first line of each
file in an index file that would be used when adding later data. When new
data comes in, look in the index to find out which file it would fall into and
put the data there. To speed up the process, you don't need to sort for every
addition, simply append to the file and mark that file as being out of sorted
order so you know you need to sort just that section when an ordered list is
needed. Then when several are marked out of order, or they grow disproportionate
in size, you could schedule a re-sort that would break them up into sorted files
of more equal size, again rebuilding the index file from the first lines of all the
other files.
I know that does nothing for many different (sorted) views on the data, but if you
have common views that cover 80-90% of the need, why not use secondary
indexing and manage them right along with the data? I guess what it boils down
to, is to build a component that manages the files, that you can call on to add,
edit, delete, and view the different records in some sorted order.
I wonder if anyone else has picture of a database engine in mind? ;-)
LFS
- Next message: Larry Serflaten: "Re: simple question"
- Previous message: Andy: "Re: simple question"
- In reply to: Bob O`Bob: "Sorting algorithms was: Factorial"
- Next in thread: Ulrich Korndoerfer: "Re: Sorting algorithms was: Factorial"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|