Re: find the largest 1000 values



Thanks Tom,


Good point.


regards,
George

"Tom Widmer [VC++ MVP]" wrote:

George wrote:
Hi Tom,


In your description below, there is one point which is missing which matters
to performance.

When you met with an element which is larger than the smallest element in
the 1000 element set, you need to replace the smallest one with the new
element, right? But then, in order to proceed with the loop and logics, you
need to find the new smallest element (previous 2nd smallest element), to
find this, you need to iterate the 1000 set again which takes another O(n)
time, right?

Firstly, I suggested keeping the largest 1000 elements in a set or heap,
which both have logarithmic push/pop time. Secondly, any deterministic
algorithm you apply n times on a fixed number of elements (such as 1000)
is by definition O(n), since runtime only increases linearly in n. If
you change the algorithm to talk about the largest m elements, rather
than the largest 1000 elements, then we could say it is:

O(n log m) worst case (produced when the n datapoints are in sorted
ascending order)

o(n) best case (it will be very close to this if n is much bigger than m
and if the 1000 highest numbers are fairly uniformly distributed
throughout this large sample, which sounds like your situation)

Tom

Please feel free to comment or correct me.


regards,
George

"Tom Widmer [VC++ MVP]" wrote:

George wrote:
Hello everyone,


I have a large data set of integers (about several Million), all of them are
positive integers from experiment. They are stored on a file on local
machine. I want to select the largest 1000 values then make analysis based on
the 1000 values.

Could anyone suggest some efficient solutions? I think sorting the several M
data in memory is not feasible.
A linear time algorithm is trivial. Just iterate over the data,
comparing to the smallest of your existing list of the 1000 biggest
items, discarding the new element or discarding the smallest old one in
each case.

Keep track of the smallest of your 1000 items either with a std::heap or
std::set - since it has fixed size, all operations on it are constant time.

Tom


.



Relevant Pages

  • Re: find the largest 1000 values
    ... Secondly, any deterministic algorithm you apply n times on a fixed number of elements is by definition O, since runtime only increases linearly in n. ... I have a large data set of integers, all of them are positive integers from experiment. ... Just iterate over the data, comparing to the smallest of your existing list of the 1000 biggest items, discarding the new element or discarding the smallest old one in each case. ...
    (microsoft.public.vc.language)
  • Re: barrel shifter
    ... does anybody know where to obtain the algorithm of how to make a synthesis of a constant barrel shifter, or knows a paper that deals with this issue? ... Regards, ... Why bother getting the algorithm? ... think the optimization are drawn by the low level optimizations of the ...
    (comp.lang.verilog)
  • Re: security 1.1 service pack security update error
    ... > advice/procedure did work well for me (except, discarding the ... > Best regards, ... > Pete V. ...
    (microsoft.public.windowsupdate)
  • Complexity of Sparse Cholesky Factorization
    ... Is it justified to claim that a sparse cholesky factorization of such a ... If yes with what specific algorithm? ... Can you give a pointer to a particular paper? ... Best regards, ...
    (comp.theory)
  • Complexity of Sparse Cholesky Factorization
    ... Is it justified to claim that a sparse cholesky factorization of such a ... If yes with what specific algorithm? ... Can you give a pointer to a particular paper? ... Best regards, ...
    (sci.math.research)

Quantcast