Re: One more question on TSTs
From: Bonj (Bonj_at_discussions.microsoft.com)
Date: 10/07/04
- Next message: M.C: "How to reproduce"
- Previous message: Bredal Jensen: "Adding a background image to a to a button."
- In reply to: Alexander Nickolov: "Re: One more question on TSTs"
- Messages sorted by: [ date ] [ thread ]
Date: Thu, 7 Oct 2004 00:25:03 -0700
Thanks very much, to all, for your help, I'm confident I can have a bash at
it now!
"Alexander Nickolov" wrote:
> Sibling is as same as in English - your brother, sister, and in this
> case a letter of an alternative word in the set. E.g. if AB and
> AC are words in your language, B and C are siblings when
> treated as children of A. Same with intermediate letters: for
> ABC and ADE, B and D are again siblings.
>
> I guess you grasped the concept of offset instead of a pointer
> at the end of your post, so I won't go into further details (which
> is a relief considering there aren't many...)
>
> What you produce is C data, not code. With pointers it would be
> a long list of:
>
> const struct blah c_NodeN {
> "data",
> &c_NodeK, /* child (can be NULL instead) */
> &c_NodeL /* next sibling (can be NULL instead) */
> };
>
> Note this is the least efficient representation, I'm just showing the
> concept.
>
> With flat layout it would be much simpler:
>
> const BYTE data[] = {
> // the linearized tree goes here
> };
>
> --
> =====================================
> Alexander Nickolov
> Microsoft MVP [VC], MCSD
> email: agnickolov@mvps.org
> MVP VC FAQ: http://www.mvps.org/vcfaq
> =====================================
>
> "Bonj" <Bonj@discussions.microsoft.com> wrote in message
> news:6340BD35-3513-4CFE-9DD4-0012BE6484E8@microsoft.com...
> >> A hint: if you want efficient storage, you have to think as a C
> >> programmer (not C++). You can do it either with pointers
> >> to the first child and to the next sibling (NULL for child means
> >> no children, NULL for sibling means no more siblings),
> >
> > What in this specific instance are you referring to as a sibling?
> > For instance, "sp_hel" has as one of its children, "sp_help". "sp_help"
> > has
> > "sp_helpi" (leading to "sp_helpindex") as one of its children, but what
> > would
> > be a sibling? Can you give me an example, 'cos I can't picture it. Or is
> > it
> > just an 'extra' child, i.e. after the first one?
> >
> >> or you
> >> can use flat portable layout using offsets.
> >
> > Can you explain in any way how I would store and retrieve this?
> >
> >> Flat layout may allow
> >> you to optimize the size for small trees
> >
> > Without worrying about optimization - I can understand how I would
> > generate
> > this tree of structs, when each one has a pointer to another one (or more
> > than one) in order to represent children. But I seem to have a mental
> > block
> > figuring out how to have a flat layout when the structs contain pointers.
> > You
> > see, a pointer is simply a long integer, that's pointing to a different
> > memory location. But that memory location is only valid for the life of
> > the
> > current process (at most). So, the memory needed to store the tree of
> > structs
> > may not necessarily be contigious.
> > What is the usual way a C programmer would get round this, in order to
> > persist the tree of structures as a 'block'?
> >
> >> , since you won't need
> >> 32 bits for offsets
> >
> > I don't know what you mean by an offset...... I think I would probably
> > understand it if you told me how to store the tree as a 'flat' set of
> > structures, in a contigious memory space?
> >
> >> (with even greater impact on the upcoming
> >> 64-bit platforms!). Then depending on the approach, you either
> >> generate a bunch of struct instances with pointers within them,
> >> or you generate a single byte chunk (can be comprised of
> >> WORDs/DWORDs instead if you align your data with the
> >> offsets) with the entire tree. In either case you'd use some
> >> code to generate your source from the original tree format.
> >
> >
> > You mean using an algorithm to actually achieve a level of indirection
> > with
> > respect to compiling code, i.e. write an algorithm to produce C++ code?
> > This
> > is what I originally tried, but when a word was a subset of another word,
> > like "sp_help" is of "sp_helpindex", it completely ignored "sp_help".
> > Probably just the way I wrote it... but is this what you would recommend I
> > do? If so, any tips on how I might do it better?
> >
> >>
> >> Note: I'd advise the no-sibling pointer/offset optimization. E.g.
> >> you have all siblings in a continuous chunk of memory and
> >> use some tag (an invalid input char as an entry after the last
> >> sibling for example, or a bit flag in each node) to determine the
> >> end of the chunk. This is natural for flat memory representation.
> >>
> >> Another advantage of the flat representation is you can load
> >> and save the data to disk files for an alternative storage method.
> >> Then you simply map a view of the file in your process and
> >> use it as embedded data.
> >>
> >> Another optimization, only possible in flat layout, is there you
> >> can use variable size entries. If a node doesn't have children,
> >> you don't need the offset after all. If you use the optimization
> >> suggested to compress consecutive children with no siblings
> >> in a single node, the data (the substring) can be varibale length.
> >> Otherwise you'd either use a pointer for the string, or have all
> >> nodes allocate memory sufficient for the largest string.
> >>
> >> A word on offsets: you can use either absolute offsets from
> >> the beginning of the whole structure, or relative offsets from
> >> the beginning of the current node. The latter are more efficient
> >> since they may allow you greater trees with narrower offsets.
> >> Furthermore, even though it's natural to generate the tree in
> >> left to right order, you can randomize the internal representation
> >> since you use offsets to locate each chunk anyway. This may
> >> allow you even greater trees when using narrower relative
> >> offsets.
> >
> > oh... right ! I think I almost understand what you're going on about... is
> > it that you simply force all the structs to be next to each other and
> > instead
> > of storing actual pointers, you just store the distance from the current
> > one,
> > hence 'offset' ?
> >
> >>
> >> The top of the icing, useful mostly for larger trees where memory
> >> is more important than speed, is that you can use progressive
> >> Huffman encoding on each chunk to potentially minimize the
> >> size even further.
> >>
> >> Of course, the logic to traverse the tree will go into a C++
> >> class, you only need C for the data itself.
> >>
> >> As you may notice, I've done these and other embedded data
> >> structures quite a bit in my past... :)
> >>
> >> --
> >> =====================================
> >> Alexander Nickolov
> >> Microsoft MVP [VC], MCSD
> >> email: agnickolov@mvps.org
> >> MVP VC FAQ: http://www.mvps.org/vcfaq
> >> =====================================
> >>
> >> "Bonj" <Bonj@discussions.microsoft.com> wrote in message
> >> news:CD95D366-14CE-4A8F-91E1-74ECBFF42DD2@microsoft.com...
> >> > Thanks all, for the great logic and suggestions.
> >> > But can anyone shed any light on how the data storage could be
> >> > hard-coded
> >> > such that it didn't have to dynamically load the data? Or if it does,
> >> > such
> >> > that the time is minimal.
> >> > My only thoughts on how this could be done at the moment are an
> >> > algorithm-generating algorithm, which spits out c++ code, which I'm
> >> > sure
> >> > isn't the best way to do it.
> >> >
> >> > Cheers
> >> >
> >>
> >>
> >>
>
>
>
- Next message: M.C: "How to reproduce"
- Previous message: Bredal Jensen: "Adding a background image to a to a button."
- In reply to: Alexander Nickolov: "Re: One more question on TSTs"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|