Re: Best way to store postfix data?



See below....
On Fri, 29 Jun 2007 10:27:02 -0700, Scoots <bssalmons@xxxxxxxxxxxx> wrote:

At first this may seem like a 'duh' question, but it's not.

I'm working on a custom scripting language, and I'm trying to minimize
both my memory usage and memory fragmentation.
****
Why do you believe this is a problem that needs to be solved?

For example, if your code splits into items like lines or modules, memory fragmentation
should not be an issue. How many bytes of code will 2.1MLOC generate? 10M? 20M? These
are "small" amounts of memory in a 2GB address space.
*****
This is important
since I'm using a preprocessor and an execution engine, so I have to
be able to store the program internally. Our cripple case is a 2.1
million line file that was generated legitimately, we have to be able
to handle it. Now, my current approach is to hold instructions, if,
for, etc in one memory map (byte array), and the condition and
assignment logic in another, as we do not know how much memory we
initially need, and growing it if they were in the same map would
be... erm, ugly.
*****
If you don't know how big something is, and it can expand, a byte array is usually the
worst possible choice of representation. By requiring contiguous bytes for massive
amounts of memory, fragmentation DOES become an issue, so the correct solution is to not
require contiguous bytes of data. Use small pieces, and fragmentation becomes a
non-problem.

I find the notion that you split the byte codes up into two disjoint arrays a bit odd, to
say the least. But if what you have is some mechanism with rule-based predicates (and
massive numbers of rules) then there is no reason to put all the rule code into a single
gigantic array; each rule would use its own space, and this would be easy to allocate and
deallocate, and would definitely reduce fragmentation, and in addition, give massively
better performance because you would not end up in the exponential-copy performance
problem.
*****

Now, I'm planning on doing postfix notation for condition and
assignment logic for a few reasons. The main point is, this is a
custom scripting language and it has it's own variables. I test
whether the variables are valid (from external sources) in the
preprocessor, and I'd like to store the reference information of these
variables if I can.
*****
This does not seem to be a serious problem. Why do you think this is generating a
problem? All you have is a symbol table, and the byte codes would have symbol table
references, which could be pointers to the symbol blocks.
*****

The simplest way I can think of storing all data is as a null
terminated char array. That would minimize my memory cost, but I
would have to reparse my variable names and I can't trust that this is
an ASCII only application. Wide char just seems like it starts taking
more memory than it is worth just to have to reparse.
*****
If the data is a string, then yes, the data would be a character array. Each value would
be one character array.

Why is the cost of wide characters going to matter in the slightest?

I think you are attacking the problem from the wrong direction. I've built a lot of
interpreters in my day, and not one of them would store the entire symbol table and its
values as a single string. One variable, one C++ object, no issues of contiguous storage
arise at all.
*****

/So, what I'd like to do is store a post-fix assignment statement with
it's own variables references into a byte array. How should I store
this information? As a array with it's own struct type? wide char?
what's your opinion?
*****
Parse a variable name. Create a symbol table entry representing it. This is a mapping
from the name to the value, e.g.,

class Value {
public:
CString name;
};

class IntegerValue : public Value {
public:
int value;
};

class BooleanVale : public Value {
public:
BOOL value;
};

etc.

class Stack {
CArray<Value, Value&> stack;
public:
void Push(Value v) { stack.Add(v); }
Value Pop() { Value v = stack[stack.GetSize() - 1];
stack.SetSize(stack.GetSize() - 1);
return v; }
};

Now there are LOTS of ways to write the implementation of a stack a lot more efficiently
than the above code, but notice that these are abstractions.

A = B + C * D;
A B C D * + =

which would be represented as

push A
push B
push C
push D
mul
add
assign

#define push 1
#define mul 2
#define add 3
#define assign 4

[1] [aaaaaaaa] [1] [bbbbbbbb] [1] [cccccccc] [1] [dddddddd] [2] [3] [4]

where [aaaaaaaa] is the address of the symbol table entry for A, bbbbbbbb for B, etc.

In fact, one representation of the code is actually to encode the addresses of the
functions instead:

[&push] [aaaaaaaa] [&push] [bbbbbbbb] [&push][ccccccccc][&mul][&add][&assign][&end]

where the basic interpreter loop is

typedef BOOL (*INTERPRETERFUNC)(INTERPRETERFUNC * & eip)

INTERPRETERFUNC myEIP = bytecodestream;
while(Execute(myEIP));

BOOL push(INTERPRETERFUNC * & eip)
{
eip++;
stack.Push(*eip);
eip++;
return TRUE;
}

BOOL mul(INTERPRETERFUNC * & eip)
{
eip++;
int op1;
int op2;
stack.Pop(op1);
stack.Pop(op2);
op2 *= op1;
stack.Push(Value(op2));
return TRUE;
}

BOOL end(INTERPRETERFNC * & eip)
{
return FALSE;
}

where the stack has functions like
class Stack {
public:
void stack.Pop(int & result);
void stack.Pop(double & result);
...etc...

void Stack::Pop(int & result)
{
if(top-of-stack-is-not-integer)
throw InvalidValueType;
result = top-of-stack.Value;
}

and so on. Trying to keep all of this as byte arrays is going to cause you all kinds of
problems best avoided. (I built my first interpreter of this type in 1966, then another in
1967, and the most recent one about two months ago, with a couple dozen in between).
joe
******


If you have any questions, feel free to ask. I haven't started coding
this, so I'd rather get it right the first time.
Joseph M. Newcomer [MVP]
email: newcomer@xxxxxxxxxxxx
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
.



Relevant Pages

  • Re: Fast string operations
    ... Looping: I thought looping over arrays in managed code was "slow" ... array handling and such. ... The problem with TrimHelper is that it always returns a new string instance. ... The customer perceives this as a memory leak. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: High Memory Consumption of Classes and Arrays
    ... Only the array itself has overhead. ... memory as a reference type. ... > least consume 40 bytes of memory. ...
    (microsoft.public.dotnet.framework.performance)
  • Re: Fast linked list
    ... > amounts of memory rather than huge chunks of it. ... random insertions into a vector/dynamic array are not as slow ... to cause a cache miss. ... some hard numbers on speed differences between lists and arrays. ...
    (microsoft.public.vc.mfc)
  • Re: Fast linked list
    ... > amounts of memory rather than huge chunks of it. ... random insertions into a vector/dynamic array are not as slow ... to cause a cache miss. ... some hard numbers on speed differences between lists and arrays. ...
    (microsoft.public.vc.language)
  • Re: not enough storage... error using GetRows
    ... > array only contained ~150,000 rows. ... It took 19 minutes for GetRows to return (db ... and the prog had consumed 200MB of memory. ... The first one allocates some 180MB, the next two only allocate about 47MB ...
    (microsoft.public.vb.database.ado)

Loading