2006-09-13

More OO for the Compiler

I've been looking at the tokenizer and lexer of SpeedCrunch laterly - much because there seems to be a problem with how it handles functions. After some thought I've come to the conclusion that the lexer part really could be made more object orientated.

My thought is to have each language element, such as an adding operation, contant, function, etc represented by a class. Each class has a evaluate member that takes a current environment (that is, a list of available variables and functions). They also have a common base class, lets call it LangElem, and a static factory function.

Lets try to implement a minimalistic language that can do multiplication and addition where multiplication has preceedence. Since the addition has the lowest priority, we start with it:

AddElem :: MulElem ['+' MulElem]*

This is some sort of hand made, non-standard, BNF encoding. All it says is that an addition consist of at least one multiplication element. If the token after the multiplication is a '+', then another multiplication element is added to the list. This goes on until no more '+' tokens are found.

The multiplication element looks like this:

MulElem :: ValueElem ['*' ValueElem]*

It works just like the AddElem, just that it takes value elements and '*' tokens. So, how does the value element look?

ValueElem :: constant | identfier | '(' AddElem ')'

Each value element can either be a constant token (remember, we run the tokenizer first), an identifier token or an addition element enclosed in parenthesises.

This all looks nice, but how do we implement it? Before we look at the code, remember, this is pseudo-code.

class AddElem : public LangElem
{
public:
static LangElem *factory( TokenStack stack )
{
LangElem *temp = MulElem::factory( stack );
if( stack.top() != '+' )
return temp; // When no addition takes place, skip the AddElem, and go directly for the MulElem
List<langelem*> tempList;

do
{
stack.pop(); // Take the '+' from the token stack
temp = MulElem::factory( stack );
tempList << temp;
} while (stack.top() == '+')

return new AddElem( tempList );
}

ResultType evaluate( Context context )
{
// Simply add all sub-results together and return the sum
resultType res = 0;
foreach( LangElem *e, mulElems )
res += e->evaluate( context );

return res;
}

private:
AddElem( List<langelem*> tempList ) : mulElems(tempList) {}

List<langelem*> mulElems;
};

The MulElem looks about the same, and the ValueElem tries to match identifiers, constants and parenthesises. There are two issues with this approach: first, "compilation errors" needs to be handled by exceptions and the ResultType must be able to carry an error message or exceptions will have to be employed there as well.

The benefits are many - specially with some implementation tricks. First, the context keeps a list of variable values, where the latest one is the valid one. A function can choose to accept a variable name instead of a value as a parameter.

This means that the function plot( variable, range, expression ) is possible. Variable is the value to iterate over, range is just a type that can be matched by a RangeElem (can looks like [start:step:end]) and the expression is what is to be plotted. For example plot( t, 0:0.1:2*pi, sin(t) ).

When the plot function evaluates it returns a standard answer - NoValue or something. It takes t and adds it on the top of the variable list (since the first hit is valid, this means that t is local to the plotted expression, it is removed (popped) from the list after the plot so everything is kept intact. The expression can then be evaluated any number of times with a different context each time - that is, with a different t value.

My though is to implement this new compiler, along with a value type that is picked at compile time (so you can trade performance and precision freely). Perhaps a cousin of evaluate() could be compile() that uses libjit for just-in-time compilation. This new approach makes it possible to keep many of the existing interfaces and at the same time build a SpeedCrunch script language. It also makes it easier to modify and extend the system in the future as the changes can be kept inside a single class.