APaGeD grammars

An APaGeD grammar file consists of rule sets that basically look like function declarations in D, except that they don't have a return type:

NonTerminal(out char[] someParameters)
{
    <rule alternatives...>
}

Each such block contains all rules for a given non-terminal symbol of your grammar. Each non-terminal has a set of parameters that make up the inherited and synthetic attributes for the semantic phase. We'll see that in a while in more detail.

Each rule is a sequence of non-terminal, terminal symbols or special keywords that represent groups of terminals, just like in a BNF-like grammar description. The non-terminal symbols have to be declared somewhere as shown above. Terminal symbols are either strings in double-quotes or regular expressions.

A()
{
    "qwer" "asdf";
    "1234" "asdf";
}

To separate rules, the end of a rule has to be marked with a semicolon. The example above therefore specifies two alternative rules with two terminal symbols each.

The keyword epsilon specifies, that the non-terminal may also match the empty word, meaning that it doesn't have to match anything at all to succeed. epsilon is only allowed as the only symbol in a rule. That is, it is illegal to have a rule like this:

A()
{
    epsilon "asdf";
}

Furthermore, only one epsilon-Rule is allowed per non-terminal and it has to be the last rule.

Regular expressions are marked by wrapping a string in brackets and preceding it with the keyword regexp like this:

regexp("[a-zA-Z][a-zA-Z0-9_]*")

Strings in single quotes are WYSIWYG strings. In a WYSIWYG string, only the single quote itself has to be escaped with a backslash:

regexp('"[^"\n\r]*"')

To simplify access to a regular expression's match, like epsilon rules, they are only allowed as the only symbol in a rule. If you want a regular expression to be part of a larger sequence of symbols, simply wrap the regular expression in another non-terminal and use that one instead:

A()
{
    NT1 B "asdf" NT2;
}

B()
{
    regexp("[a-zA-Z][a-zA-Z0-9_]*");
}

Terminal classes can be defined with the [] operator, similar to character classes in regular expressions. For example,

[ "{" "}" ]
matches one of the two terminals.
[^ "{" "}" ]
matches any of the terminals that appear in the grammar, except the two curly braces. Like regexp, the [] operator has to be the only symbol in a rule.

Symbol sequences can certainly be distributed over multiple lines:

A()
{
    NT1 B
    "asdf" NT2;
}

APaGeD allows accessing the parser's lookahead in the grammar. Matching a non-terminal can be restricted to a set of first terminals. That is, the non-terminal will only be matched, if the first terminal that it expands to is one of those specified in a terminal class preceded by >:

A()
{
    "asdf" More;
    "qwer" More;
}

B()
{
    >["asdf"] A;
}

B will only match the A's first rule.

Here is a simple example for a grammar that specifies arithmetic expression on integers that consider operator precedence (* binds stronger than + and -).

Expr()
{
    MulExpr "+" Expr;
    MulExpr "-" Expr;
    MulExpr;
}

MulExpr()
{
    Atom "*" MulExpr;
    Atom;
}

Atom()
{
    regexp("-?[0-9]+");
    "(" Expr ")";
}

You can already compile that grammar with APaGeD. The first non-terminal in a grammar file will be treated as the start symbol for the grammar. APaGeD will create a D source file that contains a global function

bool parse(
    string          filename,
    ref string      input,
    out SyntaxTree* root,
    bool            detailed = false,
    bool            recover  = false
)

parse returns true if parsing was successful. That means that the grammar matched all of the input.