An APaGeD grammar file consists of rule sets that basically look like function declarations in D, except that they don't have a return type:
NonTerminal(out char[] someParameters)
{
<rule alternatives...>
}
Each such block contains all rules for a given non-terminal symbol of your grammar. Each non-terminal has a set of parameters that make up the inherited and synthetic attributes for the semantic phase. We'll see that in a while in more detail.
Each rule is a sequence of non-terminal, terminal symbols or special keywords that represent groups of terminals, just like in a BNF-like grammar description. The non-terminal symbols have to be declared somewhere as shown above. Terminal symbols are either strings in double-quotes or regular expressions.
A()
{
"qwer" "asdf";
"1234" "asdf";
}
To separate rules, the end of a rule has to be marked with a semicolon. The example above therefore specifies two alternative rules with two terminal symbols each.
The keyword epsilon specifies, that the non-terminal may also match the empty word, meaning that it doesn't have to match anything at all to succeed.
epsilon is only allowed as the only symbol in a rule. That is, it is illegal to have a rule like this:
A()
{
epsilon "asdf";
}
Furthermore, only one epsilon-Rule is allowed per non-terminal and it has to be the last rule.
Regular expressions are marked by wrapping a string in brackets and preceding it with the keyword regexp like this:
regexp("[a-zA-Z][a-zA-Z0-9_]*")
Strings in single quotes are WYSIWYG strings. In a WYSIWYG string, only the single quote itself has to be escaped with a backslash:
regexp('"[^"\n\r]*"')
To simplify access to a regular expression's match, like epsilon rules, they are only allowed as the only symbol in a rule. If you want a regular expression to be part of a larger sequence of symbols, simply wrap the regular expression in another non-terminal and use that one instead:
A()
{
NT1 B "asdf" NT2;
}
B()
{
regexp("[a-zA-Z][a-zA-Z0-9_]*");
}
Terminal classes can be defined with the [] operator, similar to character classes in regular expressions.
For example,
[ "{" "}" ]
matches one of the two terminals.
[^ "{" "}" ]
matches any of the terminals that appear in the grammar, except the two curly braces.
Like regexp, the [] operator has to be the only symbol in a rule.
Symbol sequences can certainly be distributed over multiple lines:
A()
{
NT1 B
"asdf" NT2;
}
APaGeD allows accessing the parser's lookahead in the grammar. Matching a non-terminal can be restricted to a set of first terminals. That is, the non-terminal will only be matched, if the first terminal that it expands to is one of those specified in a terminal class preceded by >:
A()
{
"asdf" More;
"qwer" More;
}
B()
{
>["asdf"] A;
}
B will only match the A's first rule.
Here is a simple example for a grammar that specifies arithmetic expression on integers that consider operator precedence (* binds stronger than + and -).
Expr()
{
MulExpr "+" Expr;
MulExpr "-" Expr;
MulExpr;
}
MulExpr()
{
Atom "*" MulExpr;
Atom;
}
Atom()
{
regexp("-?[0-9]+");
"(" Expr ")";
}
You can already compile that grammar with APaGeD. The first non-terminal in a grammar file will be treated as the start symbol for the grammar. APaGeD will create a D source file that contains a global function
bool parse(
string filename,
ref string input,
out SyntaxTree* root,
bool detailed = false,
bool recover = false
)
filename is only used in error messages and can therefore also be null.
input obviously is the input that will be parsed.
root is a pointer to an instance of the SyntaxTree structure that will recieve the root node of the resulting abstract syntax tree.
detailed toggles whether error messages will contain detailed status output or just the filename, line- and columng number and error message.
recover enables error recovery.
parse returns true if parsing was successful. That means that the grammar matched all of the input.