Whitespace

APaGeD can automatically remove whitespace from the input before reading the next lexeme from the input, such that you don't have to specifiy whitespace symbols all over your grammar. You can still do that, if your grammar does not allow whitespace to appear in arbitrary places[*].

To define what whitespace is, a secondary grammar is used. Since whitespace does not need semantics[*], the specification of the whitespace grammar is separated from the main grammar by simply omitting the parameters and semantic code blocks in a non-terminal declaration. A non-terminal of the whitespace grammar therefore looks like this:

Whitespace
{
    regexp("[\\n\\r\\t ]+");
}

The above example is already enough to match simple whitespace. Obviously, in this case, a single regular expression would be enough. For more complex whitespace grammars with comments, we need a context-free grammar, though. Here is an example that matches D style whitespace, including nested comments:

Whitespace
{
    Whitespace WhitespaceFlat;
    WhitespaceFlat;
}

WhitespaceFlat
{
    regexp("[\\n\\r\\t ]+");
    regexp("//[^\\n]*");
    regexp("/\\*([^\\*]|\\*>/)*\\*/");
    "/+" WhitespaceNesteds "+/";
}

WhitespaceNesteds
{
    WhitespaceNesteds WhitespaceNested;
    WhitespaceNested;
}

WhitespaceNested
{
    WhitespaceFlat;
    regexp("[^#/\\+\\*\\n\\r\\t ]+");
    "+";
    "*";
    "/";
}

Note, that the first regexp in WhitespaceFlat uses the special lookahead operator >. See the section on regular expressions for details.

The separation of the lexemes in WhitespaceNested is necessary due to the first-longest-match behaviour of the lexical analyzer.