The order of appearance of the grammar's lexemes is used as the order for the first-longest-match decision. In a programming language, keyword lexemes have therefore to appear before an identifier lexeme. Else, the lexical analyzer will always match the identifier and never a keyword.
This may dictate the order of the non-terminals, which may not be desirable. To work around this, some lexemes can optionally be specified in a lexeme block. The lexemes in the lexeme block will be listed before any implicitly collected lexeme from the non-terminal declarations. The order in the lexeme blocks is certainly maintained.
A lexeme block starts with the keyword APDLexemes followed by a block separated by curly braces containing a list of lexemes.
APDLexemes
{
regexp("[,\\.\\*/\\+\\-%$~\\|&~'\\\\\"0-9<\\>\\?:]")
"asdf"
}
Additionally, lexemes that do not appear anywhere else in the grammar may be specified in a lexeme block.
This may be useful in combination with the [] operator
.
Sometimes a context free grammar is not enough to parse the lexical structure. An example are HEREDOC strings that allow custom delimiters.
That is, MyDelim...MyDelim represents the string ... and MyDelim can be chosen freely. Such a delimited string is a context sensitive construct. To parse it we need an attributed grammar. Using an attributed grammar for lexical analysis would be a waste of performance, though. Especially, since usually, there aren't many lexemes that require context sensitive parsing.
APDLexemes
{
'q"'
{
// lexer code
}
}
The custom lexer code is inserted into the parser function. Consult the generated source code and the custom lexer code in the SEATD grammar for details.