W3cubDocs

/Elisp

Living With a Weak Parser

The parsing technique used by SMIE does not allow tokens to behave differently in different contexts. For most programming languages, this manifests itself by precedence conflicts when converting the BNF grammar.

Sometimes, those conflicts can be worked around by expressing the grammar slightly differently. For example, for Modula-2 it might seem natural to have a BNF grammar that looks like this:

  ...
  (inst ("IF" exp "THEN" insts "ELSE" insts "END")
        ("CASE" exp "OF" cases "END")
        ...)
  (cases (cases "|" cases)
         (caselabel ":" insts)
         ("ELSE" insts))
  ...

But this will create conflicts for "ELSE": on the one hand, the IF rule implies (among many other things) that "ELSE" = "END"; but on the other hand, since "ELSE" appears within cases, which appears left of "END", we also have "ELSE" > "END". We can solve the conflict either by using:

  ...
  (inst ("IF" exp "THEN" insts "ELSE" insts "END")
        ("CASE" exp "OF" cases "END")
        ("CASE" exp "OF" cases "ELSE" insts "END")
        ...)
  (cases (cases "|" cases) (caselabel ":" insts))
  ...

  ...
  (inst ("IF" exp "THEN" else "END")
        ("CASE" exp "OF" cases "END")
        ...)
  (else (insts "ELSE" insts))
  (cases (cases "|" cases) (caselabel ":" insts) (else))
  ...

Reworking the grammar to try and solve conflicts has its downsides, tho, because SMIE assumes that the grammar reflects the logical structure of the code, so it is preferable to keep the BNF closer to the intended abstract syntax tree.

Other times, after careful consideration you may conclude that those conflicts are not serious and simply resolve them via the resolvers argument of smie-bnf->prec2. Usually this is because the grammar is simply ambiguous: the conflict does not affect the set of programs described by the grammar, but only the way those programs are parsed. This is typically the case for separators and associative infix operators, where you want to add a resolver like '((assoc "|")). Another case where this can happen is for the classic dangling else problem, where you will use '((assoc "else" "then")). It can also happen for cases where the conflict is real and cannot really be resolved, but it is unlikely to pose a problem in practice.

Finally, in many cases some conflicts will remain despite all efforts to restructure the grammar. Do not despair: while the parser cannot be made more clever, you can make the lexer as smart as you want. So, the solution is then to look at the tokens involved in the conflict and to split one of those tokens into 2 (or more) different tokens. E.g., if the grammar needs to distinguish between two incompatible uses of the token "begin", make the lexer return different tokens (say "begin-fun" and "begin-plain") depending on which kind of "begin" it finds. This pushes the work of distinguishing the different cases to the lexer, which will thus have to look at the surrounding text to find ad-hoc clues.