The grammar notation

Grammars can be written in a textual notation:

A : "0" A "0"
A : "1" A "1"
A : B
B : "0"
B : "1"
B :

This grammar contains two nonterminals, A and B, which both have three productions. Terminals can be written as quoted strings, such as "0" (see also regular expressions below). The first nonterminal in a grammar is the start nonterminal. Terminals and nonterminals on the right-hand side of a production are called entities.

Multiple productions of the same nonterminal can be written in a shorter form:

A : "0" A "0"
  | "1" A "1"
  | B
B : "0"
  | "1"
  |

Terminals can also be defined by regular expressions:

NUMERAL = [0-9]+

...
N : <NUMERAL>

The regular expressions use the RegExp notation from dk.brics.automaton, except that character escaping can also be done with \uXXXX and \n notation (representing Unicode UTF-16 code blocks and special symbols as in Java). EOF is a predefined expression that matches the empty string but only at end-of-file.

Note that these regular expressions do not define tokens - the formalism is scannerless. However, a regular expression can be declared MAX^* which means that it only matches maximal substrings:

TEXT = [a-z]* (MAX)

Comments can be written as in Java:

// this is a one line comment

/*
  this is a multi line comment
*/

Productions can be labeled:

A[zeros] : "0" A "0"
 [ones]  | "1" A "1"
 [done]  | B
B[zero]    : "0"
 [one]     | "1"
 [epsilon] |

These labels are used in syntax trees and in ambiguity analysis reports. If omitted, the productions are automatically labeled #1, #2, etc. for each nonterminal. The ambiguity analyzer by default skips vertical ambiguity checks of pairs of non-explicitly labeled productions, unless no productions at all have labels.

Nonterminal entities and regular expression entities can similarly be labeled:

A : "0" A[a] "0"
  | "1" A[a] "1"
  | B[b]
B : "0"
  | "1"
  |

Entities that are not labeled are called ignorable and are omitted from the syntax trees. (String entities are always ignorable.) However, dummy labels are assumed for all entities if the grammar contains no entity labels at all.

As an experimental feature, two entities within the same production are equality^* entities if their labels are the same:

X : Y[q] Y[q]

The parse trees of such equality entities must unparse to identical strings.

Productions can be prioritized using the > marker:

A : A1
  | A2
 >| A3
  | A4

In this case, the first two productions have higher priority than the latter two.

Productions can be unordered^* using the & marker:

A :& B C D

which means the same as

A : B C D | B D C | C B D | C D B | D B C | D C B

Ignorable nonterminal entities and regular expression entities can have example strings, which are used in unparsing:

IF = [iI][fF]

stm : <IF>["if"] exp stm

The example string must be in the language of the entity. If an example string is not provided for such an entity, the unparser picks a representative string from the language of the entity.

See also the "grammar for grammars" and the example grammars.

* The ambiguity analyzer currently does not support unordered productions, equality entities, and MAX regexps.

dk.brics.grammar

The grammar notation