Simple PEG (Parsing expression grammar) matching. Uses no memorization, but uses superoperators and symbol inlining to improve performance. Note: Matching performance is hopefully competitive with optimized regular expression engines.
A PEG (Parsing expression grammar) is a simple deterministic grammar, that can be directly used for parsing. The current implementation has been designed as a more powerful replacement for regular expressions. UTF-8 is supported.
The notation used for a PEG is similar to that of EBNF:
notation | meaning |
---|---|
A / ... / Z |
Ordered choice: Apply expressions A, ..., Z, in this order, to the text ahead, until one of them succeeds and possibly consumes some text. Indicate success if one of expressions succeeded. Otherwise do not consume any text and indicate failure. |
A ... Z |
Sequence: Apply expressions A, ..., Z, in this order, to consume consecutive portions of the text ahead, as long as they succeed. Indicate success if all succeeded. Otherwise do not consume any text and indicate failure. The sequence's precedence is higher than that of ordered choice: A B / C means (A B) / Z and not A (B / Z) . |
(E) |
Grouping: Parenthesis can be used to change operator priority. |
{E} |
Capture: Apply expression E and store the substring that matched E into a capture that can be accessed after the matching process. |
$i |
Back reference to the i``th capture. ``i counts from 1. |
$ |
Anchor: Matches at the end of the input. No character is consumed. Same as !. . |
^ |
Anchor: Matches at the start of the input. No character is consumed. |
&E |
And predicate: Indicate success if expression E matches the text ahead; otherwise indicate failure. Do not consume any text. |
!E |
Not predicate: Indicate failure if expression E matches the text ahead; otherwise indicate success. Do not consume any text. |
E+ |
One or more: Apply expression E repeatedly to match the text ahead, as long as it succeeds. Consume the matched text (if any) and indicate success if there was at least one match. Otherwise indicate failure. |
E* |
Zero or more: Apply expression E repeatedly to match the text ahead, as long as it succeeds. Consume the matched text (if any). Always indicate success. |
E? |
Zero or one: If expression E matches the text ahead, consume it. Always indicate success. |
[s] |
Character class: If the character ahead appears in the string s, consume it and indicate success. Otherwise indicate failure. |
[a-b] |
Character range: If the character ahead is one from the range a through b, consume it and indicate success. Otherwise indicate failure. |
's' |
String: If the text ahead is the string s, consume it and indicate success. Otherwise indicate failure. |
i's' |
String match ignoring case. |
y's' |
String match ignoring style. |
v's' |
Verbatim string match: Use this to override a global \i or \y modifier. |
i$j |
String match ignoring case for back reference. |
y$j |
String match ignoring style for back reference. |
v$j |
Verbatim string match for back reference. |
. |
Any character: If there is a character ahead, consume it and indicate success. Otherwise (that is, at the end of input) indicate failure. |
_ |
Any Unicode character: If there is an UTF-8 character ahead, consume it and indicate success. Otherwise indicate failure. |
@E |
Search: Shorthand for (!E .)* E . (Search loop for the pattern E.) |
{@} E |
Captured Search: Shorthand for {(!E .)*} E . (Search loop for the pattern E.) Everything until and exluding E is captured. |
@@ E |
Same as {@} E . |
A <- E |
Rule: Bind the expression E to the nonterminal symbol A. Left recursive rules are not possible and crash the matching engine. |
\identifier |
Built-in macro for a longer expression. |
\ddd |
Character with decimal code ddd. |
\" , etc |
Literal " , etc. |
macro | meaning |
---|---|
\d |
any decimal digit: [0-9]
|
\D |
any character that is not a decimal digit: [^0-9]
|
\s |
any whitespace character: [ \9-\13]
|
\S |
any character that is not a whitespace character: [^ \9-\13]
|
\w |
any "word" character: [a-zA-Z0-9_]
|
\W |
any "non-word" character: [^a-zA-Z0-9_]
|
\a |
same as [a-zA-Z]
|
\A |
same as [^a-zA-Z]
|
\n |
any newline combination: \10 / \13\10 / \13
|
\i |
ignore case for matching; use this at the start of the PEG |
\y |
ignore style for matching; use this at the start of the PEG |
\skip pat |
skip pattern pat before trying to match other tokens; this is useful for whitespace skipping, for example: \skip(\s*) {\ident} ':' {\ident} matches key value pairs ignoring whitespace around the ':' . |
\ident |
a standard ASCII identifier: [a-zA-Z_][a-zA-Z_0-9]*
|
\letter |
any Unicode letter |
\upper |
any Unicode uppercase letter |
\lower |
any Unicode lowercase letter |
\title |
any Unicode title letter |
\white |
any Unicode whitespace character |
A backslash followed by a letter is a built-in macro, otherwise it is used for ordinary escaping:
notation | meaning |
---|---|
\\ |
a single backslash |
\* |
same as '*'
|
\t |
not a tabulator, but an (unknown) built-in |
The PEG parser implements this grammar (written in PEG syntax):
# Example grammar of PEG in PEG syntax. # Comments start with '#'. # First symbol is the start symbol. grammar <- rule* / expr identifier <- [A-Za-z][A-Za-z0-9_]* charsetchar <- "\\" . / [^\]] charset <- "[" "^"? (charsetchar ("-" charsetchar)?)+ "]" stringlit <- identifier? ("\"" ("\\" . / [^"])* "\"" / "'" ("\\" . / [^'])* "'") builtin <- "\\" identifier / [^\13\10] comment <- '#' @ \n ig <- (\s / comment)* # things to ignore rule <- identifier \s* "<-" expr ig identNoArrow <- identifier !(\s* "<-") prefixOpr <- ig '&' / ig '!' / ig '@' / ig '{@}' / ig '@@' literal <- ig identifier? '$' [0-9]+ / '$' / '^' / ig identNoArrow / ig charset / ig stringlit / ig builtin / ig '.' / ig '_' / (ig "(" expr ig ")") postfixOpr <- ig '?' / ig '*' / ig '+' primary <- prefixOpr* (literal postfixOpr*) # Concatenation has higher priority than choice: # ``a b / c`` means ``(a b) / c`` seqExpr <- primary+ expr <- seqExpr (ig "/" expr)*
Note: As a special syntactic extension if the whole PEG is only a single expression, identifiers are not interpreted as non-terminals, but are interpreted as verbatim string:
abc =~ peg"abc" # is true
So it is not necessary to write peg" 'abc' "
in the above example.
Check if s matches Nim's "while" keyword:
s =~ peg" y'while'"
Exchange (key, val)-pairs:
"key: val; key2: val2".replacef(peg"{\ident} \s* ':' \s* {\ident}", "$2: $1")
Determine the #include
'ed files of a C file:
for line in lines("myfile.c"): if line =~ peg"""s <- ws '#include' ws '"' {[^"]+} '"' ws comment <- '/*' @ '*/' / '//' .* ws <- (comment / \s+)* """: echo matches[0]
As a regular expression \[.*\]
matches the longest possible text between '['
and ']'
. As a PEG it never matches anything, because a PEG is deterministic: .*
consumes the rest of the input, so \]
never matches. As a PEG this needs to be written as: \[ ( !\] . )* \]
(or \[ @ \]
).
Note that the regular expression does not behave as intended either: in the example *
should not be greedy, so \[.*?\]
should be used instead.
There are two ways to construct a PEG in Nim code:
PegKind = enum pkEmpty, pkAny, ## any character (.) pkAnyRune, ## any Unicode character (_) pkNewLine, ## CR-LF, LF, CR pkLetter, ## Unicode letter pkLower, ## Unicode lower case letter pkUpper, ## Unicode upper case letter pkTitle, ## Unicode title character pkWhitespace, ## Unicode whitespace character pkTerminal, pkTerminalIgnoreCase, pkTerminalIgnoreStyle, pkChar, ## single character to match pkCharChoice, pkNonTerminal, pkSequence, ## a b c ... --> Internal DSL: peg(a, b, c) pkOrderedChoice, ## a / b / ... --> Internal DSL: a / b or /[a, b, c] pkGreedyRep, ## a* --> Internal DSL: *a ## a+ --> (a a*) pkGreedyRepChar, ## x* where x is a single character (superop) pkGreedyRepSet, ## [set]* (superop) pkGreedyAny, ## .* or _* (superop) pkOption, ## a? --> Internal DSL: ?a pkAndPredicate, ## &a --> Internal DSL: &a pkNotPredicate, ## !a --> Internal DSL: !a pkCapture, ## {a} --> Internal DSL: capture(a) pkBackRef, ## $i --> Internal DSL: backref(i) pkBackRefIgnoreCase, pkBackRefIgnoreStyle, pkSearch, ## @a --> Internal DSL: !*a pkCapturedSearch, ## {@} a --> Internal DSL: !*\a pkRule, ## a <- b pkList, ## a, b pkStartAnchor ## ^ --> Internal DSL: startAnchor()
NonTerminalFlag = enum ntDeclared, ntUsed
Peg {...}{.shallow.} = object case kind: PegKind of pkEmpty .. pkWhitespace: nil of pkTerminal, pkTerminalIgnoreCase, pkTerminalIgnoreStyle: term: string of pkChar, pkGreedyRepChar: ch: char of pkCharChoice, pkGreedyRepSet: charChoice: ref set[char] of pkNonTerminal: nt: NonTerminal of pkBackRef .. pkBackRefIgnoreStyle: index: range[0 .. MaxSubpatterns] else: sons: seq[Peg]
NonTerminal = ref NonTerminalObj
Captures = object matches: array[0 .. 20 - 1, tuple[first, last: int]] ml: int origStart: int
EInvalidPeg = object of ValueError
MaxSubpatterns = 20
proc kind(p: Peg): PegKind {...}{.raises: [], tags: [].}
proc term(p: Peg): string {...}{.raises: [], tags: [].}
proc ch(p: Peg): char {...}{.raises: [], tags: [].}
proc charChoice(p: Peg): ref set[char] {...}{.raises: [], tags: [].}
proc nt(p: Peg): NonTerminal {...}{.raises: [], tags: [].}
proc index(p: Peg): range[0 .. MaxSubpatterns] {...}{.raises: [], tags: [].}
proc name(nt: NonTerminal): string {...}{.raises: [], tags: [].}
proc line(nt: NonTerminal): int {...}{.raises: [], tags: [].}
proc col(nt: NonTerminal): int {...}{.raises: [], tags: [].}
proc flags(nt: NonTerminal): set[NonTerminalFlag] {...}{.raises: [], tags: [].}
proc rule(nt: NonTerminal): Peg {...}{.raises: [], tags: [].}
proc term(t: string): Peg {...}{.nosideEffect, gcsafe, extern: "npegs$1Str", raises: [], tags: [].}
proc termIgnoreCase(t: string): Peg {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc termIgnoreStyle(t: string): Peg {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc term(t: char): Peg {...}{.nosideEffect, gcsafe, extern: "npegs$1Char", raises: [], tags: [].}
proc charSet(s: set[char]): Peg {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc `/`(a: varargs[Peg]): Peg {...}{.nosideEffect, gcsafe, extern: "npegsOrderedChoice", raises: [], tags: [].}
proc sequence(a: varargs[Peg]): Peg {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc `?`(a: Peg): Peg {...}{.nosideEffect, gcsafe, extern: "npegsOptional", raises: [], tags: [].}
proc `*`(a: Peg): Peg {...}{.nosideEffect, gcsafe, extern: "npegsGreedyRep", raises: [], tags: [].}
proc `!*`(a: Peg): Peg {...}{.nosideEffect, gcsafe, extern: "npegsSearch", raises: [], tags: [].}
proc `!*\`(a: Peg): Peg {...}{.noSideEffect, gcsafe, extern: "npgegsCapturedSearch", raises: [], tags: [].}
proc `+`(a: Peg): Peg {...}{.nosideEffect, gcsafe, extern: "npegsGreedyPosRep", raises: [], tags: [].}
proc `&`(a: Peg): Peg {...}{.nosideEffect, gcsafe, extern: "npegsAndPredicate", raises: [], tags: [].}
proc `!`(a: Peg): Peg {...}{.nosideEffect, gcsafe, extern: "npegsNotPredicate", raises: [], tags: [].}
proc any(): Peg {...}{.inline, raises: [], tags: [].}
.
) proc anyRune(): Peg {...}{.inline, raises: [], tags: [].}
_
) proc newLine(): Peg {...}{.inline, raises: [], tags: [].}
\n
) proc unicodeLetter(): Peg {...}{.inline, raises: [], tags: [].}
\letter
which matches any Unicode letter. proc unicodeLower(): Peg {...}{.inline, raises: [], tags: [].}
\lower
which matches any Unicode lowercase letter. proc unicodeUpper(): Peg {...}{.inline, raises: [], tags: [].}
\upper
which matches any Unicode uppercase letter. proc unicodeTitle(): Peg {...}{.inline, raises: [], tags: [].}
\title
which matches any Unicode title letter. proc unicodeWhitespace(): Peg {...}{.inline, raises: [], tags: [].}
\white
which matches any Unicode whitespace character. proc startAnchor(): Peg {...}{.inline, raises: [], tags: [].}
^
which matches the start of the input. proc endAnchor(): Peg {...}{.inline, raises: [], tags: [].}
$
which matches the end of the input. proc capture(a: Peg): Peg {...}{.nosideEffect, gcsafe, extern: "npegsCapture", raises: [], tags: [].}
proc backref(index: range[1 .. MaxSubpatterns]): Peg {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc backrefIgnoreCase(index: range[1 .. MaxSubpatterns]): Peg {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc backrefIgnoreStyle(index: range[1 .. MaxSubpatterns]): Peg {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc nonterminal(n: NonTerminal): Peg {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc newNonTerminal(name: string; line, column: int): NonTerminal {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc `$`(r: Peg): string {...}{.nosideEffect, gcsafe, extern: "npegsToString", raises: [], tags: [].}
proc bounds(c: Captures; i: range[0 .. 20 - 1]): tuple[first, last: int] {...}{.raises: [], tags: [].}
[first..last]
of the i'th capture. proc rawMatch(s: string; p: Peg; start: int; c: var Captures): int {...}{.noSideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc matchLen(s: string; pattern: Peg; matches: var openArray[string]; start = 0): int {...}{. nosideEffect, gcsafe, extern: "npegs$1Capture", raises: [], tags: [].}
match
, but it returns the length of the match, if there is no match, -1 is returned. Note that a match length of zero can happen. It's possible that a suffix of s remains that does not belong to the match. proc matchLen(s: string; pattern: Peg; start = 0): int {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
match
, but it returns the length of the match, if there is no match, -1 is returned. Note that a match length of zero can happen. It's possible that a suffix of s remains that does not belong to the match. proc match(s: string; pattern: Peg; matches: var openArray[string]; start = 0): bool {...}{. nosideEffect, gcsafe, extern: "npegs$1Capture", raises: [], tags: [].}
true
if s[start..]
matches the pattern
and the captured substrings in the array matches
. If it does not match, nothing is written into matches
and false
is returned. proc match(s: string; pattern: Peg; start = 0): bool {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
true
if s
matches the pattern
beginning from start
. proc find(s: string; pattern: Peg; matches: var openArray[string]; start = 0): int {...}{. nosideEffect, gcsafe, extern: "npegs$1Capture", raises: [], tags: [].}
pattern
in s
and the captured substrings in the array matches
. If it does not match, nothing is written into matches
and -1 is returned. proc findBounds(s: string; pattern: Peg; matches: var openArray[string]; start = 0): tuple[ first, last: int] {...}{.nosideEffect, gcsafe, extern: "npegs$1Capture", raises: [], tags: [].}
pattern
in s
and the captured substrings in the array matches
. If it does not match, nothing is written into matches
and (-1,0) is returned. proc find(s: string; pattern: Peg; start = 0): int {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
pattern
in s
. If it does not match, -1 is returned. proc findAll(s: string; pattern: Peg; start = 0): seq[string] {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc contains(s: string; pattern: Peg; start = 0): bool {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
find(s, pattern, start) >= 0
proc contains(s: string; pattern: Peg; matches: var openArray[string]; start = 0): bool {...}{. nosideEffect, gcsafe, extern: "npegs$1Capture", raises: [], tags: [].}
find(s, pattern, matches, start) >= 0
proc startsWith(s: string; prefix: Peg; start = 0): bool {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc endsWith(s: string; suffix: Peg; start = 0): bool {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc replacef(s: string; sub: Peg; by: string): string {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [ValueError], tags: [].}
$i
and $#
(see strutils.`%`). Examples:"var1=key; var2=key2".replacef(peg"{\ident}'='{\ident}", "$1<-$2$2")
Results in:
"var1<-keykey; val2<-key2key2"
proc replace(s: string; sub: Peg; by = ""): string {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc parallelReplace(s: string; subs: varargs[tuple[pattern: Peg, repl: string]]): string {...}{. nosideEffect, gcsafe, extern: "npegs$1", raises: [ValueError], tags: [].}
proc replace(s: string; sub: Peg; cb: proc (match: int; cnt: int; caps: openArray[string]): string): string {...}{. gcsafe, extern: "npegs$1cb", raises: [], tags: [].}
proc handleMatches*(m: int, n: int, c: openArray[string]): string = result = "" if m > 0: result.add ", " result.add case n: of 2: c[0].toLower & ": '" & c[1] & "'" of 1: c[0].toLower & ": ''" else: "" let s = "Var1=key1;var2=Key2; VAR3" echo s.replace(peg"{\ident}('='{\ident})* ';'* \s*", handleMatches)
Results in:
"var1: 'key1', var2: 'Key2', var3: ''"
proc transformFile(infile, outfile: string; subs: varargs[tuple[pattern: Peg, repl: string]]) {...}{.gcsafe, extern: "npegs$1", raises: [IOError, ValueError], tags: [ReadIOEffect, WriteIOEffect].}
reads in the file infile, performs a parallel replacement (calls parallelReplace) and writes back to outfile. Raises EIO
if an error occurs. This is supposed to be used for quick scripting.
Note: this proc does not exist while using the JS backend.
proc split(s: string; sep: Peg): seq[string] {...}{.nosideEffect, gcsafe, extern: "npegs$1", raises: [], tags: [].}
proc parsePeg(pattern: string; filename = "pattern"; line = 1; col = 0): Peg {...}{. raises: [ValueError, EInvalidPeg, Exception], tags: [RootEffect].}
proc peg(pattern: string): Peg {...}{.raises: [ValueError, EInvalidPeg, Exception], tags: [RootEffect].}
peg"{\ident} \s* '=' \s* {.*}"
proc escapePeg(s: string): string {...}{.raises: [], tags: [].}
proc handleMatches(m: int; n: int; c: openArray[string]): string {...}{.raises: [], tags: [].}
y < x
. iterator items(p: Peg): Peg {...}{.inline, raises: [], tags: [].}
iterator pairs(p: Peg): (int, Peg) {...}{.inline, raises: [], tags: [].}
iterator findAll(s: string; pattern: Peg; start = 0): string {...}{.raises: [], tags: [].}
iterator split(s: string; sep: Peg): string {...}{.raises: [], tags: [].}
Splits the string s into substrings.
Substrings are separated by the PEG sep. Examples:
for word in split("00232this02939is39an22example111", peg"\d+"): writeLine(stdout, word)
Results in:
"this" "is" "an" "example"
template letters(): Peg
charset({'A'..'Z', 'a'..'z'})
template digits(): Peg
charset({'0'..'9'})
template whitespace(): Peg
charset({' ', '\9'..'\13'})
template identChars(): Peg
charset({'a'..'z', 'A'..'Z', '0'..'9', '_'})
template identStartChars(): Peg
charset({'A'..'Z', 'a'..'z', '_'})
template ident(): Peg
[a-zA-Z_][a-zA-z_0-9]*
; standard identifier template natural(): Peg
\d+
template eventParser(pegAst, handlers: untyped): (proc (s: string): int)
import strutils, pegs let pegAst = """ Expr <- Sum Sum <- Product (('+' / '-')Product)* Product <- Value (('*' / '/')Value)* Value <- [0-9]+ / '(' Expr ')' """.peg txt = "(5+3)/2-7*22" var pStack: seq[string] = @[] valStack: seq[float] = @[] opStack = "" let parseArithExpr = pegAst.eventParser: pkNonTerminal: enter: pStack.add p.nt.name leave: pStack.setLen pStack.high if length > 0: let matchStr = s.substr(start, start+length-1) case p.nt.name of "Value": try: valStack.add matchStr.parseFloat echo valStack except ValueError: discard of "Sum", "Product": try: let val = matchStr.parseFloat except ValueError: if valStack.len > 1 and opStack.len > 0: valStack[^2] = case opStack[^1] of '+': valStack[^2] + valStack[^1] of '-': valStack[^2] - valStack[^1] of '*': valStack[^2] * valStack[^1] else: valStack[^2] / valStack[^1] valStack.setLen valStack.high echo valStack opStack.setLen opStack.high echo opStack pkChar: leave: if length == 1 and "Value" != pStack[^1]: let matchChar = s[start] opStack.add matchChar echo opStack let pLen = parseArithExpr(txt)
The handlers parameter consists of code blocks for PegKinds, which define the grammar elements of interest. Each block can contain handler code to be executed when the parser enters and leaves text matching the grammar element. An enter handler can access the specific PEG AST node being matched as p, the entire parsed string as s and the position of the matched text segment in s as start. A leave handler can access p, s, start and also the length of the matched text segment as length. For an unsuccessful match, the enter and leave handlers will be executed, with length set to -1.
Symbols declared in an enter handler can be made visible in the corresponding leave handler by annotating them with an inject pragma.
template `=~`(s: string; pattern: Peg): bool
match
with an implicit declared matches
array that can be used in the scope of the =~
call:if line =~ peg"\s* {\w+} \s* '=' \s* {\w+}": # matches a key=value pair: echo("Key: ", matches[0]) echo("Value: ", matches[1]) elif line =~ peg"\s*{'#'.*}": # matches a comment # note that the implicit ``matches`` array is different from the # ``matches`` array of the first branch echo("comment: ", matches[0]) else: echo("syntax error")
© 2006–2018 Andreas Rumpf
Licensed under the MIT License.
https://nim-lang.org/docs/pegs.html