Library
Module
Module type
Parameter
Class
Class type
PaComb is a parsing libraries that compiles grammars with semantic actions to combintators that can be used for parsing. Languages are defined with the Pacomb.Grammar
module or preferably through our PPX extension. The library offers scanner less parsing, but the Pacomb.Lex
module provides a notion of terminals and the Pacomb.Blank
module allows to define function ignoring spaces and comments. This and other modules like Pacomb.Keywords
and Pacomb.Word_list
makes the scanner less approach very easy to use.
Pacomb is available via opam and from github
The main advantage of PaComb is to offer many features in one tool:
Pacomb.Lex.give_up
Pacomb.Grammar.layout
and the fact that we use scanner less parsing.Pacomb.Grammar.print_grammar
.Importantly, performances of PaComb are very good: it is often less than two times slower than grammars generated by ocamlyacc. However, using specific Pacomb features like dependant sequences or cache will result in much slower grammar ... that you can not in general write with ocamlyacc anyway.
Pacomb.Grammar
The main module to build grammarsPacomb.Lex
lexing and Pacomb.Lex.give_up
Pacomb.Blank
blanks and layout recordPacomb.Pos
positions in file and handling of exceptionsPacomb.Keywords
Some function to parse and reserve keywordsPacomb.Word_list
An efficient extensible dictionnary to parse wordsPacomb.Regexp
Regexp implementation above Pacomb.Lex
Pacomb.Input
build input buffer from string, channel, ...Pacomb.Charset
an efficient representation of character setsDefining languages using the Pacomb.Grammar
module directly is cumbersome. For that reason, PaComb provides a BNF-like PPX syntax extension. An example with arithmetic expressions is given below. It can be compiled with command ocamlfind ocamlopt -package pacomb,pacomb.ppx -o calc -linkpkg calc.ml
(if it is written to a file calc.ml
). This example an many others are available in the examples
folder of the source distribution of opam.
type p = Atom | Prod | Sum
let%parser rec expr p =
Atom < Prod < Sum
; (p=Atom) (x::FLOAT) => x
; (p=Atom) '(' (e::expr Sum) ')' => e
; (p=Prod) (x::expr Prod) '*' (y::expr Atom) => x *. y
; (p=Prod) (x::expr Prod) '/' (y::expr Atom) => x /. y
; (p=Sum ) (x::expr Sum ) '+' (y::expr Prod) => x +. y
; (p=Sum ) (x::expr Sum ) '-' (y::expr Prod) => x -. y
Let us explain a bit this example, it is a BNF using priorities from type p
. The first rule Atom < Prod < Sum
says that expressions at Atom
level are included in expression at Prod
level which are themselves included in expression at Sum
level.
Action are given after the infix symbol =>
. Rules are separated by semicolumn and here, they all start with a guard indicated at which priority level the rule applies. The terminals in the grammar are (x::FLOAT)
, which is a predefined grammar for floating point numbers, binding the variable x
for use in action code. and '('
, ')'
, '*'
, ... which parse a single character (string are allowed too). The non terminals are (x::expr Prod)
, (y::expr Atom)
, ... which recursively call the same grammar at the given priority level. Parentheses are mandatory as we use OCaml priority for _ :: _
.
To use the above parser, you need to define a toplevel rule, say which characteres to ignore and call the parser on a file or string. For instance the following will work and handle parse error without exiting the program:
(* blanks, i.e. characteres to be ignored *)
let blank = Blank.from_charset (Charset.singleton ' ')
let _ =
try
while true do
let f () =
Printf.printf "=> %!"; (* prompt *)
Grammar.parse_string (expr Sum) blank (input_line stdin)
in
(* [Pos] module provides a function to handle exception with
an optional argument to call for error (default is to exit with
code 1 *)
Pos.handle_exception ~error:(fun _ -> ()) f ()
done
with
End_of_file -> ()
The main function above is Grammar.parse_string
which takes the grammar to be parsed, the definition of blanks (sequence of characteres to be ignored) and what to parse (here a string
).
The BNF above works using the ppx extensions [%%parser ...]
for structures and [%parser ...]
for expression. These are also accessible by suffixing keywords with %parser
as in the above example. These ppx extensions extends ocaml expressions with a syntax for grammars of type 'a Grammar.t
and modifies the behaviour of let-bindings especially recursive ones to use declare_grammar
, set_grammar
and grammar_family
. Recall that due to the limitation of ppx, we use a sub-syntax of OCaml expressions for grammars. It is therefore not a good idea to use "=>" as an infix inside [%parser ...]
. Moreover, as we are reusing ocaml syntax, if you fall outside of the BNF below, the expression is unchanged and compiled by OCaml. This means that a syntax error in the description of the BNF results in general in a (strange) type error at compilation.
We give below the BNF grammar for the extension, together with a sketch of its semantics.
expr ::= ... Any Ocaml expression
grammar ::= rule itself
| grammar ; rule Grammar.alt
rule ::= qitems => expr A rule with its action
| expr < ... < expr priority order see below
| ERROR("m") report an error if parsing fails here
| ERROR(ls) report errors if parsing fails here
qitems ::= () Grammar.empty
| non_empty_qitems itself
| guard non_empty_qitems conditional rule
guard ::= expr bool_op expr
| not expr
| expr =| epat pattern matching
bool_op ::= = | < | > | <= | >= | <> | == | != | && | || boolean operators
non_empty_qitems ::= qitem itself
| non_empty_qitems qitems Grammar.seq
qitem ::= item itself
| (epat :: item) give a name if used in the action
| ((epat,epat) >: item) as above, but for dseq
| (lazy (epat,epat) >: item) as above, but for lazy dseq
| (epat <: item) dseq with no real deps
item ::= CHAR Grammar.term(Lex.any ())
| CHAR(c) or 'c' Grammar.term(Lex.char c)
| CHARSET(s) Grammar.term(Lex.charset s)
| STRING(s) or "s" Grammar.term(Lex.string s)
| UTF8 Grammar.term(Lex.any_utf8 ())
| UTF8(s) Grammar.term(Lex.utf8 s)
| GRAPHEME Grammar.term(Lex.any_grapheme ())
| GRAPHEME(s) Grammar.term(Lex.grapheme s)
| EOF Grammar.term(Lex.eof ())
| RE(expr) Grammar.term(Lex.regexp (Regexp.from_string expr))
| NAT Grammar.term(Lex.nat ())
| INT Grammar.term(Lex.int ())
| FLOAT Grammar.term(Lex.float ())
| STRING_LIT Grammar.term(Lex.string_lit ())
| CHAR_LIT Grammar.term(Lex.char_lit ())
| ~? expr Grammar.option expr
| ~? [expr] expr Grammar.option_default expr expr
| ~* expr Grammar.star expr
| ~* [expr] expr Grammar.star_sep expr expr
| ~+ expr Grammar.plus expr
| ~+ [expr] expr Grammar.plus_sep expr expr
| other expr itself
epat ::= lid
| __ encoding of _
| (lazy epat)
| (epat : coretype)
| epat = lid encoding of [pat as lid]
| (epat, ..., epat)
| uid(epat)
| M.epat
epat
correspond to an encoding of patterns in expressions. Beware that _
is invalid, use __
instead.
Guard using (e =| epat) allows for rule garder by pattern matching. This has also been tested with GADT (see examples/calc_ext2.ml).
Action code needs parenthesis or begin ... end
if it uses if .. then
, let expressions, pattern matching or sequences.
Anything which does not correspond to this grammar will be unchanged in the ocaml code (like the type definition in the example above). A mutually recursive definition can also mix the definition of grammars (parametric of not) with the definition of normal ocaml values. This means you could put the whole file inside %%parser ...
.
Beware that inside the scope of the extension, you can use the syntax for grammars everywhere. This allows for some nesting as in:
type p = Atom | Prod | Sum
let%parser rec
expr p = Atom < Prod < Sum
; (p=Atom) (x::FLOAT) => x
; (p=Atom) '(' (x::expr Sum) ')' => x
; (p=Prod) (x::expr Prod) => ( '*' (y::expr Atom) => x*.y
; '/' (y::expr Atom) => x/.y)
; (p=Sum ) (x::expr Sum ) => ( '+' (y::expr Prod) => x+.y
; '-' (y::expr Prod) => x-.y)
Remark: the left factorisation in this example is useless: in will be automatically performed by Pacomb
.
Here is the meaning of let bindings for grammars through the ppx extension:
Grammar.declare_grammar + Grammar.set_grammar
(if no parameter)Grammar.grammar_family + setting the grammar
if parameters are given. multiple parameters and using label are supported through curryfication by the ppx extension.For recursive grammar with exactly one parameter, a rule p_1 < p_2 < ... < p_n
will automatically add rules to include the grammar parametrized by p_i
in the grammar parametrized by p_(i+1)
. This was used by the calculator example above.
let%parser
accepts the following attribute:
[@cache]
to cache the grammar (that is call Grammar.cache
on the grammar). see Cache and merge for ambiguous grammars.[@merge f]
to apply f
if two parsetrees are possible for the same input. This corresponds to Grammar.cache ~merge:f
. see Cache and merge for ambiguous grammars.[@layout blank]
or [@layout blank ~config:expr]
to apply Grammar.layout and change the blank characters for that grammar. see Pacomb.Blank
.[@print_param f]
to specify a printing function for the parameters of a recursive grammar. The feature is used in examples/calc_prio.ml
to print the grammar when provinding the -help
option.Action are evaluated as soon as the rule is reduced. This may be a problem for grammar whose prefix are ambiguous, even if the grammar is not really ambiguous. This is notably true for right recursion (which as usual should be avoided). If nothing is done, right recursion will be quadratic in time, while linear in space and time is exptected (linear space is the reason to prefer left recursion). To delay evaluation of action, you should use lazy action. For instance, the right recursive grammar for sexpr below:
let%parser rec sexp =
(x::RE id) => Idt x
; '(' (l::sexps) ')' => Lst l
and sexps =
() => []
; (e::sexp) (l::sexps) => (e::l)
Does not work efficiently because in an expression like "a b c ..." all the lists [Idt "a"]
, [Idt "a"; Idt "b"]
, etc are constructed. It should be rewritten using lazy
:
let%parser rec sexp =
(x::RE id) => Idt x
; '(' (lazy l::sexps) ')' => Lst l
and sexps =
() => lazy []
; (e::sexp) (lazy l::sexps) => lazy (e::l)
or left recursion:
let%parser rec sexp =
(x::RE id) => Idt x
; '(' (l::sexps) ')' => Lst (List.rev l)
and sexps =
() => []
; (l::sexps) (e::sexp) => e::l
Remark: there are some important optimisation for lazy
so use the lazy
keyword as above as much as possible and do not use Lazy.from_fun
or Lazy.force
if you can avoid it.
A similar problem arises if you want to evaluate an action immediatly after a newline has been parsed. As Pacomb
is trying to read the blank characteres after the newline, the evaluation of the action is retarted.
Remark: right recursion for sequence not using any semantics is optimized and works in O(1) memory. For instance, the following to parse a list of command with dependant grammar (hence right recursion is mandatory) works:
let%parser rec cmds env =
() => ()
; (top_expr env) '\n' (cmds env) => ()
; ((env,()) >: new_rule env) '\n' (cmds env) => ()
; ((env,()) >: rem_rule env) '\n' (cmds env) => ()
It works for one main reason: we do not use any semantics for any item in the rule. The first member of the pair for dependant grammar is used soon enough and does not need a stack.
For instance, if you do not want to use input_line
as above for your calculator, you may want to write:
let%parser rec exprs = () => ()
; exprs (x::expr Sum) '\n' => print_float x
This will not work and printing will not occur just after reading the newline. A solution is to test for newline without reading it to trigger printing:
let nl _ b i _ _ = let (c,_,_) = Input.read b i in c = '\n'
let%parser rec top_expr =
(t::Grammar.test_after nl expr) => Printf.printf "%g\n=> %!" t
let%parser rec exprs = () => () ; exprs top_expr '\n' => ()
Variable binding in the left part of a rule are not available before the final action code. Therefore, it can not be used for selecting grammar rule. For instance, (p::prio) (x::g p) => x
will report p
as unbounded. To solve this, you can use dependent sequences, using (x,y)>:item
will allow x
(but not y
) to be used both in the action and the rule. The separation with dependent (the par that can be used in the rest of the rume) and non dependent part (used only in the action) is crucial for efficiency. Indeed, dependent grammar are compiled to combinators at runtime and memoised and you don't want "noise" in the memoisation. This means you should really use values in a set as small as possible in dependent parts. Here is an example of grammar using this to implement an extensible calculator allowing to add extra infix operators with a given priority (see tests/calc_ext.ml
):
let%parser rec expr prio =
((pe1,e1)>:expr prio) ((pop,b)>:op pe1 prio) ((__,e2)::expr pop)
=> (pop, b e1 e2)
; (x::FLOAT) => (0.0,x)
; '(' (e::expr max_prio) ')' => (0.0,e)
where op pe1 prio
parse binary operator with priorities between pe1
and prio
. The action returns both the expression and its priority level.
It is important to know that dependant or self extensible grammars remain efficient. The main loss comes from the fact that with dependant grammars, combinator are composed at tun time are OCaml's inliner can not optimise them.
In principle, you can write a grammar bnf
to parse your favorite syntax for BNF and use a rule (g,__)>:bnf (x::parse_my_bnf g) => x
to parse any BNF! The function parse_my_bnf
should call function in the Pacomb.Grammar
module to build the usable grammar and as dependent grammar are memoised, each bnf will be only compiled once. This really gives the power to write self extensible language easily. This is illustrated in the example tests/calc_ext2.ml
which accept the following input:
2
(4)
rule "suc" 1 : Str "S" Exp 1 => Cst 1 Op2 "+"
rule "pre" 1 : Str "P" Exp 1 => Cst 1 Op2 "-"
SPPS6
rule "pow" 2 : Exp <2 Str "^" Exp 2 => Op2 "^"
rule "mul" 3 : Exp 3 Str "*" Exp <3 => Op2 "*"
rule "div" 3 : Exp 3 Str "/" Exp <3 => Op2 "/"
rule "add" 4 : Exp 4 Str "+" Exp <4 => Op2 "+"
rule "sub" 4 : Exp 4 Str "-" Exp <4 => Op2 "-"
3*3
2+2
(2+3)*5/2-1
SPS6 * 3 + S1
remove rule "suc"
remove rule "pre"
2^2^2
2^(3^2)
Note: one can fairly well control when action are evaluated and therefore the parsing may depends upon global references that are modified by your parser itsel. However, purely functional code should be prefered, even for extensible grammars, because if for instance you have unwanted ambiguities, you will have a behavior hard to predict and a bug hard to diagnose.
Dependant sequences are also useful to prevent construction of infinite grammar. For instance in this code from examples/calc_ext2.ml
:
let%parser rec rule : type a. a ty -> (env -> a Grammar.t) Grammar.t
= fun t ->
"Exp" (prio<:FLOAT) (r::rule (Arr(Flt,t))) =>
(fun env -> (x::expr env (get_prio prio env)) (f::r env) => f x)
; "Str" (s<:STRING_LIT) (r::rule t) =>
(fun env -> (STR s) (x::r env) => x)
; "=>" (a::action t) => (fun _ -> () => a)
The construction for the grammar for rule
would loop in the "Exp"
rule if we were not using <:
. epat<:item
is equivalent to ((),epat)>:((x::item) =>
((),x))
. This represent a grammar depending from ()
, i.e. a constant grammar. The only effect if to delay the construction of the grammar right of <:
until it is used for the first time.
Pacomb can manage ambiguous grammars, but this will result in general in very poor performances and issue a warning: "Parsing ambiguity, use cache with merge"
on stderr
. You can use Pacomb.Grammar.parse_all_buffer
to remove the warning, but you should preferably use cache and merge.
Pacomb.Grammar.cache
can be apply to a grammar to produce a cached grammar. This guaranty that this grammar will at most be called once for each position in the parsed buffer. This will not solve ambiguity, but may improve performance. For instance, there are some non ambiguous grammars that Pacomb can not left factorise (left factorisation is not performed to the left of a dependent sequence). The annotation @cache
on let%parser
will automatically wrap the grammar with a call to Pacomb.Grammar.cache
.
When your grammar is ambiguous, you can give an optional merge function to Pacomb.Grammar.cache
. This function will be called to merge all results conrresponding to exactly the same portion of the input buffer with the cached grammar.
Here is a toy example (see example/catalan.ml):
let%parser [@merge (+.)] rec bin_seq =
() => 1.0
; (t1::bin_seq) 'a' (t2::bin_seq) => t1 *. t2
This grammar will parse any sequence of 'a'
of length n and return the number of binary tree with n nodes (i.e. catalan number)! It does run in polynomial time (O(N^3) in fact, which is not the best that can be done).
In general, using cache and merge, you should be able to reach O(N^3) time complexity and O(N^2) space complexity for any BNF grammar. This seems confirmed by our benchmarks.
In action code (expression right of =>
), a lid_lpos
or lid_rpos
will denote respectively the left and right position of the item named lid
. a lid_pos
will group both lid_lpos
and lid_rpos
in a record of type Pos.t *
Pos.t
. If the item is matched by a tuple and you want to use its position you must use pat = lid
syntax to give a name to the whole item. The variables _lpos
, _rpos
and _pos
corresponds to the position of the input parser by the whole rule.
Here is an example parsing sexpr with positions:
type sexp = { p: Pos.t * Pos.t; e : sexp' }
and sexp' =
| Idt of string
| Lst of sexp list
let id = "[a-zA-Z_][a-zA-Z_0-9]*[']*"
let%parser rec sexp
= ; (x::RE id) => { p = _pos; e = Idt x }
; '(' (l::sexps) ')' => { p = _pos; e = Lst (List.rev l) })
and sexps = () => []
; (l::sexps) (e::sexp) => e::l
The type Pacomb.Pos.spos
is a very light type to represent position. It contains a record of type Pacomb.Input.infos
that is the same for the whole input and the position in bytes in the input of type Pacomb.Input.byte_pos
. If you need line numbers and column number, the Pacomb.Pos
module contains the necessary function to rescan the input and compute it (with a cache to avoid always rescanning from beginning).
This is important because in general positions are needed only in error messages and computing line and column number is costly, especially for utf8.
When parsing a stream that is not a regular file, the whole buffer is kept to allow rescanning. If your stream is very large, this is not reasonnable. Two solutions: restart parsing at regular interval, or use the ~rescan:false optionnal parameters and use only position in bytes.
Another problem with rescanning: if you use Grammar.parse_channel
or Grammar.parse_fd
, you can not get line and column number after closing the channel or file descriptor. However, if you use Grammar.parse_string
or Grammar.parse_file
you can. In this case, It is even possible to marshal (with closures) the type Pos.t
. With Grammar.parse_file
position will remain correct as long as the file is unchanged.
The ERROR(m)
syntax in the ppx extension allows to add well chosen error messages in your code. Here is a way to use it on the sexp
grammar above:
let%parser rec sexp
= ERROR(["id";"("])
; (x::RE id) => { p = _pos; e = Idt x }
; '(' (l::sexps) => (ERROR(")") | ')'
=> { p = _pos; e = Lst (List.rev l) })
and sexps = () => []
; (l::sexps) (e::sexp) => e::l
Parsing "a b (c + ..."
with the above grammar will give
Parse error: Line 1, character 7.
expecting: ) id (
Banchmarks seems to show that adding error cost more or less 15% on the above example.
Pacomb must eliminate left recursion in grammars in order to use combinators that would loop otherwise. However, left recursion is not supported if it traverses A Pacomb.Grammar.layout
constructor to change blanks (probably possible to solve this, but probably not worth it).
Ambiguous grammar will be analysed in polynomial time if and only if you use a cache for all ambiguous non terminals of your grammar. Pacomb does perform some left factorisation, but it is not complete. Using cache is also a solution to that problem. Pacomb has function to print the grammar to see if left factorisation was performed.
Note: left recursion do not need and is not eliminated if the grammar uses a cache. However, this solution to use cache in general is too slow for non ambiguous grammars so we do not impose a cache to all left recursive grammars.
The ppx extension is not too bad but still suffers from the fact that it uses a sub-language of OCaml to describe grammars. For instance
let%parser g =((_,x)::g) => x]
is not legal because _
cannot be used in an Ocaml expression. Though the following works:
let%parser g = ((__,x)::g) => x
The syntax
(((x,y) = z) :: g) => (x,y,z_pos)
is not very nice as we use =
to replace the as
keyword and we also need a lot of parentheses.
A solution would be to rewrite OCaml's parser with pacomb and extend the language with real grammar extension!