Parser¶
To run the parser we call parse
. This requires a grammar and the output
from the lexer (which is a list of tokens).
parse : (act : Grammar tok c ty) > (xs : List tok) >
Either (ParseError tok) (ty, List tok)
If successful this returns ‘Right’ with a pair of
 the parse result.
 the unparsed tokens (the remaining input).
otherwise it returns ‘Left’ with the error message.
So we need to define the grammar for our parser, this is done using the following ‘Grammar’ data structure. This is a combinator structure, similar in principle to the recogniser combinator for the lexer, which was discussed on the previous page.
As with the Recogniser the Grammar
type is dependent on a boolean ‘consumes’
value which allows us to ensure that complicated Grammar
structures will always
consume input.
 Description of a language's grammar. The `tok` parameter is the type
 of tokens, and the `consumes` flag is True if the language is guaranteed
 to be nonempty  that is, successfully parsing the language is guaranteed
 to consume some input.
public export
data Grammar : (tok : Type) > (consumes : Bool) > Type > Type where
Empty : (val : ty) > Grammar tok False ty
Terminal : String > (tok > Maybe a) > Grammar tok True a
NextIs : String > (tok > Bool) > Grammar tok False tok
EOF : Grammar tok False ()
Fail : Bool > String > Grammar tok c ty
Commit : Grammar tok False ()
MustWork : Grammar tok c a > Grammar tok c a
SeqEat : Grammar tok True a > Inf (a > Grammar tok c2 b) >
Grammar tok True b
SeqEmpty : {c1, c2 : Bool} >
Grammar tok c1 a > (a > Grammar tok c2 b) >
Grammar tok (c1  c2) b
Alt : {c1, c2 : Bool} >
Grammar tok c1 ty > Grammar tok c2 ty >
Grammar tok (c1 && c2) ty
So an example of a grammer type may look something like this:
Grammar (TokenData ExpressionToken) True Integer
.
This is a complicated type name and a given parser will need to use it a lot.
So to reduce the amount of typing we can use the following type synonym (similar
to what is done in Idris 2
: https://github.com/edwinb/Idris2/blob/master/src/Parser/Support.idr
)
public export
Rule : Type > Type
Rule ty = Grammar (TokenData ExpressionToken) True ty
Parser Example¶
On the previous page we implemented a lexer to ‘lex’ a very simple expression, on this page we will go on to add a parser for this running example.
The syntax we are aiming at is something like this (expressed in Backus–Naur form (BNF)):  <expr> ::= <integer literal>
 <expr>'+'<expr>
 <expr>''<expr>
 <expr>'*'<expr>
 '('<expr>')'
<integer literal> ::= <digit><integer literal><digit>

To start, here is a grammar to parse an integer literal (that is, a sequence of numeric characters).  export
intLiteral : Rule Integer
intLiteral
= terminal (\x => case tok x of
Number i => Just i
_ => Nothing)

In order to try this out, here is a temporary function, this calls parse which takes two parameters:
 The grammar (in this case intLiteral)
 The token list from the lexer.
test1 : String > Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
test1 s = parse intLiteral (fst (lex expressionTokens s))
As required, if we pass it a string which is a number literal then it will return the number in the ‘Right’ option.
*parserEx> test1 "123"
Right (123, []) : Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
If we pass it a string which is not a number literal then it will return an error message.
*parserEx> test1 "a"
Left (Error "End of input"
[]) : Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
If we pass it a number followed by something else, then it will still be successful, this is because we are not specifically checking for endoffile.
*parserEx> test1 "123a"
Right (123, []) : Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
*parserEx>
The intLiteral function above uses the terminal function to
construct the grammar 
 Succeeds if running the predicate on the
 next token returns Just x, returning x.
 Otherwise fails.
export
terminal : (tok > Maybe a) > Grammar tok True a
terminal = Terminal

This is defined here : https://github.com/idrislang/Idrisdev/blob/master/libs/contrib/Text/Parser/Core.idr Idris 2 uses a slightly different version which stores an error message like “Expected integer literal” which can be output if the rule fails : https://github.com/edwinb/Idris2/blob/master/src/Text/Parser/Core.idr
The ‘terminal’ function is also used to construct the other elements of the grammar that we require, for instance, opening parenthesis:  openParen : Rule Integer
openParen = terminal (\x => case tok x of
OParen => Just 0
_ => Nothing)

Integer value is not really relevant for parenthesis so 0
(zero) is used as
a default value.
As before, this can be tested with a function like this:
test2 : String > Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
test2 s = parse openParen (fst (lex expressionTokens s))
We can see below that it correctly parses an open parenthesis and gives an error for anything else:
*parserEx> test2 "("
Right (0, []) : Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
*parserEx> test2 "123"
Left (Error "Unrecognised token"
[MkToken 0
0
(Number 123)]) : Either (ParseError (TokenData ExpressionToken))
(Integer,
List (TokenData ExpressionToken))
Now we have two Grammars we can try combining them. The following test looks
for openParen
followed by intLiteral
, the two Grammars are combined using
<*>
. The map const
part uses the integer value from the first.
The following test is looking for (
followed by a number:
test3 : String > Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
test3 s = parse (map const openParen <*> intLiteral) (fst (lex expressionTokens s))
We can see below that (
followed by a number is successfully parsed but, as
required, other token lists are not:
*parserEx> test3 "(123"
Right (0, []) : Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
*parserEx> test3 "(("
Left (Error "Unrecognised token"
[MkToken 0
(case fspan (\ARG => not (intToBool (prim__eqChar ARG '\n')))
"(" of
(incol, "") => c + cast (length incol)
(incol, b) => cast (length incol))
OParen]) : Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
*parserEx> test3 "123"
Left (Error "Unrecognised token"
[MkToken 0
0
(Number 123)]) : Either (ParseError (TokenData ExpressionToken))
(Integer,
List (TokenData ExpressionToken))
*parserEx> test3 "123("
Left (Error "Unrecognised token"
[MkToken 0 0 (Number 123),
MkToken 0
(case fspan (\ARG => not (intToBool (prim__eqChar ARG '\n')))
"321" of
(incol, "") => c + cast (length incol)
(incol, b) => cast (length incol))
OParen]) : Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
The closing parenthesis is constructed in the same way.  closeParen : Rule Integer
closeParen = terminal (\x => case tok x of
CParen => Just 0
_ => Nothing)

Now we can generate a Grammar for an expression inside parenthesis like this.  paren : Rule Integer > Rule Integer
paren exp = openParen *> exp <* closeParen

The use of *>
and <*
instead of <*>
is an easy way to use the integer
value from the inner expression.
So lets try to use this to define a grammar which recognises either:
 An integer literal
 An integer literal inside parenthesis
 An integer literal inside parenthesis inside parenthesis
 … and so on.
This requires a recursively defined structure like this:  partial
expr : Rule Integer
expr = intLiteral <> (paren expr)

This is a valid grammar because every time it is called it is guaranteed to
consume a token. However, as an Idris structure, it is problematic due to
the recursion. Defining it as partial allows it to compile but at runtime
we are likely to get a crash with an (unhelpful) error message like
killed by signal 11
.
So a better method is to use do
notation as described in the following
section.
Monadic Combinator¶
In addition to <>
and <*>
there is a >>=
combinator, which is
defined like this:
 Sequence two grammars. If either consumes some input, the sequence is
 guaranteed to consume some input. If the first one consumes input, the
 second is allowed to be recursive (because it means some input has been
 consumed and therefore the input is smaller)
export %inline
(>>=) : {c1 : Bool} >
Grammar tok c1 a >
inf c1 (a > Grammar tok c2 b) >
Grammar tok (c1  c2) b
(>>=) {c1 = False} = SeqEmpty
(>>=) {c1 = True} = SeqEat
This allows us to use do notation for the previous parenthesis example: 
expr = intLiteral <> do
openParen
r < expr
closeParen
pure r

This can be more flexible than using the <*>
operator. Also it is defined
using Inf
so we can implement recursively defined grammars as above.
Implementing the Arithmetic Operators¶
Now for the operations, in this case: + ,  and * .
The syntax we require for these infix operators
would require recursive grammars like this: 
expr = expr (op "+") expr

As already explained, do
notation can allow us to use recursive
definitions but not necessarily leftrecursion like this.
In order to work up to this gradually lets start with prefix operators (sometimes known as Polish notation) then modify later for infix operators.
So prefix operators would have this sort of form:  expr = (add (op "+") expr expr)

where op
is defined like this:
 Matches if this is an operator token and string matches, that is,
 it is the required type of operator.
op : String > Rule Integer
op s = terminal (\x => case tok x of
(Operator s1) => if s==s1 then Just 0 else Nothing
_ => Nothing)
and Where:

addInt : Integer > Integer > Integer
addInt a b = a+b
export
add : Grammar tok c1 Integer >
Grammar tok c2 Integer >
Grammar tok c3 Integer >
Grammar tok ((c1  c2)  c3) Integer
add x y z = map addInt (x *> y) <*> z

The resulting integer will be the sum of the two operands.
The other operators are defined in a similar way:  subInt : Integer > Integer > Integer
subInt a b = ab
export
sub : Grammar tok c1 Integer >
Grammar tok c2 Integer >
Grammar tok c3 Integer >
Grammar tok ((c1  c2)  c3) Integer
sub x y z = map subInt (x *> y) <*> z
multInt : Integer > Integer > Integer
multInt a b = a*b
export
mult : Grammar tok c1 Integer >
Grammar tok c2 Integer >
Grammar tok c3 Integer >
Grammar tok ((c1  c2)  c3) Integer
mult x y z = map multInt (x *> y) <*> z

So the top level Grammar
can now be defined as follows. Note that this is
partial as it is a potentially infinite structure and so not total.
partial
expr : Rule Integer
expr = (add (op "+") expr expr)
<> (sub (op "") expr expr)
<> (mult (op "*") expr expr)
<> intLiteral <> (paren expr)
To make the testing easier we can use this function:  partial
test : String > Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
test s = parse expr (fst (lex expressionTokens s))

First test a valid (prefix) expression:  *parserEx> test "+1*6(4)"
Right (25,
[]) : Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))

Then an invalid syntax:
*parserEx> test "))"
Left (Error "Unrecognised token"
[MkToken 0 0 CParen,
MkToken 0
(case fspan (\ARG =>
not (intToBool (prim__eqChar ARG
'\n')))
")" of
(incol, "") => c + cast (length incol)
(incol, b) => cast (length incol))
CParen]) : Either (ParseError (TokenData ExpressionToken))
(Integer,
List (TokenData ExpressionToken))
However if we try something that is invalid, but starts with a valid token,
then it will return Right
(to indicate success)
*parserEx> test "1))"
Right (1,
[MkToken 0
(case fspan (\ARG =>
not (intToBool (prim__eqChar ARG '\n')))
"1" of
(incol, "") => c + cast (length incol)
(incol, b) => cast (length incol))
CParen,
MkToken 0
(case fspan (\ARG =>
not (intToBool (prim__eqChar ARG '\n')))
")" of
(incol, "") => c + cast (length incol)
(incol, b) => cast (length incol))
CParen]) : Either (ParseError (TokenData ExpressionToken))
(Integer,
List (TokenData ExpressionToken))
Infix Notation¶
So far we have implemented a prefix notation for operators (like this:
+ expr expr
) but the aim is to implemented an infix notation (like this:
expr + expr
). To do this we must be able to deal with potentially
infinite data structures (see Codata Types here Types and Functions).
First alter the grammar to have infix operations:
addInt : Integer > Integer > Integer
addInt a b = a+b
export
add : Grammar tok c1 Integer >
Grammar tok c2 Integer >
Grammar tok c3 Integer >
Grammar tok ((c1  c2)  c3) Integer
add x y z = map addInt (x <* y) <*> z
subInt : Integer > Integer > Integer
subInt a b = ab
export
sub : Grammar tok c1 Integer >
Grammar tok c2 Integer >
Grammar tok c3 Integer >
Grammar tok ((c1  c2)  c3) Integer
sub x y z = map subInt (x <* y) <*> z
multInt : Integer > Integer > Integer
multInt a b = a*b
export
mult : Grammar tok c1 Integer >
Grammar tok c2 Integer >
Grammar tok c3 Integer >
Grammar tok ((c1  c2)  c3) Integer
mult x y z = map multInt (x <* y) <*> z
partial
expr : Rule Integer
expr = (add expr (op "+") expr)
<> (sub expr (op "") expr)
<> (mult expr (op "*") expr)
<> intLiteral <> (paren expr)
However, if this was run, the code would not terminate.
Implement Left Recursion Elimination¶
Left factoring, like this, is a general problem.
If we have a rule like this:
A > (x<*>y) <> (x<*>z)
If x<*>y
fails but x<*>z
would succeed a problem is that, x<*>y
has
already consumed x
, so now x<*>z
will fail.
so we could write code to backtrack. That is try
x<*>y
without consuming
so that, if the first token succeeds but the following tokens fail, then the
first tokens would not be consumed.
Another approach is left factoring:
Left Factoring¶
Replace the rule with two rules (that is we add a nonterminal symbol) so for example, instead of:
A > (x<*>y) <> (x<*>z)
we add an extra rule to give:
A > x<*>N
N > y <> z
That is we convert a general contextfree grammar to a LL(1) grammar. Although not every contextfree grammar can be converted to a LL(1) grammar.
This still does not solve the infinite recursion issue and there is another
problem: the precedence of the operators +
, 
and *
is not explicit.
To resolve this we can alter the example to have a BNF like this:
Where braces { ... } mean the contents can occur 1 or more times. 
'expression' ::= ('+''') 'term' {('+''') 'term'}
'term' ::= 'factor' {'*' 'factor'}
'factor' ::=
number
 '(' 'expression' ')'

expr : Rule Integer
factor : Rule Integer
factor = intLiteral <> do
openParen
r < expr
closeParen
pure r
term : Rule Integer
term = map multInt factor <*> (
(op "*")
*> factor)
<> factor
expr = map addInt term <*> (
(op "+")
*> term)
<> map subInt term <*> (
(op "")
*> term)
<> term
calc : String > Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
calc s = parse expr (fst (lex expressionTokens s))
lft : (ParseError (TokenData ExpressionToken)) > IO ()
lft (Error s lst) = putStrLn ("error:"++s)
rht : (Integer, List (TokenData ExpressionToken)) > IO ()
rht i = putStrLn ("right " ++ (show i))
main : IO ()
main = do
putStr "alg>"
x < getLine
either lft rht (calc x)  eliminator for Either
As the following tests show, this now handles infix operators and precedence.
For example ‘*’ is an infix operator:  *algebraREPL> :exec
alg>2*3
right (6, [])

Check that atomic number literals work:  *algebraREPL> :exec
alg>2
right (2, [])

Check that ‘*’ has a higher precedence than ‘+’.  *algebraREPL> :exec
alg>2+3*4
right (14, [])

Also ‘*’ has a higher precedence than ‘+’ when the order is reversed.  *algebraREPL> :exec
alg>3*4+2
right (14, [])

Also precedence can be overridden by parenthesis.  *algebraREPL> :exec
alg>(2+3)*4
right (20, [])

This is still not correct because it does not correctly parse multiple sums or terms
like 1+2+3
or 1*2*3
.