Parsing left-recursion by recursive descent
25 Mar 2024 (haskell, functional-programming, parsing)
I’ve recently been building a parser for a programming language I’m designing. I’ll likely be making future posts about the project as I make more progress, but suffice it to say, the language aims to bridge the divide between the functional and procedural paradigms as discussed in my post A Tale of Two Paradigms. I’m using a Haskell library called Megaparsec, which follows the parser combinator pattern to help with building recursive descent parsers.
A recurring problem
This library is great to work with, and building parsers this way is very intuitive, but I found myself repeatedly running into the same problem: trying to parse left-recursive parts of the grammar led to an infinite recursive loop. It took me a while to properly understand what was happening, but the problem can be formulated as follows.
Any time a rule contains itself as its first term, the parser will fall into infinite recursion. For example, take the following parser function:
expression :: Parser Expression
= Expression <$> expression <*> term expression
Expression
represents a type defined as part of the
language’s abstract syntax tree (AST), and term
would be
some other parser function. The expression
function first
attempts to parse an expression
, then attempts to parse a
term
, then passes the results of these two components to
the data constructor for the Expression
type. Abstractly,
this is a valid rule. The problem is that the first thing the function
does is call itself, with no base case to break it out of this loop.
The problem becomes clearer if we express it in a more procedural style:
function expression() {
expression();
term();
}
The function immediately calls itself, and as such the program will run forever (or until the call stack overflows).
This problem appears frequently when parsing expressions that can be
chained. For example, in my language there are structs (data structures
with values assigned to predefined fields), and accessing struct fields
is achieved using a dot followed by the name of the field. For example,
given a struct person
that contains a field
name
, we would access the field using the expression
person.name
. This simple case would be straightforward to
parse. First we would define a type StructAccess
to
represent this kind of expression.
data StructAccess = StructAccess Text Text
This type takes 2 arguments of type Text
. The first
represents the name of the struct being accessed, and the second
represents the field being selected. The parser for this kind of
expression would look like this:
structAccess :: Parser StructAccess
= StructAccess <$> identifier <*> (dot *> identifier) structAccess
And for the expression person.name
, it would yield an
AST that looks like something like this:
StructAccess "person" "name"
However, we ideally want to be able to chain this operation. For
example, let’s extend the definition of a person to also include an
address
field, which is itself a struct:
type Address = {
: String,
street: String,
town: String
postCode
}
type Person = {
: String,
name: Address
address }
To access a person’s postCode
the expression should be
person.address.postCode
. In this expression, we are
actually first accessing the address field on the person struct using
person.address
, and this yields another struct; then we
access the postCode field on this address struct by appending the
previous expression with .postCode
. Thus, we arrive at the
full expression person.address.postCode
.
To achieve this, we need to change the StructAccess
type
to a general Expr
type.
data Expr = StructAccess Expr Text
| Identifier Text
This new Expr
type is an enum which can represent either
a struct access operation, or an identifier (a string that references a
struct by name). The specific StructAccess
case takes
another Expr
as its first argument, meaning the left-hand
side of the dot can be either a struct referenced by name, or another
struct access operation. That is, we can chain the operation.
The AST for this expression would now look something like this:
StructAccess
StructAccess (Identifier "person") "address")
("postCode"
To facilitate this we’d need to extend the structAccess
parser like so:
identifierExpr :: Parser Expr
= Identifier <$> identifier
identifierExpr
structAccess :: Parser Expr
= StructAccess <$> (try structAccess <|> identifierExpr) <*> (dot *> identifier) structAccess
The parser now attempts to parse a structAccess
first
(in this case, the sub-expression person.address
), and only
accept an identifier
if this fails. However, this first
attempt will never fail, because it will get caught in an infinite
recursion. We could use parenthesis to solve this (the expression for
accessing the postCode
would become
(person.address).postCode
) but this isn’t ideal.
I was stumped on this problem for a while before arriving at a solution I’m quite happy with.
Flatten the grammar
As long as we are using a recursive descent parser, there is simply no way to parse a left-recursive grammar such as this. I realised that to overcome the problem, I would have to change the structure of the AST to something that eliminates the left-recursion. I defined a new type and parser like this:
data FlatStructAccess = FlatStructAccess Text [Text]
structAccess :: Parser FlatStructAccess
= do
structAccess <- identifier
structName <- some (dot *> identifier)
fields return $ FlatStructAccess structName fields
This parser first looks for an identifier, then looks for one or more
instances of field selection using the dot operator; it then passes
these results to the data constructor for FlatStructAccess
.
For the expression person.address.postCode
, the resulting
AST would look like this:
FlatStructAccess "person" ["address", "postCode"]
By eliminating left-recursion, this function is able to successfully
parse the expression. However, this wasn’t the AST I wanted to end up
with. I wanted each StructAccess
to select a single field
from a struct—chaining the operation would result in a nested AST as
outlined earlier. I would have to restructure the AST I had, represented
by FlatStructAccess
, into the one I wanted, and I had a
feeling that would be tedious to figure out. On the contrary, it turns
out that Haskell’s standard foldl
function would do it in
one line.
foldl
has the signature:
foldl :: (a -> b -> a) -> a -> [b] -> a
For its first argument, it accepts a function that takes an
a
and a b
and returns another a
.
It’s second argument is an a
which will be fed into this
function, partially applying it, and its third argument is a list of
b
, the first of which will be passed into the partially
applied function to result in a new a
. This new
a
is then fed back into the function, partially applying it
again, and this partially applied function is applied to the next
b
in the list, and so on until the full list has been
iterated over, culminating in a final result of type a
.
A simple use case for foldl
would be to calculate the
sum of a list of numbers. In this case the specific signature would
be:
foldl :: (Int -> Int -> Int) -> Int -> [Int] -> Int
And it would be applied like foldl (+) 0 [1,2,3]
. The
result of this expression would be 6
.
For our use case, Expr
will be substituted for
a
and Text
will be substituted for
b
.
foldl :: (Expr -> Text -> Expr) -> Expr -> [Text] -> Expr
Remember that data constructors are just functions, and the
StructAccess
constructor takes 2 arguments of type
Expr
and Text
respectively.
StructAccess
is a data constructor for the
Expr
type, so that means the type signature for
StructAccess
is Expr -> Text -> Expr
—the
same as the first argument expected by foldl
.
For the expression person.address.postCode
,
foldl
will take an initial Expr
(namely
Identifier "person"
) and use it to partially apply the
StructAccess
constructor. It will then take the first item
in the list of fields
("address"
) and pass it
to the partially applied function, resulting in
StructAccess (Identifier "person") "address"
. This
Expr
will then be used to partially apply
StructAccess
again, and then the final field
("postCode"
) will be passed to the partially applied
function, resulting in the final result:
StructAccess (StructAccess (Identifier "person") "address") "postCode"
.
The final version of the structAccess
parser looks like
this:
structAccess :: Parser Expr
= do
structAccess <- identifier
structName <- some (dot *> identifier)
fields return $ foldl StructAccess (Identifier structName) fields
By bootstrapping the StructAccess
constructor with our
structName
and then folding it across our
fields
, we can easily transform our flat AST into the
recursive one we want.
Megaparsec actually used to provide a function called
chainl1
(documented here),
which is an abstraction for this approach. This has since been
deprecated and replaced with a helper module called Text.Megaparsec.Expr.
I found the approach of this helper module more difficult to wrap my
head around, but I may try to rework my solution to use this abstraction
in the future.
Intermediate representations
My central takeaway from this experience is that the AST your parser targets needn’t be the one you keep. In cases like this, it makes sense to parse to an intermediate representation before reworking the result into the structure you actually want. The purpose of the parser is to transform a string of characters into a meaningful data structure. To whatever extent you can parse directly to the desired AST, you might as well do it, but you can otherwise get to the target AST in multiple steps.