Hadrian Hughes

Parsing left-recursion by recursive descent

I’ve recently been building a parser for a programming language I’m designing. I’ll likely be making future posts about the project as I make more progress, but suffice it to say, the language aims to bridge the divide between the functional and procedural paradigms as discussed in my post A Tale of Two Paradigms. I’m using a Haskell library called Megaparsec, which follows the parser combinator pattern to help with building recursive descent parsers.

A recurring problem

This library is great to work with, and building parsers this way is very intuitive, but I found myself repeatedly running into the same problem: trying to parse left-recursive parts of the grammar led to an infinite recursive loop. It took me a while to properly understand what was happening, but the problem can be formulated as follows.

Any time a rule contains itself as its first term, the parser will fall into infinite recursion. For example, take the following parser function:

expression :: Parser Expression
expression = Expression <$> expression <*> term

Expression represents a type defined as part of the language’s abstract syntax tree (AST), and term would be some other parser function. The expression function first attempts to parse an expression, then attempts to parse a term, then passes the results of these two components to the data constructor for the Expression type. Abstractly, this is a valid rule. The problem is that the first thing the function does is call itself, with no base case to break it out of this loop.

The problem becomes clearer if we express it in a more procedural style:

function expression() {
    expression();
    term();
}

The function immediately calls itself, and as such the program will run forever (or until the call stack overflows).

This problem appears frequently when parsing expressions that can be chained. For example, in my language there are structs (data structures with values assigned to predefined fields), and accessing struct fields is achieved using a dot followed by the name of the field. For example, given a struct person that contains a field name, we would access the field using the expression person.name. This simple case would be straightforward to parse. First we would define a type StructAccess to represent this kind of expression.

data StructAccess = StructAccess Text Text

This type takes 2 arguments of type Text. The first represents the name of the struct being accessed, and the second represents the field being selected. The parser for this kind of expression would look like this:

structAccess :: Parser StructAccess
structAccess = StructAccess <$> identifier <*> (dot *> identifier)

And for the expression person.name, it would yield an AST that looks like something like this:

StructAccess "person" "name"

However, we ideally want to be able to chain this operation. For example, let’s extend the definition of a person to also include an address field, which is itself a struct:

type Address = {
    street: String,
    town: String,
    postCode: String
}

type Person = {
    name: String,
    address: Address
}

To access a person’s postCode the expression should be person.address.postCode. In this expression, we are actually first accessing the address field on the person struct using person.address, and this yields another struct; then we access the postCode field on this address struct by appending the previous expression with .postCode. Thus, we arrive at the full expression person.address.postCode.

To achieve this, we need to change the StructAccess type to a general Expr type.

data Expr = StructAccess Expr Text
          | Identifier        Text

This new Expr type is an enum which can represent either a struct access operation, or an identifier (a string that references a struct by name). The specific StructAccess case takes another Expr as its first argument, meaning the left-hand side of the dot can be either a struct referenced by name, or another struct access operation. That is, we can chain the operation.

The AST for this expression would now look something like this:

StructAccess
    (StructAccess (Identifier "person") "address")
    "postCode"

To facilitate this we’d need to extend the structAccess parser like so:

identifierExpr :: Parser Expr
identifierExpr = Identifier <$> identifier

structAccess :: Parser Expr
structAccess = StructAccess <$> (try structAccess <|> identifierExpr) <*> (dot *> identifier)

The parser now attempts to parse a structAccess first (in this case, the sub-expression person.address), and only accept an identifier if this fails. However, this first attempt will never fail, because it will get caught in an infinite recursion. We could use parenthesis to solve this (the expression for accessing the postCode would become (person.address).postCode) but this isn’t ideal.

I was stumped on this problem for a while before arriving at a solution I’m quite happy with.

Flatten the grammar

As long as we are using a recursive descent parser, there is simply no way to parse a left-recursive grammar such as this. I realised that to overcome the problem, I would have to change the structure of the AST to something that eliminates the left-recursion. I defined a new type and parser like this:

data FlatStructAccess = FlatStructAccess Text [Text]

structAccess :: Parser FlatStructAccess
structAccess = do
    structName <- identifier
    fields     <- some (dot *> identifier)
    return $ FlatStructAccess structName fields

This parser first looks for an identifier, then looks for one or more instances of field selection using the dot operator; it then passes these results to the data constructor for FlatStructAccess. For the expression person.address.postCode, the resulting AST would look like this:

FlatStructAccess "person" ["address", "postCode"]

By eliminating left-recursion, this function is able to successfully parse the expression. However, this wasn’t the AST I wanted to end up with. I wanted each StructAccess to select a single field from a struct—chaining the operation would result in a nested AST as outlined earlier. I would have to restructure the AST I had, represented by FlatStructAccess, into the one I wanted, and I had a feeling that would be tedious to figure out. On the contrary, it turns out that Haskell’s standard foldl function would do it in one line.

foldl has the signature:

foldl :: (a -> b -> a) -> a -> [b] -> a

For its first argument, it accepts a function that takes an a and a b and returns another a. It’s second argument is an a which will be fed into this function, partially applying it, and its third argument is a list of b, the first of which will be passed into the partially applied function to result in a new a. This new a is then fed back into the function, partially applying it again, and this partially applied function is applied to the next b in the list, and so on until the full list has been iterated over, culminating in a final result of type a.

A simple use case for foldl would be to calculate the sum of a list of numbers. In this case the specific signature would be:

foldl :: (Int -> Int -> Int) -> Int -> [Int] -> Int

And it would be applied like foldl (+) 0 [1,2,3]. The result of this expression would be 6.

For our use case, Expr will be substituted for a and Text will be substituted for b.

foldl :: (Expr -> Text -> Expr) -> Expr -> [Text] -> Expr

Remember that data constructors are just functions, and the StructAccess constructor takes 2 arguments of type Expr and Text respectively. StructAccess is a data constructor for the Expr type, so that means the type signature for StructAccess is Expr -> Text -> Expr—the same as the first argument expected by foldl.

For the expression person.address.postCode, foldl will take an initial Expr (namely Identifier "person") and use it to partially apply the StructAccess constructor. It will then take the first item in the list of fields ("address") and pass it to the partially applied function, resulting in StructAccess (Identifier "person") "address". This Expr will then be used to partially apply StructAccess again, and then the final field ("postCode") will be passed to the partially applied function, resulting in the final result: StructAccess (StructAccess (Identifier "person") "address") "postCode".

The final version of the structAccess parser looks like this:

structAccess :: Parser Expr
structAccess = do
    structName <- identifier
    fields     <- some (dot *> identifier)
    return $ foldl StructAccess (Identifier structName) fields

By bootstrapping the StructAccess constructor with our structName and then folding it across our fields, we can easily transform our flat AST into the recursive one we want.

Megaparsec actually used to provide a function called chainl1 (documented here), which is an abstraction for this approach. This has since been deprecated and replaced with a helper module called Text.Megaparsec.Expr. I found the approach of this helper module more difficult to wrap my head around, but I may try to rework my solution to use this abstraction in the future.

Intermediate representations

My central takeaway from this experience is that the AST your parser targets needn’t be the one you keep. In cases like this, it makes sense to parse to an intermediate representation before reworking the result into the structure you actually want. The purpose of the parser is to transform a string of characters into a meaningful data structure. To whatever extent you can parse directly to the desired AST, you might as well do it, but you can otherwise get to the target AST in multiple steps.