Context-Free Language Parser

This book will show you how to use the #[derive(Parser)] proc-macro to generate a Parser trait implementation to allow for parsing context-free languages.

The book is split into three parts:

  1. General introduction (sections 1 and 2)
  2. Simple usage (section 3)
  3. Expanded usage (section 4)

Getting started

Welcome to the cflp book. This is to teach you how to use cflp and contains some benchmarks (More on that later)


To start, make sure you've added the library to your Cargo.toml file

cflp = "1.0"

Overview

cflp allows you to derive a parser trait with a derive macro and helper attributes. The macro can derive a context-free language (hence the name of the crate), but the trait can be used more generally if required.

When discussing the trait, we have two types to consider:

I: Input type

C : Comparison type

The parser takes an iterator over input tokens which are of type I and compares them to some instances of the type C. These may be the same type in simple instances, but for more complex uses will most likely be different types.

Required impls

pub trait Parser<I, C, R = Self> where
    I: PartialEq<C>,
    C: PartialEq<C>
{
	fn parse<T>(src: &mut T) -> Result<R, Error<I, C>> where
        I: Iterator<Item=I> + Clone,
        Self: Sized;
}

When deriving the Parser trait, you will need to ensure that your input type I implements PartialEq<C> where C is your comparison type (this will make sense the more you read). You will also need to ensure that C: PartialEq<C>.

IMPORTANT: If the input and comparison types are given as the same thing, deriving PartialEq on the input type will not cover the requirement I: PartialEq<C>. Since the input comes from an iterator, I will be a reference. If the input and comparison types are given as the same, then I = &C. In this case, you will need to impl &C: PartialEq<C>. You can use the PartiqlEqRef derive macro to derive a simple impl for this:

impl PartialEq<C> for &C {
    fn eq(&self, other: &C) -> bool {
        self == other
    }
}

Non-macro library

In the non-macro part of the cflp library, we have two structs (Error and NodeWrapper), and two traits (Parser and NodeData)

Traits

Parser

pub trait Parser<I, C: PartialEq<C>, R = Self> {
	fn parse<T>(src: &mut T) -> Result<R, Error<I, C>> where
            T: Iterator<Item=I> + Clone,
            Self: Sized {
        <Self as Parser<I, C, R>>::parse_with_recursion(false)
    }
    
    fn parse_with_recursion<T>(src: &mut T) -> Result<R, Error<I, C>> where
            T: Iterator<Item=I> + Clone,
            Self: Sized;
}

The Parser trait is used to parse a token stream (of type T) into either a Self instance, or fail to do so and return an error.

You should only ever use Parser::parse. Parser::parse_with_recursion is where all the actual derived code goes. Without parse_with_recursion, we can end up with stack overflows. parse_with_recursion should prevent zero-consumption branching.

NodeData

pub trait NodeData<T: Sized> {
	fn start(&self) -> T;
	fn end(&self) -> T;
}

The NodeData trait is used to obtain positional data about a node. This is intended for tokens used in the Parser trait, but can be applied to parsed nodes as well.

Structs

Error

pub struct Error<F, E> {
	pub found: Option<F>,
	pub expected: E
}

The Error struct is returned from Parser::parse when parsing fails. The expected field is for the element expected, and the found field is the element found. This is an Option because the iterator could return a None.

Here we'll introduce simple rule syntax. This will include matching types, optional sections, Kleene closures, and positive closures.

We'll be using the following struct as our example token

pub enum Token {
    OB, CB,     // Opening and closing parenthesis
    Value(u8),  // A literal value
    Other(char) // Some character
}

Matching something

If you want to match a token, add the token you want to match to the rule. Simple as that. For example if we wanted to match Token::OB in our rule, we'd have (RuleName; Token::OB).

Groups

A group is a set of rules inside some parentheses. These can be complex nested groups with lots of modifiers applied to each inner group, or just contain a single value such as (Token::Value(2)).

Optional sections

An optional section is one that can optionally be included in the input. The parser is a maximal-munch parser and will always try to match these optional sections.

To indicate an optional section, you add a ? to the end of a group for example (Token::Other('a'))?.

Kleene and positive closures

A Kleene closure is one that will match the inner section 0 or more times, and a positive closure will match the inner section at least once. These are indicated in a similar way to optional sections but with a * for a Kleene closure and + for a positive closure.


Zero-consumption branching

Consider the following:

#[derive(PartialEq, PartialEqRef)]
enum Token {
    Add,
    Num(i64),
    Str(String)
}

#[derive(Parser)]
#[parser(Token, Token)]
enum Expr {
    #[parser([[@Expr]], Token::Add, [[@Expr]])]
    Add {
        left: Box<Self>,
        right: Box<Self>
    },
    #[parser([@ExprLit])]
    Literal(ExprLit)
}

#[derive(Parser)]
#[parser(Token, Token)]
enum ExprLit {
    #[parser([Token::Num(t)])]
    Number(i64),
    #[parser([Token::Str(t)])]
    String(String)
}

This can lead to zero-consumption branching if we call Expr::parse, so-called because it consumes zero elements in src and branches the program. This is because it will try to first parse an Expr, leading to it trying to parse an Expr etc. resulting in a stack overflow. To prevent this we instead call Parser::parse_with_recursion(src, false) which is expected to prevent this call from trying to parse itself again.

This should not prevent calling Parser::parse for Self once something has been consumed.

If this is in your code, you will need to ensure that C: Default. This can be done for an enum by re-using one of the variants, since this is only required to have some value returned to prevent a stack overflow. See example 3 for an example of this.

Derive macro

When using the derive macro to derive the Parser trait, you need to specify how the parser should be implemented. This is done using helper attributes (#[parser(...)]).

Metadata

Metadata is just three arguments (two types and one expression) that are required to know how to derive the Parser impl. They're comma seperated (no trailing comma) and in the following order:

  1. Input token type I
    This is the type in the iterator (behind a reference). For example if our input token type was MyToken, the iterator would be T: Iterator<Item=&MyToken>
  2. Comparison token type C
    This is the type being compared with. If your input token type contains token metadata (such as token position), then you will probably want to compare the underlying token type. Using the example from before, if we had MyToken::type_: MyTokType where MyTokType is an enum of token types, we would probably have our comparison token type as MyTokType

IMPORTANT: If your input and comparison types are different, then &I: Into<C> must be satisfied

Rule syntax

Rules are defined as two parts; values and groups. Values are singular expressions, and groups are a group of values.

Values

Consider the following rule from the first example:

#[derive(Parser)]
#[parser(
    Token, Token;
    ([@Value])+, [@Sep], ([@Value])*, (@Sep)?)
]
pub struct Base {
    first: Vec<Value>,
    sep: Sep,
    last: Vec<Value>
}

The first line in the helper attribute (line 3) is the metadata. The line after defines the rule for the parser impl. We'll consider this in segments:

  1. ([@Value])+
    Firstly, @Value is a call to another rule, Value in this case. This will try to parse a Value type. Placing square brackets around this means that we save the match. The parentheses makes it a group and the + makes the group into a positive closure, meaning we match at least one Value
  2. [@Sep]
    This saves a Sep type that's matched the input
  3. ([@Value])*
    This is the same as ([@Value])+ but for a kleene closure, meaning we match zero or more Value types
  4. (@Sep)?
    @Sep will match a Sep type. The ? after the group means that this is optional, so it can either be matched or not

When saving values, you might need to box a value to ensure that the type has a known size at compile time. If this is required, you can use two square brackets to indicate as such. For example, [[@Sep]] would match a Sep type and box the result (Box<Sep>)

Groups

A group is a set of values in parentheses. It can optionally be followed by a ?, *, or + for an optional group, kleene closure, and positive closure:

  • Optional means it will try to be parsed. If it's not, then it resets the input iterator back to where it was before the group, and skips it. This becomes an Option<T> where T is the type matched by the inside of the group
  • Kleene closure means it matches any number (including zero) of the group. This becomes a Vec<T> where T is the type matched by the inside of the group
  • Positive closure means it matches at least one of the group. This has the same type as a kleene closure

Groups can have comma seperated values. For example you could have (TokType::Int(1), TokType::Add, TokType::Int(2))? as a group. Groups can be nested

Enums and unwrapping matches

Util now, all the examples have been on structs. Enums are slightly different. The enum needs the metadata in a parser attribute on the enum itself, with the rule for each enum variant in an parser attribute on the variant.

For example, we might have a simple enum to turn token literals into AST node literals

#[derive(Parser)]
#[parser(Token, Token)]
pub enum Value {
    #[parser([Token::Digit(t)])]
    Int(usize),
    #[parser([Token::Ident(t)])]
    Ident(String)
}

You'll notice that the matches on the variants don't have literal values, but identifiers. This is because matches that aren't calls to other types are parsed as patterns, like in a match statement arm. The identifiers are matches just like in a match statement and used, instead of matching the whole token.

When picking identifiers, make sure they're all unique. It's also a good idea to avoid identifiers with a double underscore prefix such as __my_var since double underscore prefixed variables are used in the derived implementation and it may allow for things to get messed up.

Node data

When parsing, it's often useful to keep track of the positions of nodes for error handling for example. This can be achieved through the use the NodeData trait and the NodeWrapper struct.

The NodeData trait has two functions; NodeData::start and NodeData::end to get the start and end data of a node.

The NodeWrapper struct contains a node (NodeWrapper::node), and start and end data about that node (NodeWrapper::start, NodeWrapper::end).

Example

#[derive(Parser)]
#[parser(Token, TokType, |t| t.t, usize; ([@Add])*)]
pub struct Root(pub Vec<NodeWrapper<Add, usize>>);

In this example, you'll notice that there is an additional field for the metadata. This indicates that any matches should be wrapped in a NodeWrapper struct with positional data. The type provided is the type returned from the NodeData trait functions1.

You'll also notice that the inner section of the trait is not a Vec<Add>, but Vec<NodeWrapper<Add, usize>>.

Use for one, use for all

If you derive one Parser impl that's wrapped, all Parser impls that can be called from the one using it must also be wrapped.

For example, you could not have the following

#[derive(Parser)]
#[parser(Token, TokType, |t| t.t, usize; ([@Add])*)]
pub struct Root(pub Vec<NodeWrapper<Expr, usize>>);

#[derive(Parser)]
#[parser(Token, TokType, |t| t.t)]
pub enum Expr {
    /* ... */
}

This would cause an error because the Root Parser impl would expect that the Expr Parser impl would also be wrapped so it can access the positional data of an Expr match.

NodeData impl requirement

If you use wrapped matches, you need to ensure that the input type of your parser (Token in the example above) needs to implement NodeData so that Parser impls that match tokens not calls to other types can access positional data.


1

This could most likely be included as a generic in the Parser impl, but is required to ensure the correct NodeData impl is used in the case that there are multiple NodeData impls.

Benchmarks

Because cflp is for parsing a context-free language from a token stream into an AST, we have two primary aims:

  1. Correctness
    The derived Parser impls must be correct. This is a simple pass/fail aim
  2. Speed
    The derived Parser impls must be as fast as then can be made

To compare the speed of the derived impls, we use criterion v0.4.0 and two benchmark groups; a simple and complex benchmark.

Simple benchmark

Benchmark | Handwritten | Derived

The simple benchmark doesn't use wrapping and uses fairly simple rules. This is used to check that the derived impl is fast in a base case. It is also useful as a base benchmark to compare with other benchmarks to remove some variability from test-to-test.

Complex benchmark

Benchmark | Handwritten | Derived

The complex benchmark uses wrapping (NodeWrapper) with repeating sections and boxed matches. This is used to compare a handwritten impl against a derived one. Ideally, the derived impl would be as good as the handwritten one, but this is not true as of writing.

Comparing results

The results are used to compare the efficiency of the derived impls relative to the handwritten ones. Benchmarks are not intended to be used to compare absolute times.

As a note, all of the files generated by criterion are available under the criterion path of this book (https://fck-language.github.io/cflp/bench)