Context-Free Language Parser
This book will show you how to use the #[derive(Parser)]
proc-macro to generate a Parser
trait implementation to allow for parsing context-free languages.
The book is split into three parts:
Getting started
Welcome to the cflp book. This is to teach you how to use cflp and contains some benchmarks (More on that later)
To start, make sure you've added the library to your Cargo.toml
file
cflp = "1.0"
Overview
cflp allows you to derive a parser trait with a derive macro and helper attributes. The macro can derive a context-free language (hence the name of the crate), but the trait can be used more generally if required.
When discussing the trait, we have two types to consider:
I
: Input type
C
: Comparison type
The parser takes an iterator over input tokens which are of type I
and compares them to some instances of the type C
. These may be the same type in simple instances, but for more complex uses will most likely be different types.
Required impls
pub trait Parser<I, C, R = Self> where
I: PartialEq<C>,
C: PartialEq<C>
{
fn parse<T>(src: &mut T) -> Result<R, Error<I, C>> where
I: Iterator<Item=I> + Clone,
Self: Sized;
}
When deriving the Parser
trait, you will need to ensure that your input type I
implements PartialEq<C>
where C
is your comparison type (this will make sense the more you read). You will also need to ensure that C: PartialEq<C>
.
IMPORTANT:
If the input and comparison types are given as the same thing, deriving PartialEq
on the input type will not cover the requirement I: PartialEq<C>
. Since the input comes from an iterator, I
will be a reference. If the input and comparison types are given as the same, then I = &C
. In this case, you will need to impl &C: PartialEq<C>
. You can use the PartiqlEqRef
derive macro to derive a simple impl for this:
impl PartialEq<C> for &C {
fn eq(&self, other: &C) -> bool {
self == other
}
}
Non-macro library
In the non-macro part of the cflp library, we have two structs (Error
and NodeWrapper
), and two traits (Parser
and NodeData
)
Traits
Parser
pub trait Parser<I, C: PartialEq<C>, R = Self> {
fn parse<T>(src: &mut T) -> Result<R, Error<I, C>> where
T: Iterator<Item=I> + Clone,
Self: Sized {
<Self as Parser<I, C, R>>::parse_with_recursion(false)
}
fn parse_with_recursion<T>(src: &mut T) -> Result<R, Error<I, C>> where
T: Iterator<Item=I> + Clone,
Self: Sized;
}
The Parser
trait is used to parse a token stream (of type T
) into either a Self
instance, or fail to do so and return an error.
You should only ever use Parser::parse
. Parser::parse_with_recursion
is where all the actual derived code goes. Without parse_with_recursion
, we can end up with stack overflows. parse_with_recursion
should prevent zero-consumption branching.
NodeData
pub trait NodeData<T: Sized> {
fn start(&self) -> T;
fn end(&self) -> T;
}
The NodeData
trait is used to obtain positional data about a node. This is intended for tokens used in the Parser
trait, but can be applied to parsed nodes as well.
Structs
Error
pub struct Error<F, E> {
pub found: Option<F>,
pub expected: E
}
The Error
struct is returned from Parser::parse
when parsing fails. The expected
field is for the element expected, and the found
field is the element found. This is an Option
because the iterator could return a None
.
Here we'll introduce simple rule syntax. This will include matching types, optional sections, Kleene closures, and positive closures.
We'll be using the following struct as our example token
pub enum Token {
OB, CB, // Opening and closing parenthesis
Value(u8), // A literal value
Other(char) // Some character
}
Matching something
If you want to match a token, add the token you want to match to the rule. Simple as that. For example if we wanted to match Token::OB
in our rule, we'd have (RuleName; Token::OB)
.
Groups
A group is a set of rules inside some parentheses. These can be complex nested groups with lots of modifiers applied to each inner group, or just contain a single value such as (Token::Value(2))
.
Optional sections
An optional section is one that can optionally be included in the input. The parser is a maximal-munch parser and will always try to match these optional sections.
To indicate an optional section, you add a ?
to the end of a group for example (Token::Other('a'))?
.
Kleene and positive closures
A Kleene closure is one that will match the inner section 0 or more times, and a positive closure will match the inner section at least once. These are indicated in a similar way to optional sections but with a *
for a Kleene closure and +
for a positive closure.
Zero-consumption branching
Consider the following:
#[derive(PartialEq, PartialEqRef)]
enum Token {
Add,
Num(i64),
Str(String)
}
#[derive(Parser)]
#[parser(Token, Token)]
enum Expr {
#[parser([[@Expr]], Token::Add, [[@Expr]])]
Add {
left: Box<Self>,
right: Box<Self>
},
#[parser([@ExprLit])]
Literal(ExprLit)
}
#[derive(Parser)]
#[parser(Token, Token)]
enum ExprLit {
#[parser([Token::Num(t)])]
Number(i64),
#[parser([Token::Str(t)])]
String(String)
}
This can lead to zero-consumption branching if we call Expr::parse
, so-called because it consumes zero elements in src
and branches the program. This is because it will try to first parse an Expr
, leading to it trying to parse an Expr
etc. resulting in a stack overflow. To prevent this we instead call Parser::parse_with_recursion(src, false)
which is expected to prevent this call from trying to parse itself again.
This should not prevent calling Parser::parse
for Self
once something has been consumed.
If this is in your code, you will need to ensure that C: Default
. This can be done for an enum by re-using one of the variants, since this is only required to have some value returned to prevent a stack overflow. See example 3 for an example of this.
Derive macro
When using the derive macro to derive the Parser
trait, you need to specify how the parser should be implemented. This is done using helper attributes (#[parser(...)]
).
Metadata
Metadata is just three arguments (two types and one expression) that are required to know how to derive the Parser
impl. They're comma seperated (no trailing comma) and in the following order:
- Input token type
I
This is the type in the iterator (behind a reference). For example if our input token type wasMyToken
, the iterator would beT: Iterator<Item=&MyToken>
- Comparison token type
C
This is the type being compared with. If your input token type contains token metadata (such as token position), then you will probably want to compare the underlying token type. Using the example from before, if we hadMyToken::type_: MyTokType
whereMyTokType
is an enum of token types, we would probably have our comparison token type asMyTokType
IMPORTANT: If your input and comparison types are different, then &I: Into<C>
must be satisfied
Rule syntax
Rules are defined as two parts; values and groups. Values are singular expressions, and groups are a group of values.
Values
Consider the following rule from the first example:
#[derive(Parser)]
#[parser(
Token, Token;
([@Value])+, [@Sep], ([@Value])*, (@Sep)?)
]
pub struct Base {
first: Vec<Value>,
sep: Sep,
last: Vec<Value>
}
The first line in the helper attribute (line 3) is the metadata. The line after defines the rule for the parser impl. We'll consider this in segments:
([@Value])+
Firstly,@Value
is a call to another rule,Value
in this case. This will try to parse aValue
type. Placing square brackets around this means that we save the match. The parentheses makes it a group and the+
makes the group into a positive closure, meaning we match at least oneValue
[@Sep]
This saves aSep
type that's matched the input([@Value])*
This is the same as([@Value])+
but for a kleene closure, meaning we match zero or moreValue
types(@Sep)?
@Sep
will match aSep
type. The?
after the group means that this is optional, so it can either be matched or not
When saving values, you might need to box a value to ensure that the type has a known size at compile time. If this is required, you can use two square brackets to indicate as such. For example,
[[@Sep]]
would match aSep
type and box the result (Box<Sep>
)
Groups
A group is a set of values in parentheses. It can optionally be followed by a ?
, *
, or +
for an optional group, kleene closure, and positive closure:
- Optional means it will try to be parsed. If it's not, then it resets the input iterator back to where it was before the group, and skips it. This becomes an
Option<T>
whereT
is the type matched by the inside of the group - Kleene closure means it matches any number (including zero) of the group. This becomes a
Vec<T>
whereT
is the type matched by the inside of the group - Positive closure means it matches at least one of the group. This has the same type as a kleene closure
Groups can have comma seperated values. For example you could have (TokType::Int(1), TokType::Add, TokType::Int(2))?
as a group. Groups can be nested
Enums and unwrapping matches
Util now, all the examples have been on structs. Enums are slightly different. The enum needs the metadata in a parser attribute on the enum itself, with the rule for each enum variant in an parser attribute on the variant.
For example, we might have a simple enum to turn token literals into AST node literals
#[derive(Parser)]
#[parser(Token, Token)]
pub enum Value {
#[parser([Token::Digit(t)])]
Int(usize),
#[parser([Token::Ident(t)])]
Ident(String)
}
You'll notice that the matches on the variants don't have literal values, but identifiers. This is because matches that aren't calls to other types are parsed as patterns, like in a match statement arm. The identifiers are matches just like in a match statement and used, instead of matching the whole token.
When picking identifiers, make sure they're all unique. It's also a good idea to avoid identifiers with a double underscore prefix such as __my_var
since double underscore prefixed variables are used in the derived implementation and it may allow for things to get messed up.
Node data
When parsing, it's often useful to keep track of the positions of nodes for error handling for example. This can be achieved through the use the NodeData
trait and the NodeWrapper
struct.
The NodeData
trait has two functions; NodeData::start
and NodeData::end
to get the start and end data of a node.
The NodeWrapper
struct contains a node (NodeWrapper::node
), and start and end data about that node (NodeWrapper::start
, NodeWrapper::end
).
Example
#[derive(Parser)]
#[parser(Token, TokType, |t| t.t, usize; ([@Add])*)]
pub struct Root(pub Vec<NodeWrapper<Add, usize>>);
In this example, you'll notice that there is an additional field for the metadata. This indicates that any matches should be wrapped in a NodeWrapper
struct with positional data. The type provided is the type returned from the NodeData
trait functions1.
You'll also notice that the inner section of the trait is not a Vec<Add>
, but Vec<NodeWrapper<Add, usize>>
.
Use for one, use for all
If you derive one Parser
impl that's wrapped, all Parser
impls that can be called from the one using it must also be wrapped.
For example, you could not have the following
#[derive(Parser)]
#[parser(Token, TokType, |t| t.t, usize; ([@Add])*)]
pub struct Root(pub Vec<NodeWrapper<Expr, usize>>);
#[derive(Parser)]
#[parser(Token, TokType, |t| t.t)]
pub enum Expr {
/* ... */
}
This would cause an error because the Root
Parser
impl would expect that the Expr
Parser
impl would also be wrapped so it can access the positional data of an Expr
match.
NodeData
impl requirement
If you use wrapped matches, you need to ensure that the input type of your parser (Token
in the example above) needs to implement NodeData
so that Parser
impls that match tokens not calls to other types can access positional data.
This could most likely be included as a generic in the Parser
impl, but is required to ensure the correct NodeData
impl is used in the case that there are multiple NodeData
impls.
Benchmarks
Because cflp is for parsing a context-free language from a token stream into an AST, we have two primary aims:
- Correctness
The derivedParser
impls must be correct. This is a simple pass/fail aim - Speed
The derivedParser
impls must be as fast as then can be made
To compare the speed of the derived impls, we use criterion v0.4.0 and two benchmark groups; a simple and complex benchmark.
Simple benchmark
Benchmark | Handwritten | Derived
The simple benchmark doesn't use wrapping and uses fairly simple rules. This is used to check that the derived impl is fast in a base case. It is also useful as a base benchmark to compare with other benchmarks to remove some variability from test-to-test.
Complex benchmark
Benchmark | Handwritten | Derived
The complex benchmark uses wrapping (NodeWrapper
) with repeating sections and boxed matches. This is used to compare a handwritten impl against a derived one. Ideally, the derived impl would be as good as the handwritten one, but this is not true as of writing.
Comparing results
The results are used to compare the efficiency of the derived impls relative to the handwritten ones. Benchmarks are not intended to be used to compare absolute times.
As a note, all of the files generated by criterion are available under the criterion path of this book (https://fck-language.github.io/cflp/bench)