Using Nom – a parser combinator library
I wanted to create a parser for Apertium Stream. In 2014, I used Whittle in Ruby. If this year were 2001, I would use Lex/Yacc. Anyway, this year is 2021. I wanted to create this parser in Rust. I tried to find what is similar to Lex/Yacc. I found Rust-Peg. I found a link to Nom from Rust-Peg's document. My first impression was Nom example is easy to read. At least, its document claimed Nom is fast.
Apertium Stream format is quite complex, and I didn't know exactly how to use Nom. So I started from an easy case. My simplified Apertium stream is a list of lexical units. A lexical unit looks like this:
^surface_form$
Btw, I didn't test my source code on this post. If you want a runnable example, please check https://github.com/veer66/reinars.
I created a function to match a lexical unit first. It looks like this:
fn parse_lexical_unit(input: &str) -> IResult<&str, &str> {
let mut parse = delimited(tag("^"), is_not("^$"), tag("$"));
parse(input)
}
By running parselexicalunit(“^cat$”), it returns Ok((“”, “cat”)).
I hopefully improve by returning a Lexical Unit struct instead of &str.
#[derive(Debug)]
struct LexicalUnit {
surface_form: String
}
fn parse_lexical_unit(input: &str) -> IResult<&str, LexicalUnit> {
let mut parse = delimited(tag("^"), is_not("^$"), tag("$"));
parse(input).map(|(i,o)| (i, LexicalUnit { surface_form: String::from(o) }))
}
“delimited” helps me to match ^ at the beginning and $ at the end. I wanted to capture whatever, which is not ^ or $. So I use is_not(“^$”). Can it be more straightforward?
When I ran parselexicalunit(“^cat$”), I get Ok((“”, LexicalUnit { surface_form: “cat” })) instead. 😃
Then I created a function for parsing the simplified stream.
fn parse_stream(input: &str) -> IResult<&str, Vec<LexicalUnit>> {
let mut parse = separated_list0(space1, parse_lexical_unit);
parse(input)
}
In the parsestream function, I use parselexicalunit, which I created before, in separatedlist0. separatedlist0 is for capturing the list, which in this case, the list is the list of lexical units parsed by parselexical_unit; and space1, which is one or more spaces, separate the list.
By running parse_stream(“^I$ ^eat$ ^rice$”), I get:
Ok(("", [LexicalUnit { surface_form: "I" },
LexicalUnit { surface_form: "eat" },
LexicalUnit { surface_form: "rice" }]))
I think this is enough for showing examples. The rest of the parser is the combination of alt, escaped_transform tuple, etc. By doing all these, I feel that this is easier than using Lex/Yacc or even Whittle at least for this task.