Using Nom – a parser combinator library

I wanted to create a parser for Apertium Stream. In 2014, I used Whittle in Ruby. If this year were 2001, I would use Lex/Yacc. Anyway, this year is 2021. I wanted to create this parser in Rust. I tried to find what is similar to Lex/Yacc. I found Rust-Peg. I found a link to Nom from Rust-Peg's document. My first impression was Nom example is easy to read. At least, its document claimed Nom is fast.

Apertium Stream format is quite complex, and I didn't know exactly how to use Nom. So I started from an easy case. My simplified Apertium stream is a list of lexical units. A lexical unit looks like this:

^surface_form$

Btw, I didn't test my source code on this post. If you want a runnable example, please check https://github.com/veer66/reinars.

I created a function to match a lexical unit first. It looks like this:

fn parse_lexical_unit(input: &str) -> IResult<&str, &str> {
    let mut parse = delimited(tag("^"), is_not("^$"), tag("$"));
    parse(input)
}

By running parselexicalunit(“^cat$”), it returns Ok((“”, “cat”)).

I hopefully improve by returning a Lexical Unit struct instead of &str.

#[derive(Debug)]
struct LexicalUnit {
    surface_form: String
}

fn parse_lexical_unit(input: &str) -> IResult<&str, LexicalUnit> {
    let mut parse = delimited(tag("^"), is_not("^$"), tag("$"));
    parse(input).map(|(i,o)| (i, LexicalUnit { surface_form: String::from(o) }))
}

“delimited” helps me to match ^ at the beginning and $ at the end. I wanted to capture whatever, which is not ^ or $. So I use is_not(“^$”). Can it be more straightforward?

When I ran parselexicalunit(“^cat$”), I get Ok((“”, LexicalUnit { surface_form: “cat” })) instead. 😃

Then I created a function for parsing the simplified stream.

fn parse_stream(input: &str) -> IResult<&str, Vec<LexicalUnit>> {
    let mut parse = separated_list0(space1, parse_lexical_unit);
    parse(input)
}

In the parsestream function, I use parselexicalunit, which I created before, in separatedlist0. separatedlist0 is for capturing the list, which in this case, the list is the list of lexical units parsed by parselexical_unit; and space1, which is one or more spaces, separate the list.

By running parse_stream(“^I$ ^eat$ ^rice$”), I get:

Ok(("", [LexicalUnit { surface_form: "I" }, 
             LexicalUnit { surface_form: "eat" }, 
             LexicalUnit { surface_form: "rice" }]))

I think this is enough for showing examples. The rest of the parser is the combination of alt, escaped_transform tuple, etc. By doing all these, I feel that this is easier than using Lex/Yacc or even Whittle at least for this task.