How to parse until a range of tags #1712

frenetisch-applaudierend · 2023-11-26T11:52:54Z

I would like to parse arbitrary text with embedded sequences which are delimited by different tags into their parts. E.g.

Test <#embedded sequence 1#> and (*embedded sequence 2*)

should be parsed to Text("Test ") Embedded1("embedded sequence 1") Embedded2("embedded sequence 2"). Ideally all strings in the token should be borrowed from the input string.

The embedded sequences are straightforward, but I fail to specify the parser for the Text tokens. Is it possible to take_until a range of tags is encountered?

The text was updated successfully, but these errors were encountered:

coalooball · 2023-12-11T07:56:24Z

Hello @frenetisch-applaudierend.
I think the fifth chapter of the article The Nom Guide (Nominomicon) be able to address your question.

frenetisch-applaudierend · 2023-12-11T13:47:19Z

Hi @coalooball

Thanks for the Link, I haven't seen that one before!

However I don't think it applies to my use case, since the mentioned parsers all only allow a predicate on single characters. I would need predicate on different parsers (i.e. take_until(tag("<#").or(tag("(*")))), but this does not seem handled (or I did not see it).

coalooball · 2023-12-13T05:42:22Z

Hi @coalooball

Thanks for the Link, I haven't seen that one before!

However I don't think it applies to my use case, since the mentioned parsers all only allow a predicate on single characters. I would need predicate on different parsers (i.e. take_until(tag("<#").or(tag("(*")))), but this does not seem handled (or I did not see it).

Hello again!
The take_until really doesn't work that way, since it's the equivalent of a terminal node in BNF. I suppose you could use terminated to extract arbitrary text.
Here is my method which is a bit more cumbersome, I don't know if there are any other concise methods:

use nom::{
    branch::alt,
    bytes::complete::{tag, take_till, take_while1},
    character::{is_alphanumeric, is_space},
    sequence::{delimited, terminated},
    IResult,
};

fn is_delimiter(s: u8) -> bool {
    s == 0x2a || s == 0x23
}

fn embedded_sequence(s: &[u8]) -> IResult<&[u8], &[u8]> {
    delimited(
        alt((tag(b"<"), tag(b"("))),
        delimited(
            alt((tag(b"#"), tag(b"*"))),
            take_till(is_delimiter),
            alt((tag(b"#"), tag(b"*"))),
        ),
        alt((tag(b">"), tag(b")"))),
    )(s)
}

fn parse(s: &[u8]) -> IResult<&[u8], &[u8]> {
    terminated(
        take_while1(|x| is_alphanumeric(x) || is_space(x)),
        embedded_sequence,
    )(s)
}

fn main() {}

#[test]
fn test_embedded_sequence() {
    assert_eq!(
        embedded_sequence(b"<#embedded sequence 1#>111").unwrap(),
        (b"111".as_ref(), b"embedded sequence 1".as_ref())
    );
    assert_eq!(
        parse(b"Test <#embedded sequence 1#> and (*embedded sequence 2*)").unwrap(),
        (b" and (*embedded sequence 2*)".as_ref(), b"Test ".as_ref())
    )
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to parse until a range of tags #1712

How to parse until a range of tags #1712

frenetisch-applaudierend commented Nov 26, 2023

coalooball commented Dec 11, 2023

frenetisch-applaudierend commented Dec 11, 2023

coalooball commented Dec 13, 2023

How to parse until a range of tags #1712

How to parse until a range of tags #1712

Comments

frenetisch-applaudierend commented Nov 26, 2023

coalooball commented Dec 11, 2023

frenetisch-applaudierend commented Dec 11, 2023

coalooball commented Dec 13, 2023