Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDF Parser considered harmful (or at least the iteration part is really ugly) #98

Open
phillord opened this issue May 31, 2024 · 2 comments · Fixed by #99 · May be fixed by #102
Open

RDF Parser considered harmful (or at least the iteration part is really ugly) #98

phillord opened this issue May 31, 2024 · 2 comments · Fixed by #99 · May be fixed by #102
Labels
API This concerns horned as a library enhancement New feature or request refactor

Comments

@phillord
Copy link
Owner

The RDF parser makes multiple passes through both bnode and simple triples many times. And each time
it does so the code is repeated -- something like this.

for (k, v) in std::mem::take(&mut self.bnode) {
            match v.as_slice() {
                [[_, Term::RDF(VRDF::First), val],
                 [_, Term::RDF(VRDF::Rest), Term::BNode(bnode_id)],
                 // Some sequences have a Type List, some do not
                 ..
                ] => {
                    let some_seq = self.bnode_seq.remove(bnode_id);
                    if let Some(mut seq) = some_seq {
                        seq.push(val.clone());
                        self.bnode_seq.insert(k.clone(), seq);
                        extended = true;
                    } else {
                        self.bnode.insert(k, v);
                    }
                }
                _ => {
                    self.bnode.insert(k, v);
                }
            };
        }

I think we can replace this whole thing with a partition_map (or a
map followed by a partition which would avoid bringing in itertools as a dependency).

Partition map looks like this:

partition_map(Iterator<A>, fn -> Either<L, R>) -> (Iterator<L>, Iterator<R>)

L would be the thing (Atom, Axiom what ever)
R would be Result<A>

L would then be added to whatever,
R would be added back to where ever thing in self it came from in the first place

Would have then:

let (axiom, triples) = triples.partition_map(
   |t|

   match &t {
      [blah, blah, blah] => {
          // some_ok! would need to return Left or Right(Err), which should be okay
          some_ok!(Left(Axiom(blah, blah, blah))
      }
      None => {
          // Can we take the t here? 
          Ok(Right(&t))
      }
   }
)

self.axioms.extend(axiom);
// We would have to clone the triple here because we only have a reference
self.triples.extend(triples.map(|e| e?.clone()));

// or if we want to capture errors rather than crap out which is something
// we can't do at the moment
let triples, errors = self.triples.partition(is_okay)
self.triples.extend(triples)
self.errors.extend(errors)

which is rather neater than what we have at the moment

@filippodebortoli
Copy link
Collaborator

To get myself acquainted with the rdf::reader module, I am currently refactoring a few bits that could be replaced by e.g. deriving traits or using methods already implemented in vocab (see #97 and #99).

I agree with your observation.
I am also wondering if the code could be refactored to leverage the fact that the triple parser is streaming, by keeping a HashMap of triples sets indexed by subject and deserializing whenever the set contains e.g. all the triples necessary to instantiate a declaration.

@phillord
Copy link
Owner Author

phillord commented Jun 4, 2024

https://github.com/phillord/horned-owl/tree/feature/rdf-with-partition-cleanup

I've started looking at this. There are definately some cleanups that I can do. The idea of partition_map may or may not work; so far it makes some things simpler, but a lot of loops are modifying other data structures by side-effect. I think I will re-write a few more functions so I'm clear whether it makes sense or not.

The problem with streaming, is that there are many places where you have to parse the whole stream first, to make the interpretation of later triples. This is, for example, why I parse all declarations (in the entire import closure) first, because many of the axiom patterns depend on understanding whether for example, the kind of a property. So I suspect that streaming RDF not going to bring a lot of benefits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API This concerns horned as a library enhancement New feature or request refactor
Projects
None yet
2 participants