Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Append new submission data by chunks to the triple store #122

Merged

Conversation

syphax-bouazzouni
Copy link

@syphax-bouazzouni syphax-bouazzouni commented Jun 30, 2022

This is an optimization PR.
Currently, in the parsing process after the RDF generation step, we do a "delete and append to triple store"

      def delete_and_append(triples_file_path, logger, mime_type = nil)
        Goo.sparql_data_client.delete_graph(self.id)
        Goo.sparql_data_client.put_triples(self.id, triples_file_path, mime_type)
        logger.info("Triples #{triples_file_path} appended in #{self.id.to_ntriples}")
        logger.flush
      end

In the append triples step, we transform the XRDF to Turtle in a temporary file
Then we do a single "post" request to the triple store containing the turtle file as the request body.

The issue is when we have a big file (>= 1GB) (like in our use case here ontoportal-lirmm/ontologies_linked_data#15) is not an efficient way to submit all the content file in a unique HTTP request

The PR changes the function append_triples_no_bnodes to do the append by chunks of 500 000 lines (triples) per request

With the use case of TAXREF-LD

  • Size : 870,3 Mb
  • Pasred file size: 1,71 Go Gb
  • Turtle version appended to the triple store: 2.1Gb

Before the change we had

Mar  1 12:25:38 agroportal 4store[28359]: httpd.c:598 starting add to http://data.bioontology.org/ontologies/TAXREF-LD/submissions/2 (2179291409 bytes)
Mar  1 12:25:38 agroportal 4s-httpd: 4store[28359]: httpd.c:598 starting add to http://data.bioontology.org/ontologies/TAXREF-LD/submissions/2 (2179291409 bytes)
Mar  1 12:25:43 agroportal 4store[28359]: import.c:167 Fatal error: out of dynamic memory in turtle_lexer__scan_bytes() at 1
Mar  1 12:25:43 agroportal 4s-httpd: 4store[28359]: import.c:167 Fatal error: out of dynamic memory in turtle_lexer__scan_bytes() at 1
Mar  1 12:25:43 agroportal 4store[12682]: httpd.c:1979 child 28359 terminated by signal 11
Mar  1 12:25:43 agroportal 4s-httpd: 4store[12682]: httpd.c:1979 child 28359 terminated by signal 11

After the change, it worked and we have the following benchmark
Objects Freed: 572924847
Time: 734.6 seconds
Memory usage: 618.36 MB (Before the memory usage was dependent and equal to the append Turtle version of the file, now it will never exceed 700MB)

Reference : https://tjay.dev/howto-working-efficiently-with-large-files-in-ruby/

@jonquet
Copy link

jonquet commented Jul 1, 2022

CC: @alexskr with who I discussed the problem (loading huge file in 4store) last April.
The proposed solution seems a good practice for groups (like us) running the Appliance and then hosting 4store on the same machine.

@alexskr
Copy link
Member

alexskr commented Sep 14, 2022

The current non-chunked RDF upload approach appropriately handles situations where triple store fails uploads of generated RDF due to malformed data errors, things as the mismatched types which owlapi doesn't catch (see ncbo/bioportal-project#253)
The whole operation fails so the triple store doesn't end up with a partially loaded graph; however, with chunked data uploads, it would be possible that one of the chunked upload fails in the middle and could result in an incomplete graph stored in the triple store. Do you have any mitigation mechanisms in place for this kind of problem?

@syphax-bouazzouni
Copy link
Author

HI @alexskr,

I think if one of the chunks fails, a RestClient::BadRequest will be raised and stop the process (like ncbo/bioportal-project#253).

And when we reprocess it again it will delete the remained graph and create a new empty one for appending again the chunks from the start.

@syphax-bouazzouni syphax-bouazzouni changed the title Append new submission data by chunks to the triple store Feature: Append new submission data by chunks to the triple store Jan 19, 2023
@alexskr alexskr changed the base branch from master to develop February 10, 2024 01:08
@alexskr alexskr merged commit 5caeb0d into ncbo:develop Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants