Feature: Append new submission data by chunks to the triple store #122

syphax-bouazzouni · 2022-06-30T17:46:47Z

This is an optimization PR.
Currently, in the parsing process after the RDF generation step, we do a "delete and append to triple store"

      def delete_and_append(triples_file_path, logger, mime_type = nil)
        Goo.sparql_data_client.delete_graph(self.id)
        Goo.sparql_data_client.put_triples(self.id, triples_file_path, mime_type)
        logger.info("Triples #{triples_file_path} appended in #{self.id.to_ntriples}")
        logger.flush
      end

In the append triples step, we transform the XRDF to Turtle in a temporary file
Then we do a single "post" request to the triple store containing the turtle file as the request body.

The issue is when we have a big file (>= 1GB) (like in our use case here ontoportal-lirmm/ontologies_linked_data#15) is not an efficient way to submit all the content file in a unique HTTP request

The PR changes the function append_triples_no_bnodes to do the append by chunks of 500 000 lines (triples) per request

With the use case of TAXREF-LD

Size : 870,3 Mb
Pasred file size: 1,71 Go Gb
Turtle version appended to the triple store: 2.1Gb

Before the change we had

Mar  1 12:25:38 agroportal 4store[28359]: httpd.c:598 starting add to http://data.bioontology.org/ontologies/TAXREF-LD/submissions/2 (2179291409 bytes)
Mar  1 12:25:38 agroportal 4s-httpd: 4store[28359]: httpd.c:598 starting add to http://data.bioontology.org/ontologies/TAXREF-LD/submissions/2 (2179291409 bytes)
Mar  1 12:25:43 agroportal 4store[28359]: import.c:167 Fatal error: out of dynamic memory in turtle_lexer__scan_bytes() at 1
Mar  1 12:25:43 agroportal 4s-httpd: 4store[28359]: import.c:167 Fatal error: out of dynamic memory in turtle_lexer__scan_bytes() at 1
Mar  1 12:25:43 agroportal 4store[12682]: httpd.c:1979 child 28359 terminated by signal 11
Mar  1 12:25:43 agroportal 4s-httpd: 4store[12682]: httpd.c:1979 child 28359 terminated by signal 11

After the change, it worked and we have the following benchmark
Objects Freed: 572924847
Time: 734.6 seconds
Memory usage: 618.36 MB (Before the memory usage was dependent and equal to the append Turtle version of the file, now it will never exceed 700MB)

Reference : https://tjay.dev/howto-working-efficiently-with-large-files-in-ruby/

jonquet · 2022-07-01T09:07:31Z

CC: @alexskr with who I discussed the problem (loading huge file in 4store) last April.
The proposed solution seems a good practice for groups (like us) running the Appliance and then hosting 4store on the same machine.

alexskr · 2022-09-14T18:39:47Z

The current non-chunked RDF upload approach appropriately handles situations where triple store fails uploads of generated RDF due to malformed data errors, things as the mismatched types which owlapi doesn't catch (see ncbo/bioportal-project#253)
The whole operation fails so the triple store doesn't end up with a partially loaded graph; however, with chunked data uploads, it would be possible that one of the chunked upload fails in the middle and could result in an incomplete graph stored in the triple store. Do you have any mitigation mechanisms in place for this kind of problem?

syphax-bouazzouni · 2022-09-14T19:11:59Z

HI @alexskr,

I think if one of the chunks fails, a RestClient::BadRequest will be raised and stop the process (like ncbo/bioportal-project#253).

And when we reprocess it again it will delete the remained graph and create a new empty one for appending again the chunks from the start.

syphax-bouazzouni mentioned this pull request Jul 1, 2022

Memory usage and Turtle inserting at the ressource parsing ontoportal-lirmm/ontologies_linked_data#15

Closed

4 tasks

syphax-bouazzouni added 2 commits July 1, 2022 11:24

extract execute_append_request function

4583450

change the append to triple store to be done by chunk of 500 000 lines

225c144

syphax-bouazzouni force-pushed the pr/fix/append_triples_by_chunks branch from e092a27 to 225c144 Compare July 1, 2022 09:26

syphax-bouazzouni marked this pull request as ready for review July 1, 2022 09:44

This was referenced Jul 1, 2022

2022-Q2-Release plan agroportal/project-management#279

Closed

see why NCBITAXON is not parsing agroportal/project-management#191

Closed

syphax-bouazzouni mentioned this pull request Sep 12, 2022

2022-Q3-Release plan agroportal/project-management#298

Closed

57 tasks

syphax-bouazzouni mentioned this pull request Oct 25, 2022

out memory killed tomcat and 4s-httpd agroportal/project-management#288

Closed

syphax-bouazzouni mentioned this pull request Dec 28, 2022

2023-Q1-Release plan agroportal/project-management#343

Closed

67 tasks

syphax-bouazzouni changed the title ~~Append new submission data by chunks to the triple store~~ Feature: Append new submission data by chunks to the triple store Jan 19, 2023

alexskr changed the base branch from master to develop February 10, 2024 01:08

alexskr merged commit 5caeb0d into ncbo:develop Mar 14, 2024

alexskr mentioned this pull request May 8, 2024

TTL file load errors due to chunked data loading feature #155

Closed

alexskr mentioned this pull request Oct 8, 2024

Sync: ncbo v5.33.0 release ontoportal/goo#5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Append new submission data by chunks to the triple store #122

Feature: Append new submission data by chunks to the triple store #122

syphax-bouazzouni commented Jun 30, 2022 •

edited

Loading

jonquet commented Jul 1, 2022

alexskr commented Sep 14, 2022 •

edited

Loading

syphax-bouazzouni commented Sep 14, 2022

Feature: Append new submission data by chunks to the triple store #122

Feature: Append new submission data by chunks to the triple store #122

Conversation

syphax-bouazzouni commented Jun 30, 2022 • edited Loading

jonquet commented Jul 1, 2022

alexskr commented Sep 14, 2022 • edited Loading

syphax-bouazzouni commented Sep 14, 2022

syphax-bouazzouni commented Jun 30, 2022 •

edited

Loading

alexskr commented Sep 14, 2022 •

edited

Loading