Skip to content
This repository has been archived by the owner on Feb 5, 2020. It is now read-only.

UTF-8 is not supported? #26

Closed
migalkin opened this issue Dec 4, 2016 · 13 comments
Closed

UTF-8 is not supported? #26

migalkin opened this issue Dec 4, 2016 · 13 comments
Assignees
Labels

Comments

@migalkin
Copy link

migalkin commented Dec 4, 2016

I have a Fedbench query CD4:

SELECT ?actor ?news WHERE {
  ?film purl:title 'Tarzan' .
  ?film linkedMDB:actor ?actor .
  ?actor owl:sameAs ?x.
  ?y owl:sameAs ?x .
  ?y nytimes:topicPage ?news }

which has been rewritten to execute the following triple pattern against LinkedMDB endpoint in LDF server:

SELECT ?actor ?x WHERE { ?actor <http://www.w3.org/2002/07/owl#sameAs> ?x} LIMIT 100000 OFFSET 0

The Client throws the error:

WARNING TriplePatternIterator Unexpected "<http://dbpedia.org/resource/Espen_Skj\\u00C3\\u00B8nberg>," on line 47.
      events.js:160
     throw er; // Unhandled 'error' event
     ^

 Error: Unexpected "<http://dbpedia.org/resource/Espen_Skj\\u00C3\\u00B8nberg>," on line 47.
     at N3Lexer._syntaxError (/ldf_rest/node_modules/n3/lib/N3Lexer.js:358:12)
     at reportSyntaxError (/ldf_rest/node_modules/n3/lib/N3Lexer.js:325:54)
     at N3Lexer._tokenizeToEnd (/ldf_rest/node_modules/n3/lib/N3Lexer.js:311:18)
    at TrigFragmentIterator._parseData (/ldf_rest/node_modules/n3/lib/N3Lexer.js:393:16)
    at TrigFragmentIterator.TurtleFragmentIterator._transform (/ldf_rest/node_modules/ldf-client/lib/triple-pattern-fragments/TurtleFragmentIterator.js:47:8)
     at Immediate.readAndTransform (/ldf_rest/node_modules/asynciterator/asynciterator.js:959:12)
     at runCallback (timers.js:643:20)
     at tryOnImmediate (timers.js:610:5)
     at processImmediate [as _immediateCallback] (timers.js:582:5)

Does it mean that LDF Client does not support UTF-8?

@RubenVerborgh
Copy link
Member

Hi @migalkin, no, it means that the dataset was wrongly encoded. Note that

<http://dbpedia.org/resource/Espen_Skj\\u00C3\\u00B8nberg>

is an invalid URI in Turtle syntax; it should be

<http://dbpedia.org/resource/Espen_Skj\u00C3\u00B8nberg>

My guess is that on the server side, you have used an HDT file to serve LinkedMDB? And that this HDT file was generated with rdf2hdt in.nt out.hdt rather than rdf2hdt -f turtle in.nt out.hdt? The -f turtle option is necessary, because the N-Triples parser is broken.

@RubenVerborgh RubenVerborgh self-assigned this Dec 4, 2016
@migalkin
Copy link
Author

migalkin commented Dec 4, 2016

Thank you @RubenVerborgh
I used the -f turtle option and now the query works fine [and the size of the hdt file is 20 times less =) ]

@RubenVerborgh
Copy link
Member

Excellent 😄

@RubenVerborgh
Copy link
Member

=> Do double check whether all the triples you want are in there though (i.e., hdtInfo out.hdt should show the correct number of total triples). When SERD encounters an error, the conversion process stops (most of the time with an error, sometimes without unfortunately).

@migalkin
Copy link
Author

migalkin commented Dec 4, 2016

@RubenVerborgh actually you are right, the dump created with the broken NT parser created an HDT file with all the triples from the LinkedMDB dump, but
rdf2hdt -f turtle linkedmdb.nt linkedmdb.hdt
results only in 160142 triples.

So what I do:

rdf2hdt -f turtle linkedmdb-latest-dump.nt linkedmdb-latest-dump.hdt            
RDF format: turtle
invalid IRI character `?' (escape %8B7E)essed.: 0 % / 0 %                      
invalid IRI character `?'00 K triples processed.: 0 % / 0 %                      
invalid IRI character `@' (escape %8B7E)ed.: 0 % / 50 %                      
invalid IRI character `@'K triples processed.: 0 % / 50 %                      
HDT Successfully generated.                                           
Total processing time: Clock(1 sec 836 ms 366 us)  User(1 sec 762 ms 380 us)  System(71 ms 897 us)

Then running hdtInfo:

<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#triples> "160142" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#properties> "8" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#distinctSubjects> "149209" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#distinctObjects> "52182" 

The original linkedmdb dump has:

 wc -l linkedmdb-latest-dump.nt 
6148121 linkedmdb-latest-dump.nt

The problem is that HDT parser doesn't produce any error and writes that the file has been created successfully.

@RubenVerborgh
Copy link
Member

The problem is that HDT parser doesn't produce any error and writes that the file has been created successfully.

Yes, I just fixed that in rdfhdt/hdt-cpp@d3b02a9

The solution is to ensure that the input file is valid, by passing it through a tool such as SERD first.

@migalkin
Copy link
Author

migalkin commented Dec 4, 2016

@RubenVerborgh I used those regexps we found before to clean the entire LinkedMDB and retain all the triples, so that SERD and HDT parser never throw an error, so the parsing went fine.
However, when I attach a new hdt to the server I have an error during setting it up:
This software cannot open this version of HDT File
I used the new version of the HDT C++ library you updated today.
Server issue?

@RubenVerborgh
Copy link
Member

Not a server issue, but possibly an outdated HDT-Node version. Can you post your HDT file somewhere so I can check?

@RubenVerborgh
Copy link
Member

Never mind, I found a testcase myself. On it.

@migalkin
Copy link
Author

migalkin commented Dec 4, 2016

In case you need https://drive.google.com/file/d/0B3uXlknE4eJrZ19hem03M1VKVkk/view?usp=sharing

@RubenVerborgh
Copy link
Member

@migalkin I found the bug and proposed a fix: rdfhdt/hdt-cpp#43

Summary: you built your HDT file using the latest master, which writes an (in my opinion) incorrect version number into the HDT file. The stable branch does not have this problem.

@RubenVerborgh
Copy link
Member

@migalkin This bug is now fixed; the laster version of hdt-cpp now generates compatible HDT files again.

@migalkin
Copy link
Author

migalkin commented Dec 9, 2016

@RubenVerborgh great, thanks for the update

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants