UTF-8 is not supported? #26

migalkin · 2016-12-04T14:40:14Z

I have a Fedbench query CD4:

SELECT ?actor ?news WHERE {
  ?film purl:title 'Tarzan' .
  ?film linkedMDB:actor ?actor .
  ?actor owl:sameAs ?x.
  ?y owl:sameAs ?x .
  ?y nytimes:topicPage ?news }

which has been rewritten to execute the following triple pattern against LinkedMDB endpoint in LDF server:

SELECT ?actor ?x WHERE { ?actor <http://www.w3.org/2002/07/owl#sameAs> ?x} LIMIT 100000 OFFSET 0

The Client throws the error:

WARNING TriplePatternIterator Unexpected "<http://dbpedia.org/resource/Espen_Skj\\u00C3\\u00B8nberg>," on line 47.
      events.js:160
     throw er; // Unhandled 'error' event
     ^

 Error: Unexpected "<http://dbpedia.org/resource/Espen_Skj\\u00C3\\u00B8nberg>," on line 47.
     at N3Lexer._syntaxError (/ldf_rest/node_modules/n3/lib/N3Lexer.js:358:12)
     at reportSyntaxError (/ldf_rest/node_modules/n3/lib/N3Lexer.js:325:54)
     at N3Lexer._tokenizeToEnd (/ldf_rest/node_modules/n3/lib/N3Lexer.js:311:18)
    at TrigFragmentIterator._parseData (/ldf_rest/node_modules/n3/lib/N3Lexer.js:393:16)
    at TrigFragmentIterator.TurtleFragmentIterator._transform (/ldf_rest/node_modules/ldf-client/lib/triple-pattern-fragments/TurtleFragmentIterator.js:47:8)
     at Immediate.readAndTransform (/ldf_rest/node_modules/asynciterator/asynciterator.js:959:12)
     at runCallback (timers.js:643:20)
     at tryOnImmediate (timers.js:610:5)
     at processImmediate [as _immediateCallback] (timers.js:582:5)

Does it mean that LDF Client does not support UTF-8?

The text was updated successfully, but these errors were encountered:

RubenVerborgh · 2016-12-04T14:44:50Z

Hi @migalkin, no, it means that the dataset was wrongly encoded. Note that

<http://dbpedia.org/resource/Espen_Skj\\u00C3\\u00B8nberg>

is an invalid URI in Turtle syntax; it should be

<http://dbpedia.org/resource/Espen_Skj\u00C3\u00B8nberg>

My guess is that on the server side, you have used an HDT file to serve LinkedMDB? And that this HDT file was generated with rdf2hdt in.nt out.hdt rather than rdf2hdt -f turtle in.nt out.hdt? The -f turtle option is necessary, because the N-Triples parser is broken.

migalkin · 2016-12-04T14:54:48Z

Thank you @RubenVerborgh
I used the -f turtle option and now the query works fine [and the size of the hdt file is 20 times less =) ]

RubenVerborgh · 2016-12-04T15:37:30Z

Excellent 😄

RubenVerborgh · 2016-12-04T15:38:28Z

=> Do double check whether all the triples you want are in there though (i.e., hdtInfo out.hdt should show the correct number of total triples). When SERD encounters an error, the conversion process stops (most of the time with an error, sometimes without unfortunately).

migalkin · 2016-12-04T16:17:55Z

@RubenVerborgh actually you are right, the dump created with the broken NT parser created an HDT file with all the triples from the LinkedMDB dump, but
rdf2hdt -f turtle linkedmdb.nt linkedmdb.hdt
results only in 160142 triples.

So what I do:

rdf2hdt -f turtle linkedmdb-latest-dump.nt linkedmdb-latest-dump.hdt            
RDF format: turtle
invalid IRI character `?' (escape %8B7E)essed.: 0 % / 0 %                      
invalid IRI character `?'00 K triples processed.: 0 % / 0 %                      
invalid IRI character `@' (escape %8B7E)ed.: 0 % / 50 %                      
invalid IRI character `@'K triples processed.: 0 % / 50 %                      
HDT Successfully generated.                                           
Total processing time: Clock(1 sec 836 ms 366 us)  User(1 sec 762 ms 380 us)  System(71 ms 897 us)

Then running hdtInfo:

<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#triples> "160142" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#properties> "8" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#distinctSubjects> "149209" .
<file://linkedmdb-latest-dump.nt> <http://rdfs.org/ns/void#distinctObjects> "52182"

The original linkedmdb dump has:

 wc -l linkedmdb-latest-dump.nt 
6148121 linkedmdb-latest-dump.nt

The problem is that HDT parser doesn't produce any error and writes that the file has been created successfully.

RubenVerborgh · 2016-12-04T16:21:32Z

The problem is that HDT parser doesn't produce any error and writes that the file has been created successfully.

Yes, I just fixed that in rdfhdt/hdt-cpp@d3b02a9

The solution is to ensure that the input file is valid, by passing it through a tool such as SERD first.

migalkin · 2016-12-04T21:26:59Z

@RubenVerborgh I used those regexps we found before to clean the entire LinkedMDB and retain all the triples, so that SERD and HDT parser never throw an error, so the parsing went fine.
However, when I attach a new hdt to the server I have an error during setting it up:
This software cannot open this version of HDT File
I used the new version of the HDT C++ library you updated today.
Server issue?

RubenVerborgh · 2016-12-04T21:36:11Z

Not a server issue, but possibly an outdated HDT-Node version. Can you post your HDT file somewhere so I can check?

RubenVerborgh · 2016-12-04T21:38:04Z

Never mind, I found a testcase myself. On it.

migalkin · 2016-12-04T21:41:15Z

In case you need https://drive.google.com/file/d/0B3uXlknE4eJrZ19hem03M1VKVkk/view?usp=sharing

RubenVerborgh · 2016-12-05T22:26:44Z

@migalkin I found the bug and proposed a fix: rdfhdt/hdt-cpp#43

Summary: you built your HDT file using the latest master, which writes an (in my opinion) incorrect version number into the HDT file. The stable branch does not have this problem.

RubenVerborgh · 2016-12-09T09:10:51Z

@migalkin This bug is now fixed; the laster version of hdt-cpp now generates compatible HDT files again.

migalkin · 2016-12-09T12:56:46Z

@RubenVerborgh great, thanks for the update

RubenVerborgh self-assigned this Dec 4, 2016

RubenVerborgh added the question label Dec 4, 2016

RubenVerborgh closed this as completed Dec 4, 2016

RubenVerborgh mentioned this issue Dec 4, 2016

Remove built-in N-Triples parser/serializer rdfhdt/hdt-cpp#31

Closed

This was referenced Dec 4, 2016

Add a version number to the .hdt.index files rdfhdt/hdt-cpp#7

Closed

HDT format identifier uses only HDT version. rdfhdt/hdt-cpp#43

Merged

Implement proper branch management rdfhdt/hdt-cpp#44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 is not supported? #26

UTF-8 is not supported? #26

migalkin commented Dec 4, 2016 •

edited

Loading

RubenVerborgh commented Dec 4, 2016

migalkin commented Dec 4, 2016

RubenVerborgh commented Dec 4, 2016

RubenVerborgh commented Dec 4, 2016

migalkin commented Dec 4, 2016

RubenVerborgh commented Dec 4, 2016

migalkin commented Dec 4, 2016 •

edited

Loading

RubenVerborgh commented Dec 4, 2016

RubenVerborgh commented Dec 4, 2016

migalkin commented Dec 4, 2016

RubenVerborgh commented Dec 5, 2016

RubenVerborgh commented Dec 9, 2016

migalkin commented Dec 9, 2016

UTF-8 is not supported? #26

UTF-8 is not supported? #26

Comments

migalkin commented Dec 4, 2016 • edited Loading

RubenVerborgh commented Dec 4, 2016

migalkin commented Dec 4, 2016

RubenVerborgh commented Dec 4, 2016

RubenVerborgh commented Dec 4, 2016

migalkin commented Dec 4, 2016

RubenVerborgh commented Dec 4, 2016

migalkin commented Dec 4, 2016 • edited Loading

RubenVerborgh commented Dec 4, 2016

RubenVerborgh commented Dec 4, 2016

migalkin commented Dec 4, 2016

RubenVerborgh commented Dec 5, 2016

RubenVerborgh commented Dec 9, 2016

migalkin commented Dec 9, 2016

migalkin commented Dec 4, 2016 •

edited

Loading

migalkin commented Dec 4, 2016 •

edited

Loading