Adding information about the source format and generated metadata #66

mickael-menu · 2020-08-06T13:05:25Z

Sometimes, we are generating a RWPM from third-party publication formats (e.g. EPUB, CBZ). It could be a valuable information for reading apps to know from which source format a RWPM originated from.

For example:

{
  "metadata": {
    "sourceType": "application/epub+zip"
  }
}

Generating a RWPM from third-party formats also means that sometimes we need to generate some metadata, which might not be accurate. For example, a title is mandatory with RWPM, but doesn't exist for CBZ and some PDFs. It could be useful for reading apps to know if the metadata are not completely accurate. Two possibilities:

adding a global generated: true property when any property is generated
adding a list of generated properties, e.g. generated: ["metadata.title"]

Personally, I think that without the information about which properties are generated, this is not very actionable for reading apps.

The text was updated successfully, but these errors were encountered:

HadrienGardeur · 2020-08-12T09:07:33Z

I think that this could be covered in a source element that could contain two info:

media type
file name

{
  "metadata": {
    "source": {
      "type": "application/epub+zip",
      "filename":  "title.epub"
    }
  }
}

Once we know the media type and filename, I'm not convinced that we would need to also provide a value indicating that metadata were generated.

mickael-menu · 2020-08-12T09:17:33Z

Another useful thing we can do with the source.type information is to inject Readium CSS only in the case application/epub+zip, this would allow to stream an EPUB on the web while still supporting the CSS overrides and pagination.

Once we know the media type and filename, I'm not convinced that we would need to also provide a value indicating that metadata were generated.

That's right, the only useful thing I could think of is to know that the title was generated. But if we can check that the title is the filename, then it's actionable.

llemeurfr · 2020-08-12T13:13:16Z

I don't see how a filename can give useful info to the application. Here you're proposing the filename because some source formats will not contain a title, so the title will be generated from the filename. Is it because the only mandatory meta in RWPM is the title?

Also, knowing the source media-type seems sufficient to know which metadata have been generated (e.g. the title in case the media type is cbz).

HadrienGardeur · 2020-08-12T13:40:48Z

In the case of a CBZ or a PDF, the filename might be the only info available.

If you know the filename, you can compare it to the title generated by the streamer and do something based on that comparaison.

llemeurfr · 2020-08-12T13:46:45Z

I see ... but the title is still mandatory, so it means the RWPM will contain:

{
  "metadata": {
    "@type": "http://schema.org/Book",
    "conformsTo": "https://readium.org/webpub-manifest/profiles/pdf",
    "title":  "name.pdf",
    "source": {
      "type": "application/pdf",
      "filename":  "name.pdf"
    }
  }
}

The client part "viewer" just wants to display a title. Who cares about the original filename at this point?

mickael-menu · 2020-08-12T14:03:18Z

Also, knowing the source media-type seems sufficient to know which metadata have been generated (e.g. the title in case the media type is cbz).

In both CBZ and PDF, sometimes we have a real title (for CBZ, it is the single root folder inside the archive, if there's one), so we can't be sure the filename was used. Also file extensions are not mandatory, so checking for .pdf doesn't work either.

danielweck · 2020-08-12T16:42:31Z

What about overloading dc:source though? Isn't that problematic?

https://www.w3.org/publishing/epub3/epub-packages.html

https://www.dublincore.org/specifications/dublin-core/dcmi-terms/elements11/source/

<dc:source id="src-id">urn:isbn:9780375704024</dc:source> 
<meta refines="#src-id" property="identifier-type" scheme="onix:codelist5">15</meta> 
<meta refines="#src-id" property="source-of">pagination</meta>

danielweck · 2020-08-12T16:58:09Z

According to the parsing specification, dc:source should be preserved "as is" when parsing (under additionalProperties), there is no production of unprefixed source property:
https://github.com/readium/architecture/blob/master/streamer/parser/metadata.md

danielweck · 2020-08-12T17:08:43Z

Ah, here is an older related issue about dc:source:
#14

danielweck · 2020-08-12T17:11:32Z

Note that the R2 TypeScript implementation currently preserves dc:source as source instead of placing it in additionalProperties / otherMetadata, which is not the correct behaviour, because source is not defined in the RWPM's JSON schema ( https://github.com/readium/webpub-manifest/blob/master/schema/metadata.schema.json ), or the JSON-LD context ( https://readium.org/webpub-manifest/context.jsonld ), or the EPUB parsing doc ( https://github.com/readium/architecture/blob/master/streamer/parser/metadata.md ).

https://github.com/IDPF/epub3-samples/blob/master/30/childrens-literature/EPUB/package.opf#L29

https://idpf.github.io/epub3-samples/30/samples.html

=>

http://readium2.herokuapp.com/pub/L2FwcC9taXNjL2VwdWJzL2NoaWxkcmVucy1saXRlcmF0dXJlLmVwdWI%3D/manifest.json/show/all

mickael-menu · 2020-08-12T19:31:08Z

currently preserves dc:source as source instead of placing it in additionalProperties / otherMetadata

Just to clear any ambiguity, on mobile otherMetadata is an implementation detail of the in-memory model, to store the additional properties. In the generated RWPM, any additional metadata would be under metadata, such as this OPF:

<metadata xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:dcterms="http://purl.org/dc/terms/"
      xmlns:a11y="http://www.idpf.org/epub/vocab/package/a11y/#">
    <dc:title>Alice's Adventures in Wonderland</dc:title> 
    <dc:rights>Public Domain</dc:rights> 
    <meta property="a11y:certifiedBy">EDRLab</meta>
</metadata>

produces the RWPM (after resolving the full URI from the XML namespaces of other metadata):

{
    "metadata": {
        "title": "Alice's Adventures in Wonderland",
        "http://purl.org/dc/terms/rights": "Public Domain",
        "http://www.idpf.org/epub/vocab/package/a11y/#certifiedBy": "EDRLab"
    }
}

And with the in-memory model:

publication.metadata.title
publication.metadata["http://purl.org/dc/terms/rights"] // (internally uses `otherMetadata`)

Note that we have a special case with the dc: prefix, which is actually aliased to dcterms:.

// The dc URI is expanded as dcterms
// See https://www.dublincore.org/specifications/dublin-core/dcmi-terms/
// > While these distinctions are significant for creators of RDF applications, most
// > users can safely treat the fifteen parallel properties as equivalent. The most
// > useful properties and classes of DCMI Metadata Terms have now been published as
// > ISO 15836-2:2019 [ISO 15836-2:2019]. While the /elements/1.1/ namespace will be
// > supported indefinitely, DCMI gently encourages use of the /terms/ namespace.

I'm not sure any of this is documented in the EPUB parsing guide, as metadata extensions were not really supported at the time.

mickael-menu · 2020-08-19T09:48:14Z

How about sourceFile to circumvent the dc:source issue?

{
  "metadata": {
    "sourceFile": {
      "type": "application/epub+zip",
      "name":  "title.epub"
    }
  }
}

danielweck · 2020-08-19T10:03:29Z

...just thinking aloud regarding the term sourceFile: if the publication "asset" (e.g. EPUB zip archive) is acquired via HTTP Content-Disposition: attachment; filename="book.epub" with header Content-Type = application/epub+zip ... then sourceFile makes sense, but what about other HTTP fetch types whereby the notion of "file" is not so clear? (e.g. HTTP GET request on URL https://domain.com/books/1)
That being said, the name field of the sourceFile object seems appropriate, as this is clearly about "filename".
Alternatively, to avoid using terms that have other meanings / uses (such as "source", "origin", "resource", etc.), what about

{
  "metadata": {
    "originalAsset": {
      "type": "application/epub+zip",
      "filename":  "title.epub"
    }
  }
}

mickael-menu · 2020-08-19T10:27:19Z

Good point, then I would still use name since it is still useful outside the context of a file.

For example, fetching a CBZ from https://comics.com/watchmen, wtithout Content–Disposition. The parser could use the last path component of the URI to generate the title, and we would have:

{
  "metadata": {
    "originalAsset": {
      "type": "application/vnd.comicbook+zip",
      "name":  "watchmen"
    }
  }
}

danielweck mentioned this issue Aug 12, 2020

JSON Schema - metadata.source? #14

Open

danielweck mentioned this issue Aug 12, 2020

dc:source and dc:rights not in RWPM JSON Schema / JSON-LD context, not in EPUB parsing rules readium/r2-shared-js#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding information about the source format and generated metadata #66

Adding information about the source format and generated metadata #66

mickael-menu commented Aug 6, 2020

HadrienGardeur commented Aug 12, 2020

mickael-menu commented Aug 12, 2020

llemeurfr commented Aug 12, 2020 •

edited

Loading

HadrienGardeur commented Aug 12, 2020

llemeurfr commented Aug 12, 2020 •

edited

Loading

mickael-menu commented Aug 12, 2020

danielweck commented Aug 12, 2020 •

edited

Loading

danielweck commented Aug 12, 2020 •

edited

Loading

danielweck commented Aug 12, 2020

danielweck commented Aug 12, 2020 •

edited

Loading

mickael-menu commented Aug 12, 2020

mickael-menu commented Aug 19, 2020

danielweck commented Aug 19, 2020

mickael-menu commented Aug 19, 2020 •

edited

Loading

Adding information about the source format and generated metadata #66

Adding information about the source format and generated metadata #66

Comments

mickael-menu commented Aug 6, 2020

HadrienGardeur commented Aug 12, 2020

mickael-menu commented Aug 12, 2020

llemeurfr commented Aug 12, 2020 • edited Loading

HadrienGardeur commented Aug 12, 2020

llemeurfr commented Aug 12, 2020 • edited Loading

mickael-menu commented Aug 12, 2020

danielweck commented Aug 12, 2020 • edited Loading

danielweck commented Aug 12, 2020 • edited Loading

danielweck commented Aug 12, 2020

danielweck commented Aug 12, 2020 • edited Loading

mickael-menu commented Aug 12, 2020

mickael-menu commented Aug 19, 2020

danielweck commented Aug 19, 2020

mickael-menu commented Aug 19, 2020 • edited Loading

llemeurfr commented Aug 12, 2020 •

edited

Loading

llemeurfr commented Aug 12, 2020 •

edited

Loading

danielweck commented Aug 12, 2020 •

edited

Loading

danielweck commented Aug 12, 2020 •

edited

Loading

danielweck commented Aug 12, 2020 •

edited

Loading

mickael-menu commented Aug 19, 2020 •

edited

Loading