Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing Epub subjects #111

Closed
qnga opened this issue Jan 30, 2020 · 3 comments · Fixed by #125
Closed

Parsing Epub subjects #111

qnga opened this issue Jan 30, 2020 · 3 comments · Fixed by #125

Comments

@qnga
Copy link
Contributor

qnga commented Jan 30, 2020

Epub Parsing Guide is currently too simplistic about Epub subject parsing.

  1. if there is a one single <dc:subject> element, make sure keywords are not separated using commas or semicolons;
    1. if it doesn’t, the string is the value;
    2. if it does, split the string to build an array;
  2. if there are more than one <dc:subject> elements, build an array using their values.

Both Readium manifest and Epub supports sortAs and localized strings in subjects, so the algorithm should be modified. I suggest the following:

  1. if there is a one single <dc:subject> element without any sorting information or translation (possibly not compliant case), make sure keywords are not separated using commas or semicolons;
    1. if it doesn’t, the string is the value;
    2. if it does, split the string to build an array;
  2. else, assume conformance to Epub spec and build an array using their values.
@mickael-menu
Copy link
Member

Yes, this sounds like a better strategy.

Current Swift implementation is discarding any sort-as or alternate-script refines as long as there's a single subject, and it contains commas.
https://github.com/readium/r2-streamer-swift/blob/630d8953cbad98fb229233caa9d98b2ce16474b3/r2-streamer-swift/Parser/EPUB/EPUBMetadataParser.swift#L206-L243

But it's safe to assume that if these refines are present, then the commas will be part of the subject name itself and not a separator.

@mickael-menu
Copy link
Member

@JayPanoz Do you have an opinion on this subject?

@JayPanoz
Copy link
Contributor

Not really an opinion, perhaps more of a note.

Given the extensiveness of the task of handling refine when first creating this doc, it’s been a strong candidate for improvements through issues, PRs, etc. So it makes perfect sense to me that language is fine-tuned, and I also like the idea of offloading/referring to EPUB spec conformance whenever possible.

Then maybe refine in this doc should be an issue in and of itself, given the handling is currently partial?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants