Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to create a controlled vocabulary for content description #246

Open
baskaufs opened this issue Sep 18, 2022 · 6 comments
Open

Proposal to create a controlled vocabulary for content description #246

baskaufs opened this issue Sep 18, 2022 · 6 comments

Comments

@baskaufs
Copy link

baskaufs commented Sep 18, 2022

This proposal is intended to help resolve several other open issues: #230 (Problems with Iptc4xmpExt:CVterm), #244 (Outdated Link in Notes for Iptc4xmpExt:CVterm), #200 (Term needed to designate when an image or ROI is a label), and #166 (View for "in habitat").

Background

During the development process of the subjectPart controlled vocabulary, the community was interested in providing a way to indicate that a particular image or region of interest within an image contained something that was not part of the organism. One submitted use case was to filter out images of labels. Another situation was to indicate that the photograph was of the habitat of the organism (either the organism wasn't in the habitat photo, or the organism was in the photo, but the entire organism was not the primary subject of the image -- the habitat was). Photos of scale bars and color scales would be similar to the label use case, and an audio analog would be to indication that a region of interest in the recording was a voice introduction.

The Task Group decided that these terms were out of scope for a vocabulary about subject parts, since they didn't actually involve any part of the organism. However, it recommended that the Audubon Core Maintenance Group consider other alternative methods to make it possible to filter for these features.

Proposal

Create a Content Description controlled vocabulary. The concept IRIs would be used as values for Iptc4xmpExt:CVterm (defined as "A term to describe the content of the image by a value from a Controlled Vocabulary") and controlled value strings would be used as values for a newly minted term: ac:CVtermLiteral (defined below).

In the event that there were multiple values for Iptc4xmpExt:CVterm, the normal rules for serializing lists in that serialization would apply (e.g. RDF Turtle, JSON-LD). If there were multiple values for ac:CVtermLiteral, they MAY be concatenated and pipe separated if the user chose not to implement one of the methods described in Section 3.2 (Tabular serializations) of the Audubon Core Structure guide.

The revised definitions remove recommended vocabularies from the term metadata. If Audubon Core wishes to recommend vocabularies, it should do so via informative documents that do not require going through the standards process to change them. The URLs for many of the currently recommended vocabularies are broken (see Issue #244 for example) and others aren't actively maintained. The current recommendations are: the NASA Global Change Master Directory (GCMD; http://gcmd.nasa.gov/), Subject Categories defined in Key to Nature (K2N; http://www.keytonature.eu/wiki/Subject_Category), the BioComplexity Thesaurus; https://www2.usgs.gov/core_science_systems/csas/biocomplexity_thesaurus/, the Description Type GBIF Vocabulary; http://rs.gbif.org/vocabulary/gbif/description_type.xml, the TDWG Species Profile Model; http://rs.tdwg.org/ontology/voc/SPMInfoItems.rdf, the Plinian Core; https://github.com/tdwg/PlinianCore/wiki, the European Environmental Agency GEneral Multilingual Environmental Thesaurus (GEMET; http://www.eionet.europa.eu/gemet), and the Long Term Ecological Research Network Controlled Vocabulary (LTER; http://vocab.lternet.edu/).

The changes proposed here simplifies use of the terms. Either a globally unique IRI is used as a value for Iptc4xmpExt:CVterm (as IPTC expects), a controlled value string from the new AC Content Description vocabulary (initial values defined below) is used as a value for ac:CVterm, or a controlled value string for another vocabulary is used as a value for ac:CVterm and that vocabulary is identified using ac:subjectCategoryVocabulary.

Usage examples
Examples showing how to use the revised terms and controlled vocabulary can be viewed at the Content Description Examples document.

Term revisions
New text is in italics. Removed text is in strikethrough. The definition of Iptc4xmpExt:CVterm remains unchanged from that given by the IPTC.

Term name: Iptc4xmpExt:CVterm
Label: Subject Category
IRI: http://iptc.org/std/Iptc4xmpExt/2008-02-29/CVterm
Required: No
Repeatable: Yes
Definition: A term to describe the content of the image by a value from a Controlled Vocabulary.
Usage: Values MUST be IRIs from a controlled vocabulary of subjects to support broad classification of media items. IRIs Terms from the Audubon Core Content Description controlled vocabulary are preferred, but other various controlled vocabularies may be used as long as the term is uniquely identified by an IRI. This term MAY be used to describe the content of a region of interest within an image. AC-recommended vocabularies are preferred and MAY be unqualified literals (not a full URI). For terms from other vocabularies either a precise URI SHOULD be used, or, as long as all unqualified terms in all vocabularies are unique, metadata SHOULD provide the source vocabularies using the Subject Category Vocabulary term. The value SHOULD be a string, whose text can also be in the form of a URL. These guidelines on value format are less restrictive than is specified by the IPTC guidelines.
Notes: Recommended sets include: the NASA Global Change Master Directory (GCMD; http://gcmd.nasa.gov/), Subject Categories defined in Key to Nature (K2N; http://www.keytonature.eu/wiki/Subject_Category), the BioComplexity Thesaurus; https://www2.usgs.gov/core_science_systems/csas/biocomplexity_thesaurus/, the Description Type GBIF Vocabulary; http://rs.gbif.org/vocabulary/gbif/description_type.xml, the TDWG Species Profile Model; http://rs.tdwg.org/ontology/voc/SPMInfoItems.rdf, the Plinian Core; https://github.com/tdwg/PlinianCore/wiki, the European Environmental Agency GEneral Multilingual Environmental Thesaurus (GEMET; http://www.eionet.europa.eu/gemet), and the Long Term Ecological Research Network Controlled Vocabulary (LTER; http://vocab.lternet.edu/). The vocabulary may include major taxonomic groups (such as "vertebrates" or "fungi") or ecosystem terms ("savannah", "temperate rain forest", "forest fires", "aquatic vertebrates"). Other formal classifications (published in print or online) such as habitat, fuel, invasive species, agroproductivity, fisheries, migratory species etc. are also suitable.
Type: http://www.w3.org/1999/02/22-rdf-syntax-ns#Property

Term Name: ac:subjectCategoryVocabulary
Label: Subject Category Vocabulary
IRI: http://rs.tdwg.org/ac/terms/subjectCategoryVocabulary
Required: No
Repeatable: Yes No
Definition: Any controlled vocabulary or formal classification from which values for ac:CVterm terms in the Subject Category have been drawn.
Usage: The value SHOULD be a stable URL for the vocabulary if one is available.
Notes: If controlled string values for ac:CVterm are taken from the Audubon Core Subject Category controlled vocabulary, it is not necessary to provide a value for this property. If pipe separated strings are used to provide multiple values for ac:CVterm, this term MUST NOT be repeated. It MAY be repeated if data structuring allows particular ac:CVterm string values to be associated with particular values for this term. The AC recommended vocabularies do not need to be cited here. There is no required linkage between individual Subject Category terms and the vocabulary; the mechanism is intended to support discovery of the normative URI for a term, but not guarantee it.

New terms

Term name: ac:CVtermLiteral
Label: Subject Category (literal)
IRI: http://rs.tdwg.org/ac/terms/CVtermLiteral
Required: No
Repeatable: Yes
Definition: A term to describe the content of a image or a region of interest within an image using a controlled value string.
Usage: Values SHOULD be selected from the Audubon Core Content Description controlled vocabulary or a vocabulary that can be identified using ac:subjectCategoryVocabulary. If a value is from the Audubon Core Content Description controlled vocabulary, it is not necessary to provide a value for ac:subjectCategoryVocabulary. Multiple values MAY be provided and separated by space vertical bar space ( | ), however they MUST be from a single vocabulary. It is best practice to use Iptc4xmpExt:CVterm instead of ac:CVtermLiteral whenever practical.
Type: http://www.w3.org/1999/02/22-rdf-syntax-ns#Property

The following terms would be included in the initial Content Description controlled vocabulary (with others potentially added in the future). The type of all terms in the vocabulary is http://www.w3.org/2004/02/skos/core#ConceptScheme .

Label: Label
Definition: physical text providing metadata about the focal resource
Note: The focal resource MAY be any kind of specimen.
Controlled string: label

Label: Context
Definition: a depicted feature that is the context in which the focal resource was located
Usage: If the media item includes a depiction of the focal resource, the focal resource SHOULD NOT be the main feature of the media item.
Examples: the habitat in which an organism was found, the strata from which a fossil or mineral sample was removed
Controlled string: context

Label: Scale Bar
Definition: a linear graphic used to associate size with a dimension of the media item
Example: a ruler
Controlled string: scaleBar

Label: Color Bar
Definition: chromatic graphic used to calibrate the color profile of an image with the actual color of the object
Controlled string: colorBar

Label: Spoken Description
Definition: audio media that contains a voice description of the content of the media item
Controlled string: spokenDescription

Label: Organism Part
Definition: all or part of an organism
Usage: The media item primarily depicts some part of a organism. If the organism is not the main feature that is depicted, context SHOULD be used instead.
Note: The organism MAY be living or dead, and MAY be preserved. If this value is used, the terms ac:subjectPart/ac:subjectPartLiteral and ac:subjectOrientation/ac:subjectOrientationLiteral SHOULD be used to provide more detailed information about what precisely is depicted and how the depicted part is oriented.
Controlled string: organismPart

Ping @ben-norton with respect to lists as values.
Refer to tdwg/camtrap-dp#191 (comment)

@danstowell
Copy link
Contributor

May I suggest changing spokenIntroduction to spokenDescription? "intro" implies it comes before something, but the description may come at the end or elsewhere.

@edwbaker
Copy link
Member

@danstowell Yes, makes sense to me.

@baskaufs
Copy link
Author

baskaufs commented Dec 4, 2022

Changed proposal to use spokenIntroduction as suggested by @danstowell.

Capitalized human-readable labels to conform to existing practice.

@baskaufs
Copy link
Author

baskaufs commented Dec 4, 2022

I have completed revisions of this proposal to implement the decision made at the 2022-11-10 Audubon Core Maintenance Group working session: that we should mint a new term ac:CVTermLiteral for literal values rather than recommending using ac:tag as in the original proposal in order to follow the existing AC patterns for literal and IRI-valued terms (action item D on the agenda).

Note that as a part of this proposal I have included a general cleanup of how Iptc4xmpExt:CVterm is to be used. Originally there was a kludgy system where one could use either an IRI value or a string from one of the AC recommended vocabularies. The term ac:subjectCategoryVocabulary was then used to provide clues about what vocabulary was intended when a string value was provided.

Since we now have the ability to indicate whether the term value should be an IRI or a controlled value string, the proposal allows for three straightforward options for indicating the subject category:

  1. Provide a full IRI for Iptc4xmpExt:CVterm. No other information is required as the values would be globally unique and self describing (if the IRIs dereference).
  2. Use a controlled value string from the new controlled vocabulary we are creating. In that case, no other information is required and it isn't necessary to provide a value for ac:subjectCategoryVocabulary; the AC vocabulary would be assumed.
  3. Use a controlled value string from any other controlled vocabulary and indicate what that vocabulary is by providing a value for ac:subjectCategoryVocabulary.

There is a bit of complexity introduced by allowing multiple values to be provided for ac:CVtermLiteral using space bar space separation. In simple text based systems, there wouldn't be a way to associate a particular ac:CVtermLiteral value with a particular ac:subjectCategoryVocabulary value. So one must either limit the delimited strings to a single vocabulary, or use some structured data system that allows the controlled values to be associated with their vocabularies. No such restriction exists for IRI values -- one reason why they are preferred.

@baskaufs
Copy link
Author

baskaufs commented Dec 4, 2022

Updated the proposal to make it clear that the terms may be used with regions of interest within an image in addition to an entire image.

@edwbaker edwbaker added next meeting agenda Issues to be discussed at next MG meeting future-meeting and removed next meeting agenda Issues to be discussed at next MG meeting future-meeting labels Feb 21, 2024
@edwbaker edwbaker added public comment and removed next meeting agenda Issues to be discussed at next MG meeting labels May 15, 2024
@baskaufs
Copy link
Author

Created usage examples as requested at AC MG meeting. Edited proposal to link to examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants