Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity reconciliation between schemas and ontologies #72

Open
agnescameron opened this issue Jul 21, 2021 · 6 comments
Open

Entity reconciliation between schemas and ontologies #72

agnescameron opened this issue Jul 21, 2021 · 6 comments

Comments

@agnescameron
Copy link

agnescameron commented Jul 21, 2021

This came up in this months' call, and I wanted to give a full explanation of the use case I was describing (as it's 'a bit meta'), which can be shaped into more of a feature request / mailing list object through discussion. I originally brought this up in relation to the discussion of type hierarchies in #68 -- my impression is that the main distinction between these cases is that here, the entities being resolved are themselves types, as specified by an ontology.

What we're trying to achieve:

Taking a range of datasets, produced by different people working in a similar context (in this case, innovation data) and reconciling the dataset schemas against a common ontology. This could based on either just the string information of the column headers, but more ideally, a combination of the column header and the data type, or the relationships between different columns within the schema.

The goal of the work we're doing is to build graph of relationships between datasets, allowing merging/querying operations across a range of diverse data sources. There's also a version of this where the entities within the dataset also get reconciled (which looks a lot more like the traditional reconciliation API), but it would be interesting to know what's possible with an ontology alone.

For example: I know 3 different researchers, all of whom use patent identifiers in their datasets in a different format. The WIPO standards ontology specifies patent identifier formatting as part of a hierarchical ontology.

If I wanted to specify what identification scheme was being used in each instance, I could: each PatentPublicationIdentification has a PatentPublicationIdentificationType, which is composed of a sequence of up to 5 different objects including PublicationLanguageCode, PatentDocumentKindCode and PublicationDate. 2 of the 5 are optional, and many of these also have further possible type specifications (e.g. PublicationLanguageCode can have a different ExtendedISOLanguageCodeType depending on when the identifier was specified).

While it's possible to go through this process manually (either by going through the WIPO schema, or using guides to patent ID construction), crosswalking column types like this can be a real pain, especially for newer researchers not versed in the foibles of different notation.

It's possible that the entity reconciliation API is not the place for this problem, but it would be interesting to know what would work well -- so many ontologies get specified but then under-used when it comes to actually linking published schemas to their corresponding types. Are there existing workflows that anyone's familiar wifth for producing this kind of metadata (I'll add them to the census if so)?

@thadguidry
Copy link
Contributor

thadguidry commented Jul 21, 2021

This is a general problem where in the past less efficient software for mapping, XML, RDF, etc. have evolved over time to be much better now in 2021, but it depends on the actual use cases...and if you want to even involve the Semantic Web or not, and publish, or republish, as often is the need. For instance, a lot of Schema such as the http://www.wipo.int/standards/XMLSchema/ST96/ are not actually vocabularies in the traditional sense, but real schema for a particular niche set of domains, where no attempt to map to Linked Open Vocabularies or otherwise was part of the effort. (Closed World vs. Open World)

I think your immediate mapping needs from Schema <-> Schema or even Any <-> Many might be best accomplished with perhaps a tool and server in the market used quite a bit for that need, Altova MapForce / Server / XMLSpy

XML

As far as a history lesson of how far we have come this page goes over a broad set of tools and software, some no longer used or available: https://www.w3.org/wiki/XML_Schema_software

RDF

Gosh, there's so many over the decades depending on the needs, but practically, mapping existing DB's to RDF was very common in Academia and Enterprise. Here's the dated 2009 state of the art: https://www.w3.org/wiki/Rdb2RdfXG/StateOfTheArt

Linked Open Data

Nowadays, many of the maps are just directly embedded into Wikidata itself through the various SKOS-related properties as I've done with Schema.org and other ontologies I have loosely mapped into it. One example: https://www.wikidata.org/wiki/Q26907166 Yes, manually. But a general Excel or LibreOffice "lookup" function or OpenRefine cross() can go a long way to map things "cheaply"...but as I stated, you often need all the power that a good tool like Altova's tools gives you and then can allow you to publish or upload to share those maps with the world.

Semantic Web

Browsing and developing an ontology is totally different than the needs of mapping or linking ontologies.
I've lost touch with a lot of the open source world's effort for mapping or linking ontologies because to me there was no standards, everyone doing their own thing at different levels, and Wikidata evolved into being a common place for doing that mapping to share with the world more easily. Still here's an older page from 2010 with an updated link: https://www.w3.org/wiki/SemanticWebTools

@thadguidry
Copy link
Contributor

thadguidry commented Jul 21, 2021

@agnescameron Someone just mentioned to me (offlist) that you would likely be better served by asking folks within the W3C DXWG where interoperability and mapping are exactly their focus: https://www.w3.org/2017/dxwg/wiki/Main_Page

@agnescameron
Copy link
Author

agnescameron commented Jul 21, 2021

@thadguidry thanks for this! I hadn't encountered DXWG before, but they seem really ideal for this instance. the XML schema software history is also great.

@fsteeg
Copy link
Member

fsteeg commented Aug 17, 2021

@agnescameron: I don't think I fully understand your use case yet, but did I mention Cocoda (https://coli-conc.gbv.de/cocoda/) in the meeting? I have not worked with it myself, but it uses the reconciliation API (acting as a client instead of OpenRefine) to create mappings between different ontologies.

@agnescameron
Copy link
Author

@fsteeg Cocoda seems really well-suited to this use-case; will test it out and report back. thanks!

@thadguidry
Copy link
Contributor

thadguidry commented Aug 17, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants