-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose matching features for client-side scoring #38
Conversation
Is there any additional information that could make these scores more usable to a human making a judgment about reconciliation candidates (i.e. not only for training a classifier)? For example, a URL for an explanation of the feature/score? I would potentially use something like this to expose the separate label, temporal and spatial scores for candidates returned by the PeriodO reconciler, but the mostly likely use of them would be to give the person using the reconciler a UI for sorting, comparing and selecting candidates based on the scores. |
Great point. It would totally make sense to add more metadata to these features - we want them to be as easy to understand as possible for users. It is true that we need to cater for manual workflows. I am not sure where to add documentation URLs (and other useful metadata) to features: it feels a bit wasteful to add them in each reconciliation candidate as this will take up a lot of bandwidth. Any idea how to expose that? |
A |
I was imagining that for the Wikidata reconciliation service, I would generate features for each property supplied in the query: they would be indexed by the property id. For instance, if the user supplies So the domain of feature ids would be open in my case (the set of valid property ids is infinite since I accept property paths too), making it impossible to list them all in the manifest.
Perhaps a dedicated endpoint would do, yes. Another option, slightly less wasteful than including feature metadata in reconciliation candidates, would be to include it at the root of reconciliation responses (where metadata for a given feature would be given only once even if the feature appears in multiple reconciliation candidates). But a dedicated endpoint could be cleaner, I think. |
This looks overall good to me.
One approach for linking to documentation is to make OpenRefine payloads JSON-LD by adding a {
"@context":{
"name":"https://schema.org/name",
"score":"https://example.org/score",
"Name_tfidf":"https://example.org/Name_tfidf",
"Pagerank":"https://example.org/Pagerank"
},
"id":"1117582299",
"name":"Urbaniak, Hans-Eberhard",
"score":85.71888,
"features":[
{
"@type":"Name_tfidf",
"value":378.239
},
{
"@type":"Pagerank",
"value":-3.1209
}
],
"match":true,
"type":[
{
"id":"AuthorityResource",
"name":"Normdatenressource"
},
{
"id":"DifferentiatedPerson",
"name":"Individualisierte Person"
}
]
} |
Could this Generally, formatting responses with JSON-LD seems like a sensible move. Should it be done at other places in the API to make it more uniform? I would just like to keep an eye on the size of the payloads: it would be good not to add too much overhead. |
Re. removing redundancy and limit the size of the payload when adding a {
"@context": "https://openrefine.org/context.jsonld",
"id":"1117582299",
"name":"Urbaniak, Hans-Eberhard",
"score":85.71888,
"features":[
{
"@type":"Name_tfidf",
"value":378.239
},
{
"@type":"Pagerank",
"value":-3.1209
}
],
"match":true,
"type":[
{
"id":"AuthorityResource",
"name":"Normdatenressource"
},
{
"id":"DifferentiatedPerson",
"name":"Individualisierte Person"
}
]
} |
Ok, so I guess instead of If we want to save space and go down the JSON-LD route, perhaps we could use blank nodes to refer from each feature to other features defined elsewhere in the payload. But I am not sure this is a good idea. More generally, if we start doing things the JSON-LD way, doesn't that vaguely imply that we should accept queries or responses which have a different JSON structure but are equivalent as far as their JSON-LD semantics are concerned? So forcing implementations to rely on some JSON-LD library to handle that somehow? I would be interested to know if there are specs which use JSON-LD and JSON schema simultaneously. |
I apparently misunderstood what is modeled here. I thought we were talking about a controlled list of features that is the same for all reconciliation services. If the list differs from service to service, I have my doubts whether a JSON-LD context is the best solution.
I don't understand what you mean.
I think it is no problem and even the most sensible solution to specify the API (by a JSON Schema or otherwise) in a way that only allows for one JSON-LD expression of the semantics. This would basically mean to define a JSON API with an added JSON-LD context that enables handling the data as RDF (or providing documentation for the properties and types). For example, IIIF is doing it like that, see e.g. https://iiif.io/api/annex/notes/jsonld/#semantic-versioning. (A normative JSON schema for IIIF doesn't exist, though.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal in this pull request looks good to me.
Regarding additional documentation for the infinite feature sets, I also think an additional endpoint might be the best solution, maybe like the view template (for simple links to documentation):
"feature_view": {
"url": "https://example.com/api/feature/{{id}}"
}
Or like the preview service (so clients could embed HTML snippets with details on the features in their UI):
"feature_preview": {
"url": "https://example.com/api/feature/{{id}}/preview",
"height": 200,
"width": 350
}
Yes I would not want to specify the list of allowed features in the specs.
Ok, IIIF's stance on this makes sense. In that case I guess it would be safer not to rely on JSON-LD's technology to handle indirections: JSON-LD annotations could be added to the existing JSON payloads to document them, but these could also be safely ignored. (so: forget about the blank node idea, it is a bad one) |
d19821d
to
a4d7ea9
Compare
After many months I have added support for the |
Since the discussion has settled, I propose to merge this soon. We can have follow-up PRs if other issues arise. |
This is a fairly simple proposal to expose matching features in reconciliation candidates, for #31.
The idea behind this proposal is that clients would then be able to construct datasets of reconciliation candidates and their features, annotate which candidates are correct, and train some classifier to predict correctness based on the features (and potentially other features computed locally). This relies on the assumption that services return a given feature (designated by its
id
) in many different candidates.Note that there is no requirement that the global score for a candidate is derived from the individual features in a particular way: services are free to expose features which are not actually used to compute the global score. This seems useful since it lets services expose features which could potentially be useful in some scenarios, but are not used for the main score to keep its computation simple.
The global score is still kept for backwards-compatibility and as it gives a useful baseline for clients to build on (or potentially use it as a feature itself if it isn't already derived from the exposed features).
This JSON syntax is a bit heavy if services want to return hundreds of features for each candidate, but I wanted to avoid using a simple dictionary as it can make implementation in clients harder (#33):
It's obviously quite hard to evaluate what could go wrong without building the corresponding functionality in clients (for instance OpenRefine)… Do you think such an API would work for your use cases?