Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose matching features for client-side scoring #38

Merged
merged 3 commits into from
Jul 26, 2020

Conversation

wetneb
Copy link
Member

@wetneb wetneb commented Feb 10, 2020

This is a fairly simple proposal to expose matching features in reconciliation candidates, for #31.

The idea behind this proposal is that clients would then be able to construct datasets of reconciliation candidates and their features, annotate which candidates are correct, and train some classifier to predict correctness based on the features (and potentially other features computed locally). This relies on the assumption that services return a given feature (designated by its id) in many different candidates.

Note that there is no requirement that the global score for a candidate is derived from the individual features in a particular way: services are free to expose features which are not actually used to compute the global score. This seems useful since it lets services expose features which could potentially be useful in some scenarios, but are not used for the main score to keep its computation simple.

The global score is still kept for backwards-compatibility and as it gives a useful baseline for clients to build on (or potentially use it as a feature itself if it isn't already derived from the exposed features).

This JSON syntax is a bit heavy if services want to return hundreds of features for each candidate, but I wanted to avoid using a simple dictionary as it can make implementation in clients harder (#33):

{
   "tfidf": 3498,
   "pagerank": 34.3,
   "type_match": -11.3
}

It's obviously quite hard to evaluate what could go wrong without building the corresponding functionality in clients (for instance OpenRefine)… Do you think such an API would work for your use cases?

@wetneb wetneb changed the title Introduce matching features, for #31 Expose matching features for client-side scoring Feb 10, 2020
@rybesh
Copy link

rybesh commented Feb 10, 2020

Is there any additional information that could make these scores more usable to a human making a judgment about reconciliation candidates (i.e. not only for training a classifier)? For example, a URL for an explanation of the feature/score?

I would potentially use something like this to expose the separate label, temporal and spatial scores for candidates returned by the PeriodO reconciler, but the mostly likely use of them would be to give the person using the reconciler a UI for sorting, comparing and selecting candidates based on the scores.

@wetneb
Copy link
Member Author

wetneb commented Feb 10, 2020

Great point. It would totally make sense to add more metadata to these features - we want them to be as easy to understand as possible for users. It is true that we need to cater for manual workflows.

I am not sure where to add documentation URLs (and other useful metadata) to features: it feels a bit wasteful to add them in each reconciliation candidate as this will take up a lot of bandwidth. Any idea how to expose that?

@rybesh
Copy link

rybesh commented Feb 10, 2020

A featureIds dict in the service manifest? A new kind of suggest service?

@wetneb
Copy link
Member Author

wetneb commented Feb 11, 2020

A featureIds dict in the service manifest?

I was imagining that for the Wikidata reconciliation service, I would generate features for each property supplied in the query: they would be indexed by the property id. For instance, if the user supplies {"pid":"P1234","v":"foo bar"} then I would like to return features "P1234_likelihood", "P1234_fuzzymatch" and the like.

So the domain of feature ids would be open in my case (the set of valid property ids is infinite since I accept property paths too), making it impossible to list them all in the manifest.

A new kind of suggest service?

Perhaps a dedicated endpoint would do, yes.

Another option, slightly less wasteful than including feature metadata in reconciliation candidates, would be to include it at the root of reconciliation responses (where metadata for a given feature would be given only once even if the feature appears in multiple reconciliation candidates). But a dedicated endpoint could be cleaner, I think.

@acka47
Copy link
Member

acka47 commented Feb 11, 2020

This looks overall good to me.

I am not sure where to add documentation URLs (and other useful metadata) to features

One approach for linking to documentation is to make OpenRefine payloads JSON-LD by adding a @context. This might mean to adjust a lot of other things but I want to bring this up here anyway as now is the right time to think about it. In this case, you could provide something like this where the documentation would be available behind the URI for each property (here e.g. https://example.org/score) and type (here e.g. https://example.org/Name_tfidf):

{
    "@context":{
        "name":"https://schema.org/name",
        "score":"https://example.org/score",
        "Name_tfidf":"https://example.org/Name_tfidf",
        "Pagerank":"https://example.org/Pagerank"
    },
    "id":"1117582299",
    "name":"Urbaniak, Hans-Eberhard",
    "score":85.71888,
    "features":[
        {
            "@type":"Name_tfidf",
            "value":378.239
        },
        {
            "@type":"Pagerank",
            "value":-3.1209
        }
    ],
    "match":true,
    "type":[
        {
            "id":"AuthorityResource",
            "name":"Normdatenressource"
        },
        {
            "id":"DifferentiatedPerson",
            "name":"Individualisierte Person"
        }
    ]
}

@wetneb
Copy link
Member Author

wetneb commented Feb 11, 2020

Could this @context live higher up in the JSON tree: at the level of the reconciliation result batch, instead of being added to every candidate? If so that would potentially remove some redundancy (and would be more or less what I propose above but in a more principled format).

Generally, formatting responses with JSON-LD seems like a sensible move. Should it be done at other places in the API to make it more uniform? I would just like to keep an eye on the size of the payloads: it would be good not to add too much overhead.

@acka47
Copy link
Member

acka47 commented Feb 11, 2020

Re. removing redundancy and limit the size of the payload when adding a @context: We would have to add the context to each candidate but we could also reference an external context (see spec) and host the context file itself somewhere else so that only one key-value pair would be added to each response, e.g.:

{
    "@context": "https://openrefine.org/context.jsonld",
    "id":"1117582299",
    "name":"Urbaniak, Hans-Eberhard",
    "score":85.71888,
    "features":[
        {
            "@type":"Name_tfidf",
            "value":378.239
        },
        {
            "@type":"Pagerank",
            "value":-3.1209
        }
    ],
    "match":true,
    "type":[
        {
            "id":"AuthorityResource",
            "name":"Normdatenressource"
        },
        {
            "id":"DifferentiatedPerson",
            "name":"Individualisierte Person"
        }
    ]
}

@wetneb
Copy link
Member Author

wetneb commented Feb 11, 2020

Ok, so I guess instead of https://openrefine.org/context.jsonld, services would use a URL of their own, which would describe their own features. But then, we are back to the problem of infinite feature sets. If your features can be parametrized (as they would be for Wikidata), then you cannot list them all in this context document. So you would need to automatically generate context documents which contain the metadata of the feature ids used in a particular candidate.

If we want to save space and go down the JSON-LD route, perhaps we could use blank nodes to refer from each feature to other features defined elsewhere in the payload. But I am not sure this is a good idea.

More generally, if we start doing things the JSON-LD way, doesn't that vaguely imply that we should accept queries or responses which have a different JSON structure but are equivalent as far as their JSON-LD semantics are concerned? So forcing implementations to rely on some JSON-LD library to handle that somehow? I would be interested to know if there are specs which use JSON-LD and JSON schema simultaneously.

@acka47
Copy link
Member

acka47 commented Feb 12, 2020

Ok, so I guess instead of https://openrefine.org/context.jsonld, services would use a URL of their own, which would describe their own features. But then, we are back to the problem of infinite feature sets. If your features can be parametrized (as they would be for Wikidata), then you cannot list them all in this context document. So you would need to automatically generate context documents which contain the metadata of the feature ids used in a particular candidate.

I apparently misunderstood what is modeled here. I thought we were talking about a controlled list of features that is the same for all reconciliation services. If the list differs from service to service, I have my doubts whether a JSON-LD context is the best solution.

If we want to save space and go down the JSON-LD route, perhaps we could use blank nodes to refer from each feature to other features defined elsewhere in the payload. But I am not sure this is a good idea.

I don't understand what you mean.

More generally, if we start doing things the JSON-LD way, doesn't that vaguely imply that we should accept queries or responses which have a different JSON structure but are equivalent as far as their JSON-LD semantics are concerned? So forcing implementations to rely on some JSON-LD library to handle that somehow? I would be interested to know if there are specs which use JSON-LD and JSON schema simultaneously.

I think it is no problem and even the most sensible solution to specify the API (by a JSON Schema or otherwise) in a way that only allows for one JSON-LD expression of the semantics. This would basically mean to define a JSON API with an added JSON-LD context that enables handling the data as RDF (or providing documentation for the properties and types). For example, IIIF is doing it like that, see e.g. https://iiif.io/api/annex/notes/jsonld/#semantic-versioning. (A normative JSON schema for IIIF doesn't exist, though.)

Copy link
Member

@fsteeg fsteeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal in this pull request looks good to me.

Regarding additional documentation for the infinite feature sets, I also think an additional endpoint might be the best solution, maybe like the view template (for simple links to documentation):

"feature_view": {
  "url": "https://example.com/api/feature/{{id}}"
}

Or like the preview service (so clients could embed HTML snippets with details on the features in their UI):

"feature_preview": {
  "url": "https://example.com/api/feature/{{id}}/preview",
  "height": 200,
  "width": 350
}

@wetneb
Copy link
Member Author

wetneb commented Feb 12, 2020

If the list differs from service to service, I have my doubts whether a JSON-LD context is the best solution.

Yes I would not want to specify the list of allowed features in the specs.

I think it is no problem and even the most sensible solution to specify the API (by a JSON Schema or otherwise) in a way that only allows for one JSON-LD expression of the semantics. This would basically mean to define a JSON API with an added JSON-LD context that enables handling the data as RDF (or providing documentation for the properties and types). For example, IIIF is doing it like that, see e.g. https://iiif.io/api/annex/notes/jsonld/#semantic-versioning. (A normative JSON schema for IIIF doesn't exist, though.)

Ok, IIIF's stance on this makes sense. In that case I guess it would be safer not to rely on JSON-LD's technology to handle indirections: JSON-LD annotations could be added to the existing JSON payloads to document them, but these could also be safely ignored.

(so: forget about the blank node idea, it is a bad one)

@wetneb wetneb force-pushed the issue-31-scoring-features branch from d19821d to a4d7ea9 Compare May 5, 2020 16:42
@wetneb
Copy link
Member Author

wetneb commented May 5, 2020

After many months I have added support for the feature_view suggested by @fsteeg. I prefer this over a preview-style service because I think it is not ideal to build a spec based on iframes in 2020… It's cleaner to just associate a URI to a feature, and let clients decide what they do with this URI.

@wetneb wetneb linked an issue Jul 21, 2020 that may be closed by this pull request
@wetneb
Copy link
Member Author

wetneb commented Jul 21, 2020

Since the discussion has settled, I propose to merge this soon. We can have follow-up PRs if other issues arise.

@wetneb wetneb merged commit b07fe4f into master Jul 26, 2020
@wetneb wetneb deleted the issue-31-scoring-features branch July 26, 2020 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Give users more control over candidate scoring
5 participants