Expose matching features for client-side scoring #38

wetneb · 2020-02-10T16:38:33Z

This is a fairly simple proposal to expose matching features in reconciliation candidates, for #31.

The idea behind this proposal is that clients would then be able to construct datasets of reconciliation candidates and their features, annotate which candidates are correct, and train some classifier to predict correctness based on the features (and potentially other features computed locally). This relies on the assumption that services return a given feature (designated by its id) in many different candidates.

Note that there is no requirement that the global score for a candidate is derived from the individual features in a particular way: services are free to expose features which are not actually used to compute the global score. This seems useful since it lets services expose features which could potentially be useful in some scenarios, but are not used for the main score to keep its computation simple.

The global score is still kept for backwards-compatibility and as it gives a useful baseline for clients to build on (or potentially use it as a feature itself if it isn't already derived from the exposed features).

This JSON syntax is a bit heavy if services want to return hundreds of features for each candidate, but I wanted to avoid using a simple dictionary as it can make implementation in clients harder (#33):

{
   "tfidf": 3498,
   "pagerank": 34.3,
   "type_match": -11.3
}

It's obviously quite hard to evaluate what could go wrong without building the corresponding functionality in clients (for instance OpenRefine)… Do you think such an API would work for your use cases?

rybesh · 2020-02-10T19:29:39Z

Is there any additional information that could make these scores more usable to a human making a judgment about reconciliation candidates (i.e. not only for training a classifier)? For example, a URL for an explanation of the feature/score?

I would potentially use something like this to expose the separate label, temporal and spatial scores for candidates returned by the PeriodO reconciler, but the mostly likely use of them would be to give the person using the reconciler a UI for sorting, comparing and selecting candidates based on the scores.

wetneb · 2020-02-10T19:50:39Z

Great point. It would totally make sense to add more metadata to these features - we want them to be as easy to understand as possible for users. It is true that we need to cater for manual workflows.

I am not sure where to add documentation URLs (and other useful metadata) to features: it feels a bit wasteful to add them in each reconciliation candidate as this will take up a lot of bandwidth. Any idea how to expose that?

rybesh · 2020-02-10T21:44:12Z

A featureIds dict in the service manifest? A new kind of suggest service?

wetneb · 2020-02-11T08:59:41Z

A featureIds dict in the service manifest?

I was imagining that for the Wikidata reconciliation service, I would generate features for each property supplied in the query: they would be indexed by the property id. For instance, if the user supplies {"pid":"P1234","v":"foo bar"} then I would like to return features "P1234_likelihood", "P1234_fuzzymatch" and the like.

So the domain of feature ids would be open in my case (the set of valid property ids is infinite since I accept property paths too), making it impossible to list them all in the manifest.

A new kind of suggest service?

Perhaps a dedicated endpoint would do, yes.

Another option, slightly less wasteful than including feature metadata in reconciliation candidates, would be to include it at the root of reconciliation responses (where metadata for a given feature would be given only once even if the feature appears in multiple reconciliation candidates). But a dedicated endpoint could be cleaner, I think.

acka47 · 2020-02-11T09:22:39Z

This looks overall good to me.

I am not sure where to add documentation URLs (and other useful metadata) to features

One approach for linking to documentation is to make OpenRefine payloads JSON-LD by adding a @context. This might mean to adjust a lot of other things but I want to bring this up here anyway as now is the right time to think about it. In this case, you could provide something like this where the documentation would be available behind the URI for each property (here e.g. https://example.org/score) and type (here e.g. https://example.org/Name_tfidf):

{
    "@context":{
        "name":"https://schema.org/name",
        "score":"https://example.org/score",
        "Name_tfidf":"https://example.org/Name_tfidf",
        "Pagerank":"https://example.org/Pagerank"
    },
    "id":"1117582299",
    "name":"Urbaniak, Hans-Eberhard",
    "score":85.71888,
    "features":[
        {
            "@type":"Name_tfidf",
            "value":378.239
        },
        {
            "@type":"Pagerank",
            "value":-3.1209
        }
    ],
    "match":true,
    "type":[
        {
            "id":"AuthorityResource",
            "name":"Normdatenressource"
        },
        {
            "id":"DifferentiatedPerson",
            "name":"Individualisierte Person"
        }
    ]
}

wetneb · 2020-02-11T11:20:34Z

Could this @context live higher up in the JSON tree: at the level of the reconciliation result batch, instead of being added to every candidate? If so that would potentially remove some redundancy (and would be more or less what I propose above but in a more principled format).

Generally, formatting responses with JSON-LD seems like a sensible move. Should it be done at other places in the API to make it more uniform? I would just like to keep an eye on the size of the payloads: it would be good not to add too much overhead.

acka47 · 2020-02-11T13:41:24Z

Re. removing redundancy and limit the size of the payload when adding a @context: We would have to add the context to each candidate but we could also reference an external context (see spec) and host the context file itself somewhere else so that only one key-value pair would be added to each response, e.g.:

{
    "@context": "https://openrefine.org/context.jsonld",
    "id":"1117582299",
    "name":"Urbaniak, Hans-Eberhard",
    "score":85.71888,
    "features":[
        {
            "@type":"Name_tfidf",
            "value":378.239
        },
        {
            "@type":"Pagerank",
            "value":-3.1209
        }
    ],
    "match":true,
    "type":[
        {
            "id":"AuthorityResource",
            "name":"Normdatenressource"
        },
        {
            "id":"DifferentiatedPerson",
            "name":"Individualisierte Person"
        }
    ]
}

wetneb · 2020-02-11T14:28:29Z

Ok, so I guess instead of https://openrefine.org/context.jsonld, services would use a URL of their own, which would describe their own features. But then, we are back to the problem of infinite feature sets. If your features can be parametrized (as they would be for Wikidata), then you cannot list them all in this context document. So you would need to automatically generate context documents which contain the metadata of the feature ids used in a particular candidate.

If we want to save space and go down the JSON-LD route, perhaps we could use blank nodes to refer from each feature to other features defined elsewhere in the payload. But I am not sure this is a good idea.

More generally, if we start doing things the JSON-LD way, doesn't that vaguely imply that we should accept queries or responses which have a different JSON structure but are equivalent as far as their JSON-LD semantics are concerned? So forcing implementations to rely on some JSON-LD library to handle that somehow? I would be interested to know if there are specs which use JSON-LD and JSON schema simultaneously.

acka47 · 2020-02-12T13:20:02Z

Ok, so I guess instead of https://openrefine.org/context.jsonld, services would use a URL of their own, which would describe their own features. But then, we are back to the problem of infinite feature sets. If your features can be parametrized (as they would be for Wikidata), then you cannot list them all in this context document. So you would need to automatically generate context documents which contain the metadata of the feature ids used in a particular candidate.

I apparently misunderstood what is modeled here. I thought we were talking about a controlled list of features that is the same for all reconciliation services. If the list differs from service to service, I have my doubts whether a JSON-LD context is the best solution.

If we want to save space and go down the JSON-LD route, perhaps we could use blank nodes to refer from each feature to other features defined elsewhere in the payload. But I am not sure this is a good idea.

I don't understand what you mean.

More generally, if we start doing things the JSON-LD way, doesn't that vaguely imply that we should accept queries or responses which have a different JSON structure but are equivalent as far as their JSON-LD semantics are concerned? So forcing implementations to rely on some JSON-LD library to handle that somehow? I would be interested to know if there are specs which use JSON-LD and JSON schema simultaneously.

I think it is no problem and even the most sensible solution to specify the API (by a JSON Schema or otherwise) in a way that only allows for one JSON-LD expression of the semantics. This would basically mean to define a JSON API with an added JSON-LD context that enables handling the data as RDF (or providing documentation for the properties and types). For example, IIIF is doing it like that, see e.g. https://iiif.io/api/annex/notes/jsonld/#semantic-versioning. (A normative JSON schema for IIIF doesn't exist, though.)

fsteeg

The proposal in this pull request looks good to me.

Regarding additional documentation for the infinite feature sets, I also think an additional endpoint might be the best solution, maybe like the view template (for simple links to documentation):

"feature_view": {
  "url": "https://example.com/api/feature/{{id}}"
}

Or like the preview service (so clients could embed HTML snippets with details on the features in their UI):

"feature_preview": {
  "url": "https://example.com/api/feature/{{id}}/preview",
  "height": 200,
  "width": 350
}

wetneb · 2020-02-12T13:37:03Z

If the list differs from service to service, I have my doubts whether a JSON-LD context is the best solution.

Yes I would not want to specify the list of allowed features in the specs.

I think it is no problem and even the most sensible solution to specify the API (by a JSON Schema or otherwise) in a way that only allows for one JSON-LD expression of the semantics. This would basically mean to define a JSON API with an added JSON-LD context that enables handling the data as RDF (or providing documentation for the properties and types). For example, IIIF is doing it like that, see e.g. https://iiif.io/api/annex/notes/jsonld/#semantic-versioning. (A normative JSON schema for IIIF doesn't exist, though.)

Ok, IIIF's stance on this makes sense. In that case I guess it would be safer not to rely on JSON-LD's technology to handle indirections: JSON-LD annotations could be added to the existing JSON payloads to document them, but these could also be safely ignored.

(so: forget about the blank node idea, it is a bad one)

wetneb · 2020-05-05T16:42:39Z

After many months I have added support for the feature_view suggested by @fsteeg. I prefer this over a preview-style service because I think it is not ideal to build a spec based on iframes in 2020… It's cleaner to just associate a URI to a feature, and let clients decide what they do with this URI.

wetneb · 2020-07-21T20:27:26Z

Since the discussion has settled, I propose to merge this soon. We can have follow-up PRs if other issues arise.

wetneb requested review from alanbuxton, VladimirAlexiev, fsteeg, IvanBashkirov, osma and RicardoUsbeck February 10, 2020 16:38

wetneb changed the title ~~Introduce matching features, for #31~~ Expose matching features for client-side scoring Feb 10, 2020

fsteeg approved these changes Feb 12, 2020

View reviewed changes

wetneb added 2 commits May 5, 2020 18:41

Introduce matching features, for #31

c0d42fb

Add support for feature_view

a4d7ea9

wetneb force-pushed the issue-31-scoring-features branch from d19821d to a4d7ea9 Compare May 5, 2020 16:42

Add missing versions in manifest example

2a3690d

wetneb mentioned this pull request Jul 20, 2020

feedback for ML tuning weights #30

Open

IvanBashkirov approved these changes Jul 20, 2020

View reviewed changes

wetneb linked an issue Jul 21, 2020 that may be closed by this pull request

Give users more control over candidate scoring #31

Closed

wetneb merged commit b07fe4f into master Jul 26, 2020

wetneb deleted the issue-31-scoring-features branch July 26, 2020 09:29

wetneb mentioned this pull request Aug 29, 2020

Store and expose reconciliation candidate features OpenRefine/OpenRefine#3139

Open

This was referenced Jun 27, 2022

Wikidata reconciliation query/scores when multiple variable on same property and date precision wetneb/openrefine-wikibase#141

Closed

Add support for matching features reconciliation-api/testbench#37

Closed

acka47 mentioned this pull request Nov 11, 2022

JSON-LD responses: make them FAIR. #70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose matching features for client-side scoring #38

Expose matching features for client-side scoring #38

wetneb commented Feb 10, 2020 •

edited

Loading

rybesh commented Feb 10, 2020

wetneb commented Feb 10, 2020

rybesh commented Feb 10, 2020

wetneb commented Feb 11, 2020

acka47 commented Feb 11, 2020

wetneb commented Feb 11, 2020

acka47 commented Feb 11, 2020 •

edited

Loading

wetneb commented Feb 11, 2020

acka47 commented Feb 12, 2020

fsteeg left a comment •

edited

Loading

wetneb commented Feb 12, 2020 •

edited

Loading

wetneb commented May 5, 2020

wetneb commented Jul 21, 2020

Expose matching features for client-side scoring #38

Expose matching features for client-side scoring #38

Conversation

wetneb commented Feb 10, 2020 • edited Loading

rybesh commented Feb 10, 2020

wetneb commented Feb 10, 2020

rybesh commented Feb 10, 2020

wetneb commented Feb 11, 2020

acka47 commented Feb 11, 2020

wetneb commented Feb 11, 2020

acka47 commented Feb 11, 2020 • edited Loading

wetneb commented Feb 11, 2020

acka47 commented Feb 12, 2020

fsteeg left a comment • edited Loading

Choose a reason for hiding this comment

wetneb commented Feb 12, 2020 • edited Loading

wetneb commented May 5, 2020

wetneb commented Jul 21, 2020

wetneb commented Feb 10, 2020 •

edited

Loading

acka47 commented Feb 11, 2020 •

edited

Loading

fsteeg left a comment •

edited

Loading

wetneb commented Feb 12, 2020 •

edited

Loading