-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feedback for ML tuning weights #30
Comments
I talked just yesterday to Ontotext's CTO @vassilmomtchev about such a feature. |
This definitely needs to be opt-in and not something mandated by the API/protocol. |
Completely agree. Opt in, but would be good to have a pattern to reuse.
…________________________________
From: Tom Morris <[email protected]>
Sent: Friday, February 28, 2020 11:27:43 AM
To: reconciliation-api/specs <[email protected]>
Cc: Subscribed <[email protected]>
Subject: Re: [reconciliation-api/specs] feedback for ML tuning weights (#30)
This would probably be an opt-in feature for most clients for privacy reasons.
This definitely needs to be opt-in and not something mandated by the API/protocol.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#30?email_source=notifications&email_token=AAAIN2H72PNGPNZQEXLMDG3RFFQS7A5CNFSM4KFQ22K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENJ3HMA#issuecomment-592688048>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAAIN2CAQELQ4YRVE4WIGQ3RFFQS7ANCNFSM4KFQ22KQ>.
|
For generic reconciliation services like Wikidata, I would probably not rely on this feature if it existed in the specs, because of the diversity of the queries / use cases / matching criteria. If I could collect final decision matches from users, I am not sure this would make a good dataset to influence the scoring mechanism in the service. Because of the diversity of data shapes and matching criteria each user has, taking into account the decisions of user A in our scoring might not help user B at all. Also, updating the scoring mechanism server-side makes reconciliation workflows less reproducible. Since this must be opt-in, the question is also how to incentivize users to send this feedback. What benefit do they get out of it? The vague promise that their decisions might be used down the line to change the scoring mechanism? How will they know if/when they can count on that happening? But that's just a comment about my own use cases - no particular opposition to adding it to the specs if it can be useful for other services. |
After today's call and @workergnome feedback, I actually feel like part of what @wetneb was describing about feedback with classifier routines and ML, almost feels like it deserves ANOTHER API for ML workflows. Call it a The risk of separating the API's (1 for humans|1 for machines) is what, however? (playing devils advocate here) |
@thadguidry this issue is different from what we have been discussing today. At least I was talking about #31. |
I had been thinking about implementing something like this. Glad to see others agree, less glad to see it isn't quite there yet. While I agree that it should be optional, I don't necessarily get the significance it is given in this discussion. I can't come up with a realistic scenario where any private data is compromised by such a feedback mechanism but not the initial query. Assuming services see actual value in this data, clients should make opting in explicit but easy: not hiding it in a configuration file but, for example, specifically asking the user to opt in or out, and optionally remembering that decision. @wetneb's concerns come, I believe, from a slightly too simple idea of how something like this would work: the system would not save individual matching decisions in their entirety and then reproduce them when exactly the same query is made again. It's probably better to think of this as changing the matching algorithm instead of changing any entity data. It would allow, for example, learning about the relative importance of different property matches. Or to create far more realistic algorithms for scoring differences: month and day of a birth date being switched might happen more often than other errors, and should therefore incur less of a penalty. Names are already scored using soundex or similar, but that doesn't capture the similarity of "William" and "Bob". Middle names are highly relevant in some cultures (George H. W. vs George W), while in others everyone has them but they are commonly dropped in all but formal communications. The universe of such nuances is pretty much endless, and certainly greater than anyone's ability to manually catalog. And while individual data shapes may change between users, any halfway decent system is bound to improve matching drastically, if only because what we currently have is really just voodoo. Having that data also allows measuring the scoring performance. That's a huge benefit even without any attempt at machine learning. The scoring and matching could evolve as it did until now, by reasoning our way to what we believe to be better and implementing it in code. But after any change, we could then re-run past data, measure any improvements, and also detect pathological cases we just introduced. |
Then I was not clear enough: I would absolutely not reproduce learned replies when the exact same query is made again, that would not make sense. The goal is indeed to improve the scoring mechanism in general. But as a user, I would not like to use a service whose scoring mechanism changes unpredictably depending on the matching decisions submitted by other users. This means that the same reconciliation workflow on the same data could give different results if done a few days later.
It might make sense for some services to learn a relative importance of properties over all the queries they get, but not for services like Wikidata, where this relative importance really depends on the dataset you have at hand. Say I am matching a dataset of sportspeople with names, sport practiced and nationality. In my dataset, names only contain initials for the given names, so perhaps the matching will be better with more weight on the sport and nationality. But perhaps the day after, you want to match another dataset, which also happens to have names, sport practiced and nationality. But in your dataset the names are completely spelled out, even with middle names, and the nationality column is less reliable because it is sometimes mixed-up with the country the person plays for as a sportsperson. So in your case you probably want a higher weight on the name and less reliance on the nationality. If the service learns your weights, and I come back to match more of my dataset, I will get a less precise matching and will have to push back the service's settings to fit my needs. This is why I would like to make it possible to learn weights client side, by letting services expose more granular features (#38). That being said I am not opposed to making it possible to submit decisions as proposed here - if you have an idea of what this should look like, why not submit a PR? |
@wetneb You give an excellent example of two matching scenarios that put different importance on the same features. Let's call this a "feature scenario". I think the client needs to expose these preferences to the server in a declarative way. A naive proposal:
But if the client can't affect server-side processing, what good is that? The client can only afford to get a limited number of candidates per row. He could use the exposed feature weights to order them in a different way, but if the best candidate is not in that limited selection, he's screwed. We need to find a way to expose to the server the feature preferences embodied in the reordering. I can't believe we'd be the first to face this fundamental problem: how can ML client and servers interact. |
I would personally find it useful to be able to locally re-score the reconciliation results returned by the service. This re-scoring could either be done using a scoring function I came up with manually, or be learned from data (train a classifier on a few samples of annotated data). Let's make this super concrete to make sure we are on the same page on this. I reconcile a database of films to Wikidata using the following information: title, director, producer and filming date. Let me know if you still struggle to see the interest in this, happy to expand on any step! |
How do you know most matches will be "a bit further" and not "quite further"? People typically ask for top-3 or top-5 matches. If the correct match is not within this short list, there's nothing you can do client-side. Another (smaller) concern: the option "auto-match top candidate" is very important on large sheets. In your scenario you'd need multiple recon steps on the client side:
AFAIK the last 2 items can't be done with OpenRefine at present |
Of course. This only works when the default scoring mechanism of the service is good enough to surface the correct candidates somewhere in the list.
This can already be done in OpenRefine, there is a button for this in the "Reconcile" -> "Actions" menu. |
@VladimirAlexiev Hi Vladimir,
Were you thinking of exposing something like a PIT (point in time) or a
search_after to allow Reconcile providers to show their clients that there
are next pages of hits, regardless of client preferences for limiting the
size of results returned? That way they would still get some kind of
indicator that "there are more pages of reconcile results available for
this recon entity"? The scoring of each page or listing could reflect that
"we are not done yet here" and show reduced weighting for scores because
both sides know that a full picture hasn't surfaced yet.
Something how Elasticsearch does:
https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/filter-search-results.html
|
@wetneb and @thadguidry If I have 10k rows, I'd like to get 3-5 matches per row, not a number of matches that would require pagination |
@VladimirAlexiev Sure, understood, but it's always about context. For example, if you were reconciling Lexemes, there are entities (words) in the English language that have over 400 senses. One of those words is "set". Imagine I have in my client that word "set" and want to reconcile it against 1 of those 400 senses. For manual reconciling, I need to see all 400 senses in order to choose the right one. For ML tuning, of course it would need additional info besides just the Lexeme, but even with ML tuning hints to reduce the search scope, it might result in 50 plausible matches of a sense of "set". I'm not an expert on ML tuning parameters, I'm simply laying out the context in 1 example. Many clients will have differing needs. Large numbers of possible matches comes about a lot in biology and linguistics.
|
This issue's been quiet for a while but I hope there's still interest in it. I've been thinking that one thing the recon API might benefit from is having some notion of a session. You'd establish a session ID with the recon API, probably after getting the manifest but before doing your first reconcile query. Then, in your reconciliation queries, you could include the session ID, and if the server wants to track any state for you, the server can. The use case that I was thinking is a variation on the feedback for weights - perhaps I've got some pre-labeled examples that I'd like to send to the recon API that might help it fine-tune its weights for my upcoming queries. I would rather not include that batch of examples in my actual queries, in part to keep the amount of data that I have to send lower, and because the server-side API might need a bit of time to fine-tune its model for my session based on the examples that I just sent it, so I don't want to send my queries right away. (Ideally, I could poll the server to find out when my session is fully trained and ready to go) This is basically the same thing as sending feedback on which choices out of the candidates the human actually chose. The recon API is really close to being a great API for some human-in-the-loop workflows, but I think it needs something more like a session to really be able to be used as a loop. As an added bonus, I think it would help give the server some implementation flexibility to manage privacy concerns and to better prepare for the API to be more asynchronous/better recover from crashes on the client side, and maybe to be smarter about any caching policies the server might want to try to manage, especially if the service is trying to do some load-balancing across different backend servers. It certainly could be an optional extension for the recon API and servers could decide not to support it, and clients just treat the service as the best-effort API that exists today. But overall, I think an explicit notion of a session in the API would help with the implementation of the ideas in this issue. (I think the API might want to still also go a bit farther and make the actual reconciliation query POSTs include a request ID/handle so individual batches from each POST can be async/recoverable, but even then there's still value in having an overarching session and adding request IDs to the batches could be phase 2) |
I put together an example server that implements some of my session idea. The writeup was a bit long for a comment, so I posted it as a message on the mailing list: https://lists.w3.org/Archives/Public/public-reconciliation/2021Jun/0000.html (Also, about 50% of it is not about ML feedback, but is instead about using dedupe, which is a state-of-the-art open source entity matching library that trains a matcher from a handful of examples, as the matching engine) The workflow isn't quite what Antonin described, where you'd give feedback on the results of individual batches. The approach I took is necessary for other use cases, but can sort of subsume the workflow Antonin described - but it'd be easy to add in support for Antonin's workflow as well. One thing to keep in mind is this adds a bunch of new state at the server - my approach adds a session that you add positive match examples into, but if you want to try to give feedback on the results of individual queries, you'd also need to keep track of the individual query IDs from the reconciliation query batch (e.g. the 'q0', 'q1', 'q2' etc from the examples ) - in fact, you'd probably need to say that the batch itself needs an ID, or at least the 'q0', 'q1', etc in each batch somehow need to be made unique. Then you have to figure out how long you want to keep that around for - can a user give feedback on a match the reconciliation API suggested a year ago? The reconciliation API/protocol is really close to the what you need for a good human-in-the-loop entity matching system, and a few more batch IDs would help to make the protocol more async-friendly, which would help OpenRefine's UX when reconciling/running the data extension API. |
@epaulson Does you server and session take into account "disambiguating data" such as extra columns of data or properties that a human used in their own decision loop to make their match? And is this "disambiguating choice data" uploaded back to the server via the API? or is only the reconciled candidate choice alone uploaded back to the server? Classification systems will need it all if there is any reasonable expectation to get something useful I think. See the use case I linked above wetneb/openrefine-wikibase#117 |
@thadguidry for additional 'disambiguating data' while searching for a match I didn't do anything special protocol-wise, the existing 'properties' option in a reconciliation query seemed great - if you're matching on a column for 'university name' and the row is 'Washington', adding a property for 'city' lets you figure out the one with 'Seattle' and the one with 'Lexington'. But that was of course already included in the existing API specs and OpenRefine has a nice UI for it. For training purposes, you would upload those additional columns as part of the set of labeled examples. For example, if your unlabeled dataset is two columns: <university_name, city> and you want to reconcile/predict the QID the actual university, for the training, the protocol extension I envisioned meant that you uploaded a small dataset with three columns: <university_name, city, QID>, and the system trained on that for every future unlabeled example you queried for in your session. The DeDupe python library that my test server uses had support for including multiple columns while training and making match decision so it was easy to include in my test server. DeDupe might be right for some future API implementors but I think most will want to write their own matching algorithm. For reporting back which choice was made for a set of specific rows - ie a human turned a bunch of unlabeled/unmatched rows into labeled examples, I decided not to special-case that. It didn't seem compelling to have a different endpoint for that, since after a human reconciled a set of unmatched data into a set of matched data, they could just treat that newly human-matched data as more training examples and upload them to the existing endpoint as additional training data. One thing that this means is that I did NOT require the client to save the candidates that were returned for each query. If the server wants to use those non-matched candidates in its training, I assumed the server would have to reconstruct what it would have sent to the client for that choice - e.g. if the training example is <uname1, city1, QID1234>, I assumed that the server would know that for <uname1, city1> it would have suggested <QID1234, QID1235, QID1236> as its original match candidates and can train itself as appropriate. You could certainly make the argument that the protocol for training responses should send back something like <input: {uname1, city1}, Match: {QID1234}, NonMatch: {QID1235, QID1236}>. Including example non-matches is really important for most Entity Matching systems so including the set of suggested candidates absolutely is something to think about including, but again that could just be part of the regular training endpoint and might not have to be called out as a separate API. (If you haven't given DeDupe.io a try or watched a training video for it you really should, they've got a nice UX for this) In my server I just made up a new format for what to send to the server for training examples, but the idea in wetneb/openrefine-wikibase#117 is to use the operation history file of OpenRefine, which is an appealing idea. There's a lot of non-relevant info included in that file but the OpenRefine operation history might be a nice reuse of existing code. The other difference between what I'm suggesting and what #117 is suggesting is that I was scoping it all to a "session", which for privacy I figured would belong to a single user, and not a larger collaborative effort, but I think you could make it work for an opt-in collaboration. It would simplify the server, too, because it would eliminate some questions of privacy on the server. (Which I think will also help with scaling when that eventually matters) |
@epaulson Thanks for the feedback Erik. Yes, I've played with dedupe library before and thoroughly enjoyed reading Mikhail's dissertation much later than published, back in the day a bit after Gridworks (later OpenRefine) came into being and Stefano's clustering notes somehow pointed me to it as I recall. |
@epaulson You might also be interested in https://arrow.apache.org/docs/format/Flight.html |
Sending reconciliation decisions to a service
Reconciliation services are currently unaware of which of their proposed candidates was picked by the user (if any). In the January CG call we discussed that there could potentially be a method in the API to do something along these lines. A client would send back chosen matches using a dedicated API method. They would likely need to refer to the original query in some way (or provide it again). This would probably be an opt-in feature for most clients for privacy reasons.
If the service provider wants to rely on this data to tune the weights of their scoring mechanism (for instance), they probably want to rely on user authentication (#26) to be able to attribute decisions to particular users.
The text was updated successfully, but these errors were encountered: