Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KYC Match] Scoring #85

Closed
ToshiWakayama-KDDI opened this issue May 22, 2024 · 26 comments · Fixed by #104
Closed

[KYC Match] Scoring #85

ToshiWakayama-KDDI opened this issue May 22, 2024 · 26 comments · Fixed by #104
Labels
enhancement New feature or request

Comments

@ToshiWakayama-KDDI
Copy link
Collaborator

ToshiWakayama-KDDI commented May 22, 2024

Problem description

To consider Scoring feature for KYC Match.
(Spin off from Issue #65, item No.1, as per Action Item #13.03)

@ToshiWakayama-KDDI ToshiWakayama-KDDI added the enhancement New feature or request label May 22, 2024
@KevScarr
Copy link
Collaborator

KevScarr commented May 22, 2024

Hi @ToshiWakayama-KDDI
Linking out to a thread / good discussion around the concepts for 'score': [#46] .

I would summarise and propose the below, where 'attribute' below is a field in the existing KYC specification:-

  • When a response is "attributeMatch: 'false'" we include an extra response field "attributeScore: 70".
  • Example:
    • when "familyNameAtBirthMatch: 'false'" is returned, a new response field of "familyNameAtBirthScore: 70" is included
  • Rules:
    • Numeric attributes are not checked: ie birthdate (distance scores wouldn't make sense)
    • The response "attributeMatch" must be 'false'
    • The Score value is a whole number (%): 0 to 100 (0 = no match, 100 = exact match)
    • For consistency: Recommend using Jaro-Winkler distance algorithm as per other operators that are live today (after normalisation has been applied).

@HuubAppelboom
Copy link
Collaborator

Hi @KevScarr
Why not provide the Score value as well when the "attributeMatch" is "true", but when there is a small difference (probably a spelling mistake on either side) ?? Or do you propose to provide only a "true" answer when the Score is 100% ?

@KevScarr
Copy link
Collaborator

@HuubAppelboom I would suggest a true equates to an exact match, ie =100. for close matches ie when you return a score allow the consuming service to judge if it's a close enough match or not to proceed (their use-cases will drive their error tolerance).

@GillesInnov35
Copy link
Collaborator

hi @HuubAppelboom , @KevScarr, I understand that a score result (optional) might be added to a boolean attribute (True/False/ Not-avalaible) which is mandatory if provided in the request.
Inthis case, I wonder if the boolean attribute is useful.
At Orange the response contains only a score match result. Consumer has to decide.
Gilles

@KevScarr
Copy link
Collaborator

@GillesInnov35 @HuubAppelboom Fair point; purely thinking about when a customer of the service migrates from the previous version to this version so backward compatibility would be important. I'd say the score is only provided when a boolean: false is returned; outside of that condition it offers little value.
For Orange: Do you still respond with a not-available indicator? and can you share which algorithm you're using (JW?)

@GillesInnov35
Copy link
Collaborator

yes sure Kevin, backward compatibility will be an important point, but as KYC Match version 1.0.0 has not been published I wonder if it is a problem. But may be it is.
to answer to your question:

  • The matching algorithm implemented by the french MNOs is based on the Jaro–Winkler distance
  • The score is a value between 0 and 100, the higher the score, the more similar the strings, the value 100 means an exact match and the value 0 means there is no similarity.
  • The score « -1 » is a special value, it indicates that the requested value was not found by the MNO.

Thanks a lot for your active contribution
Regards
Gilles

@KevScarr
Copy link
Collaborator

Makes sense. So you would return a '-1' when the attribute wasn't available for checking, hence no requirement to have the boolean field in your current response.

If no MNO has implemented the current version then it's a fair shout to move towards a score only approach.

@HuubAppelboom
Copy link
Collaborator

@KevScarr @GillesInnov35 We may need to think of an approach which makes it possible to be extended further. For example, I think it may be a good idea to provide feedback whether the data is unverfied or has been verified by the MNO. That way we can provide a larger market reach, by also including unverified attributes, and the CSP can then decide whether to use that attribute or not.

@ToshiWakayama-KDDI
Copy link
Collaborator Author

Hi @GillesInnov35 , @HuubAppelboom , @KevScarr , all,

Thank you for your prompt comments/discussion, which I did not expect actually.

I should have informed you that there is KYC Match scoring enhancement proposal in the API Backlog WG, so, once we have received the proposal, we should proceed with our scoring discussion taking it into account. We should wait for it, but I don't think it will take long.

I will update the status.

Best regards,
Toshi

@ToshiWakayama-KDDI
Copy link
Collaborator Author

Hi @GillesInnov35 , @HuubAppelboom , @KevScarr, all,

Our implementation is based on v0.1.0, and actully we do not need scoring feature, so, we would insist KYC Match API should work without scoring. It is the OGW original scope, I understand, and for a OGW global API, it is also important. In addition, as we all know, we have put our efforts into v0.1.0 already, so we should use our initial design and consider backward compatibility as much as possible, I believe.

Thanks,
Toshi

@HuubAppelboom
Copy link
Collaborator

As a suggestion how to add score and other information to the API response, maintain backwards compatibility, and have something that can be expanded, we could add an extra string (when applicable) in the response for attributes where score is relevant.

For example the attributeMatch will have values "true", "false", "not_available" (like today)
And we add an extra answer "attributeMatchInfo" that contain items like "score=89 unverified" to signal that the Jaro-Winkler score is 89, but that the source data has not been verified by the MNO. And when we have additional metadata, this can be added in future.

So for example you will get:

givenNameMatch : false
givenNameMatchInfo : score=95 verified

@GillesInnov35
Copy link
Collaborator

hi @ToshiWakayama-KDDI, all, thanks for your comment.
I had a look at the API Backlog issue/PR opened by @jgarciahospital on API Enhancement Proposal KYC-Match Scoring. It is in line with our current discussion on how adding a match score level information, and so it is interesting.
I'm afraid it'll be difficult to propose a backward compatibility if we've to replace a simple attribute by a object structure after version 0.1.0.
This is just my point of view to be discussed.
For example:
image

BR
Gilles

@claraserranosolsona
Copy link

Hi all,

As advanced in last week meeting:

  1. Telefonica has implemented v0.1.0, therefore we would need backwards compatibility in v0.2.0

  2. This would be in line with the proposal of maintaining current true/false/not_available response and in the case of false, adding a score. For example:

• Keep current attributes-> "attributeMatch": true/false/not_available
• If false, add additional parameters -> "attributeScore": X%

From the technical perspective, this should keep backwards compatibility as, based on OAS3, there is a parameter called “additionalProperties” which indicates if the object (our answer in this case) can have additional parameters not documented or not. The default value of “additionalProperties” is true, therefore in CAMARA we assume it is true. So the customer should be ready to receive additional parameters. It would be worth it to check this.

  1. However, the proposal of changing a simple attribute to an object structure would not be an option for backwards compatibility, therefore not possible for us

  2. Ok to proceed with the following rules proposed for the score:

• Numeric attributes are not checked: ie birthdate
• The response "attributeMatch" must be 'false'
• The Score value is a whole number (%): 0 to 100 (0 = no match, 100 = exact match)
• Using Jaro-Winkler distance algorithm (after normalisation has been applied).

Regards,
Clara

@GillesInnov35
Copy link
Collaborator

hi all, thanks Clara for this detailed summary.
If we must address backward compatibility because of v0.1.0 already deployed, I agree with you that we should add new optional score attributes.
Do you think we've time to imagine a design based on OAS3 specifications in order to avoid a long list of attributes ?
BR
Gilles

@KevScarr
Copy link
Collaborator

KevScarr commented Jun 12, 2024

Building on Issue #96 / we should follow the same design convention (define once, use many):-

ScoreMatchResult:
    type: integer
    description: Attribute comparison score as a percentage for string comparisons
    example: 85
    minimum: 0
    maximum: 100	
    
KYC_MatchResponse:
    type: object
    properties:
 
    idDocumentMatch:
        $ref: '#/components/schemas/MatchResult'
 
    nameMatch:
        $ref: '#/components/schemas/MatchResult'
        $ref: '#/components/schemas/ScoreMatchResult'
 
    givenNameMatch:
        $ref: '#/components/schemas/MatchResult'
        $ref: '#/components/schemas/ScoreMatchResult'

ScoreMatchResult to appear for all attribute fields, excluding the following fields as they are numeric/enum/ID based:-

  • idDocumentMatch
  • streetNumberMatch
  • birthdayMatch
  • genderMatch

When a field is numeric only in a particular country, as per the above summary, the score wouldn't be returned.

@KevScarr
Copy link
Collaborator

I've taken the attributes from the current version of the specification and following the rules given an initial view of which attributes can support a 'score' concept in full. It would be good to reach a common view across as many countries as possible, it'll then make updating the yaml spec straightforward.

Attribute Optional Score Available Comment
idDocumentMatch No It’s an ID number.
nameMatch YES
givenNameMatch YES
familyNameMatch YES
nameKanaHankakuMatch ??? Are these fields in next release?
nameKanaZenkakuMatch ??? Are these fields in next release?
middleNamesMatch YES
familyNameAtBirthMatch YES
addressMatch YES
streetNameMatch YES
streetNumberMatch YES Is this houseName in some countries / assumption yes
postalCodeMatch No Being out by one letter can be a different place.
regionMatch YES
localityMatch YES
countryMatch YES
houseNumberExtensionMatch No It’s numeric, not relevant.
birthdateMatch No It’s numeric, not relevant.
emailMatch YES
genderMatch No It’s an enum type.

Some fields in some countries will be all numeric in others, a mixture.
The table above captures which match attributes in the “KYC_MatchResponse” can support a ScoreMatch.

@ToshiWakayama-KDDI Should the nameKana*Match attributes also have scores in this next version of the specification (ie will these attributes remain here or be in an extension)?

@fernandopradocabrillo
Copy link
Collaborator

Building on Issue #96 / we should follow the same design convention (define once, use many):-

ScoreMatchResult:
    type: integer
    description: Attribute comparison score as a percentage for string comparisons
    example: 85
    minimum: 0
    maximum: 100	
    
KYC_MatchResponse:
    type: object
    properties:
 
    idDocumentMatch:
        $ref: '#/components/schemas/MatchResult'
 
    nameMatch:
        $ref: '#/components/schemas/MatchResult'
        $ref: '#/components/schemas/ScoreMatchResult'
 
    givenNameMatch:
        $ref: '#/components/schemas/MatchResult'
        $ref: '#/components/schemas/ScoreMatchResult'

Hi @KevScarr
I agree with the porposal of creating a common schema for the response objects, but I don't fully understand what is the final result here. As far as I know in OAS3 we cannot use two $ref objects at the same level.

From TEF our proposal is mainly focused in not losing the retrocompatibility as we are already integrated with clients so the design could be simpler:

     idDocumentMatch:
         $ref: '#/components/schemas/MatchResult'
     idDocumentScoreMatch:
         $ref: '#/components/schemas/ScoreMatchResult'

We can document that the ScoreMatch properties will only be returned if the related property is false

@GillesInnov35
Copy link
Collaborator

hi @fernandopradocabrillo, I think that with an allOf word it works well.

allOf:
        - $ref: '#/components/schemas/MatchResult'
        - $ref: '#/components/schemas/ScoreMatchResult'

to be confirmed I suppose
BR
Gilles

@GillesInnov35
Copy link
Collaborator

hi @fernandopradocabrillo, you're right. My proposition bellow can't be applied.

allOf:
        - $ref: '#/components/schemas/MatchResult'
        - $ref: '#/components/schemas/ScoreMatchResult'

I agree with yours regarding backward compatibility which is expected.
Gilles

@ToshiWakayama-KDDI
Copy link
Collaborator Author

Hi @KevScarr , all,

@ToshiWakayama-KDDI Should the nameKana*Match attributes also have scores in this next version of the specification (ie will these attributes remain here or be in an extension)?

Thank you for asking me about this. We would prefer to have scores for the nameKanaHankakuMatch and the nameKanaZenkakuMatch attributes in this next version.

Sorry for the late reply, as I needed to discuss this internally.

BR
Toshi

@ToshiWakayama-KDDI
Copy link
Collaborator Author

Hi @KevScarr , @fernandopradocabrillo , @GillesInnov35 , @claraserranosolsona , all

I have a question for my clarification about way of scoring.

It seems that Jaro-Winkler distance algorithm will be used for scoring of string-type attributes (after normalisation has been applied), however, I think it should be up to each operator to choose the way how to calculate scoring.

The reason is, even though in Europe Jaro-Winkler distance algorithm could be used as the common way, it is unclear that Jaro-Winkler distance algorithm can be used for other languages, or, if it can be used for another language, it unclear that Jaro-Winkler distance algorithm is best suited for it. That is my concern, and actually we ourselves are not sure about using Jaro-Winkler distance algorithm for Japanease language.

So, is it OK that it will be up to each operator to choose the way how to calculate scoring, or, is there any other thought?

Thanks,
Toshi
KDDI

@GillesInnov35
Copy link
Collaborator

hi @ToshiWakayama-KDDI , all, I don't really know if this algorithm works for all languages but it should (to be confirmed).
I think we should validate an unique algo to have the same specifications and the same rules for all KYC Match API providers and avoid specific implementation.

BR
Gilles

@ToshiWakayama-KDDI
Copy link
Collaborator Author

Hi @gilles, Thanks for your comments.

"I think we should validate an unique algo to have the same specifications and the same rules for all KYC Match API providers and avoid specific implementation."

This is agreeable sentence, however, as Jaro-Winkler algorithm has not been proved effective for other languages than European languages, it would not be a better way to specify Jaro-Winkler algorithm as mandatory algorithm. If specific algorithms are needed in KYC Match API spec, for example, Jaro-Winkler could be recommendation for European languages, but algorithm for other languages should be TBD.

Would this be a possible way forward?

BR
Toshi

@claraserranosolsona
Copy link

Hi @ToshiWakayama-KDDI ,

As discussed in last week meeting, in order to have a standard score as far as possible, would be ok to proceed with Jaro-Winkler algorithm indicating the following?

"Unless otherwise captured in the specification, score will use the JaroWinkler distance algorithm for all countries."

As so far JaroWinkler has been proven to be the most effective algorithm when comparing two strings, but if at some point for a specific language there is another algorithm that works better, this would give the option to change it.

Many thanks,
Clara

@ToshiWakayama-KDDI
Copy link
Collaborator Author

Hi @ToshiWakayama-KDDI ,

As discussed in last week meeting, in order to have a standard score as far as possible, would be ok to proceed with Jaro-Winkler algorithm indicating the following?

"Unless otherwise captured in the specification, score will use the JaroWinkler distance algorithm for all countries."

As so far JaroWinkler has been proven to be the most effective algorithm when comparing two strings, but if at some point for a specific language there is another algorithm that works better, this would give the option to change it.

Many thanks, Clara

Hi @claraserranosolsona ,

Thanks for reminding me. Sorry for the delay, due to my sickness (Covid-19 still exists) and so on. I think I can reply by tomorrow.

Thank you for your understanding.

Reagrds,
Toshi

@ToshiWakayama-KDDI
Copy link
Collaborator Author

Hi @claraserranosolsona ,

It seems Jaro-Winkler algorithm itself can be used for Japanese lanaugage, however, KDDI does not provide Match Scoring function at all now, so, we are not sure if values caluculated by Jaro-Winkler algorithm are meaningful for KYC Match service, I am afraid.

If you want to use Jaro-Winkler algorithm commonly for Match Scoring, it is fine with us by adding the proposed sentence "Unless otherwise captured in the specification, score will use the JaroWinkler distance algorithm for all countries" in the API description. When KDDI implement Match Scoring function, we could add something in the description if we would have problem with Jaro-Winkler algorithm.

Just to reiterate our thoughts. We understand that in Europe Jaro-Winkler algorithm has been used and has been proven effective, so, there should be no problem, but we think that this algorithm should not be any barrier for operators in other langauge areas to implement this API, and that this API should be made an API suitable for globally common.

Many thanks,
Toshi
KDDI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants