Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spliting linker matching process onto another kafka stream #201

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

walisc
Copy link
Collaborator

@walisc walisc commented Mar 14, 2024

Ticket: N/A
Other Related Tickets: N/A


Describe of changes

This change optimizes JeMPI linking process, by splitting up the matching logic to another stream, which only start once linking has complete

In detail
When linking, the linker first tries to determine if it can link by checking if the required (linking) fields are present (and set) in the interaction (These required fields, are configured in the reference-config under rules.link). If so it proceeds to link based on the deterministic / probabilistic rules configured.

Currently (before this changes), if the required fields where not set it would try to match using the remaining fields (given that you have configured matching rules)

During linking, this would slow down the process, as matching usually relies on fuzzy searching (but not always, see note below)

This change, pushes on the interactions that need matching to a new stream. Once linking is done for an upload, it then processes the matching stream.

Note:-

Although in the ideal case this would speed up JeMPI, the configuration, and rules set is the greater determiner. For instance, if your linking rules are probabilistic, using fuzzy searching, the will be not performance improvements. This change assumes you configured your reference-config in such a way that your link rules are mainly deterministic (or non-complex probabilistic) and you match rules more probabilistic/fuzzy

How to test

  • Run the configurator using this sample config
    config-reference.json

    • This config will set the following
      • PROBABILISTIC_DO_LINKING in JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/CustomLinkerProbabilistic.java to false
      • DETERMINISTIC_DO_MATCHING in JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/CustomLinkerProbabilistic.java to true
  • upload/add this sample csv (which has same missing fields)

  • In you logs for the linker you should see something like
    sample2Streams.csv

[INFO ] 2024-03-14 04:19:03.445 SPInteractions:158 - KafkaStreams started
...
[INFO ] 2024-03-14 04:19:18.308 SPInteractions:68 - SPInteractions Stream Processor -> Starting linking for tag 'sample2Streams2'
[DEBUG] 2024-03-14 04:19:18.666 LinkerDWH:209 - 2024/03/14 04:19:16:0000002 : 0
[DEBUG] 2024-03-14 04:19:18.908 LinkerDWH:209 - 2024/03/14 04:19:16:0000003 : 0
.....
[INFO ] 2024-03-14 04:19:20.112 SPInteractions:71 - SPInteractions Stream Processor -> Ended linking for tag 'sample2Streams2'
[INFO ] 2024-03-14 04:19:20.116 SPInteractions:142 - SPInteractions Stream Processor -> Starting matching for tag 'sample2Streams2'
[DEBUG] 2024-03-14 04:19:23.189 LinkerDWH:131 - Match Candidates 0 
[INFO ] 2024-03-14 04:19:23.190 LinkerDWH:138 - MATCH NOTIFICATION NO CANDIDATE
{"givenName":"","familyName":"","gender":"female","dob":"19791014","city":"maceo","phoneNumber":"","nationalId":""}
[DEBUG] 2024-03-14 04:19:23.201 LinkerDWH:131 - Match Candidates 0 
[INFO ] 2024-03-14 04:19:23.202 LinkerDWH:138 - MATCH NOTIFICATION NO CANDIDATE
{"givenName":"","familyName":"","gender":"female","dob":"19561006","city":"castlerea","phoneNumber":"","nationalId":""}
[DEBUG] 2024-03-14 04:19:23.214 LinkerDWH:131 - Match Candidates 0 
[INFO ] 2024-03-14 04:19:23.214 LinkerDWH:138 - MATCH NOTIFICATION NO CANDIDATE
{"givenName":"","familyName":"","gender":"female","dob":"19980504","city":"bolikhan","phoneNumber":"","nationalId":""}
[INFO ] 2024-03-14 04:19:23.215 SPInteractions:125 - SPInteractions Stream Processor -> Ended matching for tag 'sample2Streams2'
[INFO ] 2024-03-14 04:19:23.215 SPInteractions:167 - SPInteractions Stream Processor -> Closing matching processor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant