Spliting linker matching process onto another kafka stream #201
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ticket: N/A
Other Related Tickets: N/A
Describe of changes
This change optimizes JeMPI linking process, by splitting up the matching logic to another stream, which only start once linking has complete
In detail
When linking, the linker first tries to determine if it can link by checking if the required (linking) fields are present (and set) in the interaction (These required fields, are configured in the reference-config under
rules.link
). If so it proceeds to link based on the deterministic / probabilistic rules configured.Currently (before this changes), if the required fields where not set it would try to match using the remaining fields (given that you have configured matching rules)
During linking, this would slow down the process, as matching usually relies on fuzzy searching (but not always, see note below)
This change, pushes on the interactions that need matching to a new stream. Once linking is done for an upload, it then processes the matching stream.
Note:-
Although in the ideal case this would speed up JeMPI, the configuration, and rules set is the greater determiner. For instance, if your linking rules are probabilistic, using fuzzy searching, the will be not performance improvements. This change assumes you configured your reference-config in such a way that your link rules are mainly deterministic (or non-complex probabilistic) and you match rules more probabilistic/fuzzy
How to test
Run the configurator using this sample config
config-reference.json
PROBABILISTIC_DO_LINKING
in JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/CustomLinkerProbabilistic.java tofalse
DETERMINISTIC_DO_MATCHING
in JeMPI_Apps/JeMPI_Linker/src/main/java/org/jembi/jempi/linker/backend/CustomLinkerProbabilistic.java totrue
upload/add this sample csv (which has same missing fields)
In you logs for the linker you should see something like
sample2Streams.csv