Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Overall process

Image Modified

Deduplication is the task of finding records in a data set (MPI) that refer to the same Person. In order to guarantee data quality also should be make task of fraud defining. In case there are different persons with same document number/tax_id etc such cases won't be duplicated and it's a scope of fraud defining task.

...

  • data cleaning and preparation 
  • blocking and finding possible pairs
  • calculating of variables for each pair
  • probability linkage using model
  • finding master persons in a pair and deactivation of person(s) which are not masters
  • fetch declaration(s) which belongs to deactivated person(s) and declaration termination 

...

The WOE/IV framework is based on the following relationship:

  • logP(Y=1|Xj)P(Y=0|XjlogP(Y=1)P(Y=0) (sample log-odds+log(Xj|Y=1)f(Xj|Y=0) (WOE)
  • WOE = ln (% of non-merge/ % of merge)

After decoding each variable we can apply model and calculate probability


Probability linkage using model

...

columnvaluedescription
idUUIDrecord unique ID
person_idUUIDthe person which will be merged
master_person_idUUIDthe person who will stay active
statusNEW, MANUAL, MERGED
inserted_atDATETIME = now()
updated_atDATETIME = now()
config
{"person_id" 
"candidate_id"
variables}

details

scorevalue from 0 to 1

...

After decision reach more than decision_amount and final_decision is MERGE OPS kafka consumer should find declaration in status 'VERIFIED' and change status to 'TERMINATED' and change persons.id.status to INACTIVE

Auto merge overview

After there was a decision to merge pair of persons do

  1. Validate persons
    1. check if person or master_person exists in DB
    2. check persons `updated_at`, if (merge_candidate.person.inserted_at or merge_candidate.measter_person.inserted_at) < mpi.person.updated_at set status STALE to merge_candidates
  2. Deactivate person
    1. search declaration for person_id, if exist
      1. create event to kafka
      2. set status DECLARATION_READY_DEACTIVATE to merge_candidates
    2. change person_id status to inactive
    3. add info to merged_pairs with person_id and master_person_id
    4. set status MERGED to merge_candidates
    5. add person status change info to event_manager

Merge_candidates state diagram

DECLINED
StatusDescription
NEWNew pair of merge candidates
MERGEDPair that was merged after either manual merge process or auto merge process
STALE
IN_PROCESS
Pair is not merged as one of persons is not actual any more
DECLARATION_READY_DEACTIVATEThe event is snd to kafka to deactivate declaration of the person
DECLINEDSomething went wrong, pair is not merged