Table of Contents |
---|
Overall process
Deduplication is the task of finding records in a data set (MPI) that refer to the same Person. In order to guarantee data quality also should be make task of fraud defining. In case there are different persons with same document number/tax_id etc such cases won't be duplicated and it's a scope of fraud defining task.
...
- data cleaning and preparation
- blocking and finding possible pairs
- calculating of variables for each pair
- probability linkage using model
- finding master persons in a pair and deactivation of person(s) which are not masters
- fetch declaration(s) which belongs to deactivated person(s) and declaration termination
...
The WOE/IV framework is based on the following relationship:
- logP(Y=1|Xj)P(Y=0|Xj) = logP(Y=1)P(Y=0) (sample log-odds) +log(Xj|Y=1)f(Xj|Y=0) (WOE)
- WOE = ln (% of non-merge/ % of merge)
After decoding each variable we can apply model and calculate probability
Probability linkage using model
...
column | value | description |
---|---|---|
id | UUID | record unique ID |
person_id | UUID | the person which will be merged |
master_person_id | UUID | the person who will stay active |
status | NEW, MANUAL, MERGED | |
inserted_at | DATETIME = now() | |
updated_at | DATETIME = now() | |
config | {"person_id" | |
details | ||
score | value from 0 to 1 |
...
Status | Description |
---|---|
NEW | New pair of merge candidates |
MERGED | Pair that was merged after either manual merge process or auto merge process |
STALE | Pair is not merged as one of persons is not actual any moreperson records has been updated after the de-duplication score has been calculated. |
DECLARATION_READY_DEACTIVATE | The event is snd sent to kafka to deactivate declaration of the person |
DECLINED | Something went wrong, pair Pair is not merged. Not because of person, but because of related to persons entities. For example, declarations. |