Table of Contents |
---|
Overall process
Deduplication is the task of finding records in a data set (MPI) that refer to the same Person. In order to guarantee data quality also should be make task of fraud defining. In case there are different persons with same document number/tax_id etc such cases won't be duplicated and it's a scope of fraud defining task.
...
- data cleaning and preparation
- blocking and finding possible pairs
- calculating of variables for each pair
- probability linkage using model
- finding master persons in a pair and deactivation of person(s) which are not masters
- fetch declaration(s) which belongs to deactivated person(s) and declaration termination
...
The WOE/IV framework is based on the following relationship:
- logP(Y=1|Xj)P(Y=0|Xj) = logP(Y=1)P(Y=0) (sample log-odds) +log(Xj|Y=1)f(Xj|Y=0) (WOE)
- WOE = ln (% of non-merge/ % of merge)
After decoding each variable we can apply model and calculate probability
Probability linkage using model
...
After that the threshold should be used to define which probability if satisfying to call the pair a duplicate.
Field | Description |
---|---|
Score | Probability that pair of persons should be merged |
Sum of target | Quantity of pairs which should be merged |
Qty | Total quantity of pairs |
Hit_rate | Ratio of 'Quantity of pairs which should be merged' to 'Total quantity of pairs' |
Merge_acu_% | Accumulated ratio of all records which marked as merge |
Qty_acu_% | Accumulated sample distribution by score |
Accuracy_rate | Percantage of errors on data sample |
From test sample suggestions for cut off are next:
- score >=0.9 - merge (as minimum score that should be saved to `
merge candidates` and auto merged)
- score between 0.7 and 0.9 - manual merge (as minimum score that should be merged manually)
- score < 0.7 - do not merge
...