Table of Contents |
---|
Overall process
Deduplication is the task of finding records in a data set (MPI) that refer to the same Person. In order to guarantee data quality also should be make task of fraud defining. In case there are different persons with same document number/tax_id etc such cases won't be duplicated and it's a scope of fraud defining task.
...
- data cleaning and preparation
- blocking and finding possible pairs
- calculating of variables for each pair
- probability linkage using model
- finding master persons in a pair and deactivation of person(s) which are not masters
- fetch declaration(s) which belongs to deactivated person(s) and declaration termination
...
Variables that included into model are next:
Variable | Descirption | |
---|---|---|
d_first_name | levenshtein distance(first_name1, first_name2) | |
d_last_name | levenshtein distance(last_name1, last_name2) | |
d_second_name | levenshtein distance(second_name1, second_name2) | |
d_documents | least(levenshtein distance(document1, document2)) | |
docs_same_number | least(same/not) number | |
birth_settlement_substr | least(position(birth_settlement_1 in birth_settlementt_2) and position(birth_settlement_2 in birth_settlementt_1) | |
d_tax_id | levenshtein distance(tax_id1, tax_id2) | |
authentication_methods | same/not authentification OTP number flag | |
residence_settlement_flag | same/not residence settlement flag | |
registration_settlement_flag | same/not registration settlement flag | |
gender_flag | same/not gender | |
twins_flag | distance last_name <=2, same birth_date, distance in document numbers between 1 and 2 |
Each categorical variable must be convert to continuous using WOE (which was calculated for train data sample). WOE describes the relationship between a predictive variable and a binary target variable.
The WOE/IV framework is based on the following relationship:
- logP(Y=1|Xj)P(Y=0|Xj) = logP(Y=1)P(Y=0) ) (sample log-odds) +log(Xj|Y=1)f(Xj|Y=0) (WOE)
- WOE = ln (% of non-merge/ % of merge)
After decoding each variable we can apply model and calculate probability
Probability linkage using model
...