Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Overall process

Deduplication is the task of finding records in a data set (MPI) that refer to the same Person. In order to guarantee data quality also should be make task of fraud defining. In case there are different persons with same document number/tax_id etc such cases won't be duplicated and it's a scope of fraud defining task.

...

  • data cleaning and preparation 
  • blocking and finding possible pairs
  • calculating of variables for each pair
  • probability linkage using model
  • finding master persons in a pair and deactivation of person(s) which are not masters
  • fetch declaration(s) which belongs to deactivated person(s) and declaration termination 

...

  • for first_name, second_name and last_name, birth_settlement  change: [ --'] to '',  'є' to 'е', 'и' to 'і'
  • for birth_certificate change: [ /%#№ _-]  to '',  [iі!IІ] to 1
  • for other documents: [ /%#№ _-]  to ''
  • for birth_settlement - ([сc][ \.,])|([сc]ело[\.,]*)|([сc]мт[\.,]*)|([сc]елище [мm][іi][сc]ького типу)|([сc]елище[\.,]*)|([мm][іi][сc][tт][оo][\.,]*)|([мm][\.,]*)


Blocking and finding possible pairs

...

The WOE/IV framework is based on the following relationship:

  • logP(Y=1|Xj)P(Y=0|XjlogP(Y=1)P(Y=0) ) (sample log-odds+log(Xj|Y=1)f(Xj|Y=0) (WOE)
  • WOE = ln (% of non-merge/ % of merge)

After decoding each variable we can apply model and calculate probability


Probability linkage using model

...