Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Overall process

Deduplication is the task of finding records in a data set (MPI) that refer to the same Person. In order to guarantee data quality also should be make task of fraud defining. In case there are different persons with same document number/tax_id etc such cases won't be duplicated and it's a scope of fraud defining task.

...

  • data cleaning and preparation 
  • blocking and finding possible pairs
  • calculating of variables for each pair
  • probability linkage using model
  • finding master persons in a pair and deactivation of person(s) which are not masters
  • fetch declaration(s) which belongs to deactivated person(s) and declaration termination 

...

The WOE/IV framework is based on the following relationship:

  • logP(Y=1|Xj)P(Y=0|XjlogP(Y=1)P(Y=0) (sample log-odds+log(Xj|Y=1)f(Xj|Y=0) (WOE)
  • WOE = ln (% of non-merge/ % of merge)

After decoding each variable we can apply model and calculate probability


Probability linkage using model

...

As for now master_person_id is defined by insertedupdated_at. In other words the last record will be active and other will be merged into this one.

...

columnvaluedescription
idUUIDrecord unique ID
person_idUUIDthe person which will be merged
master_person_idUUIDthe person who will stay active
statusNEW, MANUAL, MERGED
inserted_atDATETIME = now()
updated_atDATETIME = now()
config
{"person_id" 
"candidate_id"
variables}

details

scorevalue from 0 to 1


Manual merge overview

For person_id in status 'NEW'  and score>=0.9 from merge_candidates find declaration in status 'VERIFIED' and change status to 'TERMINATED' and change persons.id.status to INACTIVE

OPS kafka consumer should be used for the declaration termination. Declaration termination and person deactivation should be done at the same time.
Change merge_candidates.status from `NEW` to `MERGED`

Fetch record in status 'NEW'  and (0.8) min_manul_score < score < max_manul_score (0.9) from merge_candidates and write them into table manual_merge_candidates:

columnvaluedescription
idUUIDrecord unique ID
person_idUUIDthe person which will be merged
master_person_idUUIDthe person who will stay active
statusNEWNEW, PROCESSED
assignee UUIDuser, who currently reviews request, for new request it's null
inserted_atDATETIME = now()
updated_atDATETIME = now()
final_decisionnullnull, SPLIT, MERGE, POSTPONE


After decision reach more than decision_amount and final_decision is MERGE OPS kafka consumer should find declaration in status 'VERIFIED' and change status to 'TERMINATED' and change persons.id.status to INACTIVE