Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Updated by https://e-health-ua.atlassian.net/wiki/spaces/EH/pages/17526030505/Deduplication+process

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Table of Contents

Overall process

...

Deduplication is the task of finding records in finding records in a data set (MPI) that refer to the same Person. In order to guarantee data quality also should be make task of fraud defining. In case there are different persons with same document number/tax_id etc such cases won't be duplicated and it's a scope of fraud defining task.

In order to deduplicate persons next steps must be done:

  • data cleaning and preparation 

  • blocking and finding possible pairs

  • calculating of variables for each pair

  • probability linkage using model

  • finding master persons in a pair

...

  • and deactivation of person(s) which are not masters

  • fetch declaration(s) which belongs to deactivated person(s) and declaration termination 

Data cleaning and preparation

For fields which are used for blocking or calculating variables next regular expressions must be applied:

  • for first_name, second_name and last_name, birth_settlement  change: [ --'] to '',  'є' to 'е', 'и' to 'і'

  • for birth_certificate change: [ /%#№ _-]  to '',  [iі!IІ] to 1

  • for other documents: [ /%#№ _-]  to ''

  • for birth_settlement - ([сc][ \.,])|([сc]ело[\.,]*)|([сc]мт[\.,]*)|([сc]елище [мm][іi][сc]ького типу)|([сc]елище[\.,]*)|([мm][іi][сc][tт][оo][\.,]*)|([мm][\.,]*)

Blocking and finding possible pairs

Duplicate records almost always share something in common. If we define groups of data that share something and only compare the records in that group, or block, then we can reduce the number of comparisons we will make. In other words it's we should apply smart comparison.

Predicate blocks

We need to define in such way to have much less pairs to compare and still and still have confidence that will compare records that truly are duplicates.

Current blocks is matching either of

  • tax_id

  • documents.number 

  • authentication number 

  • residence settlement + first_name

...

  • residence settlement + last_name

Building index for blocks

In order to make process of pairing more efficient 

  • build index for each field that take part in blocks

  • add boolean flag field checked and build index on it

  • one by one take persons with flag "checked" is null and take ids of persons with same characteristic (tax_id, document and all block fields)

Variables calculation calculation for each pair

After building model on train dataset there is a knowledge which variables are correlated with target and have high IV. 

There is no linear dependency on variable and target. For example, if name has difference in 0 symbols the probability that this is same persons is high. In case the difference is 1 symbols  the probability decreases, 2 symbols 2 symbols - dramatically decreases, there is no linear dependency. So, we should to bin continuous variables in categorical, base on hit rate. Such approach also helps to define linear, non-linear dependency, handle missing values and predict power of missing values.

Variables that included into model are next:

Variable

Descirption

d_first_name

levenshtein distance(first_name1, first_name2)

d_last_name

levenshtein distance(last_name1, last_name2)

d_second_name

levenshtein distance(second_name1, second_name2)

d_documents

min(levenshtein distance(document1, document2)) for any types of documents

docs_same_number

min(same/not) number

birth_settlement_substr

min(position(birth_settlement_1 in birth_settlementt_2) and position(birth_settlement_2 in birth_settlementt_1)

d_tax_id

levenshtein distance(tax_id1, tax_id2)

authentication_methods

same/not authentification OTP number flag

residence_settlement_flag

same/not residence settlement flag

registration_settlement_flag

same/not registration settlement flag

gender_flag

same/not gender

twins_flag

distance last_name <=2, same birth_date, distance in document numbers between 1 and 2

Each categorical categorical variable must be convert to continuous using WOE (which was calculated for train data sample). WOE describes the relationship between a predictive variable and a binary target variable.

The WOE/IV framework is based on the following relationship:

...

  • logP(Y=1|

...

  • Xj)P(Y=0|

...

  • Xj=

...

  •  logP(Y=1)P(Y=0) (sample log-odds+log(

...

  • Xj|Y=1)f(

...

  • Xj|Y=0) (WOE)

  • WOE = ln (% of non-merge/

...

  •  % of merge)

After decoding each variable we can apply model and calculate probability

Probability linkage using model

...

Logistic regression itself is a function:

 Image Modified

where  Image Modified is the intercept from the linear regression equation and Image Modified is the regression coefficient multiplied by some value of the predictor.

...

After that the threshold should be used to define which probability if satisfying to call the pair a duplicate.

...

Field

Description

Score

Probability that pair of persons should be merged

Sum of target

Quantity of pairs which should be merged

Qty

Total quantity of pairs

Hit_rate

Ratio of 'Quantity of pairs which should be merged' to 'Total quantity of pairs'

Merge_acu_%

Accumulated ratio of all records which marked as merge

Qty_acu_%

Accumulated sample distribution by score

Accuracy_rate

Percantage of errors on data sample

From test sample suggestions for cut off are next:

  • score >=0.9 - merge (as minimum score that should be saved to `merge candidates` and auto merged)

  • score between 0.7 and 0.9 - manual merge  (as minimum score that should be merged manually)

  • score < 0.7 - do not merge

Define master record and terminate declaration

...

Make a record into merge_candidates:

column

value

description

id

UUID

record unique ID

person_id

UUID

the person which will be merged

master_person_id

UUID

the person who will stay active

status

NEW, MANUAL, MERGED

inserted_at

DATETIME = now()

updated_at

DATETIME = now()

config

{"person_id"


"candidate_id"


variables}

details

score

value from 0 to 1

Manual merge overview

For person_id in status 'NEW'  and score>=max_manul_score (0.9) from merge_candidates find declaration in status 'VERIFIED' and change status to 'TERMINATED' and change persons.id.status to INACTIVE

OPS kafka consumer should be used for the declaration termination. Declaration termination and person deactivation should be done at the same time.
Change merge_candidates.status from `NEW` to `MERGED`

Fetch record in status 'NEW'  and (0.8) min_manul_score < score < max_manul_score (0.9) from merge_candidates and write them into table manual_merge_candidates:

column

value

description

id

UUID

record unique ID

merge_candidate_id

UUID

pair identifier 

person_id

UUID

the person which will be merged

master_person_id

UUID

the person who will stay active

status

NEW

NEW, PROCESSED

status_reason

null, text

assignee 

UUID

user, who currently reviews request, for new request it's null

inserted_at

DATETIME = now()

updated_at

DATETIME = now()

decision

null, text

null, SPLIT, MERGE, POSTPONE

After that change merge_candidates.status from `NEW` to `MANUAL`

After decision reach more than decision_amount and final_decision is MERGE OPS kafka consumer should find declaration in status 'VERIFIED' and change status to 'TERMINATED'
and change persons.id.status to INACTIVEINACTIVE and 
Deactivate user:

  1. search user by person_id, if exists (Mithril.users DB)

    1. set is_active = false

    2. set updated_at = now()

  2. call expire_user_tokens function

Auto merge overview

After there was a decision to merge pair of persons do

  1. Validate persons

    1. check if person or master_person exists in DB

    2. check persons `updated_at`, if (merge_candidate.person.inserted_at or merge_candidate.measter_person.inserted_at) < mpi.person.updated_at set status STALE to merge_candidates

  2. Deactivate person

    1. search declaration for person_id, if exist

      1. create event to kafka

      2. set status DECLARATION_READY_DEACTIVATE to merge_candidates

    2. change person_id status to inactive

    3. add info to merged_pairs with person_id and master_person_id

    4. set status MERGED to merge_candidates

    5. add person status change info to event_manager

  3. Deactivate user

    1. search user by person_id, if exists (Mithril.users DB)

      1. set user.is_active = false

      2. set updated_at = now()

    2. call expire_user_tokens function

Merge_candidates state diagram

Status

Description

NEW

New pair of merge candidates

MERGED

Pair that was merged after either manual merge process or auto merge process

STALE

Pair is not merged as one of person records has been updated after the de-duplication score has been calculated.

DECLARATION_READY_DEACTIVATE

The event is sent to kafka to deactivate declaration of the person

DECLINED

Pair is not merged. Not because of person, but because of related to persons entities. For example, declarations.

IN_PROCESS

The event is sent to kafka to deactivate the person, which is not master person