ЕСОЗ - публічна документація

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Overall process

Deduplication is the task of finding records in a data set (MPI) that refer to the same Person. In order to guarantee data quality also should be make task of fraud defining. In case there are different persons with same document number/tax_id etc such cases won't be duplicated and it's a scope of fraud defining task.

In order to deduplicate persons next steps must be done:

  • data cleaning and preparation 
  • blocking and finding possible pairs
  • calculating of variables for each pair
  • probability linkage using model
  • finding master persons in a pair and deactivation of person(s) which are not masters
  • fetch declaration(s) which belongs to deactivated person(s) and declaration termination 


Data cleaning and preparation

For fields which are used for blocking or calculating variables next regular expressions must be applied:

  • for first_name, second_name and last_name, birth_settlement  change: [ --'] to '',  'є' to 'е', 'и' to 'і'
  • for birth_certificate change: [ /%#№ _-]  to '',  [iі!IІ] to 1
  • for other documents: [ /%#№ _-]  to ''


Blocking and finding possible pairs

Duplicate records almost always share something in common. If we define groups of data that share something and only compare the records in that group, or block, then we can reduce the number of comparisons we will make. In other words it's we should apply smart comparison.

Predicate blocks

We need to define in such way to have much less pairs to compare and still have confidence that will compare records that truly are duplicates.

Current blocks is matching either of

  • tax_id
  • documents.number 
  • authentication number 
  • settlement + first_name
  • settlement + last_name

Building index for blocks

In order to make process of pairing more efficient 

  • build index for each field that take part in blocks
  • add boolean flag field checked and build index on it
  • one by one take persons with flag "checked" is null and take ids of persons with same characteristic (tax_id, document and all block fields)

Variables calculation for each pair

After building model on train dataset there is a knowledge which variables are correlated with target and have high IV. 

There is no linear dependency on variable and target. For example, if name has difference in 0 symbols the probability that this is same persons is high. In case the difference is 1 symbols  the probability decreases, 2 symbols - dramatically decreases, there is no linear dependency. So, we should to bin continuous variables in categorical, base on hit rate. Such approach also helps to define linear, non-linear dependency, handle missing values and predict power of missing values.

Variables that included into model are next:

VariableDescirption
d_first_namelevenshtein distance(first_name1, first_name2)
d_last_namelevenshtein distance(last_name1, last_name2)
d_second_namelevenshtein distance(second_name1, second_name2)
d_documents

least(levenshtein distance(document1, document2))


docs_same_numberleast(same/not) number
birth_settlement_substrleast(position(birth_settlement_1 in birth_settlementt_2) and position(birth_settlement_2 in birth_settlementt_1)
d_tax_idlevenshtein distance(tax_id1, tax_id2)
authentication_methodssame/not authentification OTP number flag
residence_settlement_flagsame/not residence settlement flag
registration_settlement_flagsame/not registration settlement flag
gender_flagsame/not gender
twins_flagdistance last_name <=2, same birth_date, distance in document numbers between 1 and 2

Each categorical variable must be convert to continuous using WOE (which was calculated for train data sample). WOE describes the relationship between a predictive variable and a binary target variable.

The WOE/IV framework is based on the following relationship:

  • logP(Y=1|Xj)P(Y=0|XjlogP(Y=1)P(Y=0) ) (sample log-odds+log(Xj|Y=1)f(Xj|Y=0) (WOE)
  • WOE = ln (% of non-merge/ % of merge)

After decoding each variable we can apply model and calculate probability


Probability linkage using model

As for now the logistic regression was chosen as a predicative method.

Logistic regression itself is a function:

 

where   is the intercept from the linear regression equation and  is the regression coefficient multiplied by some value of the predictor.

As a result of model for each input pair will be calculated probability of merge/not merge events.

After that the threshold should be used to define which probability if satisfying to call the pair a duplicate.

 

From test sample suggestions for cut off are next:

  • score >=0.9 - merge (as minimum score that should be saved to `merge candidates`)
  • score between 0.7 and 0.9 - manual merge  (as minimum score that should be merged)
  • score < 0.7 - do not merge


Define master record and terminate declaration

As for now master_person_id is defined by inserted_at. In other words the last record will be active and other will be merged into this one.

For person_id in status 'NEW'  from merge_candidates find declaration in status 'VERIFIED' and change status to 'TERMINATED'

OPS kafka consumer should be used for the declaration termination

  • declaration termination and person deactivation should be done at the same time.
  • No labels