ЕСОЗ - публічна документація

Deduplication process (OUTDATED)

Overall process

Deduplication is the task of finding records in a data set (MPI) that refer to the same Person

Deduplication also includes following sub-tasks:

  • Determine master record of duplicated pair
  • Terminate declarations that refer to the merged record
  • Deactivate oldest record
  • Link merged records

Configuration

Process configuration

ParameterDescriptionValue
DEDUPLICATION_SCHEDULE(Cron Format) How often to perform deduplication

`* 1 * * *`

DEDUPLICATION_DEPTH

(days, integer >=0) Search candidates for records created in DEDUPLICATION_DEPTH days

  • If DEDUPLICATION_DEPTH == 0, analyze all records
1
DEDUPLICATION_SCORE(number) Record pair with score greater than DEDUPLICATION_SCORE must be merged0.8

Person attributes weight

Weight value format - array of numbers, [Match, Non-match]

ParameterDescriptionValue
DEDUPLICATION_TAX_ID(array of numbers) Person.tax_id[0.1, -0.1]
DEDUPLICATION_FIRST_NAME(array of numbers) Person.first_name[0.1, -0.1]
DEDUPLICATION_LAST_NAME(array of numbers) Person.last_name[0.1, -0.1]
DEDUPLICATION_SECOND_NAME(array of numbers) Person.second_name[0.1, -0.1]
DEDUPLICATION_BIRTH_DATE(array of numbers) Person.birth_date[0.1, -0.1]
DEDUPLICATION_DOCUMENT(array of numbers) Person.documents[0.1, -0.1]
DEDUPLICATION_NATIONAL_ID(array of numbers) Person.national_id[0.1, -0.1]
DEDUPLICATION_PHONE_NUMBER(array of numbers) Person.phones[0.1, -0.1]


Specification

Start Deduplication process on DEDUPLICATION_SCHEDULE

Get candidates list

  1. Read DEDUPLICATION_DEPTH parameter value. If DEDUPLICATION_DEPTH == 0, analyze the whole period
  2. Fetch records from MPI.person created in DEDUPLICATION_DEPTH

Get record from list

Get Data Set

  1. Fetch records (Data Set) from MPI.person to perform matching
  2. For each record in Data Set init matching procesdure

Match records

Match records

Matching procedure is performed on a pair of records (candidate, Data Set.record) and compare the corresponding parameters with each other using exact match

  1. Read comparison attributes from configuration (see Person attribute weight)
  2. Compare each attribute of a pair of records

Example:


tax_idfirst_namelast_namesecond_namebirth_datedocumentnational_idphone_number
Record13087232628ПетроБондарМиколайович12.06.1993ВВ123456РП-765123

[+380501234567, +380507654321]

Record23087232628ПедроБондарМиколайович13.06.1993ВВ654321РП-765123[+380501234567]

Match result:

AttributeResult
tax_idMatch
first_nameNon-match
last_nameMatch
second_nameMatch
birth_dateNon-match
documentNon-match
national_idMatch
phone_numberMatch

Score result

  1. Score match result using Person attribute weight and calculate total

Example:

AttributeResultScore
tax_idMatch0.1
first_nameNon-match-0.1
last_nameMatch0.1
second_nameMatch0.1
birth_dateNon-match-0.1
documentNon-match-0.1
national_idMatch0.1
phone_numberMatch0.1
Total:
0.2

Save duplicated record to DB

If total score greater than DEDUPLICATION_SCORE, save duplicated records to DB

Merge records (IL)

Get unmerged records

Get all records with status == 'NEW'

Apiary spec

Terminate all person declarations

For each record terminate its declarations

Apiary spec

Merge candidate

Update record status, set status to 'MERGED' and update merged record to inactive

Apiary spec 

ЕСОЗ - публічна документація