Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Overall process

Deduplication is the task of finding records in a data set (MPI) that refer to the same Person

Deduplication also includes following sub-tasks:

...

Configuration

Process configuration

ParameterDescriptionValue
DEDUPLICATION_SCHEDULE(Cron Format) How often to perform deduplication

`* 1 * * *`

DEDUPLICATION_DEPTH

(days, integer >=0) Search candidates for records created in DEDUPLICATION_DEPTH days

  • If DEDUPLICATION_DEPTH == 0, analyze all records
1
DEDUPLICATION_SCORE(number) Record pair with score greater than DEDUPLICATION_SCORE must be merged0.8

Anchor
person_attribute_weight
person_attribute_weight
Person attributes weight

Weight value format - array of numbers, [Match, Non-match]

ParameterDescriptionValue
DEDUPLICATION_TAX_ID(array of numbers) Person.tax_id[0.1, -0.1]
DEDUPLICATION_FIRST_NAME(array of numbers) Person.first_name[0.1, -0.1]
DEDUPLICATION_LAST_NAME(array of numbers) Person.last_name[0.1, -0.1]
DEDUPLICATION_SECOND_NAME(array of numbers) Person.second_name[0.1, -0.1]
DEDUPLICATION_BIRTH_DATE(array of numbers) Person.birth_date[0.1, -0.1]
DEDUPLICATION_DOCUMENT(array of numbers) Person.documents[0.1, -0.1]
DEDUPLICATION_NATIONAL_ID(array of numbers) Person.national_id[0.1, -0.1]
DEDUPLICATION_PHONE_NUMBER(array of numbers) Person.phones[0.1, -0.1]


Specification

Info
Start Deduplication process on DEDUPLICATION_SCHEDULE

...

  1. Read comparison attributes from configuration (see Person attribute weight)
  2. Compare each attribute of a pair of records

Example:


tax_idfirst_namelast_namesecond_namebirth_datedocumentnational_idphone_number
Record13087232628ПетроБондарМиколайович12.06.1993ВВ123456РП-765123

[+380501234567, +380507654321]

Record23087232628ПедроБондарМиколайович13.06.1993ВВ654321РП-765123[+380501234567]

Match result:

AttributeResult
tax_idMatch
first_nameNon-match
last_nameMatch
second_nameMatch
birth_dateNon-match
documentNon-match
national_idMatch
phone_numberMatch

Score result

  1. Score match result using Person attribute weight and calculate total

Example:

AttributeResultScore
tax_idMatch0.1
first_nameNon-match-0.1
last_nameMatch0.1
second_nameMatch0.1
birth_dateNon-match-0.1
documentNon-match-0.1
national_idMatch0.1
phone_numberMatch0.1
Total:
0.2

Save duplicated record to DB

If total score greater than DEDUPLICATION_SCORE, save duplicated records to DB

Merge records (IL)

Image RemovedImage Added

Get unmerged records

Get all records with status == 'NEW'

Apiary spec

Terminate all person declarations

For each record terminate its declarations

Merge MPI

Update record status

Update record status, set status to 'MERGED' and update merged record to inactive

Apiary spec