Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Overall process

Deduplication is the task of finding records in a data set (MPI) that refer to the same Person

Deduplication also includes following sub-tasks:

  • Determine master record of duplicated pair
  • Terminate declarations that refer to the merged record
  • Deactivate oldest record
  • Link merged records

Configuration

Process configuration

ParameterDescriptionValue
DEDUPLICATION_SCHEDULE(Cron Format) How often to perform deduplication

`* 1 * * *`

DEDUPLICATION_DEPTH

(days, integer >=0) Search candidates for records created in DEDUPLICATION_DEPTH days

  • If DEDUPLICATION_DEPTH == 0, analyze all records
1
DEDUPLICATION_SCORE(number) Record pair with score greater than DEDUPLICATION_SCORE must be merged0.8

Anchor
person_attribute_weight
person_attribute_weight
Person attributes weight

Weight value format - array of numbers, [Match, Non-match]

ParameterDescriptionValue
DEDUPLICATION_TAX_ID(array of numbers) Person.tax_id[0.1, -0.1]
DEDUPLICATION_FIRST_NAME(array of numbers) Person.first_name[0.1, -0.1]
DEDUPLICATION_LAST_NAME(array of numbers) Person.last_name[0.1, -0.1]
DEDUPLICATION_SECOND_NAME(array of numbers) Person.second_name[0.1, -0.1]
DEDUPLICATION_BIRTH_DATE(array of numbers) Person.birth_date[0.1, -0.1]
DEDUPLICATION_DOCUMENT(array of numbers) Person.documents[0.1, -0.1]
DEDUPLICATION_NATIONAL_ID(array of numbers) Person.national_id[0.1, -0.1]
DEDUPLICATION_PHONE_NUMBER(array of numbers) Person.phones[0.1, -0.1]


Specification

Info
Start Deduplication process on DEDUPLICATION_SCHEDULE

Get candidates list

  1. Read DEDUPLICATION_DEPTH parameter value. If DEDUPLICATION_DEPTH == 0, analyze the whole period
  2. Fetch records from MPI.person created in DEDUPLICATION_DEPTH

Get record from list

Get Data Set

  1. Fetch records (Data Set) from MPI.person to perform matching
  2. For each record in Data Set init matching procesdure

Match records

Match records

Matching procedure is performed on a pair of records (candidate, Data Set.record) and compare the corresponding parameters with each other using exact match

  1. Read comparison attributes from configuration (see Person attribute weight)
  2. Compare each attribute of a pair of records

...

AttributeResult
tax_idMatch
first_nameNon-match
last_nameMatch
second_nameMatch
birth_dateNon-match
documentNon-match
national_idMatch
phone_numberMatch

Score result

  1. Score match result using Person attribute weight and calculate total

Example:

AttributeResultScore
tax_idMatch0.1
first_nameNon-match-0.1
last_nameMatch0.1
second_nameMatch0.1
birth_dateNon-match-0.1
documentNon-match-0.1
national_idMatch0.1
phone_numberMatch0.1
Total:
0.2

Save duplicated record to DB

If total score greater than DEDUPLICATION_SCORE, save duplicated records to DB

Merge records (IL)

Get unmerged records

Get all records with status == 'NEW'

Apiary spec

Terminate all person declarations

For each record terminate its declarations

Merge MPI

...

Apiary spec

Merge candidate

Update record status, set status to 'MERGED' and update merged record to inactive

...