Overall process
Deduplication is the task of finding records in a data set (MPI) that refer to the same Person
Deduplication also includes following sub-tasks:
- Determine master record of duplicated pair
- Terminate declarations that refer to the merged record
- Deactivate oldest record
- Link merged records
Configuration
Process configuration
Parameter | Description | Value |
---|---|---|
DEDUPLICATION_SCHEDULE | (Cron Format) How often to perform deduplication | `* 1 * * *` |
DEDUPLICATION_DEPTH | (days, integer >=0) Search candidates for records created in DEDUPLICATION_DEPTH days
| 1 |
DEDUPLICATION_SCORE | (number) Record pair with score greater than DEDUPLICATION_SCORE must be merged | 0.8 |
Person attributes weight
Weight value format - array of numbers, [Match, Non-match]
Parameter | Description | Value |
---|---|---|
DEDUPLICATION_TAX_ID | (array of numbers) Person.tax_id | [0.1, -0.1] |
DEDUPLICATION_FIRST_NAME | (array of numbers) Person.first_name | [0.1, -0.1] |
DEDUPLICATION_LAST_NAME | (array of numbers) Person.last_name | [0.1, -0.1] |
DEDUPLICATION_SECOND_NAME | (array of numbers) Person.second_name | [0.1, -0.1] |
DEDUPLICATION_BIRTH_DATE | (array of numbers) Person.birth_date | [0.1, -0.1] |
DEDUPLICATION_DOCUMENT | (array of numbers) Person.documents | [0.1, -0.1] |
DEDUPLICATION_NATIONAL_ID | (array of numbers) Person.national_id | [0.1, -0.1] |
DEDUPLICATION_PHONE_NUMBER | (array of numbers) Person.phones | [0.1, -0.1] |
Specification
Get candidates list
- Read DEDUPLICATION_DEPTH parameter value. If DEDUPLICATION_DEPTH == 0, analyze the whole period
- Fetch records from MPI.person created in DEDUPLICATION_DEPTH
Get record from list
Get Data Set
- Fetch records (Data Set) from MPI.person to perform matching
- For each record in Data Set init matching procesdure
Match records
Match records
Matching procedure is performed on a pair of records (candidate, Data Set.record) and compare the corresponding parameters with each other using exact match
- Read comparison attributes from configuration (see Person attribute weight)
- Compare each attribute of a pair of records
Example:
tax_id | first_name | last_name | second_name | birth_date | document | national_id | phone_number | |
---|---|---|---|---|---|---|---|---|
Record1 | 3087232628 | Петро | Бондар | Миколайович | 12.06.1993 | ВВ123456 | РП-765123 | [+380501234567, +380507654321] |
Record2 | 3087232628 | Педро | Бондар | Миколайович | 13.06.1993 | ВВ654321 | РП-765123 | [+380501234567] |
Match result:
Attribute | Result |
---|---|
tax_id | Match |
first_name | Non-match |
last_name | Match |
second_name | Match |
birth_date | Non-match |
document | Non-match |
national_id | Match |
phone_number | Match |
Score result
- Score match result using Person attribute weight and calculate total
Example:
Attribute | Result | Score |
---|---|---|
tax_id | Match | 0.1 |
first_name | Non-match | -0.1 |
last_name | Match | 0.1 |
second_name | Match | 0.1 |
birth_date | Non-match | -0.1 |
document | Non-match | -0.1 |
national_id | Match | 0.1 |
phone_number | Match | 0.1 |
Total: | 0.2 |
Save duplicated record to DB
If total score greater than DEDUPLICATION_SCORE, save duplicated records to DB
Merge records (IL)
Get unmerged records
Get all records with status == 'NEW'
Terminate all person declarations
For each record terminate its declarations
Merge candidate
Update record status, set status to 'MERGED' and update merged record to inactive