HTAC Pipeline · Step 3 of 7
Cross-Site Deduplication
Patients who receive care at multiple health systems generate identical PPRL tokens at each site. This step collapses those duplicate records into a single entry in the DeduplicatedRoster — one row per unique person across the entire federated network.
Why this step?
Without deduplication, a patient seen at both Allina and M Health Fairview would be counted twice in every prevalence calculation — artificially inflating the denominator and distorting rates. In many multi-site deployments, a large share of patients appear at more than one health system.
How canonical site is chosen
The canonical site is the health system that recorded the patient's most recent
VisitOccurrence. This reflects the site most likely to have the
most complete and up-to-date record. When no visits exist, the first site encountered
is used as the canonical site.
Idempotency
The deduplication routine keys on canonical_token. Re-running with the same inputs refreshes
existing roster rows instead of inserting duplicates, so the job can be repeated safely after new tokens arrive.
How Token Matching Works
3a7f8c2d…3a7f8c2d…
The routine groups hash tokens that share the same value, counts contributing sites, selects the canonical site from the most recent visit record, and writes one DeduplicatedRoster row per unique token. Technical specifications ship with the deployment for independent verification.
Output: DeduplicatedRoster Record
One row per unique patient across the network. Enrichment flag fields (Medicaid, housing, etc.) are populated in Step 4 and remain at defaults here.
| Field | Type | Description |
|---|---|---|
canonical_token | str(64) | The HMAC-SHA256 token — unique identifier for this patient across the network |
canonical_site | FK → HealthSystem | Site with the most recent VisitOccurrence for this patient |
site_count | int | Number of distinct health systems where this patient appears |
roster_version | date | Date this row was last generated (today's date at run time) |
medicaid_flag & friends | bool | Enrichment flags — all False until Step 4 runs |
Simulation Roster Statistics
Site-count distribution
| Sites per patient | Patient count |
|---|---|
| 1 | 680 |
| 2 | 320 |