HTAC Pipeline · Step 3 of 7

Cross-Site Deduplication

Clinical Data
3
Deduplication
4
Enrichment
5
Cohort Queries
6
Suppression
7
Results

Patients who receive care at multiple health systems generate identical PPRL tokens at each site. This step collapses those duplicate records into a single entry in the DeduplicatedRoster — one row per unique person across the entire federated network.

Why this step?

Without deduplication, a patient seen at both Allina and M Health Fairview would be counted twice in every prevalence calculation — artificially inflating the denominator and distorting rates. In many multi-site deployments, a large share of patients appear at more than one health system.

How canonical site is chosen

The canonical site is the health system that recorded the patient's most recent VisitOccurrence. This reflects the site most likely to have the most complete and up-to-date record. When no visits exist, the first site encountered is used as the canonical site.

Idempotency

The deduplication routine keys on canonical_token. Re-running with the same inputs refreshes existing roster rows instead of inserting duplicates, so the job can be repeated safely after new tokens arrive.

How Token Matching Works

ALLINA
token: 3a7f8c2d…
+
MHFAIRVIEW
token: 3a7f8c2d…
Same token = Same patient
canonical_site = most recent visit
DeduplicatedRoster
1 row · site_count = 2

The routine groups hash tokens that share the same value, counts contributing sites, selects the canonical site from the most recent visit record, and writes one DeduplicatedRoster row per unique token. Technical specifications ship with the deployment for independent verification.

Output: DeduplicatedRoster Record

One row per unique patient across the network. Enrichment flag fields (Medicaid, housing, etc.) are populated in Step 4 and remain at defaults here.

FieldTypeDescription
canonical_tokenstr(64)The HMAC-SHA256 token — unique identifier for this patient across the network
canonical_siteFK → HealthSystemSite with the most recent VisitOccurrence for this patient
site_countintNumber of distinct health systems where this patient appears
roster_versiondateDate this row was last generated (today's date at run time)
medicaid_flag & friendsboolEnrichment flags — all False until Step 4 runs

Simulation Roster Statistics

1,000 Unique patients in roster
320 Patients at 2+ sites
680 Single-site patients
2 Max sites for one patient
1.32 Avg sites per patient

Site-count distribution

Sites per patientPatient count
1680
2320
Latest pipeline demonstration (completed May 14, 2026 20:05). Counts on this page reflect synthetic federated data from that run. Open the live demo →
← Step 2: PPRL Tokens
← Pipeline overview
Step 4: Enrichment →