HTAC · Live demonstration

Federated Pipeline Demonstration

A live run of the HTAC data pipeline using synthetic patient records distributed across the 11 Minnesota Electronic Health Record Consortium health systems. Each step executes in sequence — clinical data loading, privacy-preserving tokenization, cross-site deduplication, administrative data enrichment, federated cohort queries, stratification and suppression, and published results.

For program and policy staff: watch what happens to patient data at each step and why.

For technical implementers: the step detail panels show the architectural decisions behind each phase.

Running again will clear the previous demonstration data and replace it with a new synthetic run.

All records are synthetic. No real patient data is used in this demonstration. Health system names reflect actual MNEHRC consortium members; patient data is algorithmically generated for demonstration purposes only.

Pipeline steps

1 Clinical Data in OMOP CDM ✓ Complete · 4.68s

4.68 seconds

Total patient records loaded

1320 across 11 sites

Clinical records standardized across 11 health systems

Each of the 11 participating health systems maintains its own OMOP CDM database locally. Query scripts distributed by the coordinating center run against each site's local database and return only aggregate counts — no patient records leave any site. This demonstration loaded 1320 synthetic patient records distributed across sites in proportion to each system's actual patient volume.

Technical detail

Records loaded into OMOP CDM v5.4 tables: Person, VisitOccurrence, ConditionOccurrence, DrugExposure, Measurement. Vocabularies: SNOMED CT (conditions), RxNorm (drugs), LOINC (measurements). Each site's OMOP instance uses Epic-sourced concept mappings with site-specific normalization applied prior to loading.

Health system	visits	persons	conditions	measurements	drug_exposures
Allina Health	364	176	94	62	94
HealthPartners	341	170	101	59	101
M Health Fairview	356	169	95	56	95
Mayo Clinic	316	150	89	57	89
Essentia Health	258	135	68	44	68
Hennepin Healthcare	165	89	55	37	55
Sanford Health	208	106	59	43	59
CentraCare	148	86	41	26	41
Children's Minnesota	77	77	8	0	8
North Memorial Health	101	72	26	20	26
Minneapolis VA Health Care System	140	90	45	26	45

2 PPRL Token Generation ✓ Complete · 4.34s

4.34 seconds

Patients appearing at 2+ sites

320 (24.2% of total)

Patient identities converted to one-way cryptographic tokens

Before any cross-site matching can occur, each health system independently hashes six PII fields — first name, last name, date of birth, sex, phone number, and ZIP code — into a 64-character HMAC-SHA256 token using a shared salt known only to consortium members. The original PII fields are immediately discarded. Only the token is transmitted to the coordinating center. 320 tokens were generated by more than one site — these represent patients who received care at multiple health systems and would be double-counted without deduplication.

Technical detail

Algorithm: HMAC-SHA256(key=shared_salt, msg=normalized_preimage). Normalization: lowercase folding for name fields, YYYYMMDD for date of birth, digits-only for phone, 5-character ZIP. Validated at 97% precision and 75% recall in multi-site deployments (OneFlorida). Tokens are consortium-specific — a token from this network cannot be linked to tokens from a different network without the shared salt.

Health system	tokens_generated
Allina Health	176
HealthPartners	170
M Health Fairview	169
Mayo Clinic	150
Essentia Health	135
Hennepin Healthcare	89
Sanford Health	106
CentraCare	86
Children's Minnesota	77
North Memorial Health	72
Minneapolis VA Health Care System	90

3 Deduplication ✓ Complete · 9.67s

9.67 seconds

Unique individuals in statewide roster

1000 (down from 1320 raw records)

Statewide patient roster deduplicated — 320 duplicate records resolved

The coordinating center receives tokens from all 11 sites. Because the same patient produces the same token at every site where they were seen, duplicate records are identified by matching tokens. The 1320 raw patient records reduce to 1000 unique individuals after deduplication. Without this step, prevalence denominators would be inflated by 320 duplicate records — overstating the denominator and understating true prevalence rates.

Technical detail

Deduplication algorithm: exact token match. A patient roster row is created for each unique token, with site attribution assigned to the site holding the most recent visit data. The 24.2% overlap rate observed in this demonstration is consistent with the published MNEHRC figure of 75% of patients having records at more than one health system — scaled to the demonstration's patient count.

4 Administrative Data Enrichment ✓ Complete · 2.95s

2.95 seconds

Patients with social risk flags

2 homeless · 22 incarcerated · 219 Medicaid

Roster enriched with housing, justice, and insurance context

Authorized data stewards — the Minnesota Homeless Management Information System, the Minnesota Department of Corrections, and the Minnesota Department of Human Services Medicaid enrollment file — provide periodic extracts that are linked to the patient roster by token. This enrichment adds social risk context without pulling clinical records into a central warehouse. The result: 2 patients flagged with recent homelessness experience, 22 with recent incarceration involvement, and 219 with current Medicaid enrollment — dimensions that are essential for equity-focused prevalence analysis and that no clinical data source alone can provide.

Technical detail

Linkage method: exact token match against each administrative data source. Administrative extracts are provided by MDH under data use agreements that specify permitted uses, transfer protocols, and retention limits. Flag rates in this demonstration reflect published MNEHRC population proportions from the 2023 HTAC cohort.

5 Federated Cohort Queries ✓ Complete · 5.0s

5.0 seconds

Conditions queried

5 conditions · 11 sites · 55 raw cells

Condition cohort queries executed across all 11 sites

Standardized analytic packages — R scripts distributed to each site — execute against each site's local OMOP database using published OMOP concept ID codesets. Only aggregate counts return to the coordinating center. In this demonstration, 5 conditions were queried across 11 sites, producing 55 raw estimate cells before stratification and suppression.

Technical detail

Condition definitions use OMOP concept IDs from the SNOMED CT, ICD-10-CM, RxNorm, and LOINC vocabularies, consistent with CMS Chronic Conditions Warehouse codesets and HTAC condition definitions. Queries search ConditionOccurrence, DrugExposure, and Measurement domains during a 12-month study period. Denominators use persons with at least one VisitOccurrence during the study window — consistent with MNEHRC/HTAC methodology.

6 Stratification and Suppression ✓ Complete · 14.68s

14.68 seconds

Suppression rate

82.7% of cells suppressed (695 of 840)

Stratification complete — 695 of 840 cells suppressed to protect small populations

Estimates are stratified across 5 demographic and social dimensions at 1 geographic level in this live demonstration (full county, ZIP, and census-tract grids in production). Any stratum containing fewer than 11 individuals is suppressed — numerator, denominator, and rate are all cleared before results are stored or published. This threshold matches the MNEHRC Master Data Use Agreement and is consistent with CDC and CMS suppression standards. In full production runs, suppression is most common at fine geographic levels among small demographic groups — the same policy applies here on a reduced geography set for responsiveness.

Technical detail

Suppression threshold: n < 11 (not ≤10, consistent with CMS standard). Secondary suppression applied: when a numerator is suppressed, the denominator is also cleared to prevent back-calculation of the suppressed count. Prevalence rate formula: (numerator / denominator) × 10,000 active patients. This browser demonstration evaluates state-level cells only so the demo completes quickly; production adds county, ZIP, and census tract dimensions per DUA.

7 Results Published ✓ Complete · 1.01s

1.01 seconds

Published estimates

145 estimates · 5 conditions · 11 sites

Pipeline complete — 145 stratified prevalence estimates ready for review

The coordinating center now holds 145 publishable prevalence estimates and 695 suppressed cells for the 5 conditions queried. These estimates are accessible through the operations console to authorized analysts, and can be released as governed file extracts or integrated into reporting dashboards subject to the publication process defined in the consortium's data use agreements. The pipeline ran end-to-end in 42 seconds — from 11 contributing sites to a governed, suppressed, stratified statewide dataset.

Technical detail

Output format: PrevalenceEstimate rows indexed by condition × health_system × geo_level × geo_value × stratifier × stratifier_value. Each publishable row includes numerator, denominator, and rate (per 10,000 active patients). Suppressed rows retain the stratifier dimensions but clear numeric fields. This output format is consistent with HTAC public-use data file structure and is compatible with downstream reporting in Power BI, Tableau, or web-native dashboards.

Run summary

Raw patient records

1,320

Unique after deduplication

1,000

Cross-site duplicates resolved

320

Homelessness flag

Incarceration flag

Medicaid flag

219

Estimates generated

840

Estimates suppressed

695 (83%)

This demonstration ran end-to-end in 42 seconds across 11 simulated health systems. In a production deployment, the same pipeline would run on real OMOP CDM instances at each participating health system, with network latency replacing the simulated processing delays. The output format — suppressed, stratified PrevalenceEstimate rows — is identical to the format used in the MNEHRC Health Trends Across Communities production system.

View Pipeline Step 1 Discuss a Deployment