HTAC · Live demonstration
Federated Pipeline Demonstration
A live run of the HTAC data pipeline using synthetic patient records distributed across the 11 Minnesota Electronic Health Record Consortium health systems. Each step executes in sequence — clinical data loading, privacy-preserving tokenization, cross-site deduplication, administrative data enrichment, federated cohort queries, stratification and suppression, and published results.
For program and policy staff: watch what happens to patient data at each step and why.
For technical implementers: the step detail panels show the architectural decisions behind each phase.
Running again will clear the previous demonstration data and replace it with a new synthetic run.
All records are synthetic. No real patient data is used in this demonstration. Health system names reflect actual MNEHRC consortium members; patient data is algorithmically generated for demonstration purposes only.
Pipeline steps
Total patient records loaded
1320 across 11 sites
Clinical records standardized across 11 health systems
Each of the 11 participating health systems maintains its own OMOP CDM database locally. Query scripts distributed by the coordinating center run against each site's local database and return only aggregate counts — no patient records leave any site. This demonstration loaded 1320 synthetic patient records distributed across sites in proportion to each system's actual patient volume.
Technical detail
Records loaded into OMOP CDM v5.4 tables: Person, VisitOccurrence, ConditionOccurrence, DrugExposure, Measurement. Vocabularies: SNOMED CT (conditions), RxNorm (drugs), LOINC (measurements). Each site's OMOP instance uses Epic-sourced concept mappings with site-specific normalization applied prior to loading.
| Health system | visits | persons | conditions | measurements | observations | drug_exposures |
|---|---|---|---|---|---|---|
| Allina Health | 364 | 176 | 94 | 62 | 0 | 94 |
| HealthPartners | 341 | 170 | 101 | 59 | 0 | 101 |
| M Health Fairview | 356 | 169 | 95 | 56 | 0 | 95 |
| Mayo Clinic | 316 | 150 | 89 | 57 | 0 | 89 |
| Essentia Health | 258 | 135 | 68 | 44 | 0 | 68 |
| Hennepin Healthcare | 165 | 89 | 55 | 37 | 0 | 55 |
| Sanford Health | 208 | 106 | 59 | 43 | 0 | 59 |
| CentraCare | 148 | 86 | 41 | 26 | 0 | 41 |
| Children's Minnesota | 77 | 77 | 8 | 0 | 0 | 8 |
| North Memorial Health | 101 | 72 | 26 | 20 | 0 | 26 |
| Minneapolis VA Health Care System | 140 | 90 | 45 | 26 | 0 | 45 |
Patients appearing at 2+ sites
320 (24.2% of total)
Patient identities converted to one-way cryptographic tokens
Before any cross-site matching can occur, each health system independently hashes six PII fields — first name, last name, date of birth, sex, phone number, and ZIP code — into a 64-character HMAC-SHA256 token using a shared salt known only to consortium members. The original PII fields are immediately discarded. Only the token is transmitted to the coordinating center. 320 tokens were generated by more than one site — these represent patients who received care at multiple health systems and would be double-counted without deduplication.
Technical detail
Algorithm: HMAC-SHA256(key=shared_salt, msg=normalized_preimage). Normalization: lowercase folding for name fields, YYYYMMDD for date of birth, digits-only for phone, 5-character ZIP. Validated at 97% precision and 75% recall in multi-site deployments (OneFlorida). Tokens are consortium-specific — a token from this network cannot be linked to tokens from a different network without the shared salt.
| Health system | tokens_generated |
|---|---|
| Allina Health | 176 |
| HealthPartners | 170 |
| M Health Fairview | 169 |
| Mayo Clinic | 150 |
| Essentia Health | 135 |
| Hennepin Healthcare | 89 |
| Sanford Health | 106 |
| CentraCare | 86 |
| Children's Minnesota | 77 |
| North Memorial Health | 72 |
| Minneapolis VA Health Care System | 90 |
Unique individuals in statewide roster
1000 (down from 1320 raw records)
Statewide patient roster deduplicated — 320 duplicate records resolved
The coordinating center receives tokens from all 11 sites. Because the same patient produces the same token at every site where they were seen, duplicate records are identified by matching tokens. The 1320 raw patient records reduce to 1000 unique individuals after deduplication. Without this step, prevalence denominators would be inflated by 320 duplicate records — overstating the denominator and understating true prevalence rates.
Technical detail
Deduplication algorithm: exact token match. A patient roster row is created for each unique token, with site attribution assigned to the site holding the most recent visit data. The 24.2% overlap rate observed in this demonstration is consistent with the published MNEHRC figure of 75% of patients having records at more than one health system — scaled to the demonstration's patient count.
Patients with social risk flags
2 homeless · 22 incarcerated · 219 Medicaid
Roster enriched with housing, justice, and insurance context
Authorized data stewards — the Minnesota Homeless Management Information System, the Minnesota Department of Corrections, and the Minnesota Department of Human Services Medicaid enrollment file — provide periodic extracts that are linked to the patient roster by token. This enrichment adds social risk context without pulling clinical records into a central warehouse. The result: 2 patients flagged with recent homelessness experience, 22 with recent incarceration involvement, and 219 with current Medicaid enrollment — dimensions that are essential for equity-focused prevalence analysis and that no clinical data source alone can provide.
Technical detail
Linkage method: exact token match against each administrative data source. Administrative extracts are provided by MDH under data use agreements that specify permitted uses, transfer protocols, and retention limits. Flag rates in this demonstration reflect published MNEHRC population proportions from the 2023 HTAC cohort.
Conditions queried
5 conditions · 11 sites · 55 raw cells
Condition cohort queries executed across all 11 sites
Standardized analytic packages — R scripts distributed to each site — execute against each site's local OMOP database using published OMOP concept ID codesets. Only aggregate counts return to the coordinating center. In this demonstration, 5 conditions were queried across 11 sites, producing 55 raw estimate cells before stratification and suppression.
Technical detail
Condition definitions use OMOP concept IDs from the SNOMED CT, ICD-10-CM, RxNorm, and LOINC vocabularies, consistent with CMS Chronic Conditions Warehouse codesets and HTAC condition definitions. Queries search ConditionOccurrence, DrugExposure, and Measurement domains during a 12-month study period. Denominators use persons with at least one VisitOccurrence during the study window — consistent with MNEHRC/HTAC methodology.
Suppression rate
82.7% of cells suppressed (695 of 840)
Stratification complete — 695 of 840 cells suppressed to protect small populations
Estimates are stratified across 5 demographic and social dimensions at 1 geographic level in this live demonstration (full county, ZIP, and census-tract grids in production). Any stratum containing fewer than 11 individuals is suppressed — numerator, denominator, and rate are all cleared before results are stored or published. This threshold matches the MNEHRC Master Data Use Agreement and is consistent with CDC and CMS suppression standards. In full production runs, suppression is most common at fine geographic levels among small demographic groups — the same policy applies here on a reduced geography set for responsiveness.
Technical detail
Suppression threshold: n < 11 (not ≤10, consistent with CMS standard). Secondary suppression applied: when a numerator is suppressed, the denominator is also cleared to prevent back-calculation of the suppressed count. Prevalence rate formula: (numerator / denominator) × 10,000 active patients. This browser demonstration evaluates state-level cells only so the demo completes quickly; production adds county, ZIP, and census tract dimensions per DUA.
Published estimates
145 estimates · 5 conditions · 11 sites
Pipeline complete — 145 stratified prevalence estimates ready for review
The coordinating center now holds 145 publishable prevalence estimates and 695 suppressed cells for the 5 conditions queried. These estimates are accessible through the operations console to authorized analysts, and can be released as governed file extracts or integrated into reporting dashboards subject to the publication process defined in the consortium's data use agreements. The pipeline ran end-to-end in 42 seconds — from 11 contributing sites to a governed, suppressed, stratified statewide dataset.
Technical detail
Output format: PrevalenceEstimate rows indexed by condition × health_system × geo_level × geo_value × stratifier × stratifier_value. Each publishable row includes numerator, denominator, and rate (per 10,000 active patients). Suppressed rows retain the stratifier dimensions but clear numeric fields. This output format is consistent with HTAC public-use data file structure and is compatible with downstream reporting in Power BI, Tableau, or web-native dashboards.
Run summary
Raw patient records
1,320
Unique after deduplication
1,000
Cross-site duplicates resolved
320
Homelessness flag
2
Incarceration flag
22
Medicaid flag
219
Estimates generated
840
Estimates suppressed
695 (83%)
This demonstration ran end-to-end in 42 seconds across 11 simulated health systems. In a production deployment, the same pipeline would run on real OMOP CDM instances at each participating health system, with network latency replacing the simulated processing delays. The output format — suppressed, stratified PrevalenceEstimate rows — is identical to the format used in the MNEHRC Health Trends Across Communities production system.