HTAC Pipeline · Step 1 of 7

Clinical Data in OMOP CDM

1
Clinical Data
2
PPRL Tokens
3
Deduplication
4
Enrichment
5
Cohort Queries
6
Suppression
7
Results

The pipeline begins at each health system's local EHR database, where patient data is standardized into the OMOP Common Data Model — the shared vocabulary used by PCORnet, N3C, and over 300 research networks worldwide.

What This Pipeline Demonstrates

This walkthrough steps through a working implementation of the federated population health data architecture used by statewide EHR consortia — including the Minnesota Electronic Health Record Consortium. Each step reflects real architectural decisions that practitioners building similar systems encounter: how patient data stays at contributing sites, how cross-site deduplication works without sharing identifiable information, how administrative data from HMIS and corrections systems gets linked, how condition cohorts are defined using OMOP concept IDs, and how small-cell suppression protects individual privacy while preserving analytic utility.

This is not a slide deck or a diagram. The pipeline is connected to a live Django application with real data models, real suppression logic, and a real OMOP CDM schema. The record counts are zero because no synthetic dataset has been loaded into this public instance — but the infrastructure that would process them is here.

If you are evaluating whether this kind of technical capability is relevant to your organization's work, the services page describes how this translates into consulting engagements.

See how this translates into services →

Why this step?

Participating organizations in a federated network often run different EHR platforms (Epic, Cerner, Allscripts, and others). Without a common format, a query that finds "diabetes" at one system misses it at another. OMOP CDM creates a single shared vocabulary so the same analysis script produces comparable results at every site.

How it works

  • Each site maintains its own OMOP database locally — no patient data leaves the site.
  • The coordinating team distributes standardized analytic packages that run against each local OMOP database.
  • Scripts return only aggregate counts — for example, the number of persons meeting a condition definition at a given site during the study period.
  • Aggregates are transmitted to the Data Coordinating Center (DCC) for final analysis.

What is never stored

No patient names, dates of birth, phone numbers, or Social Security numbers are persisted in the HTAC database. These PII fields exist only transiently during PPRL token generation (Step 2) and are immediately discarded — only the one-way cryptographic hash is retained.

Data Fields Captured Per Domain

OMOP CDM organizes clinical data into six domains. Each row is one clinical event linked to a Person and a HealthSystem. Concept IDs reference the OHDSI Athena vocabulary browser.

Person

1,320 rows
FieldTypePurpose
person_source_valuestrSite-internal patient ID — never shared across sites
gender_concept_idintSex at birth — OMOP standard (8507 Male, 8532 Female)
year_of_birthintBirth year; used to compute age group stratifier
race_concept_idintOMOP standard race concept
ethnicity_concept_idintOMOP standard ethnicity (38003563 Hispanic or Latino)
preferred_languagestrPrimary language — used as a stratification dimension
county_fipsstr(5)5-digit county FIPS code for county-level prevalence
zip_codestr(5)ZIP code for ZIP-level geographic estimates
census_tractstr(11)11-char FIPS census tract for finest geographic resolution

VisitOccurrence

2,474 rows
FieldTypePurpose
visit_concept_idintEncounter type (9201 Inpatient, 9202 Outpatient)
visit_start_datedateEncounter start — defines the active-patient denominator in Step 6
visit_end_datedateEncounter end; null for same-day visits
visit_type_concept_idintHow encounter was recorded (44818518 EHR encounter)

ConditionOccurrence

681 rows
FieldTypePurpose
condition_concept_idintOMOP standard diagnosis concept (SNOMED CT / ICD-10-CM)
condition_start_datedateOnset or first-recorded date — used for study-period filtering
condition_end_datedateResolution date; null for chronic conditions
condition_type_concept_idintProvenance (32817 EHR encounter diagnosis)

DrugExposure

681 rows
FieldTypePurpose
drug_concept_idintOMOP standard drug concept (RxNorm — Metformin, Buprenorphine, etc.)
drug_exposure_start_datedatePrescription or dispensing start date
drug_exposure_end_datedateEnd of supply; derived from days_supply when available
drug_type_concept_id requiredintProvenance (32817 EHR encounter, 38000180 Prescription written)
quantitydecimalUnits dispensed (tablets, mL, etc.)
days_supplyintDays of medication in the dispense (30 / 60 / 90 day fills)

Measurement

430 rows
FieldTypePurpose
measurement_concept_idintOMOP standard lab / vital concept (LOINC — HbA1c, BMI, Creatinine, etc.)
measurement_datedateDate of lab draw or vital sign collection
measurement_type_concept_id requiredintProvenance (44818702 Lab result, 44818701 Vital sign)
value_as_numberdecimalNumeric result (e.g., 7.4 for HbA1c %, 142 for systolic BP mmHg)
value_as_concept_idintCoded result for qualitative measurements (Positive / Negative)
unit_concept_idintUnit of measure (UCUM vocabulary — mg/dL, mmHg, kg/m²)

Observation

0 rows
FieldTypePurpose
observation_concept_idintOMOP standard observation concept (social history, functional status, etc.)
observation_datedateDate the observation was recorded
observation_type_concept_id requiredintProvenance (38000280 Observation recorded from EHR)
value_as_stringstrText value (e.g., tobacco use status)
value_as_number requireddecimalNumeric value when observation has a quantitative component
value_as_concept_idintCoded value (e.g., current smoker concept)
unit_concept_id requiredintUnit of measure when value_as_number is populated

required marks fields mandatory in OMOP CDM v5.4 that were added to this implementation to match the standard. Vocabularies supported: SNOMED CT, ICD-10-CM, RxNorm, LOINC, NDC, CPT-4, HCPCS, ICD-10-PCS.

Simulation Record Counts by Site

Live counts from the current synthetic dataset. In production, only aggregate counts like these — never patient records — travel to the DCC.

Health SystemPersonsVisitsConditions Drug ExposuresMeasurementsObservations
Allina Health 176364 9494 620
CentraCare 86148 4141 260
Children's Regional Health 7777 88 00
Essentia Health 135258 6868 440
HealthPartners 170341 101101 590
Hennepin Healthcare 89165 5555 370
M Health Fairview 169356 9595 560
Mayo Clinic 150316 8989 570
Minneapolis VA 90140 4545 260
North Memorial Health 72101 2626 200
Sanford Health 106208 5959 430
Total (11 sites) 1,3202,474 681681 4300

Observations are zero because the 22 HTAC condition codesets currently use the Condition, Drug, and Measurement domains. The Observation domain would capture social determinants (housing status, tobacco use) when added to future codeset expansions.

Latest pipeline demonstration (completed May 14, 2026 20:05). Counts on this page reflect synthetic federated data from that run. Open the live demo →
← Pipeline overview
Step 2: PPRL Tokens →