HTAC Pipeline · Step 1 of 7
Clinical Data in OMOP CDM
The pipeline begins at each health system's local EHR database, where patient data is standardized into the OMOP Common Data Model — the shared vocabulary used by PCORnet, N3C, and over 300 research networks worldwide.
What This Pipeline Demonstrates
This walkthrough steps through a working implementation of the federated population health data architecture used by statewide EHR consortia — including the Minnesota Electronic Health Record Consortium. Each step reflects real architectural decisions that practitioners building similar systems encounter: how patient data stays at contributing sites, how cross-site deduplication works without sharing identifiable information, how administrative data from HMIS and corrections systems gets linked, how condition cohorts are defined using OMOP concept IDs, and how small-cell suppression protects individual privacy while preserving analytic utility.
This is not a slide deck or a diagram. The pipeline is connected to a live Django application with real data models, real suppression logic, and a real OMOP CDM schema. The record counts are zero because no synthetic dataset has been loaded into this public instance — but the infrastructure that would process them is here.
If you are evaluating whether this kind of technical capability is relevant to your organization's work, the services page describes how this translates into consulting engagements.
Why this step?
Participating organizations in a federated network often run different EHR platforms (Epic, Cerner, Allscripts, and others). Without a common format, a query that finds "diabetes" at one system misses it at another. OMOP CDM creates a single shared vocabulary so the same analysis script produces comparable results at every site.
How it works
- Each site maintains its own OMOP database locally — no patient data leaves the site.
- The coordinating team distributes standardized analytic packages that run against each local OMOP database.
- Scripts return only aggregate counts — for example, the number of persons meeting a condition definition at a given site during the study period.
- Aggregates are transmitted to the Data Coordinating Center (DCC) for final analysis.
What is never stored
No patient names, dates of birth, phone numbers, or Social Security numbers are persisted in the HTAC database. These PII fields exist only transiently during PPRL token generation (Step 2) and are immediately discarded — only the one-way cryptographic hash is retained.
Data Fields Captured Per Domain
OMOP CDM organizes clinical data into six domains. Each row is one clinical event linked to a Person and a HealthSystem. Concept IDs reference the OHDSI Athena vocabulary browser.
Person
1,320 rows| Field | Type | Purpose |
|---|---|---|
| person_source_value | str | Site-internal patient ID — never shared across sites |
| gender_concept_id | int | Sex at birth — OMOP standard (8507 Male, 8532 Female) |
| year_of_birth | int | Birth year; used to compute age group stratifier |
| race_concept_id | int | OMOP standard race concept |
| ethnicity_concept_id | int | OMOP standard ethnicity (38003563 Hispanic or Latino) |
| preferred_language | str | Primary language — used as a stratification dimension |
| county_fips | str(5) | 5-digit county FIPS code for county-level prevalence |
| zip_code | str(5) | ZIP code for ZIP-level geographic estimates |
| census_tract | str(11) | 11-char FIPS census tract for finest geographic resolution |
VisitOccurrence
2,474 rows| Field | Type | Purpose |
|---|---|---|
| visit_concept_id | int | Encounter type (9201 Inpatient, 9202 Outpatient) |
| visit_start_date | date | Encounter start — defines the active-patient denominator in Step 6 |
| visit_end_date | date | Encounter end; null for same-day visits |
| visit_type_concept_id | int | How encounter was recorded (44818518 EHR encounter) |
ConditionOccurrence
681 rows| Field | Type | Purpose |
|---|---|---|
| condition_concept_id | int | OMOP standard diagnosis concept (SNOMED CT / ICD-10-CM) |
| condition_start_date | date | Onset or first-recorded date — used for study-period filtering |
| condition_end_date | date | Resolution date; null for chronic conditions |
| condition_type_concept_id | int | Provenance (32817 EHR encounter diagnosis) |
DrugExposure
681 rows| Field | Type | Purpose |
|---|---|---|
| drug_concept_id | int | OMOP standard drug concept (RxNorm — Metformin, Buprenorphine, etc.) |
| drug_exposure_start_date | date | Prescription or dispensing start date |
| drug_exposure_end_date | date | End of supply; derived from days_supply when available |
| drug_type_concept_id required | int | Provenance (32817 EHR encounter, 38000180 Prescription written) |
| quantity | decimal | Units dispensed (tablets, mL, etc.) |
| days_supply | int | Days of medication in the dispense (30 / 60 / 90 day fills) |
Measurement
430 rows| Field | Type | Purpose |
|---|---|---|
| measurement_concept_id | int | OMOP standard lab / vital concept (LOINC — HbA1c, BMI, Creatinine, etc.) |
| measurement_date | date | Date of lab draw or vital sign collection |
| measurement_type_concept_id required | int | Provenance (44818702 Lab result, 44818701 Vital sign) |
| value_as_number | decimal | Numeric result (e.g., 7.4 for HbA1c %, 142 for systolic BP mmHg) |
| value_as_concept_id | int | Coded result for qualitative measurements (Positive / Negative) |
| unit_concept_id | int | Unit of measure (UCUM vocabulary — mg/dL, mmHg, kg/m²) |
Observation
0 rows| Field | Type | Purpose |
|---|---|---|
| observation_concept_id | int | OMOP standard observation concept (social history, functional status, etc.) |
| observation_date | date | Date the observation was recorded |
| observation_type_concept_id required | int | Provenance (38000280 Observation recorded from EHR) |
| value_as_string | str | Text value (e.g., tobacco use status) |
| value_as_number required | decimal | Numeric value when observation has a quantitative component |
| value_as_concept_id | int | Coded value (e.g., current smoker concept) |
| unit_concept_id required | int | Unit of measure when value_as_number is populated |
required marks fields mandatory in OMOP CDM v5.4 that were added to this implementation to match the standard. Vocabularies supported: SNOMED CT, ICD-10-CM, RxNorm, LOINC, NDC, CPT-4, HCPCS, ICD-10-PCS.
Simulation Record Counts by Site
Live counts from the current synthetic dataset. In production, only aggregate counts like these — never patient records — travel to the DCC.
| Health System | Persons | Visits | Conditions | Drug Exposures | Measurements | Observations |
|---|---|---|---|---|---|---|
| Allina Health | 176 | 364 | 94 | 94 | 62 | 0 |
| CentraCare | 86 | 148 | 41 | 41 | 26 | 0 |
| Children's Regional Health | 77 | 77 | 8 | 8 | 0 | 0 |
| Essentia Health | 135 | 258 | 68 | 68 | 44 | 0 |
| HealthPartners | 170 | 341 | 101 | 101 | 59 | 0 |
| Hennepin Healthcare | 89 | 165 | 55 | 55 | 37 | 0 |
| M Health Fairview | 169 | 356 | 95 | 95 | 56 | 0 |
| Mayo Clinic | 150 | 316 | 89 | 89 | 57 | 0 |
| Minneapolis VA | 90 | 140 | 45 | 45 | 26 | 0 |
| North Memorial Health | 72 | 101 | 26 | 26 | 20 | 0 |
| Sanford Health | 106 | 208 | 59 | 59 | 43 | 0 |
| Total (11 sites) | 1,320 | 2,474 | 681 | 681 | 430 | 0 |
Observations are zero because the 22 HTAC condition codesets currently use the Condition, Drug, and Measurement domains. The Observation domain would capture social determinants (housing status, tobacco use) when added to future codeset expansions.