HTAC Pipeline · Step 1 of 7

Clinical Data in OMOP CDM

Clinical Data

PPRL Tokens

Deduplication

Enrichment

Cohort Queries

Suppression

Results

The pipeline begins at each health system's local EHR database, where patient data is standardized into the OMOP Common Data Model — the shared vocabulary used by PCORnet, N3C, and over 300 research networks worldwide.

What This Pipeline Demonstrates

This walkthrough steps through a working implementation of the federated population health data architecture used by statewide EHR consortia — including the Minnesota Electronic Health Record Consortium. Each step reflects real architectural decisions that practitioners building similar systems encounter: how patient data stays at contributing sites, how cross-site deduplication works without sharing identifiable information, how administrative data from HMIS and corrections systems gets linked, how condition cohorts are defined using OMOP concept IDs, and how small-cell suppression protects individual privacy while preserving analytic utility.

This is not a slide deck or a diagram. The pipeline is connected to a live Django application with real data models, real suppression logic, and a real OMOP CDM schema. The record counts are zero because no synthetic dataset has been loaded into this public instance — but the infrastructure that would process them is here.

If you are evaluating whether this kind of technical capability is relevant to your organization's work, the services page describes how this translates into consulting engagements.

See how this translates into services →

Why this step?

Participating organizations in a federated network often run different EHR platforms (Epic, Cerner, Allscripts, and others). Without a common format, a query that finds "diabetes" at one system misses it at another. OMOP CDM creates a single shared vocabulary so the same analysis script produces comparable results at every site.

How it works

Each site maintains its own OMOP database locally — no patient data leaves the site.
The coordinating team distributes standardized analytic packages that run against each local OMOP database.
Scripts return only aggregate counts — for example, the number of persons meeting a condition definition at a given site during the study period.
Aggregates are transmitted to the Data Coordinating Center (DCC) for final analysis.

What is never stored

No patient names, dates of birth, phone numbers, or Social Security numbers are persisted in the HTAC database. These PII fields exist only transiently during PPRL token generation (Step 2) and are immediately discarded — only the one-way cryptographic hash is retained.

Data Fields Captured Per Domain

OMOP CDM organizes clinical data into six domains. Each row is one clinical event linked to a Person and a HealthSystem. Concept IDs reference the OHDSI Athena vocabulary browser.

Person

1,320 rows

Field	Type	Purpose
person_source_value	str	Site-internal patient ID — never shared across sites
gender_concept_id	int	Sex at birth — OMOP standard (8507 Male, 8532 Female)
year_of_birth	int	Birth year; used to compute age group stratifier
race_concept_id	int	OMOP standard race concept
ethnicity_concept_id	int	OMOP standard ethnicity (38003563 Hispanic or Latino)
preferred_language	str	Primary language — used as a stratification dimension
county_fips	str(5)	5-digit county FIPS code for county-level prevalence
zip_code	str(5)	ZIP code for ZIP-level geographic estimates
census_tract	str(11)	11-char FIPS census tract for finest geographic resolution

VisitOccurrence

2,474 rows

Field	Type	Purpose
visit_concept_id	int	Encounter type (9201 Inpatient, 9202 Outpatient)
visit_start_date	date	Encounter start — defines the active-patient denominator in Step 6
visit_end_date	date	Encounter end; null for same-day visits
visit_type_concept_id	int	How encounter was recorded (44818518 EHR encounter)

ConditionOccurrence

681 rows

Field	Type	Purpose
condition_concept_id	int	OMOP standard diagnosis concept (SNOMED CT / ICD-10-CM)
condition_start_date	date	Onset or first-recorded date — used for study-period filtering
condition_end_date	date	Resolution date; null for chronic conditions
condition_type_concept_id	int	Provenance (32817 EHR encounter diagnosis)

DrugExposure

681 rows

Field	Type	Purpose
drug_concept_id	int	OMOP standard drug concept (RxNorm — Metformin, Buprenorphine, etc.)
drug_exposure_start_date	date	Prescription or dispensing start date
drug_exposure_end_date	date	End of supply; derived from days_supply when available
drug_type_concept_id required	int	Provenance (32817 EHR encounter, 38000180 Prescription written)
quantity	decimal	Units dispensed (tablets, mL, etc.)
days_supply	int	Days of medication in the dispense (30 / 60 / 90 day fills)

Measurement

430 rows

Field	Type	Purpose
measurement_concept_id	int	OMOP standard lab / vital concept (LOINC — HbA1c, BMI, Creatinine, etc.)
measurement_date	date	Date of lab draw or vital sign collection
measurement_type_concept_id required	int	Provenance (44818702 Lab result, 44818701 Vital sign)
value_as_number	decimal	Numeric result (e.g., 7.4 for HbA1c %, 142 for systolic BP mmHg)
value_as_concept_id	int	Coded result for qualitative measurements (Positive / Negative)
unit_concept_id	int	Unit of measure (UCUM vocabulary — mg/dL, mmHg, kg/m²)

Observation

0 rows

Field	Type	Purpose
observation_concept_id	int	OMOP standard observation concept (social history, functional status, etc.)
observation_date	date	Date the observation was recorded
observation_type_concept_id required	int	Provenance (38000280 Observation recorded from EHR)
value_as_string	str	Text value (e.g., tobacco use status)
value_as_number required	decimal	Numeric value when observation has a quantitative component
value_as_concept_id	int	Coded value (e.g., current smoker concept)
unit_concept_id required	int	Unit of measure when value_as_number is populated

required marks fields mandatory in OMOP CDM v5.4 that were added to this implementation to match the standard. Vocabularies supported: SNOMED CT, ICD-10-CM, RxNorm, LOINC, NDC, CPT-4, HCPCS, ICD-10-PCS.

Simulation Record Counts by Site

Live counts from the current synthetic dataset. In production, only aggregate counts like these — never patient records — travel to the DCC.

Health System	Persons	Visits	Conditions	Drug Exposures	Measurements
Allina Health	176	364	94	94	62
CentraCare	86	148	41	41	26
Children's Regional Health	77	77	8	8	0
Essentia Health	135	258	68	68	44
HealthPartners	170	341	101	101	59
Hennepin Healthcare	89	165	55	55	37
M Health Fairview	169	356	95	95	56
Mayo Clinic	150	316	89	89	57
Minneapolis VA	90	140	45	45	26
North Memorial Health	72	101	26	26	20
Sanford Health	106	208	59	59	43
Total (11 sites)	1,320	2,474	681	681	430

Observations are zero because the 22 HTAC condition codesets currently use the Condition, Drug, and Measurement domains. The Observation domain would capture social determinants (housing status, tobacco use) when added to future codeset expansions.

Latest pipeline demonstration (completed May 14, 2026 20:05). Counts on this page reflect synthetic federated data from that run. Open the live demo →

← Pipeline overview

Step 2: PPRL Tokens →