Healthcare Data Governance

From One Radiology Site to Five: De-identification and Data-Sharing on Databricks without Breaking HIPAA

Mid-market radiology groups must share real-world imaging for AI research without exposing PHI. This article outlines a governed, agentic approach on Databricks—combining DICOM de-identification, pixel redaction, Unity Catalog policies, approvals, watermarking, and revocation—to scale from one to five sites without breaking HIPAA. It also provides a 30/60/90-day plan, governance controls, ROI metrics, and common pitfalls to avoid.

• 8 min read

From One Radiology Site to Five: De-identification and Data-Sharing on Databricks without Breaking HIPAA

1. Problem / Context

Mid-market radiology groups face a paradox: imaging AI vendors need real-world studies to validate models, but exporting DICOM studies can expose protected health information (PHI) in headers or even burned into pixels. With mixed PACS vendors and lean IT teams, manual exports and ad‑hoc scripts don’t scale—and the compliance risk is untenable. Our scenario: a HIPAA-regulated radiology group expanded from one to five sites, each with slightly different PACS configurations, and needed a reliable way to de-identify and share research datasets on Databricks without triggering HIPAA violations or endless review cycles.

2. Key Definitions & Concepts

  • DICOM de-identification: Removing or transforming PHI present in DICOM headers and image pixels while preserving clinical utility (e.g., modality, study description, acquisition parameters).
  • PHI in pixels: Burned-in annotations (names, MRNs, dates) that require content-aware detection, often via OCR or computer vision to mask/redact.
  • Agentic AI: A governed set of software agents that plan, execute, and coordinate tasks—such as extracting studies, detecting PHI, applying de-identification policies, routing approvals, and managing revocations—while maintaining full auditability.
  • Databricks Lakehouse: The unified platform used to orchestrate data pipelines, enforce governance via Unity Catalog, and securely share data products to external partners.
  • Approval and watermarking: Human-in-the-loop checkpoints for sensitive exports and traceable watermarks so each shared asset is linked to a purpose, dataset, and recipient.

3. Why This Matters for Mid-Market Regulated Firms

Radiology groups in the $50M–$300M range operate with thin compliance and engineering benches. They must demonstrate zero PHI leakage, provide audit trails on every export, and still move quickly enough to keep vendor partnerships and research timelines on track. Naive RPA or manual click-through exports fail in this environment because they are not content-aware, cannot enforce standardized approvals, and leave little evidence for auditors. A governed, agentic approach on Databricks lets teams move fast with guardrails: standardized templates per site, Unity Catalog policies, and automated logging that satisfies HIPAA scrutiny.

Kriv AI, as a governed AI and agentic automation partner, helps mid-market teams establish this backbone—so data moves only when policies, approvals, and controls say it can, and every step is recorded.

4. Practical Implementation Steps / Roadmap

  1. Ingest and normalize
    • Use a secure DICOM router or gateway to land studies in cloud storage partitioned by site and modality.
    • Create a bronze/silver structure on Databricks: bronze for raw DICOM (with restricted access), silver for de-identified outputs. Register datasets in Unity Catalog from day one.
  2. Agentic orchestration
    • An ingestion agent watches for new studies and triggers a de-identification plan based on site template.
    • A PHI-detection agent inspects headers against a governed rule set and performs pixel-level checks using OCR/vision models for burned-in text.
  3. Policy-driven de-identification
    • Apply per-field transformations (remove, hash, pseudonymize) to DICOM tags; preserve research-critical attributes (SOP Class, Study/Series descriptions) per policy.
    • For pixels, mask bounding boxes containing detected PHI; log confidence scores and before/after artifacts for audit.
  4. Human-in-the-loop approvals
    • Route flagged studies (low confidence or unusual tags) to a compliance queue. Approvers review sampled thumbnails and tag diffs within Databricks notebooks or a simple app connected to the lakehouse.
  5. Watermark and export
    • Watermark output images/metadata with dataset ID, partner, purpose, and expiration. Store lineage in Unity Catalog tables.
    • Share via controlled mechanisms (e.g., Delta Sharing or signed exports) with time-bounded entitlements.
  6. Revocation and recall
    • Maintain a revocation list: if a study must be withdrawn, the agent updates entitlements, triggers recall notifications, and logs actions for audit.
  7. Site-by-site templates
    • Start with one site’s PACS profile. Encode quirks (custom tags, modality defaults) into a template; replicate and tweak for additional sites. Agents reference the correct template automatically.
  8. Continuous auditability
    • Every step writes to an immutable audit log with user, time, policy version, and outcome—critical for HIPAA and internal review.

[IMAGE SLOT: agentic AI workflow diagram showing DICOM ingestion, PHI detection (headers and pixels), human-in-the-loop approval, watermarking, Unity Catalog governance, and secure data sharing]

5. Governance, Compliance & Risk Controls Needed

  • Access control via Unity Catalog: Least-privilege groups for raw (bronze) vs de-identified (silver) data. Only agents and approvers can touch raw.
  • Approved policy library: Versioned de-identification policies with change control and rollback. Link each export to the policy version used.
  • Audit trails: Immutable logs capturing PHI detections, redaction decisions, approvals, watermarks, and sharing events.
  • Vendor isolation: Separate workspaces or catalogs for external sharing; entitlements tied to named, purpose-limited shares with expiration.
  • Model and tool validation: Validate OCR/vision components on representative modalities (CT, MR, US, MG) with documented sensitivity/specificity for PHI detection.
  • Incident playbooks: Clear steps for suspected leakage—freeze shares, invoke revocation list, reprocess affected studies, and notify stakeholders.
  • Change management: Any template or policy update is peer-reviewed and recorded before agents use it in production.

Kriv AI often supports clients by hardening these controls—data readiness, MLOps hygiene, and governance guardrails—so that scaling from one site to five remains auditable and consistent.

[IMAGE SLOT: governance and compliance control map highlighting Unity Catalog access tiers, audit logs, approval checkpoints, and revocation workflows]

6. ROI & Metrics

Mid-market leaders need measurable gains, not slideware. In this case, outcomes were tracked from the first pilot through multi-site scale:

  • Dataset preparation time: Reduced from multiple weeks to roughly 3 days for a typical research cohort.
  • PHI incidents: Reduced to zero across shared datasets, backed by audit evidence.
  • Analyst time: 60% reduction through automation of repetitive export, tagging, and QA tasks.

How to measure and communicate this value:

  • Cycle time reduction: Measure mean time from dataset request to partner-ready export across cohorts; target a >50% reduction.
  • Error rate: Track PHI detection misses found in approval sampling; drive toward zero through model tuning and policy refinement.
  • Labor savings: Calculate analyst hours avoided per 1,000 studies processed; multiply by loaded labor cost to show monthly savings.
  • Throughput: Studies processed per day per site before/after automation.
  • Compliance assurance: Number of complete audit records per export and time to assemble evidence for audits.

These metrics translate directly into payback: for example, if automation saves 120 analyst hours per month across five sites, at a $70/hour loaded cost, that’s ~$8,400/month in labor alone—before considering reduced project delays and compliance risk avoidance.

[IMAGE SLOT: ROI dashboard visualizing cycle-time reduction, analyst hours saved, zero PHI incidents, and throughput across five sites]

7. Common Pitfalls & How to Avoid Them

  • Naive RPA exports: Screen-scraping or manual PACS downloads miss burned-in PHI and lack auditability. Use content-aware agents with policy controls and approvals.
  • One-off scripts: Hard-coded tag lists break when sites add modalities or custom fields. Externalize de-identification rules in versioned policies.
  • Ignoring pixels: Header-only de-identification is insufficient. Validate OCR/vision on your modalities and set confidence thresholds that trigger review.
  • No revocation path: Partners will occasionally need studies withdrawn. Maintain a revocation list and build recall into the sharing mechanism.
  • Pilot graveyard: Compliance halts scale when governance isn’t baked in. Use Unity Catalog policies, automated audit logs, and incremental templates to unlock approvals.
  • Vendor lock-in: Favor standards-based sharing (e.g., Delta Sharing) and export formats with watermarks and lineage so you can switch tools without rework.

30/60/90-Day Start Plan

First 30 Days

  • Discovery and scoping: Inventory PACS vendors, modalities, data volumes, and current export practices per site.
  • Governance boundaries: Define approved research purposes, approver roles, and data sharing policies in Unity Catalog.
  • Technical baseline: Stand up Databricks workspace(s), configure secure storage, and register initial bronze/silver catalogs.
  • Policy seeds: Draft de-identification policies for headers and pixels; outline sampling and approval thresholds.

Days 31–60

  • Pilot workflows: Implement agentic pipeline on a single site with 1–2 modalities. Turn on PHI detection, pixel redaction, and approval routing.
  • Security controls: Enforce least-privilege access; lock raw data to agent service principals; validate audit logging.
  • Evaluation: Track cycle time, PHI detection performance, and approval outcomes; tune policies and OCR/vision thresholds.
  • Watermarking and sharing: Enable time-bounded, purpose-limited shares to a single vendor; test revocation and recall.

Days 61–90

  • Scale templates: Generalize the site template; extend to two more sites with minimal changes.
  • Monitoring and metrics: Stand up ROI dashboards for cycle time, analyst hours saved, and incident counts; automate weekly reviews.
  • Stakeholder alignment: Present results to compliance, radiology leadership, and vendor partners; formalize go/no-go for full five-site rollout.
  • Hardening: Implement change management for policies, add disaster recovery drills, and document incident playbooks.

9. Industry-Specific Considerations

  • Mixed PACS vendors: Expect variations in private tags and naming. Maintain a per-site tag dictionary and map into the policy engine.
  • Modality nuances: Mammography (MQSA) and ultrasound often have burned-in text; prioritize pixel redaction QA for these.
  • IRB and BAAs: Ensure research approvals and business associate agreements explicitly cover de-identification process, sharing method, and revocation rights.
  • Expert determination vs. Safe Harbor: For certain research uses, consider expert determination to preserve more utility while staying compliant.

10. Conclusion / Next Steps

Scaling from one radiology site to five is achievable when governance and content-aware automation come first. Agentic AI on Databricks orchestrates de-identification, detects PHI in headers and pixels, enforces approvals, watermarks exports, and manages revocations—so your team can accelerate research partnerships without compromising HIPAA.

If you’re exploring governed Agentic AI for your mid-market organization, Kriv AI can serve as your operational and governance backbone. With a focus on data readiness, MLOps, and workflow orchestration, Kriv AI helps regulated radiology groups turn ad-hoc pilots into production systems that deliver measurable, compliant ROI.

Explore our related services: Agentic AI & Automation �� AI Governance & Compliance