CRDC Insights

Updates from the Cancer Research Data Commons:
Empowering the Scientific Community to Make New Discoveries

Final Report: Medical Imaging De-Identification (MIDI) Project

June 18, 2025
Computer

The NIH/NCI Medical Imaging De-Identification (MIDI) Project was established to assess tools and processes for de-identifying radiology images in DICOM format that are shared for research purposes. Protected Health Information (PHI) and Personally Identifiable Information (PII) can be embedded in these images as pixel data, included in metadata headers, or attached as descriptive information. Given the need to ensure patient privacy, efforts have been made to develop and evaluate various imaging de-identification strategies that address the high volume of imaging data potentially available to the research community.

The Data Ecosystem Branch (DEB) of NCI’s Center for Bioinformatics & Information Technology (CBIIT) coordinated this de-identification work. Over the last several years, the work has involved assessing various automated and semi-automated approaches through a series of workshops and MIDI challenges. 

The DEB team oversaw the development of a fully automated pipeline for de-identifying DICOM medical images. The process involved using an extensive dataset of representative medical images that were modified to include synthetic PHI/PII for training, validation, and testing de-identification purposes. This dataset was used over multiple iterations to improve the pipeline. A final test of the pipeline was conducted using a large dataset with actual PHI/PII. In parallel, an open challenge was conducted using the synthetic PHI/PII dataset. The MIDI pipeline, as well as several challenge submissions, demonstrated success in image de-identification. The synthetic dataset, which includes synthetic DICOM images, answer keys, mapping files, and a validation script with accompanying documentation, is publicly available on The Cancer Imaging Archive (TCIA) for further use by the research community. 

As noted by Granger Sutton, PhD, MIDI project co-coordinator: "Medical image de-identification is challenging even when limited to a standard format such as DICOM and a subset of modalities as encapsulated in the datasets evaluated for the MIDI project. The results are promising, but at this time, some human review is still necessary to ensure full confidence in patient privacy."

The final report about the MIDI project was released in late 2024. Some takeaways include:

  • In all iterations of the MIDI pipeline and with all Challenge submissions, no automated method had a 100% success rate in fully removing PHI/PII.
  • In a final or production setting, the MIDI pipeline would require a “human-in-the-loop” to achieve the desired threshold of 100% accuracy in ensuring all images have been de-identified. While requiring a human-in-the-loop, the MIDI pipeline developed through this work would still reduce the workload of image de-identification compared to human-only, manual review.  

In the future, the MIDI project team recommends:

  • Providing researchers access to the open-source MIDI pipeline for use within their own environments.
  • Providing the research community with an NIH-supported tool based on the MIDI pipeline for automated de-identification as a standalone cloud-enabled service. This standalone service would require additional steps by the DEB team, including obtaining an NIH-provided Authority to Operate (ATO) and documenting processes for human-in-the-loop review.

The full report is available here.

Watch the recording of the 2024 MIDI-B Challenge Workshop to hear directly from the participating teams about their results and lessons learned. 

The MIDI Benchmark (developed through the work described above) is open to testing de-identification algorithms from April 1 to June 30, 2025. Learn more.