De-identifiying protected health information (PHI)

Home > De-identifiying protected health information (PHI)

A number of clients have been asking me about protected health information (PHI) solutions so I thought I’d put out a general call for help from my esteemed readers. What I’m looking for is a general-purpose data de-identification library (preferably open source) that I could use in both OSS and commercial systems. Even if it costs money, I’d love to hear about it.

The idea is to be able to find PHI automatically in any arbitrary data packet (HL7, e-mail, database, etc), be able to flag it, do a one-way hash, tokenize it, add it to a dictionary, etc. There are many uses for this kind of software from automatically scanning outgoing emails to protecting sensitive data within databases so data can be aggregated and shared.

A thought leader in the de-identification and data privacy space is Dr. Latanya Sweeney who teaches at CMU. She has some really nice stuff, though I don’t think it’s open source or available online.

A commercial firm that does this stuff is called De-ID. NIH’s National Cancer Institute uses them for their projects, too, so it’s probably pretty good.

Please drop me some comments here if you know of other researchers that might have some stuff available that they can share. While finished products would be great, even research projects are welcome.


Shahid N. Shah

Shahid Shah is an internationally recognized enterprise software guru that specializes in digital health with an emphasis on e-health, EHR/EMR, big data, iOT, data interoperability, med device connectivity, and bioinformatics.