A number of clients have been asking me about protected health information (PHI) solutions so I thought I’d put out a general call for help from my esteemed readers. What I’m looking for is a general-purpose data de-identification library (preferably open source) that I could use in both OSS and commercial systems. Even if it costs money, I’d love to hear about it.
The idea is to be able to find PHI automatically in any arbitrary data packet (HL7, e-mail, database, etc), be able to flag it, do a one-way hash, tokenize it, add it to a dictionary, etc. There are many uses for this kind of software from automatically scanning outgoing emails to protecting sensitive data within databases so data can be aggregated and shared.
A thought leader in the de-identification and data privacy space is Dr. Latanya Sweeney who teaches at CMU. She has some really nice stuff, though I don’t think it’s open source or available online.
Please drop me some comments here if you know of other researchers that might have some stuff available that they can share. While finished products would be great, even research projects are welcome.