De-identifiying protected health information (PHI)

A number of clients have been asking me about protected health information (PHI) solutions so I thought I’d put out a general call for help from my esteemed readers. What I’m looking for is a general-purpose data de-identification library (preferably open source) that I could use in both OSS and commercial systems. Even if it costs money, I’d love to hear about it.

The idea is to be able to find PHI automatically in any arbitrary data packet (HL7, e-mail, database, etc), be able to flag it, do a one-way hash, tokenize it, add it to a dictionary, etc. There are many uses for this kind of software from automatically scanning outgoing emails to protecting sensitive data within databases so data can be aggregated and shared.

A thought leader in the de-identification and data privacy space is Dr. Latanya Sweeney who teaches at CMU. She has some really nice stuff, though I don’t think it’s open source or available online.

A commercial firm that does this stuff is called De-ID. NIH’s National Cancer Institute uses them for their projects, too, so it’s probably pretty good.

Please drop me some comments here if you know of other researchers that might have some stuff available that they can share. While finished products would be great, even research projects are welcome.

Newsletter Sign Up

7 thoughts on “De-identifiying protected health information (PHI)

  1. David, yes, this is a great example. It’s not exactly the library that I’m looking for but it’s certainly another place we can look to see if IBM has something that they might license out of that product.

    In the PHI world, matching is always the first problem, replacement with valid (either one-way or two-way) deidentified data is the second problem.

  2. Sure — PHI is considered to be any patient/personal identifying information like names, zip codes, dates, etc. The entire list is described at

    So, the “matching problem” is one of looking through arbitrary unstructured (like e-mails) or structured (like HL7, databases, etc) text to see where anything like the 18 types of identifying information show up.

    While the problem is in the text matching domain, it’s non-trivial in unstructured text but much easier in structured text.

  3. The problem is more difficult than just watching for the 18 identifiers. For example, if you have the year of birth and a three digit-zipcode in an area of just over 20,000 persons, a rare disease could identify a person. Also things like the date a person moved into the area, correlated with other commercial databases might identify a person. A visit date might correlate with a parking ticket.

  4. I’m the Chief Medical Officer for De-ID Data Corp — our software was developed at UPMC as their IRB HIPAA compliance tool and is now being used by the NCI as the HIPAA compliance tool for multi-hospital data sharing networks. We are committed to access to our tool and have special pricing for RHIOs, public health access, data sharing networks, clinical research and IRB’s. Our software will deidentify PHI (or a subset) in any electronic text — transcribed discharge summaries, op or path notes, or text fields in EMR’s and other records. We do not scrub text, but provide proxies (Dr M, Dr. N. Dr O) and offsets (Age 30s) to add power to data files; we also allow you to add names to our dictionary to deal with local names and acronyms. DE-ID is a powerful tool to facilitate research and data acccess and (being an ex-public health physician) we will work with you to provide technical assistance to make your research workflow successful.

  5. Everyone, thanks for all your comments and help. It’s been quite useful. I’ll talk to my clients about what you’ve mentioned and see where it leads.

    If there’s anyone else out there that can help, my hat is still in hand asking for any assistance you can provide 🙂

Add Comment