How to identify spreadsheets and databases with protected health information (PII and PHI)

The nice folks from IBM’s developerWorks group asked me to write an intermediate-level set of instructions (with a little code) for how technical teams can identify and find databases and spreadsheets that might contain personally identifiable information (PII) and protected health information (PHI).

The article is now available on IBM’s developerWorks, here’s the abstract:

Identity theft and medical fraud are growing problems. They are so big the U.S. government is spending billions of dollars securing its own computer systems and has written thousands of pages of new regulations that you must follow to help protect your customer and employee data. To comply with new regulations and properly secure data, you will need to find personally identifiable information (PII) and protected health information (PHI) in your databases and documents. Both PHI and PII are conceptually easy to understand but very difficult to track in the thousands of relational data stores, files, and spreadsheets that make up a typical organization’s IT environment. This article describes some methods to automatically identify and inventory PII, PHI, and other sensitive data with databases and spreadsheets using Java™ technology and the Apache Ant build tool.

Don’t forget that there are open source and commercial scanning tools start with similar functionality but add more features. When you look for third-party tools, consider automated discovery (the tools automatically find databases and record sources with PII/PHI), configurable templates (you add your own rules), broad coverage (all files, databases, and network transfers are covered), content scanning, and auditing.

Please check out my developerWorks article and comment either directly on the IBM site or drop me some notes here about what you think about it.

Author

Shahid N. Shah

Shahid Shah is an internationally recognized enterprise software guru that specializes in digital health with an emphasis on e-health, EHR/EMR, big data, iOT, data interoperability, med device connectivity, and bioinformatics.