A recent article in Scientific American, How Data Brokers Make Money Off Your Medical Records, by Adam Tanner, February 1, 2016, describes how the purchase and sale of health care data has become a multi-billion dollar industry.
The dominant player in the medical-data-trading industry is IMS Health, which recorded $2.6 billion in revenue in 2014.... At press time, IMS was a $9-billion company. Competitors include Symphony Health Solutions and smaller rivals in various countries.
IMS gathers data from pharmacies, insurance companies, and state and federal health departments, then sells it mainly to drug companies. "Three quarters of all retail pharmacies in the U.S. send some portion of their electronic records to IMS." Another article by Adam Tanner, This Little-Known Firm Is Getting Rich Off Your Medical Data, Fortune Magazine, February 9, 2016, includes further details on IMS data collection.
One particular type of data, records of drugs prescribed by doctors sold to drug companies and used in turn to adapt sales pitches, became the subject of a U.S. Supreme Court case, Sorrell v. IMS Health Inc.(2011). Vermont had passed a law in 2007 prohibiting the commercial use of such records without the doctor's consent. The case was decided in IMS's favor on 1st amendment free speech grounds.
Apart from making money selling information to other businesses, IMS also shares some data with academic and other researchers for free or at a discount. The company has published a long list of medical articles that relied on its longitudinal data.
Tanner posits that a loss of trust in the confidentiality of medical information could adversely affect the entire health care system, with patients unwilling to describe their conditions to their doctors or to even seek treatment for them. He suggests that a first step to restoring patients' trust would be to give individuals the right to opt out of sharing their data for commercial use. Unfortunately, this would not solve the problem of supposedly anonymous data that may be publicly available.
“It is getting easier and easier to identify people from anonymized data,” says Chesley Richards, director of the Office of Public Health Scientific Services at the Centers for Disease Control and Prevention. “You may not be identifiable from a particular data set that an entity has collected, but if you are a broker that is assembling a number of sets and looking for ways to link those data, that's where, potentially, the risk becomes greater for identification.”
HIPAA regulations actually do address the issue of combining "de-identified" data with other information to identify specific individuals. That particular concern about health data reminded me of an article I read last year, Designing Statistical Privacy for Your Data, Ashwin Machanavajjhala and Daniel Kifer, Communications of the ACM, Vol. 58 No. 3 (March 2015), Pages 58-67, 10.1145/2660766. You have to subscribe to CACM to get the full text of the article, but an explanation of many of the technical details can also be found in the Wikipedia entry on differential privacy.
Machanavajjhala and Kifer's goal is to present best practices for "sanitizing" data that include sensitive information before releasing it to the public. They present two generalized "privacy definitions" that specify the characteristics of sanitized data. ε-differential privacy, described in the Wikipedia article, involves adding random "noise" to every record in the data, thereby masking the influence of any single record. The magnitude of the added noise acts as a privacy factor-- increasing it makes it harder to identify individuals, but it also detracts from the accuracy of the data that is released.
The second general type of privacy definition discussed by Machanavajjhala and Kifer is k-anonymity. I've adapted one of the figures from their article to explain k-anonymity and relate it to the HIPAA regulations. Consider the following fictitious hospital admission data:
Hospital A admissions
raw data
|
Zip Code |
Age |
Disease |
13016 |
26 |
Cancer |
90210 |
60 |
Cancer |
13007 |
29 |
Flu |
90201 |
65 |
Flu |
90210 |
67 |
Flu |
13007 |
26 |
None |
13041 |
25 |
Stroke |
90257 |
63 |
Stroke |
Suppose Alice knows her neighbor Bob was admitted to Hospital A, and further that she knows Bob's age and zip code. Obviously, she could infer from the data above what disease Bob had.
HIPAA regulations specify two methods for de-identifying data to exempt it from privacy rules. The first requires specific statistical analysis and documentation, but the second is a "safe harbor" which focuses on a list of 18 identifiers-- name, birth date, social security number, etc. If these identifiers are removed, and the covered entity has no actual knowledge of how individuals could be identified from the remaining data, the safe harbor requirement is met. Zip codes are one of the 18 specific identifiers, but the first three digits may be retained for large metropolitan areas. Similarly, while complete birth dates are not allowed, year of birth can be included. Thus, if Hospital A had no actual knowledge of how it could be used to identify individuals, the following "sanitized" version of the admissions data above would be exempt from HIPAA privacy protection:
Hospital A admissions
4-anonymized data
|
Zip Code |
Age |
Disease |
130** |
25–30 |
None |
130** |
25–30 |
Stroke |
130** |
25–30 |
Flu |
130** |
25–30 |
Cancer |
902** |
60–70 |
Flu |
902** |
60–70 |
Stroke |
902** |
60–70 |
Flu |
902** |
60–70 |
Cancer |
Even if she knows Bob's zip code and age, Alice can no longer infer Bob's disease. She will find at least 4 different records for whatever combination of these identifiers she looks for, hence the characterization "4-anonymous". Consider, however, what happens if Hospital B, unknown to Hospital A, releases the following de-identified admissions data:
Hospital B
3-anonymized data |
Zip Code |
Age |
Disease |
130** |
< 40 |
Cold |
130** |
< 40 |
Stroke |
130** |
< 40 |
Rash |
148** |
= 40 |
Cancer |
148** |
= 40 |
Flu |
148** |
= 40 |
Cancer |
Looking at the anonymized tables individually, privacy is maintained, but if Alice knew Bob was treated by both hospitals for the same condition, the combination of the two tables together would result in a privacy breach. See the poll below.
Adam Tanner's point about patients' trust is well taken. In addition to giving patients greater control over their health data, a revision of HIPAA privacy protections might also be in order.