DATE: Wednesday, June 18, 2008
TIME: 4:00 pm
PLACE: CBY-A707 (PLS NOTE ROOM CHANGE)
TITLE: Protecting Privacy Using k-Anonymity
PRESENTER: Fida Dankar
University of Ottawa
ABSTRACT:

Today we live in a world where our personal information is continuously captured in a multitude of electronic databases. Details about our health, financial status and buying habits are stored in databases managed by public and private organizations. Since these databases contain information about millions of people, they can provide valuable research, epidemiologic and business insight. For example, analysis of a database of purchases at a large retailer will show the merchandise most in demand. Examining a drug store chain’s prescriptions can indicate where a flu outbreak is occurring. To extract or maximize the value contained in these databases, data custodians must provide outside organizations access to their data. In order to protect the privacy of the people whose data is being analyzed, a data custodian will “de-identify” information before releasing it to a third-party. De-identification ensures that data cannot be traced to the person about whom it pertains. What might seem like a simple matter of masking a person’s identifiers (name, address), the problem of de-identification has proven more difficult and is an active area of scientific research.

One popular approach in this area is referred to as k-anonymity. With k-anonymity an original data set containing personal health information can be transformed so that it is difficult for an intruder to determine the identity of the individuals in that data set. However, there have been no evaluations of the actual re-identification probability of k-anonymized data sets.

In this talk, we make explicit the two re-identification scenarios that k-anonymity protects against, and show that the actual probability of re-identification with k-anonymity is much lower than the threshold risk (the intended risk target) for one of these scenarios, resulting in excessive information loss. To address that problem, we evaluate three different modifications to k-anonymity and identify one that ensures that the actual risk is close to the threshold risk and that also reduces information loss considerably.

The talk will conclude with guidelines for deciding when to use the baseline versus the modified k-anonymity procedure. Following these guidelines will ensure that re-identification risk is controlled with minimal information loss when using k-anonymity.