Identifying anomalies in data is central to the advancement of science, national security, and finance. However, privacy concerns restrict our ability to analyze data. Can we lift these restrictions and accurately identify anomalies without hurting the privacy of those who contribute their data? We address this question for the most practically relevant case, where a record is considered anomalous relative to other records.
We make four contributions. First, we introduce the notion of sensitive privacy, which conceptualizes what it means to privately identify anomalies. Sensitive privacy generalizes the important concept of differential privacy and is amenable to analysis. Importantly, sensitive privacy admits algorithmic constructions that provide strong and practically meaningful privacy and utility guarantees. Second, we show that differential privacy is inherently incapable of accurately and privately identifying anomalies; in this sense, our generalization is necessary. Third, we provide a general compiler that takes as input a differentially private mechanism (which has bad utility for anomaly identification) and transforms it into a sensitively private one. This compiler, which is mostly of theoretical importance, is shown to output a mechanism whose utility greatly improves over the utility of the input mechanism. Fourth, we propose mechanisms for a specific and popular definition of anomaly ((β,r)-anomaly) that (i) are guaranteed to be sensitively private, (ii) have good utility on almost every database of fixed size, and (iii) are empirically shown to have an overwhelmingly accurate performance.
This is a joint work with Periklis Papakonstantinou and Jaideep Vaidya
Hafiz is a Ph.D. candidate at Rutgers University. His research primarily focuses on the problem of analyzing data in a privacy-preserving manner with an overarching goal of investigating data privacy in the context of data mining and machine learning. His broad research interests include data privacy, secure computation, machine learning and data analytics, distributed computation, and the problems at the intersections of these fields.