Two Rutgers University faculty, Dr. Rebecca Wright and Dr. Anand Sarwate, are leading a research effort to determine if a theoretical data-analysis model called Differentially Private Anomaly Detection can detect anomalies in big data sets while at the same time preserving the private information of ordinary citizens. Such anomalies can be a tip off to online intrusions (such as a hack) or other threats.
Sarwate and Wright are applying their research to three data-analysis techniques: active learning, group testing, and community detection. What follows is a description of each of those techniques.
Active learning (a type of “machine learning” in this context) refers to the process of sequentially and adaptively exploring a data set while doing statistical inference. An active learning algorithm selects data points to investigate based on the points it has investigated in the past and its own certainty. In the context of privacy, the learning algorithm has to balance two competing objectives: selecting points which it thinks are anomalous (exploitation) and selecting points which will improve its overall ability to distinguish between anomalous and non-anomalous points (exploration). This exploration/exploitation tradeoff is common in machine learning problems with constraints. The privacy constraint provides an additional twist: the algorithm has to balance privacy concerns when selecting individuals to investigate.
Group testing is a method that was first proposed in 1943 to more efficiently screen potential soldiers drafted for World War II. Syphilis tests were expensive and screening every draftee would have been inefficient, given how rare the condition was. Instead, blood samples from groups of soldiers were pooled together. If the test was negative the whole group could be cleared, and if not, individuals could be tested. Group testing is a cost-efficient way of screening and also provides privacy benefits because individuals are not tested in the initial rounds.
Sarwate and Wright’s work adds the mathematical guarantee of differential privacy to put the privacy benefits on firm theoretical footing. Their project has brought the concept of group testing to the attention of various agencies that are considering applications ranging from cargo screening to passenger anomaly detection.
Community detection refers to identifying tightly-knit groups of individuals connected by networks of social or communication ties. In particular, this community may be “hidden’’ in the sense that detecting membership (an anomaly) requires significant investigation. Given that we know about the network structure and some individuals in the community, how can we more efficiently detect the other members of that community without violating the privacy of non-anomalous individuals?