When the analysis of individuals' personal information has value to an institution, but it compromises privacy, should individuals be compensated? We describe the foundations of a market in which those seeking access to data must pay for it and individuals are compensated for the loss of privacy they may suffer.
The interests of individuals and institutions with respect to personal data are often at odds. Personal data has great value to institutions: they eagerly collect it and monetize it by using it to model customer behavior, personalize services, target advertisements, or by selling the data directly. Yet the inappropriate disclosure of personal data poses a risk to individuals. They may suffer a range of harms including elevated prices for goods or services, discrimination, or exclusion from employment opportunities.3
A rich literature on privacy-preserving data analysis4,6,11 has tried to devise technical means for negotiating these competing interests. The goal is to derive accurate aggregate information from data collected from a group of individuals while at the same time protecting each member's personal information. But this approach necessarily imposes restrictions on the use of data. A seminal result from this line of work is that any mechanism providing reasonable privacy must strictly limit the number of query answers that can be accurately released.5 Nevertheless, recent research into differential privacy,7 a formal model of privacy in which an individual's privacy loss is rigorously measured and bounded, has shown that, for some applications, accurate aggregate analysis need not entail significant disclosure about individuals. Practical adoption of these techniques is slowing increasing: they have been used in a U.S. Census product16 and for application monitoring by Google9 and Apple.13
But there remain settings where strictly limiting privacy loss degrades the utility of data and means the intended use of data will be impossible. We therefore pursue an alternative approach which allows a non-negligible degree of privacy loss if that loss is compensated in accordance with users' preferences. Compensating privacy loss is an improvement over the narrower view that mandates negligible privacy loss because it empowers individuals to control their data through financial means and permits more accurate data analysis if end-users are willing to pay for it.
Considering personal information as a resource, one that is valuable but also exchangeable, is not new. Twenty years ago, Laudon proposed that personal information be bought and sold in a national market18 and there is a mature literature on economic aspects of privacy.1 And in today's online services, one could argue that individuals are compensated indirectly for contributing their personal data. Many internet companies acquire personal data by offering a (purportedly) free service, attracting users who provide their data, and then monetizing the personal data by selling it, or by selling information derived from it, to third parties.
Even so, a technical foundation for a market for personal information is lacking, particularly one that is consistent with recent advances in the formal modeling of privacy. We address this by proposing a formal framework for assigning prices to queries in order to compensate data owners for their loss of privacy. Our framework borrows from, and extends, key principles from both differential privacy7,8 and data markets.17,21
There are three types of actors in our setting: individuals, or data owners, contribute their personal data; a buyer submits an aggregate query over many owners' data; and a market maker, trusted to answer queries on behalf of owners, charges the buyer and compensates the owners.
Our framework makes three important connections:
1.1. Perturbation and price
In response to a buyer's query, the market maker computes the true query answer, adds random noise, and returns a perturbed result. Using differential privacy, perturbation is always necessary. Here query answers can be sold unperturbed, but the price would be high because each data owner contributing to an aggregate query needs to be compensated. By adding perturbation to the query answer, the price can be lowered: the more perturbation, the lower the price. When issuing the query, the buyer specifies the degree of accuracy for which he is willing to pay. Unperturbed query answers are very expensive, but at the other extreme, query answers are almost free if the noise added is the same as required by differential privacy with conservative privacy parameters. The relationship between the accuracy of a query result and its cost depends on the query and the preferences of contributing data owners. Formalizing this relationship is one of the goals of this article.
1.2. Arbitrage and perturbation
Arbitrage allows a buyer to obtain the answer to a query more cheaply than its advertised price by deriving the answer from a less expensive alternative set of queries. Arbitrage is possible because of inconsistency in a set of priced queries. As a simple example, suppose that a given query is sold with two options for perturbation, measured by variance: $5 for a variance of 10 and $200 for a variance of 1. A savvy buyer seeking a variance of 1 would never pay $200. Instead, he would purchase the first query 10 times, receive 10 noisy answers, and compute their average. Since noise is added independently, the variance of the resulting average is 1, and the total cost is only $50. The pricing of queries should avoid arbitrage opportunities. While this has been considered before for data markets,2, 17, 21 it has not been studied for perturbed query answers. Formalizing arbitrage for noisy queries is a second goal of this article.
1.3. Privacy-loss and payments
Given a randomized mechanism for answering a query q, a common measure of privacy loss to an individual is defined by differential privacy7:it is the maximum ratio between the probability of returning some fixed output with and without that individual's data. Differential privacy imposes a bound of e on this quantity, where is a small constant, presumed acceptable to all individuals. Our framework contrasts with this in several ways. First, the privacy loss is not bounded, but depends on the buyer's request. If the buyer asks for a query with low variance, then the privacy loss to individuals will be high. These data owners must be compensated for their privacy loss by the buyer's payment. In addition, we allow each data owner to value their privacy loss separately, by demanding greater or lesser payments. Formalizing the relationship between privacy loss and payments to data owners is a third goal of this article.
In our framework, the burden of the market maker is not to enforce a strict limit on the privacy loss of individuals. Instead, they must ensure that prices are set such that, whatever disclosure is obtained by the buyer, all contributing individuals are properly compensated. In particular, if a sequence of queries can indeed reveal the private data for most individuals, its price must approach the total cost of the entire database.
In this section we describe the basic architecture of the private data pricing framework, illustrated as shown in Figure 1.
Figure 1: The pricing framework has 3 components: (A) Pricing and purchase: the buyer asks a query Q = (q, v) and must pay its price, (Q); (B) Privacy loss: by answering Q, the market maker leaks some information about the private data of the data owners to the buyer; (C) Compensation: the market maker must compensate each data owner for their privacy loss with micro-payments; i(Q) is the total of the micro-payments for all users in bucket i. The pricing framework is balanced if the price (Q) is sufficient to cover all micro-payments j and the micro-payments i compensate the owners for their privacy loss .
2.1. The main actors
The main actors in our proposed marketplace are data owners, query buyers, and a market maker negotiating between the two.
The market maker. The market maker is trusted by the buyer and by each of the data owners. He collects data from the owners and sells it in the form of queries. When a buyer decides to purchase a query, the market maker collects payment, computes the answer to the query, adds noise as appropriate, returns the result to the buyer, and finally distributes individual payments to the data owners. The market maker may retain a fraction of the price as profit.
The owner and her data. Each owner contributes a single tuple conforming to a relational schema R(