Download On a New Scheme on Privacy Preserving Association Rule Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
On a New Scheme on Privacy Preserving Association
Rule Mining
Nan Zhang, Shengquan Wang, and Wei Zhao
Department of Computer Science, Texas A&M University
College Station, TX 77843, USA
{nzhang, swang, zhao}@cs.tamu.edu
Abstract. We address the privacy preserving association rule mining problem in
a system with one data miner and multiple data providers, each holds one transaction. The literature has tacitly assumed that randomization is the only effective
approach to preserve privacy in such circumstances. We challenge this assumption by introducing an algebraic techniques based scheme. Compared to previous
approaches, our new scheme can identify association rules more accurately but
disclose less private information. Furthermore, our new scheme can be readily
integrated as a middleware with existing systems.
1 Introduction
In this paper, we address issues related to production of accurate data mining results,
while preserving the private information in the data being mined. We will focus on
association rule mining. Since Agrawal, Imielinski, and Swami addressed this problem in [1], association rule mining has been an active research area due to its wide
applications and the challenges it presents. Many algorithms have been proposed and
analyzed [2–4]. However, few of them have addressed the issue of privacy protection.
Borrowing terms from e-business, we can classify privacy preserving association
rule mining systems into two classes: B2B (Business to Business) and B2C (Business to
Customer), respectively. In the first category (B2B), transactions are distributed across
several sites (businesses) [5, 6]. Each of them holds a private database that contains
numerous transactions. The sites collaborate with each other to identify association
rules spanning multiple databases. Since usually only a few sites are involved in a system (e.g., less than 10), the problem here can be modelled as a variation of secured
multi-party computation [7]. In the second category (B2C), a system consists of one
data miner (business) and multiple data providers (customers) [8,9]. Each data provider
holds only one transaction. Association rule mining is performed by the data miner on
the aggregated transactions provided by data providers. On-line survey is a typical example of this type of system, as the system can be modelled as one data miner (i.e., the
survey collector and analyzer) and millions of data providers (i.e., the survey providers).
Privacy is of particular concern in this type of system; in fact there has been wide media
coverage of the public debate of protecting privacy in on-line surveys [10]. Both B2B
and B2C have wide applications. Nevertheless, in this paper, we will focus on studying
B2C systems.
2
Nan Zhang, Shengquan Wang, and Wei Zhao
Several studies have been carried out on privacy preserving association rule mining
in B2C systems. Most of them have tacitly assumed that randomization is an effective approach to preserving privacy. We challenge this assumption by introducing a
new scheme that integrates algebraic techniques with random noise perturbation. Our
new method has the following important features that distinguish it from previous approaches:
– Our system can identify association rules more accurately but disclosing less private information. Our simulation data show that at the same accuracy level, our
system discloses private transaction information about five times less than previous
approaches.
– Our solution is easy to implement and flexible. Our privacy preserving mechanism
does not need a support recovery component, and thus is transparent to the data
mining process. It can be readily integrated as a middleware with existing systems.
– We allow explicit negotiation between data providers and the data miner in terms
of tradeoff between accuracy and privacy. Instead of obeying the rules set by the
data miner, a data provider may choose its own level of privacy. This feature should
help the data miner to collaborate with both hard-core privacy protectionists and
persons comfortable with a small probability of privacy divulgence.
The rest of this paper is organized as follows: In Section 2, we present our models,
review previous approaches, and introduce our new scheme. The communication protocol of our system and related components are discussed in Section 3. A performance
evaluation of our system is provided in Section 4. Implementation and overhead are
discussed in Section 5, followed by a final remark in Section 6.
2 Approaches
In this section, we will first introduce our models of data, transactions, and data miners.
Based on these models, we review the randomization approach – a method that has
been widely used in privacy preserving data mining. We will point out the problems
associated with the randomization approach which motivates us to design a new privacy
preserving method, based on algebraic techniques.
2.1 Model of Data and Transactions
Let I be a set of n items: I = {a1 , . . . , an }. Assume that the dataset consists of m
transactions t1 , . . . , tm , where each transaction ti is represented by a subset of I. Thus,
we may represent the dataset by an m × n matrix T = [a1 , . . . , an ] = [t1 , . . . , tm ]0 1 .
Let hT iij denote the element of T with indices i and j. Correspondingly, for a vector v,
its ith element is represented by hvii . An example of matrix T is shown in Table 1. The
elements of the matrix depict whether an item appears in a transaction. For example,
1
We denote the transpose of matrix T as T 0 . The transpose of T is obtained by replacing all
elements hT iij with hT iji .
On a New Scheme on Privacy Preserving Association Rule Mining
3
Table 1. Transaction Matrix
t1
..
.
tm
a1
0
..
.
1
a2
1
..
.
0
···
···
..
.
···
an
0
..
.
1
suppose the first transaction contains items a2 and a47 . Then the first row of the matrix
has hT i1,2 = hT i1,47 = 1 and all other elements equal to 0.
An itemset B ⊆ I is k-itemset if B contains k items (i.e., |B| = k). The support of
B is defined as
supp(B) =
|{t ∈ T |B ⊆ t}|
m
(1)
where t ∈ T means t is a transaction in the dataset represented by T . A k-itemset
B is frequent if supp(B) ≥ min suppk , where min suppk is a predefined minimum threshold of support. The set of frequent k-itemsets is denoted by Lk . Technically
speaking, the main task of association rule mining is to identify frequent itemsets. We
will focus on this issue.
2.2 Model of Data Miners
There are two classes of data miners in our system. One is legal data miners. These
miners always act legally in that they perform regular data mining tasks and would never
intentionally breach the privacy of the data. On the other hand, illegal data miners would
purposely discover the privacy in the data being mined. Illegal data miners come in
many forms. In this paper, we focus on a particular sub-class of illegal miners. That is, in
our system, illegal data miners are honest but curious [11]: they follow proper protocol
(i.e., they are honest), but they may keep track of all intermediate communications and
received transactions to perform some analysis (i.e., they are curious) to discover private
information.
Even though it is a relaxation from Byzantine behavior, this kind of honest but
curious (nevertheless illegal) behavior is most common and has been widely adopted as
an adversary model in the literatures. This is because, in reality, a workable system must
benefit both the data miner and the data providers. For example, an online bookstore (the
data miner) may use the association rules of purchase records to make recommendations
to its customers (data providers). The data miner, as a long-term agent, requires large
numbers of data providers to collaborate with. In other words, even an illegal data miner
desires to build a reputation for trustworthiness. Thus, honest but curious behavior is an
appropriate choice for many illegal data miners.
2.3 Randomization Approach
To prevent the privacy breach due to the illegal data miners, countermeasures must
be implemented in data mining systems. Randomization has been the most common
approach for countermeasures. We briefly view this method below.
4
Nan Zhang, Shengquan Wang, and Wei Zhao
Fig. 1. Randomization Approach
We consider the entire mining process to be an iterative one. In each stage, the
data miner obtains a perturbed transaction from a different data provider. With the randomization approach, each data provider employs a randomization operator R(·) and
applies it to one transaction t which the data provider holds. Figure 1 depicts this kind
of system.
Upon receiving transactions from the data providers, the legal data miner must first
perform an operation called support recovery which intends to filter out the noise injected in the data due to randomization, and then carry out the data mining tasks. At
the same time, an illegal (honest but curious) data miner may perform a particular privacy recovery algorithm in order to discover private data from that supplied by the data
providers.
Clearly, the system should be measured by its capability in terms of supporting the
legal miner to discover accurate association rules, while preventing illegal miner from
discovering private data.
2.4 Problems of Randomization Approach
Researchers have discovered some problems with the randomization approach. For example, as pointed in [8], when the randomization is implemented by a so called cutand-paste method, if a transaction contains 10 items or more, it is difficult, if not impossible, to provide effective information for association rule mining while at the same
time preserving privacy. Furthermore, large itemsets have exceedingly high variances
on recovered support values. Similar problems would exist with other randomization
methods (e.g., MASK system [9]) as they all use (and only use) random variables to
distort the original transactions.
Now, we will explore the reasons behind the randomization approach problems that
have already been mentioned.
– First, we note that previous randomization approaches are transaction-invariant. In
other words, the same perturbation algorithm is applied to all data providers. Thus,
On a New Scheme on Privacy Preserving Association Rule Mining
5
transactions of a large size (e.g., |t| > 10) are doomed to failure in privacy protection by the large numbers of the real items divulged to the data miner. The solution
proposed in [8] has ignored all transactions with a size larger than 10. However, a
real dataset may have about 5% such transactions. Even if the average transaction
size is relatively small, this solution still prevents many frequent itemsets (e.g., with
size of 4 or more) from being discovered.
– Second, previous approaches are item-invariant. All items in the original transaction t have the same probability of being included in the perturbed transaction
R(t). No specific operation is performed to preserve the correlation between different items. Thus, a lot of real items in the perturbed transactions may never appear
in any frequent itemset. In other words, the divulgence of these items does not
contribute to the mining of association rules.
Note that invariance of transactions and items is inherent in the randomization approach. This is because in this kind of system, the communication is one-way: from data
providers to the data miner. As such, a data provider cannot obtain any specific guidance
on the perturbation of its transaction from the (legal) data miner. Consequently, lack of
communication between data providers prevents a data provider from learning the correlation between different items. Thus, a data provider has no choice but to employ a
transaction-invariant and item-invariant mechanism.
This observation motivates us to develop a new approach that allows two-way communication between the data miner and data provider. We describe the new approach in
the next subsection.
2.5 Our New Scheme
Fig. 2. Our New Scheme
Figure 2 shows the infrastructure of our scheme. The (legal) data miner S contains
two components: DM (data mining process) and PG (perturbation guidance). When a
6
Nan Zhang, Shengquan Wang, and Wei Zhao
data provider Ci initializes a communication session, PG first dispatches a reference
Vk to Ci . Based on the received Vk , the data perturbation component of Ci transforms
the transaction t to a perturbed one R(t) and transmits R(t) to PG. PG then updates Vk
based on the recently received R(t) and forwards R(t) to the data mining process DM.
Unlike the cut-and-paste mechanism [8], our new scheme does not require the data
miner to employ a support recovery algorithm. Thus, instead of demanding an apriori
based association rule mining algorithm as in the cut-and-paste mechanism, our data
perturbation algorithm does not specify any association rule mining algorithm. Since
then, our scheme can be easily appended to any existing system as a component, leaving
no change on the original system settings.
The key here is to properly design Vk so that correct guidance can be provided to
the data providers on how to distort the data transactions. In our system, we let Vk be
an algebraic quantity derived from T . As we will see, with this kind of Vk , our system
can effectively maintain accuracy of data mining while significantly reduce the leakage
of private information.
3 Communication Protocol and Related Components
In this section, we will present the communication protocol and the associated components in our system. Recall that in our system, there is a two-way communication
between data providers and the data miner. While only little overhead is involved, as
we will see, this two-way communication substantially improves the performance of
privacy preserving discovered association rules.
3.1 The Communication Protocol
We now describe the communication protocol used between the data providers and data
miners. On the side of the data miner, there are two current threads that perform the
following operations iteratively after initializing Vk :
Thread of registering data provider:
R1. Negotiate on the truncation level k
with a data provider;
R2. Wait for a ready message from a data
provider;
R3. Upon receiving the ready message
from a data provider,
– Register the data provider;
– Send the data provider current Vk ;
R4. Go to Step R1;
Thread of receiving data transaction:
T1. Wait for a (perturbed) data transaction R(t) from a data provider;
T2. Upon receiving the data transaction
from a registered data provider,
– Update Vk based on the newly
received perturbed data transaction;
– Deregister the data provider;
T3. Go to Step T1;
On a New Scheme on Privacy Preserving Association Rule Mining
7
For a data provider, it will perform the following operations once its data transaction
t is ready:
P1. Send the data miner a ready message indicating that this provider is ready to contribute to the mining process.
P2. Wait for a message that contains Vk from the data miner.
P3. Upon receiving the message from the data miner, compute R(t) based on t and Vk .
P4. Transmit R(t) to the data miner.
3.2 Related Components
It is clear from the above description of the communication protocol that the key components in our system are (a) the method of computing Vk ; (b) the algorithm for perturbation function R(·) and (c) negotiation on the truncation level. We discuss these
components in the following.
Computation of Vk Recall that Vk carries information from the data miner to data
providers on how to distort a data transaction in order to preserve privacy. In our system,
Vk is an estimation of the eigenvectors of A = T 0 T . The justification of Vk on providing
accurate mining results and protecting privacy is presented in Appendix A.
As we are considering dynamic case where data transactions are dynamically fed to
the data miner, the miner keeps a copy of all received transactions and need to update
it when a new transaction is received. Assume that the initial set of received transactions T ∗ is empty 2 and every time when a new (distorted) data transaction, R(t), is
received, T ∗ is updated by appending R(t) at the bottom of T ∗ . Thus, T ∗ is the matrix
of perturbed transactions. We derive Vk from T ∗ .
In particular, the computation of Vk is done in the following steps. Using singular
value decomposition (SVD) [12], we can decompose A∗ = T ∗ 0 T ∗ as
A∗ = T ∗ 0 T ∗ = V ∗ Σ ∗ V ∗ 0
(2)
where diagonal matrix Σ ∗ = diag(s21 , . . . , s2n ) and s21 ≥ . . . ≥ s2n . V ∗ is an n × n
unitary matrix composed of the eigenvectors of A∗ .
Vk is composed of the first k vectors of V ∗ (i.e., eigenvectors corresponding to the
largest k eigenvalues of A∗ ). In other words, if V ∗ = [v1 , . . . , vn ], then
Vk = [v1 , . . . , vk ]
(3)
Thus, we call Vk as the k-truncation of V ∗ . Several incremental algorithms have been
proposed to update Vk when a new (distorted) data transaction is received by the data
miner [13, 14]. The computing cost of updating Vk is addressed in Section 5.
Note that k is a given integer less than or equal to n. As we will see in Section 4,
k can play a critical role in balancing accuracy and privacy. We will also show that
by using Vk in conjunction with R(·), to be discussed next, we can achieve desired
accuracy and privacy.
2
T ∗ may also be composed of some transactions provided by privacy-careless data providers
8
Nan Zhang, Shengquan Wang, and Wei Zhao
Perturbation Function R(·) Recall that once a data provider receives a perturbation
guidance Vk from the data miner, the provider will apply a perturbation function, R(·),
to its data transaction, t. The result is a distorted transaction that will be transmitted to
the data miner. The computation of R(t) is defined as follows. First, for the given Vk ,
the data transaction, t, is transformed by t̃ = tVk Vk0 . Note that the elements in t̃ may not
be integers. Algorithm Mapping will apply to t̃ in order to integerize t̃. In the algorithm,
ρt is a pre-defined parameter. Finally, to enhance the privacy preserving capability, we
need to insert additional noise into R(t). This is done by Algorithm Random-Noise Perturbation. An example of Algorithm mapping and Algorithm random-noise is provided
in Appendix B.
Algorithm Mapping
for every element ht̃ii in t̃ do
if ht̃ii ≥ 1 − ρt then
hR(t)ii = 1
else
hR(t)ii = 0
end if
end for
Algorithm Random-Noise Perturbation
for every item ai ∈
/ t do
Choose a real number j uniformly at
random on [0, 1]
if j ≥ 1 − ρm then
hR(t)ii = 1
end if
end for
Now, computation of R(t) has been completed and it is ready to be transmitted to the
data miner.
Negotiation As stated in the communication protocol, a data provider may negotiate with the data miner about the truncation level k of eigenvectors. In order to retain
enough information after the truncation, a textbook heuristic is to make the sum of
the retained eigenvalues (i.e., eigenvalues corresponding to retained eigenvectors) of
A∗ = T ∗ 0 T ∗ larger than 85% of the grand total [12,15]. Usually k is large at the beginning and decreases to be very small (e.g., less than 1% of n) soon. Thus for most data
providers, negotiataion is only needed in the early stages of the the process.
Algorithm Negotiation
1: Based on the SVD of A∗ (A∗ = T ∗ 0 T ∗ = V ∗ Σ ∗ V ∗ 0 ), the data miner calculates
S = hΣ ∗ i211 + · · · hΣ ∗ i2nn ;
Pk
2: Find the smallest k ∈ [1, n] such that i=1 hΣ ∗ i2ii ≥ 85% · S;
3: The data miner dispatches k to active data providers (a data provider becomes
active when it shows intention on providing data and remains active till the transformed transaction is shipped to the data miner);
4: if a data provider Ci receives k > Kt . Here Kt is the threshold set by Ci then
5:
Ci remain active;
6: else
7:
Ci sends ready message to the data miner.
8: end if
9: Goto 1;
We have described our system – the communication protocol and its key components. We now discuss the accuracy and privacy metrics of our system.
On a New Scheme on Privacy Preserving Association Rule Mining
9
4 Analysis on Accuracy and Privacy
In this section, we will propose the metrics of accuracy and privacy with analysis of
the tradeoff between them. We will derive a upper bound on the degree of accuracy in
the mining results (frequent itemsets). An analytical formula for evaluating the privacy
metric is also provided.
4.1 Accuracy Metric
We use the error of support of frequent itemsets to measure the degree of accuracy in
our system. This is because general objective of association rule mining is to identify all
frequent itemsets with support larger than a threshold min supp. There are two kinds
of errors: false drops, which are undiscovered frequent itemsets and false positives,
which are itemsets wrongly identified to be frequent. Formally, given itemset Ij , let
the support of Ij in the original transactions T and the perturbed transactions R(T ) be
supp(Ij ) and supp0 (Ij ), respectively. Recall that the set of frequent h-itemsets in T is
Lh . With these notations, we can define those two errors as follows:
Definition 1. For a given itemset size h, the error on false drops, ρ1 , and the error on
false positives, ρ2 , are defined as
ρ1 = max (supp(Ij ) − supp0 (Ij )),
(4)
ρ2 = max (supp0 (Ij ) − supp(Ij )).
(5)
Ij ∈Lh
Ij 6∈Lh
We define the degree of accuracy as the maximum of ρ1 and ρ2 on all itemset sizes.
Definition 2. The degree of accuracy in a privacy preserving association rule mining
system is defined as
γ = max max(ρ1 , ρ2 ).
h≥1
(6)
With this definition, we can derive an upper bound on the degree of accuracy.
Theorem 1.
γ ≤ 2.618
2
σk+1
,
m
(7)
where σi2 is the ith eigenvalue of A = T 0 T .
The proof is given in Appendix C.
This bound is fairly small when m is sufficiently large, which is usually the case
in reality. Actually, our method tends to enlarge the support of high-supported itemsets
and reduce the support of low-supported itemsets. Thus, the effective error that may
result in false positives or false drops is much smaller than the upper bound. We may
see this from the simulation results later.
10
Nan Zhang, Shengquan Wang, and Wei Zhao
4.2 Privacy Metric
In our system, the data miner cannot deduce the original t from t̃ = tVk Vk0 because
Vk Vk0 is a singular matrix with det(Vk Vk0 ) = 0 (i.e., it does not have an inverse matrix).
Since t → t̃ → R(t), t cannot be deduced from R(t) deterministically. To measure the
probability that an item in t is identified from R(t), we need a privacy metric.
A privacy metric, privacy breach, is proposed in [8]. It is defined by the posterior
probability Pr{ai ∈ t|t0 } that an item could be recovered from the perturbed transaction. Unfortunately, this metric is unsuitable in our system settings, especially to Internet applications. Consider a person taking an online survey of the commodities he/she
purchased in the last month. A privacy breach of 50% (which is achieved in [8]) does
not prevent privacy divulgence effectively. For instance, for a company who uses spam
mail to make advertisement, a 50% probability of success (correct identification of a
person who purchased similar commodities in the last month) certainly deserves a try
because a wrong estimation (a spam mail sent to a wrong person) costs little.
We propose a privacy metric that measures the number of “unwanted” items (i.e.,
items not contribute to association rule mining) divulged toSthe data miner. For an item
ai that does not appear in any frequent itemset (i.e., ai 6∈ Lk ), the divulgence of ai
(i.e., ai ∈ R(t)) does not contribute to the mining of association rules. Due to survey
results in [10], a person has a strong will to filter out such “unwanted” information (i.e.,
information not effective in data mining) before divulging private data in exchange of
data mining results.
We evaluate the level of privacy by the probability of an “unwanted” item to be
included in the transformed transaction. Formally, the level of privacy is defined as
follows:
Definition 3. Given a transaction t, an item ai ∈ t appears in a frequent itemset in t
if there exists a frequent itemset Ij such that ai ∈ Ij ⊆ t. Otherwise we say that ai is
infrequent in t. We define the level of privacy as
δ = Pr{ai ∈ R(t)|ai is infrequent in t}
(8)
250
k = 6 out of 497, ρt = 0.4
Support in Perturbed Transactions
−4
200
Support in R(T) (× 10 )
−4
Support in T (× 10 )
13641 2−itemsets
150
100
233,368
2−itemsets
50
0
50
100
150
200
Support in Original Transactions
Fig. 3. Comparison of 2-itemsets Supports between Original and Perturbed Transactions
On a New Scheme on Privacy Preserving Association Rule Mining
11
Fig. 3 shows a simulation result on all 2-itemsets. The x-axis is the support of itemsets in original transactions. The y-axis is the support of itemsets in perturbed transactions. The figure intends to show how effectively our system blocks the unwanted items
from being divulged. If a system preserves privacy perfectly, we should have y equal to
zero when x is less than min supp2 . The data in Fig. 3 shows that almost all 2-itemsets
with support less than 0.2% (i.e., 233, 368 unwanted 2-itemsets) have been blocked.
Thus, the privacy has been successfully protected. Meanwhile, the supports of frequent
2-itemsets are exaggerated. This should help the data miner to identify frequent itemsets
from additional noises.
Formally, we can derive an upper bound on the level of privacy.
Theorem 2. The level of privacy in our system is bounded by
δ ≤1−
s
2
σk+1
+ · · · + σn2
.
σ12 + · · · + σn2
(9)
where σi2 is the ith eigenvalue of A = T 0 T .
The proof is given in Appendix C.
By Theorems 1 and 2, we can observe a tradeoff between accuracy and privacy.
Note that σi is sorted in descending order. Thus a larger k results in more “unwanted”
items to be divulged. Simultaneously, the degree of accuracy (whose upper bound is in
2
proportion to σk+1
) decreases.
4.3 Simulation Results on Real Datasets
We will present the comparison between our approach and the cut-and-paste randomization operator by simulation results obtained on real datasets. We use a real world
dataset BMS Webview 1 [16]. The dataset contains web click stream data of several
months from the e-commerce website of a leg-care company. It has 59,602 transactions
and 497 distinct items.
We randomly choose 10,871 transactions from the dataset as our test band. The
maximum transaction size is 181. The average transaction size is 2.90. There are 325
transactions (2.74%) with size 10 or more. If we set min supp = 0.2%, there are 798
frequent itemsets including 259 one-itemset, 350 two-itemsets, 150 three-itemsets, 37
four -itemsets and two 5-itemsets.
As a compromise between privacy and accuracy, the cutoff parameter Km of cutand-paste randomization operator is set to 7. The truncation level k of our approach is
set to 6. Since both our approach and the cut-and-paste operator use the same method to
add random noise, we compare the results before noise is added. Thus we set ρm = 0
for both our approach and the cut-and-paste randomization operator.
The solid line in Figure 4 shows the change of degree of accuracy (max{ρ1 , ρ2 })
of our approach with the parameter ρt . The dotted line shows the degree of accuracy
while cut-and-paste randomization operator is employed. We can see that our approach
reaches a better accuracy level than the cut-and-paste operator. A recommendation
12
Nan Zhang, Shengquan Wang, and Wei Zhao
0.8
1.8
Our perturbation algorithm
Previous approach: cut−and−paste
0.6
1.4
1.2
1
0.8
0.6
0.5
0.4
0.3
0.2
0.1
0.4
0.2
0.1
Our perturbation algorithm
Previous approach: cut−and−paste
0.7
Degree of Privacy
degree of accuracy (%)
1.6
0
0.2
0.3
0.4
0.5
ρ
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ρ
t
t
Fig. 4. Accuracy vs ρt
Fig. 5. Privacy vs ρt
made from the figure is that ρt ∈ (0.7, 0.8) is suitable for hard-core privacy protectionists while ρt ∈ (0.2, 0.3) is recommended to persons care accuracy of association
rules more than privacy protection.
The relationship between the level of privacy and ρt in the same settings is presented
in Figure 5. The dotted line shows the level of privacy of the cut-and-paste randomization operator. We can see that the privacy level of our approach is much higher than
the cut-and-paste operator when ρt > 0.1. Thus our approach is always better on both
privacy and accuracy issues when 0.1 ≤ t ≤ 1.
5 Implementation
A prototype of the privacy preserving association rule mining system with our new
scheme has been implemented on web browsers and servers for online surveys. Visitors
taking surveys are considered to be data providers. The data perturbation algorithm is
implemented as custom codes on web browsers. The web server is considered to be the
data miner. A custom code plug-in on the web server implements the PG (perturbation
guidance) part of the data miner. All custom codes are component-based plug-ins that
one can easily install to existing systems. The components required for building the
system is shown in Figure 6.
Fig. 6. System Implementation
The overhead of our implementation is substantially smaller than previous approaches
in the context of online survey. The time-consuming part of the “cut-and-paste” mechanism is on support recovery, which has to be done while mining association rules. The
support recovery algorithm needs the partial support of all candidate items for each
transaction size, which results in a significant overhead on the mining process.
On a New Scheme on Privacy Preserving Association Rule Mining
13
In our system, the only overhead (possibly) incurred on the data miner is updating
the perturbation guidance Vk , which is an approximation of the first k right eigenvectors
of A∗ = T ∗ 0 T ∗ . Many SVD updating algorithms have been proposed including SVDupdating, folding-in and recomputing the SVD [13, 14]. Since T ∗ is usually a sparse
matrix, the complexity of updating SVD can be considerably reduced to O(n). Besides,
this overhead is not on the critical time path of the mining process. It occurs during
data collection instead of data mining process. Note that the transfered “perturbation
guidance” Vk is of the length kn. Since k is always a small number (e.g., k ≤ 10), the
communication overhead incurred by “two-way” communication is not significant.
6 Final Remarks
In this paper, we propose a new scheme on privacy preserving mining of association
rules. In comparison with previous approaches, we introduce a two-way communication
mechanism between the data miner and data providers with little overhead. In particular,
we let the data miner send a perturbation guidance to the data providers. Using this
intelligence, the data providers distort the data transactions to be transmitted to the
miner. As a result, our scheme identifies association rules more precisely than previous
approaches and at the same time reaches a higher level of privacy.
Our work is preliminary and many extensions can be made. For example, we are
currently investigating how to apply a similar algebraic approach to privacy preserving
classification and clustering problems. The method of singular value decomposition
has been broadly adopted to many knowledge discovery areas including latent semantic
indexing, information retrieval and noise reduction in digital signal processing. As we
have shown, singular value decomposition can be an effective mean in dealing with
privacy preserving data mining problems as well.
References
1. R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items
in large databases,” in Proc. ACM SIGMOD Int. Conf. on Management of Data, 1993, pp.
207–216.
2. R. Agrawal and R. Srikant, “Fast algorithms for mining association rules in large databases,”
in Proc. Int. Conf. on Very Large Data Bases, 1994, pp. 487–499.
3. J. S. Park, M.-S. Chen, and P. S. Yu, “An effective hash-based algorithm for mining association rules,” in Proc. ACM SIGMOD Int. Conf. on Management of Data, 1995, pp. 175–186.
4. M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman, “Computing
Iceberg queries efficiently,” in Proc. Int. Conf. on Very Large Data Bases, 1998, pp. 299–
310.
5. J. Vaidya and C. Clifton, “Privacy preserving association rule mining in vertically partitioned
data,” in Proc. ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, 2002, pp.
639–644.
6. M. Kantarcioglu and C. Clifton, “Privacy-preserving distributed mining of association rules
on horizontally partitioned data,” in Proc. ACM SIGMOD Workshop on Research Issues on
Data Mining and Knowledge Discovery, 2002, pp. 24–31.
14
Nan Zhang, Shengquan Wang, and Wei Zhao
7. Y. Lindell and B. Pinkas, “Privacy preserving data mining,” Advances in Cryptology, vol.
1880, pp. 36–54, 2000.
8. A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, “Privacy preserving mining of association rules,” in Proc. ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining,
2002, pp. 217–228.
9. S. J. Rizvi and J. R. Haritsa, “Maintaining data privacy in association rule mining,” in Proc.
Int. Conf. on Very Large Data Bases, 2002, pp. 682–693.
10. J. Hagel and M. Singer, Net Worth. Harvard Business School Press, 1999.
11. O. Goldreich, Secure Multi-Party Computation. Working Draft, 2002.
12. G. H. Golub and C. F. V. Loan, Matrix Computations. Baltimore, Maryland: Johns Hopkins
University Press, 1996.
13. J. R. Bunch and C. P. Nielsen, “Updating the singular value decomposition,” Numerische
Mathematik, vol. 31, pp. 111–129, 1978.
14. M. Gu and S. C. Eisenstat, “A stable and fast algorithm for updating the singular value
decomposition,” Yale University, Tech. Rep. YALEU/DCS/RR-966, 1993.
15. I. T. Jolliffe, Principle Component Analysis. Springer Verlag, 1986.
16. Z. Zheng, R. Kohavi, and L. Mason, “Real world performance of association rule algorithms,” in Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2001,
pp. 401–406.
On a New Scheme on Privacy Preserving Association Rule Mining
A
15
Appendix: Justification of Vk
The main part of a privacy preserving association rule mining system is the data perturbation mechanism employed by the data providers. In current techniques, randomization operator is used to perturb original transactions. As we described in Section 2,
an item-invariant randomization operator leaves the correlation between different items
unconsidered. We investigate the item correlation to improve accuracy of mining results.
Specifically, our data perturbation algorithm preserves the support of 2-itemsets.
The intuition behind our approach can be stated as follows. The PG part of the data
miner maintains an estimation of the support of all 2-itemsets. Based on the estimation, PG tells the data providers which itemsets are more likely to be frequent itemsets.
We know that no superset of an infrequent itemset is frequent (anti-monotone). A data
provider may safely remove all items that do not appear in frequent 2-itemsets because
they cannot appear in any frequent itemset with size larger than 2 either.
Readers may raise a question that why we choose to preserve the support of 2itemsets instead of 3-itemsets or 1-itemsets? If we choose to preserve the support of
3-itemsets, we cannot recover frequent 2-itemsets from perturbed transactions. Meanwhile, as we may see in Section 4, our data perturbation algorithm successfully preserves the support of 1-itemsets. Besides, the large number of 3-itemsets (n3 ) also results in a much larger overhead.
If we choose to preserve the support of 1-itemsets, we cannot remove many items
from original transactions. Roughly speaking, if {ai , aj } and {aj , ak } are frequent, we
have a fairly large probability that {ai , aj , ak } is also frequent. However, if both ai
and aj are frequent, the probability that {ai , aj } is frequent is much smaller. Thus,
we have to set a very small threshold on the minimum support of frequent 1-itemsets
(min supp1 ) to guarantee that all frequent 2-itemsets can be preserved. This prevents
many items from being removed from the original transactions.
Now we will show that our data perturbation algorithm successfully preserves the
support of 2-itemsets. Consider A = T 0 T , 3


a1 · a1 · · · a1 · an

.. 
(10)
A = T 0 T =  ...
. 
a ·a
i
an · a1
j
· · · an · an
where ai · aj is the dot product of ai and aj .
ai · aj =
m
X
hai ih haj ih
(11)
h=1
Note that
ai · aj
= support of {ai , aj }.
m
3
(12)
Here we consider the case when Vk is the k-truncation of the accurate eigenvectors of T 0 T . In
reality, the value of Vk has to be estimated from the current copy of T ∗ . Fortunately, the value
of Vk converges well to its accurate value. Besides, the convergence is fairly fast in most cases
16
Nan Zhang, Shengquan Wang, and Wei Zhao
Thus, we may improve the accuracy of support of 2-itemsets by a better approximation
of A. We use truncated eigenvectors of A to perturb each private transaction t into t̃.
Given transaction matrix T , define the perturbed transaction T̃ = T Vk Vk0 .
T = [t1 , t2 , . . . tm ]0
(13)
Vk = [v1 , v2 , . . . , vk ]
(14)
T̃ =
T Vk Vk0
=
[t1 Vk Vk0 , t2 Vk Vk0 , . . . , tm Vk Vk0 ].
(15)
With some algebriac manupulation we have
à = T̃ 0 T̃ = Vk Vk0 T 0 T Vk Vk0
(16)
Vk Vk0 V (Σ 0 Σ)V 0 Vk Vk0
Vk Σ 0 ΣVk0
(17)
(18)
= σ12 v1 v10 + · · · + σk2 vk vk0
(19)
=
=
Thus à is the optimal rank-k approximation of A [12]. In other words, we preserve the
support of 2-itemsets while cutting off its eigenvectors.
Our data perturbation algorithm also preserves the privacy of data providers. The
data miner cannot deduce the original t from t̃ because Vk Vk0 is a singular matrix with
det(Vk Vk0 ) = 0, which means that Vk Vk0 does not have an inverse matrix.
B Appendix: Transaction Hashing
Besides the truncation of eigenvectors, we need a function to map the values of t̃ ∈ [0, 1]
to integer values R(t) ∈ {0, 1}. Although the truncation of eigenvectors has already
inserted some “noise” items into transactions, we still need to add more “noise” items.
These tasks are done by a hash function. The hashes of values should be either 0 or 1
and “look” randomly generated. Here we use a constructive example to illustrate the
algorithm involved in this process.
Example 1. A data provider Ci holds a transaction t = [1, 1, 0, 1, 1, 1, 0, 0]. After the
transformation algorithm, we have t̃ = [0.31, 0.03, 0.05, 0.53, 0.09, 0.13, 0.13, 0.08].
Given ρt ∈ [0, 1], the step of mapping t̃ to integer values can be stated as:
1. As stated in Algorithm Mapping, round off t̃ to integer. ρt is the threshold of the
transformed value for an item to be placed into R(t). Assume ρt = 0.3. We may
transform t̃ to R(t) = [1, 0, 0, 1, 0, 0, 0, 0] = {a1 , a4 };
2. After that, as stated in Algorithm Random-Noise Perturbation, some items are inserted into R(t) as noise. ρm is the probability of a false item to be placed into
R(t). Now R(t) may become R(t) = [1, 0, 0, 1, 0, 0, 1, 0] = {a1 , a4 , a7 }.
As shown in the above example, there are two parameters ρt and ρm in the hash
function. They both supply data providers controls to preserve privacy. In addition to
the truncated eigenvectors transformation (which can also be personalized in the negotiation process), a data provider may assign individual values of ρt and ρm due to its
privacy sensitivity.
On a New Scheme on Privacy Preserving Association Rule Mining
C
17
Appendix: Proof of Theorem 1
Proof. In our system, we only need to consider the accuracy metric on itemsets with
size 1 and 2. Since no superset of an infrequent itemset is frequent (anti-monotone),
we have already preserved frequent itemsets with size 3 or more by retaining the support of 2-itemsets. For example, assume that {a1 , a2 , a3 } is a frequent 3-itemset. Since
{a1 , a2 }, {a2 , a3 } and {a1 , a3 } are all frequent itemsets, a transaction that contains
{a1 , a2 , a3 } as a subset has a fairly high probability to include these three items in the
perturbed transaction. Thus, we only consider the accuracy metric on itemsets with 1 or
2 items.
First, we consider the error introduced by the truncation of eigenvectors. As shown
in (19) in Appendix A, Ã = T̃ 0 T̃ is a rank-k approximation of A = T 0 T . With the
theory of SVD, we have
2
max |hà − Aiij | ≤ kà − Ak2 = σk+1
i,j
(20)
For 1-itemsets, assume the error of the support of itemset {ai } introduced by the trun2
cation is i , we have Ti = |hà − Aiii | ≤ σk+1
/m; for 2-itemsets, assume the error of the support of itemset {ai , aj } introduced by the truncation is Tij , we have
2
/m.
Tij = |hà − Aiij | ≤ σk+1
Second, we consider the error introduced by Algorithm Mapping. As we may see
Table 2. Mapping Function
ht1 ii · ht2 ij
0·0
0·1
1·1
1·1
ht̃1 ii · ht̃2 ij hR(t1 )ii · hR(t2 )ij
≥ (1 − ρt )2
1·1
≥ (1 − ρt )2
1·1
≤ (1 − ρt )
1·0
≤ (1 − ρt )2
0·0
from Table 2, the error introduced by Algorithm Mapping satisfies
M
ij ≤ max{
1 − (1 − ρt )2 1 − ρt T
,
}ij
(1 − ρt )2
ρt
(21)
√
2
The bound reaches its optimal (smallest) value as ρt = 3−2 5 and M
i,j ≤ 1.618σk+1 /m.
N
2
For any i 6= j, the error introduced by additional random noise is ij ≈ ρm , which can
be safely neglected. Thus for any 2-itemset {ai , aj }, the error of its support introduced
by the whole perturbation process satisfies
max{ρ1 , ρ2 } ≤ i,j = Tij + M
ij ≤ 2.618
2
σk+1
m
(22)
This bound is fairly small when m is large. The error of support of 1-itemset can also
be bounded by the above value following the same steps. Actually, since SVD tends to
18
Nan Zhang, Shengquan Wang, and Wei Zhao
enlarge the support of high-supported itemsets and reduce the support of low-supported
itemsets, the effective error which may result in false positives or false drops is much
smaller than the higher bound. We may see this from the simulation results in Section 4.3.
D
Appendix: Proof of Theorem 2
Proof. Here we will propose q
an analytical formula on δ. Consider F-norm (Frobenius
Pm Pn
2
norm) of T , where kT kF ≡
i=1
j=1 |hT iij | . By the theory of SVD, given all
p
eigenvalues σi ’s of T , we have kT kF = σ12 + · · · + σn2 . Since T̃ = T Vk Vk0 is the
closest rank k approximation of T , we know that the F-norm of T − T̃ satisfies
s
2
σk+1
+ · · · + σn2
kT − T̃ kF
=
(23)
2
kT kF
σ1 + · · · + σn2
Althought the rounding off error of T̃ is hard to be bounded, an overpessimistic estimation can be given as
kT − R(T )kF ≥ kT − T̃ kF .
(24)
Note that kT − R(T )kF is equal to the sum of the hamming distance between all corresponding transactions t and R(t), i.e.,
X
kT − R(T )kF =
#{items at which t and R(t) differ}
(25)
t∈T
The number of false positives (ai ∈ t̃, ai 6∈ t) is much less than the number of items
removed from T . Thus we may estimate the number of removed items by kT −R(T )kF .
Thus the number of “unwanted” items divulged to the data miner can be estimated by
s
2
σk+1
+ · · · + σn2
kR(T )kF
kT k − kT − R(T )kF
P
(26)
δ≈ P
≈
≤1−
2
σ1 + · · · + σn2
t∈T |t|
t∈T |t|