Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Secure Data Outsourcing Outline     Motivation Background Research issues Summary Motivation  Cost of maintaining/mining large data  4-5 times of the cost of data acquisition  DBAs are paid well   More and more data service providers  Low cost – cloud computing  Maintain one database for one user  multiple users  Examples:     Alentus.com Datapipe.com Discountasp.net …  Concerns about data security and privacy  Untrusted service provider Un-trusted service provider  Lazy: incentives to perform less  Curious: incentives to acquire information  Malicious:  Denial of service  Incorrect results  Possibly compromised Challenges  Data confidentiality  Data need to be encrypted (?)  Utility of protected data?  Query utility  Mining utility  Access pattern privacy  Integrity  Data integrity  Query integrity    Correct Complete Fresh Why is it hard for query services?  Arbitrary expressivity  SQL statements  Often, restricted for certain type of query for simplicity (e.g. range query, knn query)  Cost  Communication  Computation (server side vs client side) Why it is hard for mining services?  Many data mining models  Different utilities to preserve  No one-size-for-all solutions Data confidentiality  Bucketization method (crypto-index)  Order preserving encryption  Perturbations Bucketization method  Hacigumus (SIGMOD02)  Main steps  Partition sensitive attributes  Order preserving: supports comparison  Random: query rewriting becomes hard  Build index on the partitions  Rewrite queries to target partitions  ‘john doe’  105  Select * from T’ where name=105  Execute queries and return results  Prune/post-process results on client  Trade off between confidentiality and overhead  Larger partition  increased privacy  increased overheads Order preserving encryption  Agrawal2004, Boldyreva2009  The set of data is securely transformed so that the order is preserved but the distribution and domain are changed  Benefits: indexing/searching on OPE encrypted data  Weakness: once the original distribution is known, OPE is broken  Not attribute-wise order preserving  Order preserving encryption (OPE, Agrawal et al 2004) is not resilient to distribution-based attacks OPE Original Xi distribution is known Bucket based Estimation Transformed Xi’ distribution Data perturbation  Definition 1. randomly change the original data 2. the attacker cannot effectively recover the original data 3. the desired properties are preserved  Techniques  Single dimension: noise addition  Multidimensional  Geometric perturbation  Random projection  RASP random space perturbation Noise addition  Y = X+ R  X: original data column, R: random noise (distribution published), Y: published data  Applications in data mining  Reconstructing column distribution  Rakesh Agrawal SIGMOD 2000  Applied to privacy-preserving decision tree, naïve bayes classifier  Attacks  Spectral filtering (Kargupta ICDM 2004)  PCA reconstruction (Huang  Multiplicative perturbations  Geometric data perturbation for outsourced data mining  Random Projection  RASP perturbation for query services (range query, kNN query). Perturbation-based framework Mining service Geometric data perturbation  Y=RX+T+D  R: secret rotation matrix (preserve Euclidean distances)  T: secret random translation matrix, D: secret random noise matrix  Distances are approximately preserved (D)  Resilient to most attacks to rotation perturbation  Applications  Outsourced privacy preserving data mining, applicable for many classification and clustering algorithms  Attacks  Population based attacks (when covariance matrix is revealed) Random Projection  Y=AX+D  A: random projection, e.g., entries from N(0,1)  Distances are approximately preserved  Applications  Many classification and clustering algorithms  Worse accuracy than geometric perturbation  Good for sparse high-dimensional data (text data), i.e., sketch methods (A is randomly generated for EACH record)  Attacks  Possibly more resilient than other two perturbation methods  But utility (distance) is not well preserved RASP perturbation k-dimensional numeric data, n records, represented as a k x n matrix, x: a record (1) Extend x to k+2 dimensions - (K+1) th dimension is always 1 – homogeneous dimension (K+2) th dimension v is a real random number drawn from (2) Encryption - A is a (k+2)x(k+2) invertible real value matrix, with at least two non-zero values for each row and the last column of A has all non-zero values - A is shared by all records  Properties  Not an OPE  Preserves convexity of the dataset  Convex dataset in Rk  another convex dataset in Rk+2.  Good for range query  Each range query in Rk  hyperplane based query  range query in Rk+2 . RASP properties  Convexity preserving   Queried range (hypercube) is convex RASP transforms the range to another convex (polyhedron) half space: wTx<=a wTx=a The intersection of convex sets is also convex. illustration of convexity preserving Original space Encrypted space Secure query transformation  A naïve solution  Based on the convexity preserving property Problems: (1) A-1 can be probed (2) is . . If a is known, the whole dimension i is breached. Secure query transformation  Enhanced solution  Xk+2 is always positive  (Xi-a)  0  (Xi-a)Xk+2  0  Correspondingly, in the encrypted space yTy  0, Problems addressed: (1) A-1 cannot be derived from  (2) (Xi-a)Xk+2  0 contains the random component Xk+2 that protects the condition (Xi-a)  0 Efficient two-stage query processing  illustrated Stage2: Filter out the junk records Stage1: Querying this bounding box Original space Transformed space A multidimensional tree index is been built on the encrypted data (in the transformed space) in the server. Stage 1: The client calculates the large bounding box; The server uses the index to find the results. Stage 2: filter the initial results with the conditions yTiy  0 for 1…2k Note: the two-stage strategy works, if the output of stage 1 is significantly smaller than the original database and can be fit into the memory. Otherwise, use linear scan with stage 2 filtering. RASP-based data mining  Preserving range query  linear classifier  Use the boosting framework to get strong classifiers (PerturBoost, in ICDM 2013) Access pattern privacy  On database queries  Problem is the same as PIR  Attackers may use the access pattern to breach data confidentiality  Each of previous approaches should handle this problem! PIR is impractical  Solutions based on private Information retrieval (PIR)  PIR is still impractical For Bucktization approach  Based on the architecture of Hacigumus (SIGMOD02)  Hore VLDB04  For range query  Privacy concern: reveal the distribution of value in each bucket  “Diffusion”: split buckets and combine parts of different buckets  Trade off: now the server needs to return more noisy results  larger size For OPE  Use queries to find out the distributions, then break the encryption For RASP  Secure query transformation  Attacks to transformed queries Oblivious RAM  Access pattern: read/write data items  Setting:  Client has a small secure memory  Server has large insecure storage, semihonest  Data items are encrypted  Client cannot hide the accessed locations  An active area Existing Approaches Dummy Block Real Block Real Block Dummy Block Real Block Dummy Block Dummy Block Real Block  Inside a level  Some real blocks  Useful data  Some dummy blocks  Random data  Randomly permuted  Only the client knows the permutation Existing Approaches  Reading  Read a block from each level  One real block.  Remaining are dummy blocks dummy real dummy dummy dummy dummy Server Client Existing Approaches Server (before) Client Server (after)  Writing  Shuffle consecutively filled levels.  Write into next unfilled level.  Clear the source levels shuffle blocks Continuous Shuffling … To write:  The Problem with Existing Approaches  Integrity guarantee  Merkle hash tree H(H(x1)+H(x2)) , + is string concatenation Can be stored with tree like structure : index, xml  Hash chains Query correctness with merkle by Devanbu et. al. Using merkle tree Example: 5<=q<=10 LUB(q) = 4 GLB(q) = 11  Operations:  Selections, projections, equijoins, set ops  Issues  Works only on data with verification objects  Query expressiveness  Expensive  Related work  Pang et. al (ICDE04, SIGMOD05), using ElGamal function  Sion VLDB05: challenge token  F.Li SIGMOD06: freshness Secure keyword search  Simple information retrieval  For a keyword, find the documents containing the keyword  What if the documents are encrypted word by word  and if the keyword is also encrypted Secure keyword search  Song 2000 •Seed is random, different for each Wi •Key idea: Li and Ri are selfverifiable •Advantage of XOR How to set K?  Setting of ki  Ki = Fk’(Wi), k’ is secret  User publishes W and k = Fk’(W)  Server checks CiW  whether <Li, Fk(Li)> == CiW It reveals nothing if Ci is not the ciphertext for W. And Li is random for different Wi – server cannot find any information from Li. Hidden search  In previous schemes, W is revealed  Weakness: each search will have to release k for W  Easy to collect information  Solution: encrypt Wi with an private key, then xor with <Li, Fk(Li)> Recent developments  Reza 2006  “Searchable symmetric encryption: improved definitions and efficient constructions”  Completely solved this problem, with a solution indistinguishability under chosen ciphertext attack (IND-CCA) Trusted hardware Possible benefits Discussion  Data confidentiality/access pattern  Restrict cryptographic definition (keyword search) or  Relaxed definition (perturbation, bucketization, OPE, etc.)  It is very difficult to formulate and prove the security of non-traditional approaches  Do we need to reformulate the security model? and how?