Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IBM Research Securing Electronic Health Records without Impeding the Flow of Information Rakesh Agrawal* Microsoft Search Labs Mountain View, CA rakesha@microsoft.com Christopher Johnson IBM Almaden Research Center San Jose, CA johnsocm@us.ibm.com * Based on work done while author was at IBM Almaden IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Based on joint work with Roberto Bayardo Alvin Cheung Alexandre Evfimievski Tyrone Grandison Jerry Kiernan Kristen Lefevre Ramakrishnan Srikant Yirong Xu IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Thesis Technology alone cannot solve the complex problem of securely managing the health information; at the same time, policy and law needs to be informed of what is technically feasible and in what timeframe. By advancing technology, we can: – change the mix of legislation, societal norms, market forces, and technology comprising the solution; and – improve the overall quality of the solution. IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Outline Illustrate thesis with technology examples based on Hippocratic database work Recommendations for – Policy designers and legislators – Solution developers – Scientists and researchers IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Hippocratic Database Technologies GOAL Create a new generation of information systems that protect the privacy, security, and ownership of data while not impeding the flow of information. Active Enforcement Data item level enforcement of disclosure policies and patient preferences Privacy-Preserving Data Mining Preserves privacy at individual level, allowing accurate data mining models at aggregate level Compliance Auditing Determine whether data has been accessed in violation of specified policies Optimal k-anonymization De-identifies records in a way that maintains truthful data but is not prone to data linkage attacks IMIA Conference – Security in Health Information Systems | April 29, 2006 Sovereign Information Integration Selective, minimal sharing across autonomous data sources, without trusted third party IBM Research Active Enforcement • Privacy Policy: Organizations define a set of policies describing who may access data (users or roles), for what purposes the data may be accessed (purposes) and to whom the data may be disclosed (recipients). • Consent: Data subjects are given control, through opt-in and opt-out choices, over who may see their data and under what circumstances # Name Age Phone 1 Adam 25 111-1111 3 Bob - 333-3333 4 Daniel 40 - • Provides cell-level disclosure control. • Application modification not required. • Database agnostic; does not require changes to the database engine. Patient Preferences & Data Collection Policy Creation Application Data Retrieval • Disclosure Control: Database enforces privacy policies and data subject consent choices with respect to all data access. • Active Enforcement system intercepts and rewrites incoming queries to comply with policies, subject choices, and context. • Rewritten queries benefit from all of the optimizations and performance enhancements provided by the underlying engine (e.g. parallelism). VLDB 02, WWW 03, VLDB 04 Installation Policy Parser Negotiation Patient Preferences & Policy Matching Installed Policy Patient Records IMIA Conference – Security in Health Information Systems | April 29, 2006 DATABASE Enforcement JDBC/ODBC Driver IBM Research Query Modification Example (Disclose Name only of Patients who have opted-in) SELECT Name FROM Patients WHERE Age < 20 SELECT CASE WHEN EXISTS (SELECT Name_Choice FROM Patient_Choices WHERE Patients.Patient# = Patient_Choices.Patient# AND Patient_Choices.Name_Choice = 1) THEN Name ELSE null END FROM Patients WHERE Age < 20 AND EXISTS (SELECT Patient#_Choice FROM Patient_Choices WHERE Patients.Patient# = Patient_Choices.Patient# AND Patient_Choices.Patient#_Choice = 1) IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Elapsed Time (seconds) 40 30 20 10 Unmodified Modified External Multiple Modified Internal 0 0 20 40 60 Choice Selectivity (%) 80 100 Measured performance of a query selecting all records from a 5 million-record table Compared performance of original and modified queries for varied choice selectivity Not surprisingly, performance actually better for modified queries when we use privacy enforcement as an additional selection condition – Able to use indexes on choice values Shows the importance of database-level privacy enforcement for performance IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Audit Scenario The doctor must now review disclosures of Jane’s Sometime later, Jane information in order The doctor uncovers that Jane’stoblood sugar level is receives promotional understand high literature and suspects fromdiabetes a the circumstances of the disclosure, and take pharmaceutical appropriate action company, proposing over theto counter diabetes of Health and Human Jane complains the department tests Services saying that of the Janeshe hashad notopted been out feeling welldoctor and decides to sharing her medical information with pharmaceutical consult her doctor companies for marketing purposes IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Audit Expression Who has accessed Jane’s disease information? audit T.disease from Customer C, Treatment T where C.cid=T.pcid and C.name = ‘Jane’ IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Problem Statement Given – A log of queries – An audit expression specifying sensitive data NOT Given – Log of data accesses Precisely and Efficiently identify – Those queries that accessed the data specified by the audit expression in the past IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Compliance Auditing Query with purpose, recipient IDs of log queries having accessed data specified by the audit query Audit query Updates, inserts, delete Database Layer Audit Database triggers track updates to base tables Data Tables • Audits whether particular data has been disclosed in violation of the specified policies. • Audit expression specifies what potential data disclosures need monitoring. Database Layer Backlog • Identifies logged queries that accessed the specified data. • Auditors can analyze the circumstances of violations. • Make necessary corrections to procedures, policies, security. Generate audit record for each query Query Audit Log ID Timestamp Query User Purpose Recipient 1 2004-02… Select … B. Jones Marketing PharmaCo. 2 2004-02… Select … S. Roberts Treatment S. Roberts VLDB 04 IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Overhead on Updates Time (minutes) 250 7x if all tuples are updates 3x if a single tuple is updated 200 Negligible by using Recovery Log to build Backlog tables Composite Simple No Index No Triggers 150 100 50 0 5 20 35 50 # of versions per tuple IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Audit Query Execution Time IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Privacy Preserving Data Mining Kevin’s LDL Kevin’s weight Julie’s LDL 126 | 210 | ... 128 | 130 | ... Randomizer Randomizer 126+35 161 | 165 | ... 129 | 190 | ... Reconstruct distribution of LDL Reconstruct distribution of weight Data Mining Algorithms Data Mining Model Preserves privacy at the individual patient level, but allows accurate data mining models to be constructed at the aggregate level. Adds random noise to individual values to protect patient privacy. EM algorithm estimates original distribution of values given randomized values + randomization function. Algorithms for building classification models and discovering association rules on top of privacypreserved data with only small loss of accuracy. 1200 120 1000 100 800 80 600 60 400 40 20 200 0 0 Original Randomized 20 40 82 74 66 58 50 42 34 26 18 2 Sigmod00, KDD02, Sigmod05 10 10 Reconstructed IMIA Conference – Security in Health Information Systems | April 29, 2006 60 80 100 150 Randomization Level Original Randomized Reconstructed 200 IBM Research Optimal k-Anonymization Goal: De-identify patient data such that it retains its integrity, but is resistant to data linkage attacks. Motivation: Naïve de-identification methods are prone to data linkage attacks, which combine subject data with publicly available information to re-identify represented individuals. Process of k-Anonymization • Data Suppression - Involves deleting particular cell values or entire tuples. • Value Generalization - Entails replacing specific values, such as a telephone number, with more general ones, such as the area code alone. Samarati and Sweeney k-Anonymity* Method – A k-anonymized data set has the property that each record is indistinguishable from at least k-1 other records within the data set. Advantages of Optimal k-anonymization • Truthful - Unlike other disclosure protection techniques that use data scrambling, swapping, or adding noise, all information within a k-anonymized dataset is truthful. • Secure - More secure than other de-identification methods, which may inadvertently reveal confidential information. Optimal k-Anonymization – Name We have developed a k-anonymization algorithm that finds optimal k-anonymizations under two representative cost measures and variations of k. Address City Age Diagnosis Eric 7, rue du Mont Dore Paris 26 Influenza Paul 13, rue des Canettes Paris 42 Hypertens. Marc 48, rue du Four Paris 47 Diabetes Henri 21, rue du Mont Dore Paris 28 Asthma Address City Age Diagnosis * 17th Arrond. Paris 20-29 Influenza * 6th Arrond. Paris 40-49 Hypertens. * 6th Arrond. Paris 40-49 Diabetes * 17th Arrond. Paris 20-29 Asthma Name (k=2, on name, address, age) * P. Samarati and L. Sweeney. “Generalizing Data to Provide Anonymity when Disclosing Information.” In Proc. of the 17th ACM SIGMOD-SIGACT-SIGART Symposium on the Principles of Database Systems, 188, 1998. IMIA Conference – Security in Health Information Systems | April 29, 2006 ICDE05 IBM Research Sovereign Information Integration Separate databases due to statutory, competitive, or security reasons. Minimal Necessary Sharing Selective, minimal sharing on a need-to-know basis. Example: Among those patients who took a particular drug, how many with a specified DNA sequence had an adverse reaction? Researchers must not learn anything beyond counts. • Algorithms for computing joins and join counts while revealing minimal additional information. R a u v x Medical Research Inst. RS u v S b u DNA Sequences RS R must not know that S has b and y S must not know that R has a and x v Count (R S) R and S do not learn anything except that the result is 2. y Drug Reactions Sigmod 03, DIVO 04 IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Recommendations Policy Makers & legislators – Continuous technology monitoring and understanding to inform policies and laws (current and new) – Invest in research Solution Developers (Technologists) – Design-in ethical considerations (e.g. respect for privacy, safeguard against misuse); they can’t be afterthoughts – Engage in dialog with policy makers and legislators to educate them on performance implications of the policies/laws IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Recommendations for Researchers Asking questions is easy: it's answering them that's hard. IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Policy Specification How to determine if the policy specification accurately captures the intent of the policy maker? (The person specifying the policy is usually not a computer scientist.) How to help the patient understand the policy and the implications of his or her choices? How to design a policy language that reconciles the goals of understandability and efficient computation? IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Sticky Policies Healthcare organizations should be assured that original policy controls will be enforced over data after transfer to other entities. Transferees of patient data should be capable of applying source disclosure policies to any information in its database. Database should enforce source and enterprise policies and resolve any conflicts among policies. Data compliant with source and enterprise policies policies Patient data + policy annotations Patient Records DB patient data Hospital 1 Hospital 2 IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Data Pointillism Name Phone Phone Address City Patient Policy# Bob 394-1015 396-1012 Maple St Chatham Alice AAA1035 Alice 396-1012 394-1015 - Madison Bob AAA1035 Alice 396-1112 396-1112 Maple St Madison Alice UHG1035 • > 14B records with Choicepoint • Accuracy? Limits? Pointillist • How to allow someone to verify data? • Data from > 22,000 sources in RDC’s GRID • >550 companies compiling databases of pvt information Bob 394-1015 Maple St Madison AAA1035 Alice 396-1012 Maple St Chatham UHG1035 •Identifying and correcting errors? Alice 396-1112 Maple St Madison AAA1035 • Usage control? IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Massively Distributed Data Management What if patient data is stored on personal devices? Pervasive monitoring devices will also collect patient data. How to protect the security of these devices? Enable selective sharing of information stored on devices? Distributed backup in the network to prevent data loss? 512MB SanDisk Cruzer $47.99 Transcend 40GB Portable Hard Disk USB 95mm x 71.5mm x 15mm, $189 IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Data Life Cycle Management Healthcare organizations must define data retention policies based on legal requirements and patient specifications: –HIPAA: 6 years (21 years for pediatric care). –Medicare: 5 to 7 years –AHA & AHIMA: at least 10 years Data compression vs. encryption How to remove expired data and forget persistent data? How to establish truthfulness of data? IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Interoperability Sovereign health information systems must be able to communicate among one another, using standard data formats and clinical vocabularies. Examples of current efforts include: –HL7 messaging standards –SNOMED-CT vocabularies –CDA and CCR document standards Much work remains to be done to make systems interoperable. Mass collaboration might be useful in defining clinical vocabularies and taxonomies. IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Concluding Remarks Hippocratic Database technologies protect the security of electronic health records and patient privacy without impeding the flow of information. We need not sacrifice security or privacy to gain value from EHRs for diagnosis, treatment, and research. We must focus on: – Deriving value from bits we know how to manage. – Demonstrating what could not be done before. IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research Thank you! Papers: rakesh.agrawal-family.com Collaborations: rakesh.agrawal@microsoft.com johnsocm@us.ibm.com IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research References Active Enforcement R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. “Hippocratic Databases.” 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002. K. Lefevre, R. Agrawal, V. Ercegovac, R. Ramakrishnan, Y. Xu, D. DeWitt. "Limiting Disclosure in Hippocratic Databases". Proc. of the 30th Int'l Conf. on Very Large Databases (VLDB 2004), Toronto, Canada, August 2004. Compliance Auditing R. Agrawal, R. Bayardo, C. Faloutsos, J. Kiernan, R. Rantzau and R. Srikant. “Auditing Compliance with a Hippocratic Database.” Proc. of the 30th Int'l Conf. on Very Large Databases (VLDB 2004), Toronto, Canada, August 2004. Privacy-Preserving Data Mining R. Agrawal and R. Srikant. "Privacy-Preserving Data Mining". Proc. of the ACM SIGMOD Conference on Management of Data, Dallas, May 2000. A. Evfimievski, R. Srikant, R. Agrawal and J. Gehrke. "Privacy Preserving Mining of Association Rules". Proc. of the 8th ACM SIGKDD Int'l Conference on Knowledge Discovery in Databases and Data Mining, Edmonton, Canada, July 2002. IMIA Conference – Security in Health Information Systems | April 29, 2006 IBM Research References Optimal k-Anonymization R. J. Bayardo and R. Agrawal. "Data Privacy Through Optimal k-Anonymization". To appear in Proc. of the 21st Int'l Conf. on Data Engineering (ICDE 2005), Tokyo, Japan, April 2005. Sovereign Information Integration R. Agrawal, A. Evfimievski, R. Srikant. “Information Sharing Across Private Databases.” ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003. R. Agrawal, D. Asonov and R. Srikant. "Enabling Sovereign Information Sharing Using Web Services". Proc. of the ACM SIGMOD Conference on Management of Data, Paris, France, June 2004. IMIA Conference – Security in Health Information Systems | April 29, 2006