Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #17 Data Mining, Security and Privacy March 15, 2006 Objective of the Unit  This unit provides an overview of data mining for security (national security and information security) and then discuss privacy Why We Need Intrusion Detection Systems? Incidents Reported to Computer Emergency Response Team/Coordination Center (CERT/CC) 90000  Due to the proliferation of high-speed Internet access, more and more organizations are becoming vulnerable to potential cyber attacks, such as network intrusions 80000 70000 60000 50000 40000 30000 20000 10000 0 1 2 3 4 5 6 7 8 9 10 2000 11 12 2002 13 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2001  Sophistication of cyber attacks as well as their severity has also increased recently (e.g., Code-Red I & II, Nimda, and more recently the SQL slammer worm on Jan. 25)  Security mechanisms always have inevitable vulnerabilities Current firewalls are not sufficient to ensure security in computer networks Source: www.caida.org The geographic spread of Sapphire/Slammer Worm 30 minutes after release Data Mining for Intrusion Detection  Increased interest in data mining based intrusion detection - Attacks for which it is difficult to build signatures; Unforeseen/Unknown/Emerging attacks; Distributed/coordinated attacks  Data mining approaches for intrusion detection - - Misuse detection  Building predictive models from labeled labeled data sets (instances are labeled as “normal” or “intrusive”) to identify known intrusions  High accuracy in detecting many kinds of known attacks  Cannot detect unknown and emerging attacks Anomaly detection  Detect novel attacks as deviations from “normal” behavior  Potential high false alarm rate - previously unseen (yet legitimate) system behaviors may also be recognized as anomalies Outline: Data Mining for Security (National and Cyber)  Data Mining for Intrusion Detection  General discussions on data mining for counter-terrorism  Data mining for non real-time threats and real-time threats  Data mining for cyber terrorism and bioterrorism  Discussions of some techniques  Directions and challenges Data Mining for Counter-terrorism Data Mining for Counterterrorism Data Mining for Non real-time Threats: Gather data, build terrorist profiles Mine data, prune results Data Mining for Real-time Threats: Gather data in real-time, build real-time models, Mine data, Report results Data Mining Needs for Counterterrorism: Non-real-time Data Mining  Gather data from multiple sources - Information on terrorist attacks: who, what, where, when, how - Personal and business data: place of birth, ethnic origin, religion, education, work history, finances, criminal record, relatives, friends and associates, travel history, . . . - Unstructured data: newspaper articles, video clips, speeches, emails, phone records, . . .  Integrate the data, build warehouses and federations  Develop profiles of terrorists, activities/threats  Mine the data to extract patterns of potential terrorists and predict future activities and targets  Find the “needle in the haystack” - suspicious needles?  Data integrity is important  Techniques have to SCALE Data Mining for Non Real-time Threats Integrate data sources Clean/ modify data sources Build Profiles of Terrorists and Activities Mine the data Data sources with information about terrorists and terrorist activities Report final results Examine results/ Prune results Data Mining Needs for Counterterrorism: Real-time Data Mining  Nature of data - Data arriving from sensors and other devices  Continuous data streams - Breaking news, video releases, satellite images - Some critical data may also reside in caches  Rapidly sift through the data and discard unwanted data for later use and analysis (non-real-time data mining)  Data mining techniques need to meet timing constraints  Quality of service (QoS) tradeoffs among timeliness, precision and accuracy  Presentation of results, visualization, real-time alerts and triggers Data Mining for Real-time Threats Integrate data sources in real-time Rapidly sift through data and discard irrelevant data Build real-time models Mine the data Data sources with information about terrorists and terrorist activities Report final results Examine Results in Real-time Data Mining Needs for Counterterrorism: Cybersecurity  Determine nature of threats and vulnerabilities - e.g., emails, trojan horses and viruses  Classify and group the threats  Profiles of potential cyberterrorist groups and their capabilities  Data mining for intrusion detection - Real-time/ near-real-time data mining - Limit the damage before it spreads  Data mining for preventing future attacks - Forensics Data Mining Needs for Counterterrorism: Protection from Bioterrorism  Determine nature of threats - Biological weapons and agents, Chemical weapons and agents  Classify and group the threats  Identify the types of substances used  Prevention and detection mechanisms - Intelligence gathering, detecting symptoms, biosensors  Determine actions to be taken to avoid fatal and dangerous situations  Need data management engineers, data miners, computational scientists, mathematical biologists, epidemiologists to work together - Model the spread of diseases, detection and prevention Some common threads  Identify the threats  Group/classify the threats  Gather data; Develop profiles of terrorists  Data mining for preventing/detecting/managing terrorist attacks Data Mining Outcomes and Techniques for Counter-terrorism Data Mining Outcomes and Techniques Classification: Build profiles of Terrorist and classify terrorists Association: John and James often seen together after an attack Link Analysis: Follow chain from A to B to C to D Clustering: Divide population; People from country X of a certain religion; people from Country Y Interested in airplanes Anomaly Detection: John registers at flight school; but des not care about takeoff or landing Web Usage Mining for Counter-terrorism Web Usage Mining for Counter-terrorism Determine the Web usage of suspected terrorists Mine web usage and give Advice to analyst about the actions to take Mine terrorist web sites and Determine behavior Are general data/web mining techniques sufficient?  Does one size fit all? - Non real-time, real-time, cyber, bio?  What are the major differences - e.g., develop models ahead of time for real-time data mining? - What happens in a very dynamic environment?  Data mining tasks/outcomes - Classification, clustering, associations, link analysis, anomaly detection, prediction - - - -?  Data mining techniques - Which techniques are good for which problems? Some other data mining applications for National Security  Insider Threat analysis - Detecting potential threats from employees of a corporation or agencies  E.g., Espionage  Preventing/Detecting Money laundering, Drug trafficking, Tax violations  Protecting children from inappropriate content on the Internet - National Academy of Science Panel 2000-2001 Chair: Richard Thornburgh (former U.S. Attorney General)  Protecting infrastructures, national databases, -.-.-.- Example Success Story - COPLINK  COPLINK developed at University of Arizona - Research transferred to an operational system currently in use by Law Enforcement Agencies  What does COPLINK do? Provides integrated system for law enforcement; integrating law enforcement databases - If a crime occurs in one state, this information is linked to similar cases in other states It has been stated that the sniper shooting case may have been solved earlier if COPLINK had been operational at that time - Where are we now?  We have some tools for - building data warehouses from structured data - integrating structured heterogeneous databases - mining structured data - forming some links and associations - information retrieval tools - image processing and analysis - pattern recognition - video information processing - visualizing data - managing metadata - intrusion detection and forensics What are our challenges?  Do the tools scale for large heterogeneous databases and petabyte sized databases?  Building models in real-time; need training data  Extracting metadata from unstructured data  Mining unstructured data  Extracting useful patterns from knowledge-directed data mining  Rapidly forming links and associations; get the big picture for real- time data mining  Detecting/preventing cyber attacks  Mining the web  Evaluating data mining algorithms  Conducting risks analysis / economic impact  Building testbeds Form a Work Agenda  Immediate action (0 - 1 year) - We’ve got to know what our current capabilities are - Do the commercial tools scale? Do they work only on special data and limited cases? Do they deliver what they promise? - Need an unbiased objective study with demonstrations  At the same time, work on the big picture - What do we want? What are our end results for the foreseeable future? What are the criteria for success? How do we evaluate the data mining algorithms? What testbeds do we build?  Near-term (1 - 3 years) - Leverage current efforts - Fill the gaps in a goal-directed way; technology transfer  Long-term (3 - 5 years and beyond) - 5-year R&D plan for data mining for counterterrorism IN SUMMARY:  Data Mining is very useful to solve Security Problems - Data mining tools could be used to examine audit data - - and flag abnormal behavior Much recent work in Intrusion detection (unit #18)  e.g., Neural networks to detect abnormal patterns Tools are being examined to determine abnormal patterns for national security  Classification techniques, Link analysis Fraud detection  Credit cards, calling cards, identity theft etc. BUT CONCERNS FOR PRIVACY Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Privacy March 29, 2005 Outline  Data Mining and Privacy - Review  Some Aspects of Privacy  Revisiting Privacy Preserving Data Mining  Platform for Privacy Preferences  Challenges and Discussion Some Privacy concerns  Medical and Healthcare - Employers, marketers, or others knowing of private medical concerns  Security - Allowing access to individual’s travel and spending data - Allowing access to web surfing behavior  Marketing, Sales, and Finance - Allowing access to individual’s purchases Data Mining as a Threat to Privacy  Data mining gives us “facts” that are not obvious to human analysts of the data  Can general trends across individuals be determined without revealing information about individuals?  Possible threats: Combine collections of data and infer information that is private  Disease information from prescription data  Military Action from Pizza delivery to pentagon  Need to protect the associations and correlations between the data that are sensitive or private - Some Privacy Problems and Potential Solutions  Problem: Privacy violations that result due to data mining - Potential solution: Privacy-preserving data mining  Problem: Privacy violations that result due to the Inference problem - Inference is the process of deducing sensitive information from the legitimate responses received to user queries - Potential solution: Privacy Constraint Processing  Problem: Privacy violations due to un-encrypted data - Potential solution: Encryption at different levels  Problem: Privacy violation due to poor system design - Potential solution: Develop methodology for designing privacyenhanced systems Some Directions: Privacy Preserving Data Mining  Prevent useful results from mining - Introduce “cover stories” to give “false” results - Only make a sample of data available so that an adversary is unable to come up with useful rules and predictive functions  Randomization - Introduce random values into the data and/or results - Challenge is to introduce random values without significantly affecting the data mining results - Give range of values for results instead of exact values  Secure Multi-party Computation - Each party knows its own inputs; encryption techniques used to compute final results Privacy Preserving Data Mining Agrawal and Srikant (IBM)  Value Distortion - Introduce a value Xi + r instead of Xi where r is a random value drawn from some distribution  Uniform, Gaussian  Quantifying privacy Introduce a measure based on how closely the original values of modified attribute can be estimated  Challenge is to develop appropriate models Develop training set based on perturbed data  Evolved from inference problem in statistical databases - - Privacy Constraint Processing  Privacy constraints processing - Based on prior research in security constraint processing - Simple Constraint: an attribute of a document is private - Content-based constraint: If document contains information about X, then it is private - Association-based Constraint: Two or more documents taken together is private; individually each document is public - Release constraint: After X is released Y becomes private  Augment a database system with a privacy controller for constraint processing Architecture for Privacy Constraint Processing User Interface Manager Privacy Constraints Constraint Manager Query Processor: Constraints during query and release operations DBMS Database Design Tool Update Processor: Constraints during database design operation Constraints during update operation Database Semantic Model for Privacy Control Dark lines/boxes contain private information Cancer Influenza Has disease John’s address Patient John address England Travels frequently Data Mining and Privacy: Friends or Foes?  They are neither friends nor foes  Need advances in both data mining and privacy  Need to design flexible systems - For some applications one may have to focus entirely on “pure” data mining while for some others there may be a need for “privacy-preserving” data mining - Need flexible data mining techniques that can adapt to the changing environments  Technologists, legal specialists, social scientists, policy makers and privacy advocates MUST work together Platform for Privacy Preferences (P3P): What is it?  P3P is an emerging industry standard that enables web sites t9o express their privacy practices in a standard format  The format of the policies can be automatically retrieved and understood by user agents  It is a product of W3C; World wide web consortium www.w3c.org  Main difference between privacy and security User is informed of the privacy policies User is not informed of the security policies - Platform for Privacy Preferences (P3P): Key Points  When a user enters a web site, the privacy policies of the web site is conveyed to the user  If the privacy policies are different from user preferences, the user is notified  User can then decide how to proceed Platform for Privacy Preferences (P3P): Organizations  Several major corporations are working on P3P standards including: Microsoft IBM HP NEC Nokia NCR  Web sites have also implemented P3P  Semantic web group has adopted P3P - Platform for Privacy Preferences (P3P): Specifications  Initial version of P3P used RDF to specify policies  Recent version has migrated to XML  P3P Policies use XML with namespaces for encoding policies  Example: Catalog shopping Your name will not be given to a third party but your purchases will be given to a third party <POLICIES xmlns = http://www.w3.org/2002/01/P3Pv1> <POLICY name = - - - </POLICY> </POLICIES> - Platform for Privacy Preferences (P3P): Specifications (Concluded)  P3P has its own statements a d data types expressed in XML  P3P schemas utilize XML schemas  XML is a prerequisite to understanding P3P  P3P specification released in January 20005 uses catalog shopping example to explain concepts  P3P is an International standard and is an ongoing project P3P and Legal Issues  P3P does not replace laws  P3P work together with the law  What happens if the web sites do no honor their P3P policies Then appropriate legal actions will have to be taken  XML is the technology to specify P3P policies  Policy experts will have to specify the policies  Technologies will have to develop the specifications  Legal experts will have to take actions if the policies are violated - Challenges and Discussion  Technology alone is not sufficient for privacy  We need technologists, Policy expert, Legal experts and Social scientists to work on Privacy  Some well known people have said ‘Forget about privacy”  Should we pursue working on Privacy? - Interesting research problems - Interdisciplinary research - Something is better than nothing - Try to prevent privacy violations - If violations occur then prosecute  Discussion?