Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE 300 Data Mining & Cyberinfrastructures in Biomedical Informatics Ryan McGivern CSE5095 May 1, 2011 Data Mining and Cyberinfrastructures in Biomedical Informatics - 1 Main Concepts CSE 300  Data Mining  Knowledge Discovery  Cyberinfrastructures  Collaborative Research Data Mining and Cyberinfrastructures in Biomedical Informatics - 2 Nature of Biomedical Data CSE 300 Health  care is more than numbers and readings Can’t replace the subjective sense of disease severity that a physician has in moments Capture   data in a way that best captures observation Data representation Precision Data Mining and Cyberinfrastructures in Biomedical Informatics - 3 Review  Medical datum  Any single observation of a patient  Knowledge  Derived through formal/informal analysis of data  Information  Combine knowledge with data for new information CSE 300  Heuristics and research models  BMI Data-Knowledge Spectrum  What information constitutes the substance of medicine Data Mining and Cyberinfrastructures in Biomedical Informatics - 4 Nature of Biomedical Data CSE 300  Knowledge at one level of abstraction might be considered data at another  Medical Database is a Collection of individual patient observations  EHR is in some sense simply a database  Using historical patient data from the EHR system can facilitate the deduction of new knowledge related to health care strategies Data Mining and Cyberinfrastructures in Biomedical Informatics - 5 Nature of Biomedical Data CSE 300  Humans can intuitively decompose information from unitary view of data  But nothing is intuitive to computational systems  Example  Clinical setting  BP of 120/80 may suffice to indicate a normal reading  Analytical setting  Systolic BP = 120 mm Hg  Diastolic BP = 80 mm Hg Data Mining and Cyberinfrastructures in Biomedical Informatics - 6 Nature of Biomedical Data CSE 300  Data mining in health is mainly related to Clinical Research Support  Clinical Data Repositories (CDRs)  New knowledge learned through aggregated info from a large number of patients  Can be facilitated by EHRs  Unfortunately  CDRs generally limited to admin data sources  Rarely store patient charts Data Mining and Cyberinfrastructures in Biomedical Informatics - 7 Nature of Biomedical Data CSE 300  CDRs support Clinical Research Studies  Retrospective studies  Investigate a hypothesis that was not a subject of the study at the time the data were collected  Prospective studies  Clinical hypothesis known in advance  Research protocol designed to collect future data Data Mining and Cyberinfrastructures in Biomedical Informatics - 8 Nature of Biomedical Data CSE 300  Knowledge base  Facts  Heuristics  Complex models  Semantic linking  Conduct case based problem solving  Medical data is intrinsically heterogeneous  Illusory to conceive ‘complete medical dataset’  Data selective based on treatment Data Mining and Cyberinfrastructures in Biomedical Informatics - 9 Data Mining in BMI CSE 300  Data mining  Knowledge discovery technique  Sophisticated statistical methods  Identify trend patterns hidden amongst the sheer size of the dataset  Data warehouse  Multiple heterogeneous data sources  Organized under a unified schema   Single site Facilitate management and decision making Data Mining and Cyberinfrastructures in Biomedical Informatics - 10 Data Mining in BMI CSE 300  CDR is essentially a data warehouse  Architecture consists of four tiers  External data sources  Operational databases, flat files, etc.  Data storage layer  Unified schema, metadata, data marts  OLAP Layer  Data mining engine  Presentation layer  GUI  Usually web-based Data Mining and Cyberinfrastructures in Biomedical Informatics - 11 Data Mining in BMI CSE 300 Figure: Clinical Data Repository Data Mining and Cyberinfrastructures in Biomedical Informatics - 12 Data Mining in BMI CSE 300  Data integration mechanism  Extraction  Transformation  Refresh  Scrubbing  Data marts  Subsets of data tailored to a user group  Cache resultant datasets Data Mining and Cyberinfrastructures in Biomedical Informatics - 13 Data Mining in BMI CSE 300  Data integration  Heterogeneous data under a unified schema  Ontologies  Link primary data expressions to structured vocabularies  Data now available to search and algorithmic processing at different levels of abstraction  Clinical domain  Notorious for overwhelming presence of natural language text  Natural language processing Data Mining and Cyberinfrastructures in Biomedical Informatics - 14 Data Mining in BMI CSE 300  Data integration  Cancer Biomedical Informatics Grid (caBIG)  Seeks to integrate all cancer research data  Standardize the way by which data is acquired, formatted, processed, and stored – Whole data ‘life cycle’  Translational research  No common architecture among vocabularies  Therefore difficult to consolidate terms into a single system Data Mining and Cyberinfrastructures in Biomedical Informatics - 15 Data Mining in BMI CSE 300  Communication  HL7  Communication standard for exchange of all information relevant to health care  Focuses on meta-level of data integration within a clinical setting Data Mining and Cyberinfrastructures in Biomedical Informatics - 16 Data Mining in BMI  Online Analytical Processing Layer (OLAP)  Formats aggregated data in multidimensional way  Evaluated and visualized at presentation layer  User specifies summary technique  Data Cube  Roll-up and drill-down operations  Control abstraction level for each data dimension CSE 300 Data Mining and Cyberinfrastructures in Biomedical Informatics - 17 Data Mining in BMI  CSE 300 Data mining techniques  Descriptive methods  Mine for relationships among attribute types with as few variables as possible  Predictive methods  Iterate through attributes and classify data into predefined classes  Identify similar classes  Other related methods  Neural Networks  Machine Learning  Each provides a way of recognizing data patterns Data Mining and Cyberinfrastructures in Biomedical Informatics - 18 Data Mining in BMI CSE 300 UWV & VCU (2006)  Data mining research  667,00 digital records    Duke University (1997)  Perinatal outcomes  45,922 patient records Out-patient & in-patient De-identified   HealthMiner® (IBM)   CliniMiner®  Association analysis   THOTH  Predictive analysis  215,626 encounters 3,898,887 lab results 217,453 procedures 3,016,313 physical findings SQL Queries   Average time 3 minutes Longest time 12 minutes  4 million records Data Mining and Cyberinfrastructures in Biomedical Informatics - 19 Data Mining in BMI CSE 300  Challenges in mining biomedical data  Non-hypothesis driven approaches  Combinatorial explosion  Degree of non-reducibility – Minimize with sophisticated heuristics  High dimensionality  Sparse complex relationships – Spread thinly across many dimensions  Hypotheses  Limit inherent bias in traditional clinical data analysis Data Mining and Cyberinfrastructures in Biomedical Informatics - 20 Data Mining in BMI CSE 300  Challenges in warehousing biomedical data  IT infrastructure for CDRs  Established for clinical trials but separated from EHR systems  Data integration  Map clinical terminologies to clinical research standards  Pseudonymization  De-identification is a ‘must’ when EHR leaves the realm of primary health care Data Mining and Cyberinfrastructures in Biomedical Informatics - 21 Data Mining in BMI CSE 300  General road blocks  Data sharing  Researchers are protective of their data  Language/vocabulary changes  Due to required detail  Bedside vs. laboratory  Transdisciplinary research leads to competing standards Data Mining and Cyberinfrastructures in Biomedical Informatics - 22 Data Mining in BMI CSE 300  Advantages of mining biomedical data  New health management strategies  Relationships among patient observations  Understanding of disease progression  Undetected drug events  Prevalence through larger sample populations  Clinical trial cohort selection  Identify patient types that will best prove a given hypothesis Data Mining and Cyberinfrastructures in Biomedical Informatics - 23 Cyberinfrastructures in BMI CSE 300  Motivations  Computer systems are now more than essential to research  Development of complex modeling tools  But generally only available to a handful of clinical researchers  Integration of data from different disciplines  Can require specialized training in mathematics, statistics, and software  Ideally want to provide a layer of abstraction that can make this integration transparent to the researcher Data Mining and Cyberinfrastructures in Biomedical Informatics - 24 Cyberinfrastructures in BMI CSE 300  Mission  Develop a geographically distributed virtual research community that facilitates  Data sharing – Data warehousing  Computational resource sharing – Distributed grid computing  Collaboration – Research management – Research protocol sharing Data Mining and Cyberinfrastructures in Biomedical Informatics - 25 Cyberinfrastructures in BMI CSE 300  Components of a cyberinfrastructure  Data infrastructure  Series of interconnected repositories  Computational infrastructure  Registered resource sharing  Communication infrastructure  Communication amongst architectures  Human infrastructure  Facilitate communication and collaboration between registered researchers Data Mining and Cyberinfrastructures in Biomedical Informatics - 26 Cyberinfrastructures in BMI CSE 300 Data Mining and Cyberinfrastructures in Biomedical Informatics - 27 Cyberinfrastructures in BMI CSE 300  Data infrastructure  Network of databases  Facilitates remote storage, integration, and retrieval of data  Databases browsed by web based front-ends  Can be extended to cater to  Automatic acquisition  Direct submission  Allows for pulling of data into local repositories  For private or semi-private analyses Data Mining and Cyberinfrastructures in Biomedical Informatics - 28 Cyberinfrastructures in BMI CSE 300  Computational infrastructure  Shared access to hardware and software  Intensive computation needed for sophisticated analyses – i.e. Image analysis software  Essentially a computing grid  Systems separated geographically but clustered over the web  Provides a virtual consolidated supercomputing node  If system is idle locally, it is raised as a resource for outsiders Data Mining and Cyberinfrastructures in Biomedical Informatics - 29 Cyberinfrastructures in BMI CSE 300  Communication infrastructure  At the low level  Require connectivity and acceptable bandwidth between – Repositories – Computational resources – Researcher  At the high level  Responsible for maintaining syntactic and semantic harmony throughout data Data Mining and Cyberinfrastructures in Biomedical Informatics - 30 Cyberinfrastructures in BMI CSE 300  Communication infrastructure continued  Syntax and Semantics  Suppose analysis involves data from different repositories  Syntactic connectivity established through a common format for data organization  Semantic connectivity maintains data interoperability by ensuring concepts captured by the data share a common terminology – Usually implemented using an ontology Data Mining and Cyberinfrastructures in Biomedical Informatics - 31 Cyberinfrastructures in BMI CSE 300  Human infrastructure  Ultimately, must facilitate the sociology of science  Everyone curates communal data sets  Encourage the sharing of  Protocols  Analysis algorithms  Data sets  Similar to CICATS at UConn  Research toolkit Data Mining and Cyberinfrastructures in Biomedical Informatics - 32 Cyberinfrastructures in BMI CSE 300  Human infrastructure continued  Ideally researcher should be able to design experiment at a high level  Describe datasets, relationships, etc.  Generally high level description language – Workflow language   Infrastructure then manages data retrieval, analysis, and transformation Constructs an environment where researchers can get an in-depth result from a high level description Data Mining and Cyberinfrastructures in Biomedical Informatics - 33 Cyberinfrastructures in BMI CSE 300   There are many existing cyberinfrastructures  Don’t necessarily implement all components Most common form is an online database  GenBank  EMBL  European Molecular Biology Lab  UniProt  Protein database  PDB  Protein data bank Data Mining and Cyberinfrastructures in Biomedical Informatics - 34 Cyberinfrastructures in BMI CSE 300  Online databases continued  But, these lack the components to facilitate  Collaboration  Interdisciplinary research   Use centralized resources and are generally managed by the owning research group Data centric  Most of the computational architecture is dedicated solely to data acess Data Mining and Cyberinfrastructures in Biomedical Informatics - 35 Cyberinfrastructures in BMI CSE 300  Community Annotation Hubs  Open up a centralized database to direct contribution from the research community  SDSU Gene Wiki  For the community annotation of gene function  BMI Wikis have been recognized as some of the most sophisticated document repositories  Despite being a relatively recent umbrella discipline  Still not a complete research environment  Could be ‘plugged in’ to the human infrastructure of a complete cyberinfrastructure Data Mining and Cyberinfrastructures in Biomedical Informatics - 36 Cyberinfrastructures in BMI CSE 300  Data sharing  Still difficult to share data on disparate information classes  Even if they are related through a subset of attribute types  Further difficulty of interconnecting similar repositories written by different research groups  Differing technologies  Differing data representations  Reoccurring difficulty in integrating data  Medical data is inherently heterogeneous – Massive amount of data types involved – Data is captured differently, because it’s used differently Data Mining and Cyberinfrastructures in Biomedical Informatics - 37 Cyberinfrastructures in BMI CSE 300  Data sharing challenge  As Dr. Kevin Sullivan said in the discussion  It is difficult for an institution to share their data  It can be difficult to argue a business case to do so – Institutions may not want people evaluating their treatments or incorrect treatments – Research groups get a sense of proprietary ownership over their data – Some institutions feel it is not theirs to share – Others are skeptical as to how the community would react to their health care provider exposing information to outsiders Data Mining and Cyberinfrastructures in Biomedical Informatics - 38 Cyberinfrastructures in BMI CSE 300  One interoperability solution is web services  Provide common technology for heterogeneous data and services to interoperate  Common implementations consist of  Web Service Description Language (WSDL) – Describes capabilities of services  Simple Object Access Protocol (SOAP)  Researchers never use services directly  But rely on the analysis and visualization engines that run on top of these Data Mining and Cyberinfrastructures in Biomedical Informatics - 39 Cyberinfrastructures in BMI CSE 300  Globus  Open source libraries  Industry heavyweight in web services for many domains  Provides mechanisms for  Announcing the availability of a computer resource  Discovering the resource  Invoking the resource Used by BIRN and caBIG BioMOBY  Similar to Globus but relatively lightweight  Used by PlaNet Consortium   Data Mining and Cyberinfrastructures in Biomedical Informatics - 40 Cyberinfrastructures in BMI CSE 300  Ontologies  Web services allow heterogeneous data and services to exchange  But this does not enforce data semantics   Ontologies are used to ensure an unambiguous standard for data Most BMI cyberinfrastructures specify ontologies using OWL Data Mining and Cyberinfrastructures in Biomedical Informatics - 41 Cyberinfrastructures in BMI CSE 300  Biomedical Informatics Research Network (BIRN)  Developed a robust software installation & deployment system to implement a BIRN endpoint       Host data and contribute computational resources Access shared datasets through web portal Analysis and visualization tools Publish datasets through BIRN data repository Roughly $20k for a BIRN rack Technologies  Globus: grid management  BIRNLex: ontology Data Mining and Cyberinfrastructures in Biomedical Informatics - 42 Cyberinfrastructures in BMI CSE 300  Cancer Biomedical Informatics Grid  Launched 2003  Mission  Provide a common information platform to support the diverse clinical and basic research of the US National Cancer Institute – 87 cancer institutes at the time  Highly heterogeneous datasets Data Mining and Cyberinfrastructures in Biomedical Informatics - 43 Cyberinfrastructures in BMI CSE 300  Future of BMI cyberinfrastructures  Use of cyberinfrastructure is growing rapidly  Grid computing is increasingly more efficient  Current weaknesses related to cross-discipline collaboration  Each implements an internally consistent grid, but isolated from each other  We need integration and communication among disciplines to investigate further relationships  Data interoperability may be resolved by semantic web  Current research in cyberinfrastructures is related to using the semantic web concept Data Mining and Cyberinfrastructures in Biomedical Informatics - 44 Cyberinfrastructures in BMI CSE 300  Semantic web in cyberinfrastructures  Web services make strong distinction between data and data operations      User identifies service to invoke Formats the input data Invokes the service Unpacks and interprets the results Semantic web is a technology tolerant of diverse data models  No data transformation services  Just pieces of information and relationships between them Data Mining and Cyberinfrastructures in Biomedical Informatics - 45