Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Taming EHR Data Using Semantic Similarity to Reduce Dimensionality Jim Weatherall, PhD Head, Advanced Analytics Centre, AstraZeneca Visiting Lecturer, School of Computer Science, University of Manchester 14th World Congress on Medical & Health Informatics, August 2013, Copenhagen On behalf of the authors: Leila Kalankesh, School of Computer Science, UoM James Weatherall, AstraZeneca Thamer Ba-Dhfari, School of Computer Science, UoM Iain Buchan, Institute of Population Health, UoM Andy Brass, School of Computer Science, UoM Introduction Problems with mining healthcare data Large collections not easily visualised or interpreted Research not primary purpose for collection 2 J.Weatherall | August 2013 Read Code Rubric C10F. Type II Diabetes Mellitus, 1372. Trivial smoker < 1 cig/day bd3j. Prescription of “Atenolol 25mg tablets” G20. Essential hypertension 2469. Measurement of Diastolic Blood Pressure 246A. Assessment of Diastolic Blood Pressure 100s of 1000s of codes 10s of 1000s of dimensions Biometrics & Information Sciences | GMD Data The Salford Integrated Record (SIR)  Population ~220,000  Integrated primary and secondary care information  Individual Read Code entries captured in primary care information systems  Codes for diagnosis  Codes for procedures  All clinical transactions in primary care and some in secondary care  Data extract for this analysis based on:  GP data in date range 2003-2009  Containing 136M Read code entries  Selected 24K patients with chronic conditions  Containing 443K Read code entries 3 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD Methods Semantic Similarity How alike are the meanings of two terms? ? Measure depth? Or not? Measure ontological distance? 4 J.Weatherall | August 2013 From Sanchez, J.Biomed.Inform, 2011 Biometrics & Information Sciences | GMD Methods Semantic Similarity – which method? An ontology of methods! Semantic Similarity Method Ontological 5 J.Weatherall | August 2013 Corpus-based Node-based Frequency Edge-based Context Hybrid Proximity Combined Biometrics & Information Sciences | GMD Semantic similarity calculation The Resnik measure c  codes( c )count (c) P (c )  N 1 2 3 Term probability, based on frequency, including descendants and annotations IC (c)   log P(c) Log transformation, gives “Information Content” sim Re s (c1, c 2)  IC (CMICA) IC of “Most Informative Common Ancestor” gives similarity measure P. Resnik, “Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language”, J Artif Intell Res, 1999 6 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD Analysis Plan Stepwise approach to dimensionality reduction 1 Map patient records from diagnosis space into a similarity space Map patient records into a 2 low-dimensional vector space via PCA 3 7 J.Weatherall | August 2013 Project patient records onto low-dimensional vector space and cluster patients by similarity Biometrics & Information Sciences | GMD Analysis – Step 1 Mapping from diagnosis space to similarity space p1 p2 … pn p1 sim(p1,p1) sim(p1,p2) … sim(p1,pn) p2 sim(p2,p1) sim(p2,p2) … sim(p2,pn) … … pn … … sim(pn,p1) sim(pn,p2) … … sim(pn,pn) “The Similarity Matrix” pi = patient i sim(pi,pj) = similarity score between patients i and j 8 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD Analysis – Steps 2 + 3 PCA on the similarity matrix, visualisation & clustering Natural co-morbidity: Diabetes is a risk factor for angina due to its accelerating effect on atherosclerosis 9 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD Discussion & Conclusion Review & Outlook • Patients with similar diagnosis codes are grouped together • Therefore, the semantic similarity technique works, to some degree • Therefore, this is a viable route to dimensionality reduction in complex healthcare data sets Transferability of method? Population level characterisation? 10 J.Weatherall | August 2013 New biomedical hypotheses? Exploring comorbidity and cotreatment effects? New data mining paradigms? Biometrics & Information Sciences | GMD Confidentiality Notice This file is private and may contain confidential and proprietary information. If you have received this file in error, please notify us and remove it from your system and note that you must not copy, distribute or take any action in reliance on it. Any unauthorized use or disclosure of the contents of this file is not permitted and may be unlawful. AstraZeneca PLC, 2 Kingdom Street, London, W2 6BD, UK, T: +44(0)20 7604 8000, F: +44 (0)20 7604 8151, www.astrazeneca.com 12 J.Weatherall | August 2013 Biometrics & Information Sciences | GMD