Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Graph and Topological Structure Mining on Scientific Articles Fan Wang, Ruoming Jin, Gagan Agrawal and Helen Piontkivska The Ohio State University The Kent State University Presenter: Fan Wang The Ohio State University Outline • Introduction • Topological Structure Mining • Data Preprocessing and Graph Representations • Experiment Results and Pattern Analysis • Conclusion Introduction • Huge number of genes in literature • Associated with targeted disease or functionality • Finding interaction among genes manually – Time consuming – Error Prone Introduction CCL5 CCR5 CCL3 CCL5 T CCL4 CCL4 CCL3 T CD4 • Well-known relationship among chemokine ligands • Mining these relations from literature documents • Mining frequent patterns from graph datasets – Convenient representation – Lots of research in subgraph mining Introduction • Our Goal – Find commonly occurring interactions – Represent them visually • Capture the co-occurrence of scientific terms • Graph representation of scientific document • Mining frequent topological structures Outline • Introduction • Topological Structure Mining • Data Preprocessing and Graph Representations • Experiment Results and Pattern Analysis • Conclusion Topological Structure Mining • Disadvantages of subgraph mining – Exact matching – Missing potential patterns • Focusing on the topological relationship • Incorporating approximate matching Topological Structure Mining G Y X G is a subgraph of Y X is a (0,3) topological structure of Y Topological Structure Mining • Definition – Given a collection of graphs, two parameters l and h, and a threshold θ. A (l,h)-topological structure whose support is greater than or equal to θis called a frequent topological structure. • Given a set of graphs, in our KDD05 paper, an algorithm TSMiner finding frequent topological structures is implemented Our Work • Using topological structure mining • Challenges – How to create graphs? – What are the keywords? – How to insert edges into graphs? Outline • Introduction • Topological Structure Mining • Data Preprocessing and Graph Representations • Experiment Results and Pattern Analysis • Conclusion Data Preprocessing and Graph Representation • One graph for each document • Nodes are keywords of interest • Edges inserted based on occurrence of the keywords • Run topological structure mining algorithm Data Preprocessing • Four dictionaries of keywords – Short Dictionary • 321 genes expressed between prostate epithelial and stromal cells – Long Dictionary • 2600 human genes found in supperarray’s DNA microarray experiment – Confusion Dictionary • Gene names easily confused with ordinary words – GO Dictionary • GO terms (molecular function, biological process and cellular component) Graph Representations • Edge Construction Methods – Sentence-based Method • Two keywords in one sentence – Mutual Information Method • The mutual information of two keywords greater than a threshold – Sliding Window Method • Two keywords located within a sliding window with a predefined size Outline • Introduction • Topological Structure Mining • Data Preprocessing and Graph Representations • Experiment Results and Pattern Analysis • Conclusion Experiment Results • Focusing on articles containing at least one of the 5 genes – CCL5, TF, IGF1, MYLK, IGFBP3 • Generating graph for each article • Finding frequent topological structures Three Edge Construction Methods Number of Patterns Found, MYLK, sup=20% Number of Patterns Found, MYLK, sup=15% 1000 800 500 Sentence MI Window Total Number of Patterns Total Number of Patterns 1200 600 400 200 0 400 Sentence MI 300 Window 200 100 0 (0,0) (1,1) L and H values (2,2) (0,0) (1,1) L and H values (2,2) Three Edge Construction Methods Number of Patterns Found, IGFBP3, sup=20% Number of Patterns Found, IGFBP3, sup=15% 350 300 Sentence MI 250 Window 350 Total Number of Patterns Total Number of Patterns 400 200 150 100 50 300 250 200 Sentence MI Window 150 100 50 0 0 (0,0) (1,1) L and H values (2,2) (0,0) (1,1) L and H values (2,2) Three Edge Construction Methods Results • Sliding window method wins – Largest number of frequent patterns – Best scalability • Topological structure mining giving us more frequent patterns • Large number doesn’t mean high biological significance Pattern Analysis CCL4 CCL3 T CCR5 CCL3 T CCL4 PSMB6 CCL3 • CCBP2 CCL5 CCL5 CCL4 • ONLY be found by CCL5 CCL5 T CCL4 • T CCL3 CXCR4 PSMB6 Underlying Graph 1 PSMB6 Underlying Graph 2 topological structure mining ONLY be found by sliding window method Restoring nodes revealing interesting patterns Outline • Introduction • Topological Structure Mining • Data Preprocessing and Graph Representations • Experiment Results and Pattern Analysis • Conclusion Conclusion • Sliding window method is the best – The most number of frequent patterns – The highest quality of frequent patterns • Topological structures found corresponding well to known relationships • Topological mining being a very valuable tool for biological researchers Three Edge Construction Methods • Interestingness of Edges – Counting the number of distinct edges – Computing the average interestingness of edges for all patterns found by using each edge construction method