* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slide 1
History of genetic engineering wikipedia , lookup
Copy-number variation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Public health genomics wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Genetic engineering wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Genome evolution wikipedia , lookup
Genome (book) wikipedia , lookup
Protein moonlighting wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Point mutation wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Gene therapy wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
The Selfish Gene wikipedia , lookup
Gene desert wikipedia , lookup
Helitron (biology) wikipedia , lookup
Gene expression profiling wikipedia , lookup
Gene expression programming wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Microevolution wikipedia , lookup
Gene nomenclature wikipedia , lookup
Information Extraction from Literature Yue Lu BeeSpace Seminar Oct 24, 2007 Outline Overview of BeeSpace v4  Entity Recognition  Relation Extraction  Overview  BeeSpace V4  deeper semantic base than the current v3 system  entities and relations VS mutual information  Four levels  Level1: Entity Recognition  Level2: Entity Association Mining  Level3: Relation Extraction  Level4: Inference and Hypothesis Generation Overview Level1: Entity Recognition (detailed later)  Level2 Entity Association Mining   Suppose entities are properly tagged  Utilize the co-occurrence patterns of entities to extract semantics  e.g. a bee biologist may want to know which genes are important for foraging behavior.  Similar to TREC Genomics 2007 task TREC Genomics 2007 e.g. “Which [PATHWAYS] are possibly involved in the disease ADPKD?”  currently only retrieval techniques   Gene synonym expansion  Conjunctive query interpretation  User relevance feedback  tagged Entities definitely would help Overview  Level3: Relation Extraction  Goal is to extract the relations between entities  Generally requires entities to be properly tagged first  Detailed later  Level4: Inference and Hypothesis Generation  Inference on knowledge base  Graph mining Outline Overview of BeeSpace v4  Entity Recognition  Relation Extraction  Entity Recognition   Gene Example:  Although <GENE>mxp</GENE> and <GENE>Pb</GENE> display very similar expression patterns, <GENE>pb</GENE> null embryos develop normally Entity Recognition   Anatomy Example:  In normal embryos, mxp is expressed in the <ANATOMY>maxillary</ANATOMY> and <ANATOMY>labial</ANATOMY> segments, whereas ectopic expression is observed in some GOF variants. Entity Recognition   Biological process Example:  Amongst these are the Bicoid, the Nanos, and the terminal class gene products, some of which are oncoproteins involved in signal transduction for <BIOLOGICAL PROCESS>the formation of terminal structures in the embryo<BIOLOGICAL PROCESS>. Entity Recognition   Pathways Example:  Several signal transduction pathways have been described in Drosophila, and this review explores the potential of oncogene studies using one of those pathways - <PATHWAY>the terminal class signal transduction pathway</PATHWAY> - to better understand the cellular mechanisms of protooncogenes that mediate cellular responses in vertebrates including humans Entity Recognition   Protein family Example:  While non-arthropod orthologs have been found for many Drosophila eye developmental genes, this has not been the case for the glass (gl) gene, which encodes a <PROTEIN FAMILY>zinc finger transcription factor</PROTEIN FAMILY> required for photoreceptor cell specification, differentiation, and survival. Entity Recognition   CRE (cis-regulatory elements) Example:  A synthetic, 23-bp <CRE>ecdysterone regulatory element (EcRE) </CRE>, derived from the upstream region of the Drosophila melanogaster hsp27 gene, was inserted adjacent to the herpes simplex virus thymidine kinase promoter fused to a bacterial gene for chloramphenicol acetyltransferase (CAT). Entity Recognition   Phenotype Definition:   a set of observable physical characteristics of an individual organism Example:  Fog, dumpy Entity Recognition  Class1: Small Variation (Dictionary/Ontology)  Organism, Anatomy , Biological Process, Pathway, Protein Family  Class2: Medium Variation  Gene,  cis Regulatory Element Class3: Large Variation  Phenotype, Behavior Entity Recognition Generally can be defined as a classification problem  Boils down to feature definition   Class1: matching a word in the Dictionary/Ontology  Class2: prefix/suffix of the word, POS tags, …  Class3:? Entity Recognition  Firstly focus on Class1  Relatively  simple Class2 and Class3 need training examples  Useful in entity association mining  Useful in facilitating extraction of many interesting relations  Related work: Textpresso Textpresso    Input: full text C. elegans literature Output: tagged XML format Defined a Textpresso ontology  First    category is biological entities manually curated a lexicon of names Implemented by PERL regular expressions We could reuse some of the regular expressions Entity Recognition Resources: Organism Anatomy Entrez gene table, Textpresso, BeeSpace DB FlyBase Biological Process, Textpresso Cellular Component, Molecular Function Pathway KEGG Protein Family PDB, NCBI Outline Overview of BeeSpace v4  Entity Recognition  Relation Extraction  Relation Extraction  Expression Location   the expression of a gene in some location (tissues, body parts) Homology/Orthology  one gene is homologous to another gene Relation Extraction  Biological process  one gene has some role in a biological process  Genetic/Physical/Regulatory Interaction  one gene interacts with another gene in a certain fashion (3 types of relations)  a simple case: Protein-Protein Interaction (PPI) Relation Extraction Generally can be defined as a classification problem, which requires training data  Domain adaptation?   an example of PPI PPI  Problem Definition:  Gene/protein names are already tagged  A known list of interaction words  133 words  classify each tuple (p1, p2, interWord) in one single sentence PPI  Methods  Learning algorithm: Maximum Entropy  Context features “Extracting protein-protein interactions using simple contextual features training data” BioNLP Workshop on HLT-NAACL 06  e.g. lexical forms, POS tags …  Less dependent on domain  PPI  Training/Testing data:  BioCreative  1000 hand labeled sentences, 3964 tuples  5-fold cross validation  Performance  avgpr = 47.14624  avgre = 43.97337  avgf1 = 45.35523 PPI  Training data:  BioCreative  1000  Testing Data (different domain)  Bee  hand labeled sentences, 3964 tuples collection Performance (Judged by Moushumi)  Total number of tuples extracted as PPI instances: 92  Precision: 63% PPI Misclassification examples     Type1: No interaction Sentence: Pretreatment of platelet suspension with phospholipase A2 from N. naja atra or A. mellifera venom (50 .mu.g/ml) inhibited platelet aggregation induced by sodium arachidonate or collagen, but not induced by thrombin or ionophore A-23187. False: (collagen, thrombin, induced) True: relation between protein and platelet aggregation; no PPI PPI Misclassification examples     Type2: Incorrect interaction word Sentence: IgG antibody was able to inhibit binding of IgE antibody in the PLA radioallergsorbent test (RAST) from 10-40% at a molar excess of 10- to 1000-fold. False: (IgG antibody, IgE antibody, binding) True: (IgG antibody, IgE antibody, inhibit) PPI Misclassification examples Type3: Incorrect protein involved  Sentence: AChE exhibits a butyrylcholinesterase (BuChE) activity that represents about 14% of AChE activity.  False: (AChE, AChE, exhibits)  True: (AChE, BuChE, exhibits )  PPI  Possible Improvement  syntactic patterns: “Optimizing syntax-patterns for discovering protein-protein interactions” In Proc ACM Symposium on Applied Computing, SAC, Bioinformatics Track,  parse tree  dependency parsing …  The End
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            