* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Customization of Gene Taggers for BeeSpace
Epigenetics in learning and memory wikipedia , lookup
Genomic imprinting wikipedia , lookup
Pathogenomics wikipedia , lookup
Point mutation wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Public health genomics wikipedia , lookup
Genetic engineering wikipedia , lookup
Copy-number variation wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
History of genetic engineering wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Protein moonlighting wikipedia , lookup
Genome evolution wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene therapy wikipedia , lookup
The Selfish Gene wikipedia , lookup
Genome (book) wikipedia , lookup
Gene desert wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene expression programming wikipedia , lookup
Helitron (biology) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene expression profiling wikipedia , lookup
Microevolution wikipedia , lookup
Gene nomenclature wikipedia , lookup
Customizing Gene Taggers
for BeeSpace
Jing Jiang
jiang4@uiuc.edu
March 9, 2005
Entity Recognition in BeeSpace
• Types of entities we are interested in:
–
–
–
–
–
–
Genes
Sequences
Proteins
Organisms
Behaviors
…
• Currently, we focus on genes
Mar 9, 05
BeeSpace
2
Input and Output
• Input: free text (w/ simple XML tags)
– <?xml version=“1.0” encoding=“UTF-8”><Document
id=“1”>…We have cloned and sequenced a cDNA encoding
Apis mellifera ultraspiracle (AMUSP) and examined its
responses to JH. …</Document>
• Output: tagged text (XML format)
– <?xml version=“1.0” encoding=“UTF-8”> <Document id=“1”>
…<Sent><NP>We</NP> have <VP>cloned</VP> and
<VP>sequenced</VP> <NP>a cDNA encoding <Gene>Apis
mellifera ultraspiracle</Gene><NP>
(<Gene>AMUSP</Gene>) and <VP>examined</VP> <NP>its
responses to JH</NP>.</Sent>…</Document>
Mar 9, 05
BeeSpace
3
Challenges
• No complete gene dictionary
• Many variations:
– Acronyms: hyperpolarization-activated ion
channel (Amih)
– Synonyms: octopamine receptor (oa1, oar,
amoa1)
– Common English words: at (arctops), by (3R-B)
• Different genes or gene and protein may
share the same name/symbol
Mar 9, 05
BeeSpace
4
Automatic Gene Recognition:
Characteristics of Gene Names
•
•
•
•
Capitalization (especially acronyms)
Numbers (gene families)
Punctuation: -, /, :, etc.
Context:
– Local: surrounding words such as “gene”,
“encoding”, “regulation”, “expressed”, etc.
– Global: same noun phrase occurs several times
in the same article
Mar 9, 05
BeeSpace
5
Existing Tools
• KeX (Fukuda)
– Based on hand-crafted rules
– Recognizes proteins and other entities
– Human efforts, not easy to modify
• ABNER & YAGI (Settles)
– Based on conditional random fields (CRFs) to
learn the “rules”
– ABNER identifies and classifies different
entities including proteins, DNAs, RNAs, cells
– YAGI recognizes genes and gene products
– No training
Mar 9, 05
BeeSpace
6
Existing Tools (cont.)
• LingPipe (Alias-i, Inc.)
– Uses a generative statistical model based on
word trigrams and tag bigrams
– Can be trained
– Has two trained models
• Others
– NLProt (SVM)
– AbGene (rule-based)
– GeneTaggerCRF (CRFs)
Mar 9, 05
BeeSpace
7
Comparison of Existing Tools
• Performance on a few manually annotated,
public data sets (protein names):
– GENIA (2000 abstracts on “human & blood cell
& transcription factor”)
– Yapex (99 abstracts on “protein binding &
interaction & molecular”)
– UTexas (750 abstracts on “human”)
• Performance on a honeybee sample data
set:
– Biosis search “apis mellifera gene”
Mar 9, 05
BeeSpace
8
Comparison of Existing Tools
(cont.)
GENIA
Yapex
UTexas
KeX
P: 0.3644
R: 0.4191
F1: 0.3898
P: 0.3451
R: 0.3931
F1: 0.3675
P: 0.1775
R: 0.3445
F1: 0.2343
ABNER
P: 0.7876
R: 0.7485
F1: 0.7675
P: 0.4351
R: 0.4441
F1: 0.4396
P: 0.3916
R: 0.4314
F1: 0.4105
LingPipe
P: 0.9298
R: 0.7388
F1: 0.8234
P: 0.4168
R: 0.4619
F1: 0.4382
P: 0.3633
R: 0.3918
F1: 0.3770
Mar 9, 05
BeeSpace
9
Comparison of Existing Tools
(cont.)
• KeX on honeybee data
– False positives: company name, country name, etc.
– Does not differentiate between genes, proteins, and
other chemicals
• YAGI on honeybee data
– False negatives: occurrences of the same gene name are
not all tagged
– Entity types and boundary detection
• LingPipe on honeybee data
– Similar to YAGI
Mar 9, 05
BeeSpace
10
Lessons Learned
• Machine learning methods outperform handcrafted rule-based system
• Machine learning methods have over-fitting
problem
• Existing tools need to be customized for
BeeSpace
– LingPipe is a good choice
• There is still room for better feature selection
– E.g., global context
Mar 9, 05
BeeSpace
11
Customization
• Train LingPipe on a better training data set
– Use fly (Drosophila) genes
– F1 increased from 0.2207 to 0.7226 on heldout fly data
– Tested on honeybee data: results
• Some gene names are learned (Record 13)
• Some false positives are removed (proteins, RNAs)
• Some false positives are introduced
– The noisy training data can be further cleaned
• E.g., exclude common English words
Mar 9, 05
BeeSpace
12
Customization (cont.)
• Exploit more features such as global
context
– Occurrences of the same word/phrase should
be tagged all positive or all negative
• Differentiate between domain-independent
features and domain-specific features
– E.g., prefix “Am” is domain-specific for Apis
mellifera
– Features can be weighted based on their
contribution across domains
Mar 9, 05
BeeSpace
13
Maximum Entropy Model
for Gene Tagging
• Given an observation (a token or a noun phrase),
together with its context, denoted as x
• Predict y {gene, non-gene}
• Maximum entropy model:
P(y|x) = K exp(ifi(x, y))
• Typical f:
– y = gene & candidate phrase starts with a capital letter
– y = gene & candidate phrase contains digits
• Estimate i with training data
Mar 9, 05
BeeSpace
14
Plan: Customization with Feature
Adaptation
• i: trained on large set of data in domain A
(e.g., human or fly)
• i: trained on small set of data in domain B
(e.g., bee)
• i’ = i•i + (1 - i)•i: used for domain B
• i: based on how useful fi is across
different domains
– Large i if fi is domain-independent
– Small i if fi is domain-specific
Mar 9, 05
BeeSpace
15
Issues to Discuss
• Definition of gene names:
– Gene families? (e.g., cb1 gene family)
– Entities with a gene name? (e.g., Ks-1
transcripts)
• Difference between genes and proteins?
– E.g., “CREB (cAMP response element binding
protein)” and “AmCREB”?
• How to evaluate the performance on
honeybee data?
Mar 9, 05
BeeSpace
16
The End
• Questions?
• Thank You!
Mar 9, 05
BeeSpace
17