* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download No Slide Title
Therapeutic gene modulation wikipedia , lookup
Genome evolution wikipedia , lookup
Designer baby wikipedia , lookup
Microevolution wikipedia , lookup
Protein moonlighting wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
History of genetic engineering wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
PathoLogic Pathway Predictor
SRI International
Bioinformatics
Inference of Metabolic Pathways
Annotated Genomic
Sequence
Pathway/Genome
Database
Gene Products
Pathways
Genes/ORFs
DNA Sequences
Multi-organism Pathway
Database (MetaCyc)
Pathways
Reactions
PathoLogic
Software
Integrates genome and
pathway data to identify
putative metabolic
networks
Compounds
Gene Products
Genes
Reactions
Genomic Map
Compounds
PathoLogic Functionality
Initialize
SRI International
Bioinformatics
schema for new PGDB
Transform existing genome to PGDB form
Infer metabolic pathways and store in PGDB
Infer operons and store in PGDB
Assemble Overview diagram
Assist user with manual tasks
Assign enzymes to reactions they catalyze
Identify false-positive pathway predictions
Build protein complexes from monomers
Infer transport reactions
SRI International
Bioinformatics
PathoLogic Input/Output
Inputs:
File listing genetic elements
http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat
Files containing DNA sequence for each genetic element
Files containing annotation for each genetic element
MetaCyc database
Output:
Pathway/genome database for the subject organism
Reports that summarize:
Evidence contained in the input genome for the presence of reference
pathways
Reactions missing from inferred pathways
PathoLogic Analysis Phases
SRI International
Bioinformatics
Trial parsing of input data files [few days]
Initialize schema of new PGDB [3 min]
Create DB objects for replicons, genes, proteins [5 min]
Assign enzymes to reactions they catalyze
ferrochelatase
[10 min / 1 week]
glutamate 1-semialdehyde 2,1-aminomutase
porphobilinogen deaminase
A
E1
B
E2
C
D
E
G
F
PathoLogic Analysis Phases
SRI International
Bioinformatics
From
assigned reactions, infer what pathways are
present
[5 min / few days]
Define
metabolic overview diagram
Define
protein complexes
[30 min]
[few days]
genetic-elements.dat
ID
TEST-CHROM-1
NAME Chromosome 1
TYPE :CHRSM
CIRCULAR?
N
ANNOT-FILE
chrom1.pf
SEQ-FILE
chrom1.fsa
//
ID
TEST-CHROM-2
NAME Chromosome 2
CIRCULAR?
N
ANNOT-FILE
/mydata/chrom2.gbk
SEQ-FILE
/mydata/chrom2.fna
//
SRI International
Bioinformatics
SRI International
Bioinformatics
File Naming Conventions
One
pair of sequence and annotation files for
each genetic element
Sequence
files: FASTA format
suffix fsa or fna
Annotation
file:
Genbank format: suffix .gbk
PathoLogic format: suffix .pf
SRI International
Bioinformatics
Typical Problems Using Genbank
Files With PathoLogic
Wrong
qualifier names used: read PathoLogic
documentation!
Extraneous
Check
information in a given qualifier
results of trial parse carefully
GenBank File Format
Accepted feature types:
CDS, tRNA, rRNA, misc_RNA
Accepted qualifiers:
/locus_tag
/gene
/product
/EC_number
/product_comment
/gene_comment
/alt_name
/pseudo
SRI International
Bioinformatics
Unique ID [recm]
Gene name [req]
[req]
[recm]
[opt]
[opt]
Synonyms [opt]
Gene is a pseudogene [opt]
For multifunctional proteins, put each function in a separate
/product line
PathoLogic File Format
SRI International
Bioinformatics
Each record starts with line containing an ID attribute
Tab delimited
Each record ends with a line containing //
One attribute-value pair is allowed per line
Use multiple FUNCTION lines for multifunctional proteins
Lines starting with ‘;’ are comment lines
Valid attributes are:
ID, NAME, SYNONYM
STARTBASE, ENDBASE, GENE-COMMENT
FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT
DBLINK
INTRON
PathoLogic File Format
SRI International
Bioinformatics
ID
TP0734
NAME
deoD
STARTBASE
799084
ENDBASE
799785
FUNCTION
purine nucleoside phosphorylase
DBLINK
PID:g3323039
PRODUCT-TYPE
P
GENE-COMMENT
similar to GP:1638807 percent identity:
57.51; identified by sequence similarity; putative
//
ID
TP0735
NAME
gltA
STARTBASE
799867
ENDBASE
801423
FUNCTION
glutamate synthase
DBLINK
PID:g3323040
PRODUCT-TYPE
P
SRI International
Bioinformatics
Before you start:
What to do when an error occurs
Navigator errors are automatically trapped –
debugging information is saved to error.tmp file.
All other errors (including most PathoLogic
errors) will cause software to drop into the Lisp
debugger
Unix: error message will show up in the original terminal
window from which you started Pathway Tools.
Windows: Error message will show up in the Lisp console.
The Lisp console usually starts out iconified – its icon is a
blue bust of Franz Liszt
2 goals when an error occurs:
Try to continue working
Obtain enough information for a bug report to send to
pathway-tools support team.
Most
The Lisp Debugger
SRI International
Bioinformatics
Sample error (details and number of restart actions differ
for each case)
Error: Received signal number 2 (Keyboard interrupt)
Restart actions (select using :continue):
0: continue computation
1: Return to command level
2: Pathway Tools version 10.0 top level
3: Exit Pathway Tools version 10.0
[1c] EC(2):
To generate debugging information (stack backtrace):
:zoom :count :all
To continue from error, find a restart that takes you to the
top level – in this case, number 2
:cont 2
To exit Pathway Tools:
:exit
How to report an error
Determine
SRI International
Bioinformatics
if problem is reproducible, and how to
reproduce it (make sure you have all the latest
patches installed)
Send email to ptools-support@ai.sri.com
containing:
Pathway Tools version number and platform
Description of exactly what you were doing (which command
you invoked, what you typed, etc.) or instructions for how to
reproduce the problem
error.tmp file, if one was generated
If software breaks into the lisp debugger, the complete error
message and stack backtrace (obtained using the command
:zoom :count :all, as described on previous slide)
SRI International
Bioinformatics
Using the PPP GUI to Create a
Pathway/Genome Database
Input
Project Information
Organism -> Create New
SRI International
Bioinformatics
Input Project Information
Next Steps
Trial
Parse
Build -> Trial Parse
Fix any errors in input files
Build pathway/genome database
Build -> Automated Build
SRI International
Bioinformatics
SRI International
Bioinformatics
PathoLogic Parser Output
SRI International
Bioinformatics
Assign Enzymes to Reactions
5.1.3.2
Gene
product
MetaCyc
UDP-glucose-4epimerase
Match
yes
no
Probable enzyme
-ase
no
yes
Not a metabolic
enzyme
Assign
UDP-D-glucose UDP-galactose
Manually
search
no
Can’t Assign
yes
Assign
Enzyme Name Matcher
Matches
SRI International
Bioinformatics
on full enzyme name
Match is case-insensitive and removes the
punctuation characters “ -_(){}',:”
Also matches after removal of prefixes and
suffixes such as:
“Putative”, “Hypothetical”, etc
alpha|beta|…|catalytic|inducible chain|subunit|component
Parenthetical gene name
Enzyme Name Matcher
For
SRI International
Bioinformatics
names that do not match, software identifies
probable metabolic enzymes as those
Containing “ase”
Not containing keywords such as
“sensor kinase”
“topoisomerase”
“protein kinase”
“peptidase”
Etc
Research
unknown enzymes
MetaCyc, Swiss-Prot, PubMed
SRI International
Bioinformatics
Enzyme Name to Reaction Mapping
See also file PTools Tutorial/PathoLogic Reports/name-matching-report.txt
SRI International
Bioinformatics
Manual Polishing
Refine -> Assign Probable Enzymes
Do this first
Refine -> Rescore Pathways
Redo after assigning enzymes
Refine -> Create Protein Complexes
Can be done at any time
Refine -> Assign Modified Proteins
Can be done at any time
Refine -> Transport Identification Parser Can be done at any time
Refine -> Pathway Hole Filler
Refine -> Predict Transcription Units
Refine -> Update Overview Do this last, and repeat after any material
changes to PGDB
Assign Probable Enzymes
SRI International
Bioinformatics
SRI International
Bioinformatics
How to find reactions for probable
enzymes
First,
verify that enzyme name describes a
specific, metabolic function
Search for fragment of name in MetaCyc – you
may be able to find a match that PathoLogic
missed
Look up protein in SwissProt or other DBs
Search for gene name in PGDB for related
organism (bear in mind that gene names are not
reliable indicators of function, so check carefully)
Search for function name in PubMed
Other…
Manual Polishing
Refine -> Assign Probable Enzymes
Refine -> Rescore Pathways
Refine -> Create Protein Complexes
Refine -> Assign Modified Proteins
Refine -> Transport Identification Parser
Refine -> Pathway Hole Filler
Refine -> Predict Transcription Units
Refine -> Run Consistency Checker
Refine -> Update Overview
SRI International
Bioinformatics
Automated Pathway Inference
All
SRI International
Bioinformatics
pathways in MetaCyc for which there is at
least one enzyme identified in the target organism
are considered for possible inclusion.
Algorithm errs on side of inclusivity – easier to
manually delete a pathway from an organism than
to find a pathway that should have been predicted
but wasn’t.
SRI International
Bioinformatics
Considerations taken into account when
deciding whether or not a pathway should
be inferred:
Is there a unique enzyme – an enzyme not involved in any
other pathway?
Does the organism fall in the expected taxonomic domain of
the pathway?
Is this pathway part of a variant set, and, if so, is there more
evidence for some other variant?
If there is no unique enzyme:
Is there evidence for more than one enzyme?
If a biosynthetic pathway, is there evidence for final reaction(s)?
If a degradation pathway, is there evidence for initial reaction(s)?
If an energy metabolism pathway, is there evidence for more than half the
reactions?
SRI International
Bioinformatics
Assigning Evidence Scores to
Predicted Pathways
X|Y|Z
denotes score for P in O
where:
X = total number of reactions in P
Y = enzymes catalyzing number of reactions for which there is
evidence in O
Z = number of Y reactions that are used in other pathways in O
Manual Pruning of Pathways
SRI International
Bioinformatics
Use pathway evidence report
Coloring scheme aids in assessing pathway evidence
Phase I: Prune extra variant pathways
Rescore pathways, re-generate pathway evidence report
Phase II: Prune pathways unlikely to be present
No/few unique enzymes
Most pathway steps present because they are used in another pathway
Pathway very unlikely to be present in this organism
Nonspecific enzyme name assigned to a pathway step
Caveats
Cannot
predict pathways not present in MetaCyc
Evidence
Since
SRI International
Bioinformatics
for short pathways is hard to interpret
many reactions occur in multiple pathways,
some false positives
Output from PPP
Pathway/genome
SRI International
Bioinformatics
database
Summary
pages
Pathway evidence page
Click “Summary of Organisms”, then click organism name, then click
“Pathway Evidence”, then click “Save Pathway Report”
Missing enzymes report
Directory
etc.
tree containing sequence files, reports,
SRI International
Bioinformatics
Resulting Directory Structure
ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/
input
reports
ORGIDbase.ocelot
data
name-matching-report.txt
trial-parse-report.txt
kb
organism.dat
organism-init.dat
genetic-elements.dat
annotation files
sequence files
overview.graph
released -> VERSION
Manual Polishing
Refine -> Assign Probable Enzymes
Refine -> Rescore Pathways
Refine -> Create Protein Complexes
Refine -> Assign Modified Proteins
Refine -> Transport Identification Parser
Refine -> Pathway Hole Filler
Refine -> Predict Transcription Units
Refine -> Run Consistency Checker
Refine -> Update Overview
SRI International
Bioinformatics
SRI International
Bioinformatics
Creating Protein Complexes
SRI International
Bioinformatics
Complex Subunits Stoichiometries
Manual Polishing
Refine -> Assign Probable Enzymes
Refine -> Re-run Name Matcher
Refine -> Create Protein Complexes
Refine -> Assign Modified Proteins
Refine -> Transport Identification Parser
Refine -> Pathway Hole Filler
Refine -> Predict Transcription Units
Refine -> Run Consistency Checker
Refine -> Update Overview
SRI International
Bioinformatics
SRI International
Bioinformatics
Proteins as Reaction Substrates
Manual polishing
Refine -> Assign Probable Enzymes
Refine -> Rescore Pathways
Refine -> Create Protein Complexes
Refine -> Assign Modified Proteins
Refine -> Transport Identification Parser
Refine -> Pathway Hole Filler
Refine -> Predict Transcription Units
Refine -> Run Consistency Checker
Refine -> Update Overview
SRI International
Bioinformatics
Nomenclature
SRI International
Bioinformatics
• WO pair = pair of genes within an operon
• TUB pair = pair of genes at a transcription unit boundary
(delineate operons)
SRI International
Bioinformatics
Operation of the operon predictor
For each contiguous gene pair, predict whether gene pairs
are within the same operon or at a transcription unit
boundary
Use pairwise predictions to identify potential operons
AB = TUB pair
BC = WO pair
CD = WO pair
DE = TUB pair
A
operon = BCD
B
C
D
E
Operon predictor
SRI International
Bioinformatics
Predicts operon gene pairs based on:
intergenic distance between genes
genes in the same functional class
Typically used for operon prediction
We use method from Salgado et al, PNAS (2000) as a
starting point.
Uses E. coli experimentally verified data as a training set.
Compute log likelihood of two genes being WO or TUB pair based
on intergenic distance.
Operon predictor
SRI International
Bioinformatics
Additional features easily computed from a PGDB
1.
both genes products enzymes in the same metabolic
pathway
2.
both gene products monomers in the same protein
complex
3.
one gene product transports a substrate for a metabolic
pathway in which the other gene product is involved as an
enzyme
4.
a gene upstream or downstream from the gene pair (and
within the same directon) is related to either one of the
genes in the pair as per features 1, 2 and 3 above.