Download Using the Gene Ontology for Expression Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Signal transduction wikipedia , lookup

Protein moonlighting wikipedia , lookup

JADE1 wikipedia , lookup

List of types of proteins wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene regulatory network wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Gene Annotation
& Gene Ontology
May 24, 2016
Gene lists from RNAseq analysis
What do you do with a list of 100s of genes that
contain only the following information?
• Gene name or symbol
• Ratio between groups (UP or DOWN)
• One or more database IDs (accession numbers)
How do you figure out the role of the genes in the
model you are studying?
Gene annotation
Process of assigning descriptions to a transcript or
gene product. Includes:
– Official gene symbol & name
– Protein features: domains, functional elements such as
nuclear localization signals
– Predicted molecular function, biological process and
cellular location
– Experimentally derived information function, process
and cellular location
– References
– ....
Who does the gene annotation?
• Refseq & Gene databases
– NCBI staff
• Ensemble databases
– http://useast.ensembl.org
– EMBL & Welcome Trust at Sanger Institute
• Uniprot
– Staff at European Bioinformatics Institute (EBI), Swiss
Institute of Bioinformatics (SIB) and the Protein
Information Resource (PIR)
• Yeast DB, FlyBase, Mouse Genome Informatics
(MGI) & other organism specific databases
Gene record for BEST1
Ensembl Gene record for BEST1
Uniprot record for BEST1
Gene, Ensembl or Uniprot?
•
•
•
•
What information are you looking for?
Comfort level with the interface
All have a little to LOTS of information
Use as a starting point
Dealing with gene lists
• How can you efficiently categorize the genes in in
some biologically meaningful way?
• Batch download data from Gene or Uniprot and do
a lot of reading?
• PubMed?
• One approach is to use meta-data in the form of
terms assigned to each gene that describe its
molecular function, participation in a biological
process and its location in a cellular component
Gene Ontology
• Set of standard biological phrases (terms) which
are applied to genes/proteins:
– protein kinase
– apoptosis
– Membrane
• Attempt to standardize the representation of
genes and gene product attributes across species
and databases
• Maintained by Gene Ontology consortium
– http://geneontology.org/
– Individual groups contribute taxonomic specific terms
Cellular Component
Where a gene product acts
Mitochondria
Cellular Component
Cellular components of a
virus different than a cell
Cellular Component
Enzyme complexes in the component
ontology refer to places, not activities.
Molecular Function
Activities or “jobs” of a gene product
glucose-6-phosphate isomerase activity
Molecular Function
insulin binding
insulin receptor activity
Molecular Function
• A gene product may have several functions
• Sets of functions make up a biological process.
Biological Process
a commonly recognized series of events
cell division
Biological Process
transcription
Biological Process
regulation of gluconeogenesis
Biological Process
limb development
Why use gene ontology?
• Allows biologists to make queries across large
numbers of genes without researching each one
individually
• Can find all the PI3 kinases in a given genome or
find all proteins involved in oxidative stress
response without prior knowledge of every gene
MAPK14
• GO biological process:
– 3’UTR mediated mRNA stabilization
– DNA damage checkpoint
– Ras protein signal transduction
• GO molecular function:
– ATP binding
– MAP kinase activity
– MAP kinase kinase activity
• GO cellular component
– Cytoplasm
– Extracellular exosome
– Nucleoplasm
Gene Ontology for analysis
• Generally biological process terms are more useful
for putting gene lists into a context
• There are more GO terms assigned to process than
to function or component
• Fewest terms assigned to component
• Function in the absence of any process
information can imply a biological role
– i.e. you are looking for transcription factors responsible
for some response
Ontology Structure
• Terms are linked by two relationships
– is-a
– part-of 

Ontology Structure
cell
membrane
mitochondrial
membrane
is-a
part-of
chloroplast
chloroplast
membrane
GO structure
Nucleic acid binding is a
type of binding.
• GO isn’t just a flat list of
biological terms
• terms are related within a
hierarchy
DNA binding is a type of
nucleic acid binding.
is_a
is_a
GO structure
gene
A
A single gene associated with with a particular term is
automatically annotated to all of the parent terms
GO structure
• This means genes can
be grouped according
to user-defined levels
• Allows broad overview
of gene set or genome
• You can use the level
of granularity that
makes most sense
GO terms
Each
concept has:
• a name
• an ID number
• a definition
term: transcription initiation
id: GO:0006352
definition:
Processes involved in the assembly
of the RNA polymerase complex at
the promoter region of a DNA
template resulting in the
subsequent synthesis of RNA from
that promoter.
GO terms assigned to MAPK14
Types of evidence codes
Experimental:
Types of evidence codes
Computational:
Types of evidence codes
Other evidence codes
Manual annotation
Molecular function
In this study, we report the isolation and molecular characterization
of the B. napus PERK1 cDNA, that is predicted to encode a novel
receptor-like kinase. We have shown that like other plant RLKs, the
kinase domain of PERK1 has serine/threonine kinase activity, In
addition, the location of a PERK1-GTP fusion protein to the plasma
membrane supports the prediction that PERK1 is an integral
membrane protein…these kinases have been implicated in early
stages of wound response…
Biological process
Cellular component
Electronic Annotation
• Annotation derived without human validation
– mappings file e.g. interpro2go, ec2go.
– Blast search ‘hits’
• Lower ‘quality’ than manual codes
• Used in non-model organisms
GO & analysis of gene lists
• www.geneontology.org
– Maintains the databases of GO terms, serves a clearing
house for terms as they are assigned in new organisms
• Tools for exploring gene lists using GO:
– WebGestalt, gProfiler, Onto-Express, and GSEA to name
a few
– DAVID is a suite of tools for gene enrichment analysis
that also includes GO.
– We’ll use both DAVID and WebGestalt to explore our
gene list
Gene Ontology tools
• input a gene list
• shows which GO categories have most genes
associated with them or are “enriched”
• provides a statistical measure to determine
whether enrichment is significant
Using GO in practice
• statistical measure
– how likely your differentially regulated genes fall into
that category by chance
80
70
60
50
40
30
20
10
0
microarray
1000 genes
experiment
100 genes
differentially
regulated
mitosis
apoptosis
positive control of glucose transport
cell proliferation
mitosis – 80/100
apoptosis – 40/100
Cell proliferation – 30/100
glucose transport – 20/100
Using GO in practice
• However, when you look at the distribution of all
genes on the microarray:
Process
Genes on
array
# genes expected
(out of 100)
# genes
observed
Mitosis
800/1000
80
80
Apoptosis
400/1000
40
40
Cell proliferation
100/1000
10
30
Glucose transport
50/1000
5
20
• Proportions analysis
– Chi-squared or Fisher’s exact test
Other sources of annotation
• Uniprot (Swiss-Prot) keywords
• Protein domain databases
– PFAM, Panther, PDB, PROSITE, ect
• GeneDB summaries from NCBI
• Protein-protein interactions databases
• Pathway databases
– KEGG, BioCarta, BBID, Reactome
DAVID incorporates annotation from all of these and
clusters the redundant terms
Today in computer lab
• Tutorial on using DAVID
• Tutorial on using WebGestalt
• Analysis of gene lists using DAVID and at least one
other GO term enrichment tool