Download Mining Gene Regulatory Networks and Microarray Data: The

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Non-coding DNA wikipedia , lookup

Transposable element wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Copy-number variation wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Genomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Genetic engineering wikipedia , lookup

Metagenomics wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Oncogenomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Pathogenomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Minimal genome wikipedia , lookup

NEDD9 wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene therapy wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene desert wikipedia , lookup

History of genetic engineering wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Helitron (biology) wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Gene wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Public health genomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome (book) wikipedia , lookup

Genome evolution wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression programming wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Mining Gene Regulatory Networks and
Microarray Data: The MinePath Approach
Alexandros Kanterakis1, Dimitris Kafetzopoulos2, Vassilis Moustakis1 and George Potamias1
*
“… Imagine that for selected cancer patients, biopsies are taken before, during and after treatment … and the analyses stored
promptly in an accessible fashion. These biopsy samples are subjected to gene-expression and proteomic analysis, and these
molecular data are also stored accessibly … imagine that one can drill down into clinical and other (genomic/genetic)
databases in an intelligent search in hours rather than months. One end-point might be the rapid identification of
individualized molecular profiles correlated with sensitivity or resistance to therapy …” [1].
Abstract. Two of the most prominent concepts in
bioinformatics are Gene Regulatory Networks (GRNs) and DNA
Microarrays (MAs). GRNs model the interfering relations among
gene products and other cell components during the regulation of
the cell phenotype and function. MAs measure the simultaneous
expression profile of thousands of genes. In this paper we present
a methodology that integrates both bio-sources. It aims to discover
the functional parts of GRNs by identifying consistencies with
respective MA gene-expression data. GRNs are decomposed into
all possible functional paths. Induced, and appropriately formatted,
paths are stored in a reference GRN-Paths repository. Then, target
paths from the repository are matched - based on a novel geneexpression--path matching formula, against given microarray
gene-expression samples. The most consistent paths, i.e., paths
with adequately high matching scores, are identified. These paths
uncover and present potential underlying gene-regulatory
mechanisms that govern the gene-expression status of the
specified samples. This discovery guides the finer classification of
samples as well as the re-classification of diseases providing the
most prominent molecular evident for that. The methodology is
implemented in the MinePath system. Preliminary experimental
results prove the suitability, efficiency and reliability of the
approach, as well as the individualised molecular medicine
potential.
1
INTRODUCTION
After the completion of the human genome we are now entering
the postgenomic era. The main focus in genomic research is
switching from sequencing to using the genome sequences in order
to understand how genomes are functioning. In this context, a new
demanding need is raising namely, the linkage between the clinical
and the ´genomic´ world. The opportunities for clinical genetics to
become a mainstream component of clinical medicine are now
apparent. This move to the clinic appears to be inevitable [2].
A future healthcare-delivery scenario is sited above - a scenario
that emerges from the envisioned and raising genomic medicine
era. The scenario addresses the vision of individualized (or,
personalized) healthcare needs and benefits. At the same time it
designates the needed and inevitable investments in technological
advances towards its realization and achievement. The vision
requires common standards of data storage at each level of
investigation, new frameworks for data, information and
knowledge integration, and new tools to analyze and mine clinicogenomic data at all levels (gene, protein, molecular pathway,
tissue, individual and population).
Current post-genomics bioinformatics research seeks for
methods that not only combine the information from dispersed and
heterogeneous data sources but distil the knowledge and provide a
systematic, genome-scale view of biology [3]. The advantage of
this approach is that it can identify emergent properties of the
underlying molecular system as a ‘whole’ – an endeavour of
limited success if targeted genes, reactions or even molecular
pathways are studied in isolation [4]. Individuals show different
phenotypes for the same disease – they respond differently to drugs
and sometimes the effects are unpredictable. Many of the genes
examined in early clincio-genomic studies were linked to singlegene traits, but future advances engage the elucidation of multigene determinants of drug response. Differences in the individuals’
background DNA code but mainly, differences in the underlying
gene regulation mechanisms alter the expression or function of
proteins being targeted by drugs, and contribute significantly to
variation in the responses of individuals. The challenge is to
accelerate our understanding of the molecular mechanisms of these
variations and to produce targeted individualized therapies.
Two of the most promising and advanced technologies and
notions of contemporary bioinformatics research are Microarrays
(MA) and Gene Regulatory Networks (GRNs). Combining
knowledge from both knowledge sources will accelerate our
understanding of the molecular mechanisms of genome variations
and will guide the discovery of targeted individualized therapies.
Faced with such a challenge we devised and present an
integrated methodology that ‘amalgamates’ knowledge and data
from both GRNs and MA gene-expression sources. A preliminary
realisation of the methodology is done in a system called
MinePath. MinePath aims to uncover potential gene-regulatory
‘fingerprints’ and mechanisms that govern the genomic profiles of
diseases.
In the next section we present the background of MA technology
and gene-expression data mining, as well as the potential and
limitations of current GRNs’ analysis. In the sequel we present the
intrinsic problems related to MAs and GRNs when studied in
isolation, and post the needs for combining both knowledge and
data sources. In the third section we present the MinePath
methodology in detail, accompanied with the underlying
techniques and formulas. In section four we present preliminary
experimental results and discuss on the application of MinePath on
a well-known and widely utilised microarray gene-expression
study. In the last section we summarising our contribution and
point to on-going research and development work.
2
MAs AND GRNs
2.1 Microarrays and gene-expression data mining
With the recent advances in MA technology [5], [6], the potential
for molecular diagnostic and prognostic tools seem to come in
reality. The last years, microarray-chips have been devised and
manufactured in order to measure the expression profile of
thousands of genes. In this context a number of pioneering studies
* Corresponding author, potamias@ics.forth.gr
1
Institute of Computer Science, FORTH, kantale@ics.forth.gr
2
Institute of Molecular Biology & Biotechnology, FORTH, kafetzo@imbb.forth.gr
have been conducted that profile the expression-level of genes for
various types of cancers such as leukemia, breast cancer and other
tumours [7], [8]. The aim is to add molecular characteristics to the
classification of diseases so that diagnostic procedures are
enhanced and prognostic predictions are improved. These studies
demonstrate that gene-expression profiling has great potential in
identifying and predicting various targets and prognostic factors of
diseases.
By measuring transcription levels of genes in an organism under
various conditions, in different tissue samples, we can build up
gene expression profiles, which characterize the dynamic
functioning of each gene in the genome. The microarray data are
represented in a matrix with rows representing genes, columns
representing samples (e.g. various tissues, developmental stages
and treatments), and each cell containing a number characterizing
the expression level of the particular gene in the particular sample gene expression matrix.
Gene-expression data analysis depends on Gene Expression
Data Mining (GEDM) technology, and the involved data analysis
is based on two approaches: (a) hypothesis testing - to investigate
the induction or perturbation of a biological process that leads to
predicted results, and (b) knowledge discovery - to detect
underlying hidden-regularities in biological data. For the latter,
one of the major challenges is gene-selection [9], [10]. Possible
prognostic genes for disease outcome, including response to
treatment and disease recurrence are then selected to compose the
molecular signature of (or, gene-markers for) the targeted disease.
visualization techniques and software tools plays an important role
in the field [31].
Very few methods of gene regulatory inference are considered
superior to the others mainly because of the intrinsically noisy
property of the data, ‘the curse of dimensionality’, and the
unknown ‘true’ underlying networks. The study of the function,
structure and evolution of GRNs in combination with microarray
gene-expression profiles and data is essential for contemporary
biology research. First of all, researchers have uncovered a
multitude of biological facts, such as protein properties and
genome sequences. But this alone is not sufficient to interpret
biological systems and understand their robustness, which is one of
the fundamental properties of living systems at different levels
[32]. This is mainly because cell, tissues, organs, organisms or any
other biological systems defined by evolution are essentially
complex physicochemical systems. They consist of numerous
dynamic networks of biochemical reactions and signalling
interactions between active cellular components. This cellular
complexity has made it difficult to build a complete understanding
of cellular machinery to achieve a specific purpose [33]. To
circumvent this complexity, microarrays, biology knowledge and
biology networks can be combined in order to document and
support the detected and predicted interactions [34]. The advances
and tools that each discipline carries can be integrated in a holistic
and generic perspective so that the chaotic complexity of biology
networks can be traced down.
2.3 The need to combine MAs and GRNs
2.2 GRNs: Potential and Limitations
Gene regulatory networks (GRNs) are network structures that
depict the interaction of DNA segments during the transcription of
the genes into mRNA. The prominent and vital role of GRNs in the
study of various biology processes is a major sector in
contemporary biology research, where numerous thorough studies
have been made [11], [12]. From a computational point of view,
GRNs can be conceived as analogue biochemical computers that
regulate the level of expression of target genes [13]. Each network
has inputs, usually proteins or transcription factors that initiate the
network function, and outputs that usually is a certain gene. The
network by itself acts as a mechanism that determines cellular
behaviour where the nodes are genes and edges are functions that
represent the molecular reactions between the nodes. These
functions can be perceived as Boolean functions, where nodes have
only two possible states (“on” and “off”), and the whole network
represented as a simple directed graph [14]. The notion of the
GRNs is by itself an abstraction of the underlying chemical
dynamics of the cell, thus the expectation of high reliability in
terms of modelling is limited. It is indicative that most of the
relations in known and established GRNs have been derived from
laborious and extensive laboratory experiments and careful study
of the existing biochemical literature. Thus GRNs are far from
complete.
Current efforts focus on the reconstruction of GRNs by
exploring gene-expression data. Specifically, Babu in [15]
identified that network topologies extracted from gene coexpression events could discover motifs and regulatory hubs that
can characterize the entire cellular states and guide further the
pharmaceutical research. The identification of underlying subgraph
isomorphisms is an NP-complete problem [16], so large network
comparisons rely on heuristics that measure various attributes of
the network as the degree distribution, clustering coefficient and
the shortest path lengths [17]. Other heuristics focus on local
properties as the network motifs [18], [19], and small overrepresented graphs [20]. Inference of gene networks out from geneexpression also engages and applies various analytical methods
such as Boolean Networks [21], Bayesian Networks [22],
differential equations and steady-state models [23], [24], [25],
statistical and probabilistic [26], [27], [28], and data mining
methods [29], [30]. Finally the integration of these methods with
On one hand, MA experiments involve more variables (genes) than
samples (patients). This fact, leads to results with poor biological
significance. To remedy this there is an open debate whether we
should concentrate on gathering more data or on building new
algorithms. Simon et al., in [35] published a very strict criticism on
common pitfalls on microarray data mining while in [36]
commented about the bias in the gene selection procedure.
Moreover, due to limitations in DNA microarray technology,
higher differential expressions of a gene do not necessarily reflect a
greater likelihood of the gene being related to a disease (e.g.,
cancer) and therefore, focusing only on the candidate genes with
the highest differential expressions might not be the optimal
procedure [37].
Another significant aspect is the noisy content of the experiment.
Appropriate statistical analysis of noisy data is very important in
order to obtain meaningful biological information [38], [39].
Evidence on this is given by the fact that different methods produce
gene lists (i.e., gene-markers and molecular signatures) that are
strikingly different [40]. As a result, and because the immature
state of microarray technology, reproducibility of microarray
experiments and the accompanied statistical prediction models are
pretty low except when protocols are uniformly and strictly
followed [41], [42]. In the light of the previous observations and in
order to overcome the posted limitations we have to view MA
based gene-expression profiles just as an instance of biological
information, strongly connected - rather than isolated, from other
sources of related biological knowledge (e.g., GRNs).
On the other hand, even if the extraction of GRNs from gene
expression-data seems quite strong, the resulted GRN compositions
remain largely unsupported from a biological point of view. As we
have already mentioned, the already identified and composed
networks from laborious and extensive laboratory work, despite
potential errors, misfits and gaps, possess sufficient biological and
scientific evidence. Currently, GRNs are treated and utilized as
“flat” (graph-like) structures that indicate whether or not any subset
of the involved genes is over- or, down-regulated in a microarray
experiment. Although GRNs provides extra documentation for the
functionality of genes, current views luck the potential of
exploiting the underlying inner dynamics and correlations provided
by the network.
3 THE MinePath METHODOLOGY
Existing GRNs databases provide us with widely utilized networks
of proved molecular validity. The most known are network that
describe important cellular processes such as cell-cycle, apoptosis,
signaling, and regulation of important growth factors. Online
public repositories contain a variety of information that includes
not only the network per se but links to respective nodes (genes)
and edge (regulation) meta-data and annotation. Currently
MinePath utilizes the KEGG pathways repository3. KEGG
provides a standardized format representation operationalised by its
own markup description language, the KGML4.
The gene regulatory relations we consider are restricted to what
might be observed in a microarray experiment: a change in the
expression of a regulator gene modulates the expression of a target
gene mainly via protein-DNA interactions. In other words, there
are genes that causally regulate other genes. A change in the
expression of these genes might change dramatically the behaviour
of the whole network. The identification and prediction of such
changes is a challenging task in bioinformatics. Moreover, we have
to identify real, true networks and use them as scaffolds ([43]) to
methods that infer gene regulatory networks out from gene
expression data [29]. This approach can aim several areas of
biology research such as genomic medicine [44], microarray data
mining [45] and phylogenetic analysis [46].
3.1 Path decomposition and interpretation
MinePath methodology relies on a novel approach for GRN
processing that takes into account all possible interpretations of the
network. The different GRN interpretations correspond to the
different functional paths that can be followed during the
regulation of a target gene.
Figure 1. Path decomposition. Top: A target part of the KEGG cell-cycle
GRN; Bottom: The five decomposed paths for the tergated path part - all
possible functional routes taking place during network regulaion
machinery.
Different GRNs are downloaded from the KEGG repository.
With an XML parser we obtain all the internal network semantics
(see next sub-section). In a subsequent step, all possible and
functional network paths are extracted as exemplified in Figure 1
above. Each functional path is annotated with the possible valid
values according to Kauffman’s principles that follow a binary
setting: each gene in a functional path can be either ‘ON’ or ‘OFF’.
According to Kauffman [14], the following functional generegulatory semantics apply: (a) the network is a directed graph with
genes (inputs and outputs) being the graph nodes and the edges
3
4
KEGG: Kyoto Encyclopedia of Genes and Genomes; http://www.genome.jp/kegg/
KGML (KEGG Markup Language); http://www.genome.jp/kegg/xml/
between them representing the casual (regulatory) links between
them; (b) each node can be in one of the two states: ‘on’ or ‘off’;
(c) for a gene, ‘on’ corresponds to the gene being expressed (i.e.,
the respective substance being present); and (d) time is viewed as
proceeding in discrete steps - at each step, the new state of a node
is a Boolean function of the prior states of the nodes with arrows
pointing towards it.
KEGG encompasses and models a variety of regulation links
(edges). Figure 2 shows these links accompanied with their
underlying notation and semantics.
Figure 2. Different functional relationships between genes (or group of
genes)
Since the regulation edge connecting two genes defines
explicitly the possible values of each gene, we can set all possible
state-values that a gene may take in a path. Thus, each extracted
path contains not only the relevant sub-graph but the state-values
(‘ON’ or ‘OFF’) of each gene as well. The only requirement
concerns the assumption that for a path being functional, the path
should be ‘active’ during the GRN regulation process. In other
words we assume that all genes in a path are functionally active.
For example assume the functional path A B (‘’ is an
activation/expression regulatory relation). If gene A is on an ‘OFF’
state then, gene B is not allowed to be on an ‘ON’ state - B could
become ‘ON’ only and only if it is activated/expressed by another
gene in a different functional path, e.g., C B). If we had allowed
non-functional genes to have arbitrary values then the significant
paths would be more likely to be ‘noisy’ rather than of biological
importance.
The extracted and annotated genes are stored in a database that
acts as a repository for future reference. Through this repository we
can query paths being parts of targeted GRN, contain specific
genes or, involve a specific regulatory relation. Moreover, the
stored paths can be combined and analyzed in the view of specific
microarray experiments and respective gene-expression sample
profiles. Furthermore, as the database repository contain and
retrieves functional paths from a variety of different GRNs (e.g.,
cell-cycle, apoptosis etc), we may combine knowledge (i.e., the
functional paths) from different molecular pathways and networks
– a major need for molecular biology and a big challenge for
contemporary bioinformatics research.
3.2 Combining gene-expression profiles and paths
The next step is to locate microarray experiments and respective
gene-expression data where we expect the targeted GRN to play an
important role. For example the cell-cycle and apoptosis GRNs
play an important role in cancer studies and respective experiments
dealing with tumor progression.
With a gene-expression/functional-path matching operation, the
valid and most prominent GRN functional paths are identified.
These paths uncover and present potential underlying generegulatory mechanisms that govern the gene-expression status of
the samples under investigation. Such a discovery may guide to the
finer classification of samples as well as to the re-classification of
diseases providing the most prominent molecular evident for that.
3.2.1 Discretization of MA Gene-Expression Values
In order to combine and match GRN paths with microarray data the
respective gene-expression values should be transformed into two
(binary) states - “on” and “off”. Microarray gene-expression
discretization is a popular method to indicate vigorously the
expressed (up-regulated/”on”) and non-expressed (downregulated/”off”) genes. Currently MinePath encompasses an
information-theoretic process for the binary transformation of
gene-expression values. The process pre-supposes the
categorization of the input microarray samples into two classes.
Figure 3, below, presents indicative gene expression values for
gene M77142, taken from the well known and widely exploited
leukaemia gene expression study [7]. Gene expression numeric
values are correlated with binary equivalents, which are denoted
either as h (for high expression) or l (for low expression). The table
includes expressions across 14 samples.
Figure 3. Example of binary transformation of a Gene’s Expression Profile.
Expression values of gene ‘M77142’ from the leukemia domain. Top row
denotes sample classification (“ALL” or “AML”). Second row lists the
original numerical gene expression values and third row lists binary
equivalents. (‘h’ denotes high and ‘l’ denotes low). Binary equivalents were
derived using MineGene’s binary transformation procedure.
Consider a vector of a descending series of numbers V = <n1, n2,
… ns> such that ni≥ni+1∀i, 1≤i≤S, where S denotes the number of
samples. Each ni associates with a class. Classes are binary. For
simplicity we denote classes as P (positive) and N (negative) – for
instance, in leukemia P may denote “ALL” and N may denote
“AML” leukemia types, respectively.
We seek a point estimate µ to split interval [ni,nS] in two parts.
Estimate µ should discriminate between classes P and N in the best
possible way; µ is used to split vector V elements in high (h) and
low (l) values. Binary transformation of V into h and l value
intervals proceeds via two distinct steps.
Step 1: Calculation of midpoint values, µi, across V elements,
e.g. µi = (ni+ni+1)/2,∀i, and formulation of the midpoint values
vector: Μ = <µ1, µ2, ... µS-1>, µi≥µ i+1 ∀i.
Step 2: Assessment of point estimate µ. With respect to each
element of Μ we assess information gain, IG(V,µi) using
expression (1), namely:
(1)
IG (V , µ i ) = IE (V ) − IE (V , µ i )
We use IE to represent information entropy. IE(V) corresponds
to the entropy of the gene expression profile and IG(V,µi)
corresponds to the entropy of the gene expression profile using µi
as a split point for V elements. IE(V) is calculated using an
information theoretic model (also utilised in decision-tree induction
techniques [47]), namely:
IE (V ) = −
Sp
S
log 2
Sp
S
−
SN
S
log 2
SN
(2)
S
where, Sp is the set of samples that belong to class P and SN is the
set of class N samples and |S|, |Sp| and |SN| correspond to set
cardinality. When V is split into two intervals using µi one interval
corresponds to h values and the other corresponds to l values. In
consequence:
IE(V , µi ) = −
S (h)  S P (h)
S P (h) S N (h)
S N (h) 

−
log2
+
log2
S  S (h)
S (h)
S (h)
S (h) 
S (l )  S P (l )
S P (l ) S N (l )
S N (l ) 


log2
+
log 2

S  S (l )
S (l )
S (l )
S (l ) 
(3)
where, Sp(h), Sp(h)⊆Sp, is the subset of samples that belong to class
P and correspond to h values. Sp(l), SN(h) and SN(l) sets are
similarly defined and |S(.)| denote set cardinalities. The estimate µi
which maximizes IG(V,µi) is the value of µ. This point is selected
to transform gene expression values to binary equivalent low or
high values (results are demonstrated in the third row of Table 1).
Binary transformation proceeds independently across genes. When
all gene expression values are transformed the original gene
expression matrix is replaced by a matrix that includes the binary
equivalent values.
In [48] a similar to our discretization approach is introduced and
employed in order to improve the use of continuous attributes
during decision tree induction. The reported results favour the
utilization of a ‘local’ discretization process, in contrast to a
‘global’ one (to be utilized as a data pre-processing step). The
process is employed in the context of c4.5/Rel8 decision tree
induction algorithm [48]; Fayyad and Irani [49] and Li and Wong
[50] employ similar to IG(V,µi) metrics. However, all approaches
do not incorporate an explicit parameter µ to force binary split,
which may result to uncontrolled numbers of discretization
intervals.
3.3 Matching GRN paths with MA data
The samples of a binary transformed (discretized) gene-expression
matrix could then be matched against targeted molecular pathways
and respective GRN functional paths (retrieved form the described
repository).
GRN and MA gene-expression data matching aims to
differentiate GRN paths and identify the most prominent functional
paths for the given samples. In other words, the quest is for the
paths that exhibit high matching scores for one of class and low
matching scores for the other. This is a paradigm shift from mining
for genes with differential expression, to mining for subparts of
GRN with differential function. The algorithm for differential path
identification is inherently simple (see Figure 4).
Figure 4. Samples S1, S2, S3 belong to the '+' class and samples S4, S5
belong to the '-' class. The first path (IL-1R
TRADD) satisfies samples
1,2,3,5. Second path (IL-1R
TRADD
FLIP) satisfies samples S1,
S2, S3. Third path satisifies all samples and the forth path doesn’t satisfie
any sample. The green arrow indicates that the second path yields the
maximum differential power and it contains a potential function
differentiation since it is consisted only with samples that belong to the ‘+’
class. (‘
’: activation; ‘
’: inhibition).
For each path we compute the number of samples that is
consistent for each disease class. Suppose that there are S1 and S2
samples belong to the first and second class, respectively. Assume
that path Pi is consistent with Si;1 and Si;2 samples form the first and
second class, respectively. Formula 4,
(Si;1 / S 1) − (Si; 2 / S 2)
(4)
computes the differential power of the specific path with respect to
the two clinical classes. Ranking of paths according to formula 4,
provides the most differentiating and prominent GRN functional
paths for the respective disease classes. These paths present
evidential molecular mechanisms that govern the disease itself, its
type, and its state as well its phenotypic profiles (e.g., respond to
drugs). The formula can be enriched so that longer consistent paths
acquire stronger power. It can also be relaxed so that ‘consistent’ is
a continuous indicator rather than a Boolean value. Finally we may
introduce ‘unknown’ values for missing and erroneous gene
expression values.
4 EXPERIMENTS AND RESULTS
In this section we present preliminary results on running the
MinePath combined GRNs and MA mining methodology on a realworld cancer-related microarray study. The study concerns the
distinction between two well-known leukemia classes namely:
Acute Myeloid Leukemia (AML) vs. Acute Lymphoblastic
Leukemia (ALL). Although the distinction between AML and ALL
has been well established, no single test is currently sufficient to
establish the diagnosis. Rather, current clinical practice involves an
experienced hematopathologist's interpretation of the tumor's
morphology, histochemistry, immunophenotyping, and cytogenetic
analysis, each performed in a separate, highly specialized
laboratory. Distinguishing ALL from AML is critical for successful
treatment. Although remissions can be achieved using ALL therapy
for AML (and vice versa), current rates are markedly diminished,
and unwarranted toxicities are encountered. The specific leukemia
study presents one of the first microarray clinical studies performed
with results already published in 1999 [7]. The respective
microarray gene-expression data consist of 38 bone marrow
samples - 27 ALL and 11 AML, obtained from acute leukemia
patients at the time of diagnosis.
In the aforementioned study publication, the researchers were
able to identify a molecular signature of 50 differentially expressed
genes the expression profiles of which was capable of adequately
distinguish between the two leukemia types. But (even in the early
microarray days), there was no evidential reporting on the
underlying gene-regulatory mechanisms governing the expression
status of the identified molecular signature.
Figure 5. The AML highly differentiating ‘Apoptosis’ paths
In an effort to identify such mechanisms we applied the
MinePath methodology on this leukemia microarray dataset,
targeting the “Apoptosis” regulatory network5. A total of 248
consistent paths were identified, each one matching and being
consistent with different samples. Moreover, each of these paths
exhibit different differential power with respect to the two types.
Based on an appropriate ranking, and setting a high cut-off, we were
able to identify four highly leukemia-type differentiating paths; 2
paths for AML, and 2 paths for ALL (see Figures 5 and 6).
AML-paths: In the upper part of Figure 5 the two identified paths
are shown in both KEGG and HUGO6 gene nomenclature; red and
blue colouring is used to denote the “on” and “off” states of genes,
respectively. The bottom part of the figure shows the, crorersponding
to these paths, triggered and/or ‘salienced’ regulatory sub-netowrks.
Each of these sub-networks present a different molecular and generegulatory mechanism. Following the AML-1 path we observe that
the sub-network guiding to ‘Survival of genes’ is ‘salienced’ (small
shaded area) because of the activation of gene 4792 (following the
respective activated path) that disassociates (denoted with ‘
) the
NFkB factor (gene). Moreover, the same route guides to the
‘Degradation’ of the whole apoptotic pathway. Following the AML2 functional path a whole sub-network guiding to ‘Apoptosis’ (large
shaded area) is salienced.
ALL-paths: In the upper part of Figure 6 the two identified paths
are shown in both KEGG and HUGO7 gene nomenclature (again, red
and blue colouring is used to denote the “on” and “off” states of
genes, respectively). The bottom part of the figure shows the
discovered underlying ALL-paths regulatory mechanisms. Following
both ALL-1,-2 paths we observe that the sub-network guiding to
‘Apoptosis’ is also ‘salienced’ (shaded area) because of the
expression of genes 840 and 836 (after their activation from gene
842), and the subsequent activation of gene 1676 which in-turn
disassociates gene 1677 that (indirectly) guides to ‘Apoptosis’.
Figure 6. The ALL highly differentiating ‘Apoptosis’ paths
From the above findings it should be evident that there are two
different molecular regulatory mechanisms that differentiate between
the AML and ALL leukemia types. Even if both guide to the same
results (Degradation and inhibition of Apoptosis) they follow quite
different paths. The discovered and identified differentiated paths
may be of high value for deciding treatment plans and potential
therapeutic targets in drug design processes. Of course, a more
complete picture could be achieved when more regulatory networks
take part in the analysis and more differentiating paths are identified,
e.g., cell-cycle, p53 (tumour suppressor gene) signalling pathways
etc. Moreover, all the identified leukemia functional paths share a
number of common genes (notice the rectangular area labelled with
‘ALL’ in the shaded area of figure 5). Even if the shaded subnetworks are ‘salienced’ the involved genes may be on different
states, with some of them having the same state in both leukemia
types. So, with a standard gene-selection approach these genes could
not be highly ranked and selected as potential gene-markers. It is the
power of their regulation and not the genes themselves that makes
the difference! This is a realization of the already mentioned
paradigm shift: from mining for genes with differential expression,
to mining for subparts of GRN with differential function.
5
We have presented an integrated methodology for the combined
mining of both GRNs and MA gene-expression profiles. In the heart
6
5
‘Apoptosis’ (KEGG); http://www.genome.jp/kegg/pathway/hsa/hsa04210.html
CONCLUSIONS
7
HUman Genome Organization (HUGO); http://www.hugo-international.org/
HUman Genome Organization (HUGO); http://www.hugo-international.org/
of the methodology is the decomposition of GRNs into all possible
functional paths, and the matching of these paths with samples’
gene-expression profiles. An initial implementation of the whole
methodology is made in a system called MinePath. The whole
methodology was applied in a well-known gene-expression study
(differentiation between AML and ALL leukemias) where, we were
able to identify two distinct ‘Apoptotic’ paths and the underlying
molecular mechanisms that differentiate between the two leukemia
types. The results prove the suitability, efficiency and reliability of
the approach, as well as the individualised molecular medicine
potential.
Among others, our on-going and immediate research include: (a)
further experimentation with various real-world microarray studies
and different GRN targets (accompanied with the evaluation of
results form molecular biology collaborators); (b) extension of pathdecomposition to multiple GRNs; (c) elaboration on more
sophisticated path/gene-expression sample matching formulas and
operations; (d) incorporation of different gene nomenclatures in
order to cope with microarray experiments from different platforms
and encodings; and (e) porting of the whole methodology in a WebServices and workflow environment.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Nature. Making data dreams come true (editorial), Nature, 428,
6980, 239 (2004).
J. Bell, ‘Predicting disease using genomics’, Nature 429, 453-456
(2004).
T. Ideker, T. Galitski and L. Hood, ‘A new approach to decoding
life: systems biology’, Annu Rev Genomics Hum Genet, 2, 343-372
(2001).
F.S. Collins, E.D. Green, A. E. Guttmacher and M. S. Guyer, ‘A
Vision for the Future of Genomics Research’, Nature, 422(6934),
835-847 (2003).
H.F. Friend, ‘How DNA microarrays and expression profiling will
affect clinical practice’, Br Med J., 319, 1-2 (1999).
D.E. Bassett, M.B. Eisen, and M.S. Boguski, ‘Gene expression
informatics: it’s all in your mine’, Nature Genetics, 21(Supplement
1), 51-55 (1999).
T.R. Golub et al., ‘Molecular classification of cancer: class discovery
and class prediction by gene expression monitoring’, Science, 286,
531-537 (1999).
L.J. van 't Veer et al., ‘Gene Expression Profiling Predicts Clinical
Outcome of Breast Cancer’, Nature, 415, 530-536 (2002).
M.E. Troyanskaya, M.E. Garber, P.O. Brown, D. Botstein, and R.B.
Altman, ‘Nonparametric methods for identifying differentially
expressed genes in microarray data’, Bioinformatics, 18 (11), 14541461 (2002).
G. Potamias, L. Koumakis and V. Moustakis, ‘Gene Selection via
Discretized Gene-Expression Profiles and Greedy FeatureElimination’, LNAI, 3025, 256-266 (2004).
J. M. Bower and H.Bolouri, Computational Modeling of Genetic and
Biochemical Networks, Computational Molecular Biology Series,
MIT Press, 2001.
B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter,
Molecular Biology of the Cell, Garland Science, New York, 2002.
Arkin and J. Ross, ‘Computational functions in biochemical reaction
networks’, Biophys J., 67(2), 560-578 (1994).
S. A. Kauffman, The Origins of Order: Self-Organization and
Selection in Evolution, Oxford Univ. Press, New York, 1993
N.M. Babu, N.M. Luscombe, L. Aravind, M. Gerstein and S.A.
Teichmann, ‘Structure and evolution of transcriptional regulatory
networks’, Curr. Opin. Struct. Biol., 14, 283-291 (2004).
S. Cook, The complexity of theorem-proving procedures. Procs 3rd
Ann. ACM Symp. On Theory of Computing, 151-158, 1971.
M.E.J. Newman, ‘The structure and function of complex networks’,
SIAM Review, 45(2), 167-256 (2003).
R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii and U.
Alon, ‘Network motifs: Simple building blocks of complex
networks’, Science, 298(5594), 824-827 (2002).
S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J., Cho, and G.M.
Church, ‘Systematic determination of genetic network architecture’,
Nature Genetics, 22, 281-285 (1999).
R. Milo, S. Itzkovitz, N. Kashtan, R. Levitt, S. Shen-Orr, I.
Ayzenshtat, M. Sheffer and U. Alon, ‘Superfamilies of evolved and
designed networks’, Science, 303(5663), 1538-1542, ( 2004).
[21] T. Akutsu, S. Miyano, and S. Kuhara, ‘Identification of genetic
networks from a small number of gene expression patterns under the
Boolean network model’, Pac Symp Biocomput., 17-28 (1999).
[22] S. Imoto, T. Goto and S. Miyano, ‘Estimation of genetic networks
and functional structures between genes by using Bayesian networks
and nonparametric regression’, Pac Symp Biocomput., 175-186
(2002).
[23] S. Kimura et al., ‘Inference of S-system models of genetic networks
using a cooperative co-evolutionary algorithm’, Bioinformatics,
21(7), 1154-1163 (2005).
[24] D., di Bernardo et al., ‘Chemogenomic profiling on a genome-wide
scale using reverse-engineered gene networks’, Nat Biotech, 23(3),
377 (2005).
[25] K.C. Chen, T.Y. Wang, H.H. Tseng, C.Y.F. Huang and C.Y. Kao, ‘A
stochastic differential equation model for quantifying transcriptional
regulatory network in Saccharomyces cerevisiae’, Bioinformatics,
21(12), 2883-2890 (2005).
[26] J.M. Stuart, E. Segal, D. Koller and S.K. Kim, ‘A gene coexpression
network for global discovery of conserved genetic modules’,
Science, 302, 249-255 (2003).
[27] E. Segal et al., ‘Module networks: identifying regulatory modules
and their condition-specific regulators from gene expression data’,
Nat Genet, 34(2), 166 (2003).
[28] J.J. Faith et al., ‘Large-Scale Mapping and Validation of Escherichia
coli Transcriptional Regulation from a Compendium of Expression
Profiles’, PLoS Biology, 5(1):e8 (2007).
[29] Guanrao, P. Larsen, E. Almasri and Y. Dai, ‘Rank-based edge
reconstruction for scale-free genetic regulatory networks’, BMC
Bioinformatics, 9, 75 (2008).
[30] T. Daisuke and P. Horton, ‘Inference of scale-free networks from
gene expression time series’, Journal of Bioinformatics and
Computational Biology, 4(2), 503-514 (2006).
[31] T. Milenkovic, J. Lai and N. Przulj, ‘GraphCrunch: a tool for large
network analyses’, BMC Bioinformatics, 9:70 (2008).
[32] H. Kitano, ‘Robustness from top to bottom’, Nat. Genet.,38, 133
(2006).
[33] H. Kitano, ‘Systems biology: a brief overview’, Science, 295(5560),
1662-1664 (2002).
[34] K. Kwoh and P. Y. Ng, ‘Network analysis approach for biology’,
Cell. Mol. Life Sci., 64, 1739-1751 (2007).
[35] R. Simon, M. D. Radmacher, K. Dobbin and L. M. McShane,
‘Pitfalls in the Use of DNA Microarray Data for Diagnostic
Classification’, Journal of the National Cancer Institute, 95(1), 1418, (2003).
[36] Ambroise and G. J. McLachlan, ‘Selection bias in gene extraction on
the basis of microarray gene-expression data’, PNAS, 99(10), 65626566, (2002).
[37] S. Draghici, S. Sellamuthu and P. Khatri, ‘Babel's tower revisited: a
universal resource for cross-referencing across annotation databases’,
Bioinformatics, 22(23), 2934-2939 (2006).
[38] D.K. Slonim, ‘From pattern to pathways: gene expression data
analysis comes of age’, Nature Genetics, 32, 502-508 (2002).
[39] J. Quackenbush, ‘Computational Analysis of Microarray Data’,
Nature Reviews Genetics, 2, 418-427 (2001).
[40] W. Pan, ‘A comparative review of statistical methods for discovering
differentially expressed genes in replicated microarray experiments’,
Bioinformatics, 18(4), 546-554 (2002).
[41] MTRC: Members of the Toxicogenomics Research Consortium,
‘Standardizing global gene expression analysis between laboratories
and across platforms’, Nature Methods, 2, 351-356 (2005).
[42] Robert et al. ‘Robust interlaboratory reproducibility of a gene
expression signature measurement consistent with the needs of a new
generation of diagnostic tools’, BMC Genomics, 8:148 (2007).
[43] T. Ideker and D. Lauffenburger, ‚Building with a scaffold: emerging
strategies for high- to low-level cellular modeling’, Trends in
Biotechnology, 21(6), 255-262 (2003).
[44] M.A. Hoffman, ‘The genome-enabled electronic medical record’,
Journal of Biomedical Informatics, 40(1), 44-46 (2007).
[45] P. Jares, ‘DNA Microarray Applications in Functional Genomics’,
Ultrastructural Pathology, 30, 209-219, (2006).
[46] R. Jothi, T. M Przytycka and L. Aravind, ‘Discovering functional
linkages and uncharacterized cellular pathways using phylogenetic
profile comparisons: a comprehensive assessment’, BMC
Bioinformatics, 8:173 (2007).
[47] J.R. Quinlan, ‘Induction of decision trees’, Machine Learning, 1, 81106 (1986).
[48] J.R. Quinlan, ‘Improved Used of Continuous Attributes in C4.5’,
Journal of Artificial Intelligence Research, 4, 77-90 (1996).
[49] U. Fayyad and K. Irani, ‘Multi-interval discretization of continuousvalued attributes for classification learning’, Procs 13th International
Joint Conference of Artificial Intelligence. Morgan Kaufmann, San
Francisco, CA, 1022-1029, 1993.
[50] J. Li and L. Wong, ‘Identifying good diagnostic gene groups from
gene expression profiles using the concept of emerging patterns’,
Bioinformatics, 18:725-734 (2002).