Download 4. Learning algorithm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
Report of Master Project
Learning Database
Professor: Spaccapietra Stefano
Assistant: Fabio Porto
Student: Yuanjian Wang Zufferey
LBD
Page 1
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
Sommaire ............................................... Error! Bookmark not defined.
1.
INTRODUCTION
4
2.
RELATED WORKS
5
3.
CONCEPT MODEL
7
3.1
Project Architecture
7
3.2
UML Definition
8
3.2.1
Biological Neuron
8
3.2.2
Computational Model
10
3.2.3
Simulation Model
11
3.3
Definition of XML Schema
12
3.3.1
LearningDatabase
12
3.3.2
BioNeuron
12
3.3.3
NeuroModel
13
3.3.4
Hypotheses
16
3.3.5
Simulations
16
3.3.6
Constraints
18
4.
LEARNING ALGORITHM (WHY)
20
5.
IMPLEMENTATION
25
5.1
Architecture of the Oracle Implementation
25
5.1.2
Computational Model
27
5.1.3
Biological question:
28
5.1.4
Simulation
30
5.1.5
Definition of Views
31
5.1.6
Competitive Learning
33
LBD
Page 2
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
6.
QUERIES:
39
7.
PERFORMANCE ANALYSIS
42
8.
CONCLUSION
46
9.
REFERENCE
48
LBD
Page 3
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
1. Introduction
The reason that I proposed this project to Prof. Stefano Spaccapietra and Dr. Fabio Porto is that I
believe that computer science has to behavior as the tool that serves in one or more application
domains such chemical, biological or financial domains. As a Master student who wants to be
specialized in Bio-Computing, I have studied the corresponding courses that including the basic
molecular biological courses and biological computational courses, and the machine learning courses
give me the most enthusiasm. At same time, my interest in Database technology is never decreased.
To combine the two techniques to supply the applicable services for biological domain became my
first idea of the Master project. On the contact with Dr. Fabio Porto, the requirement of one well
designed database schema for Blue Brain Project has been suggested. In further consideration, the
possible requirement of application of machine learning algorithm in database to automatically
cluster unstructured high-dimension data, such as experiment results and simulation results has
been proposed. It gave one opportunity to design and realize one application system that combines
biological and computational aspects of neuronal science.
With the very fast growing knowledge collection on the neuron science, the requirement is became
more and more urgent to find a well-formed storage and retrieval tool to save and share such
knowledge among neuron scientists so that we can take the advantage of the coming new
discoveries and share the information quickly and reliably.
The basic requirement is from the neuron science. We need an extensible database schema which
can save and retrieve the neural information including the biological information and the
mathematics information. One group of the users is composed by the biologists who work on the
neuron experiments and try to find the biological functions of neurons. The second group is formed
by the mathematicians, physicians, and computer scientists who create and manage the
computational models that describe the electrical activity of neuron cells. The advanced requirement is
how to make the two groups of scientists to understand each other and further more to refer to each other or
aggregate them together. As the two groups of scientists do not interpret the same biological or
computational simulation data in the same way, we have to find a method to translate or map
corresponding terms that have different representation but with same signification. Biologists are
interested in the characteristic of biological part of neurons, for example, neural network, type of
cells, characters of ion channel, receptor or transmitters of neuron. But the physicians are more
interested the potential change on the membrane and abstract each neuron as a unit. A
computational model usually simulates part of the biological data and the biologist may tune his
experiment by cheap simulation in computer before the real experiment on neural cells. The contrary,
physician wants to find and test on the biological interpretation after he has built a computational
model. How do they find the possible hidden similarity between biological data and simulated data?
This project proposed a solution that is to build two basic database schemas for each group to store
the common knowledge of biology and computation, and then build a bridge between biologists and
physicians or mathematicians. The most common method is to draw a curve by using the recorded
data and to see how the two curves fit each other. But how we do the queries on the mass of data to
find the similarity in database and in the reliable way?
LBD
Page 4
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
Here we propose a neural learning database solution. By giving the biological neural data and the
results of simulation, our database can learn the similarity between the two types of data. We supply
the answer of such queries:
 Are there any computational model have similar simulation results with the given
biological experiment results? If yes, show the most similar results and model in the
comparable way (for example, curve and graphs).
 Are there any biological results fitting well the given computational simulation result?
If yes, show the most similar results in the comparable way (for example, curve).
The neural learning database will be implemented with the learning ability. It is able to cluster
different data series basing on the data content itself and certain predefined criterions. At same time
it can be reinforced by external positive or negative feedback from users. For example a proposed
result of a query may be strongly confirmed by user, it will augment the correlation of the results.
Similar for the negative case, but the correlation will be reduced. With the size of learned data
growing, the precision to cluster can be improved.
2. Related works
To supply a common knowledge database that includes biological and computational aspect of the
neuronal science, modelDB [11] has given an example. It supplies the computational model by the
classification of neuronal composition. The advantage that modelDB, is that we can find various
formats of computational model, such as NEURON, JAVA, MatLab, C++, etc. The inconvenience is that
we can’t easily compare different formats directly. The definition of computational model is not easy
to read (equation, definition of variable, parameters, etc). The connection between biological
structure and model application can’t be found. At same time, we can’t find the similar simulation
results in different computational models.
Traditional database manager systems are aimed to classify data by certain predefined criteria. We
usually need to know well in advance the data structure of data and have to carefully define the
structures and the detailed procedures to abstract data for saving and classification.
The semantics web applications use the well defined conceptual schema to supply annotations
(knowledge markup technologies) to be recognized by the semantic analysis tools and thus we can
classify the web contents based on the annotations. Based on the very carefully designed schema,
the ontology technology can mine the valuable, high-quality ontological resources. Obviously, the
multi-media data without the well defined annotations will not be possible to be mined by ontology.
Recently the learning machines are well used in all kinds of domains, such as image reorganization,
language processing, syntactic pattern recognition, search engines, medical diagnosis, bioinformatics
and cheminformatics, etc. Machine learning and clustering in information retrieval systems can be
applied to categorize the content-based results or rank them more meaningfully.
As a broad subfield of artificial intelligence, machine learning is concerned with the design and
development of algorithms and techniques that allow computer to “learn”. The major focus of
machine learning research is to extract information from data automatically, by computational and
statistical methods. So, machine learning is closely related not only to data mining and statistics, but
LBD
Page 5
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
also theoretical computer science. Generally, there are three kinds of algorithms: supervised learning,
unsupervised learning, and reinforcement learning.
Supervised learning aims to solve the classification problem. It learns the behavior of a function
which maps a vector [X1, X2, …, XN] into one of several classes by fitting with several input-output
examples of the function. The problem to apply the supervised learning is that we have to know in
advance the possible classes and find the typical examples for learning.
Unsupervised learning agent models a set of inputs: classes and typical examples are not available.
The common form is clustering, which is sometimes not probabilistic. The number of clusters is
adapted by the problem size and user can control the degree of similarity between members of the
same clusters by means of a user-defined constant called the vigilance parameter.
Reinforcement learning concerns with how an agent ought to take actions in an environment so as to
maximize some notion of long-term reward. It is studied in the domain of real time system.
Learning machines’ algorithm, especially the supervised and unsupervised algorithms give the
possibility to database to learn in advance (supervised) or learn by progression (unsupervised). As in
this project we can’t get the possible examples for supervised algorithm to train database, we have
to look at the unsupervised algorithms.
In the following section, we firstly we propose an Object-Oriented schema in the UML presentation
and the XML representation (Section 3). In the section 4 we describe the common cluster algorithms.
Then we introduce the detail design (implementation) in Oracle in section 5. In the section 6 we
describe the queries the system support. The performance about the competitive learning
unsupervised learning implementation is given out in section 7. We conclude this project in the
section 8.
LBD
Page 6
Master Project-Learning Database
3.
Informatique
Yuanjian Wang Zufferey
Concept Model
The fundamental work of this project is to design a well-formed data structure to store the necessary
information. It not only supplies the storage base, but also to formalize some workflows. The storage
base is composed by two parts: neuronal biological information and computational model
information. The workflows are concerning the procedure of construction a computational model
and the simulation procedure.
3.1
Project Architecture
Computational Models
Biological Neurons
XML or UML Object-Oriented Definition
1.Biological Queries
2.Computational Queries
Knowledge
Database
Relational Database
Definition
5. Learning Process
3.Experiment or
Simulation Queries
Learning
Database
Biologial Experiments
4.Data Cluster Queries
Computational Simulations
XML or UML Object-Oriented Interface Layer
Figure-
In Figure- we can see the architecture of this project. Generally speaking, this project is defined by
two levels of definition: UML or XML Object-Oriented classes’ definition and Relational Database
definition.
In the UML or XML Object-Oriented definition, we stood at the point of user view and specified two
groups of users: biologists on neuron science and scientist who working on mathematis, physics, and
compute science.Thus two types of information is modeled as Biological Neurons and Computational
Models. The biologists can retrieve the biological information about neurons (1.Biological Queries).
Similarly, the other group of scientists can query the computational models information about
neurons (2.Computational Queries). The derived information such as the biological experiments and
computational simulation are available. From derived information, biologists or other scientists can
not only search on the experiment or simulation (3.Experiment or Simulation Queries), but also the
similarity queries, that is to find the cluster information given an experiment result or a simulation
result (4.Data Clusters Queries).
LBD
Page 7
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
In the Relational Database definition, we implement the Object-Oriented definition of UML or XML in
the relational way. We distingue two kinds of implementation: One is Knowledge Database that
stores the biological neural information includes the neuron information and experiment information
and computational neural information combines with the computational model and the simulation
information; the other is learning database who learns from the simulation and experiment results to
find the similarity between them and form the clusters (5. Learning Process). The biologists or other
scientists can query the similar results of experiments or simulations.
We begin to introduce the two common representations for Object-Oriented Model in UML and XML
representation to introduce the concept model in the following sub chapters. In this step of work,
some of the terminologies are referred to the modelDB and some are referred to the wiki definitions.
3.2 UML Definition
3.2.1 Biological Neuron
Firstly, we have to understand what a neural cell is. Neurons [3] are electrically excitable cells in the
nervous system that process and transmit information. Neurons are the core components of the
brain, and spinal cord in vertebrates and ventral nerve cord in invertebrates, and peripheral nerves.
Neurons are typically composed of a soma, or cell body, a dendritic tree and an axon (Figure-). The
majority of vertebrate neurons receives input on the cell body and dendritic tree, and transmits
output via the axon. Neurons communicate via chemical and electrical synapses, in a process known
as synaptic transmission. The fundamental process that triggers synaptic transmission is the action
potential, a propagating electrical signal that is generated by exploiting the electrically excitable
membrane of the neuron. The electrical properties of the ion channels and receptors on the
membrane of neuron together can decide the electrical properties of the neuron (Figure-).
Neurotransmitters are chemicals that are used to relay, amplify and modulate signals between a
neuron and another cell.
Figure--1 A typical neural cell
Figure-- 2Receptor and Ion Channels
A biological neuron [2] could be composed by multiple compartments that are soma, axon (an axon
can be composed by Hillock, stem, and terminal sub compartments), and dendrites (a dendrite can
be composed by proximal, middle and distal sub compartments). Additional compartments could be
added in the future. Each compartment has the some properties on the membrane such as the
1
2
Figure from http://www.cs.nott.ac.uk/
Figure from http://www.neuropsychopathologie.fr
LBD
Page 8
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
neurotransmitter receptor type (for example: ionotropic receptor, metabotropic receptor), ion
channel (for example: Na+, Ka+, Ca2+) or transmitter type (for example: the acetylcholine, the biogenic
amines, the amino acid transmitters, etc.). For each property or the compartment we may interested
in the states of some measurement, such as potential of membrane, membrane capacitance,
conduction velocity of axon, etc.
For each neuron, we are interested in where it is (organ), which category it belongs to (classification),
and what its function are, and other molecular information, such as the coded gene, the microscope
image for visualization, and the experiments to understand its dynamics characters.
Figure-
A concrete instance of neuron is shown in Figure-. By different classification criterions, a neuron
could belong to different neuron class. An organ will be covered by numerous neurons. Each neuron
may composed by different compartments that carry special properties.
Figure-
For example, a pyramidal cell [5] (or pyramidal neuron, or projection neuron) is a multipolar neuron
(Nueron_Classification) located in the hippocampus and cerebral cortex (OrganInstance). These cells
have a triangularly shaped soma, or cell body, a single apical dendrite extending towards the pial
surface, multiple basal dendrites, and a single axon (Compartment). K+ channels (ElectricalProperty)
on dendrites of pyramidal cell are often studied.
LBD
Page 9
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
3.2.2 Computational Model
Let us look at an example of computational model (The computational model example we have taken
from ModelDB [11].): Simple Model of Spiking Neurons [6] who combines the biological HodgkinHuxley-type dynamics and the computational integrate-and fire neurons. Two equations of this
model are defined here:
With the auxiliary after-spike resetting:
There are three variables:
: The membrane potential of the neuron;
: The membrane recovery variable, which accounts for the activation of
and inactivation of
ionic currents
ionic currents.
: Delivered synaptic currents or injected dc-currents
And there are 4 parameters are defined:
: Time scale of the recovery variable .
: The sensitivity of the recovery variable .
: The after-spike reset value of the membrane potential caused by the fast highthreshold
conductance.
: The after-spike reset of the recovery variable caused by slow high-threshold
and
conductance.
We can easily define a computational model by its equations and the variables, parameters that
included in equations (Figure-). For each variable and equation, there may be some biological
explanation supplied (as in our example). This model will read from (ReadInterface) and produce
the results of
(WriteInterface) in each time step. It’s a model based on the Hodgkin-Huxley-
type dynamics and integrate-and fire hypotheses. The references of this model we can find the in the
paper of [6]. We can find the same information about the biological question that it talks about:
spiking and bursting behavior of known types of cortical neurons.
Figure-
LBD
Page 10
Master Project-Learning Database
3.2.3
Informatique
Yuanjian Wang Zufferey
Simulation Model
Figure-
Simulations are the realization of computational model in some programming languages, such as
Matlab, NEURONS. The same realization of computational model but with different simulation
conditions (different start conditions, parameter settings and stop conditions) will produce different
results. A simulation can include more than one simulation element. A simulation element can be a
neural cell, or a compartment of neural cell, or electrical property binding with a computational
model. And these elements can be connected to form a neural network or detailed neural cell. For
each simulation, we may discover different behaviors.
For example, in the computational model we described in 3.1.2 (Simple Model of Spiking Neurons),
we can bind a neural cell with this model. With different initial condition and parameters setting, we
can get the different simulation results as below (Figure-):
Figure- (A) a=0.02, b=0.2, c=-65, d=6, I=14, tonic spiking
(B) a=0.02, b=0.25, c=-65, d=6, I=0.5, phasic spiking
LBD
Page 11
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
3.3 Definition of XML Schema
Now we introduce the concept model in XML representation (xml schema) from the top definition to
the bottom.
3.3.1 LearningDatabase
The top of schema is the LearningDatabase element (figure-). It’s composed by BioNeurons,
NeuronModels, Hypotheses, Simulations and References elements.
Figure-
BioNeurons named biological neurons is the collections of BioNeuron element. NeuroModels are a
collection of the neural computational model. Hypotheses are a collection of hypothese proposed by
different authors and are referred in computational models. Simulations are a collection of
simulation which simulates one or combination of more than one computational model. References
element is the collection of references that may be referred in the computational model. We will
explain the Constraints in section 3.2.6.
3.3.2 BioNeuron
BioNeuron (Figure-) is an element who acts as an individual neuron cell. It’s identified by its ID, name,
Canonical form, and Organism it situates. To describe its molecular information, we can use
Interoperation in the form of Gene_Chromosome, Microscopy_Data or Experimental_Data. An
external resource location such as image file can be saved for each sub element of Interoperation
(Resource attribute records this location).
A neuron cell is composed by different compartment such as somas, axons or dendrites. Certain
electronic properties (for example, channels, receptors or transmitters) can be bound to the
membrane of the compartment.
For each neural cell, more than one experiment may be executed and in each experiment
measurement of electronic properties may be taken in each time step (named states) during the
whole execution time. We record the unit, name, and value of the measurement. It’s important that
we have such information for later clustering algorithm to cluster the homologous data. More
generally, we can save the result file location of one states element in the value element instead of
save each value in element. Additional information or detailed description can be saved.
LBD
Page 12
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
Figure –
3.3.3
NeuroModel
(b)
(a)
(c)
Figure -
LBD
Page 13
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
Each variable (Figure- (b)) or parameter (Figure- (c)) has its unique identity (ID) in one computational
model, name, unit and the symbol (for example, α, β, x, y, etc) which we use to represent it in
equation.
Figure -
An equation (Figure-) with unique ID in one computational model is composed by the variables and
parameters. The mathematic expression of the equation is saved. For each equation, it may describe
a functionality of one region of organism, of one cell, or one compartment, or an electronic property.
In one computational model, we may define more than one group of equations that have similar
behavior.
For retrieval convenience, we are interested in the characteristics of one computational model, such
as biological question (Figure- (a): BiologicalQuestion) it replies, references (Figure- (b)) it refers to,
hypothesis it’s based on, the keywords it mentions. A reference can be a published paper
(PaperReference), or from a scientist (TheoryFromPerson), or from one book.
(a)
(b)
Figure -
To precise a biological question, we can describe the research area it belongs to, which kind of topics
it talks about, or a reference it uses. To reply the question, this computational model may give
important contribution. The specialties (Feature) of this computational model to reply this biological
question, and the conclusion we can get from it are important to other people to refer to.
To formalize the terminology of research area (Figure- (a)) and topics (Figure- (b)), we referred the
classification of Wiki on neuron science.
LBD
Page 14
Master Project-Learning Database
Informatique
(a)
Yuanjian Wang Zufferey
(b)
Figure -
For a single neuron, its compartments may behavior differently so that different computational
model may apply on different compartment or electronic properties on compartment. To
communicate between different models we can define the reads or writes variables that read input
from or write out value to other model variables (Figure-).
Figure -
In a neural network, the connections between different neural cells with different computational
model will behavior in the same way (Figure-).
For each computational model, we may have the resource files such as program codes (Matlab, Java,
NEURON, C++, etc.), reference files, some other necessary files can be treated as additional file that
LBD
Page 15
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
zip in one file. We can find the location of this zip file by the attribute ‘resource’ of element of
AdditionalFiles.
3.3.4
Hypotheses
Figure –
Hypotheses (Figure-) are a collection of Hypothesis. Hypothesis has to be proposed by someone and
has the corresponding statement. It may have some relationship with the region of organism, or
some cell, or with compartment of neural cell. It has the unique identity (ID) in LearningDatabase
scale.
3.3.5
Simulations
Figure –
Simulations element is the collection of simulation. Simulation (Figure-) can be configured in the
SimulationEnvironment (for example: Matlab, C++, Java) and corresponding program codes saved in
SimulationResource (location information).
A simulation may include more than one computational model at one time (Figure-). For example,
given two computational models that define separately the electronic property of Na+ channel and
K+ channel, for a simulation includes one neural cell with three compartments: soma, axon and
dendrite, computational models of Na+ channel and K+ channel can be applied or bound on the
LBD
Page 16
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
membrane of each of the three compartments but their variables with different initial conditions
(InitialConditions), stop conditions (StopConditioons) and parameters setting (Assignments). In
Connections element (Figure-) records the application of such binding. As a result, the simulation
results (Figure-) of the same computational model bound in different compartment will be different.
Figure –(need to be modified)
Figure –
Figure –
LBD
Page 17
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
The representation of simulation results can be graphs (curves) or tables that each axis or column
saves the values of one variable during one simulation. Being limited to 3D graphs, the Graph
element may have X, Y, and Z axis. Each axis save the variable (refer to the ID of variable in
computational model) and its whole value as a list or external resource such as txt or xml file. The
Tables element is the collection of all the TableData which saves one variable and its values for each
step of simulation as list or external file.
Figure –
If we have discovered some abnormal phenomena or some new observations in a simulation, we can
record it in Discovery element (Figure-). The cell type and the region of organism can serve as
retrieval information. The observed results are recorded in Observation elements with the detailed
compartment, electronic property and description, the measurement also.
Figure –
Connections serve as two functions: in the single neural cell simulation with multiple computational
models binding to different compartment, it can record the connection between compartment or
property (ConnetctTo element) and computational model (From element); or the connection
between compartments; or the connection between compartment and property. In the neural
network, it can serve as the connection between different neural cells.
3.3.6 Constraints
We list the constraints include the key definition in Table- and the key reference definition in Table-:
Key
NeuronID
NeuronModelID
PropertyID
CompartmentID
HypothesisID
SimulationID
LBD
Scale
Learningdatabase
Learningdatabase
Learningdatabase
Learningdatabase
Learningdatabase
Learningdatabase
Selection
./BioNeurons/BioNeuron
./NeuronModels/NeuronModel
./Bioneurons/Properties/Property
./ Bioneurons/Compartments/Compartment
./Hypotheses/Hypothesis
./Simulations/Simulation
Field
ID
ID
@ID
@ID
ID
@ID
Page 18
Master Project-Learning Database
ReferenceID
ParameterID
VariableID
EquationID
Learningdatabase
NeuronModel
NeuronModel
NeuronModel
KeyRef
refer
HypothesisRef
NeuronModelRef
CompartmentRef
HypothesisID
NeuronModelID
CompartmentID
PropertyRef
PropertyID
LBD
Informatique
Yuanjian Wang Zufferey
./References/Reference
Parameter
Variable
Equation
Table–
selector
NeuronModels/NeuronModel/BiologicalQuestion
Simulations/Simulation/NeuronModels/NeuronModel
NeuronModels/NeuroModel/* |Simulations/*
|BioNeurons/BioNeuron/Compartments
NeuronModels/NeuroModel/*
|Simulatioins/*|BioNeurons/Compartments/*
|BioNeurons/BioNeuron/Compartments/Compartment
Table–
@ID
ID
ID
ID
field
Hypothesis
@ID
Compartment
Property
Page 19
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
4. Learning algorithm (why)
In this project we are trying to simulate the function of brain by using the database. The brain is not
only serves as database that to store information but also has the capacity to learn from the
information that it has met. A learning database, as we named, it is firstly a database which stores
the information for the neuron science, secondly it learns from the information that it has stored.
All people have different way to learn in the different conditions. In the cases that we have a teacher
to give the right answer is the easiest way and we usually named it as supervised learning. In the
cases that we have to find the answers by ourselves, if we make mistake or we get right answers we
may get a negative or positive compensation, it’s the case of reinforcement learning (for example, to
learn bicycle); but without such feedback, we can only rely on some intuition. Such problems as
classification and clustering are belonging to unsupervised learning, we can only try to put the similar
things together and guess how many classes or clusters may exist.
How to define the similarity for a class or a cluster? We have to give a formal definition such as the
distance between the center of the cluster and the object that we want to classify. But what’s the
center of the cluster? How can we find center or what’s the procedure or algorithm to find it? Such
questions have to be replied before we can really learn something.
Firstly, let’s look at a simple example in the following figure (Figure-):
Figure –
In this case we easily identify by eyes the 4 clusters into which the data can be divided. But how the
computer can distingue it? What’s the criterion that the computer put one point into one of the
cluster?
Another more difficult example, as in our case, we want to identify if the experiment or simulation
results in our database can be classified into any of the curves as shown in the Figure- (Data samples
for learning are created by the MatLab by defining different combination of elementary functions:
exponentials, logarithms, and trigonometric). And worse, we don’t know advance that which kind of
typical curves we could have. How can we cluster the similar curves together?
LBD
Page 20
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
Figure –
In this project, learning algorithm is working as a clustering machine whose function is to find the
cluster of each experiment or simulation results clusters and supply the possibility to retrieve the
similar results when given a sample.
There are many algorithms of unsupervised learning for cluster problem, such as K-means
Competitive Learning [][7], Kohonen Competitive Learning, and Fuzzy-C-Means Competitive Learning
and Hierarchical Clustering Algorithms.
 K-means CL: (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that
solve the well known clustering problem. The standard k-means algorithm calculates the
distance between the input vector and the centre vector (prototype) (Figure-).
Figure –
The distance is usually defined as the Euclidian norm:
(4)
Where
is the input vector,
is the prototype vector ,
is the vectors’ dimension and
N is the number of the prototypes. The prototype with a minimum distance is named
winner:
(5)
The winner prototype is updated by a reduction of the learning rate towards the input:
LBD
Page 21
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
(6)
This reduction of the learning rate makes each prototype vector the mean of all cases
assigned to its cluster and guarantees convergence of the algorithm to an optimum value of
the error function:
(7)
is the input vector that is classified in cluster
.
is the prototype of the cluster
.
The algorithm is shown as following (Figure-):
Start
Initialize the prototypes
Competition to find the winners
for all inputs
Convergence
criterion is
satisfied?
no
yes
Update the winner prototype
Competition to find the winners
for all inputs
End
Figure –

LBD
The convergence criterion is defined by the percentage of the number of changed winners
for all inputs.
The classical k-means algorithm has the “dead units” problem. That is, some prototypes
may never win the competition, so it may never be updated. The result is the “dead units”
can’t really represent prototypes. Furthermore, we need to know the exact number of
cluster k, before performing the data clustering. Otherwise, it will lead to poor clustering
performance. The resulting clusters depend on the initial random assignments. It
minimizes intra-cluster variance, but does not ensure that the result has a global
minimum of variance.
The time consummation is O(N2). N is the total size of inputs.
Kohonen Competitive Learning (Kohonen, 1995/1997; Hecht-Nielsen 1990): one of the
“Kohonen network”, the Vector Quantization-competitve networks can be viewed as
unsupervised algorithm that is closely related to k-means cluster analysis. The prototype
Page 22
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
vector is moved a certain proportion of the distance between it and the training case, the
proportion being specified by the learning rate, that is:
(8)
Kohonen’s learning law with a fixed learning rate does not converge. As is well known
from stochastic approximation theory, convergence requires the sum of the infinite
sequence of learning rates to be infinite, while the sum of squared learning rates must be
finite (Koheonen, 1995, p.34). In this case, the learning rate has to be reduced in a
suitable manner. These requirements are satisfied by MacQueen’s k-means algorithm.
The prototypes are randomly initialized from the input vector values. The algorithm is
defined as following (Figure-):
Start
Initialize the prototypes
Competition to find the winner
for one input
Update the winner prototype
no
Whole set of
inputs ?
yes
End
Figure –

LBD
The main advantages of this algorithm are its simplicity and speed which allows it to run
on large datasets. But as in K-means algorithm, the clustering results of Kohonen
Competitive Learning depend on the initialization of the prototypes and may produce the
“dead units”. The level of time consummation is O(N M). N is the total size of inputs and
M is the total cluster number.
Fuzzy c-means: (FCM) is a method of clustering which allows one piece of data to belong
to two or more clusters. This method (developed by Dunn in 1973 and improved by
Bezdek in 1981) is frequently used in pattern recognition.
In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic,
rather than belonging completely to just one cluster. Thus, points on the edge of a cluster,
may be in the cluster to a lesser degree than points in the center of cluster. For each point
x we have a coefficient giving the degree of being in the kth cluster uk(x). Usually, the sum
of those coefficients is defined to be 1:
The fuzzy c-means algorithm is very similar to the k-means algorithm:
1. Choose a number of clusters.
2. Assign randomly to each point coefficients for being in the clusters.
Page 23
Master Project-Learning Database

Informatique
Yuanjian Wang Zufferey
3. Repeat until the algorithm has converged (that is, the coefficients' change
between two iterations is no more than ε, the given sensitivity threshold) :
Compute the centroid for each cluster.
4. For each point, compute its coefficients of being in the clusters.
The algorithm minimizes intra-cluster variance as well, but has the same problems as kmeans, the minimum is a local minimum, and the results depend on the initial choice of
weights. The Expectation-maximization algorithm is a more statistically formalized
method which includes some of these ideas: partial membership in classes. It has better
convergence properties and is in general preferred to fuzzy-c-means.
Hierarchical Clustering Algorithms: Given a set of N items to be clustered, and an N*N
distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C.
Johnson in 1967) is this:
1. Start by assigning each item to a cluster, so that if you have N items, you now
have N clusters, each containing just one item. Let the distances (similarities)
between the clusters the same as the distances (similarities) between the items
they contain.
2. Find the closest (most similar) pair of clusters and merge them into a single
cluster, so that now you have one cluster less.
3. Compute distances (similarities) between the new cluster and each of the old
clusters.
Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.
Why we chose the competitive learning algorithm?
The reason that we chose Kohonen competitive learning algorithm for the first pass (initialization)
and the K-means competitive learning algorithm is due to three principle limitations:
 First, the simplicity of competitive learning algorithm is a big advantage.
 Secondly, hierarchical clustering methods are unsuitable for isolating spherical or poorly
separated clusters
 The last reason is that K-means competitive learning algorithm can take a long time to
converge to a solution, depending on the appropriateness of the reallocation criteria to
the structure of the data. However, the structure of the data is unknown.
To reduce the number of final “dead units” and improve the performance of the initialization of
the prototypes, we have taken some strategies into account. We will discuss them in section 5
(Implementation).
LBD
Page 24
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
5. Implementation
This project is implemented in Oracle by using the PL/SQL language.
5.1
Architecture of the Oracle Implementation
Biological Neuron Tables
Biological Experiment Tables
Computational Model Tables
Computational Simulation Tables
1.Abstract Data :Trigger
Definition
Queries’ Views Definition
Learning Database Tables
2.Learning Procedures and
Function definition
Figure –
The architecture of the Oracle implementation is shown in Figure-. There are three basic groups
that are knowledgement about the biological neuron and computational model, experiment and
simulation, and the learning database. To supply the corresponding views as defined in the
Object-Oriented definition, we defined the queries’ views. To automatically abstract the
experiment and simulation results, we defined the triggers on corresponding tables to import
external data into the learning database tables. Learning Procedures and functions are defined to
execute the competitive learning algorithm to find clusters.
In the follow sub section, we introduce corresponding definitions for each part.
LBD
Page 25
Master Project-Learning Database
5.1.1
Informatique
Yuanjian Wang Zufferey
Biological Neuron
Figure –
There are 9 tables that save the biological neuron information (Figure-):
 tb_organism: The table that saves the information about the biological organisms. It
includes the name, description, and identity (id) information.
 tb_neuroncell: Biological neural cell information is saved in this table. It includes the name,
canonical form (canonicalform), cell identity (cell_id), situated organism identity (id) and
description of cell type (classification of neural cell: cell_type). The canonical form
describes a standard way of presenting that cell. The interoperation of one neural cell is
saved in table tb_interoperation. tb_compartmentcomposition saves its composition of
compartments.
 tb_compartment: It saves the possible compartment of one neural cell. For example, the
somas, axons, or dendrites. It is defined by its identity (id), name and type. Its properties
are saved in tb_compartmentcomposition.
 tb_property: The general properties such as electrical properties of ion channel, receptors
on membrane of compartment are described in tb_property. It has its identity (id), type,
name and description information.
 tb_compartmentcomposition: It builds the connection between compartments and
properties of one neural cell. A compartment of one neural cell may have more than one
property definition. A general property may be included in more than one compartment
of a neural cell.
 tb_experiment: A biological experiment is recorded and identified by experimentid. The
executer, description, and execution date (experimentdate) are saved.
 tb_experiment_result: For each experiment, we may record more than one measurement.
A measurement may be taken on the property of compartment of one cell. So we save
the experiment identity (tb_experiment_experimentid), cell identity (cellid), compartment
identity (compartmentid), and property identity (propertyid). For each measurement we
record its name, unit and size (datasize). All values of one measurement are found in an
external file (datalocation).
LBD
Page 26
Master Project-Learning Database


5.1.2
Informatique
Yuanjian Wang Zufferey
tb_interoperation: The microscopy data, gene chromosome information of one neural cell
can be saved in an external file and we could find it by the file location of each
information (microscopydata, genechromosome).
tb_observation: It’s the table to save the observation that we record in one experiment.
This table can contain the observation of a simulation of computational models. The
source_type decides that it’s an observation or experiment or simulation. Sim_or_exp_id
saves the simulationid or experimentid. Each observation is unique by its id. The
description of the observation is saved.
Computational Model
Figure –
There are 9 tables that save the computational model information (Figure-):
 tb_neuronmodel: It saves the basic information about the identity (modelid) and name of
model, description, author and time when it is built. If there are more external files about
the model, it can be found in the additioonalfile field that saves the file location. The
definition of equations, variables, parameters are saved in tables tb_equation,
tb_parameter, tb_variable. More information for retrieval, such as keywords, concerning
biological questions, and the references it has used is saved in table tb_keywords,
tb_biologicalquestion, tb_refered.
 tb_variable: A variable may bind with some biological information such as the neural cell
(cellid), compartment of neural cell (compartmentid), property of compartment
(propertyid). Or it may signify some region of organism or some other biological
LBD
Page 27
Master Project-Learning Database







5.1.3
Informatique
Yuanjian Wang Zufferey
information (divers).A variable is represented by its symbol and name. It may have unit. It
belongs to a computational model (modelid). It is identified by its id in a model.
tb_parameter: A parameter may bind with some biological information such as the neural
cell (cellid), compartment of neural cell (compartmentid), property of compartment
(propertyid). Or it may signify some region of organism or some other biological
information (divers).A variable is represented by its symbol and name. It may have unit. It
belongs to a computational model (modelid). It is identified by its id in a model.
tb_equation: An equation is an expression of variables and parameters. Apart from its
biological meaning, it is identified by its id in a model and has an expression. Each
equation may belong to different equation group.
tb_variable_member: It saves for one equation the membership of variables.
tb_parameter_member: It saves for one equation the membership of parameters.
tb_refered: A model may refer to a reference that can be paper, book.
tb_keywords: The keywords that have defined in one model are saved in this table.Each
keywordid signifies a unique keyword. It can be used by more than one model.
tb_biologicalquestion: A computational model may reply a biological question in some
research area and talks about certain topics. The detail of this table is introduced in next
section.
Biological question:
Figure –
There are 8 tables to save the biological question information (Figure-):
 tb_biologicalquestion: A computational model may reply to the questions that scientists
have asked. For retrieval requirement, we may need to know which research area or
topics the question about. For each question we supply a unique identity. To reply the
biological question, we may base on the known hypothesis (tb_hypothesis) or reference
LBD
Page 28
Master Project-Learning Database






LBD
Informatique
Yuanjian Wang Zufferey
(tb_refered). The conclusions for the biological question are saved in table
tb_contribution. The biological contributions of the computational model to reply the
biological question are listed in table tb_conclusion.
tb_hypothesis: A hypotheses is made by one person (author), it can be shared by more
than one biological question. If it’s a clearly defined hypothesis on the compartment of
neural cell (cellid, compartmentid), or on neural cell itself (cellid), or about the region of
organism, we can record corresponding information in this table. The statement is the
content of the hypothesis. It has one unique identity (id).
tb_based_hypothesis: This table records for each biological question, the hypothesis it has
based on.
tb_reference: This table will record all kinds of possible references that can be used in this
database. The type defines it’s a book, a paper, or other published articles. Description is
the detailed information about the reference itself. For example, the author, published
date, publisher, etc. It has the unique identity referenceid.
tb_refered: It records for the computational model (tb_neuronmodel_modelid) and the
biological question (questioned) the referred identity (referenceid) of the reference.
tb_contribution: The contribution is the special devotion of the computational model
when it tries to reply one biological question (questionid). It may have more than one
contribution. We can index the content by the importance or other characters
(indexofcontribution).
tb_conclusion: The conclusion to reply one biological question is saved in this table
(questionid). The similar structure as in tb_contribution is defined. We can index the
multiple conclusion contents (indexofconclusion, content).
Page 29
Master Project-Learning Database
5.1.4
Informatique
Yuanjian Wang Zufferey
Simulation
Figure –
There are 7 tables that will be used to save the information about simulation (Figure-):
 tb_simulation: A simulation may be based on one computational model or based on the
combination of multiple computational models. Each simulation is identified by its id. The
simulationResource field saves the external files location (we suppose all the files in one
zip). The SimulationEnvionment describes which kind of tools to simulate the
computation models (such as Matlab, Java, or NEURON, etc). The simulated time is stored.
 tb_modelsimulation: Each computational model (modelid) that is simulated in one
simulation (id) may be applied on the concrete biological environment. For example,
given two computational models M1 and M2, we can apply the M1 on the K+ channel
(propertyid) on membrane of soma (compartmentid) for one neural cell (cellid), and M2
on the Na+ channel (propertyid) on the membrane of the axon (compartmentid) for the
same neural cell. In the description we can describe the connection between the two
compartments of the cell. For each computational model (M1 and M2), they may have
different parameters setting (tb_assignment). The start simulating condition
(tb_startcondition) and stop simulation condition (tb_stopcondition) may different for
each model. For each model, we can record the variables’ values during the simulation
period (tb_resulttable) and we can construct the graphs based on the variables’ values
(tb_resultgraph).
 tb_initialcondition: For each simulation (simulatonid), the initial value for each variable
(variableid) of its computational model element (modelid) has been stored in this table.
 tb_stopcondition: For each simulation (simulatonid), the stop value for each variable
(variableid) of its computational model element (modelid) has been stored in this table.
 tb_assignment: For each simulation (simulatonid), the assigned value for each parameter
(parameterid) of its computational model element (modelid) has been stored in this table.
LBD
Page 30
Master Project-Learning Database



5.1.5













LBD
Informatique
Yuanjian Wang Zufferey
tb_resulttable: The values of one variable (variableid) of a model (modelid) that has been
recorded during simulation (simulatonid). The size of total value (resultsize) and the
external file (datalocation) that saves these values as a column of data are stored.
tb_resultgraph: A 3-D dimension graph that is produced on the variables’ simulation
results can be saved in this table. Xvariableid, yvariableid, zvariableid describe the x, y, z
axis data source (We can find the corresponding values in tb_resulttable). Produced graph
can be saved in the graphsource location.
tb_obsesrvation: As described in section 5.1.1.
Definition of Views
v_organism_cell: View of the biological cells in one organism: The source of view is from
tb_neuroncell and tb_organism.
v_neural_cell: View of Biological cell: A detailed neural cell its compartments and
properties information. The source of view is from tb_neuroncell, tb_compartment,
tb_property and tb_compartmentcomposition.
v_neural_cell_experiment: View of experiment of neural cell: A view shows the
experiments that have been done on the neural cells. The source is from tb_neuroncell,
tb_experiment, and tb_experiment_result.
v_experiment_observation: View of observation of the experiment. The source is from
tb_experiment and tb_observation.
v_computational_model: View of computational model: A detailed computational model
includes basic model description and its equations definition. The source is from
tb_neuronmodel and tb_equation.
v_equation_variables: View of equations vairables: Detailed equation descriptions about
the definition of variables. The source is from tb_equation, tb_variable, and
tb_variable_member.
v_equation_parameters: View equations parameters: Detailed equation description about
the definition of parameters. The source is from tb_parameter_memeber, tb_parameter,
and tb_equation.
v_model_reference: View of the reference that referred by computational model. Source
is from tb_neuronmodel, tb_referred, and tb_reference.
v_model_keywords: Views of the keywords that used in each computational model.
Source is from tb_neuronmodel, tb_keywords, and tb_keyword.
v_model_hypothesis: View of the hypothesis of biological question: The source is from
tb_neuronmodel and tb_biological_question, tb_based_hypothesis and tb_hypothesis.
v_model_conclusion: View of the conclusion of biological question: The source is from
tb_neuronmodel, tb_biological_question, and tb_conclusion.
v_model_contribution: View of contribution of neural model: The source is from
tb_neuronmodel, tb_biological_question, and tb_contribution.
v_model_question_reference: View of the reference of the biological question: The
source is from tb_neuronmodel, tb_biological_question, tb_referred, tb_reference.
Page 31
Master Project-Learning Database








LBD
Informatique
Yuanjian Wang Zufferey
v_simulation_models: View of simulation detailed composition: It defines the models that
involved in one simulation. The source is from tb_simulation, tb_modelsimulation, and
tb_neuronmodel.
v_simulation_observation: View of the observation for the simulation: The source is from
tb_simulation and tb_observation.
v_simulation_startcondition: View of start conditions for one simulation: The source is
from tb_modelsimulation, tb_simulation, tb_initialcondition, and tb_variable.
v_simulation_stopcondition: View of stop conditions for one simulation: The source is
from tb_neuronmodel, tb_modelsimulation, tb_simulation, tb_stopcondition and
tb_variable.
v_simulation_assignment: View of the parameters’ setting in one simulation: The source
is from tb_modelsimulation, tb_simulation, tb_assignment and tb_parameter.
v_simulation_result_list: View of recorded results in one simulation: The source is from
tb_modelsimulation, tb_variable, tb_simulation, and tb_resulttable.
v_experiment_cluster: View of the clustered samples that from experiment: The source is
from tb_sample, v_neural_cell_experiment, and tb_cluster.
v_neural_cell_experiment: View of the clustered samples that from the simulation: The
source is from tb_sample, tb_cluster, and v_simulation_result_list.
Page 32
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
5.1.6 Competitive Learning
There are 4 tables used to save the information that need to apply the competitive learning
algorithm (Figure-).
 tb_sample: This table is the summary for all the experiments’ results and simulations’
results. It’s the interface between learning database and the basic information including
biological neuron information and computational information. Each data (A data is an
array with one dimension that saves values of one measurement in experiment or one
variable in one simulation.) can be originated from simulation or experiment
(sim_or_exp_id). If it’s from an experiment, it may include the cellid, compartmentid,
propertyid information that are abstracted from experiment source; if it’s from a
simulation, it may include the modelid and variableid information. Each data has the unit
and size information. Datalocation stores the external file that we can read the values
from. Once we have read all the values from external file, the data_fromindex and
data_toindex will save the values location in tb_cluster_sample_values. Each value of one
data will be saved in table tb_cluster_sample_values. And data index (dataindex) is the
unique identity that serves as the index and primary key.
 tb_cluster_sample_values: Values of one data from the external files are saved in this
table. The order of the values in original file is kept. Each value has a data index
(dataindex) that is unique for each value. All the values are saved in a vertical column. We
need to transform each range of values for each data into horizontal table in usage.
 tb_cluster: This table records the cluster result for each data. Each data in different cluster
level belongs to only one cluster. Each cluster is identified by its clusterid and clusterlayer.
The first level of the cluster layer here we defined by default is the clusters classified by
the unit and data size. In each cluster with the same unit and data size, we will apply the
competitive learning algorithm to find the corresponding cluster.
 tb_prototype: Each cluster has one prototype. A prototype is identified by its clusterid,
layer (cluster level) and prototypeid. Its vector values are saved in table
tb_cluster_sample_values. Data_fromindex and data_toindex save the location of value
(data_fromindex<=value.dataindex<=data_toindex). Each prototype represents for one
group of data with same unit and size (prototypesize). For quality analysis aim, we can
write out the prototype values into an external file in datalocation.
Figure –
LBD
Page 33
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
5.2 Constraints
The Constraints on tables are defined as in tableTable Name
tb_property
Primary Key
id
tb_simulation
id
tb_organism
id
tb_reference
referenceid
tb_initialcondition
simulationid, modelid, variableid
tb_stopcondition
modelid, simulationid, variableid
tb_resultgraph
simulationid, modelid
tb_compartment
id
tb_variable
id, modelid
tb_variable_member
(id, modelid) REFERENCES tb_variable
(id, modelid);
(tb_equation_id, modelid)
REFERENCES tb_equation (id,
modelid)
tb_neuronmodel
Modelid
tb_neuroncell
cell_id
tb_resulttable
modelid, variableid, simulationid
tb_experiment_result
tb_assignment
lusteredted, propertyID,
cellid,tb_experiment_experimentid
modelid, simulationid, parameterid
tb_observation
observationid
tb_biologicalquestion
questionid
tb_experiment
experimentid
tb_parameter
id, modelid
tb_parameter_member
id, modelid
tb_cluster
dataindex,lustered,clusterlayer
tb_sample
dataindex
tb_prototype
lustered, prototypeid,layer
tb_cluster_sample_values
dataindex
tb_temp_values
dataindex
tb_discovery
tb_contribution
LBD
(id) REFERENCES tb_organism (id)
(id) REFERENCES tb_neuronmodel
(modelid)
(tb_equation_id, modelid)
REFERENCES tb_equation (id,
modelid);
(id, modelid) REFERENCES
tb_parameter (id, modelid)
tb_equation
tb_interoperation
Foreign Key
(cell_id) REFERENCES tb_neuroncell
(cell_id)
(observationid) REFERENCES
tb_observation (observationid);
(tb_bio_questionid) REFERENCES
tb_biologicalquestion (questionid)
(tb_bio_questionid) REFERENCES
Page 34
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
tb_biologicalquestion (uestioned)
tb_refered
(tb_bio_questionid) REFERENCES
tb_biologicalquestion (questionid);
(tb_neuronmodel_modelid)
REFERENCES tb_neuronmodel
(modelid)
(id) REFERENCES tb_neuronmodel
(modelid)
(id) REFERENCES tb_simulation (id);
(modelid) REFERENCES
tb_neuronmodel (modelid)
(tb_experiment_experimentid)
REFERENCES tb_experiment
(experimentid)
(tb_bio_questionid) REFERENCES
tb_biologicalquestion (questionid)
(tb_bio_questionid) REFERENCES
tb_biologicalquestion (questionid)
(property_id) REFERENCES
tb_property (id);
(compartment_id) REFERENCES
tb_compartment (id);
(cell_id) REFERENCES tb_neuroncell
(cell_id)
tb_keywords
tb_modelsimulation
tb_experiment_result
tb_hypothesis
tb_conclusion
tb_compartmentcomposition
Table–
5.3Triggers
Strategy to keep the integration between base table and cluster structure (Table-): Triggers are
defined on the table tb_experiement_result and tb_resulttable to keep the integrity between
data source and cluster source.
Trigger Name
t_experiment_delete
Table
tb_experiment_result
t_experiment_result_insert
tb_experiment_result
t_experimentresult_update
tb_experiment_result
t_resulttable_delete
tb_resulttable
t_resulttable_insert
tb_resulttable
LBD
Description
After delete one experiment data, we delete the
corresponding
data
in
tables
of
tb_cluster,tb_sample,tb_cluster_sample_values
After insert a new experiment data, we insert the
data saved in data file into tables of
tb_cluster,tb_sample,tb_cluster_sample_values. If
the clusters for the corresponding data size and unit
are defined in learning database, we calculate the
cluster for the new result.
After have modified an experiment data, we modify
the
tables
of
tb_cluster,
tb_sample,
tb_cluster_sample_values to have the same data
content. At same time, if the clusters for the
corresponding data size and unit are defined in
learning database, we recalculate the corresponding
cluster for the modified result.
After delete one simulation result data, we delete
the
corresponding
data
in
tables
of
tb_cluster,tb_sample,tb_cluster_sample_values
After insert a new simulation result data, we insert
the data saved in data file into tables of b_cluster,
Page 35
Master Project-Learning Database
t_resulttable_after_update
Informatique
tb_resulttable
Yuanjian Wang Zufferey
tb_sample, tb_cluster_sample_values. If the clusters
for the corresponding data size and unit are defined
in learning database, we calculate the cluster for the
new result.
After have modified an a simulation result data, we
modify the tables of tb_cluster, tb_sample,
tb_cluster_sample_values to have the same data
content. . At same time, if the clusters are defined in
learning database, we recalculate the corresponding
cluster for the modified result.
Table –
LBD
Page 36
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
5.4Competitive learning algorithm implementation:
Competitive learning algorithm applied on each homologous data groups such that they have the
same unit and the data size. For example, a simulation result whose variable has the unit ‘mV’ and
recorded value array with length 5000 is homologous with an experiment result of the
membrane’s potential on a soma of neural cell which has the unit ‘mV’ and the measured 5000
values.
We suppose all the simulation results and experiment results are saved on external files that can
be accessed by the user in the authorized directory in Oracle.
As being mentioned in section 5.1.5, four tables have been involved in the procedure of
competitive learning. They are tb_sample, tb_cluster_sample_values, tb_cluster and
tb_prototype.
Each insertion of simulation result (tb_resulttable) or experiment result (tb_experiment_result)
will trigger the corresponding importation data values procedure which read the external file
location into the tb_sample and tb_cluster_sample_values tables. The update or delete operation
on these two tables will trigger the update or deletes procedures on the corresponding tables:
tb_sample, tb_cluster_sample_values, tb_cluster.
Once we have enough data samples to cluster, we can start the cluster procedure by initializing
the prototypes and execute the competitive learning algorithm.
To get better initialization of prototypes at the start point of the competitive learning procedure,
we need more extra effects to check the randomly chosen prototypes and reject the choices that
too much overlapped prototypes exist. The overlap can be represented by the correlation
between two prototypes. Prototype pair with high correlation (for example, more than 0.9) may
result that the two prototypes represents the same cluster. We check correlation between
prototypes to find the highly correlated prototypes and a criterion is defined to decide how many
correlated prototypes accepted by the initialization. If we haven’t found the wanted level of
correlation number criterion, there may be no such choice, and then we have to augment the
tolerable correlated prototypes’ number. A maximum ratio of this criterion has to be defined, or it
will loop infinitely.
Firstly to avoid exhaustive tries, we have to limit to a reasonable iteration number to search the
better choices.
Secondly, as if we can’t know in advance the cluster number and missing the clusters is not that
we want. We have to define a rather high ratio of prototypes regarding to the total size of
samples data. The results may lead to more dead units but guarantee that we haven’t missed the
poorly separated clusters.
Once the initialization finds the satisfied prototypes, competitive learning begins. Once all the
samples are clustered, the initialization of the competitive learning algorithm finishes.
LBD
Page 37
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
When new sample is inserted into database, it will be clustered by the competitive learning
algorithm. The learning procedure continues.
To get better accuracy, we suggest that if the database size has grown in a considerable scale, we
reinitialize all the prototypes with bigger ratio and execute the competitive learning again for all
the samples.
The flow in Oracle to execute corresponding sql files is shown in Figure-:
Create user’s name and directory
Create Tables: createtable_ra.sql
Insert samples’ data (createdatasample of
initializecluster.sql)
Import the external data from files (initializesample)
Initialize prototypes (by unit, data size):
initializeallprototype.sql
Execute the competitive learning algo. (by unit, data
size): execALLCompetiveLearning.sql
Figure –
LBD
Page 38
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
6. Queries:
We decided to give the freedom to user to choose possible queries that he wants. The semester
project of “EasyQueries” of laboratory of Databases in 2007 (author: Ariane Pasquier) has
supplied a good tools to exploit all the possibilities to query the databases. For summary, we can
list some examples that are common for users:
A. General queries:
1. Biological questions:
 Given the neuron name or ID, list out its compartments with electronic
properties, classifications, locations, and interpretations.
 Given the neuron name or ID, list out the experiment results.
 Given the type of receptor/neurotransmitter/channel, list out the neurons
that have such type of receptor/ neurotransmitter/channel.
 Etc.
2. Computational questions:
 Given a neuron name or ID, list out its computational models.
 Given the name of a theory, list out the computational model based on this
theory
 Given the research subject, list out the possible concerning models.
 Given the neuron name, list out the biological experimental data for data
fitting
 Given the model name, list out the simulation results (table, graphs, condition
and parameters).
 Given the properties of compartments, find out the computational model that
is based on it.
 Etc.
B. Advanced queries:
 Given the simulation resulting data (pre-saved table or graph), find the closest
experiment results or simulations.
 Given an external file that stores a simulation or experiment result, find the
closest experiment results or simulations.
 Given a cluster id, find its prototype.
 Given a simulation result or experiment result, find its cluster.
 Etc.
Easyqueries supplies an interface that user can easily define his queries (Figure-). All the tables
and views in the database corresponding to the role of user can be chosen as the querying object.
It is original for query the tables in DERBY database. We slightly modified it to be able to query the
database of Oracle. There are two possible query options: assisted query with QBE and handwritten queries.
LBD
Page 39
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
Figure –
QBE, Query-By-Example, is a language for querying relational data with a graphic representation
of the data. User can use QBE keywords to retrieve, update, delete, and insert data. The graphic
representation is shown in Figure-. It refers to DB2 Query Management Facility (QMF) keywords
definition [8], user can choose the queried table, the fields in the table (keywords: P.: Projection,
UNQ: distinct etc.), define the conditions (for example: <=, >=, <>, etc) easily by using supplied
keywords. For detailed explanation of this language, we comment reader to refer to [8]. Multiple
tables can be joined by using the complex link. In our case, we have already created the necessary
views for users to query almost all the visual information, the operation on one view is sufficient
to query the necessary information.
Figure –
In Figure-, we show an example that a query to retrieval the view of v_organism_cell: we select all
the fields (P. on the v_organism_cell) and chose the records that organism_name is like ‘brain’.
The result SQL statement is previewed. Once we click on the ‘SEND QUERY’, we can get the result
records below.
Figure –
LBD
Page 40
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
For advanced user who is familiar with SQL queries, he can define the query manually by writing
the SQL statements. ???
LBD
Page 41
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
7. Performance analysis
Firstly, we look at the time consummation of the step of competitive learning process. As we
mentioned before, the complexity of competitive learning is O(M*N), M is the number of clusters
and N is the total samples. We fixed the number of clusters and executed competitive learning on
different size of samples. The time consummation is shown in Figure-. We can conclude that the
time consummation is increased linearly with the size of samples.
Figure –
Secondly, we check the final prototypes’ quality by two measurements: the number of dead units
and the number of correlated pairs of prototypes.
(a)
(b)
Figure –
In Figure- the dead units and the number of final correlated prototypes pairs with configuration of
1000 samples that each sample has 500 dimensions are shown. We can discover that with the
fixed samples number and dimension, increasing the clusters will lead to more dead units (in our
case, we defined the clusters that have less than 3 samples are dead units) and more correlated
pairs of prototypes (in our case, the pair of prototypes that has the correlation more than 0.8 are
defined as correlated). The correlated prototypes may represent the same cluster and dead units
show that there are some prototypes have never find samples belong to the cluster that the
prototypes represent.
LBD
Page 42
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
Figure –
We studied various dimensions with the same number of samples and clusters (Figure-). The
correlated prototypes pairs haven’t had great change (between 0 or 3 pairs), but the
consummation of time has great difference (from 113 to 1171 second) (Figure-). We think the
solution to reduce the dimension for the competitive learning is acceptable [12].
The prototypes’ quality is the most important measurement that indicates the successful factor
for the algorithm. As we believe that more samples the database learned, more successful that it
can cluster the correctly sample. To verify the cluster result’s correctness, we used the labeled
hand written digits’ images that each image represents one number among 0 to 9 (The hand
written digits data to test the precision of clustering is from the course “Unsupervised and
reinforcement learning in neural networks” of Professor Wulfram Gerstner.).
These images have fixed data dimension (784) and the clusters number proportional to the total
samples (10%) and executed the learning process on different size of sample. We can define the
precision of cluster as following:
The precision for clustering is shown in Figure-. We can easily to see that with more samples, we
can learn more precisely. It confirms the solution that to improve the quality of the clusters, we
may need to initialize the prototypes and execute the competitive learning again when the
samples number has grown in some scales. In this example, the scale is 10. The precision is
improved from 57.5% (200 samples, 20 clusters) to 82.5% (2000 samples, 200 clusters).
LBD
Page 43
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
Figure –
In the case of 200 samples with 20 clusters, we can’t find all the possible prototypes for each digit.
But in the case of 2000 samples and 200 clusters, different prototypes for each digit are found. To
show the progression, we reshaped the prototypes from the 784 dimension to the 32*32 matrix,
and draw the digits. The 20 prototypes of 200 samples are shown in Figure- and the 20 of 200
prototypes that 2000 samples have found are displayed in Figure-. We can observe in the two
figures that in the first case, we have only found 1 prototypes of digit 1 but at least 4 prototypes
in the second case; similarly, we have only found 2 prototype of digit 0 in the first case but at least
4 prototypes in the second case. Furthermore, in the first case there are 2 prototypes we can’t tell
which digit they represent (in the second line, the third and the fifth prototype), but in the second
case, we can distingue each digit easily.
Figure –
Figure –
LBD
Page 44
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
Figure –
To further analyze the performance and we compared the time consummation of competitive
learning process between the Oracle and Matlab. We fixed the dimension as 500 and took 1000
samples. By choosing different clusters number, we can get the following time consummation in
Figure- for Oracle and Matlab. Surprisingly, we can discover that Matlab is much faster in all the
case. The maximum time during this test for Matlab to execute the competitive learning algorithm
is only 3.5 seconds contrary to Oracle 4065 seconds. At same time it gives huge hope that we can
improve the performance in Oracle by using external tools such as Matlab to execute the learning
algorithm but Oracle serves as the storage and retrieval tools.
We tried one possibility to use the matrix package of the Oracle. But short of example to use it, in
the short time we can’t finish it.
LBD
Page 45
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
8. Conclusion
Based on the requirement of the storage of biological neural information and computational
models’ information, we designed a database that not only can store the biological and
computational neural information systematically, but also can learn from the high dimension data
series or graph information (transformed to high dimension vector) to find the cluster information.
This project is implemented in Oracle 10G by using the PL/SQL language. Queries are possible by
using the EasyQueries [10]. The performance of the project is studied by using data series
produced by the elementary functions combinations and hand written digits images.
Here we want to point out some problems about the implementation in Oracle:
 The most inconvenience that we have met is the mathematics’ calculation. We can’t
execute the vector calculation easily in Oracle. The performance of execution of such
procedures in Oracle is extremely heavy.
 The IO operation to load external file into Oracle is not efficient.
 In the analysis of the performance of the time consummation, Oracle is not efficient to
execute the competitive learning algorithm. The Matlab is obviously much better.
In the future work, we may think about to implement competitive learning algorithm by using the
external specialized components such as Matlab or C++ languages but Oracle serves as a storage
and retrieval tool.
We have supplied a well-formed and easy to query knowledge base but we miss a good visual
tools to input the knowledge. As the biological information and computational information are
decomposed as detailed as possible, user may input enormous information. For example, for each
computational model, we have to input the equations, parameters, variables separately if we
have no an intelligent tool to abstract the parameters and variables from the equations. For each
parameter or variable, if without the help of dictionary that user can easily choose the
signification from, it’s a heavy work to input all the description or biological explanation manually.
In the future work, such intelligent editor is necessary to supply to user. Furthermore, some
existing databases may already have partial information. In this case we may need to develop a
tool to automatically import the corresponding information.
As we mentioned in the section 7, the possibility to reduce the dimension have no great effects on
the precision, but reduce greatly the time consummation. The future work we may think about to
calculate the principle components instead of calculate all the dimensions.
In this project, we chose the simple one pass Kohonen competitive learning algorithm that can
easily adapted for the database application. But in the future work, any learning algorithm can be
added to the application level by defining a standard interface to retrieve data samples and return
the results.
LBD
Page 46
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
9. Acknowledgement
Thanks Professor Spaccapietra Stefano has accepted my proposition for this project and I am
deeply appreciated for the help of Dr. Fabio Porto.
Thanks the no-stop support of my family.
LBD
Page 47
Master Project-Learning Database
Informatique
Yuanjian Wang Zufferey
Reference
1. http://en.wikipedia.org/wiki/Data_clustering
2. http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html
3. Neuron. (2008, May 18). In Wikipedia, the Free Encyclopedia. Retrieved 08:16, May 26,
2008, from http://en.wikipedia.org/w/index.php?title=Neuron&oldid=213235579
4. Membrane potential. (2008, May 21). In Wikipedia, The Free Encyclopedia. Retrieved
08:14, May 26, 2008, from
http://en.wikipedia.org/w/index.php?title=Membrane_potential&oldid=214033480
5. Pyramidal cell. (2008, May 8). In Wikipedia, The Free Encyclopedia. Retrieved 13:46,
May 26, 2008, from
http://en.wikipedia.org/w/index.php?title=Pyramidal_cell&oldid=211030760
6. Izhikevich artificial neuron model from EM Izhikevich "Simple Model of Spiking
Neurons"
IEEE Transactions On Neural Networks, Vol. 14, No. 6, November 2003 pp 1569-1572
7. Kanungo, T. Mount, D.M. Netanyahu, N.S. Piatko, C.D. Silverman, R. Wu, A.Y.
Almaden Res. Center, San Jose, CA: An Efficient k-Means Clustering Algorithm:Analysis
and Implementation. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE
INTELLIGENCE, VOL. 24, NO. 7, JULY 2002
8. IBM Corporation, August, 2005
http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.q
mf.doc.using/dsqk2mst365.htm
9. IBM Corporation, August, 2005
http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.q
mf.doc.using/dsqk2mst339.htm
10. Ariane Pasquier: User Manul of EasyQueries. April 2007.
11. Hines ML, Morse T, Migliore M, Carnevale NT, Shepherd GM. ModelDB: A Database to
Support Computational Neuroscience. J Comput Neurosci. 2004 Jul-Aug;17(1):7-11.
12. Heng Tao Shen , Xiaofang Zhou, Aoying Zhou: An adaptive and dynamic imensionality
reduction method for high-dimensional indexing. The VLDB Journal (2007) 16(2): 219–
234
LBD
Page 48