Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Master Project-Learning Database Informatique Yuanjian Wang Zufferey Report of Master Project Learning Database Professor: Spaccapietra Stefano Assistant: Fabio Porto Student: Yuanjian Wang Zufferey LBD Page 1 Master Project-Learning Database Informatique Yuanjian Wang Zufferey Sommaire ............................................... Error! Bookmark not defined. 1. INTRODUCTION 4 2. RELATED WORKS 5 3. CONCEPT MODEL 7 3.1 Project Architecture 7 3.2 UML Definition 8 3.2.1 Biological Neuron 8 3.2.2 Computational Model 10 3.2.3 Simulation Model 11 3.3 Definition of XML Schema 12 3.3.1 LearningDatabase 12 3.3.2 BioNeuron 12 3.3.3 NeuroModel 13 3.3.4 Hypotheses 16 3.3.5 Simulations 16 3.3.6 Constraints 18 4. LEARNING ALGORITHM (WHY) 20 5. IMPLEMENTATION 25 5.1 Architecture of the Oracle Implementation 25 5.1.2 Computational Model 27 5.1.3 Biological question: 28 5.1.4 Simulation 30 5.1.5 Definition of Views 31 5.1.6 Competitive Learning 33 LBD Page 2 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 6. QUERIES: 39 7. PERFORMANCE ANALYSIS 42 8. CONCLUSION 46 9. REFERENCE 48 LBD Page 3 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 1. Introduction The reason that I proposed this project to Prof. Stefano Spaccapietra and Dr. Fabio Porto is that I believe that computer science has to behavior as the tool that serves in one or more application domains such chemical, biological or financial domains. As a Master student who wants to be specialized in Bio-Computing, I have studied the corresponding courses that including the basic molecular biological courses and biological computational courses, and the machine learning courses give me the most enthusiasm. At same time, my interest in Database technology is never decreased. To combine the two techniques to supply the applicable services for biological domain became my first idea of the Master project. On the contact with Dr. Fabio Porto, the requirement of one well designed database schema for Blue Brain Project has been suggested. In further consideration, the possible requirement of application of machine learning algorithm in database to automatically cluster unstructured high-dimension data, such as experiment results and simulation results has been proposed. It gave one opportunity to design and realize one application system that combines biological and computational aspects of neuronal science. With the very fast growing knowledge collection on the neuron science, the requirement is became more and more urgent to find a well-formed storage and retrieval tool to save and share such knowledge among neuron scientists so that we can take the advantage of the coming new discoveries and share the information quickly and reliably. The basic requirement is from the neuron science. We need an extensible database schema which can save and retrieve the neural information including the biological information and the mathematics information. One group of the users is composed by the biologists who work on the neuron experiments and try to find the biological functions of neurons. The second group is formed by the mathematicians, physicians, and computer scientists who create and manage the computational models that describe the electrical activity of neuron cells. The advanced requirement is how to make the two groups of scientists to understand each other and further more to refer to each other or aggregate them together. As the two groups of scientists do not interpret the same biological or computational simulation data in the same way, we have to find a method to translate or map corresponding terms that have different representation but with same signification. Biologists are interested in the characteristic of biological part of neurons, for example, neural network, type of cells, characters of ion channel, receptor or transmitters of neuron. But the physicians are more interested the potential change on the membrane and abstract each neuron as a unit. A computational model usually simulates part of the biological data and the biologist may tune his experiment by cheap simulation in computer before the real experiment on neural cells. The contrary, physician wants to find and test on the biological interpretation after he has built a computational model. How do they find the possible hidden similarity between biological data and simulated data? This project proposed a solution that is to build two basic database schemas for each group to store the common knowledge of biology and computation, and then build a bridge between biologists and physicians or mathematicians. The most common method is to draw a curve by using the recorded data and to see how the two curves fit each other. But how we do the queries on the mass of data to find the similarity in database and in the reliable way? LBD Page 4 Master Project-Learning Database Informatique Yuanjian Wang Zufferey Here we propose a neural learning database solution. By giving the biological neural data and the results of simulation, our database can learn the similarity between the two types of data. We supply the answer of such queries: Are there any computational model have similar simulation results with the given biological experiment results? If yes, show the most similar results and model in the comparable way (for example, curve and graphs). Are there any biological results fitting well the given computational simulation result? If yes, show the most similar results in the comparable way (for example, curve). The neural learning database will be implemented with the learning ability. It is able to cluster different data series basing on the data content itself and certain predefined criterions. At same time it can be reinforced by external positive or negative feedback from users. For example a proposed result of a query may be strongly confirmed by user, it will augment the correlation of the results. Similar for the negative case, but the correlation will be reduced. With the size of learned data growing, the precision to cluster can be improved. 2. Related works To supply a common knowledge database that includes biological and computational aspect of the neuronal science, modelDB [11] has given an example. It supplies the computational model by the classification of neuronal composition. The advantage that modelDB, is that we can find various formats of computational model, such as NEURON, JAVA, MatLab, C++, etc. The inconvenience is that we can’t easily compare different formats directly. The definition of computational model is not easy to read (equation, definition of variable, parameters, etc). The connection between biological structure and model application can’t be found. At same time, we can’t find the similar simulation results in different computational models. Traditional database manager systems are aimed to classify data by certain predefined criteria. We usually need to know well in advance the data structure of data and have to carefully define the structures and the detailed procedures to abstract data for saving and classification. The semantics web applications use the well defined conceptual schema to supply annotations (knowledge markup technologies) to be recognized by the semantic analysis tools and thus we can classify the web contents based on the annotations. Based on the very carefully designed schema, the ontology technology can mine the valuable, high-quality ontological resources. Obviously, the multi-media data without the well defined annotations will not be possible to be mined by ontology. Recently the learning machines are well used in all kinds of domains, such as image reorganization, language processing, syntactic pattern recognition, search engines, medical diagnosis, bioinformatics and cheminformatics, etc. Machine learning and clustering in information retrieval systems can be applied to categorize the content-based results or rank them more meaningfully. As a broad subfield of artificial intelligence, machine learning is concerned with the design and development of algorithms and techniques that allow computer to “learn”. The major focus of machine learning research is to extract information from data automatically, by computational and statistical methods. So, machine learning is closely related not only to data mining and statistics, but LBD Page 5 Master Project-Learning Database Informatique Yuanjian Wang Zufferey also theoretical computer science. Generally, there are three kinds of algorithms: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning aims to solve the classification problem. It learns the behavior of a function which maps a vector [X1, X2, …, XN] into one of several classes by fitting with several input-output examples of the function. The problem to apply the supervised learning is that we have to know in advance the possible classes and find the typical examples for learning. Unsupervised learning agent models a set of inputs: classes and typical examples are not available. The common form is clustering, which is sometimes not probabilistic. The number of clusters is adapted by the problem size and user can control the degree of similarity between members of the same clusters by means of a user-defined constant called the vigilance parameter. Reinforcement learning concerns with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. It is studied in the domain of real time system. Learning machines’ algorithm, especially the supervised and unsupervised algorithms give the possibility to database to learn in advance (supervised) or learn by progression (unsupervised). As in this project we can’t get the possible examples for supervised algorithm to train database, we have to look at the unsupervised algorithms. In the following section, we firstly we propose an Object-Oriented schema in the UML presentation and the XML representation (Section 3). In the section 4 we describe the common cluster algorithms. Then we introduce the detail design (implementation) in Oracle in section 5. In the section 6 we describe the queries the system support. The performance about the competitive learning unsupervised learning implementation is given out in section 7. We conclude this project in the section 8. LBD Page 6 Master Project-Learning Database 3. Informatique Yuanjian Wang Zufferey Concept Model The fundamental work of this project is to design a well-formed data structure to store the necessary information. It not only supplies the storage base, but also to formalize some workflows. The storage base is composed by two parts: neuronal biological information and computational model information. The workflows are concerning the procedure of construction a computational model and the simulation procedure. 3.1 Project Architecture Computational Models Biological Neurons XML or UML Object-Oriented Definition 1.Biological Queries 2.Computational Queries Knowledge Database Relational Database Definition 5. Learning Process 3.Experiment or Simulation Queries Learning Database Biologial Experiments 4.Data Cluster Queries Computational Simulations XML or UML Object-Oriented Interface Layer Figure- In Figure- we can see the architecture of this project. Generally speaking, this project is defined by two levels of definition: UML or XML Object-Oriented classes’ definition and Relational Database definition. In the UML or XML Object-Oriented definition, we stood at the point of user view and specified two groups of users: biologists on neuron science and scientist who working on mathematis, physics, and compute science.Thus two types of information is modeled as Biological Neurons and Computational Models. The biologists can retrieve the biological information about neurons (1.Biological Queries). Similarly, the other group of scientists can query the computational models information about neurons (2.Computational Queries). The derived information such as the biological experiments and computational simulation are available. From derived information, biologists or other scientists can not only search on the experiment or simulation (3.Experiment or Simulation Queries), but also the similarity queries, that is to find the cluster information given an experiment result or a simulation result (4.Data Clusters Queries). LBD Page 7 Master Project-Learning Database Informatique Yuanjian Wang Zufferey In the Relational Database definition, we implement the Object-Oriented definition of UML or XML in the relational way. We distingue two kinds of implementation: One is Knowledge Database that stores the biological neural information includes the neuron information and experiment information and computational neural information combines with the computational model and the simulation information; the other is learning database who learns from the simulation and experiment results to find the similarity between them and form the clusters (5. Learning Process). The biologists or other scientists can query the similar results of experiments or simulations. We begin to introduce the two common representations for Object-Oriented Model in UML and XML representation to introduce the concept model in the following sub chapters. In this step of work, some of the terminologies are referred to the modelDB and some are referred to the wiki definitions. 3.2 UML Definition 3.2.1 Biological Neuron Firstly, we have to understand what a neural cell is. Neurons [3] are electrically excitable cells in the nervous system that process and transmit information. Neurons are the core components of the brain, and spinal cord in vertebrates and ventral nerve cord in invertebrates, and peripheral nerves. Neurons are typically composed of a soma, or cell body, a dendritic tree and an axon (Figure-). The majority of vertebrate neurons receives input on the cell body and dendritic tree, and transmits output via the axon. Neurons communicate via chemical and electrical synapses, in a process known as synaptic transmission. The fundamental process that triggers synaptic transmission is the action potential, a propagating electrical signal that is generated by exploiting the electrically excitable membrane of the neuron. The electrical properties of the ion channels and receptors on the membrane of neuron together can decide the electrical properties of the neuron (Figure-). Neurotransmitters are chemicals that are used to relay, amplify and modulate signals between a neuron and another cell. Figure--1 A typical neural cell Figure-- 2Receptor and Ion Channels A biological neuron [2] could be composed by multiple compartments that are soma, axon (an axon can be composed by Hillock, stem, and terminal sub compartments), and dendrites (a dendrite can be composed by proximal, middle and distal sub compartments). Additional compartments could be added in the future. Each compartment has the some properties on the membrane such as the 1 2 Figure from http://www.cs.nott.ac.uk/ Figure from http://www.neuropsychopathologie.fr LBD Page 8 Master Project-Learning Database Informatique Yuanjian Wang Zufferey neurotransmitter receptor type (for example: ionotropic receptor, metabotropic receptor), ion channel (for example: Na+, Ka+, Ca2+) or transmitter type (for example: the acetylcholine, the biogenic amines, the amino acid transmitters, etc.). For each property or the compartment we may interested in the states of some measurement, such as potential of membrane, membrane capacitance, conduction velocity of axon, etc. For each neuron, we are interested in where it is (organ), which category it belongs to (classification), and what its function are, and other molecular information, such as the coded gene, the microscope image for visualization, and the experiments to understand its dynamics characters. Figure- A concrete instance of neuron is shown in Figure-. By different classification criterions, a neuron could belong to different neuron class. An organ will be covered by numerous neurons. Each neuron may composed by different compartments that carry special properties. Figure- For example, a pyramidal cell [5] (or pyramidal neuron, or projection neuron) is a multipolar neuron (Nueron_Classification) located in the hippocampus and cerebral cortex (OrganInstance). These cells have a triangularly shaped soma, or cell body, a single apical dendrite extending towards the pial surface, multiple basal dendrites, and a single axon (Compartment). K+ channels (ElectricalProperty) on dendrites of pyramidal cell are often studied. LBD Page 9 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 3.2.2 Computational Model Let us look at an example of computational model (The computational model example we have taken from ModelDB [11].): Simple Model of Spiking Neurons [6] who combines the biological HodgkinHuxley-type dynamics and the computational integrate-and fire neurons. Two equations of this model are defined here: With the auxiliary after-spike resetting: There are three variables: : The membrane potential of the neuron; : The membrane recovery variable, which accounts for the activation of and inactivation of ionic currents ionic currents. : Delivered synaptic currents or injected dc-currents And there are 4 parameters are defined: : Time scale of the recovery variable . : The sensitivity of the recovery variable . : The after-spike reset value of the membrane potential caused by the fast highthreshold conductance. : The after-spike reset of the recovery variable caused by slow high-threshold and conductance. We can easily define a computational model by its equations and the variables, parameters that included in equations (Figure-). For each variable and equation, there may be some biological explanation supplied (as in our example). This model will read from (ReadInterface) and produce the results of (WriteInterface) in each time step. It’s a model based on the Hodgkin-Huxley- type dynamics and integrate-and fire hypotheses. The references of this model we can find the in the paper of [6]. We can find the same information about the biological question that it talks about: spiking and bursting behavior of known types of cortical neurons. Figure- LBD Page 10 Master Project-Learning Database 3.2.3 Informatique Yuanjian Wang Zufferey Simulation Model Figure- Simulations are the realization of computational model in some programming languages, such as Matlab, NEURONS. The same realization of computational model but with different simulation conditions (different start conditions, parameter settings and stop conditions) will produce different results. A simulation can include more than one simulation element. A simulation element can be a neural cell, or a compartment of neural cell, or electrical property binding with a computational model. And these elements can be connected to form a neural network or detailed neural cell. For each simulation, we may discover different behaviors. For example, in the computational model we described in 3.1.2 (Simple Model of Spiking Neurons), we can bind a neural cell with this model. With different initial condition and parameters setting, we can get the different simulation results as below (Figure-): Figure- (A) a=0.02, b=0.2, c=-65, d=6, I=14, tonic spiking (B) a=0.02, b=0.25, c=-65, d=6, I=0.5, phasic spiking LBD Page 11 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 3.3 Definition of XML Schema Now we introduce the concept model in XML representation (xml schema) from the top definition to the bottom. 3.3.1 LearningDatabase The top of schema is the LearningDatabase element (figure-). It’s composed by BioNeurons, NeuronModels, Hypotheses, Simulations and References elements. Figure- BioNeurons named biological neurons is the collections of BioNeuron element. NeuroModels are a collection of the neural computational model. Hypotheses are a collection of hypothese proposed by different authors and are referred in computational models. Simulations are a collection of simulation which simulates one or combination of more than one computational model. References element is the collection of references that may be referred in the computational model. We will explain the Constraints in section 3.2.6. 3.3.2 BioNeuron BioNeuron (Figure-) is an element who acts as an individual neuron cell. It’s identified by its ID, name, Canonical form, and Organism it situates. To describe its molecular information, we can use Interoperation in the form of Gene_Chromosome, Microscopy_Data or Experimental_Data. An external resource location such as image file can be saved for each sub element of Interoperation (Resource attribute records this location). A neuron cell is composed by different compartment such as somas, axons or dendrites. Certain electronic properties (for example, channels, receptors or transmitters) can be bound to the membrane of the compartment. For each neural cell, more than one experiment may be executed and in each experiment measurement of electronic properties may be taken in each time step (named states) during the whole execution time. We record the unit, name, and value of the measurement. It’s important that we have such information for later clustering algorithm to cluster the homologous data. More generally, we can save the result file location of one states element in the value element instead of save each value in element. Additional information or detailed description can be saved. LBD Page 12 Master Project-Learning Database Informatique Yuanjian Wang Zufferey Figure – 3.3.3 NeuroModel (b) (a) (c) Figure - LBD Page 13 Master Project-Learning Database Informatique Yuanjian Wang Zufferey Each variable (Figure- (b)) or parameter (Figure- (c)) has its unique identity (ID) in one computational model, name, unit and the symbol (for example, α, β, x, y, etc) which we use to represent it in equation. Figure - An equation (Figure-) with unique ID in one computational model is composed by the variables and parameters. The mathematic expression of the equation is saved. For each equation, it may describe a functionality of one region of organism, of one cell, or one compartment, or an electronic property. In one computational model, we may define more than one group of equations that have similar behavior. For retrieval convenience, we are interested in the characteristics of one computational model, such as biological question (Figure- (a): BiologicalQuestion) it replies, references (Figure- (b)) it refers to, hypothesis it’s based on, the keywords it mentions. A reference can be a published paper (PaperReference), or from a scientist (TheoryFromPerson), or from one book. (a) (b) Figure - To precise a biological question, we can describe the research area it belongs to, which kind of topics it talks about, or a reference it uses. To reply the question, this computational model may give important contribution. The specialties (Feature) of this computational model to reply this biological question, and the conclusion we can get from it are important to other people to refer to. To formalize the terminology of research area (Figure- (a)) and topics (Figure- (b)), we referred the classification of Wiki on neuron science. LBD Page 14 Master Project-Learning Database Informatique (a) Yuanjian Wang Zufferey (b) Figure - For a single neuron, its compartments may behavior differently so that different computational model may apply on different compartment or electronic properties on compartment. To communicate between different models we can define the reads or writes variables that read input from or write out value to other model variables (Figure-). Figure - In a neural network, the connections between different neural cells with different computational model will behavior in the same way (Figure-). For each computational model, we may have the resource files such as program codes (Matlab, Java, NEURON, C++, etc.), reference files, some other necessary files can be treated as additional file that LBD Page 15 Master Project-Learning Database Informatique Yuanjian Wang Zufferey zip in one file. We can find the location of this zip file by the attribute ‘resource’ of element of AdditionalFiles. 3.3.4 Hypotheses Figure – Hypotheses (Figure-) are a collection of Hypothesis. Hypothesis has to be proposed by someone and has the corresponding statement. It may have some relationship with the region of organism, or some cell, or with compartment of neural cell. It has the unique identity (ID) in LearningDatabase scale. 3.3.5 Simulations Figure – Simulations element is the collection of simulation. Simulation (Figure-) can be configured in the SimulationEnvironment (for example: Matlab, C++, Java) and corresponding program codes saved in SimulationResource (location information). A simulation may include more than one computational model at one time (Figure-). For example, given two computational models that define separately the electronic property of Na+ channel and K+ channel, for a simulation includes one neural cell with three compartments: soma, axon and dendrite, computational models of Na+ channel and K+ channel can be applied or bound on the LBD Page 16 Master Project-Learning Database Informatique Yuanjian Wang Zufferey membrane of each of the three compartments but their variables with different initial conditions (InitialConditions), stop conditions (StopConditioons) and parameters setting (Assignments). In Connections element (Figure-) records the application of such binding. As a result, the simulation results (Figure-) of the same computational model bound in different compartment will be different. Figure –(need to be modified) Figure – Figure – LBD Page 17 Master Project-Learning Database Informatique Yuanjian Wang Zufferey The representation of simulation results can be graphs (curves) or tables that each axis or column saves the values of one variable during one simulation. Being limited to 3D graphs, the Graph element may have X, Y, and Z axis. Each axis save the variable (refer to the ID of variable in computational model) and its whole value as a list or external resource such as txt or xml file. The Tables element is the collection of all the TableData which saves one variable and its values for each step of simulation as list or external file. Figure – If we have discovered some abnormal phenomena or some new observations in a simulation, we can record it in Discovery element (Figure-). The cell type and the region of organism can serve as retrieval information. The observed results are recorded in Observation elements with the detailed compartment, electronic property and description, the measurement also. Figure – Connections serve as two functions: in the single neural cell simulation with multiple computational models binding to different compartment, it can record the connection between compartment or property (ConnetctTo element) and computational model (From element); or the connection between compartments; or the connection between compartment and property. In the neural network, it can serve as the connection between different neural cells. 3.3.6 Constraints We list the constraints include the key definition in Table- and the key reference definition in Table-: Key NeuronID NeuronModelID PropertyID CompartmentID HypothesisID SimulationID LBD Scale Learningdatabase Learningdatabase Learningdatabase Learningdatabase Learningdatabase Learningdatabase Selection ./BioNeurons/BioNeuron ./NeuronModels/NeuronModel ./Bioneurons/Properties/Property ./ Bioneurons/Compartments/Compartment ./Hypotheses/Hypothesis ./Simulations/Simulation Field ID ID @ID @ID ID @ID Page 18 Master Project-Learning Database ReferenceID ParameterID VariableID EquationID Learningdatabase NeuronModel NeuronModel NeuronModel KeyRef refer HypothesisRef NeuronModelRef CompartmentRef HypothesisID NeuronModelID CompartmentID PropertyRef PropertyID LBD Informatique Yuanjian Wang Zufferey ./References/Reference Parameter Variable Equation Table– selector NeuronModels/NeuronModel/BiologicalQuestion Simulations/Simulation/NeuronModels/NeuronModel NeuronModels/NeuroModel/* |Simulations/* |BioNeurons/BioNeuron/Compartments NeuronModels/NeuroModel/* |Simulatioins/*|BioNeurons/Compartments/* |BioNeurons/BioNeuron/Compartments/Compartment Table– @ID ID ID ID field Hypothesis @ID Compartment Property Page 19 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 4. Learning algorithm (why) In this project we are trying to simulate the function of brain by using the database. The brain is not only serves as database that to store information but also has the capacity to learn from the information that it has met. A learning database, as we named, it is firstly a database which stores the information for the neuron science, secondly it learns from the information that it has stored. All people have different way to learn in the different conditions. In the cases that we have a teacher to give the right answer is the easiest way and we usually named it as supervised learning. In the cases that we have to find the answers by ourselves, if we make mistake or we get right answers we may get a negative or positive compensation, it’s the case of reinforcement learning (for example, to learn bicycle); but without such feedback, we can only rely on some intuition. Such problems as classification and clustering are belonging to unsupervised learning, we can only try to put the similar things together and guess how many classes or clusters may exist. How to define the similarity for a class or a cluster? We have to give a formal definition such as the distance between the center of the cluster and the object that we want to classify. But what’s the center of the cluster? How can we find center or what’s the procedure or algorithm to find it? Such questions have to be replied before we can really learn something. Firstly, let’s look at a simple example in the following figure (Figure-): Figure – In this case we easily identify by eyes the 4 clusters into which the data can be divided. But how the computer can distingue it? What’s the criterion that the computer put one point into one of the cluster? Another more difficult example, as in our case, we want to identify if the experiment or simulation results in our database can be classified into any of the curves as shown in the Figure- (Data samples for learning are created by the MatLab by defining different combination of elementary functions: exponentials, logarithms, and trigonometric). And worse, we don’t know advance that which kind of typical curves we could have. How can we cluster the similar curves together? LBD Page 20 Master Project-Learning Database Informatique Yuanjian Wang Zufferey Figure – In this project, learning algorithm is working as a clustering machine whose function is to find the cluster of each experiment or simulation results clusters and supply the possibility to retrieve the similar results when given a sample. There are many algorithms of unsupervised learning for cluster problem, such as K-means Competitive Learning [][7], Kohonen Competitive Learning, and Fuzzy-C-Means Competitive Learning and Hierarchical Clustering Algorithms. K-means CL: (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The standard k-means algorithm calculates the distance between the input vector and the centre vector (prototype) (Figure-). Figure – The distance is usually defined as the Euclidian norm: (4) Where is the input vector, is the prototype vector , is the vectors’ dimension and N is the number of the prototypes. The prototype with a minimum distance is named winner: (5) The winner prototype is updated by a reduction of the learning rate towards the input: LBD Page 21 Master Project-Learning Database Informatique Yuanjian Wang Zufferey (6) This reduction of the learning rate makes each prototype vector the mean of all cases assigned to its cluster and guarantees convergence of the algorithm to an optimum value of the error function: (7) is the input vector that is classified in cluster . is the prototype of the cluster . The algorithm is shown as following (Figure-): Start Initialize the prototypes Competition to find the winners for all inputs Convergence criterion is satisfied? no yes Update the winner prototype Competition to find the winners for all inputs End Figure – LBD The convergence criterion is defined by the percentage of the number of changed winners for all inputs. The classical k-means algorithm has the “dead units” problem. That is, some prototypes may never win the competition, so it may never be updated. The result is the “dead units” can’t really represent prototypes. Furthermore, we need to know the exact number of cluster k, before performing the data clustering. Otherwise, it will lead to poor clustering performance. The resulting clusters depend on the initial random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance. The time consummation is O(N2). N is the total size of inputs. Kohonen Competitive Learning (Kohonen, 1995/1997; Hecht-Nielsen 1990): one of the “Kohonen network”, the Vector Quantization-competitve networks can be viewed as unsupervised algorithm that is closely related to k-means cluster analysis. The prototype Page 22 Master Project-Learning Database Informatique Yuanjian Wang Zufferey vector is moved a certain proportion of the distance between it and the training case, the proportion being specified by the learning rate, that is: (8) Kohonen’s learning law with a fixed learning rate does not converge. As is well known from stochastic approximation theory, convergence requires the sum of the infinite sequence of learning rates to be infinite, while the sum of squared learning rates must be finite (Koheonen, 1995, p.34). In this case, the learning rate has to be reduced in a suitable manner. These requirements are satisfied by MacQueen’s k-means algorithm. The prototypes are randomly initialized from the input vector values. The algorithm is defined as following (Figure-): Start Initialize the prototypes Competition to find the winner for one input Update the winner prototype no Whole set of inputs ? yes End Figure – LBD The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. But as in K-means algorithm, the clustering results of Kohonen Competitive Learning depend on the initialization of the prototypes and may produce the “dead units”. The level of time consummation is O(N M). N is the total size of inputs and M is the total cluster number. Fuzzy c-means: (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. This method (developed by Dunn in 1973 and improved by Bezdek in 1981) is frequently used in pattern recognition. In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster. Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the center of cluster. For each point x we have a coefficient giving the degree of being in the kth cluster uk(x). Usually, the sum of those coefficients is defined to be 1: The fuzzy c-means algorithm is very similar to the k-means algorithm: 1. Choose a number of clusters. 2. Assign randomly to each point coefficients for being in the clusters. Page 23 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 3. Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) : Compute the centroid for each cluster. 4. For each point, compute its coefficients of being in the clusters. The algorithm minimizes intra-cluster variance as well, but has the same problems as kmeans, the minimum is a local minimum, and the results depend on the initial choice of weights. The Expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes. It has better convergence properties and is in general preferred to fuzzy-c-means. Hierarchical Clustering Algorithms: Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this: 1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. 3. Compute distances (similarities) between the new cluster and each of the old clusters. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. Why we chose the competitive learning algorithm? The reason that we chose Kohonen competitive learning algorithm for the first pass (initialization) and the K-means competitive learning algorithm is due to three principle limitations: First, the simplicity of competitive learning algorithm is a big advantage. Secondly, hierarchical clustering methods are unsuitable for isolating spherical or poorly separated clusters The last reason is that K-means competitive learning algorithm can take a long time to converge to a solution, depending on the appropriateness of the reallocation criteria to the structure of the data. However, the structure of the data is unknown. To reduce the number of final “dead units” and improve the performance of the initialization of the prototypes, we have taken some strategies into account. We will discuss them in section 5 (Implementation). LBD Page 24 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 5. Implementation This project is implemented in Oracle by using the PL/SQL language. 5.1 Architecture of the Oracle Implementation Biological Neuron Tables Biological Experiment Tables Computational Model Tables Computational Simulation Tables 1.Abstract Data :Trigger Definition Queries’ Views Definition Learning Database Tables 2.Learning Procedures and Function definition Figure – The architecture of the Oracle implementation is shown in Figure-. There are three basic groups that are knowledgement about the biological neuron and computational model, experiment and simulation, and the learning database. To supply the corresponding views as defined in the Object-Oriented definition, we defined the queries’ views. To automatically abstract the experiment and simulation results, we defined the triggers on corresponding tables to import external data into the learning database tables. Learning Procedures and functions are defined to execute the competitive learning algorithm to find clusters. In the follow sub section, we introduce corresponding definitions for each part. LBD Page 25 Master Project-Learning Database 5.1.1 Informatique Yuanjian Wang Zufferey Biological Neuron Figure – There are 9 tables that save the biological neuron information (Figure-): tb_organism: The table that saves the information about the biological organisms. It includes the name, description, and identity (id) information. tb_neuroncell: Biological neural cell information is saved in this table. It includes the name, canonical form (canonicalform), cell identity (cell_id), situated organism identity (id) and description of cell type (classification of neural cell: cell_type). The canonical form describes a standard way of presenting that cell. The interoperation of one neural cell is saved in table tb_interoperation. tb_compartmentcomposition saves its composition of compartments. tb_compartment: It saves the possible compartment of one neural cell. For example, the somas, axons, or dendrites. It is defined by its identity (id), name and type. Its properties are saved in tb_compartmentcomposition. tb_property: The general properties such as electrical properties of ion channel, receptors on membrane of compartment are described in tb_property. It has its identity (id), type, name and description information. tb_compartmentcomposition: It builds the connection between compartments and properties of one neural cell. A compartment of one neural cell may have more than one property definition. A general property may be included in more than one compartment of a neural cell. tb_experiment: A biological experiment is recorded and identified by experimentid. The executer, description, and execution date (experimentdate) are saved. tb_experiment_result: For each experiment, we may record more than one measurement. A measurement may be taken on the property of compartment of one cell. So we save the experiment identity (tb_experiment_experimentid), cell identity (cellid), compartment identity (compartmentid), and property identity (propertyid). For each measurement we record its name, unit and size (datasize). All values of one measurement are found in an external file (datalocation). LBD Page 26 Master Project-Learning Database 5.1.2 Informatique Yuanjian Wang Zufferey tb_interoperation: The microscopy data, gene chromosome information of one neural cell can be saved in an external file and we could find it by the file location of each information (microscopydata, genechromosome). tb_observation: It’s the table to save the observation that we record in one experiment. This table can contain the observation of a simulation of computational models. The source_type decides that it’s an observation or experiment or simulation. Sim_or_exp_id saves the simulationid or experimentid. Each observation is unique by its id. The description of the observation is saved. Computational Model Figure – There are 9 tables that save the computational model information (Figure-): tb_neuronmodel: It saves the basic information about the identity (modelid) and name of model, description, author and time when it is built. If there are more external files about the model, it can be found in the additioonalfile field that saves the file location. The definition of equations, variables, parameters are saved in tables tb_equation, tb_parameter, tb_variable. More information for retrieval, such as keywords, concerning biological questions, and the references it has used is saved in table tb_keywords, tb_biologicalquestion, tb_refered. tb_variable: A variable may bind with some biological information such as the neural cell (cellid), compartment of neural cell (compartmentid), property of compartment (propertyid). Or it may signify some region of organism or some other biological LBD Page 27 Master Project-Learning Database 5.1.3 Informatique Yuanjian Wang Zufferey information (divers).A variable is represented by its symbol and name. It may have unit. It belongs to a computational model (modelid). It is identified by its id in a model. tb_parameter: A parameter may bind with some biological information such as the neural cell (cellid), compartment of neural cell (compartmentid), property of compartment (propertyid). Or it may signify some region of organism or some other biological information (divers).A variable is represented by its symbol and name. It may have unit. It belongs to a computational model (modelid). It is identified by its id in a model. tb_equation: An equation is an expression of variables and parameters. Apart from its biological meaning, it is identified by its id in a model and has an expression. Each equation may belong to different equation group. tb_variable_member: It saves for one equation the membership of variables. tb_parameter_member: It saves for one equation the membership of parameters. tb_refered: A model may refer to a reference that can be paper, book. tb_keywords: The keywords that have defined in one model are saved in this table.Each keywordid signifies a unique keyword. It can be used by more than one model. tb_biologicalquestion: A computational model may reply a biological question in some research area and talks about certain topics. The detail of this table is introduced in next section. Biological question: Figure – There are 8 tables to save the biological question information (Figure-): tb_biologicalquestion: A computational model may reply to the questions that scientists have asked. For retrieval requirement, we may need to know which research area or topics the question about. For each question we supply a unique identity. To reply the biological question, we may base on the known hypothesis (tb_hypothesis) or reference LBD Page 28 Master Project-Learning Database LBD Informatique Yuanjian Wang Zufferey (tb_refered). The conclusions for the biological question are saved in table tb_contribution. The biological contributions of the computational model to reply the biological question are listed in table tb_conclusion. tb_hypothesis: A hypotheses is made by one person (author), it can be shared by more than one biological question. If it’s a clearly defined hypothesis on the compartment of neural cell (cellid, compartmentid), or on neural cell itself (cellid), or about the region of organism, we can record corresponding information in this table. The statement is the content of the hypothesis. It has one unique identity (id). tb_based_hypothesis: This table records for each biological question, the hypothesis it has based on. tb_reference: This table will record all kinds of possible references that can be used in this database. The type defines it’s a book, a paper, or other published articles. Description is the detailed information about the reference itself. For example, the author, published date, publisher, etc. It has the unique identity referenceid. tb_refered: It records for the computational model (tb_neuronmodel_modelid) and the biological question (questioned) the referred identity (referenceid) of the reference. tb_contribution: The contribution is the special devotion of the computational model when it tries to reply one biological question (questionid). It may have more than one contribution. We can index the content by the importance or other characters (indexofcontribution). tb_conclusion: The conclusion to reply one biological question is saved in this table (questionid). The similar structure as in tb_contribution is defined. We can index the multiple conclusion contents (indexofconclusion, content). Page 29 Master Project-Learning Database 5.1.4 Informatique Yuanjian Wang Zufferey Simulation Figure – There are 7 tables that will be used to save the information about simulation (Figure-): tb_simulation: A simulation may be based on one computational model or based on the combination of multiple computational models. Each simulation is identified by its id. The simulationResource field saves the external files location (we suppose all the files in one zip). The SimulationEnvionment describes which kind of tools to simulate the computation models (such as Matlab, Java, or NEURON, etc). The simulated time is stored. tb_modelsimulation: Each computational model (modelid) that is simulated in one simulation (id) may be applied on the concrete biological environment. For example, given two computational models M1 and M2, we can apply the M1 on the K+ channel (propertyid) on membrane of soma (compartmentid) for one neural cell (cellid), and M2 on the Na+ channel (propertyid) on the membrane of the axon (compartmentid) for the same neural cell. In the description we can describe the connection between the two compartments of the cell. For each computational model (M1 and M2), they may have different parameters setting (tb_assignment). The start simulating condition (tb_startcondition) and stop simulation condition (tb_stopcondition) may different for each model. For each model, we can record the variables’ values during the simulation period (tb_resulttable) and we can construct the graphs based on the variables’ values (tb_resultgraph). tb_initialcondition: For each simulation (simulatonid), the initial value for each variable (variableid) of its computational model element (modelid) has been stored in this table. tb_stopcondition: For each simulation (simulatonid), the stop value for each variable (variableid) of its computational model element (modelid) has been stored in this table. tb_assignment: For each simulation (simulatonid), the assigned value for each parameter (parameterid) of its computational model element (modelid) has been stored in this table. LBD Page 30 Master Project-Learning Database 5.1.5 LBD Informatique Yuanjian Wang Zufferey tb_resulttable: The values of one variable (variableid) of a model (modelid) that has been recorded during simulation (simulatonid). The size of total value (resultsize) and the external file (datalocation) that saves these values as a column of data are stored. tb_resultgraph: A 3-D dimension graph that is produced on the variables’ simulation results can be saved in this table. Xvariableid, yvariableid, zvariableid describe the x, y, z axis data source (We can find the corresponding values in tb_resulttable). Produced graph can be saved in the graphsource location. tb_obsesrvation: As described in section 5.1.1. Definition of Views v_organism_cell: View of the biological cells in one organism: The source of view is from tb_neuroncell and tb_organism. v_neural_cell: View of Biological cell: A detailed neural cell its compartments and properties information. The source of view is from tb_neuroncell, tb_compartment, tb_property and tb_compartmentcomposition. v_neural_cell_experiment: View of experiment of neural cell: A view shows the experiments that have been done on the neural cells. The source is from tb_neuroncell, tb_experiment, and tb_experiment_result. v_experiment_observation: View of observation of the experiment. The source is from tb_experiment and tb_observation. v_computational_model: View of computational model: A detailed computational model includes basic model description and its equations definition. The source is from tb_neuronmodel and tb_equation. v_equation_variables: View of equations vairables: Detailed equation descriptions about the definition of variables. The source is from tb_equation, tb_variable, and tb_variable_member. v_equation_parameters: View equations parameters: Detailed equation description about the definition of parameters. The source is from tb_parameter_memeber, tb_parameter, and tb_equation. v_model_reference: View of the reference that referred by computational model. Source is from tb_neuronmodel, tb_referred, and tb_reference. v_model_keywords: Views of the keywords that used in each computational model. Source is from tb_neuronmodel, tb_keywords, and tb_keyword. v_model_hypothesis: View of the hypothesis of biological question: The source is from tb_neuronmodel and tb_biological_question, tb_based_hypothesis and tb_hypothesis. v_model_conclusion: View of the conclusion of biological question: The source is from tb_neuronmodel, tb_biological_question, and tb_conclusion. v_model_contribution: View of contribution of neural model: The source is from tb_neuronmodel, tb_biological_question, and tb_contribution. v_model_question_reference: View of the reference of the biological question: The source is from tb_neuronmodel, tb_biological_question, tb_referred, tb_reference. Page 31 Master Project-Learning Database LBD Informatique Yuanjian Wang Zufferey v_simulation_models: View of simulation detailed composition: It defines the models that involved in one simulation. The source is from tb_simulation, tb_modelsimulation, and tb_neuronmodel. v_simulation_observation: View of the observation for the simulation: The source is from tb_simulation and tb_observation. v_simulation_startcondition: View of start conditions for one simulation: The source is from tb_modelsimulation, tb_simulation, tb_initialcondition, and tb_variable. v_simulation_stopcondition: View of stop conditions for one simulation: The source is from tb_neuronmodel, tb_modelsimulation, tb_simulation, tb_stopcondition and tb_variable. v_simulation_assignment: View of the parameters’ setting in one simulation: The source is from tb_modelsimulation, tb_simulation, tb_assignment and tb_parameter. v_simulation_result_list: View of recorded results in one simulation: The source is from tb_modelsimulation, tb_variable, tb_simulation, and tb_resulttable. v_experiment_cluster: View of the clustered samples that from experiment: The source is from tb_sample, v_neural_cell_experiment, and tb_cluster. v_neural_cell_experiment: View of the clustered samples that from the simulation: The source is from tb_sample, tb_cluster, and v_simulation_result_list. Page 32 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 5.1.6 Competitive Learning There are 4 tables used to save the information that need to apply the competitive learning algorithm (Figure-). tb_sample: This table is the summary for all the experiments’ results and simulations’ results. It’s the interface between learning database and the basic information including biological neuron information and computational information. Each data (A data is an array with one dimension that saves values of one measurement in experiment or one variable in one simulation.) can be originated from simulation or experiment (sim_or_exp_id). If it’s from an experiment, it may include the cellid, compartmentid, propertyid information that are abstracted from experiment source; if it’s from a simulation, it may include the modelid and variableid information. Each data has the unit and size information. Datalocation stores the external file that we can read the values from. Once we have read all the values from external file, the data_fromindex and data_toindex will save the values location in tb_cluster_sample_values. Each value of one data will be saved in table tb_cluster_sample_values. And data index (dataindex) is the unique identity that serves as the index and primary key. tb_cluster_sample_values: Values of one data from the external files are saved in this table. The order of the values in original file is kept. Each value has a data index (dataindex) that is unique for each value. All the values are saved in a vertical column. We need to transform each range of values for each data into horizontal table in usage. tb_cluster: This table records the cluster result for each data. Each data in different cluster level belongs to only one cluster. Each cluster is identified by its clusterid and clusterlayer. The first level of the cluster layer here we defined by default is the clusters classified by the unit and data size. In each cluster with the same unit and data size, we will apply the competitive learning algorithm to find the corresponding cluster. tb_prototype: Each cluster has one prototype. A prototype is identified by its clusterid, layer (cluster level) and prototypeid. Its vector values are saved in table tb_cluster_sample_values. Data_fromindex and data_toindex save the location of value (data_fromindex<=value.dataindex<=data_toindex). Each prototype represents for one group of data with same unit and size (prototypesize). For quality analysis aim, we can write out the prototype values into an external file in datalocation. Figure – LBD Page 33 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 5.2 Constraints The Constraints on tables are defined as in tableTable Name tb_property Primary Key id tb_simulation id tb_organism id tb_reference referenceid tb_initialcondition simulationid, modelid, variableid tb_stopcondition modelid, simulationid, variableid tb_resultgraph simulationid, modelid tb_compartment id tb_variable id, modelid tb_variable_member (id, modelid) REFERENCES tb_variable (id, modelid); (tb_equation_id, modelid) REFERENCES tb_equation (id, modelid) tb_neuronmodel Modelid tb_neuroncell cell_id tb_resulttable modelid, variableid, simulationid tb_experiment_result tb_assignment lusteredted, propertyID, cellid,tb_experiment_experimentid modelid, simulationid, parameterid tb_observation observationid tb_biologicalquestion questionid tb_experiment experimentid tb_parameter id, modelid tb_parameter_member id, modelid tb_cluster dataindex,lustered,clusterlayer tb_sample dataindex tb_prototype lustered, prototypeid,layer tb_cluster_sample_values dataindex tb_temp_values dataindex tb_discovery tb_contribution LBD (id) REFERENCES tb_organism (id) (id) REFERENCES tb_neuronmodel (modelid) (tb_equation_id, modelid) REFERENCES tb_equation (id, modelid); (id, modelid) REFERENCES tb_parameter (id, modelid) tb_equation tb_interoperation Foreign Key (cell_id) REFERENCES tb_neuroncell (cell_id) (observationid) REFERENCES tb_observation (observationid); (tb_bio_questionid) REFERENCES tb_biologicalquestion (questionid) (tb_bio_questionid) REFERENCES Page 34 Master Project-Learning Database Informatique Yuanjian Wang Zufferey tb_biologicalquestion (uestioned) tb_refered (tb_bio_questionid) REFERENCES tb_biologicalquestion (questionid); (tb_neuronmodel_modelid) REFERENCES tb_neuronmodel (modelid) (id) REFERENCES tb_neuronmodel (modelid) (id) REFERENCES tb_simulation (id); (modelid) REFERENCES tb_neuronmodel (modelid) (tb_experiment_experimentid) REFERENCES tb_experiment (experimentid) (tb_bio_questionid) REFERENCES tb_biologicalquestion (questionid) (tb_bio_questionid) REFERENCES tb_biologicalquestion (questionid) (property_id) REFERENCES tb_property (id); (compartment_id) REFERENCES tb_compartment (id); (cell_id) REFERENCES tb_neuroncell (cell_id) tb_keywords tb_modelsimulation tb_experiment_result tb_hypothesis tb_conclusion tb_compartmentcomposition Table– 5.3Triggers Strategy to keep the integration between base table and cluster structure (Table-): Triggers are defined on the table tb_experiement_result and tb_resulttable to keep the integrity between data source and cluster source. Trigger Name t_experiment_delete Table tb_experiment_result t_experiment_result_insert tb_experiment_result t_experimentresult_update tb_experiment_result t_resulttable_delete tb_resulttable t_resulttable_insert tb_resulttable LBD Description After delete one experiment data, we delete the corresponding data in tables of tb_cluster,tb_sample,tb_cluster_sample_values After insert a new experiment data, we insert the data saved in data file into tables of tb_cluster,tb_sample,tb_cluster_sample_values. If the clusters for the corresponding data size and unit are defined in learning database, we calculate the cluster for the new result. After have modified an experiment data, we modify the tables of tb_cluster, tb_sample, tb_cluster_sample_values to have the same data content. At same time, if the clusters for the corresponding data size and unit are defined in learning database, we recalculate the corresponding cluster for the modified result. After delete one simulation result data, we delete the corresponding data in tables of tb_cluster,tb_sample,tb_cluster_sample_values After insert a new simulation result data, we insert the data saved in data file into tables of b_cluster, Page 35 Master Project-Learning Database t_resulttable_after_update Informatique tb_resulttable Yuanjian Wang Zufferey tb_sample, tb_cluster_sample_values. If the clusters for the corresponding data size and unit are defined in learning database, we calculate the cluster for the new result. After have modified an a simulation result data, we modify the tables of tb_cluster, tb_sample, tb_cluster_sample_values to have the same data content. . At same time, if the clusters are defined in learning database, we recalculate the corresponding cluster for the modified result. Table – LBD Page 36 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 5.4Competitive learning algorithm implementation: Competitive learning algorithm applied on each homologous data groups such that they have the same unit and the data size. For example, a simulation result whose variable has the unit ‘mV’ and recorded value array with length 5000 is homologous with an experiment result of the membrane’s potential on a soma of neural cell which has the unit ‘mV’ and the measured 5000 values. We suppose all the simulation results and experiment results are saved on external files that can be accessed by the user in the authorized directory in Oracle. As being mentioned in section 5.1.5, four tables have been involved in the procedure of competitive learning. They are tb_sample, tb_cluster_sample_values, tb_cluster and tb_prototype. Each insertion of simulation result (tb_resulttable) or experiment result (tb_experiment_result) will trigger the corresponding importation data values procedure which read the external file location into the tb_sample and tb_cluster_sample_values tables. The update or delete operation on these two tables will trigger the update or deletes procedures on the corresponding tables: tb_sample, tb_cluster_sample_values, tb_cluster. Once we have enough data samples to cluster, we can start the cluster procedure by initializing the prototypes and execute the competitive learning algorithm. To get better initialization of prototypes at the start point of the competitive learning procedure, we need more extra effects to check the randomly chosen prototypes and reject the choices that too much overlapped prototypes exist. The overlap can be represented by the correlation between two prototypes. Prototype pair with high correlation (for example, more than 0.9) may result that the two prototypes represents the same cluster. We check correlation between prototypes to find the highly correlated prototypes and a criterion is defined to decide how many correlated prototypes accepted by the initialization. If we haven’t found the wanted level of correlation number criterion, there may be no such choice, and then we have to augment the tolerable correlated prototypes’ number. A maximum ratio of this criterion has to be defined, or it will loop infinitely. Firstly to avoid exhaustive tries, we have to limit to a reasonable iteration number to search the better choices. Secondly, as if we can’t know in advance the cluster number and missing the clusters is not that we want. We have to define a rather high ratio of prototypes regarding to the total size of samples data. The results may lead to more dead units but guarantee that we haven’t missed the poorly separated clusters. Once the initialization finds the satisfied prototypes, competitive learning begins. Once all the samples are clustered, the initialization of the competitive learning algorithm finishes. LBD Page 37 Master Project-Learning Database Informatique Yuanjian Wang Zufferey When new sample is inserted into database, it will be clustered by the competitive learning algorithm. The learning procedure continues. To get better accuracy, we suggest that if the database size has grown in a considerable scale, we reinitialize all the prototypes with bigger ratio and execute the competitive learning again for all the samples. The flow in Oracle to execute corresponding sql files is shown in Figure-: Create user’s name and directory Create Tables: createtable_ra.sql Insert samples’ data (createdatasample of initializecluster.sql) Import the external data from files (initializesample) Initialize prototypes (by unit, data size): initializeallprototype.sql Execute the competitive learning algo. (by unit, data size): execALLCompetiveLearning.sql Figure – LBD Page 38 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 6. Queries: We decided to give the freedom to user to choose possible queries that he wants. The semester project of “EasyQueries” of laboratory of Databases in 2007 (author: Ariane Pasquier) has supplied a good tools to exploit all the possibilities to query the databases. For summary, we can list some examples that are common for users: A. General queries: 1. Biological questions: Given the neuron name or ID, list out its compartments with electronic properties, classifications, locations, and interpretations. Given the neuron name or ID, list out the experiment results. Given the type of receptor/neurotransmitter/channel, list out the neurons that have such type of receptor/ neurotransmitter/channel. Etc. 2. Computational questions: Given a neuron name or ID, list out its computational models. Given the name of a theory, list out the computational model based on this theory Given the research subject, list out the possible concerning models. Given the neuron name, list out the biological experimental data for data fitting Given the model name, list out the simulation results (table, graphs, condition and parameters). Given the properties of compartments, find out the computational model that is based on it. Etc. B. Advanced queries: Given the simulation resulting data (pre-saved table or graph), find the closest experiment results or simulations. Given an external file that stores a simulation or experiment result, find the closest experiment results or simulations. Given a cluster id, find its prototype. Given a simulation result or experiment result, find its cluster. Etc. Easyqueries supplies an interface that user can easily define his queries (Figure-). All the tables and views in the database corresponding to the role of user can be chosen as the querying object. It is original for query the tables in DERBY database. We slightly modified it to be able to query the database of Oracle. There are two possible query options: assisted query with QBE and handwritten queries. LBD Page 39 Master Project-Learning Database Informatique Yuanjian Wang Zufferey Figure – QBE, Query-By-Example, is a language for querying relational data with a graphic representation of the data. User can use QBE keywords to retrieve, update, delete, and insert data. The graphic representation is shown in Figure-. It refers to DB2 Query Management Facility (QMF) keywords definition [8], user can choose the queried table, the fields in the table (keywords: P.: Projection, UNQ: distinct etc.), define the conditions (for example: <=, >=, <>, etc) easily by using supplied keywords. For detailed explanation of this language, we comment reader to refer to [8]. Multiple tables can be joined by using the complex link. In our case, we have already created the necessary views for users to query almost all the visual information, the operation on one view is sufficient to query the necessary information. Figure – In Figure-, we show an example that a query to retrieval the view of v_organism_cell: we select all the fields (P. on the v_organism_cell) and chose the records that organism_name is like ‘brain’. The result SQL statement is previewed. Once we click on the ‘SEND QUERY’, we can get the result records below. Figure – LBD Page 40 Master Project-Learning Database Informatique Yuanjian Wang Zufferey For advanced user who is familiar with SQL queries, he can define the query manually by writing the SQL statements. ??? LBD Page 41 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 7. Performance analysis Firstly, we look at the time consummation of the step of competitive learning process. As we mentioned before, the complexity of competitive learning is O(M*N), M is the number of clusters and N is the total samples. We fixed the number of clusters and executed competitive learning on different size of samples. The time consummation is shown in Figure-. We can conclude that the time consummation is increased linearly with the size of samples. Figure – Secondly, we check the final prototypes’ quality by two measurements: the number of dead units and the number of correlated pairs of prototypes. (a) (b) Figure – In Figure- the dead units and the number of final correlated prototypes pairs with configuration of 1000 samples that each sample has 500 dimensions are shown. We can discover that with the fixed samples number and dimension, increasing the clusters will lead to more dead units (in our case, we defined the clusters that have less than 3 samples are dead units) and more correlated pairs of prototypes (in our case, the pair of prototypes that has the correlation more than 0.8 are defined as correlated). The correlated prototypes may represent the same cluster and dead units show that there are some prototypes have never find samples belong to the cluster that the prototypes represent. LBD Page 42 Master Project-Learning Database Informatique Yuanjian Wang Zufferey Figure – We studied various dimensions with the same number of samples and clusters (Figure-). The correlated prototypes pairs haven’t had great change (between 0 or 3 pairs), but the consummation of time has great difference (from 113 to 1171 second) (Figure-). We think the solution to reduce the dimension for the competitive learning is acceptable [12]. The prototypes’ quality is the most important measurement that indicates the successful factor for the algorithm. As we believe that more samples the database learned, more successful that it can cluster the correctly sample. To verify the cluster result’s correctness, we used the labeled hand written digits’ images that each image represents one number among 0 to 9 (The hand written digits data to test the precision of clustering is from the course “Unsupervised and reinforcement learning in neural networks” of Professor Wulfram Gerstner.). These images have fixed data dimension (784) and the clusters number proportional to the total samples (10%) and executed the learning process on different size of sample. We can define the precision of cluster as following: The precision for clustering is shown in Figure-. We can easily to see that with more samples, we can learn more precisely. It confirms the solution that to improve the quality of the clusters, we may need to initialize the prototypes and execute the competitive learning again when the samples number has grown in some scales. In this example, the scale is 10. The precision is improved from 57.5% (200 samples, 20 clusters) to 82.5% (2000 samples, 200 clusters). LBD Page 43 Master Project-Learning Database Informatique Yuanjian Wang Zufferey Figure – In the case of 200 samples with 20 clusters, we can’t find all the possible prototypes for each digit. But in the case of 2000 samples and 200 clusters, different prototypes for each digit are found. To show the progression, we reshaped the prototypes from the 784 dimension to the 32*32 matrix, and draw the digits. The 20 prototypes of 200 samples are shown in Figure- and the 20 of 200 prototypes that 2000 samples have found are displayed in Figure-. We can observe in the two figures that in the first case, we have only found 1 prototypes of digit 1 but at least 4 prototypes in the second case; similarly, we have only found 2 prototype of digit 0 in the first case but at least 4 prototypes in the second case. Furthermore, in the first case there are 2 prototypes we can’t tell which digit they represent (in the second line, the third and the fifth prototype), but in the second case, we can distingue each digit easily. Figure – Figure – LBD Page 44 Master Project-Learning Database Informatique Yuanjian Wang Zufferey Figure – To further analyze the performance and we compared the time consummation of competitive learning process between the Oracle and Matlab. We fixed the dimension as 500 and took 1000 samples. By choosing different clusters number, we can get the following time consummation in Figure- for Oracle and Matlab. Surprisingly, we can discover that Matlab is much faster in all the case. The maximum time during this test for Matlab to execute the competitive learning algorithm is only 3.5 seconds contrary to Oracle 4065 seconds. At same time it gives huge hope that we can improve the performance in Oracle by using external tools such as Matlab to execute the learning algorithm but Oracle serves as the storage and retrieval tools. We tried one possibility to use the matrix package of the Oracle. But short of example to use it, in the short time we can’t finish it. LBD Page 45 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 8. Conclusion Based on the requirement of the storage of biological neural information and computational models’ information, we designed a database that not only can store the biological and computational neural information systematically, but also can learn from the high dimension data series or graph information (transformed to high dimension vector) to find the cluster information. This project is implemented in Oracle 10G by using the PL/SQL language. Queries are possible by using the EasyQueries [10]. The performance of the project is studied by using data series produced by the elementary functions combinations and hand written digits images. Here we want to point out some problems about the implementation in Oracle: The most inconvenience that we have met is the mathematics’ calculation. We can’t execute the vector calculation easily in Oracle. The performance of execution of such procedures in Oracle is extremely heavy. The IO operation to load external file into Oracle is not efficient. In the analysis of the performance of the time consummation, Oracle is not efficient to execute the competitive learning algorithm. The Matlab is obviously much better. In the future work, we may think about to implement competitive learning algorithm by using the external specialized components such as Matlab or C++ languages but Oracle serves as a storage and retrieval tool. We have supplied a well-formed and easy to query knowledge base but we miss a good visual tools to input the knowledge. As the biological information and computational information are decomposed as detailed as possible, user may input enormous information. For example, for each computational model, we have to input the equations, parameters, variables separately if we have no an intelligent tool to abstract the parameters and variables from the equations. For each parameter or variable, if without the help of dictionary that user can easily choose the signification from, it’s a heavy work to input all the description or biological explanation manually. In the future work, such intelligent editor is necessary to supply to user. Furthermore, some existing databases may already have partial information. In this case we may need to develop a tool to automatically import the corresponding information. As we mentioned in the section 7, the possibility to reduce the dimension have no great effects on the precision, but reduce greatly the time consummation. The future work we may think about to calculate the principle components instead of calculate all the dimensions. In this project, we chose the simple one pass Kohonen competitive learning algorithm that can easily adapted for the database application. But in the future work, any learning algorithm can be added to the application level by defining a standard interface to retrieve data samples and return the results. LBD Page 46 Master Project-Learning Database Informatique Yuanjian Wang Zufferey 9. Acknowledgement Thanks Professor Spaccapietra Stefano has accepted my proposition for this project and I am deeply appreciated for the help of Dr. Fabio Porto. Thanks the no-stop support of my family. LBD Page 47 Master Project-Learning Database Informatique Yuanjian Wang Zufferey Reference 1. http://en.wikipedia.org/wiki/Data_clustering 2. http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html 3. Neuron. (2008, May 18). In Wikipedia, the Free Encyclopedia. Retrieved 08:16, May 26, 2008, from http://en.wikipedia.org/w/index.php?title=Neuron&oldid=213235579 4. Membrane potential. (2008, May 21). In Wikipedia, The Free Encyclopedia. Retrieved 08:14, May 26, 2008, from http://en.wikipedia.org/w/index.php?title=Membrane_potential&oldid=214033480 5. Pyramidal cell. (2008, May 8). In Wikipedia, The Free Encyclopedia. Retrieved 13:46, May 26, 2008, from http://en.wikipedia.org/w/index.php?title=Pyramidal_cell&oldid=211030760 6. Izhikevich artificial neuron model from EM Izhikevich "Simple Model of Spiking Neurons" IEEE Transactions On Neural Networks, Vol. 14, No. 6, November 2003 pp 1569-1572 7. Kanungo, T. Mount, D.M. Netanyahu, N.S. Piatko, C.D. Silverman, R. Wu, A.Y. Almaden Res. Center, San Jose, CA: An Efficient k-Means Clustering Algorithm:Analysis and Implementation. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 7, JULY 2002 8. IBM Corporation, August, 2005 http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.q mf.doc.using/dsqk2mst365.htm 9. IBM Corporation, August, 2005 http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.q mf.doc.using/dsqk2mst339.htm 10. Ariane Pasquier: User Manul of EasyQueries. April 2007. 11. Hines ML, Morse T, Migliore M, Carnevale NT, Shepherd GM. ModelDB: A Database to Support Computational Neuroscience. J Comput Neurosci. 2004 Jul-Aug;17(1):7-11. 12. Heng Tao Shen , Xiaofang Zhou, Aoying Zhou: An adaptive and dynamic imensionality reduction method for high-dimensional indexing. The VLDB Journal (2007) 16(2): 219– 234 LBD Page 48