Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011 ISSN (Online): 1694-0814 www.IJCSI.org Efficient Spatial Data mining using Integrated Genetic Algorithm and ACO Mr.K.Sankar1 and Dr. V.Vankatachalam2 1 Assistant Professor(Senior), Department of Master of Computer Applications, KSR College of Engineering, Tiruchengode 2 Principal, The KAVERY Engineering College, Mecheri, Salem Abstract Spatial data plays a key role in numerous applications such as network traffic, distributed security applications such as banking, retailing, etc., The spatial data is essential mine, useful for decision making and the knowledge discovery of interesting facts from large amounts of data. Many private institutions, organizations collect the number of congestion on the network while packets of data are sent, the flow of data and the mobility of the same. In addition other databases provide the additional information about the client who has sent the data, the server who has to receive the data, total number of clients on the network, etc. These data contain a mine of useful information for the network traffic risk analysis. Initially study was conducted to identify and predict the number of nodes in the system; the nodes can either be a client or a server. It used a decision tree that studies from the traffic risk in a network. However, this method is only based on tabular data and does not exploit geo routing location. Using the data, combined to trend data relating to the network, the traffic flow, demand, load, etc., this work aims at deducing relevant risk models to help in network traffic safety task. The existing work provided a pragmatic approach to multi-layer geodata mining. The process behind was to prepare input data by joining each layer table using a given spatial criterion, then applying a standard method to build a decision tree. The existing work did not consider multi-relational data mining domain. The quality of a decision tree depends, on the quality of the initial data which are incomplete, incorrect or non relevant data inevitably leads to erroneous results. The proposed model develops an ant colony algorithm integrated with GA for the discovery of spatial trend patterns found in a network traffic risk analysis database. The proposed ant colony based spatial data mining algorithm applies the emergent intelligent behavior of ant colonies. The experimental results on a network traffic (trend layer) spatial database show that our method has higher efficiency in performance of the discovery process compared to other existing approaches using non-intelligent decision tree heuristics. Keywords: Spatial data mining, Network Traffic, ACO, GA 1. Introduction Data mining is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Given an informational system, data mining is seen to be used as an important tool to transform data into business intelligence process. It is currently used in wide range of areas such as marketing, surveillance, fraud 97 IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011 ISSN (Online): 1694-0814 www.IJCSI.org traffic risk analysis. The proposed work detection, and scientific discovery. aims at spatial feature of the traffic load Automatic data processing is the result and demand requirements and their of the increase in size and complexity of interaction with the geo routing the data set. This has been used in other environment. In previous work, the areas of computer science as neural system has implemented some spatial networks, support vector machines, data mining methods such as genetic algorithms and decision trees. A generalization and characterization. The primary reason for using data mining is proposal of this work uses intelligent ant to assist in the analysis of collections of agent to evaluate the search space of the observations of network user behavior. network traffic risk analysis along with usage of genetic algorithm for risk Spatial data mining try to find pattern. patterns in geographic data. Most commonly used in retail, it has grown 2. Literature Review out of the field of data mining, which initially focused on finding patterns in Spatial data mining fulfills real network traffic analysis, security threats needs of many geomantic applications. It over a period of time, textual and allows taking advantage of the growing numerical electronic information. It is availability of geographically referenced considered to be more complicated data and their potential richness. This challenge than traditional mining includes the spatial analysis of risk such because of the difficulties associated as epidemic risk or network traffic with analyzing objects with concrete accident risk in the router. This work existences in space and time. Spatial deals with the method of decision tree patterns may be discovered using for spatial data classification. This techniques like classification, method differs from conventional association, and clustering and outlier decision trees by taking account implicit detection. New techniques are needed spatial relationships in addition to other for SDM due to spatial auto-correlation, object attributes. Ref [2, 3] aims at importance of non-point data types, taking account of the spatial feature of continuity of space, regional knowledge the packets transmissions and their and separation between spatial and noninteraction with the geographical spatial subspace. The explosive growth environment. of spatial data and widespread use of spatial databases emphasize the need for How are spatial data handled in the automated discovery of spatial usual data mining systems? Although knowledge. Our focus of this work is on many data-mining applications deal at the methods of spatial data mining, i.e., least implicitly with spatial data they discovery of interesting knowledge from essentially ignore the spatial dimension spatial data of network traffic patterns. of the data, treating them as non-spatial. Spatial data are related to traffic data This has ramifications both for the objects that occupy space. analysis of data and for their visualization. First, one of the basic tasks The institutions concern the of exploratory data analysis is to present routing network studies the application the salient features of a data set in a of data mining techniques for network 98 IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011 ISSN (Online): 1694-0814 www.IJCSI.org there might be little pixel coherence. format understandable to humans. It is Therefore, cartogram-based distortion is well known that visualization in primarily a preprocessing step. geographical space is much easier to understand than visualization in abstract In [8] the author proposes an space. Secondly, results of a data mining Improved Ant Colony Optimization analysis may be suboptimal or even be (IACO) and Hybrid Particle Swarm distorted if unique features of spatial Optimization (HPSO) method for data, such as spatial autocorrelation SCOC. In the process of doing so, the ([7]), are ignored. In sum, convergence system first use IACO to obtain the of GIS and data mining in an Internet shortest obstructed distance, which is an enabled spatial data mining system is a effective method for arbitrary shape logical progression for spatial data obstacles, and then the system develop a analysis technology. Related work in this novel HPKSCOC based on HPSO and direction has been done by Koperski and K-Medoids to cluster spatial data with Han, Ester et al. [4, 9]. obstacles, which can not only give attention to higher local constringency Rather than aggregate data, speed and stronger global optimum Gridfit [1] avoids overlap in the 2D search, but also get down to the display by repositioning pixels locally. obstacles constraints. Spatial clustering In areas with high overlap, however, the is an important research topic in Spatial repositioning depends on the ordering of Data Mining (SDM). Many methods the points in the database, which might have been proposed in the literature, but be arbitrary. Gridfit places the first data few of them have taken into account item found in the database at its correct constraints that may be present in the position, and moves subsequent data or constraints on the clustering. overlapping data points to nearby free These constraints have significant positions, making their placement influence on the results of the clustering quasirandom. Cartograms [5] are another process of large spatial data. In this common technique dealing with project, the system discuss the problem advanced map distortion. Cartogram of spatial clustering with obstacles techniques let data analysts trade shape constraints and propose a novel spatial against area and preserve the map’s clustering method based on Genetic topology to improve map visualization Algorithms (GAs) and KMedoids, called by scaling polygonal elements according GKSCOC, which aims to cluster spatial to an external parameter. Thus, in data with obstacles constraints.[9] cartogram techniques, the rescaling of map regions is independent of a local distribution of the data points. A 3. Genetic and ACO Based Spatial cartogram-based map distortion provides Data Mining Model much better results, but solves neither the overlap nor the pixel coherence Before data mining algorithms problems. Even if the cartogram can be used, a target data set must be provides a perfect map distortion (in collected. As data mining only uncover many cases, achieving a perfect patterns already present in the data, the distortion is impossible), many data target dataset must be large enough to points might be at the same location, and contain these patterns. A common source 99 IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011 ISSN (Online): 1694-0814 www.IJCSI.org substantially for relational (attribute) for data is a data mart or data warehouse. data management and for topological Pre-process is essential to analyze the (feature) data management Geographic multivariate datasets before clustering or data repositories increasingly include ill data mining. The target set is then structured data such as imagery and geo cleaned. Cleaning removes the referenced multimedia. The strength of observations with noise and missing network GIS is in providing a rich data data. The clean data are reduced into infrastructure for combining disparate feature vectors, one vector per data in meaningful ways by using spatial observation. A feature vector is a proximity. summarized version of the raw data observation. This might be turned into a The next logical step to take feature vector by locating the eyes and Network GIS analysis beyond mouth in the image. The feature vectors demographic reporting to true market are divided into two sets, the "training intelligence is to incorporate the ability set" and the "test set". The training set is to analyze and condense a large number used to "train" the data mining of variables into a single forecast or algorithm(s), while the test set is used to score. This is the strength of predictive verify the accuracy of any patterns found data mining technology and the reason why there is such a true relationship The proposed spatial data mining between Network GIS & data mining. model uses ACO integrated with GA for Depending upon the specific application, network risk pattern storage. The Network GIS can combine historical proposed ant colony based spatial data customer or retail store sales data with mining algorithm applies the emergent syndicated demographic, business, intelligent behavior of ant colonies. The network traffic, and market research proposed system handle the huge search data. This dataset is then ideal for space encountered in the discovery of building predictive models to score new spatial data knowledge. It applies an locations or customers for sales effective greedy heuristic combined with potential, cross-selling, targeted the trail intensity being laid by ants marketing, customer churn, and other using a spatial path. GA uses searching similar applications. Geospatial data population to produce a new generation repositories tend to be very large. population. The proposed system Moreover, existing GIS datasets are develops an ant colony algorithm for the often splintered into feature and attribute discovery of spatial trends in a GIS components that are conventionally network traffic risk analysis database. archived in hybrid data management Intelligent ant agents are used to systems. Algorithmic requirement differ evaluate valuable and comprehensive substantially for relational (attribute) spatial patterns. data management and for topological 3.1. Geo-Spatial Data Mining (feature) data management. Data volume was a primary 3.2 Ant Colony Optimization factor in the transition at many federal agencies from delivering public domain Ant colony Optimization data via physical mechanisms algorithm (ACO), a probabilistic Algorithmic requirements differ 100 IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011 ISSN (Online): 1694-0814 www.IJCSI.org optimal solution. If there were no technique is deployed for evaluating evaporation at all, the paths chosen by spatial data inference from network the first ants would tend to be traffic patterns which find load and excessively attractive to the following demand at various instances. In the ones. In that case, the exploration of the natural world, ants (initially) wander randomly, and upon finding food return solution space would be constrained. to their colony while laying down pheromone trails. If other ants find such Thus, when one ant finds a good a path, they are likely not to keep (i.e., short) path from the colony to a traveling at random, but to instead food source, other ants are more likely to follow the trail, returning and reinforcing follow that path, and positive feedback it if they eventually find food. eventually leads all the ants following a single path. The idea of the ant colony Ant Colony Optimization (ACO) algorithm is to mimic this behavior with is a paradigm for designing meta"simulated ants" walking around the heuristic algorithms for combinatorial graph representing the problem to solve. optimization problems. Meta-heuristic algorithms are algorithms which, in 3.3 Genetic Algorithm order to escape from local optima, drive some basic heuristic, either a The proposed algorithm of constructive heuristic starting from a spatial clustering based on GAs is null solution and adding elements to described in the following procedure. build a good complete one, or a local Divide an individual risk pattern of the search heuristic starting from a complete network traffic generating objects solution and iteratively modifying some (chromosome) into n part and each part of its elements in order to achieve a is corresponding to the classification of a better one. The metaheuristic part datum element. The optimization permits the low level heuristic to obtain criterion is defined by a Euclidean solutions better than those it could have distance among the data frequently, and achieved alone, even if iterated. The the initial number of packets that has to characteristic of ACO algorithms is their be sent is produced at random. Its explicit use of elements of previous genetic operators are similar to standard solutions GA’s. This method can find the global optimum solution and not influenced by Over time, however, the an outlier, but it only fits for the pheromone trail starts to evaporate, thus situation of small network traffic risk reducing its attractive strength. The more pattern data sets and classification time it takes for an ant to travel down the number. path and back again, the more time the pheromones have to evaporate. A short 4. Experimental Evaluation path, by comparison, gets marched over faster, and thus the pheromone density ACO with GA integration SPDM remains high as it is laid on the path as model is proposed to be tested in the fast as it can evaporate. Pheromone framework of network traffic risk evaporation has also the advantage of analysis. The analysis is done on a avoiding the convergence to a locally spatial database provided in the 101 IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011 ISSN (Online): 1694-0814 www.IJCSI.org accidents and networks. Visualization framework of an industrial collaboration. and interaction helps to understand the It contains data on the number of packets dependencies within and between data to be sent and others on the number of sets. Visualization supports formulating nodes that is ready to be served in the hypotheses and answering questions network. The objective is to construct a about correlations between certain predictive model. The system model variables and collision. Explorative looks correspondences between the visualization may reveal new variables packet and the other trend layers as the relevant to the model and relevance of number of nodes, time taken for the already used variables. It is highly packet to reach at the other end etc. It required to analyze the correlations applies classification by decision tree combine spatial data analysis methods while integrating the number of packets with visualization. Risk model to be transmitted via spatial character development is an interactive and and their interaction with the explorative process. geographical environment. The experimental evaluation is made on a geographical network traffic (trend 4.2 ACO on SPDM layer) spatial database to depict higher ACO has been recently used in efficiency in performance of the some data mining tasks, e.g. discovery process. It proves that better classification rule discovery. quality of trend patterns discovered Considering the challenges faced in the compared to other existing approaches problem of spatial trend detection, ACO using non-intelligent decision tree suggest efficient properties in these heuristics. Reliable data constitute the aspects. Ant agents search for the trend key to success of a decision tree. An starting from their own start point in a efficient parallel and near global completely distributed manner. This optimum search for network traffic risk guides the search process to infer to a patterns are evaluated using genetic better subspace potentially containing algorithm. It combines the concept of more and better trend patterns. Finally survival of the fittest with a structured some measures of attractiveness can be interchange. GAs imitates natural defined for selecting a feasible spatial selection of the biological evolution. object from the neighborhood graph. Improvements in the identification of ACO on SPDM Effectively guide the high or low risk areas can assist the trend detection process of an ant ACO emergency preparedness planning and has been recently used in some data resource evaluation. mining tasks, e.g., classification rule discovery. Considering the challenges 4.1 Spatial Data Mining on Network faced in the problem of spatial trend Traffic Risk Patterns detection, ACO suggest efficient properties in these aspects. Ant agents Spatial data mining on network search for the trend starting from their traffic risk pattern focuses on the human own start point in a completely vulnerability in built environments. It distributed manner. Finally some considers issues like differences between measures of attractiveness can be common and rare collision, commuting of people, and relations between 102 IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011 ISSN (Online): 1694-0814 www.IJCSI.org a domain specialist in traffic risk defined for selecting a feasible spatial analysis. The advantage of proposed object from the neighborhood graph. technique allows the end-user to evaluate the results without any 4.3 Spatial Clustering GA assistance by an analyst or statistician. Gas automatically achieve and Genetic algorithms are an accumulate the knowledge about the efficient parallel and near global search space of the ACO. GA adaptively optimum search method based on nature controls the traffic risk pattern search genetic and selection. GA combines the process to approach a global optimal concept of survival of the fittest with a solution. Perform well in highly structured interchange. Gas imitates constrained traffic risk pattern, where the natural selection of the biological number of “good” solutions is very small evolution. It uses searching population relative to the size of the search space. (set) to produce a new generation population. GAs automatically achieve The current application results and accumulate the knowledge about the show a use case of spatial decision trees. search space. GA adaptively controls the The contribution of this approach to search process to approach a global spatial classification lies in its simplicity optimal solution. GA performs well in and its efficiency. It makes it possible to highly constrained problems, where the classify objects according to spatial number of “good” solutions is very small information (using the distance). It relative to the size of the search space. allows adapting any decision tree GAs provides better solution in a shorter algorithm or tool for a spatial modeling time, including complex problems to problem. Furthermore, this method solve by traditional methods. considers the structure of geo-data in multiple trends (patterns) which is 5. Result and Discussions characteristic of geographical databases. The proposed results provide The graph below indicates the number of spatial decision trees for network traffic trends found and paths examined using risk patterns with optimized route SPDM Decision Tree and SPDM-ACOstructure with the ant agents. The GA models for traffic risk pattern proposed model classifies objects analysis. according to spatial information (using the ant agent and the distance pheromone). Spatial classification 6. Conclusion The Spatial data mining system provided by the proposed scheme is of ACO with GA have shown that simple and efficient. It allows adapting network traffic risk patterns are to different decision tree algorithm for discovered efficiently and recorded in the spatial modeling of network traffic the genetic property for avoiding the risk patterns. It uses the structure of geocollision risk in highly dense spatial data in multiple trend layers which is regions. The proposal of our system characteristic of geographical databases. analyzes existing methods for spatial Finally, the quality of this analysis is data mining and mentioned their improved by enriching the spatial strengths and weaknesses. The variety of database by multiple geographical yet unexplored topics and problems trends, and by a close collaboration with 103 IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011 ISSN (Online): 1694-0814 www.IJCSI.org [3] Anselin, L. 1995. Local Indicators of makes knowledge discovery in spatial Spatial Association: LISA. Geographical databases an attractive and challenging Analysis 27(2):93-115. research field. This work gives an [4] Ester, M., Frommelt, A., Kriegel, efficient approach to multi-layer geoH.P, Sander, J., “Spatial Data Mining: data mining. The main idea is to prepare Database Primitives, Algorithms and input data by joining each layer table Efficient DBMS Support”, in Data using a given spatial criterion, then applying a standard method to build' a Minining and Knowledge Discovery, an International Journal, 1999 decision tree. The most advantage is to [5] D.A. Keim, S.C. North, and C. demonstrate the feasibility and the Panse, “Cartodraw: A Fast Algorithm for interest of integrating neighborhood Generating Contiguous Cartograms,” properties when analyzing spatial objects. Our future work will focus on IEEE Trans. Visualization and Computer Graphics (TVCG), vol. 10, adapting recent work in multi-relational no. 1, 2004, pp. 95-110. data mining domain, in particular on the [6] Haining, R. Spatial data analysis in extension of the spatial decision trees the social and environmental sciences, based on neural network. Another Cambridge Univ. Press, 1991 extension will concern automatic [7] Hawkins, D. 1980. Identification of filtering of spatial relationships. The Outliers. Chapman and Hall. [Jain & system will study of its functional Dubes1988] Jain, A., and Dubes, R. behavior and its performances for 1988. Algorithms for Clustering Data. concrete cases, which has never been Prentice Hall. done before. Finally, the quality of this [8] Jhung, Y., and Swain, P. H. 1996. analysis could be improved by enriching Bayesian Contextual Classification the spatial database by other Based on Modified M-Estimates and geographical trends, and by a close Markov Random Fields. IEEE collaboration with a domain specialist in Transaction on Pattern Analysis and traffic risk analysis. Indeed, the quality Machine Intelligence 34(1):67-75. of a decision tree depends, on the whole, [9] Koperski, K., Han, J. “GeoMiner: A of the quality of the initial data. System Prototype for Spatial Mining”, Proceedings ACMSIGMOD, Arizona, REFERENCES 1997. [1] . D.A. Keim and A. Herrmann, “The K.Sankar is a Research Scholar at the Gridfit Algorithm: An Efficient and Anna University Coimbatore. He is now Effective Approach to Visualizing Large working as a Assistant Professor(Sr) at Amounts of Spatial Data,” Proc. IEEE KSR College of Engineering, Visualization Conf., IEEE CS Press, Tiruchengode. His Research interests 1998, pp. 181-188. are in the field of Data Mining and [2] Anselin, L. 1988. Spatial Optimization Techniques. Econometrics: Methods and Models. Dordrecht, Netherlands: Kluwer. Anselin, L. 1994. Exploratory Spatial Dr.V.Vankatachalam is a principal of Data Analysis and Geographic The Kavery Engineering College. He Information Systems. In Painho, M., ed., received his B.E in Electronics and New Tools for Spatial Analysis, 45-54. 104 IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 1, May 2011 ISSN (Online): 1694-0814 www.IJCSI.org Communication at Coimbatore Institute of Technology Coimbatore. He obtained his M.S degree in Software systems from Birla Institute of Technology Pilani. He did his M.Tech in Computer Science at Regional Engineering College (REC) Trichy. He obtained his Ph.D degree in Computer Science and Engineering from Anna University Chennai. He has published 3 papers in International Journal and 20 papers in International & National conferences. 105