Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Survey of Spatial Data Mining Approaches: Algorithms and Architecture Arvind Sharma1,R K Gupta2, 1 Department of Computer Science MPCT ,Gwalior-474011,INDIA 2 Department of Computer Science MITS,Gwalior,INDIA 1 arvinddevansh@rediffmail.com ,2 rkg_iiitm@rediffmail.com Abstract Knowledge discovery in spatial data mining is rapidly growing field, whose development is driven and based on advance research as well as urgent practical, social and environmental needs. There are so many important and sophisticated areas like designing of road maps for different regions or states, countries, cloud cover, traffic control or GPS etc on the basis of recorded data whether it is collected from satellite or local cameras overview .In this paper, we provide an overview of common knowledge discovery algorithms in SPDM. We propose a feature classification scheme on the basis of clustering and classification for 3D databases .A comparative study of algorithms also have done in this paper. Keywords:3D databases, SPDM 1. INTRODUCTION 1.1 Overview and motivation The collection of data usually referred to as the database,contains information relevant to an entity Suchas an organization,enterprise etc. The primary goal of a database system is to provide a way to store or retrieve database information that is both convenient and efficient. A very interesting and efficient method has introduced for this purpose and it is called as Data mining. Data Mining is usually defined as searching, analyzing and sifting through large amounts of data to find relationships ,patterns, or any significant statistical correlation. Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The term ‘data mining ‘ refers to the finding of relevant and useful information from databases .Data mining and knowledge discovery in the data bases is a new interdisciplinary field,merging ideas from statistics,machine learning ,databases and parallel computing. Basically there are two techniques for managing database in the process of data mining: Spatial data mining and temporal data mining.Spatial Data Mining(SPDM)is the process of discovering interesting ,useful ,nontrivial patterns information or knowledge from large spatial datasets. Here the term spatial stands for all those data sets which are related with space or geographical regions. Spatial data is data related to space. The number and size of spatial data base e.g. for geo marketing ,traffic control or environmental studies, medical diagnosis, weather prediction are rapidly growing which result in an increasing need for spatial data mining. For these applications we need to store huge amount of data and certain approaches for getting fruitful results. A variety of SPDM algorithms have been discussed in this paper with some suggestions. Automated tools with intelligent algorithms have the capability to analyze the raw data and present the extracted high level information to the analyst or decision maker, rather than having the analyst find it for himself or herself. In this paper, we present a survey and study of different spatial data mining algorithms whichever have implemented over this topic[7]. It has been studied by different research groups that market for SPDM will grow $2000 million to $3000 million by 2012. The aim of this study is: 1) A review of data mining and SPDM process 2) Study of existing knowledge and spatial data mining algorithms. 3) Working architecture of spatial data mining. While data mining is the process of extracting of meaningful patterns from data. Data mining is becoming an increasingly important tool to transfer this data into information. Hence, data mining is just one step in the overall KDD process. Detailed steps are given belowi. Developing and understanding of the application domain and the goals of the data mining process . ii. Selection of target data set. iii. Integrating and checking the data set iv. Data cleaning ,preprocessing, and transformation v. Hypothesis building and software selection vi. Identification, selection and development of suitable algorithm. vii. Result interpretation and visualization. viii. Result testing, verification ,and refinement ix. Result application 2.2 Database sources and issues 2. KNOWLEDGE DISCOVERY AND SPATIAL DATA MINING This section covers knowledge discovery and spatial data mining process for feature extraction and classification of spatial and non spatial attributes of the databases. 2.1 The KDD process Sometimes KDD and data mining are used as interchangeably but actually this is not true. Actually the data mining is a stage in a whole KDD process. A simple definition of KDD is as follows :knowledge discovery in database is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. With the help of advance technology and tools ,spatial data may be collected in huge amount from different resources for various applications ranging from remote sensing and satellite telemetry systems, to computer cartography ,medical diagnosis, weather analysis and prediction, and all kinds of environmental planning. Various national and international agencies are also providing spatial data in different dimensions. Most common data sources are satellite images, medical images, human body’s protein structure and all those data who can be represented in the form of cuboids, polygon, cylinder etc. Various sites are also available for collection of GIS data[10][11][9].Google earth, visible earth(NASA),JSC digital image collection(NASA),Global land cover facility are also available for collection of spatial data. Basically spatial data include geographic data such As maps and associated information, and computer aided design data such as integrated circuit design or building designs. It has observed that 2D database are not more efficient in storing, indexing and queing of data on the basis of spatial locations .Additionally for 2D databases, we can not use standard index structures, such as B-trees or hash indices, to answer such a query efficiently. So it is recommended that we should work for higher dimensional data. 2.3 Spatial Data Mining Tasks For extracting of patterns from spatial data there is a need of various methods and techniques by which we can collect meaningful patterns of data from different samples. It should also be noted that several methods with different goals may be applied successively to achieve a desired result. Some of the SPDM tasks are listed hereData processing- Analyst or users may select, filter, aggregate, sample, clean and/or transform data into much more understandable form. Unwanted and useless portions may be cut from the existing data hence to improve the productivity and applicability of the data. Prediction- Prediction means to give some outcomes in advance on the basis of previous history or patterns of data items. Values of specific attributes of the data items may also be calculated accurately with iterative methods and different samples of data. Regression- Given a set of data items, regression identifies dependency of some attribute values upon the values of other attributes in the same item and apply these values on other data items or records. Classification- Given a set of predefined categorical classes ,determine to which of these classes a specific data item belongs. For example, in weather prediction system we classify satellite images into different classes on the basis of some common properties and patterns. Clustering- given a set of data items, group items that are similar. For example, given a set of satellite images, identify subgroup of objects of patterns(colored, noncolored, size, shape)and their behavior. Link Analysis – Given a set of data items identify relationships b/w attributes and items such as the presence of one pattern implies the presence of another pattern. Model visualization- Visualization plays a very important role in understanding and demonstration the desired task properly. Visualization techniques may range from simple scatter plots and histogram plots over parallel coordinates to 3D data items. 2.4 Data Mining methods There are so many methods by which we can get more information out of data. According to application and easier level of understanding these methods can be grouped as shown in figure no.1. Authors to authors these methods may be different. 3. ALGORITHMS AND METHODS INCLUDED IN THIS PAPER Some of the well known algorithms are discussed here1. A density based algorithm for discovering clusters in large spatial databases with noise. The task considered in this paper is class identification i.e the grouping of the objects of a database into meaningful subclasses .It requires one input parameter and supports the user in determining an appropriate value for it. It discovers clusters of arbitrary shape. Here DBSCAN has implemented on the basis of R*-tree. All experiments have been run on HP 735/100 workstations with the help of synthetic data and the database of the SEQUOIA 2000 benchmark. Positive aspects- i. Faster ii. Efficient iii. Applicable for large database iv. Applicable on arbitrary shape. V. Extendable for polygons over point objects. Points missed: i. High dimensional data not considered. ii. It is only about static rather than moving obstacles. 2. Algorithm for characterization detection in spatial databases.. and trend In this algorithm it has observed that for spatial characterization ,it is important that class membership of a database object is not only determined by its non spatial attributes but also by the attributes of objects in its neighborhood. In this paper neighborhood relationship is considered as centered point of discussion. With the help of different databases , various local and global trends have detected. proposed. A characterization rule is an assertion which characterizes a concept satisfied by all or a majority number of the examples in the class undergoing learning(called the target class).A discrimination rule is an assertion which discriminates a concept of the class being learned from other classes (called contrasting classes).In medical science for diagnosis of diseases, it is very important and usable. In this paper, proposed algorithm is very much suitable for identification of weather patterns. In this algorithm, the characteristics of some spatial objects can be found as well as what the characteristics of that spatial objects discriminate from other contrast spatial objects can also be found. Positive aspects of this algorithms arei. 3. Density connected sets and their application for trend detection in spatial databases. ii. In this paper, the concept of density connected sets and a generalized form of DBSCAN has introduced .The concept of trend detection has explained with nice examples. A systematic change of one or several non spatial attributes in 2D or 3D space have described successfully. On the basis of repeated trends of the databases certain predictions are explained. Somewhere it has observed that given algorithm is not able to give clear cut relations between different datasets. iii. iv. v. It extracts not only the properties of target and contrast objects but also the properties of their neighbors as they impact on the characteristics of all objects. It shows successful implementation of general frame work for SPDM. This algorithm is more suitable for medical and weather applications Negative or future aspectsSometimes the concept of relative frequency in target region does not match and work at satisfactory level with actual database. 4. SOFTWARE ORGANIZATION AND ARCHITECTURES 4.Extended algorithm for Spatial characterization and discrimination rules. In this paper, A new spatial data mining algorithm for both characterization and discrimination rules have been Software organization and architecture for SPDM are shown in fig.2 and 3. There is a big need of an intelligent and reliable machine tool for supporting an interactive knowledge discovery process in large centralized or distributed spatial databases. A list of unexplored and incomplete issues is given here for future discussion and implementation The organization of SPDM software tool is shown in fig.1 .Here different sites are shown at different locations with their own algorithms and environments in a integrated manner. Normally organization and architecture should be designed in distributed manner so that it can share raw data and intermediate results with the coordination of central GUI. Merging of different techniques. Currently available tools deploy either a single technique or a limited set of techniques to carry out data collection and analysis.In the paragraph of issues and methods already it has discussed that there is no one best technique at hand for all kind of data analysis.The problem becomes more complex when we combine different techniques for getting better result. Each distributed site has its own local data, SPDM software package ,file transfer and remote connection software .Each user i(i=1 to n)uses some learning algorithms on one or more local spatial databases DBi ,to produce a local classifier Ci .Now ,all local classifier can be sent to a central repository ,where these classifiers can be combined into a new global classifier(GC)using majority or weighted majority principle . This classifier GC is now sent to all the individual users to use it as a possible method for improving local classifiers. Compatibility with higher dimension of data. According to time and arrival of new applications and technologies it has become necessary to interact with 3 or more dimensional data and produce result with greater satisfaction.So designing of such applications is really a typical job. Designing of these software and organizing data and working modules in efficient manner is also an important subject for research now a days. 4.0 CONCLUSIONS AND FUTURE RESEARCH In this paper the definition of data mining and Spatial data mining has explained and it is clear that spatial data mining is one step at the core of the knowledge discovery process, dealing with the extraction of patterns and relationships from large amounts of data. The Spatial data mining is just as an extension of data mining and it is still an emerging field of research. Application of methods and algorithms on changing data i.e dynamic data. Sometimes it becomes necessary to apply methods and techniques on the dynamic data i.e. we get data in a regular fashion as per defined interval and hence results must also be changed according to this new data.This changing data may make previously discovered patterns invalid and hold new ones instead.There is clearly a need for incremental methods or adaptive methods that are able to update working models. Non –Standard data types. Today’s requirement is to process all kind of data such as audio, video, image, temporal, spatial and other data types.Those data types contain special patterns ,which cannot be handled well by the standard analysis methods.Therefore, these applications require special methods and algorithms. Data Mining methods Site1 SPDM S/W Package Verification C E N Data/Knowledge Discovery T Site2 SPDM S/W Package R A Description Prediction L Site N-1 SPDM S/W Regression User Classification G Site N SPDM S/W Package Neural Network Bayesian Network Decision Trees Data/KB N U Association Rules Information Theoretic Networks Figure1 : Data Mining Methods I Figure2:The organization of the distributed SPDM S/W Model Integration Modeling S P D M Data Partitioning G U Data processing I Data Inspection Data Generation and Manipulation S.Data NonS Data Pre. Data knowledge Data Partition Figure 3. Internal architecture of working of SPDM S/W User 6.0 References [1] Martin Ester ,Hans-Peter Kriegel, Jorg Sander, “Algorithms and applications for Spatial Data Mining” published in GDM and KD ,research monographs in GIS, taylor and Francis,2001. [2]Michael Goebel,Le Gruenwald, “survey of Data Mining and Knowledge Discovery Software tools”SIGKDD Explorations June 1999.Volume1,Issue1 pp 20-33. [3] Martin Ester,Hans-Peter Kriegel,Jorg Sander,Xiaowei Xu “A Density based Algorithm for Discovering Clusters in large Spatial Databases with Noise”2nd International conference on KDD,Portland,California pp226231. [4] Martin Ester,Hans-Peter Kriegel,Jorg Sander,Alexander Frommelt, “Algorithms for characterization and Trend detection in Spatial database”4th International conference on KDD,New York City,pp-44-50. [5] Martin Ester,Hans-Peter Kriegel,Jorg Sander, Spatial data mining :A Database Approach.Proc 5th Int.Symp.on large spatial databases,4766,Berlin Springer. [6] Martin Ester,Hans-Peter Kriegel,Jorg Sander,Xiaowei Xu “Density connected sets and their application for trend detection in Spatial Databases” 3rd int. Conf. on KDD-97 [7] Gueting R.H. 1994 “An introduction to Spatial Database Systems” Special issue on Spatial database Systems of the VLDB Journal,Vol.3 No.4,October 1994. [6] Martin Ester,Hans-Peter Kriegel,Jorg Sander,Xiaowei Xu “Density connected sets and their application for trend detection in Spatial Databases” 3rd int. Conf. on KDD-97 [7] Gueting R.H. 1994 “An introduction to Spatial Database Systems” Special issue on Spatial database Systems of the VLDB Journal,Vol.3 No.4,October 1994. [8] Md. Rashidul Hasan,Md. Zakir Hossain,Fahim Md. Chaudahry , Md. Hasan “Extended Algorithm for Spatial Characterization and Discrimination Rules” Proceeding of 11th International conf. on ICCIT 2008,Bangladesh. [9] Aleksandar Lazarevic ,Tim Fiez “ A Software System for Spatial Data Analysis and Modeling” [10] Sample spatial (http::www.apress.com) [11] IBM corporation. Intelligent (http::/www.software.ibm.com) datasets Miner