Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Interoperating with GIS and Statistical Environment for an Interactive Spatial Data Mining D. Josselin, Researcher Laboratoire THEMA, CNRS, Besançon, France didier.josselin@univ-fcomte.fr http://thema.univ-fcomte.fr Introduction Since a long time, spatial analyst, regardless of the application or research field of which he is specialist, looked for finding the process grounds that manage his environment. Using statistical methods validated by mathematician, he cleared lows, constructed models and theories. Because of a significant increase of computer speed and functionality, he can actually explore many more problem facets, hypothesis and data. His goal was and is still, to comprehend society development process, in order to manage, to foresee and to plan. One of his paradoxical purposes is to find regular and standard behaviours and focus on some outliers from which innovation could emerge. Another related challenge is to extract from data the relevant, useful, necessary and sufficient information to investigate his issue and to bring to light some pertinent complex geographical entities. That is the quest of data mining, and, more precisely involving GIS: spatial data mining (RIG, 2000). Spatial data mining (i. e. finding spatial and statistical patterns, shapes, discontinuities, relations, rules, etc.) may be led by two main different ways. When the problem requires lots of data and has to be considered as a whole system (it can not be divided into several subproblems), the specialist can apply automatic algorithms and procedures to solve or at least simplify his problem. In other cases, he may set an exploratory approach closer to statistical individuals he has to analyse, because of their variety. At an interface position, GIS may be a central receptacle of general rules extraction (by clusters or trends statistical learning) and data examination (by requesting individuals within databases). In a context of high reliable support decision wish for government policy and territorial planning, GIS remain a very important research and application domain, proficient to provide tools, methods and concepts for this quest. If we consider a practical view of data mining, we may notice that this topic involves different aspects. Among different facets of geographical analysis research, computer sciences and spatial analysis are significant; the interdisciplinary research group CASSINI of CNRS (RIG, 1998) has been promoting them in France for the last ten years. After this group will come a new body in 2000, which might be a European one and whose presumed axes will be those shown in Figure 1. This group developed and will go on to tight partnership with European laboratories (via research programs such as ESPRIT, TELEMATIC or IST) and international research community (via participation in several workshop organisations: EGIS94, SIG-GIS Europe in 1992, ESF-GISDATA in 1995 and 1999, UDMS in 1992, ACM-GIS in 2000, STDBM99, SSD97, STDML99, etc.). It will also be in close relation with future French technological network of geographical information (RRTIG) including most of the private firms or public departments and offices concerned by GI. Figure 1 - The five surmised axes of future interdisciplinary research French (and maybe European) group (including about 50 laboratories working on GIS) following the previous CASSINI research group. Thus, a first part of our research group is made of computer scientists. Computer science brings innovations in geographical data bases, such as programming and requesting languages, data modelling, interoperability process, information exchanging, etc. Either in hardware or software for GIS, progress has been significant (RIG, 1998). In same time, spatial analysts, who may be researchers or development actors in thematic topics, propose their own geographical data models to push along a good comprehension of our rural and urban environment by providing dedicated or general decision support tools, such as: Geographical information systems (real world description), including data, their modelling, and their use within a particular software, Analysis tools to investigate data and generate modelling of them (abstract world description). We think these models and tools become actually more efficient and closer to real practice and data, because of computer science progress and since they are closely related to applicants (more generally what we call "the social demand"). A research project example developed in CASSINI about spatial data mining, and coupling GIS and statistical environment is shared between the two axes “interacting” and “spatial analysis”, especially on the relation “multi-scalar dynamic geographical objects” of the future research group project (Figure 1). We first expose our methodological grounds, after having evoked GIS and statistical environment regarding our objectives. Then, we propose two ways to reach these objectives: an interactive environment for spatial data mining within XlispStat software and a dynamic link between two existing software packages (XlispStat and Arcview). Finally, we apply these two tools in order to analyse geographical agricultural flows between French communes. Methodological positioning Spatial statistics Theoretical and technical development of statistical and data analysis tools are numerous. This research filed regularly transfers its scientific results to decision support software, such as GIS. Lots of methods come from multidimensional data statistic analysis (Sanders, 1989, Lebart, 1995]. More recent developments are provided in knowledge representation and extraction (such as induction trees), artificial methods such as neural networks (Thiria, 1997) or genetic algorithms (Goldberg, 1989). These different methods can be applied to geographical problematic and start to grandly enrich spatial analysis methodologies. Moreover, more specialised geo-statistic methods (spatial auto-correlation, variograms, etc.) are now currently used in spatial and temporal process analysis. For these different methods, the managed data are shaped within a unique table, crossing Individuals and Variables. Generally, statistics user assumes: The information to extract is preferentially the trend rather than singular individuals around behaviours, During information processing, the “cord” linking the extracted model and the whole individuals with all its own associated statistic and graphic representations is not necessary and can be definitely cut, The employed method validation and its theoretical validity domain are sufficient to justify its use on the whole studied territory; in other words, the chosen method is general enough to embrace all the different encountered cases, The problem is correctly defined in a table (Individuals x Variables), even if different geographical entities and their attached attributes are involved, with all incoherence and redundancy problems which may occur, The only output results analysis permits to clear pertinent information. To look for a trend in a geographical entities homogeneous set, these assumptions are not really uncomfortable and able to simplify information by bringing a studied phenomenon synthesis. Nevertheless, they may become a serious hindrance to explanative analysis, if we have to identify, among individuals or groups, complex relations (Openshaw, 1984) through scales (Piron, 1993). Geographical Information Systems At the opposite of this approach, are situated Geographical Information Systems. Much more oriented to data managers, geographical databases research and development field gave significant incomes, in data modelling and requesting languages, notably (Cheylan and al., 1997, Laurini and Milleret-Rafford, 1993). It provides individual attributes access (by assigning a unique key attribute, for example), via relations they maintain (multiple relations between objects or entity classes). It also provides meta-models, very useful for thematic experts and computer scientists, either data (data dictionary, for example, (Spery and Libourel, 1998) or modelling tools (Case Tools, notably). Research results about dynamic links (such as triggers) between objects or tables, are real progress on the user point of view (Josselin, 2000). However, as soon as we try to produce a high level expertise, some difficulties are encountered, because, in general: It is not GIS vocation to implement a complete statistical toolbox; when statistic functionalities exist, they are quite poor or rare, the potent ones being generally integrated within parallel specific commands, since very powerful statistics software are available, Primacy is conceded to objects or covers crossing and combination, process which tends to orient user approach towards a vertical and geometric view of his database and spatial investigation (the objects and their process are considered, but not explicitly the relations, which stays at a structural description state), Likewise, tools providing a deep work on objects relations and association are generally poor, setting user in front of his high dimensional database combinatory, he can store, request, visualise, but not easily abstract (Salgé, 1996). Thus, spatial statistics and GIS fields appear as strongly complementary. Their advantages, concomitantly processed, may help user or expert, who supports policy and decision makers, to go further spatial analysis deepness. Exploratory Spatial Data Analysis (ESDA) ESDA (Exploratory Spatial Data Analysis), the spatial branch of EDA (Exploratory Data Analysis) initiated by J. Tukey (Tukey, 1977, Hoaglin 19883) increases in USA (John and Beheren, 1997) and in Europe. It advantageously complements global statistical methods, by offering to users real data domination via interactive and graphic tools, and by transforming some spatial statistical indicators to local ones (for example, the LISA: Local Indicators of Spatial Association, Anselin, 1995). All these methods illustrate, at least a part of them, the concept of interactive spatial data mining. Although it may be difficult to give an exhaustive view of ESDA because of its variety, we can focus on a few of its fundamental bases, which may be useful for spatial analysis: We favour robust methods (Floch and al., 1998), such as: quantils, resistant line, lowess, median polish, projections pursuit, mobile median on suite, etc. Two individuals are not interchangeable: model must explicitly take into account their local characteristics, and its incomes must be known at any time of analysis process (Fotheringham, 1997), Deviations to trend (residuals) are as interesting as trend itself, Multiple dynamically linked graphical representations (whose map) improve diagnostics (Hasslet and al., 1991), Expert must be able to intervene if possible, in the process itself, in order to avoid « black boxes », and to actively participate in validation (by a qualitative approach, complementary to a mathematical validation), Semantic and geometric information must be processed and captured simultaneously, especially functional (e.g. a farmer owns some parcels), spatial (e.g. the distance separating these two parcels is two km) and topological relations (these two communes are contiguous). Dynamic link between objects is also a fundamental aspect that will bring together expert and computer learning processes. This functionality can occur between objects of the same or different classes (maps, statistical distributions). We feel it able to noticeably improve analyses quality altogether, due to a kind of information systemic approach. Improving interactive and graphic spatial data mining Two main ways to build interactive spatial data mining tools We already evoked advantages and restrictions using GIS or statistic tools for spatial analysis. User who develops an exploratory approach within a GIS may bump into a functionalities lack and into necessity to write down and execute specific requests (slow and sometimes fastidious). User who tends to use ESDA tool or graphic statistical environments may suffer because of mapping and semiology weakness. This fact induced a few authors to propose improvements in different software. Two main ways are available to develop GIS with exploratory statistical functionalities: By improving existing tools i.e. integrating in GIS or mapping software very powerful spatial statistical functionalities or implementing in a statistical environment GIS functionalities, By linking two existing software and benefiting from them. Three kinds of software develop tools for geographical data management and exploratory approach; we give a few examples below: Mapping software – that is the case of MacMap and Cartes&Données, which integrate interactive multidimensional statistical representations within a very user friendly graphical environment, GIS – an example of application is SpaceStat module, linked to Arcview and providing several ESDA methods (Anselin and Bao, 1997), Statistical softwares – a few free developments exist on this side, some are implemented within XlispStat sotware (Livemap by Brunsdon, 1998, ARPEGE' by Josselin, 2000). Another way is interoperability, increasing because of information exchange improving between software via networks. The idea is to take the best part of each package and to associate them by a dynamic link. These links are often procedures to transfer data streams from one to each other. They are rarely interactive i.e. operating in real time. However, there exist a few commercial developments (e.g. SpatialStat implemented within S+ statistical environment). Let’s also notice a few goodies developed within XlispStat (a module linking SmallWorld and S+ or XlispStat by Albouazzaoui, 1994), or LAVSTAT, a dynamic link between Arcview and XlispStat (Josselin, 1999). ARPEGE’: a tool to Analyse Robustly in Practice and Explore Geographical Environment The ARPEGE’ software (RIG, 2000) is developed within statistical environment XlispStat. It provides an exploratory data analysis within ARPEGE' software due to: Benefits from the XlispStat numerous statistical and graphic functions, Dynamic linking between different objects classes, all associated to an adequate geometry and back map, Statistical representations describing individuals of each objects class, The possibility to select on screen particular objects depending on either their statistical characteristics or their spatial repartition and to focus on them, The permanent update, by triggers, of selected individuals in every linked windows, Possibility for user to (in)validate his hypothesis onto crossed sub-populations (and subpopulations issued from these sub-populations, etc.) from initial objects populations, A tool, called “visioner”, which manages the whole set of objects, playing an equivalent role as CASE tool, and permitting to know at any time, the structure of the data base (by identifying the statistical graphics, the elementary objects classes and their relations); visioner implements a general m to n and an inheritance relations between objects classes. LAVSTAT: a dynamic Link between ArcView and xlispSTAT More than classical GIS functionalities, a few characteristics of ArcView were appreciated to develop LAVSTAT: Because we have to deal with different geographical objects and attributes tables, we needed to dispose of a dynamic link between entities inside GIS; the function «link» can have this role, if we are careful with links coherence, The SQL connection permits to get from or to point to other information sources and models, such as text editors, databases (Access), etc. The Data Dynamic Exchange library proposed by Windows is supported by ArcView and is the exchange platform for information between XlispStat and ArcView. XlispStat (Tierney, 1990) is a programming statistical environment working on Unix, Windows or MacOS and built on Common LISP language. It is the receptacle of lots of dedicated applications, such as R-Code (Cook and Weisberg, 1994), Vista1. Completely open, this software offers a consequent set of statistical functions, graphic objects (because of its object orientation) and a natural dynamic linking between them. Services, DDE Server XlispStat ArcView Figure 2 - LAVSTAT, principles LAVSTAT principle is simple (Figure 2); software provides a server to other software via DDE, by assigning it a specific name and several services. Clients can then connect to software via server, ask it for data, modify them or make it execute some procedures. Moreover, it permanently inspects for any data changes. Clients may give a name of function that might be called at any involved change. It is in fact possible to share evolutive data. 1 Développé par F. Young Technically, the Avenue language provided in Arcview offers a client connection class (DDEclient) to which are associated five functions: «make» (to create a connection object), «close» (to close it), «execute» (to execute a function), «poke» (to modify a data of server application) and «request» (which returns a variable value). ArcView implements a server, which can execute and compile an Avenue script. On its side, XlispStat proposes three basic commands: «dde-connect» (to establish connection), «dde-client-transaction» (to send a message to application; messages may be «execute», «poke» or «request») and «ddedisconnect» (to close connection). Using these functions, we developed a global function, called «Avlink», which manages the dynamic link. It sets a script in Arcview, executed at each table update and returns selected individuals to XlispStat. Moreover, a selection change event in Arcview triggers an appropriate dynamic modification of XlispStats graphics. Avlink also manages the same process in the other direction (from Xlisp-Stat to ArcView). Application to flow analysis We now show the use of LAVSTAT or ARPEGE to analyse agricultural geographical flows. Definitions Geographical flow analysis constitutes a research field widely explored by geographers or spatial analysts (Putmann and Shung, 1989, Bolot, 1999, RGE, 2000). More generally, flows correspond to any exchange of information between two entities. These geographical entities may be towns, with their traffic, countries, with their commercial trade, administrative entities (such as French communes), with their commuting or their agricultural lands used by outsider farmers. In this paper, we consider the example of agricultural inter-communal flows. Geographical objects taken into account are French communes (start commune and target one), linked by flows, which represent farmed areas. If these flows occur in the commune, they are considered as internal: farmers use the lands in their own commune (corresponding to their administrative location). If they link two different communes, they are external, more precisely outgoing for the start commune and incoming for the target one. In this last case, farming takes place out of the commune boundary. Analysing flows with LAVSTAT The following example shows how LAVSTAT may help user to explore his data. Arcview presents a map of Jura department communes, and two entity tables: flows and communes. In XlispStat, we have three statistical representations associated to flows: agricultural areas concerned, number of farmers aggregated by each flow, and a scatterplot matrix crossing three flows attributes, whose areas in cereals. The question asked might be: is there any statistical dependency linking farmers involved by specific flows (in terms of quantity, scope, spatial repartition, etc.) and the commune where concerned parcels are located? A graphic selection of flows associated to numerous farmers discloses two facts (Figure 3): They also correspond to important farmed areas (this validates data because higher is the number of farmers more important are flows). One can observe the scatterplot matrix to notice that these flows do not correspond necessarily to cereals farming, The direct link to the map makes appear communes aggregates, especially in the north of department; this means there are local disparities in terms of agricultural spreading. Figure 3 – Agricultural flows spatial analysis in Jura (France): geostatistical exploration with LAVSTAT, a dynamic link between ArcView and XlispStat. Analysing flows with ARPEGE' Another way to analyse inter-communal agricultural flows is to use ARPEGE' (Figure 4). In the example below, are present three geographical objects classes: communes, commune aggregates (called "little agricultural regions") and flows linking communes. User can explore interactively his data in different ways by selecting: Graphical labels of any objects and examine the resulting selection and highlighting in other graphical or mapping windows, Individuals in different statistical distributions and viewing at the same time their geographical location. This permits clear identification of statistical relations between geographical objects belonging to different classes and to disclose an interactive spatial data mining process. As we saw through these application examples, computer speed and efficiency allows management of lots of data and their relations, regardless of their type. It is potentially powerful for a dominated spatial data mining. Nevertheless, a few restrictions and precautions may be devised: Over a certain threshold of information quantity, common computers cannot preserve interactivity because of increasing delays; user may separate his analysis in several phases and sub-problems, It is not rare user to be "made all fancy" within his data until to be unable to extract any pertinent information; user must have a real data exploration strategy and previously prepare a deductive approach with formalised hypothesis, The only fact to have a complete statistical information can sometimes afraid or discourage some users who would like to have synthetic statistical indicators to help them (indeed to partly replace them?) to take their own expert decision. Anyway and despite these limitations, we guess this type of spatial data mining software have a promising future for decision support in territorial policy. Figure 4: Using ARPEGE’ for analysing relations between "little agricultural regions" intercommunal agricultural flows and communes in Isère French department Conclusion References Albouazzaoui A., Agrégation et modélisation objet dans les S.I.G., Les Journées de la recherche CASSINI, Lyon 13-14 octobre 1994, GDR CASSINI, pp. 187-196, 1994, Anselin L. and Bao S., Exploratory spatial data analysis linking SpaceStat and Arcview, in Recent developments in spatial analysis, Ed. M. Fisher and A. Getis, Berlin, Springer-Verlag, 1997, Anselin L., Local indicators of spatial association, LISA, Geographical Analysis, pp. 93-115, 1995, Behren J. T., Principles and Procedures of Exploratory Data Analysis, American Psychological Association, vol 2., n° 2, pp. 131-160, 1997, Bolot J. et al., Construction and evaluation of spatial partitions to describe geographical flows in The International Symposium on Spatial Data Quality, Honk-Hong, July 1999 Brunsdon C., Exploratory spatial data analysis and local indicators of spatial association with XlispStat, The Statistician, n°47, Part 3, pp. 471-484, 1998, Cheylan et al., Conception des systèmes d'information sur l'environnement, Collection Géomatique, Hermès, 1997, Cook RD., Weisberg S., An introduction to Regression Graphics. New-York, Wiley, 1994, Floch JM. et al.D., Exploratory Data Analysis, Cours de 3ème année d’ENSAE, 1998, Fotheringham S., Trends in quantitative methods : stressing the local, Progress in Human Geography, 21, 1, pp. 88-96, 1997, Goldberg D.E., Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading, Mass., 1989, Haoglin D. and al., Understanding robust and exploratory data analysis, Wiley Series in probability and mathematical statistics, 1983, Haslett J., and al., Dynamics graphics for exploring spatial data with application to locating global and local anomalies, The American Statistician, Août 1991, vol. 45, N° 3, pp. 235-242, 1991, Josselin D and al., Lien dynamique entre ArcView et Xlisp-Stat (LAVSTAT) : un environnement interactif d'analyse spatiale, Actes de la Conférence Française des Utilisateurs ESRI 1999, 29-30 septembre 1999, Cédérom, 1999, Josselin D., Un peu d'O2 pour la géographie (O2 : Orienté Objet) ?, Colloque Géopoint, 28 et 29 mai 1998, Avignon, published in 2000, Josselin D., A la recherche d'objets géographiques composites, n° spécial Data Mining Spatial, Revue Internationale de Géomatique (Ed. K. Zeitouni), 2000 Laurini R., Milleret-Raffort F., Les bases de données en géomatique, Hermès, 340 pages, 1993 Lebart and al., Statistique exploratoire multidimensionnelle, Dunod, Paris, 1995, Openshaw S., The modifiable areal unit problem, in Concepts And Techniques in Modern Geography, No 38, Norwich, UK, Geobooks, 1984, Piron M., Changer d’échelle : une méthode pour l’analyse des systèmes d’échelle, L’espace Géographique, n° 2, pp. 147-165, 1993, Putmann SH., Shung SH., Effects of spatial system design on spatial interaction models, the spatial system definition problem, Environment and planning, 1989, RIG1998, Les nouveaux usages de l’information géographique, Actes des Journées Cassini 1998, Vol. 8, n° 1-2, coordonnateurs T. Libourel, S. Motet, Hermès, 227 p. RIG2000, n° spécial Data Mining Spatial, Revue Internationale de Géomatique (Ed. K. Zeitouni), 2000 RGE1999, Les flux dans l’espace géographique, N° Spécial de la Revue de Géographie de l’Est, Nancy, (Ed. Josselin D.), Tome XXXIX, n°4, décembre 1999 Salgé L. et al., Les Systèmes d’Information Géographiques, Que sais-je ? PUF, 1996, Sanders L., L’analyse statistique des données en Géographie, Alilade, GIP RECLUS, 1989, Spery L., Libourel T., Vers une structuration des métadonnées, Revue Internationale de Géomatique, vol. 8, n° 1-2, pp. 59-74, 1998, Thiria S. et al., Statistiques et méthodes neuronales, 2ème cycle Ecoles d’ingénieurs, Dunod, Paris, 1997, Tierney L., Lisp-Stat, an object oriented environment for statistical computing and dynamic graphics, Wiley-Interscience Publication, John Wiley and Sons, NewYork, 1990, Tukey JW, Exploratory data Analysis, Addison-Wesley, 1977.