Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
266 Chapter XV Basic Principles of Data Mining Karl-Ernst Erich Biebler Ernst-Moritz-Arndt-University, Germany Bernd Paul Jäger Ernst-Moritz-Arndt-University, Germany Michael Wodney Ernst-Moritz-Arndt-University, Germany Abstract This chapter gives a summary of data types, mathematical structures, and associated methods of data mining. Topological, order theoretical, algebraic, and probability theoretical mathematical structures are introduced. The n-dimensional Euclidean space, the model used most for data, is defined. It is executed briefly that the treatment of higher dimensional random variables and related data is problematic. Since topological concepts are less well known than statistical concepts, many examples of metrics are given. Related classification concepts are defined and explained. Possibilities of their quality identification are discussed. One example each is given for topological cluster and for topological discriminant analyses. Introduction Data mining is up to a point a self-guided dataevaluating process and influenced by accompanying activity of the user. In comparison to data analysis, it describes an in-advance-defined process of the data evaluation. Data mining describes explorative procedures most of the time. Hypoth- eses being in connection with the examined data are sought. One must presuppose nothing about the methods of the collection of the data. The concluding procedures pursue another aim position: A given hypothesis shall be checked with data. The collection of the data then must be carried out according to certain principles, however. Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. Basic Principles of Data Mining As a rule, if statistical procedures are used, the data must be able to be regarded as samples. More exact definitions of the concepts of information and hypothesis are not looked here. Contributions to the methods of data mining are from different branches, for example computer science, logic, learning theory, artificial intelligence, also from the application fields like medical informatics, financial analysis etc. Basic concepts of data mining shall be explained in the following. The concepts used are part of different areas of mathematics. They are defined and illustrated as examples. One has to distinguish data of different types. According to this, the mathematical methods of data evaluation have to be designed. The mathematical structures are of basic importance. They correspond with the respective data types. The result interpretations must refer to it. If one can calculate the pair wise distances for the objects of a data set, then so-called topological methods of data mining can be designed. Statistical methods of data mining are based on observations of random variables. It is presupposed mostly that the data are a sample. If this is not the case, statistical methods are considered only in exceptions. It is not a trivial problem of deciding whether data are a sample of a random variable. Therefore, we point to not statistical methods of data mining. Methods of data mining are mathematical procedures. Its variety is exceptionally broad. We therefore confine ourselves to some classification methods and different possibilities of their treatment. The reader is able thus in principle to recognize the connection of data type, observation strategy, structure of the data as well as the datamining method. This is essential for any result interpretation. Transformations of the original data can influence the results of data mining. It is therefore recommended always to refer to the original data. Data types Observations at objects are informed about as data. One can receive these observations as measuring, numbers or verbal descriptions, for example. Sometimes they concern a quality, often also more qualities. Also more complicated facts can be included concerning the objects, such as relations. It is therefore required to distinguish data types. Data types relevant for the data analyses are described in the following. One knows data types also from programming languages. These shall not be treated here. A set X in the set-theoretical meaning consists of elements xi, X = {xi , i ∈ I }. The index I may be finite or infinite. According to this one distinguishes finite and infinite sets. The sets {x1, x1, x1, x2} and {x1, x2} are the same in the settheoretical meaning. This means all elements of a set are different. Data sets are collections of elements of a set. The data sets {x1, x1, x1, x2} and {x1, x2} have to be distinguished. The same element of a set can appear repeatedly in a data set. String data are signs or character strings (e.g. letters, words, abstract words). Numerical data are numbers (e.g. 3, 324, 2.1482). Dates are not regarded as numeric data. They form a type of their own. Categorical data are collections of elements of a set X, e.g., {red, red, red, green, green} is collected from X = {red, green, blue}. Categorical data can be string data or numerical data. Ordinal data is data which can be ordered. One can order numbers after their size. The words of a language are string data and can be ordered in a dictionary. Metric data are collections of elements of an interval X of real numbers, e.g., {2.001, 13.2, 1.008, 200.23} shall have been collected from X = [0; 225] . 267 22 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the publisher's webpage: www.igi-global.com/chapter/basic-principles-data-mining/29155 Related Content An UML Profile and SOLAP Datacubes Multidimensional Schemas Transformation Process for Datacubes Risk-Aware Design Elodie Edoh-Alove, Sandro Bimonte and François Pinet (2015). International Journal of Data Warehousing and Mining (pp. 64-83). www.irma-international.org/article/an-uml-profile-and-solap-datacubes-multidimensionalschemas-transformation-process-for-datacubes-risk-aware-design/130667/ Aesthetics in Data Visualization: Case Studies and Design Issues Heekyoung Jung, Tanyoung Kim, Yang Yang, Luis Carli, Marco Carnesecchi, Antonio Rizzo and Cathal Gurrin (2016). Big Data: Concepts, Methodologies, Tools, and Applications (pp. 1053-1076). www.irma-international.org/chapter/aesthetics-in-data-visualization/150205/ Big Data Paradigm for Healthcare Sector Jyotsna Talreja Wassan (2016). Big Data: Concepts, Methodologies, Tools, and Applications (pp. 570587). www.irma-international.org/chapter/big-data-paradigm-for-healthcare-sector/150182/ Literature Review in Computational Linguistics Issues in the Developing Field of Consumer Informatics: Finding the Right Information for Consumer's Health Information Need Ki Jung Lee (2009). Handbook of Research on Text and Web Mining Technologies (pp. 758-765). www.irma-international.org/chapter/literature-review-computational-linguistics-issues/21756/ Cooperation between Expert Knowledge and Data Mining Discovered Knowledge Fernando Alonso, Loïc Martínez, Aurora Pérez and Juan Pedro Valente (2013). Data Mining: Concepts, Methodologies, Tools, and Applications (pp. 1936-1959). www.irma-international.org/chapter/cooperation-between-expert-knowledge-data/73529/