Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Chapter 2 Input: Concepts, Instances, and Attributes Kirk Scott 1 • Hopefully the idea of instances and attributes is clear • Assuming there is something in the data to be mined, either this is the concept, or the concept is inherent in this • Earlier data mining was defined as finding a structural representation • Essentially the same idea is now expressed as finding a concept description 2 Concept Description • The concept description needs to be: • Intelligible – It can be understood, discussed, disputed • Operation – It can be applied to actual examples 3 2.1 What’s a Concept? 4 Reiteration of Types of Discovery • Classification – Prediction • Clustering – Outliers • Association • Each of these is a concept • Successful accomplishment of these for a data set is a concept description 5 Recall Examples • Weather, contact lenses, iris, labor contracts • All were essentially classification problems • In general, the assumption is that classes are mutually exclusive • In complicated problems, data sets may be classified in multiple ways • This means individual instances can be “multilabeld” 6 Supervised Learning • Classification learning is supervised • There is a training set • A structural representation is derived by examining a set of instances where the classification is known • How to test this? • Apply the results to another data set with known classifications 7 Association Rules • In any given data set there can be many association rules • The total may approach n(n – 1) / 2 for n attributes • The book doesn’t use the terms support and confidence, but it discusses these concepts • These terms will be introduced 8 Support for Association Rules • Let an association rule X = (x1, x2, …, xi)y be given in a data set with m instances • The support for Xy is the count of the number of instances where the combination of x values, X, occurs in the data set, divided by m • In other words, the association rule may be interesting if it occurs frequently enough 9 Confidence for Association Rules • Confidence here is based on the statistical use of the term • The confidence for Xy is the count of the number of occurrences in the data set where this relationship holds true divided by the number of occurrences of X overall • The book describes this idea as accuracy • In other words, the association is interesting the more likely it is that X does 10 determine y Clustering • We haven’t gotten the details yet, but this is an interesting data mining problem • Given a data set without predefined classes, is it possible to determine classes that the instances fall into? • Having determined the classes, can you then classify future instances into them? • Outliers are instances that you can definitely say do not fall into any of the 11 classes Numerical Prediction • This is a variation on classification • Given n attribute values, determine the (n + 1)st attributed value • Recall the CPU performance problem • It would be a simple matter to dream up sample data where the weather data predicted how long you would play rather than a simple yes or no • (The book does so) 12 2.2 What’s in an Example? 13 • The authors are trying to present some important ideas • In case their presentation isn’t clear, I present it here in a slightly different way • The basic premise goes back to this question: • What form does a data set have to be in in order to apply data mining techniques to it? 14 Data Sets Should Be Tabluar • The simple answer based on the examples presented so far: • The data has to be in tabular form, instances with attributes • The remainder of the discussion will revolve around questions related to normalization in db 15 Not All Data is Naturally Tabular • Some data is not most naturally represented in tabular form • Consider OO db’s, where the natural representation is tree-like • How should such a representation be converted to tabular form that is amenable to data mining? 16 Correctly Normalized Data May Fall into Multiple Tables • You might also have data which naturally falls into >1 table • Or, you might have data (god forbid) that has been normalized into >1 table • How do you make it conform to the single table model (instances with attributes) for data mining? 17 • Tree-like data and multi-table data may be related questions • It would not be surprising to find that a conversion of a tree to a table resulted in >1 table 18 Denormalization • The situation goes against the grain of correct database design • The classification, association, and clustering you intend to do may cross db entity boundaries • The fact that you want to do mining on a single tabular representation of the data means you have to denormalize 19 • In short, you combine multiple tables back into one table • The end result is the monstrosity that is railed against in normalization theory: • The monolithic, one-table db 20 The Book’s Family Examples • Family relationships are typically viewed in tree-like form • The book considers a family tree and the relationship “is a sister of” • The factors for inferring sisterhood: • Two people, one female • The same (or at least one common) parents for both people 21 Two People in the Same Table • Suppose you want to do this in tabular form • You end up with the two people who might be in a sisterhood relationship in the same table • These occurrences of people are matched with a classification, yes or no 22 • Recall that according to normalization, a truly one-to-one relationship can be stored in a single table • Pairings of all people would result in lots of instances/rows where the classification was simply no • This isn’t too convenient 23 • In theory, you might restrict your attention only to those rows where the classification was yes • This restriction is known as the “closed world assumption” in data mining • Unfortunately, it is hardly ever the case that you have a problem where this kind of simplifying assumption applies • You have to deal with all cases 24 Two People with Attributes in the Same Table • Suppose the two people are only listed by name in the table, without parent information • The classification might be correct, but this is of no help • There are no attributes to infer sisterhood from • The table has to include attributes about the two people, namely parent information 25 The Connection with Normalization • There is a problem with denormalized data mining which is completely analogous to the normalization problem • Suppose you have two people in the same instance (the same row) with their attributes • By definition, you will have stray dependencies • The Person identifiers determine the 26 attributes values • So far we’ve considered classification • However, what would happen if you mined for associations? • The algorithm would find the perfectly true, but already known associations between the pk identifiers of the people and their attribute fields • This is not helpful • It’s a waste of effort 27 Recursive Relationships • Recall the monarch and product-assembly examples from db • These give tables in recursive relationships with themselves or others • In terms of the book’s example, how do you deal with parenthood when there is a potentially unlimited sequence of ancestors? 28 • In general, the answer is that you would need recursive rules • Mining recursive rules is a step beyond classification, association, etc. • The good news is that this topic will not be covered further • It’s simply of interest to know that such problems can arise 29 One-to-Many Relationships • A denormalized table might be the result joining two tables in a pk-fk relationship • If the classification is on the “one” side of the relationship, then you have multiple instances in the table which are not independent • In data mining this is called a multiinstance situation 30 • The multiple instances belonging to one classification together actually form one example of the concept under consideration in such a problem • Data mining algorithms have been developed to handle cases like these • They will be presented with the other algorithms later 31 Summary of 2.2 • The fundamental practical idea here is that data sets have to be manipulated into a form that’s suitable for mining • This is the input side of data mining • The reality is that denormalized tables may be required • Data mining can be facetiously be referred to as file mining since the required form does not necessarily agree with db theory 32 • The situation can be restated in this way: • Assemble the query results first; then mine them • This leads to an open question: • Would it be possible to develop a data mining system that could encompass >1 table, crawling through the pk-fk relationships like a query, finding assocations? 33 2.3 What’s in an Attribute? 34 • This subsection falls into two parts: • 1. Some ideas that go back to db design and normalization questions • 2. Some ideas having to do with data type 35 Design and Normalization • You could include different kinds (subtypes) of entities in the same table • To make this work you would have to include all of the fields of all of the kinds of entities • The fields that didn’t apply to a particular instance would be null • The book uses transportation vehicles as an example: ships and trucks 36 • You could also have fields in a table that depend on each other (ack) • The book gives married T/F and spouse’s name as examples • Again, you can handle this with null values 37 Data Types • The simplest distinction is numeric vs. categorical • Some synonyms for categorical: symbolic, nominal, enumerated, discrete • There are also two-valued variables known as Boolean or dichotomy 38 Spectrum of Data Types • 1. Nominal = unordered, unmeasurable named categories • Example: sunny, overcast, rainy • 2. Ordinal = named categories that can be put into a logical order but which have no intrinsic numeric value and no defined distance between them (support < or >) • Example: hot, mild, cool 39 • 3. Interval = numeric values where the distance between them makes sense (support subtraction) but other operations do not • Example: Time expressed in years 40 • 4. Ratio = numeric values where all operations make sense • These are real or continuous (or possibly integer) values on a scale with a natural 0 point • Example: Physical distance 41 • In principle, data mining has to handle all possible types of data • In practice, applied systems typically have some useful subset of the type distinctions given above • You adapt your data to the types provided 42 2.4 Preparing the Input 43 • In practice, preparing the data can take more time and effort than doing the mining • Data needs to be in the format required by whatever mining software you’re using • In Weka, this is ARFF = attribute relation file format 44 • Real data tends to be low in quality • Think data integrity and completeness • “Cleaning” the data before mining it pays off 45 • • • • Weka From Wikipedia, the free encyclopedia Jump to: navigation, search For other uses, see Weka (disambiguation). 46 • The Weka or woodhen (Gallirallus australis) is a flightless bird species of the rail family. It is endemic to New Zealand, where four subspecies are recognized. Weka are sturdy brown birds, about the size of a chicken. As omnivores, they feed mainly on invertebrates and fruit. Weka usually lay eggs between August and January; both sexes help to incubate. 47 48 • Behaviour • … • Where the Weka is relatively common, their furtive curiosity leads them to search around houses and camps for food scraps, or anything unfamiliar and transportable.[2] 49 50 51 52 Gathering the Data Together • In a large organization, different departments may manage their own data • Global level data mining will require integration of data from multiple databases • If you’re lucky, the organization has already created a unified archive, a data warehouse • Interesting mining may also require integrating external data into the data set 53 Aggregation • It may be necessary to aggregate data in order to mine it successfully • You may have data on parameters of interest spread through multiple instances • To be useful to problem solution, it may be necessary to add the values of data points together, for example 54 • The type of aggregation is important • Remember the aggregation operators in db: COUNT, SUM, AVERAGE, etc. • The level of aggregation is important • Remember GROUP BY in db • Do you aggregate all instances, or is it useful to do it by subsets of some sort? 55 ARFF (Format) • ARFF is the regular version of the data format for Weka • XRFF is the XML version • In ARFF: • % marks a comment • @ marks file descriptor information, relation, attributes, and data 56 • For a categorical attribute, the set of values is given in braces • Categorical values containing spaces have to be put in quotation marks • Numeric attributes are simply identified as numeric 57 • Instances in the file are given line by line • They are separate by new lines • Attributes values in instances are separated by commas • Missing values (nulls) are indicated with a question mark 58 • In an ARFF file, a classification attribute, if there is one, is treated no differently than any others • The format is equally suited to classification, association, or cluster mining • Figure 2.2, on the following overhead, shows the weather data set in ARFF 59 60 Weka Has Three Additional Attribute Types • String = the moral equivalent of VARCHAR in db • Date = the equivalent of DATE in db • Relational = Stay tuned; this will require some explanation 61 Relational-Valued Attributes • The book gives an example which is OK, but it’s not necessarily presented in the clearest way possible • My plan is to first give a bunch of explanatory background • Then explain the book’s example in a slightly different order than it does 62 Relational Background • Recall that multivalued problems can be viewed as mining the result of a 1-m join • In preparing a data set for mining, this is what a relational-valued attribute is: • It is an attribute that can contain or consist of multiple instances of the same kind of set of values, where these sets belong together for some reason 63 • In a 1-m, pk-fk join, the multiple sets are the rows of the many table which belong together because they share the same fk value • In case this general overview isn’t clear, the idea can be illustrated with mothers and children 64 Mothers and Children • • • • • • Suppose you ran this query: SELECT * FROM Mother, Child WHERE Mother.motherid = Child.motherid GROUP BY motherid Children of the same mother would be grouped together 65 • In data mining, it is possible that you would want to elicit information about children in general • You might also want to elicit information that generally held for children of the same mother • Conceptually, you would be mining information about siblinghood 66 • This is where relational-valued attributes come in • From a relational point of view, the representation is wrong • First normal form says you have flat files with no repeating groups • But for data mining purposes, in ARFF format, you want the repeating groups 67 Explaining the Book’s Example • The weather adapts the weather/play a game data to a multivalued example • The new twist is this: Games extend over 2 days, not just one • Each day is still a single instance • But for each game, there are two of these instances which belong together 68 • The instances in the rows of the table representing game information will be multivalued • Each game will contain two days’ worth of weather data • These two days’ worth of data are the contents of one relational-valued attribute in the overall data set 69 • Note that in general, relational attributes, multivalued attributes, are not limited to 2 sets of values • This is just an artifact of their example, where games last exactly 2 days 70 • The book also uses terminology which could be clearer • In their new weather table they name the relational attribute “bag” • Suffice it to say that it would have been clearer if they had named the attribute game_days or something like that 71 • There are three major attributes in the book’s table: • bag_ID (id for sets of days belonging to games) • bag (multivalued relational attribute containing days belonging to games, grouped by game) • play (the classification to play, yes or no) 72 • The bag attribute has 4 (familiar) attributes describing the multivalued instances (of day): • outlook • temperature • humidity • windy 73 • In the body of the ARFF table, the multivalued entries are structured in this way: • The data for the multiple days that belong together for a single bag_ID is enclosed in quotation marks • Within the quotation marks, the individual sets of day data are separated by “\n”, the new line character 74 • The book’s ARFF table is shown in Figure 2.3 on the following overhead 75 76 Sparse Data • Some data sets are sparse • In this context the book means 0’s for numerical values, not nulls • Rather than listing everything, a row can be economically expressed by showing only the values present 77 • • • • • • • • In ARFF, the attributes for a row are: identified by number starting with 0 Followed by the value Separated by commas Enclosed in braces E.g.: {1 X, 6 Y, 10 “Class A”} This doesn’t work for nulls; you still have to include ?’s 78 Attribute Types • The bottom line is that ARFF only has two fundamental types: nominal and numeric • String attributes are effectively nominal • Date attributes are effectively numeric • (Recall the discussions of stuff like this in db) • The rest of this subsection has to do with numeric types in particular 79 Numerics as Ordinals • The important point is this: • Different data mining algorithms treat vanilla numeric values differently • One algorithm may treat numerics as ordinals, where subtraction applies, generating rules based on <, =, > comparisons 80 Numerics as Ratio Values • Another algorithm may treat numerics as ratio values • Recall that all arithmetic operations are defined in this case • The algorithm may normalize ratio values 81 Normalization • Normalization means putting values into a range, most commonly the range 01 • A simple approach for positive values: Divide any given data value by the maximum present • Another simple approach for positive values: Subtract the minimum from the data value and divide by (max – min) 82 Standardization • Values can also be statistically standardized • Each data point is converted using this approach: • xstandardized = (x – μ) / σ • This puts the values into a distribution where the mean is 0 and the standard deviation is 1 83 Distance as an Example of Ratio Values • Consider the calculation of distance in n dimensional space, 2-space for example • Calculating the square root of the sum of the squares of the differences of the coordinates involves using arithmetic operators other than subtraction • Normalization is implicated in a situation like this 84 • Given some (x, y) space, suppose x is in the range 010 and y is in the range 0100 • Do you normalize both x and y before calculating distances or not? • Another way of stating this is, do x and y make corresponding contributions to the measure of distance between two data points or not? 85 Nominal Attributes and Distance • This is a crude measure of distance for nominal attributes: • If two instances have the same value for that attribute, the distance between them, measure on that attribute is 0 • If two instances have a different value for an attribute, the distance between them is 1 86 • There are cases where nominal attributes can be reverse engineered back to numerics • One example from the book: Zip codes and geographic location coordinates • Recall that zip codes came up in db in a similar way as determinants of geographic locations 87 Nominal vs. Numeric • Just like in db the assertion is made that an id “number” field should be TEXT— • In data mining there may be attributes containing numeric digits which are simply nominal fields and should be mined as such 88 • Finally, some algorithms support nominals but not ordinals • In the contact lens data, young < prepresbyopic < presbyopic • If their ordinal relationships is not recognized, a complete and correct set of rules can still be mined • However a complete and correct set of rules about 1/3 as large can be mined in a 89 system that recognizes the relationship Missing Values • This is essentially a discussion of nulls • The only new element consists of two questions: • Can you infer anything from the absence of values? • Would it be possible to meaningfully code why values are absent and mine something from this? 90 Inaccurate Values • This is essentially a discussion of data integrity • Both data miners and regular db users have to cope with faulty data one way or the other • The authors say this is especially important when mining • It’s especially important to the data miner if the data miner ascribes more significance 91 to an attribute than a regular user does The End 92