Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database Management Systems: Data Mining Data Compression Jerry Post Copyright © 2003 1 D A T A M i n i n g The Need for Data Compression  Noisy data  Need to combine data to smooth it (bins)  Too many values in a dimension  Need to combine data to get smaller number of sets  Hierarchies  Rollup data into natural hierarchies  Create additional hierarchies  Data compression  Large text, images, time series (wavelet compression) 2 D A T A Bins  Numerical data  Create bins  Equal-depth partitions  Equal-width partitions  Bin mean/smoothing Data: 3, 9, 17, 27, 32, 32, 37, 48, 55  Choose number of bins? M i n i n g Bin 1: 3, 9, 17 Bin 2: 27, 32, 32 Bin 3: 37, 48, 55 Bin 1: 3 – 20.3 Bin 2: 20.3 – 37.6 Bin 3: 37.6 – 55 3, 9, 17 27, 32, 32, 37 48, 55 Bin 1: 9.6, 9.6, 9.6 Bin 2: 30.3, 30.3, 30.3 Bin 3: 46.6, 46.6, 46.6 3 D A T A M i n i n g What is Cluster Analysis?  The process of placing items into groups, where items within a group are similar to each other, and dissimilar to items in other groups.  Similar to Classification Analysis, but in classification, the group characteristics are known in advance (e.g., borrowers who successfully repaid loans). 4 D A T A M i n i n g Clustering 1. Find groups statistically 2. Eliminate outliers 5 D A T A M i n i n g Standardization  If there are multiple attributes with continuous variables, you should standardize them.  Z-Scores are a common approach, but using a standard squared-deviation places too much weight on outliers. 1 s f  (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |) n 1 m f  ( x1 f  x2 f  ...  xnf ) n xif  m f zif  sf 6 D A T A Distance Measure  A simple distance measure is the l1 norm:  D(i,j) = |xi1-xj1| + |xi2-xj2| + … + |xin-xjn| 3 M i n i n g 5  A more common measure is the Euclidean or l2 norm:  D(i,j) = sqrt((xi1-xj1)2 + (xi1-xj1)2 + … + (xi1-xj1)2) x2 x1 7 D A T A M i n i n g General Distance Measure  In general form: d (i, j )  (| xi1  x j1 |q  | xi 2  x j 2 |q ... | xin  x jn |q )1/ q || x ||  max (| xik  x jk |) k 1... n 8 D A T A M i n i n g Hierarchies Dates Year – Quarter – Month – Day Year – Quarter – Week – Day Business hierarchies Division/Product Function: Marketing, HRM, Production, … Region Region hierarchies World – Continent – Nation – State – City 9 D A T A M i n i n g Clustering: Principal Components     Find the primary factors that define the data (vectors Y1, Y2) Statistical packages can compute them quickly. Map raw data (x1, x2 points) into the two vectors.. Use the vectors instead of the raw data. X2 Y1 Y2 X1 10 D A T A M i n i n g Factor Analysis: Latent Variables  Some data is not observable, but can be measured through indicator variables. Classic example: human intelligence which can be evaluated through a variety of tests IQ Test Intelligence SAT ACT 11 D A T A M i n i n g Exploratory Factor Analysis     Survey (marketing) Many items (questions) Are they related? Can the data be described by a small number of factors? Q1 Factor 1 Q2 Q3 Q4 Factor 2 Q5 Q6 12 D A T A M i n i n g Regression to Reduce Values  Estimate regression coefficients  If good fit, use the coefficients to pick categories 13 D A T A M i n i n g Wavelet Compression     Find patterns (wavelets) in the data Standard compression methods (GIF) Reduces data to a small number of wavelet patterns Particularly useful for complex data 14 D A T A M i n i n g Clustering Technologies  K-means  User specifies number of clusters (k)  System minimizes intracluster distance (usually L2: sum of squared errors).      K-medoids (selects representative center point) Hierarchical (BIRCH) Non-spherical (CURE) Density-based (DBSCAN, DENCLUE [best of group]) Grid-based (STING, CLIQUE) Data Mining, Jiawei Han and Micheline Kamber, 2001; chapter 8 15 D A T A M i n i n g Classification  Classify an outcome based on attributes. Similar to prediction and attribute valuation, but generally splits attributes into groups.  Example: Decision Tree  Will customers buy a new item? Age <=25 26…50 Student Yes No Yes No Yes >50 Income Low High No Yes 16 D A T A M i n i n g Decision Tree  Modeler specifies  The dependent variable (outcome)  The attribute variables  Sample/training data  The system selects the nodes and the criteria to best fit the model. Possibly using the gain in information criteria.  Results are useful in data mining/OLAP/cube browser. The tree nodes become important sides of the cube, and the classification levels specify hierarchies. 17 D A T A M i n i n g Bayesian Classifier  Goal: predict the probability that an item belongs to a class.  Begin with the naïve assumption: with m classes, the a priori probability P(Ci) = 1/m  Bayes’ Theorem: The probability that sample X belongs in Class i is P(Ci|X).  Assume the attributes are independent, so you can multiply the probabilities. Estimate the attribute probabilities from the sample. For categorical data, sik is the number of training samples of Class Ci having the value xk for Ak and si is the number of training samples in Ci. For continuous variables, use a Gaussian (normal) distribution. P(Ci | X )  P( X | Ci) P(Ci) P( X ) n P( X | Ci )   P( xk | Ci ) k 1 sik P ( x k | Ci )  si 18 D A T A Neural Networks Output Cells 3 -2 M i n i n g Input weights 7 4 Hidden Layer Some of the connections Incomplete pattern/missing inputs. Sensory Input Cells 19 D A T A M i n i n g Neural Networks: Back Propogation  Training sample with known results  Define the network  Number of layers (You must have one hidden layer)  Number of cells in each layer  Normalize the data between 0.0 and 1.0  Initialize the weights (small random numbers)  Propagate forward and compute the error  Update the weights and biases working backwards X0 X1 Xn Inputs w0j f Σ wnj i Oj  θj w1j I j   wijOi   j 1 I 1 e j Output bias Weighted sum 20 D A T A M i n i n g NN: Error Calculations  Errj = Oj (1-Oj)(Tj-Oj)  Tj is the true output, Oj (1-Oj ) is the derivative of the logistic function  For the hidden layer:  (l) is the learning rate Errj  O j (1  O j ) Errk w jk k wij  (l ) Errj Oi  j  (l ) Errj 21