Download Getting to Know Your Data

CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao Tallahassee, Florida, 2016 Data • Collection of data objects and their attributes • A data object represents an entity – Examples: • Sales database: customers, store items, sales • Medical database: patients, treatments • University database: students, professors, courses – Also called records , examples, instances, points, objects, tuples • Data objects are described by attributes – Properties or characteristics of data objects – Also called variables, fields, characteristics, features 1 Example Attributes Tid Objects Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 2 Data Types • Text – Each textual document is a collection of words • Transactional data – Each transaction involves a set of items • Graph – Vertices and edges • Sequential data – An ordered sequence, e.g., a DNA sequence with A, T, C, G • Spatial-temporal data – Time and location are implicit attributes • Multimedia data – Audio, video, … 3 Data Matrix • Data can often be represented or abstracted as an n*d data matrix with n rows and d columns as – Rows: a.k.a., instances, examples, records, transactions, objects, points, feature-vectors, etc. Given as a d-tuple 𝑥𝑖 = (𝑥𝑖1 , 𝑥𝑖2 , … … , 𝑥𝑖𝑑 ) – Columns: a.k.a., attributes, properties, features, dimensions, variables, fields, etc. Given as an n-tuple 𝑋𝑗 = (𝑥1𝑗 , 𝑥2𝑗 , … … , 𝑥𝑛𝑗 ) 4 Types of Attributes • Nominal: categories, states or “names of things” – Special case: Binary – Examples: eye color, race, gender, zip codes • Ordinal: values have a meaningful order but magnitude between successive values is unknown – Examples: rankings (e.g., taste of potato chips on a scale from 110), grades, height in {tall, medium, short} • Interval: on a scale of equal-sized units – Examples: calendar dates, temperatures in Celsius or Fahrenheit • Ratio – Examples: temperature in Kelvin (10 K˚ is twice as high as 5 K˚), length, time, counts 5 Types of Attributes Attribute Type Description Examples Nominal / Binary The values are just different names that provide only enough information to distinguish (equality) one object from another. (=, ) The values provide enough information to order (equality and inequality) objects. (<, >) zip codes, employee ID numbers, eye color, gender The differences between values are meaningful, i.e., a unit of measurement exists (+, - ) calendar dates, temperature in Celsius or Fahrenheit Ordinal Interval Ratio Both differences and ratios are meaningful. (*, /) pain level, rating, grades, street numbers temperature in Kelvin, monetary quantities, counts, age, mass, length 6 Discrete and Continuous Attributes • Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables – Note: binary attributes are a special case of discrete attributes • Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight – Practically, real values can only be measured and represented using a finite number of digits – Continuous attributes are typically represented as floating-point variables 7 Data: Algebraic and Geometric View • For numeric data matrix D, each row or point is a ddimensional column vector 8 Data: Probabilistic View • A random variable X is a function X : O → R, where O is the set of all possible outcomes of the experiment, also called the sample space – If X is discrete, the probability mass function of X is defined as 𝑓 𝑥 =𝑃 𝑋=𝑥 – f must obey the basic rules of probability: • 𝑓 𝑥 ≥0 • 𝑥 𝑓(𝑥) =1 – Intuitively, for a discrete variable X, the probability is concentrated or massed at only discrete values in the range of X, and is zero for all other values 9 Data: Probabilistic View • If X is continuous, the probability density function of X is defined as 𝑏 𝑃 𝑋 𝜖 𝑎, 𝑏 = 𝑓 𝑥 𝑑𝑥 𝑎 – f must obey the basic rules of probability: • 𝑓 𝑥 ≥0 • ∞ 𝑓 −∞ 𝑥 𝑑𝑥 =1 – Note that P(X = v) = 0 for all v ∈ R since there are infinitely possible values in the sample space. The probability mass is spread so thinly over the range of values that it can be measured only over intervals [a, b] ⊂ R, rather than at specific points 10 Probability Distributions • Bernoulli distribution – An attribute A following the Bernoulli distribution with parameter p ∈ [0, 1] has two values T and F, such that P(A=T) = p, and P(A=F) = 1-p • Binomial distribution – An attribute A following the Binomial distribution with parameters n and p, means the number k of T values in n independent Bernoulli trials with probability p for T 𝑓 𝑘 =𝑃 𝐴=𝑘 = 𝑛 𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 • Gaussian (Normal) distribution 11 Basic Statistical Description • Motivation – • To better understand the data: central tendency, variation and spread Data dispersion characteristics – median, max, min, quantiles, outliers, variance, etc. – Numerical dimensions correspond to sorted intervals • • Boxplot or quantile analysis on sorted intervals Dispersion analysis on computed measures – Folding measures into numerical dimensions 12 Measuring the Central Tendency n • Mean (expected value) – – The arithmetic average of the values 1 n x   xi n i 1 x  w x i 1 n i i w i 1 i Provides a one-number summary of the location or central tendency for the distribution – • Not robust because a single large value can skew the average Median (2nd quartile) – The “middle most” value: arranging all data points from lowest value to highest value and picking the middle one • – Middle value if odd number of values, or average of the middle two values – Robust as it is not affected very much by extreme values Mode: Value that occurs most frequently in the data – Not necessarily unique 13 Measuring the Central Tendency Comparison of common central stats of values { 1, 2, 2, 3, 4, 7, 9 } Type Description Example Result Mean Sum of values of a data set divided by number of values (1+2+2+3+4+7+9) / 7 4 Median Middle value separating the greater and lesser halves of a data set 1, 2, 2, 3, 4, 7, 9 3 Mode Most frequent value in a data set 1, 2, 2, 3, 4, 7, 9 2 14 Measuring the Dispersion of Data • Quartiles, outliers Quartiles: Q1 (25th percentile), Q3 (75th percentile) – • • Q1: the middle number between the smallest and the median of the data set • Q3: the middle number between the median and the highest of the data set – Inter-quartile range: IQR = Q3 – Q1 – Five number summary: min, Q1, median, Q3, max – Outlier: usually, a value higher/lower than 1.5 x IQR Variance and standard deviation (sample: s, population: σ) Variance – 1 n 1 n 2 1 n 1 2 s  ( xi  x )  [ xi  ( xi ) 2 ]  2   n  1 i 1 n  1 i 1 n i 1 N 2 – n 1 ( x   )   i N i 1 2 n  xi   2 2 i 1 Standard deviation s (or σ) is the square root of variance s2 (or σ2) 15 Measuring the Dispersion of Data Boxplot N(0,1σ2) 16 Boxplot • Data is represented with a box • The ends of the box are at the first and third quartiles – The height of the box is IQR • The median is marked by a line within the box • Whiskers: two lines outside the box extended to Minimum and Maximum – Max length = 1.5*IQR • Outliers: points beyond a specified outlier threshold, plotted individually 17 Histogram • A graph display of tabulated frequencies, shown as bars – Shows what proportion of cases fall into each of several categories – The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent 40 35 30 25 20 15 10 5 0 10000 30000 50000 70000 90000 18 Histograms Often Tell More than Boxplots • Two histograms may have the same boxplot representation – The same values for: min, Q1, median, Q3, max • But they have rather different data distributions 19 Quantile-Quantile (Q-Q) Plot • Graphs the quantiles of one univariate distribution against the corresponding quantiles of another – View: is there is a shift in going from one distribution to another? • Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2 20 Scatter Plot • Provides a first look at bivariate data to see clusters of points, outliers, etc. – Each pair of values is treated as a pair of coordinates and plotted as points in the plane 21 Scatterplot Matrix • Matrix of scatterplots of the k-dimension data – total of (k2/2-k) scatterplots 22 Similarity and Dissimilarity • Similarity – Numerical measure of how alike two data objects are – Value is higher when objects are more alike – Often falls in the range [0,1] • Dissimilarity (e.g., distance) – Numerical measure of how different two data objects are – Lower when objects are more alike – Minimum dissimilarity is often 0 – Upper limit varies • Proximity refers to a similarity or dissimilarity 23 Proximity Measure for Nominal Attributes • Method 1: Simple matching – For object i and j, m: # of matches, p: total # of variables m d (i, j)  p  p • Method 2: Use a large number of binary attributes – creating a new binary attribute for each of the M nominal states • A color attribute with values of red, yellow, blue, green, etc. • Create a series of new attributes red?, yellow?, blue?, green? … 24 Proximity Measure for Binary Attributes • A contingency table for binary data Object j Object i • Distance measure for symmetric binary variables • Distance measure for asymmetric binary variables • Jaccard coefficient (similarity measure for asymmetric binary variables) 25 Example Name Jack Mary Jim Gender M F M Fever Y Y Y Cough N N P Test-1 P P N Test-2 N N N Test-3 N P N Test-4 N N N • Compute the distance between different individuals based on asymmetric binary attributes – Gender is a symmetric attribute, the remaining attributes are asymmetric binary – The values Y and P be 1, and the value N 0 0 1  0.33 2  0 1 11 d ( jack , jim)   0.67 111 1 2 d ( jim, mary )   0.75 11 2 d ( jack , mary )  26 Distance on Numeric Data • Minkowski Distance – where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and h is the order (the distance so defined is also called L-h norm) • Properties – Positive definiteness: d(i, j) > 0 if i ≠ j, and d(i, i) = 0 – Symmetry: d(i, j) = d(j, i) – Triangle Inequality: d(i, j)  d(i, k) + d(k, j) • A distance that satisfies these properties is a metric 27 Special Cases of Minkowski Distance • h = 1: Manhattan distance (city block, L1 norm) – E.g., the Hamming distance: the number of bits that are different between two binary vectors d (i, j) | x  x |  | x  x | ... | x  x | i1 j1 i2 j2 ip jp • h = 2: Euclidean distance (L2 norm) d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 ) i1 j1 i2 j2 ip jp • h  : “supremum” distance (L norm) – This is the maximum difference between any component (attribute) of the vectors 28 Example point x1 x2 x3 x4 attribute 1 attribute 2 1 2 3 5 2 0 4 5 Manhattan (L1) L x1 x2 x3 x4 x1 0 5 3 6 x2 x3 x4 0 6 1 0 7 0 x2 x3 x4 Euclidean (L2) L2 x1 x2 x3 x4 x1 0 3.61 2.24 4.24 0 5.1 1 0 5.39 0 Supremum L x1 x2 x3 x4 x1 x2 0 3 2 3 x3 0 5 1 x4 0 5 0 29 Distance on Ordinal Variables • An ordinal variable can be discrete or continuous – Order is important, e.g., rank • Can be treated like interval-scaled – replace xif by their rank rif {1,...,M f } – map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by zif rif 1  M f 1 – compute the dissimilarity using methods for interval-scaled variables 30 Cosine Similarity • A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document – Applications: information retrieval, biologic taxonomy, gene feature mapping • If d1 and d2 are two vectors (e.g., term-frequency vectors), then cos(d1, d2) = (d1  d2) /||d1|| ||d2|| where  indicates vector dot product, ||d||: the length of vector d 31 Example • Find the similarity between documents 1 and 2 d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481 ||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12 So, cos(d1, d2 ) = 0.94 32 Cosine Similarity • This metric is a measurement of orientation and not magnitude, it can be seen as a comparison between documents on a normalized space because we’re not taking into the consideration only the magnitude of each word count of each document, but the angle between the documents 33

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Getting to Know Your Data