Download Getting to Know Your Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CIS4930
Introduction to Data Mining
Getting to Know Your Data
Peixiang Zhao
Tallahassee, Florida, 2016
Data
• Collection of data objects and their attributes
• A data object represents an entity
– Examples:
• Sales database: customers, store items, sales
• Medical database: patients, treatments
• University database: students, professors, courses
– Also called records , examples, instances, points, objects, tuples
• Data objects are described by attributes
– Properties or characteristics of data objects
– Also called variables, fields, characteristics, features
1
Example
Attributes
Tid
Objects
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
2
Data Types
• Text
– Each textual document is a collection of words
• Transactional data
– Each transaction involves a set of items
• Graph
– Vertices and edges
• Sequential data
– An ordered sequence, e.g., a DNA sequence with A, T, C, G
• Spatial-temporal data
– Time and location are implicit attributes
• Multimedia data
– Audio, video, …
3
Data Matrix
• Data can often be represented or abstracted as an n*d
data matrix with n rows and d columns as
– Rows: a.k.a., instances, examples, records, transactions, objects,
points, feature-vectors, etc. Given as a d-tuple
𝑥𝑖 = (𝑥𝑖1 , 𝑥𝑖2 , … … , 𝑥𝑖𝑑 )
– Columns: a.k.a., attributes, properties, features, dimensions,
variables, fields, etc. Given as an n-tuple
𝑋𝑗 = (𝑥1𝑗 , 𝑥2𝑗 , … … , 𝑥𝑛𝑗 )
4
Types of Attributes
• Nominal: categories, states or “names of things”
– Special case: Binary
– Examples: eye color, race, gender, zip codes
• Ordinal: values have a meaningful order but magnitude
between successive values is unknown
– Examples: rankings (e.g., taste of potato chips on a scale from 110), grades, height in {tall, medium, short}
• Interval: on a scale of equal-sized units
– Examples: calendar dates, temperatures in Celsius or Fahrenheit
• Ratio
– Examples: temperature in Kelvin (10 K˚ is twice as high as 5 K˚),
length, time, counts
5
Types of Attributes
Attribute Type
Description
Examples
Nominal /
Binary
The values are just different names
that provide only enough
information to distinguish (equality)
one object from another. (=, )
The values provide enough
information to order (equality and
inequality) objects. (<, >)
zip codes, employee ID
numbers, eye color,
gender
The differences between values are
meaningful, i.e., a unit of measurement
exists (+, - )
calendar dates,
temperature in Celsius
or Fahrenheit
Ordinal
Interval
Ratio
Both differences and ratios are
meaningful. (*, /)
pain level, rating,
grades, street numbers
temperature in Kelvin,
monetary quantities,
counts, age, mass,
length
6
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables
– Note: binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight
– Practically, real values can only be measured and represented using
a finite number of digits
– Continuous attributes are typically represented as floating-point
variables
7
Data: Algebraic and Geometric View
• For numeric data matrix D, each row or point is a ddimensional column vector
8
Data: Probabilistic View
• A random variable X is a function X : O → R, where O is
the set of all possible outcomes of the experiment, also
called the sample space
– If X is discrete, the probability mass function of X is defined as
𝑓 𝑥 =𝑃 𝑋=𝑥
– f must obey the basic rules of probability:
• 𝑓 𝑥 ≥0
•
𝑥 𝑓(𝑥)
=1
– Intuitively, for a discrete variable X, the probability is concentrated
or massed at only discrete values in the range of X, and is zero for
all other values
9
Data: Probabilistic View
• If X is continuous, the probability density function of X
is defined as
𝑏
𝑃 𝑋 𝜖 𝑎, 𝑏
=
𝑓 𝑥 𝑑𝑥
𝑎
– f must obey the basic rules of probability:
• 𝑓 𝑥 ≥0
•
∞
𝑓
−∞
𝑥 𝑑𝑥 =1
– Note that P(X = v) = 0 for all v ∈ R since there are infinitely
possible values in the sample space. The probability mass is spread
so thinly over the range of values that it can be measured only
over intervals [a, b] ⊂ R, rather than at specific points
10
Probability Distributions
• Bernoulli distribution
– An attribute A following the Bernoulli distribution with parameter p ∈ [0, 1]
has two values T and F, such that P(A=T) = p, and P(A=F) = 1-p
• Binomial distribution
– An attribute A following the Binomial distribution with parameters n and p,
means the number k of T values in n independent Bernoulli trials with
probability p for T
𝑓 𝑘 =𝑃 𝐴=𝑘 =
𝑛
𝑘
𝑝𝑘 (1 − 𝑝)𝑛−𝑘
• Gaussian (Normal) distribution
11
Basic Statistical Description
•
Motivation
–
•
To better understand the data: central tendency, variation and
spread
Data dispersion characteristics
–
median, max, min, quantiles, outliers, variance, etc.
–
Numerical dimensions correspond to sorted intervals
•
•
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
–
Folding measures into numerical dimensions
12
Measuring the Central Tendency
n
•
Mean (expected value)
–
–
The arithmetic average of the values
1 n
x   xi
n i 1
x 
w x
i 1
n
i
i
w
i 1
i
Provides a one-number summary of the location or central tendency for the
distribution
–
•
Not robust because a single large value can skew the average
Median (2nd quartile)
–
The “middle most” value: arranging all data points from lowest value to
highest value and picking the middle one
•
–
Middle value if odd number of values, or average of the middle two values
–
Robust as it is not affected very much by extreme values
Mode: Value that occurs most frequently in the data
–
Not necessarily unique
13
Measuring the Central Tendency
Comparison of common central stats of values { 1, 2, 2, 3,
4, 7, 9 }
Type
Description
Example
Result
Mean
Sum of values of a
data set divided by
number of values
(1+2+2+3+4+7+9) /
7
4
Median
Middle value
separating the
greater and lesser
halves of a data set
1, 2, 2, 3, 4, 7, 9
3
Mode
Most frequent
value in a data set
1, 2, 2, 3, 4, 7, 9
2
14
Measuring the Dispersion of Data
•
Quartiles, outliers
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
–
•
•
Q1: the middle number between the smallest and the median of the data set
•
Q3: the middle number between the median and the highest of the data set
–
Inter-quartile range: IQR = Q3 – Q1
–
Five number summary: min, Q1, median, Q3, max
–
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)
Variance
–
1 n
1 n 2 1 n
1
2
s 
( xi  x ) 
[ xi  ( xi ) 2 ]  2 

n  1 i 1
n  1 i 1
n i 1
N
2
–
n
1
(
x


)


i
N
i 1
2
n
 xi   2
2
i 1
Standard deviation s (or σ) is the square root of variance s2 (or σ2)
15
Measuring the Dispersion of Data
Boxplot
N(0,1σ2)
16
Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles
– The height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to Minimum and
Maximum
– Max length = 1.5*IQR
• Outliers: points beyond a specified outlier threshold, plotted
individually
17
Histogram
• A graph display of tabulated frequencies, shown as bars
– Shows what proportion of cases fall into each of several
categories
– The categories are usually specified as non-overlapping intervals
of some variable. The categories (bars) must be adjacent
40
35
30
25
20
15
10
5
0
10000
30000
50000
70000
90000
18
Histograms Often Tell More than Boxplots
• Two histograms may have the same boxplot
representation
– The same values for: min, Q1, median, Q3, max
• But they have rather different data distributions
19
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution
against the corresponding quantiles of another
– View: is there is a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2
for each quantile. Unit prices of items sold at Branch 1 tend to be
lower than those at Branch 2
20
Scatter Plot
• Provides a first look at bivariate data to see clusters of
points, outliers, etc.
– Each pair of values is treated as a pair of coordinates and plotted
as points in the plane
21
Scatterplot Matrix
• Matrix of scatterplots of the k-dimension data
– total of (k2/2-k) scatterplots
22
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are
– Value is higher when objects are more alike
– Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
• Proximity refers to a similarity or dissimilarity
23
Proximity Measure for Nominal Attributes
• Method 1: Simple matching
– For object i and j, m: # of matches, p: total # of variables
m
d (i, j)  p 
p
• Method 2: Use a large number of binary attributes
– creating a new binary attribute for each of the M nominal states
• A color attribute with values of red, yellow, blue, green, etc.
• Create a series of new attributes red?, yellow?, blue?, green? …
24
Proximity Measure for Binary Attributes
• A contingency table for binary data
Object j
Object i
• Distance measure for symmetric binary variables
• Distance measure for asymmetric binary variables
• Jaccard coefficient (similarity measure for asymmetric binary
variables)
25
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
• Compute the distance between different individuals based on
asymmetric binary attributes
– Gender is a symmetric attribute, the remaining attributes are asymmetric binary
– The values Y and P be 1, and the value N 0
0 1
 0.33
2  0 1
11
d ( jack , jim) 
 0.67
111
1 2
d ( jim, mary ) 
 0.75
11 2
d ( jack , mary ) 
26
Distance on Numeric Data
• Minkowski Distance
– where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and h is the order (the distance so
defined is also called L-h norm)
• Properties
– Positive definiteness: d(i, j) > 0 if i ≠ j, and d(i, i) = 0
– Symmetry: d(i, j) = d(j, i)
– Triangle Inequality: d(i, j)  d(i, k) + d(k, j)
• A distance that satisfies these properties is a metric
27
Special Cases of Minkowski Distance
• h = 1: Manhattan distance (city block, L1 norm)
– E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1
j1
i2
j2
ip
jp
• h = 2: Euclidean distance (L2 norm)
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1
j1
i2
j2
ip
jp
• h  : “supremum” distance (L norm)
– This is the maximum difference between any component (attribute) of
the vectors
28
Example
point
x1
x2
x3
x4
attribute 1 attribute 2
1
2
3
5
2
0
4
5
Manhattan (L1)
L
x1
x2
x3
x4
x1
0
5
3
6
x2
x3
x4
0
6
1
0
7
0
x2
x3
x4
Euclidean (L2)
L2
x1
x2
x3
x4
x1
0
3.61
2.24
4.24
0
5.1
1
0
5.39
0
Supremum
L
x1
x2
x3
x4
x1
x2
0
3
2
3
x3
0
5
1
x4
0
5
0
29
Distance on Ordinal Variables
• An ordinal variable can be discrete or continuous
– Order is important, e.g., rank
• Can be treated like interval-scaled
– replace xif by their rank rif {1,...,M f }
– map the range of each variable onto [0, 1] by replacing i-th object
in the f-th variable by
zif
rif 1

M f 1
– compute the dissimilarity using methods for interval-scaled
variables
30
Cosine Similarity
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document
– Applications: information retrieval, biologic taxonomy, gene feature mapping
• If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2||
where  indicates vector dot product, ||d||: the length of vector d
31
Example
• Find the similarity between documents 1 and 2
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
So,
cos(d1, d2 ) = 0.94
32
Cosine Similarity
• This metric is a measurement of orientation and not magnitude, it
can be seen as a comparison between documents on a normalized
space because we’re not taking into the consideration only the
magnitude of each word count of each document, but the angle
between the documents
33
Related documents