Download Data Mining - WordPress.com

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining – Day 1
Fabiano Dalpiaz
Department of Information and
Communication Technology
University of Trento - Italy
http://www.dit.unitn.it/~dalpiaz
Database e Business Intelligence
A.A. 2007-2008
Acknowledgements
This presentation is partially based on the slides for the book:
Data Mining: Concepts and Techniques, 2° ed
Jiawei Han and Micheline Kamber
© P. Giorgini, F. Dalpiaz
2
Two-days outline







Data Mining and KDD
Why Data Mining
Applications of Data Mining
Data Preprocessing
Data Mining techniques
Visualization of the results
Summary
© P. Giorgini, F. Dalpiaz
3
Data Mining and KDD
KDD Conference
Logo
© P. Giorgini, F. Dalpiaz
4
Looking for knowledge

The Explosive Growth of Data

The World Wide Web

Business: e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation

Society and everyone: news, digital cameras, YouTube, forums,
blogs, Google & Co

We are drowning in data, but starving for knowledge!

Avoid data tombs

“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets.
© P. Giorgini, F. Dalpiaz
5
What is Data Mining?

Data mining (knowledge discovery from data)


Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.

Are simple search engines data mining? Are queries data
mining? Are expert systems data mining?
© P. Giorgini, F. Dalpiaz
6
Knowledge Discovery (KDD)
Process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Data sources
© P. Giorgini, F. Dalpiaz
7
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions
Decision
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
© P. Giorgini, F. Dalpiaz
Quantity of data
DBA
8
Data Mining: confluence of multiple
disciplines
Database
Technology
Machine
Learning
Pattern
Recognition
© P. Giorgini, F. Dalpiaz
Statistics
Data Mining
Algorithms
Visualization
Other
Disciplines
9
Why Data Mining?
© P. Giorgini, F. Dalpiaz
10
Why is Data Mining so complex? A
matter of data dimensions

Tremendous amount of data



High-dimensionality of data



Walmart – Customer buying patterns – a data warehouse 7.5
Terabytes large in 1995
VISA – Detecting credit card interoperability issues – 6800
payment transactions per second
Many dimensions to be combined together
Data cube example: time, location, product  sales
High complexity of data

Time-series data, temporal data, sequence data

Structure data, graphs, social networks and multi-linked data

Spatial, spatiotemporal, multimedia, text and Web data
© P. Giorgini, F. Dalpiaz
11
What does Data Mining provide me
with? (1)


Multidimensional concept description: Characterization and
discrimination

Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions

Characterization describes things in the same class,
discrimination describes how to separate different classes
Frequent patterns, association, correlation vs. causality

Wine  Spaghetti [0.3% of all basket cases, 75% of cases
when tomato sauce is bought]

Is this correlation or not?
© P. Giorgini, F. Dalpiaz
12
What does Data Mining provide me
with? (2)

Classification and prediction

Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars based
on gas mileage


Predict some unknown or missing numerical values
Cluster analysis
 Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass
similarity
© P. Giorgini, F. Dalpiaz
13
What does Data Mining provide me
with? (3)

Outlier analysis




Outlier: Data object that does not comply with the general
behavior of the data
Fraud detection is the main application area
Noise or exception?
Trend and evolution analysis




Trend and deviation: e.g., regression analysis
Sequential pattern mining: e.g., digital camera  large SD
memory
Periodicity analysis
Similarity-based analysis
© P. Giorgini, F. Dalpiaz
14
Applications of Data Mining
Market Analysis and Management

Data sources:


credit card transactions, loyalty cards, smart cards, discount
coupons, ...
Target marketing

Find clusters of “model” customers who share the same
characteristics:
• Geographics (lives in Rome, lives in Trentino)
• Demographics (married, between 21-35, at least one child, family income
more than 40.000€/year)
• Psychographics (likes new products, consistently uses the Web)
• Behaviors (searches info in Internet, always defends her decisions)

Determine customer purchasing patterns over time
© P. Giorgini, F. Dalpiaz
15
Applications of Data Mining
Market Analysis and Management

Cross-market analysis



Customer profiling



Find associations between product sales, and predict based on
such association
Compare the sales in the US and in Italy, find associations in
old products and predict if new ones will have success
What types of customers buy what products
Customers with age between 20-30 and income > 20K€ will buy
product A
Customer requirement analysis


Identify the best products for different groups of customers
Predict what factors will attract new customers
© P. Giorgini, F. Dalpiaz
16
Applications of Data Mining
Corporate Analysis

Finance Planning and Asset Evaluation



Resource Planning


summarize and compare the resources and spending
Competition




Cash flow prediction and analysis
Cross-sectional and time-series analysis (financial ratio, trend
analysis)
monitor competitors and market directions
group customers into classes and a class-based pricing
procedure
set pricing strategy in a highly competitive market
Other examples?
© P. Giorgini, F. Dalpiaz
17
What’s next?

Data Preprocessing






Data Mining techniques





Why is it needed?
Data cleaning
Data integration and transformation,
Data reduction
Discretization and Concept hiererchy
Frequent patterns, association rules
Classification and prediction
Cluster Analysis
Are you sleeping?
Visualization of the results
Summary
© P. Giorgini, F. Dalpiaz
18
Data Preprocessing
© P. Giorgini, F. Dalpiaz
19
Why Data Preprocessing?

Data in the real world is dirty

incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., occupation=“ ”, birthdate=“31/12/2099”

noisy: containing errors or outliers
• e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997” (we are in 2007!!)
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records. In one copy of the data
customer A has to pay 200.000€, in the second copy of the data A does not
have to pay anything.
© P. Giorgini, F. Dalpiaz
20
Why is data dirty?

Incomplete data may come from




Noisy data (incorrect values) may come from




“Not applicable” data value when collected
Different considerations between the time when the data was
collected and when it is analyzed.
Human/hardware/software problems
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from


Different data sources
Functional dependency violation (e.g., modify some linked data)
© P. Giorgini, F. Dalpiaz
21
Why Is Data Preprocessing
Important?
© P. Giorgini, F. Dalpiaz
22
Data Preprocessing
1. Data cleaning – missing values
“Data cleaning is one of the three biggest problems in data
warehousing”— Ralph Kimball

Fill in missing values






Name=“John”, Occupation=“Lawyer”, Age=“28”, Salary=“”
Ignore the record (is it always feasible?)
Manually filling missing attributes
Automatically insert a constant
Automatically insert the mean value (relative to the record
class)
Most probable value: make some inference!
© P. Giorgini, F. Dalpiaz
23
Data Preprocessing
1. Data cleaning – binning

Handle noisy data


1.
2.
Binning
Sort data by price (€): 4, 8, 9, 15, 21, 21, 24, 25, 26
Partition into equal-frequency (equi-depth) bins:



3.
Binning, clustering, regression (not details)
Bin 1: 4, 8, 9
Bin 2: 15, 21, 21
Bin 3: 24, 25, 26
Smoothing by bin means:



Bin 1: 7, 7, 7
Bin 2: 19, 19, 19
Bin 3: 25, 25, 25
© P. Giorgini, F. Dalpiaz
24
Data Preprocessing
1. Data cleaning – clustering
noise
© P. Giorgini, F. Dalpiaz
25
Data Preprocessing
2. Integration and transformation


Data Integration combines data from multiple sources
into a coherent store
D1
D2
D3
Schema integration



D1,2,3
Entity identification problem:


Integrate metadata from different sources
A.cust-id  B.cust-number
Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts

For the same real world entity, attribute values from different
sources are different (e.g., cm vs. inch)
© P. Giorgini, F. Dalpiaz
26
Data Preprocessing
2. Integration and transformation

Data integration can lead to redundant attributes



Same object (A.house = B.residence)
Derivates (A.annualIncome =  B.salary+C.rentalIncome)
Redundant attributes can be discoverd via correlation
analysis




A mathematical method detecting the correletion between two
attributes
Correlation coefficient (Pearson’s product moment coefficient):
the higher it is, the stronger the correlation between attributes
Χ2 (chi-square) test
No details on these methods here
© P. Giorgini, F. Dalpiaz
27
Data Preprocessing
2. Integration and transformation

Aggregation:



Sum the sales of different branches (in different data sources)
to compute the company sales
Generalization:

concept hierarchy climbing

From integer attribute age to classes of age (children, adult,
old)
Normalization: scaled to fall within a small, specified
range

Change the range from [-∞,+ ∞] to [-1,+1]

{-13, -6, -3, 10, 100}  {-0.13, -0.06, -0.03, 0.1, 1}
© P. Giorgini, F. Dalpiaz
28
Data Preprocessing
3. Data reduction

Data reduction



Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
Different reduction types (dimensions, numerosity,
discretization)
Dimensionality: Attribute subset selection

Example with a decision tree (left branches True, right False)
A4?
Initial attribute set:
{A1, A2, A3,
A4, A5, A6}
A1?
Class 1
© P. Giorgini, F. Dalpiaz
Class 2
A6?
Class 1
Reduced attribute
set: {A1, A4, A6}
Class 2
29
Data Preprocessing
3. Data reduction

Dimensionality: Principal Components Analysis




Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to
represent data
Works for numeric data only
Used when the number of dimensions is large
Numerosity: Clustering

Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
2 clusters
© P. Giorgini, F. Dalpiaz
Sparse data leads
to many clusters
– non effective
30
Data Preprocessing
3. Data reduction

Numerosity: Sampling




obtaining a small sample s to represent the whole data set N
Problem: How to select a representative sampling set
Random sampling is not enough – representative samples
should be preserved
Stratified sampling: Approximate the percentage of each class
(or subpopulation of interest) in the overall database
Random sampling
Stratified sampling
No samples
from here
© P. Giorgini, F. Dalpiaz
31
Data Preprocessing
4. Discretization - concept hierarchy

Three types of attributes




Discretization




Nominal — values from an unordered set (color, profession)
Ordinal — values from an ordered set (military or academic
rank)
Continuous — numbers (integer or real numbers)
Divide the range of a continuous attribute into intervals
Reduces data size and its complexity
Some data mining algorithms do not support continuous types,
and in those cases discretization is mandatory
Some useful methods:
Binning, clustering (already presented)
 Entropy-based discretization (no details here)
© P. Giorgini, F. Dalpiaz

32
Data Preprocessing
4. Discretization - concept hierarchy

Concept hierarchy generation


For categorical data
Specification of an ordering between attributes (schema level)
• street < city < state < country

Specification of a hierarchy of values (data level)
• {Urbana, Champaign, Chicago} < Illinois

Automatic generation using the number of distinct values
• For the set of attributes: {street, city, state, country}
• IF: |street| = 600.000, |city|=3.000, |state|=300, |country|=15
• THEN: street < city < state < country
© P. Giorgini, F. Dalpiaz
33
Day 1 Summary




Data Mining and KDD
Why Data Mining
Applications of Data Mining
Data Preprocessing





Data Cleaning
Data Integration and Transformation
Data Reduction
Discretization and concept hierarchy
Tomorrow?



Data Mining techniques
Results visualization
Summary
© P. Giorgini, F. Dalpiaz
Questions?
34