Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
第四部分
数据库新应用与新技术
60年代中后期至90年代末,发展速度之快,应用范围之广,
是许多其他技术远不能及的。
发展的动力
计算环境的发展
与其他相关技术发展相结合
受应用需求推动
. . .
数据库技术发展和研究方向
并行数据库
分布式数据库
移动式、嵌入式数据库
面向对象数据库,对象-关系数据库
演绎数据库,知识库,主动数据库
多媒体数据库
空间数据库
数据仓库和OLAP,数据挖掘
信息集成
Web环境中的数据库技术
...
数据库技术发展和研究方向
并行数据库
分布式数据库
移动式、嵌入式数据库
面向对象数据库,对象-关系数据库
演绎数据库,知识库,主动数据库
多媒体数据库
空间数据库
数据仓库和OLAP,数据挖掘
信息集成
Web环境中的数据库技术
...
第十一章 数据库中的知识发现
Slides are basically from:
Data Mining:
Concepts and Techniques
— Slides for Textbook —
Jiawei Han and Micheline Kamber
Simon Fraser University, Canada
http://www.cs.sfu.ca
Motivation: Why Knowledge discovery in
databases?
Data Warehouse and OLAP
data mining
Motivation: “Necessity is the Mother
of Invention”
Data explosion problem
Automated data collection tools and mature database
technology lead to tremendous amounts of data
stored in databases, data warehouses and other
information repositories
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network
DBMS
1970s:
Relational data model, relational DBMS
implementation
1980s:
RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) and application-oriented DBMS
(spatial, scientific, engineering, etc.)
1990s—2000s:
Data mining and data warehousing, multimedia
databases, and Web databases
Motivation: Why Knowledge discovery in databases?
Data Warehouse and OLAP
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse application
Data mining
What is Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained
separately from the organization’s operational
database
Support information processing by providing a solid
platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in support
of management’s decision-making process.”—W. H.
Inmon
Data warehousing:
The process of constructing and using data
warehouses
Data Warehouse—Subject-Oriented
Organized around major subjects, such as customer,
product, sales.
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.
Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process.
Data Warehouse—Integrated
Constructed by integrating multiple, heterogeneous
data sources
relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are
applied.
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is
converted.
Data Warehouse—Time Variant
The time horizon for the data warehouse is significantly
longer than that of operational systems.
Operational database: current value data.
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not
contain “time element”.
Data Warehouse—Non-Volatile
A physically separate store of data transformed from the
operational environment.
Operational update of data does not occur in the data
warehouse environment.
Does not require transaction processing, recovery,
and concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data.
Data Warehouse vs. Heterogeneous
DBMS
Traditional heterogeneous DB integration:
Build wrappers/mediators on top of heterogeneous
databases
Query driven approach:
When a query is posed to a client site, a metadictionary is used to translate the query into queries
appropriate for individual heterogeneous sites
involved, and the results are integrated into a global
answer set.
Complex information filtering, compete for sources
Data warehouse: update-driven, high performance.
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
Why Separate Data Warehouse?
High performance for both systems
DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
missing data: Decision support requires historical data
which operational DBs do not typically maintain
data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
data quality: different sources typically use inconsistent
data representations, codes and formats which have to
be reconciled
Motivation: Why Knowledge discovery in databases?
Data Warehouse and OLAP
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse application
Data mining
Multidimensional Data Model
A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
Measures attributes: attributes which measure some
value and can be aggregated upon. such as
dollars_sold.
Dimension attributes: attributes on which measure
attributes and summaries of measure attributes are
viewed. such as item (item_name, brand, type), or
time(day, week, month, quarter, year).
Data cube: multidimensional view of data. such as
sales.
A concept hierarchy: dimension (location)
all
all
Europe
region
country
city
office
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
...
Mexico
Toronto
M. Young
Multidimensional Data
Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Month
Month Week
Day
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
sum
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
cuboid and Data Cubes
In data warehousing literature, an n-D base cube is
called a base cuboid. The top most 0-D cuboid, which
holds the highest-level of summarization, is called the
apex cuboid. The lattice of cuboids forms a data cube.
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product
product,date
date
country
product,country
1-D cuboids
date, country
2-D cuboids
3-D(base) cuboid
product, date, country
More Example of Cube and Cuboids
all
time
time,item
0-D(apex) cuboid
item
time,location
location
item,location
time,supplier
time,item,location
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
Conceptual Modeling of
Data Warehouses
Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a
set of dimension tables
Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_state
country
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
province_or_state
country
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_state
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
Typical OLAP Operations
Roll up (drill-up): summarize data
Drill down (roll down): reverse of roll-up
project and select
Pivot (rotate):
from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Slice and dice:
by climbing up hierarchy or by dimension reduction
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its backend relational tables (using SQL)
Motivation: Why Knowledge discovery in databases?
Data Warehouse and OLAP
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse application
Data mining
The Process of Data Warehouse
Design
Top-down, bottom-up approaches or a combination of
both
Top-down: Starts with overall design and planning
(mature)
Bottom-up: Starts with experiments and prototypes
(rapid)
From software engineering point of view
Waterfall: structured and systematic analysis at each
step before proceeding to the next.
Spiral: rapid generation of increasingly functional
systems, short turn around time, quick turn around.
Multi-Tiered Architecture
other
Metadata
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
Three Data Warehouse Models
Enterprise warehouse
collects all of the information about subjects spanning
the entire organization
Data Mart
a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to specific,
selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be
materialized
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Data
Mart
Data
Mart
Model refinement
Enterprise
Data
Warehouse
Model refinement
Define a high-level corporate data model
OLAP Server Architectures
Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces
Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
greater scalability
Multidimensional OLAP (MOLAP)
Array-based multidimensional storage engine (sparse matrix
techniques)
fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP)
User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers
specialized support for SQL queries over star/snowflake schemas
Motivation: Why Knowledge discovery in databases?
Data Warehouse and OLAP
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse application
Data mining
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
Differences among the three tasks
From On-Line Analytical Processing to On
Line Analytical Mining
Why online analytical mining?
High quality of data in data warehouses
Available information processing structure
surrounding data warehouses
mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
integration and swapping of multiple mining functions,
algorithms, and tasks.
Architecture of OLAM
ODBC, OLEDB, Web accessing, service facilities, reporting
and OLAP tools
OLAP-based exploratory data analysis
DW contains integrated, consistent, cleaned data
An OLAM Architecture
Mining query
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
OLAP
Engine
Layer3
OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration
Database API
Filtering
Layer1
Data cleaning
Databases
Data
Data integration Warehouse
Data
Repository
Motivation: Why Knowledge discovery in databases?
Data Warehouse and OLAP
data mining
Introduction to data mining
Characterization
Association rule mining
Classification
Cluster Analysis
What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases
Alternative names and their “inside stories”:
Data mining: a misnomer?
Knowledge discovery(mining) in databases
(KDD), knowledge extraction, data/pattern
analysis, data archaeology, data dredging,
information harvesting, business intelligence, etc.
Why Data Mining? — Potential
Applications
Database analysis and decision support
Market analysis and management
Risk analysis and management
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and management
Other Applications
Text mining (news group, email, documents) and Web analysis.
Intelligent query answering
Data Mining: A KDD Process
Pattern Evaluation
Data mining: the core of
knowledge discovery
Data Mining
process.
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Architecture of a Typical Data Mining
System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration
Databases
Filtering
Data
Warehouse
Data Mining: On What Kind of Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
Data Mining Functionalities (1)
Concept description: Characterization and discrimination
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
Association
contains(x, “computer”) contains(x, “software”)
[support = 1%, confidence = 75%]
age(X, “20..29”) ^ income(X, “20..29K”) buys(X,
“PC”) [2%, 60%]
Multi-dimensional vs. single-dimensional association
Data Mining Functionalities (2)
Classification and Prediction
Finding models (functions) that describe and distinguish classes
or concepts for future prediction
E.g., classify countries based on climate, or classify cars based
on gas mileage
Presentation: decision-tree, classification rule, neural network
Prediction: Predict some unknown or missing numerical values
Cluster analysis
Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
Clustering based on the principle: maximizing the intra-class
similarity and minimizing the interclass similarity
Data Mining Functionalities (3)
Outlier analysis
Outlier: a data object that does not comply with the general
behavior of the data
It can be considered as noise or exception but is quite useful in
fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses
Are All the “Discovered” Patterns
Interesting?
A data mining system/query may generate thousands of patterns,
not all of them are interesting.
Suggested approach: Human-centered, query-based, focused
mining
Interestingness measures: A pattern is interesting if it is easily
understood by humans, valid on new or test data with some degree
of certainty, potentially useful, novel, or validates some hypothesis
that a user seeks to confirm
Objective vs. subjective interestingness measures:
Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
Data Mining: Confluence of Multiple
Disciplines
Database
Technology
Machine
Learning
Information
Science
Statistics
Data Mining
Visualization
Other
Disciplines
Motivation: Why Knowledge discovery in databases?
Data Warehouse and OLAP
data mining
Introduction to data mining
Characterization
Association rule mining
Classification
Cluster Analysis
Data Generalization and Summarizationbased Characterization
Data generalization
A process which abstracts a large set of task-relevant
data in a database from a low conceptual levels to
higher ones.
1
2
3
4
Conceptual levels
5
Approaches:
Data cube approach(OLAP approach)
Attribute-oriented induction approach
Characterization: Data Cube Approach
(without using AO-Induction)
Perform computations and store results in data cubes
Strength
An efficient implementation of data generalization
Computation of various kinds of measures
e.g., count( ), sum( ), average( ), max( )
Generalization and specialization can be performed on a data
cube by roll-up and drill-down
Limitations
handle only dimensions of simple nonnumeric data and
measures of simple aggregated numeric values.
Lack of intelligent analysis, can’t tell which dimensions should
be used and what levels should the generalization reach
Attribute-Oriented Induction
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How it is done?
Collect the task-relevant data( initial relation) using a
relational database query
Perform generalization by attribute removal or attribute
generalization.
Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts.
Interactive presentation with users.
Basic Principles of AttributeOriented Induction
Data focusing: task-relevant data, including dimensions,
and the result is the initial relation.
Attribute-removal: remove attribute A if there is a large set
of distinct values for A but (1) there is no generalization
operator on A, or (2) A’s higher level concepts are
expressed in terms of other attributes.
Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A.
Attribute-threshold control: typical 2-8, specified/default.
Generalized relation threshold control: control the final
relation/rule size. see example
Class Characterization: An Example
Name
Gender
Jim
Initial
Woodman
Relation Scott
M
Major
M
F
…
Removed
Retained
Residence
Phone #
GPA
Vancouver,BC, 8-12-76
Canada
CS
Montreal, Que, 28-7-75
Canada
Physics Seattle, WA, USA 25-8-70
…
…
…
3511 Main St.,
Richmond
345 1st Ave.,
Richmond
687-4598
3.67
253-9106
3.70
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
Sci,Eng,
Bus
City
Removed
Excl,
VG,..
Gender Major
M
F
…
Birth_date
CS
Lachance
Laura Lee
…
Prime
Generalized
Relation
Birth-Place
Science
Science
…
Country
Age range
Birth_region
Age_range
Residence
GPA
Canada
Foreign
…
20-25
25-30
…
Richmond
Burnaby
…
Very-good
Excellent
…
Birth_Region
Canada
Foreign
Total
Gender
M
16
14
30
F
10
22
32
Total
26
36
62
Count
16
22
…
Presentation of Generalized Results
Generalized relation:
Cross tabulation:
Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
Mapping results into cross tabulation form (similar to contingency
tables).
Visualization techniques:
Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:
Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad( x) male( x)
birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%].
Presentation—Generalized Relation
Presentation—Crosstab
Motivation: Why Knowledge discovery in databases?
Data Warehouse and OLAP
data mining
Introduction to data mining
Characterization
Association rule mining
Classification
Cluster Analysis
What Is Association Rule Mining?
Association rule mining:
Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
Applications:
Basket data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
Examples.
Rule form: “antecedent consequent [support,
confidence]”.
buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]
major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%,
75%]
Association Rules
Given
A database of customer transactions
Each transaction is a list of items (purchases by a
customer in a visit)
Find all rules that correlated the presence of one set of
items with another set of items
Example: 98% of people who purchase tires and auto
accessories also get automotive services done
Any number of items in the consequent/antecedent of
rule
Possible to specify constraints on rules (e.g. find only
rules involving Home Laundry Appliances).
Characteristics
Number of items: Hundreds of thousands
Number of transactions: Millions
Data set sizes: High Gigabytes
Discovering all rules rather than verifying if a rule
holds
Performance
Scalability
Association Rules -- Problem Statement
I = {i1, i2 ,…, im} : a set of literals, called items.
Transaction T: a set of items such that T I.
Database D : a set of transactions.
A transaction T contains X, a set of some items,
if X T.
An association rule is an implication of the form
X Y, where X,Y I and X Y = .
Association Rules –
Confidence and Support
confidence of rule X Y is the percentage of
transactions in D containing X that also contain Y.
c = (number of transactions containing both X and Y) /
(total number of transactions containing X )
support of rule X Y is the percentage of
transactions in D that contain both X and Y.
s = (number of transactions containing both X and Y) /
(total number of transactions in D)
Discover all rules that have minimum user-specified
confidence and support.
Mining Association Rules
Find the frequent itemsets : sets of items that
have minimum support .
A subset of a frequent itemset must also be a
frequent itemset, i.e., if {AB} is a frequent
itemset, both {A} and {B} should be a frequent
itemset
Iteratively find frequent itemsets with cardinality
from 1 to k (k-itemset)
Use the frequent itemsets to generate the desired
rules.
Mining Association Rules -- Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
For rule A C:
support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%
The Apriori Algorithm — Example
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Using frequent itemsets to generate rules
Association rules can be generated as follows:
For each frequent itemset l, generate all
nonempty subsets of l.
For each nonempty subset s of l, output the
rule “s (l-s)” if
support_count(l)/support_count(s) >=min_conf,
where min_conf is the minimum confidence
threshold.
Summary
Association rule mining
probably the most significant contribution from the
database community in KDD
A large number of papers have been published
Many interesting issues have been explored
An interesting research direction
Association analysis in other types of data: spatial
data, multimedia data, time series data, etc.
Motivation: Why Knowledge discovery in databases?
Data Warehouse and OLAP
data mining
Introduction to data mining
Characterization
Association rule mining
Classification
Cluster Analysis
Classification
Classification:
predicts categorical class labels
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting
will occur
Classification Process (1): Model
Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Classification by Decision Tree
Induction
Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision tree
Training Dataset
This
follows
an
example
from
Quinlan’s
ID3
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Output: A Decision Tree for “buys_computer”
age?
>40
<=30
30..40
student?
no
no
yes
yes
yes
credit rating?
excellent
no
fair
yes
Another Example
An
Example
from
Quinlan
O u tlo o k
su n n y
su n n y
o verc ast
rain
rain
rain
o verc ast
su n n y
su n n y
rain
su n n y
o verc ast
o verc ast
rain
T em p reatu re
hot
hot
hot
m ild
cool
cool
cool
m ild
cool
m ild
m ild
m ild
hot
m ild
H u m id ity
h ig h
h ig h
h ig h
h ig h
n o rm al
n o rm al
n o rm al
h ig h
n o rm al
n o rm al
n o rm al
h ig h
n o rm al
h ig h
W in d y
false
tru e
false
false
false
tru e
tru e
false
false
false
tru e
tru e
false
tru e
C lass
N
N
P
P
P
N
P
N
P
P
P
P
P
N
A Sample Decision Tree
Outlook
sunny overcast
humidity
rain
windy
P
high
normal
true
false
N
P
N
P
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left
Summary
Classification is an extensively studied problem (mainly in
statistics, machine learning & neural networks)
Classification is probably one of the most widely used
data mining techniques with a lot of extensions
Scalability is still an important issue for database
applications: thus combining classification with database
techniques should be a promising topic
Research directions: classification of non-relational data,
e.g., text, spatial, multimedia, etc..
Motivation: Why Knowledge discovery in databases?
Data Warehouse and OLAP
data mining
Introduction to data mining
Characterization
Association rule mining
Classification
Cluster Analysis
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to
their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
What Is Good Clustering?
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
Major Clustering Approaches
Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
The K-Means Clustering Method
Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Use the Single-Link method and the dissimilarity matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10