Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Related Concepts Outline
Goal: Examine some areas which are related to
data mining.
Database/OLTP Systems
Fuzzy Sets and Logic
Information Retrieval(Web Search Engines)
Dimensional Modeling
Data Warehousing
OLAP/DSS
Statistics
Machine Learning
Pattern Matching
1
Ming-Yen Lin, IECS, FCU
DB & OLTP Systems
On-Line Transaction Processing
Schema
(ID,Name,Address,Salary,JobNo)
Data Model
Entity-Relationship
Relational
Transaction
Query:
SELECT Name
FROM T
WHERE Salary > 100000
[Fig. 2.1]
DM: Only imprecise queries
2
Ming-Yen Lin, IECS, FCU
Fuzzy Sets and Logic
Fuzzy Set: Set membership function is a real valued
function with output in the range [0,1].
f(x): Probability x is in F.
1-f(x): Probability x is not in F.
EX:
T = {x | x is a person and x is tall}
Let f(x) be the probability that x is tall
Here f is the membership function
{x|x R and x.salary > 100,000} vs. {x|xR and x
is tall}
DM: Prediction and classification are fuzzy.
Ming-Yen Lin, IECS, FCU
3
Fuzzy Sets & Fuzzy Logic
Fuzzy logic: reasoning with uncertainty; multiple valued logic
retrieve data with imprecise/missing values
mem(x) = 1- mem(x);
mem(xy) = min(mem(x), mem(y))
mem(xy) = max(mem(x), mem(y))
4
Ming-Yen Lin, IECS, FCU
Classification/Prediction is Fuzzy
Grey area
Loan
Reject
Reject
Amnt
Accept
Simple
Accept
Fuzzy
5
Ming-Yen Lin, IECS, FCU
Information Retrieval
Information Retrieval (IR): retrieving desired information
from textual data.
Library Science
Digital Libraries
Web Search Engines
Traditionally keyword based
Sample query:
Find all documents about “data mining”.
DM: Similarity measures;
Mine text/Web data.
6
Ming-Yen Lin, IECS, FCU
Information Retrieval (cont’d)
Similarity: measure of how close a query is to a document.
Documents which are “close enough” are retrieved.
sim(q,Di); sim(Di, Dj)
Metrics:
Precision = |Relevant and Retrieved|
|Retrieved|
Recall = |Relevant and Retrieved|
|Relevant|
Inverse Document Frequency:
IDFk = log(n/|documents containing k|) + 1
Concept hierarchy [Fig. 2.7]
Replace ‘tiger’ with ‘CAT’
May be a Directed Acyclic Graph
7
Ming-Yen Lin, IECS, FCU
IR Query Result Measures and Classification
calculate precision/recall
IR
Classification
8
Ming-Yen Lin, IECS, FCU
Decision Support Systems
Improve decision making by providing
specific information needed by management
Executive information systems
Executive Support Systems
as a suite of tools, assist in the overall DSS
process
9
Ming-Yen Lin, IECS, FCU
Dimensional Modeling
a different way to view and interrogate data in DB
View data in a hierarchical manner more as
business executives might
Useful in decision support systems and mining
Dimension: collection of logically related attributes;
axis for modeling data.
Facts: data stored
Ex: Dimensions – products, locations, date
Facts – quantity, unit price
DM: May view data as dimensional.
Ming-Yen Lin, IECS, FCU
10
Relational View of Data
ProdID
123
123
150
150
150
150
200
300
500
500
LocID
Dallas
Houston
Dallas
Dallas
Fort
Worth
Chicago
Seattle
Rochester
Bradenton
Chicago
Date
022900
020100
031500
031500
021000
Quantity
5
10
1
5
5
UnitPrice
25
20
100
95
80
012000
030100
021500
022000
012000
20
5
200
15
10
75
50
5
20
25
1
11
Ming-Yen Lin, IECS, FCU
Dimensional Modeling Queries
Roll Up: more general dimension
Drill Down: more specific dimension
Dimension (Aggregation) Hierarchy
SQL uses aggregation
Multidimensional schemas
star schema
snowflake schema
fact constellation schema
Multidimensional indexing
bitmap index, join index
Ming-Yen Lin, IECS, FCU
12
Cube view of Data
13
Ming-Yen Lin, IECS, FCU
Aggregation Hierarchies
order relationship
second < minute
aggregate sum
additive
14
Ming-Yen Lin, IECS, FCU
Star Schema
Day
product
Sales
Division
Ming-Yen Lin, IECS, FCU
dimension
facts
Location
aggregate facts for efficiency
15
Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_street
country
Measures
16
Ming-Yen Lin, IECS, FCU
Options to implement star schema
(a) flattened: store data for each dimension in
exactly one table; roll up: by SQL aggregate
(b) normalized: a table exists for each level in each
dimension; each table has one tuple for every
occurrence at the level
(c) expanded: num. of dimen. tables =
normalized; lowest dim. = flattened
(d) levelized: has one dim. table as does the
flattened, but aggregations have been
performed.
[Fig. 2.12]
Ming-Yen Lin, IECS, FCU
17
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
province_or_street
country
18
Ming-Yen Lin, IECS, FCU
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
item_key
shipper_key
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
Measures
Galaxy schema
Ming-Yen Lin, IECS, FCU
time_key
from_location
branch_key
branch
Shipping Fact Table
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type 19
Data Warehousing
“Subject-oriented, integrated, time-variant,
nonvolatile” William Inmon
Operational Data: Data used in day to day needs of
company.
Informational Data: Supports other functions such
as planning and forecasting.
Data mining tools often access data warehouses
rather than operational data.
DM: May access data in warehouse.
20
Ming-Yen Lin, IECS, FCU
What is Data Warehouse?
定義
一個分別設置的,獨立於公司作業資料庫的,決策支
援資料庫
為支援資料處理,提供分析之用,提供完善的、統合
歷史資料的平台
“A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in
support of management’s decision-making
process.”—W. H. Inmon
Data warehousing
建構與使用 data warehouses的程序
21
Ming-Yen Lin, IECS, FCU
D. W.—Subject-Oriented
依主要主題而組織,如 customer, product,
sales
焦點集中在決策者要的資料模型或分析,不
在日常作業或交易處理
去除決策資源程序中無用的資料,提供簡化
的、精簡的(環繞於特定主題的)view
22
Ming-Yen Lin, IECS, FCU
Data Warehouse—Integrated
藉整合多個、異質的資料來源而建構
relational databases
flat files
on-line transaction records
應用data cleaning 與 data integration的技巧
確保不同資料來源的一致性
naming conventions
encoding structures
attribute measures
例:Hotel price: currency, tax, breakfast covered, etc.
當資料「移動」到 warehouse時,已經經轉換
23
Ming-Yen Lin, IECS, FCU
Data Warehouse—Time Variant
data warehouse 的時間軸明顯的比作業性系統長
Operational database: current value data.
Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
data warehouse的各主要結構(key structure)
外顯或隱含地(explicitly or implicitly) 包含 time 這個元素
operational data:不一定包含“time element”
24
Ming-Yen Lin, IECS, FCU
Data Warehouse—Non-Volatile
由作業環境中的資料轉換得到的、實質
上獨立的儲存(physically separate store)
data warehouse 不含操作性的更新
不需交易處理、復原、協同控制
(concurrency control) 機制
僅需兩種操作
資料的初始載入
資料的取用
25
Ming-Yen Lin, IECS, FCU
Data Warehousing
traditional db: operational data
data warehouse: information data
‘what if’ questions -> warehouse + query
eg. analyze trend from historical data
basic components
data migration
warehouse
access tool
26
Ming-Yen Lin, IECS, FCU
Transformation in DWing
Transformation [Fig. 2.14]
remove unwanted data
convert heterogeneous source into one common format
merge snapshots to create historical view
summarize data at levels
add derived data
handling missing/erroneous data
also called data scrubbing/data staging
Improve performance of data warehouse
applications
Summarization
Denormalization (speed up join!)
Partitioning
27
Ming-Yen Lin, IECS, FCU
Operational vs. Informational
Operational Data
Data Warehouse
Application
OLTP
OLAP
Use
Precise Queries
Ad Hoc
Temporal
Snapshot
Historical
Modification
Dynamic
Static
Orientation
Application
Business
Data
Operational Values
Integrated
Size
Level
Gigabits
Detailed
Terabits
Summarized
Access
Often
Less Often
Response
Few Seconds
Minutes
Data Schema
Relational
Star/Snowflake
28
Ming-Yen Lin, IECS, FCU
OLAP
Online Analytic Processing (OLAP):
provides more complex queries than OLTP.
OnLine Transaction Processing (OLTP):
traditional database/transaction processing.
Dimensional data; cube view
Visualization of operations:
Slice: examine sub-cube.
Dice: rotate cube to look at another dimension.
Roll Up/Drill Down
DM: May use OLAP queries.
Ming-Yen Lin, IECS, FCU
29
A Concept Hierarchy
Dimension (location)
all
all
Europe
region
country
city
office
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
...
Toronto
M. Wind
Used for multi-level abstraction (for interactive mining)
Ming-Yen Lin, IECS, FCU
Mexico
30
典型的 OLAP 運算
Roll up (drill-up): 綜合資料
by climbing up hierarchy or by dimension reduction
Drill down (roll down): roll-up的相反
from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Slice and dice: (選取部分)
project and select
Pivot (rotate): (旋轉)
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its backend relational tables (using SQL)
31
Ming-Yen Lin, IECS, FCU
Cube
Operations
dice
(location=x
AND time=Y
AND item = Z)
roll-up
(city2location)
drill-down
(quarter2month)
slice
(time=Q1)
pivot
32
Ming-Yen Lin, IECS, FCU
OLAP Operations
Roll Up
Drill Down
Single Cell
Multiple Cells
Slice
Dice
OLAP tools: ROLAP (relational) or MOLAP (multidimentional)
ROLAP: a ROLAP server (middleware) creates MD view for users
MOLAP: specialized DBMS & s/w to directly support MD data
OR Hybrid tool
33
Ming-Yen Lin, IECS, FCU
Web Search Engines
be viewed as query systems like IR systems
query: keyword, boolean, weighted, …
Conventional search engines suffer
Abundance
Limited coverage
Limited query
Limited customization
Web Mining
content/structure/usage
Web search => content mining
34
Ming-Yen Lin, IECS, FCU
Statistics
Simple descriptive models
Statistical inference: generalizing a model
created from a sample of the data to the
entire dataset.
Exploratory Data Analysis:
Data can actually drive the creation of the
model
Opposite of traditional statistical view.
Data mining targeted to business user
DM: Many data mining methods come
from statistical techniques.
Ming-Yen Lin, IECS, FCU
35
Machine Learning
Machine Learning: area of AI that examines how to
write programs that can learn.
Often used in classification and prediction
Supervised Learning: learns by example.
Unsupervised Learning: learns without knowledge
of correct answers.
Machine learning often deals with small static
datasets.
[table 2.3]
DM: Uses many machine learning
techniques.
Ming-Yen Lin, IECS, FCU
36
Pattern Matching (Recognition)
Pattern Matching: finds occurrences of a
predefined pattern in the data.
Applications include speech recognition,
information retrieval, time series analysis.
DM: Type of classification.
37
Ming-Yen Lin, IECS, FCU
DM vs. Related Topics
Area
Query
Data
DB/OLTP Precise Database
IR
OLAP
DM
Results Output
Precise DB Objects
or
Aggregation
Precise Documents
Vague Documents
Analysis Multidimensional Precise DB Objects
or
Aggregation
Vague Preprocessed Vague KDD
Objects
38
Ming-Yen Lin, IECS, FCU