Download 9/12 - Computer and Information Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
The extraction of implicit, previously
unknown and potentially useful information
from large bodies of data often accumulated
for other purposes.
Data mining: Hmm, what is it?
Data warehousing
Examples
Discussions
Mining: large volumes of low grade data
sifted through to find something
interesting
September 9, 2003
Data mining vs. database query
„
„
„
„
DB can answer:
„
„
Database query languages used when: finding and
reporting information within the database (found in terms
of factual entry); Provides data with no analysis.
„
Data mining used when: looking for patterns and
associations in the database/data warehouse, that are not
explicitly present/coded in the database. Analysis needed
to unveil hidden patterns and relationships.
„
UMASSD, CIS, Iren Valova
„
„
3
As usual, a picture is worth mlns.
mlns.
S. Throat
C
I
YES
NO
YES
YES
NO
NO
NO
YES
NO
YES
Fever
C
I
YES
NO
YES
NO
YES
NO
NO
NO
YES
YES
Sw. Glands
C
I
YES
NO
NO
YES
NO
NO
YES
NO
NO
NO
Congest.
C
I
YES
YES
YES
NO
YES
YES
NO
YES
YES
YES
Headache
C
I
YES
YES
NO
NO
NO
NO
NO
YES
YES
YES
Diagnosis
C
O
ST
AL
COLD
ST
COLD
AL
ST
AL
COLD
COLD
Database query: List all patients with Diagnosis of Cold, or Diagnosis of
Allergy and Swollen Glands.
UMASSD, CIS, Iren Valova
develop a general profile for credit card customers who take
advantage of promotions offered with their credit card billing
determine which patient is likely to come back to work after heart
attack
differentiate individuals who are poor credit risks from those who
are likely to make their loan payment on time.
September 9, 2003
UMASSD, CIS, Iren Valova
4
One more picture for a good measure
Database query: List all patients with Swollen Glands and Diagnosis of ST.
September 9, 2003
get a list of all dept.store customers who used credit card to buy
gas grill
get a list of all patients who have had at least one heart attack
and whose cholesterol < 20
get a list of all credit card holders who used their credit card to
purchase more than $300 in groceries during January
DM can answer:
„
Patient ID
R
U
1
2
3
4
5
6
7
8
9
10
2
As if it is not clear already …
Now that we are in the course, perhaps she will really tell
us what is data mining all about?!?!
September 9, 2003
UMASSD, CIS, Iren Valova
5
Income RangMagazine ProWatch Promo Life Ins PromCredit Car Sex Age
C
C
C
C
C
C
R
I
I
I
O
I
I
I
40-50,000 Yes
No
No
No
Male 45
30-40,000 Yes
Yes
Yes
No
Fema 40
40-50,000 No
No
No
No
Male 42
30-40,000 Yes
Yes
Yes
Yes
Male 43
50-60,000 Yes
No
Yes
No
Fema 38
20-30,000 No
No
No
No
Fema 55
30-40,000 Yes
No
Yes
Yes
Male 35
20-30,000 No
Yes
No
No
Male 27
30-40,000 Yes
No
No
No
Male 43
30-40,000 Yes
Yes
Yes
No
Fema 41
40-50,000 No
Yes
Yes
No
Fema 43
20-30,000 No
Yes
Yes
No
Male 29
50-60,000 Yes
Yes
Yes
No
Fema 39
40-50,000 No
Yes
No
No
Male 55
20-30,000 No
No UMASSD, CIS, IrenYes
Yes
Fema 19 6
September 9, 2003
Valova
1
It becomes claer…
claer… I meant “clear”?
„
Data warehousing
In order to facilitate decision making, the data in a data
warehouse are organized around major subjects, such as
customer, item, supplier, and activity. The data are stored
to provide information from a historical perspective (such
as from the past 5-10 years) and are typically
summarized. For example, rather than storing the details
of each sales transaction, the data warehouse may store
a summary of the transactions per item type for each
store or, summarized to a higher level, for each sales
region.
Well, then we move on with Data Warehouses.
Suppose that AllElectronics is a successful international company, with
branches around the world. Each branch has its own set of
databases. The president of AllElectronics has asked you to provide
an analysis of the company's sales per item type per branch for the
third quarter. This is a difficult task, particularly since the relevant data
are spread out over several databases, physically located at
numerous sites.
If AllElectronics had a data warehouse, this task would be easy. A data
warehouse is a repository of information collected from multiple
sources, stored under a unified schema, and which usually resides at
a single site. Data warehouses are constructed via a process of data
cleaning, data transformation, data integration, data loading, and
periodic data refreshing.
September 9, 2003
UMASSD, CIS, Iren Valova
A data warehouse is usually modeled by a multidimensional
database structure, where each dimension corresponds to
an attribute or a set of attributes in the schema, and each
cell stores the value of some aggregate measure, such as
count or sales-amount.
7
September 9, 2003
Data warehousing
September 9, 2003
UMASSD, CIS, Iren Valova
8
MultiMulti-dimensional Data Cube
UMASSD, CIS, Iren Valova
9
September 9, 2003
MultiMulti-dimensional Data Cube
UMASSD, CIS, Iren Valova
10
Data warehouse vs. Data Mart
„
„
Data warehouse: collects information about subjects that
span an entire organization.
Data mart: department subset of a data warehouse.
OLAP
Drill-down on time data
September 9, 2003
By providing multidimensional data views and the
precomputation of summarized data, data warehouse
systems are well suited for On-Line Analytical Processing,
or OLAP. OLAP operations make use of background
knowledge regarding the domain of the data being studied
in order to allow the presentation of data at different levels
of abstraction. Such operations accommodate different
user viewpoints.
Roll-up on address
UMASSD, CIS, Iren Valova
11
September 9, 2003
UMASSD, CIS, Iren Valova
12
2
DW and OLAP for DM
DW and major characteristics
Data warehousing provides architectures and tools for
business executives to systematically organize,
understand, and use their data to make strategic
decisions.
Loosely speaking, a data warehouse refers to a database
that is maintained separately from an organization's
operational databases. Data warehouse systems allow for
the integration of a variety of application systems. They
support information processing by providing a solid
platform of consolidated historical data for analysis.
In the last several years, many firms have spent millions of
dollars in building enterprise-wide data warehouses. Many
people feel that with competition mounting in every
industry, data warehousing is the latest must-have
marketing weapon - a way to keep customers by learning
more about their needs.
September 9, 2003
UMASSD, CIS, Iren Valova
„
„
„
„
13
September 9, 2003
Data Warehouse vs. Operational DBMS
„
„
„
OLTP
users
clerk, IT professional
knowledge worker
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
complex query
OLAP (on-line analytical processing)
„
Major task of data warehouse system
„
Data analysis and decision making
usage
„
User and system orientation: customer vs. market
unit of work
read/write
index/hash on prim. key
short, simple transaction
„
Data contents: current, detailed vs. historical, consolidated
# records accessed
tens
millions
„
Database design: ER + application vs. star + subject
#users
thousands
hundreds
„
View: current, local vs. evolutionary, integrated
DB size
100MB-GB
100GB-TB
„
Access patterns: update vs. read-only but complex queries
metric
transaction throughput
query throughput, response
access
Distinct features (OLTP vs. OLAP):
UMASSD, CIS, Iren Valova
15
Why Separate Data Warehouse?
„
„
September 9, 2003
„
„
„
UMASSD, CIS, Iren Valova
16
DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
„
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions
„
Different functions and different data:
„
lots of scans
From Tables and Spreadsheets to Data Cubes
High performance for both systems
„
14
OLAP
Major task of traditional relational DBMS
September 9, 2003
„
UMASSD, CIS, Iren Valova
OLTP vs. OLAP
OLTP (on-line transaction processing)
„
„
data warehouse is organized around major subjects
constructed by integrating multiple heterogeneous
sources
every key structure contains an element of time
physically separate store of data transformed from the
application data found in the operational environment.
missing data: Decision support requires historical data which
operational DBs do not typically maintain
„
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
Dimension tables, such as item (item_name, brand,
type), or time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold)
and keys to each of the related dimension tables
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
September 9, 2003
UMASSD, CIS, Iren Valova
17
September 9, 2003
UMASSD, CIS, Iren Valova
18
3
Cube formation
September 9, 2003
UMASSD, CIS, Iren Valova
19
September 9, 2003
UMASSD, CIS, Iren Valova
20
Cube: A Lattice of Cuboids
all
time
time,item
0-D(apex) cuboid
item
time,location
location
item,location
time,supplier
time,item,location
supplier
location,supplier
item,supplier
time,location,supplier
time,item,supplier
1-D cuboids
2-D cuboids
3-D cuboids
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
September 9, 2003
UMASSD, CIS, Iren Valova
21
Conceptual Modeling of Data Warehouses
„
September 9, 2003
UMASSD, CIS, Iren Valova
22
Start schema example
Modeling data warehouses: dimensions & measures
„
Star schema: A fact table in the middle connected to a
set of dimension tables
„
Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
„
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
September 9, 2003
UMASSD, CIS, Iren Valova
23
September 9, 2003
UMASSD, CIS, Iren Valova
24
4
Snowflake schema example
September 9, 2003
UMASSD, CIS, Iren Valova
Fact constellation schema
25
„
„
UMASSD, CIS, Iren Valova
26
define cube sales_star [time, item, branch, location]:
Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
Dimension Definition ( Dimension Table )
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
Special Case (Shared Dimension Tables)
„ First time as “cube definition”
„ define dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>
September 9, 2003
UMASSD, CIS, Iren Valova
Defining a Star Schema in DMQL
A Data Mining Query Language: DMQL
„
September 9, 2003
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
27
Defining a Snowflake Schema in DMQL
September 9, 2003
UMASSD, CIS, Iren Valova
28
Defining a Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location
in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
define dimension location as (location_key, street, city(city_key,
province_or_state, country))
September 9, 2003
UMASSD, CIS, Iren Valova
29
September 9, 2003
UMASSD, CIS, Iren Valova
30
5
Measures: Three Categories
„
„
Germany
country
...
...
Spain
North_America
Canada
...
Mexico
E.g., avg(), min_N(), standard_deviation().
holistic: if there is no constant bound on the storage size needed to
describe a subaggregate.
„
Europe
region
E.g., count(), sum(), min(), max().
algebraic: if it can be computed by an algebraic function with M
arguments (where M is a bounded integer), each of which is obtained
by applying a distributive aggregate function.
„
all
all
distributive: if the result derived by applying the function to n aggregate
values is the same as that derived by applying the function on all the
data without partitioning.
„
„
A Concept Hierarchy: Dimension (location)
city
Frankfurt
Vancouver ...
...
Toronto
E.g., median(), mode(), rank().
L. Chan
office
September 9, 2003
UMASSD, CIS, Iren Valova
31
September 9, 2003
View of Warehouses and Hierarchies
32
Sales volume as a function of product, month, and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
gi
on
Specification of hierarchies
Industry Region
Re
Schema hierarchy
Set_grouping hierarchy
{1..10} < inexpensive
Year
Category Country Quarter
Product
day < {month < quarter; week}
< year
„
UMASSD, CIS, Iren Valova
Multidimensional Data
„
„
... M. Wind
Product
City
Office
Month
Week
Day
Month
September 9, 2003
UMASSD, CIS, Iren Valova
33
2Qtr
Date
3Qtr
4Qtr
sum
Total annual sales
of TV in U.S.A.
Canada
Mexico
0-D(apex) cuboid
product
product,date
country
date
product,country
1-D cuboids
date, country
2-D cuboids
sum
product, date, country
September 9, 2003
UMASSD, CIS, Iren Valova
34
all
U.S.A
Pr
o
TV
PC
VCR
sum
1Qtr
UMASSD, CIS, Iren Valova
Cuboids Corresponding to the Cube
Country
du
ct
A Sample Data Cube
September 9, 2003
35
September 9, 2003
UMASSD, CIS, Iren Valova
3-D(base) cuboid
36
6
Browsing a Data Cube
Typical OLAP Operations
„
Roll up (drill-up): summarize data
„
Drill down (roll down): reverse of roll-up
„
„
Slice and dice:
„
Pivot (rotate):
„
Other operations
„
„
„
September 9, 2003
„
Visualization
OLAP capabilities
Interactive manipulation
UMASSD, CIS, Iren Valova
„
37
„
Customer
CONTRACTS
„
ORDER
TRUCK
Time
ANNUALY QTRLY
DAILY
CITY
PRODUCT LINE
Product
PRODUCT ITEM PRODUCT GROUP
„
SALES PERSON
COUNTRY
DISTRICT
REGION
Each circle is
called a footprint
Location
September 9, 2003
DIVISION
Promotion
39
other
sources
Operational
DBs
Extract
Transform
Load
Refresh
drill through: through the bottom level of the cube to its back-end
relational tables (using SQL)
UMASSD, CIS, Iren Valova
38
Top-down, bottom-up approaches or a combination of both
„ Top-down: Starts with overall design and planning (mature)
„ Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view
„ Waterfall: structured and systematic analysis at each step before
proceeding to the next
„ Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
Typical data warehouse design process
„ Choose a business process to model, e.g., orders, invoices, etc.
„ Choose the grain (atomic level of data) of the business process
„ Choose the dimensions that will apply to each fact table record
„ Choose the measure that will populate each fact table record
September 9, 2003
MultiMulti-Tiered Architecture
Monitor
&
Integrator
drill across: involving (across) more than one fact table
Organization
UMASSD, CIS, Iren Valova
Metadata
reorient the cube, visualization, 3D to series of 2D planes.
Data Warehouse Design Process
Customer Orders
AIR-EXPRESS
project and select
September 9, 2003
A StarStar-Net Query Model
Shipping Method
from higher level summary to lower level summary or detailed
data, or introducing new dimensions
„
„
„
by climbing up hierarchy or by dimension reduction
„
Serve
40
Three Data Warehouse Models
OLAP Server
Data
Warehouse
UMASSD, CIS, Iren Valova
„
Analysis
Query
Reports
Data mining
„
Enterprise warehouse
„ collects all of the information about subjects spanning the entire
organization
Data Mart
„ a subset of corporate-wide data that is of value to a specific groups
of users. Its scope is confined to specific, selected groups, such as
marketing data mart
„ Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
„ A set of views over operational databases
„ Only some of the possible summary views may be materialized
Data Marts
Data Sources
September 9, 2003
Data Storage
OLAP Engine Front-End Tools
UMASSD, CIS, Iren Valova
41
September 9, 2003
UMASSD, CIS, Iren Valova
42
7
Data warehouse usage
DW and DM applications
Business executives in almost every industry use the data collected,
integrated, preprocessed, and stored in data warehouses and data
marts to perform data analysis and make strategic decisions. In many
firms, data warehouses are used as an integral part of a plan-executeassess "closed-loop" feedback system for enterprise management.
Data warehouses are used extensively in banking and financial
services, consumer goods and retail distribution sectors, and controlled
manufacturing, such as demand-based production.
DW: Information processing supports querying, basic
statistical analysis, and reporting; Analytical processing
supports basic OLAP operations, including slice-and-dice,
drill-down, roll-up, and pivoting. Operates on historical
data.
Typically, the longer a data warehouse has been in use, the more it will
have evolved. This evolution takes place throughout a number of
phases. Initially, the data warehouse is mainly used for generating
reports and answering predefined queries. Progressively, it is used to
analyze summarized and detailed data, where the results are
presented in the form of reports and charts. Later, the data warehouse
is used for strategic purposes, performing multidimensional analysis
and sophisticated slice-and-dice operations. Finally, the data
warehouse may be employed for knowledge discovery and strategic
decision making using data mining tools.
September 9, 2003
UMASSD, CIS, Iren Valova
DM: knowledge discovery by finding hidden patterns and
associations, constructing analytical models, performing
classification and prediction.
43
So, now DW? What about DM?
September 9, 2003
UMASSD, CIS, Iren Valova
44
Say what?
"How does data mining relate to information processing and on-line
analytical processing?"
OLAP functions are essentially for user-directed data
summary and comparison (by drilling, pivoting, slicing,
dicing, and other operations).
Information processing, based on queries, can find useful information.
However, answers to such queries reflect the information directly
stored in databases or computable by aggregate functions. They do
not reflect sophisticated patterns or regularities buried in the
database. Therefore, information processing is not data mining.
Data mining covers a much broader spectrum than simple
OLAP operations because it not only performs data
summary and comparison, but also performs association,
classification, prediction, clustering, time-series analysis,
and other data analysis tasks.
"Do OLAP systems perform data mining? Are OLAP systems actually
data mining systems?"
The functionalities of OLAP and data mining can be viewed as disjoint:
OLAP is a data summarization/aggregation tool that helps simplify
data analysis, while data mining allows the automated discovery of
implicit patterns and interesting knowledge hidden in large amounts of
data.
September 9, 2003
UMASSD, CIS, Iren Valova
45
September 9, 2003
One last attempt: DM rules!
UMASSD, CIS, Iren Valova
46
An OLAM Architecture
Mining query
Data mining can help business managers find and reach
more suitable customers, as well as gain critical business
insights that may help to drive market share and raise
profits. In addition, data mining can help managers
understand customer group characteristics and develop
optimal pricing strategies accordingly, correct item
bundling based not on intuition but on actual item groups
derived from customer purchase patterns, reduce
promotional spending, and at the same time increase the
overall net effectiveness of promotions.
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
OLAP
Engine
Layer3
OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration
Database API
Filtering
Layer1
Databases
September 9, 2003
UMASSD, CIS, Iren Valova
47
September 9, 2003
Data cleaning
Data
Data integration Warehouse
UMASSD, CIS, Iren Valova
Data
Repository
48
8
References (I)
References (II)
„
„
„
„
„
„
„
„
„
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.
Sarawagi. On the computation of multidimensional aggregates. VLDB’96
„
D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses.
SIGMOD’97.
„
R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97
K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs..
SIGMOD’99.
„
„
S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD
Record, 26:65-74, 1997.
„
OLAP council. MDAPI specification version 2.0. In http://www.olapcouncil.org/research/apily.htm,
1998.
„
G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients in Data
Cubes. VLDB’2001
„
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H.
Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and subtotals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
September 9, 2003
UMASSD, CIS, Iren Valova
„
49
J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With
Complex Measures. SIGMOD’01
V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.
SIGMOD’96
Microsoft. OLEDB for OLAP programmer's reference version 1.0. In
http://www.microsoft.com/data/oledb/olap, 1998.
K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97.
K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple
granularities. EDBT'98.
S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data
cubes. EDBT'98.
E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John
Wiley & Sons, 1997.
W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to
Reducing Data Cube Size. ICDE’02.
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for
simultaneous multidimensional aggregates. SIGMOD’97.
September 9, 2003
UMASSD, CIS, Iren Valova
50
Thank you !!!
September 9, 2003
UMASSD, CIS, Iren Valova
51
9