Download Chapter 1: Introduction to Data Mining, Warehousing, and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
1
Chapter 1: Introduction to
Data Mining, Warehousing,
and Visualization
Modern Data Warehousing, Mining, and
Visualization: Core Concepts
by George M. Marakas
Spring 2011
1
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Exercise #5










1
Due: Mar 24
Points: 20 points
Pratt & Adamski: Premiere Products [PP] and Henry Books [HB] Databases
Assignments must have cover sheet
sheet, table of contents
contents, index tabs
tabs. Use 3
3-hole
hole
punch notebook (1/2” or smaller). Put your name on the spine of the notebook.
Use a tab for PP and a tab for HB.
Use ACCESS, PP, and HB databases.
Redesign both PP and HB databases as they would be for a data warehouse as
described in Adamson & Venables [Chapters 1 & 2] and Marakas [Chapters 1 &
2]. Use the Star diagram as the basis for their design. Be sure to include a
meaningful Time dimension table.
Turn-in printouts of the REVISED relationship diagrams, i.e., the Star
Diagrams, for both databases.
On a separate page(s), clearly identify for each database: Fact tables,
dimension tables, primary keys, foreign keys, alternate keys, etc. Use
relational notation from Pratt & Adamski (SEE CHAP 9, ALSO).
Indicate the normal form [1NF, 2NF, 3NF, etc.] of each table.
NOTE: Use the ORIGINAL copy of the Premiere Products and Henry Books
databases for this assignment.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
2
Homework #5 Scoresheet
1
3
1
Objectives








What is the purpose and motivation for developing a
Data Warehouse (DW)?
Position of DW within IT infrastructure
Relationship between DW and business data mart
What can a DW do?
Foundations for Data Mining
Steps in a typical Data mining project
What is a “Correlation”?
Correlation ? KEY CONCEPT
History of Data Visualization vis-à-vis DW
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
4
1
1-1: The Modern Data Warehouse



A data warehouse is a copy of transaction data
specifically structured for querying
querying, analysis and
reporting
Note that the data warehouse contains a copy of the
transactions. These are not updated or changed later
by the transaction system.
Also note that this data is specially structured, and may
have been transformed when it was placed in the
warehouse
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1-2: Data Warehouse Roles
and Structures
5
1
The DW has the following primary functions:
 It is
i a direct
di t reflection
fl ti off the
th business
b i
rules
l off the
th
enterprise.
 It is the collection point for strategic information.
 It is the historical store of strategic information.
 It is the source of information later delivered to data
marts.
 It is the source of stable data regardless of how the
business processes may change.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
6
1
Elements of a DW
Extract
Transform
Store/Load
[ETS or
ETL]
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Position of the Data Warehouse Within
the Organization – Figure 1-2
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
7
1
8
1
Data Marts




A data mart is a smaller, more focused data warehouse.
It reflects the business rules of a specific business unit.
The data mart does not need to cleanse its data
because that was done when it went into the
warehouse.
It is a set of tables for direct access by users.
These tables are designed for aggregation.
9
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1
Data Marts and the Data
Warehouse – Figure 1-6
Legacy Systems
Legacy systems
feed data to the
warehouse.
The warehouse
feeds specialized
information to
departments and
Data Marts and
visa versa.
Operational
Data Store
Finance
Data Mart
Sales
Data Mart
Marketing
Data Mart
Accountin
g
Data Mart
Operational
Data Store
Operational
Data Store
Organizational
Data
Warehouse
Operational
Data Store
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
10
1
The Data Mart is
More Specialized – Figure 1-7
The data
mart serves
the needs of
one business
unit, not the
organization.
Organizational Data
Warehouse
Corporate
Highly granular data
Normalized design
Robust historical data
Large data volume
Data Model driven data
Versatile
General purpose DBMS
technologies
Finance
Data Mart
Sales
Data Mart
Marketing
Data Mart
Accting
Data Mart
Data Marts
Organizational
Data
Warehouse
Departmentalized
Summarized,
Summarized aggregated
data
Star join design
Limited historical data
Limited data volume
Requirements driven data
Focused on departmental
needs
Multi-dimensional DBMS
technologies
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1-3: What Can a Data
Warehouse Do?
11
1
Some of the benefits of a DW are:
 Fast information delivery
y
 Data integration from across and even outside the
organization
 Future vision from historical trends
 Additional tools for looking at data in new ways
 Freedom from IS department resource limitations (you
don’tt need programmers,
don
programmers but rather data analysts to
use a data warehouse)
 Customer Relationship Management [CRM]
 Customer Service Relationships [CRS]
 Mining or Auditing for accounting irregularities [Fraud]
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
12
Data Mining Example
Service Quality vs. Training
1
Courtesy: MicroStrategy (2005)
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Examples of Common DW Applications
Table 1-1
13
1
Sales Analysis

Determine real-time product sales to make vital pricing and distribution decisions.

Analyze historical product sales to determine success or failure attributes.

Evaluate successful products and determine key success factors
factors.

Use corporate data to understand the margin as well as the revenue implications of a decision.

Rapidly identify a preferred customer segments based on revenue and margin.

Quickly isolate past preferred customers who no longer buy.

Identify daily what product is in the manufacturing and distribution pipeline.

Instantly determine which salespeople are performing, on both a revenue and margin basis, and which are
behind.
Financial Analysis

Compare actual to budgets on an annual, monthly and month-to-date basis.

Review past cash flow trends and forecast future needs.

Identify and analyze key expense generators.
Instantly generate a current set of key financial ratios and indicators.


Receive near-real-time, interactive financial statements.
Human Resource Analysis

Evaluate trends in benefit program use.

Identify the wage and benefits costs to determine company-wide variation.

Review compliance levels for EEOC and other regulated activities.
Other Areas

Warehouses have also been applied to areas such as: logistics, inventory, purchasing, detailed transaction
analysis and load balancing.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
14
1
Table 1-2
Comparison of Typical DW Costs and Benefits
Costs
 Hardware, software, development personnel and consultant costs.
 Ope
Operational
a o a cos
costs
s like
eo
ongoing
go g sys
systems
e s maintenance.
a e a ce
 Benefits
Added Revenue
 Will the new (business objective) process generate new customers (what is the
estimated value?)
 Will the new (business objective) process increase the buying propensity of
existing customers (by how much?)
 Is the new process necessary to ensure that the competition doesn't offer a
demanded service that y
you can't match?
Reduced costs
 What costs of current systems will be eliminated?
 Is the new process intended to make some operation more efficient? If so, how
and what is the dollar value?
15
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1-4: The Cost of DW




1
Expenditures can be categorized as one-time initial
costs or as recurring, ongoing costs.
The initial costs can further be identified as for hardware
or software.
Expenditures can also be categorized as capital costs
(associated with acquisition of the warehouse) or as
operational
p
costs ((associated with running
g and
maintaining the warehouse)
Cost of a Data Warehouse:
 Rule of Thumb: $1 million per 1 Terabyte of data

Courtesy Walmart Corporation.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
16
Expenditures Associated with Building a DW
Table 1-3
Recurring Costs
Capital
Operational
1
One-Time Costs




Hardware maintenance
Software maintenance
Terminal analysis
Middleware
Hardware

Disk

CPU

Network

Terminal Analysis







Ongoing refreshment
Integration transformation
Data model maintenance
Record identification maintenance
Metadata infrastructure maintenance
Archival of data
Data aging within the DW









Software
DBMS
Terminal analysis
Middleware
Log utility
Processing
Metadata
Infrastructure
Integration/transformation
processing specification
Metadata infrastructure population
System of record definition
Data dictionary language definition
Network transfer definition
CASE/Repository interface
Initial data warehouse population
Data model definition
Database design definition
17
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1-5: Data Mining:
Farmers and Explorers


1
Every corporation has two types of DW users.
 Farmers [Traditional Statistical Hypothesis
t ti ] know
testing]
k
what
h t they
th wantt before
b f
they
th sett outt to
t
find it. They submit small queries and retrieve small
nuggets of information.
 Explorers [Data Mining] are quite unpredictable.
They often submit large queries. Sometimes they
find nothing, sometimes they find priceless “golden”
nuggets.
gg
Cost justification for the DW is usually done on the basis
of the results obtained by farmers since explorers are
unpredictable.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
18
1-6: Foundations of Data
Mining




Data mining is the process of using raw data to infer
important business relationships.
Despite a consensus on the value of data mining, a
great deal of confusion exists about what it is.
It is a collection of powerful techniques intended for
analyzing large datasets.
There is no single data mining approach, but rather a
set of techniques that can be used in combination with
each other.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1-6 & -7: The Foundations of
Data Mining




1
19
1
Data mining has roots in practice dating back over 30
years using standard statistics [e.g., bio
bio-statistics;
statistics;
BIOMED software and mainframe computers (1960’s)]
In the early 1960s, data mining was called statistical
analysis, and the pioneers were statistical software
companies such as SAS and SPSS.
By the 1980s, the traditional techniques had been
augmented by new methods such as fuzzy logic
logic, heuristics
and neural networks.
Also, DSS tools came into popular use in the 1980’s with
tools such as Lotus 1-2-3 & EXCEL
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
20
1
Data Mining – A General Approach
Although all data mining endeavors are unique,
they possess a common set of process steps:
1. Infrastructure preparation – choice of hardware
platform, the database system and one or
more mining tools
2. Exploration – looking at summary data,
sampling and applying intuition [Data
visualization useful here]
3. Analysis – each discovered pattern is
analyzed for significance and trends
21
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
A General Approach
(continued)
4.
5.
1
Interpretation – Once patterns have been
discovered and analyzed
analyzed, the next step is to
interpret them. Considerations include
business cycles, seasonality and the
population the pattern applies to.
Exploitation – this is both a business and a
technical activity.
y One way
y to exploit
p
a
pattern is to use it for prediction. Others are to
package, price or advertise the product in a
different way.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
22
1
1.8: The Approach to Data
Exploration and Data Mining
A
The basis
for all
data mining
activities is
CORRELATION
B
A Perfect Correlation
A
B
A Strong Correlation
A
B
A Weak Correlation
23
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1
The Spectrum of
Correlation
1/( -1)
Perfect (Inverse)
Correlation
.5/( -.5)
Moderate
Correlation
0
No
Correlation
 In
general, a correlation coefficient is a
number between 0 and ±1 that shows strength
of a relationship.
 Some types of correlation are signed (±) to also
show the direction of the relationship.
 Even a weak correlation can be interesting,
however, if it shows a trend over time.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
24
1
Perfect positive or negative
correlations
+1
-1
25
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Methods to Determine
Correlation
The method
used
depends on
the type of
elements
being
correlated.
vs B
A vs.
A vs.
A vs.
A vs.
1
Data element vs
vs. data element
e.g., Sales of Digital Cameras vs.
Sales of35mm film
Data element vs. unit of time
e.g., Sales of Digital Cameras vs. Months;
TIME SERIES
B BB
B B B BBB
B B B
Data element vs. data element groups
e.g., Sales of Digital Cameras vs.
Public or Private Schools
Data element vs. geography
e.g., Sales of Digital Cameras vs. Region
A vs.
Data element vs. external trends
e.g., Sales of Digital Cameras vs. Tax cuts
A vs.
Data element vs. demographics
Sales of Digital Cameras vs. Age
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
26
The Data Warehouse and
Data Mining
1
 Data
mining
g does not require
q
the use of a data
warehouse (DW), however, DWs are designed
with data mining in mind.
 The data in the DW is integrated and stable
(non-volatile)
 Data changes
g continuously
y in an operational
p
database.
 If multiple analyses are run in sequence, the
data need to be held constant (as in a DW).
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Volumes of Data – The
Biggest Challenge
27
1
 The
largest
g
challenge
g a “data miner” may
y face is
the sheer volume of data (number of rows vs.
the number of bytes) in the warehouse.
 It is quite important, then, that summary data
also be available to get the analysis started.
 A major
j p
problem is that this sheer volume may
y
mask the important relationships the analyst is
interested in.
 The ability to overcome the volume and
visualize the data becomes quite important.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
28
RFID Technology
 RFID
1
Technology
gy
http://www.pbs.org/newshour/bb/science/july-dec06/rfid_08-17.html
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1.9: Foundations of Data
Visualization [DV]
29
1
 One
of the earliest known examples
p
of data
visualization was in London during the 1854
cholera epidemic. A map (next slide) helped
to identify the source of the disease.
 Modern visualization techniques grew from the
twin technologies of computer graphics and
high performance computing in the 1970s
and 1980s.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
30
1
Dr. John
Snow used
a map to
show the
source of
cholera was
a water
pump, thus
proving the
di
disease
was water
borne.
Broad Street
Pump
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
31
1
DV: Opportunity and Timing
 Alternative
input devices (light pen, sketch pad
and mouse) began to appear in the 1960s.
 In the 1970s, flight simulators became much
more realistic when graphics replaced film.
 In the same decade, special effects computers
became entrenched in the entertainment
industry.
 In the 1980s, visualization grew more dynamic
with applications like the animation of weather
patterns.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
32
1
One of today’s
more useful
types of
visualization is
in simulators
(both in games
and in practice).
This is the only
way most of us
will ever fly a
Boeing 747
[Note:
Instrument
panel or
Dashboard].
33
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Data Visualization – Sales by Region
1
Typical Spreadsheet Graphic
90
80
70
60
50
40
East
West
North
30
20
10
0
1st Qtr 2nd Qtr 3rd Qtr
4th Qtr
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
34
Data Visualization – Total Precipitation
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
DV & DM:
Future Success Drivers



1
35
1
In the 1990s, rapid advances in chip technology, both
at the CPU and the graphics processor, put data
visualization everywhere. – Moore’s Law!
On-going reduced costs of computing.
 Each new generation has a 10X-100X performancecost improvements.
 Approximately every 18 months [Moore’s Law].
Web-based Ecommerce
 Business to Consumer Commerce [B to C; and C:C]
 Generates billions and even trillions of characters per
reporting period
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
36
The End
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1
37