Download Analytical databases

Document related concepts

Concurrency control wikipedia , lookup

Big data wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Analytical data bases
Data cubes
Database lectures for mathematics students
Zbigniew Jurkiewicz, Institute of Informatics UW
May 22, 2016
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Decision support systems
From the perspective of the time span all decisions in the
organization could be divided into three categories:
operational decisions within the scope of days or weeks;
tactical decisions, whose effects range from a few months
to one year,
strategic decisions, which impact the organization
development for the next few years.
It has been observed that when moving from operational
decisions towards the strategic ones, the procedures used
become less and less algorithmic and formalized.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Decision support systems
Initially in business activities computer systems had been
used mostly for operational data processing with such
applications as sales order management, invoicing, or
magazine inventory.
Gradually the computers has been used for less routine
activities, called Decision Support Systems (DSS).
They are also knonw under other popular names
BIS/BIT (Business Intelligence System/Technology)
EIS (Executive Information System).
In addition to “mechanical” data processing they also
provide various mechanisms for deducing new information
from the facts contained in the database.
This lead to a division of “database” applications into
operational (transactional) and analytical.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Requirements for decision support systems
Information should usually be presented in a summarized
form.
No standard access path, very varied methods of selection
and formatting the information to be presented, dynamics.
Associating the selected information which other
computational resources (spreadsheets, specialized
statistical packages).
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Analytical data processing
Commonly known as On-Line Analytical Processing,
abbreviated usually to OLAP.
Typical applications: trend analysis, looking for patterns of
behavior, looking for anomalies.
Used interactively, so efficiency is very important,
especially time efficiancy.
If a user observes that some queries (e.g. based on 5 and
more joins) are executed very slowly, she will try to avoid
them.
It is assumed that the answers for 90% of queries should
be available within 10 seconds.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Analytical data bases
Also called On-Line Analytical Processing (OLAP)
databases.
Growing in importance.
From personal computers to large client-server
configurations.
Many buzzwords
roll-up and drill-down, drill-through, MOLAP, pivoting.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Main issues
What is an analytical database?
Models and operations
Implementing analytical database
Development trends
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Matrix reports
Data analysis was first supported by matrix reports.
Matrix reports look like spreadsheets.
They are often concerned with finances or management.
Sales system should for example contain a report about
customers and their buying patterns, divided by country
regions.
However, instead of analysing buying patterns for each
product, we divide product into categories.
So the report would have product categories as columns,
country regions as rows, and each report cell will show the
number of items sold in this category in this region.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Data Mart
Small data warehouse, sometimes called thematic
database
Covers only some areas (themes) of the enterprise, e.g.
marketing: customers, products, sales
Model adapted to the needes of a department.
Usually the information is initially preaggregated
Elimination of unnecessary details
Some critical level of details selected.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Tools for querying and analysis
Query builders
Report generators
comparisons: growth, decrease
trends,
graphs
Spreadsheets
WWW interface
Data mining
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Other operations
Functions over time
e.g. averages on different periods
Computed attributes
e.g.. profit = sales * rate
Textual queries, e.g.
find all documents containing words A and B
order documents by frequency of occurence for words X , Y
and Z
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Data models and operators
Data models
relation
star and snowflake
cube: extension of spreadsheet idea (multidimensional
tables, dimensions indexed by database values)
Operators
slice & dice
roll-up, drill down
pivoting
other
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Multidimensional data model
Multidimensional databases are most popular due to
analytical data model of the form of multidimensional cube
containing:
facts (also called measures), e.g. the number of cars sold;
dimensions, e.g. months, regions of sale.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Dimensions
Dimensions usually form hierarchies, e.g. for time
dimension the hierarchy will be year-quarter-month-day.
Hierarchies enable the interactive change of detail level
(granularity) of the information presented.
In more complex models the hierachies can branch, e.g.
division into weeks is incompatible with division into
months.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Time
Time dimension needs a special treatment
It is hidden — there is no separate table for time.
Time is unique as a dimension because it is sequential in
character.
We might ask to see the sales for May or the sales for the
first three months of 2007.
But we would rarely ask to see the sales for the first five
goods (ever assuming they are ordered by name).
Method of aggregation for time depends on the meaning of
the measure.
If a company sold 10 computers in January, 15 computers
in February, and 10 computers in March, then typical query
would ask for total number (i.e. sum) sold for the first
quarter.
On the other side, if a company had employeed 10 people
in January, 7 in February, and 10 again in March, then we
would usually ask about the average count for the quarter.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Database
The data is usually taken from data warehouse (real or
virtual).
Direct storage of information for all facts and all levels of
detail in database could be very costly in terms of space,
so
Store only data for the most often used levels of hierarchies.
Other data is computed from stored data on the fly if
needed.
When aggregating measures it is important to take into
account various rules of aggregation, e.g.
Sales amount is usually summed.
Temperature or price will rather be averaged.
The analytical database stores as a rule only aggregated
data.
To see the detail data (drill-through) it is necessary to fetch
it from data warehouse or operational database.
Because this takes a lot of time, such need should not
occur too often.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Operations on data
Cutting and projecting on the cross-section surface (slice
and dice)
Change of detail level: drill-down and roll-up)
Turning (pivot): changes the visible dimensions on the
“image”.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Approaches to building the OLAP database
1
ROLAP = “Relational OLAP”: we adapt the relational
DBMS to star or snowflake schema.
2
MOLAP = “Multidimensional OLAP”: we use specialized
DBMS based on “datacube” model.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Star schema
Star schema is a typical method of data organization in
relational database for OLAP.
It is composed of:
Fact table: large set of facts such as informations about the
amount of sale.
Dimension tables: smaller, statical information about the
objects that the facts deal with.
Generalization: snowflake model.
Hierarchies of tables for particular dimensions: dimension
table normalization.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Example star schema
We want to have in OLAP database informations about the
selling of beers: pub, beer name, drinker who bought it,
day, hour, and price.
We take the following relation as our fact table:
Sales(pub,beer,drinker,day,hour,price)
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Example, cont.
Dimension tables contain informations about pubs, beers
and drinkers:
Pubs(pub, address, licence)
Beers(beer, prod)
Drinkers(drinker, address, phone)
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Dimension attributes and dependent attributes
Two kinds of attributes exist in fact table:
Dimension attributes: the keys for dimension tables.
Dependent attributes: the values associated with particular
combinations of dimension attributes values.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Example: dependent attribute
price is a dependent attribute in the example relation Sales.
Its value is determined by the combination of dimension
attributes: pub, beer, drinker and time (the combination of
date and hour).
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
ROLAP optimization techniques
Bitmap indexes: for each value of the index key in a
dimension table (e.g. for each beer in the Beers table) we
create a bit vector showing which tuples in a fact table
contain this value.
Materialized views: the OLAP database (or ever the data
warehouse) stores precomputed answers for some useful
queries (perspectives).
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Typical OLAP query
OLAP query often starts with “star join”: the natural join of
the fact table with all or most dimension tables.
Example:
SELECT *
FROM Sales,Pubs,Beers,Drinkers
WHERE Sales.pub = Beers.pub
AND Sales.beer = Beers.beer
AND Sales.drinker = Drinkers.drinker;
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Typical OLAP query
Starts with a star join.
Selects interesting tuples using data from dimension
tables.
Groups on one or more dimensions.
Aggregates some attributes of the result.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Example OLAP query
For each pub in Poznań show the total sale of each beer
produced by Anheuser-Busch brewery.
Filter: address = “Poznań” and prod = “Anheuser-Busch”.
Grouping: by pub and beer.
Aggregation: Sum over price.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Example: SQL
SELECT pub, beer, SUM(price)
FROM Sales NATURAL JOIN Pubs
NATURAL JOIN Beers
WHERE addr = ’Poznań’
AND prod = ’Anheuser-Busch’
GROUP BY pub, beer;
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Materialized views
Direct execution of our query for the table Sales and
dimension tables may take a lot more time than we accept.
If we would create a materialized view containg the
appropriate information, we could give the answer much
faster.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Example: materialized view
Which view could help us?
Basic requirements:
1
2
3
4
Must join at least Sales, Pubs and Beers.
Must group at least by pub and beer.
Does not need to select pubs in Poznań nor beers from
Anheuser-Busch.
Does not need to omit columns address and prod.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Example
Here is a useful view:
CREATE VIEW PuBeS(pub, address, beer,
prod, sale) AS
SELECT pub, address, beer, prod,
SUM(price) AS sale
FROM Sales NATURAL JOIN Pubs
NATURAL JOIN Beers
GROUP BY pub, address, beer, prod;
Because pub → address and beer → prod, some grouping is
superficial, but it is necessary because address and prod occur
in the SELECT phrase.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Example — finale
The reformulated query (now it uses the materialized view
BaBaS):
SELECT pub, beer, sale
FROM PuBeS
WHERE address = ’Poznań’
AND prod = ’Anheuser-Busch’;
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Materialization aspects
Type and frequency of queries
Computing time for queries
Storage costs
Updating costs
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
MOLAP and datacubes
The (keys of) dimension tables become the dimensions of
hypercube.
Example: for data from Sales table we have 4 dimensions:
pub, beer, drinker i time.
Dependent attributes (e.g. price) are located in points
(cells) of the hypercube.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Visualization — hypercubes
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Borders
Often a cube should also contain aggregations (usually
SUM or AVG) along the hyperedges of the cube.
Borders contain one-dimensional, two-dimensional,
. . . aggregations.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Example: borders
Our 4-dimensional hypercube Sales may contain sums of
price for each pub, each beer, each drinker (ummm...
sensible personal data) and each time unit (probably days).
It could also contain sums of price for all pairs pub-beer,
triples pub-drinker-day, . . .
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Structure of the cube
We extend each dimension to have one additional value *.
Internal cell with one or more coordinate being * contains
aggregates for grouping by dimensions with *.
Example: Sales(’Pod Żaglem’, ’Bud’, *, *) contains the sum
of the cost of the beer Bud which has been drunk in the
pub “Pod Żaglem” by all drinkers at any time.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Drill-down
Drill-down = “deaggregation” — decompose the
aggregation into its components.
Example: after finding that “Pod Żaglem” sells few Okocim
beer, one may try to decompose this sales into particular
kinds of Okocim.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Roll-up
Roll-up = additional aggregation on one or more
dimensions.
Example: having the table showing how much Okocim
beer is drunk by each drinker in each pub, we roll it into a
table giving the total amount of Okocim beer drunk by each
of drinkers.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Roll-Up i Drill-Down
Anheuser-Busch for drinker/pub
Joe’s Pub
Nut-House
Blue Chalk
Jim
45
50
38
Bob
33
36
31
Mary
30
42
40
Rolling-up by Pubs
A-B / drinker
Jim
133
Mary
100
Bob
112
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Roll-Up i Drill-Down
Drill-down by Beers
Beers A-B / drinker
Bud
M’lob
Bud Light
Jim
40
45
48
Bob
29
31
40
Zbigniew Jurkiewicz, Institute of Informatics UW
Mary
40
37
35
Analytical data bases Data cubes
Database lectures
Materialized views for datacubes
Useful materialized views for datacubes should aggregate
by one or more dimensions.
The dimensions should not be totally aggregated, but
possibly grouped by some attribute from a dimension table.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Example
A materialized view for our Sales hypercube could:
1
2
3
4
Aggregate totally by drinker.
Do not aggregate at all by beer.
Aggregate by time using week.
Aggregate by town for pubs.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Indexes
Traditional techniques
B-trees, hashing tables, R-trees, grids, ...
Specific
inverted lists
bitmap indexes
join indexes
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Using inverted lists
Query:
Find people with age = 20 and name = “Fred”
List for age = 20: r4, r18, r34, r35
List for name = “Fred”: r18, r52
The answer is obtained as intersection: r18
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
MDX
Multidimensional Expressions (MDX): query language for
MOLAP, initially part OLE DB (Microsoft 1997).
Then used by Microsoft OLAP Services 7.0 and Microsoft
Analysis Services.
XML for Analysis contains MDX as query language.
Supported by Applix, Oracle, SAS, SAP, Panorama
Software, Cognos, Hyperion Solutions and others.
In 2001 XMLA Council (www.xmla.org) publishes the
standard for XML for Analysis, with query language
mdXML (MDX enclosed with <Statement> tag from XML.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Example query in MDX
SELECT
{ [Measures].[Sales in shops] } ON COLUMNS,
{ [Date].[2002], [Date].[2003] } ON ROWS
FROM Sales
WHERE ( [Shop].[Europe].[Poland] )
The SELECT clause determines the “axes” of query as
Sales in shops from Measures dimension and 2002
plus 2003 from Date dimension.
The FROM clause indicates, that the data source is the
hypercube Sales.
The WHERE clause defines the “cross-section” as the
element Poland of the dimension Shop.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Trends
Oracle: Essbase (after taking over Hyperion), BI Server.
IBM: Cognos 8 BI (together with ‘PowerPlay Studio’),
database TM1 (Applix).
Microsoft: database Panorama (included into SQL Server
7), two analysis tools (Maximal i ProClarity), integration
with Excel, SharePoint and Visio. Planned in-memory tool
Gemini.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Data Mining
Automatic search for “interesting” patterns and trends in
data.
The term data mining is mostly used for describing the
summarization of large data sets in a useful way.
Showing regularities, often written using rules
Inductive methods used, opposite to deductive data bases
(like Datalog)
Consequence: the results are never universally
guaranteed, they could be the effect of the momentary
contents of the data base.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Examples
Grouping all WWW Internet pages according to subjects.
Preventing credit frauds: finding characteristic properties of
illegal transctions with credit cards.
Searching for associations, e.g. finding goods often bought
together.
Finding similar sequences of behavior, e.g. shares with
similar oscillations of quotations.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Characteristics
Basically a nontrivial automatic extraction of unknown and
potentially useful information contained implicite in
database.
Based on searching for patterns in data, without previous
construction of hypotheses
This differs from classical statistical approach, where
analyst builds hypotheses and tries to verify them on a
smaple from data bases.
More troubles in situations, when patterns are discovered
in recursive decision process.
Information in database is often disturbed and incomplete,
so some statistic knowledge is necessary anyway.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Technology
Generally artificial intelligence, machine learning, neural
networks, association rules, rough sets.
Classification and forecasting: building a classifier for
categories given in advance.
Custer analysis: category defining during analysis.
Pattern recognition and searching.
Decision trees.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Clustering: issues
Partitioning data into into automatically generated categories
Do we have the expected number of groups?
How to find the ,,best” groups?
Are groups semantically meaningful?
e.g. “yuppies”
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Market-basket analysis
Market baskets = sets of goods which are bought together
by a customer during one visit in the shop.
Summary of market-baskets: frequent sets of items — sets
of goods often found together.
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures
Tools
Weka: New Zealand
Rses i Rses-lib: MIMUW.
SAS
Zbigniew Jurkiewicz, Institute of Informatics UW
Analytical data bases Data cubes
Database lectures