Download Hierarchical Clustering for OLAP: the CUBE File Approach

Document related concepts
Transcript
NIKOS KARAYANNIDIS AND TIMOS SELLIS
Institute of Communication and Computer Systems and
School of Electrical and Computer Engineering,
National Technical University of Athens,
Zographou 15773, Athens, Hellas
Phone: +30-210-772-1601
Fax: +30-210-772-1442
{nikos,timos}@dblab.ece.ntua.gr
Abstract. This paper deals with the problem of physical clustering of multidimensional data that
are organized in hierarchies on disk in a hierarchy-preserving manner. This is called hierarchical
clustering. A typical case, where hierarchical clustering is necessary for reducing I/Os during
query evaluation, is the most detailed data of an OLAP cube. The presence of hierarchies in the
multidimensional space results in an enormous search space for this problem. We propose a
representation of the data space that results in a chunk-tree representation of the cube. The model
is adaptive to the cube’s extensive sparseness and provides efficient access to subsets of data based
on hierarchy value combinations. Based on this representation of the search space we formulate
the problem as a chunk-to-bucket allocation problem, which is a packing problem as opposed to
the linear ordering approach followed in the literature.
We propose a metric to evaluate the quality of hierarchical clustering achieved (i.e., evaluate the
solutions to the problem) and formulate the problem as an optimization problem. We prove its NPHardness and provide an effective solution based on a linear time greedy algorithm. The solution
of this problem leads to the construction of the CUBE File data structure. We analyze in-depth all
steps of the construction and provide solutions for interesting sub-problems arising, such as the
formation of bucket regions, the storage of large data chunks and the caching of the upper nodes
(root-directory) in main memory.
Finally, we provide an extensive experimental evaluation of the CUBE File’s adaptability to the
data space sparseness as well as to an increasing number of data points. The main result is that the
CUBE File is highly adaptive to even the most sparse data spaces and for realistic cases of data
point cardinalities provides hierarchical clustering of high quality and significant space savings.
Keywords: Hierarchical Clustering, OLAP, CUBE File, Data Cube, Physical
Data Clustering
1
Efficient processing of ad hoc OLAP queries is a very difficult task considering,
on the one hand the native complexity of typical OLAP queries, which potentially
combine huge amounts of data, and on the other, the fact that no a-priori
knowledge for queries exists and thus no pre-computation of results or other
query-specific tuning can be exploited. The only way to evaluate these queries is
to access directly the most detailed data in an efficient way. It is exactly this need
to access detailed data based on hierarchy criteria that calls for the hierarchical
clustering of data. This paper discusses the physical clustering of OLAP cube
data points on disk in a hierarchy-preserving manner, where hierarchies are
defined along dimensions (hierarchical clustering).
The problem addressed is set out as follows: We are given a large Fact Table (FT)
containing only grain-level (most detailed) data. We assume that this is part of star
schema in a dimensional Data Warehouse. Therefore, data points (i.e., tuples in
the FT) are organized by a set of N dimensions. We further assume that each
dimension is organized in a hierarchy. Typically the data distribution is extremely
skewed. In particular, the OLAP cube is extremely sparse and data tend to appear
in arbitrary clusters of data along some dimension. These clusters correspond to
specific combinations of the hierarchy values for which there exist actual data
(e.g., sales for a specific product Category in a specific geographic Region for a
specific Period of time). The problem is on the one hand to store the fact table
data in a hierarchy-preserving manner so as to reduce I/Os during the evaluation
of ad hoc queries containing restrictions and /or groupings on the dimension
hierarchies, and on the other, to enable navigation in the multilevelmultidimensional data space by providing direct access (i.e., indexing) to subsets
of data via hierarchical restrictions. The later implies that index nodes must be
also hierarchically clustered if we are aiming at a reduced I/O cost.
Some of the most interesting proposals [20, 21, 36] in the literature for cube data
structures deal with the computation and storage of the data cube operator [9].
These methods omit a significant aspect in OLAP, which is that usually
dimensions are not flat but are organized in hierarchies of different aggregation
levels (e.g., store, city, area, country is such a hierarchy for a Location
dimension). The most popular approach for organizing the most detailed data of a
cube is the so-called star schema. In this case the cube data are stored in a
2
relational table, called the fact table. Furthermore, various indexing schemes have
been developed [3, 25, 26, 15], in order to speed up the evaluation of the join of
the central (and usually very large) fact table with the surrounding dimension
tables (also known as a star join). However, even when elaborate indexes are
used, due to the arbitrary ordering of the fact table tuples, there might be as many
I/Os as are the tuples resulting from the fact table.
We propose the CUBE File data structure as an effective solution to the
hierarchical clustering problem set above. The CUBE File multidimensional data
structure ([18]) clusters data into buckets (i.e., disk pages) with respect to the
dimension hierarchies aiming at the hierarchical clustering of the data. Buckets
may include both intermediate (index) nodes (directory chunks), as well as leaf
(data) nodes (data chunks). The primary goal of a CUBE File is to cluster in the
same bucket a “family” of data (i.e., data corresponding to all hierarchy-value
combinations for all dimensions) so as to reduce the bucket accesses during query
evaluation.
Experimental results in [18] have shown that the CUBE File outperforms the UBtree/MHC [22] - which is another effective method for hierarchically clustering
the cube - resulting in 7-9 times less I/Os on average for all workloads tested. This
simply means that the CUBE File achieves a higher degree of hierarchical
clustering of the data. More interestingly, in [15] it was shown that the UBtree/MHC technique outperformed the traditional bitmap index-based star-join by
a factor of 20 to 40, which simply proves that hierarchical clustering is the most
determinant factor for a file organization for OLAP cube data, in order to reduce
I/O cost.
To tackle this problem we first model the cube data space as a hierarchy of
chunks. This model - called the chunk-tree representation of a cube - copes
effectively with the vast data sparseness by truncating empty areas. Moreover, it
provides a multiple resolution view of the data space where one can zoom-in or
zoom-out to specific areas navigating along the dimension hierarchies. The CUBE
File is built by allocating the nodes of the chunk-tree into buckets in a hierarchypreserving manner. This way we depart from the common approach for solving
the hierarchical clustering problem, which is to find a total ordering of the data
points (linear clustering), and cope with it as a packing problem, namely a chunkto-bucket packing problem.
3
In order to solve the chunk-to-bucket packing problem, we need to be able to
evaluate the hierarchical clustering achieved (i.e., evaluate the solutions to this
problem). Thus, inspired by the chunk-tree representation of the cube, we define a
hierarchical clustering quality metric, called the hierarchical clustering factor.
We use this metric to evaluate the quality of the chunk to bucket allocation.
Moreover, we exploit it in order to formulate the CUBE File construction problem
as an optimization problem, which we call the chunk-to-bucket allocation
problem. We formally define this problem and prove that it is NP-Hard. Then, we
propose a heuristic algorithm as a solution that requires a single pass over the
input fact table and linear time in the number of chunks.
In the course of solving this problem several interesting sub-problems arise. We
define the sub-problem of chunk-region formation, which deals with the
clustering of chunk-trees hanging from the same parent-node in order to increase
further the overall hierarchical clustering. We propose two algorithms as a
solution, one of which is driven by workload patterns. Next, we deal with the subproblem of storing large data chunks (i.e., chunks that don’t fit in a single bucket),
as well as with the sub-problem of storing the so-called root directory of the
CUBE File (i.e., the upper nodes of the data structure).
Finally, we study the CUBE File’s effective adaptation to several cube data spaces
by presenting a set of experimental measurements that we have conducted.
All in all, the contributions of this paper are outlined as follows:
We provide an analytic solution to the problem of hierarchical clustering
an OLAP cube. The solution leads to the construction of the CUBE File
data structure.
We model the multilevel-multidimensional data space of the cube as a
chunk-tree. This representation of the data space adapts perfectly to the
extensive data sparseness and provides a multi-resolution view of the data
w.r.t. the hierarchies. Moreover, if viewed as an index, it provides direct
access to cube data via hierarchical restrictions, which results in
significant speedups of typical ad hoc OLAP queries.
We transform the hierarchical clustering problem from a linear clustering
problem into a chunk-to-bucket allocation (i.e., packing) problem, which
we formally define and prove that it is NP-Hard.
4
We introduce a hierarchical clustering quality metric for evaluating the
hierarchical clustering achieved (i.e., evaluating the solution to the
problem in question). We provide an efficient solution to this problem as
well as to all sub-problems that stem from it, such as the storage of large
data chunks or the formation of bucket regions.
We provide an experimental evaluation which leads to the following basic
results:
The CUBE File adapts perfectly to even the most extremely sparse
data spaces yielding significant space savings. Furthermore, the
hierarchical clustering achieved by the CUBE File is almost
unaffected by the extensive cube sparseness.
The CUBE File is scalable for any realistic number of input data
points . In addition, the hierarchical clustering achieved remains of
high quality, when the number of input data points increases.
The root-directory can be cached in main memory providing a
single I/O cost for the evaluation of point queries.
The rest of this paper is organized as follows. Section 2 discusses related work
and positions the CUBE File in the space of cube storage structures. Section 3
proposes the chunk-tree representation of the cube as an effective representation
of the search space. Section 4 introduces a quality metric for the evaluation of
hierarchical clustering. Section 5 formally defines the problem of hierarchical
clustering, proves its NP-Hardness and then delves into the nuts and bolts of
building the CUBE File. Section 6 presents our extensive experimental evaluation
and section 7 recapitulates and emphasizes on main conclusions drawn.
!
$
"
#
% &
'
&
(
The linear clustering problem for multidimensional data is defined as the problem
of finding a linear ordering of records indexed on multiple attributes, to be stored
in consecutive disk blocks, such as the I/O cost for the evaluation of queries is
minimized. The clustering of multidimensional data has been studied in terms of
finding a mapping of the multidimensional space to a one-dimensional space. This
5
approach has been explored mainly in two directions: (a) in order to exploit
traditional one-dimensional indexing techniques to a multidimensional index
space - typical example is the UB-tree [2], which exploits a z-ordering of
multidimensional data [27], so that these can be stored into a one-dimensional Btree index [1] – and (b) for ordering buckets containing records that have been
indexed on multiple attributes, to minimize the disk access effort. For example, a
grid file [23] exploits a multidimensional grid in order to provide a mapping
between grid cells and disk blocks. One could find a linear ordering of these cells
– and therefore an ordering of the underlying buckets - such as the evaluation of a
query to entail more sequential bucket reads than random bucket accesses. To this
end, space-filling curves (see [33] for a survey) have been used extensively. For
example, Jagadish in [13] provides a linear clustering method based on the Hilbert
curve that outperforms previously proposed mappings. Note however that all
linear clustering methods are inferior to a simple scan in high dimensional spaces.
This is due to the notorious dimensionality curse [41], which states that clustering
in such spaces becomes meaningless due to lack of useful distance metrics.
In the presence of dimension hierarchies the multidimensional clustering problem
becomes combinatorially explosive. Jagadish in [14] tries to solve the problem of
finding an optimal linear clustering of records of a fact table on disk, given a
specific workload in the form of a probability distribution over query classes. The
authors propose a subclass of clustering methods called lattice paths, which are
paths on the lattice defined by the hierarchy level combinations of the dimensions.
The HPP chunk-to-bucket allocation problem (in section 3.2 we provide a formal
definition of HPP restrictions and queries) is a different problem for the following
reasons:
1. It tries to find an optimal way (in terms of reduced I/O cost during query
evaluation) to pack the data into buckets, rather than order the data
linearly. The problem of finding an optimal linear ordering of the buckets,
for a specific workload, so as to reduce random bucket reads, is an
orthogonal problem and therefore, the methods proposed in [14] could be
used additionally.
2. It deals apart from the data also with the intermediate node entries (i.e.,
directory chunk entries), which provides clustering at a whole-index level
6
and not only at the index-leaf level. In other words, index data are also
clustered along with the “real” data.
Since, we know that there is no linear clustering of records that will permit all
queries over a multidimensional space to be answered efficiently [14], we strongly
advocate that linear clustering of buckets (inter-bucket clustering) must be
exploited in conjunction with an efficient allocation of records into buckets (intrabucket clustering).
Furthermore, in [22], a path-based encoding of dimension data, similar to our
encoding scheme, is exploited in order to achieve linear clustering of
multidimensional data with hierarchies, through a z-ordering [27]. The authors use
the UB-tree [2] as an index on top of the linearly clustered records. This technique
has the advantage of transforming typical star-join [25] queries to
multidimensional range queries, which are computed more efficiently due to the
underlying multidimensional index.
However, this technique suffers from the inherent deficiencies of the z spacefilling curve, which is not the best space-filling curve according to [13, 7]. On the
other hand, it is very easy to compute and thus straightforward to implement the
technique even for high dimensionalities. A typical example of such deficiency is
that in the z-curve there is a dispersion of certain data points, which are close in
the multidimensional space but are not close in the linear order and the opposite,
i.e., distant data points are clustered in the linear space. The latter results also to
an inefficient evaluation of multiple disjoint query regions, due to the repetitive
retrieval of the same pages for many queries. Finally, the benefits of z-based
linear clustering starts to disappear quite soon as dimensionality increases,
practically even when dimensionality gets over the number of 4-5 dimensions.
)
*%
'
&
'
The CUBE File organization was initially inspired by the grid file organization
[23], which can be viewed as the multidimensional counterpart of extendible
hashing [6]. The grid file superimposes a d-dimensional orthogonal grid on the
multidimensional space. Given that the grid is not necessarily regular, the
resulting cells may be of different shapes and sizes. A grid directory associates
one or more of these cells with data buckets, which are stored in one disk page
7
each. Each cell is associated with one bucket, but a bucket may contain several
adjacent cells, therefore bucket-regions may be formed.
To ensure that data items are always found with no more than two disk accesses
for exact match queries, the grid itself is kept in main memory represented by d
one-dimensional arrays called scales. The grid file is intended for dynamic
insert/delete operations, therefore it supports operations for splitting and merging
directory cells. A well-known problem of the grid file is that it suffers from a
superlinear growth of the directory even for data that are uniformly distributed
[31]. One basic reason for this is that splitting is not a local operation and thus can
lead to superlinear directory growth. Moreover, depending on the implementation
of the grid directory merging may require a complete directory scan [12].
Hinrichs in [12] attempts to overcome the shortcomings of the grid file by
introducing a 2-level grid-directory. In this scheme, the grid directory is now
stored on disk and a scaled-down version of it (called root directory) is kept in
main memory to ensure the two-disk access principle still holds. Furthermore, he
discusses efficient implementations of the split, merge and neighborhood
operations. In a similar manner, Whang extends the idea of a 2-level directory to a
multilevel directory, introducing the multilevel grid file [43], achieving a linear
directory growth in the number of records. There exist more grid file based
organizations. A comprehensive survey of these and of multidimensional access
methods in general can be found in [8].
An obvious distinction of the CUBE File organization from the above
multidimensional access methods is that it has been designed to fulfill completely
different requirements; namely those of an OLAP environment and not of a
transaction oriented one. A CUBE File is designed for an initial bulk-loading and
then a read-only operation mode, in contrast, to the dynamic insert/delete/update
workload of a grid file. Moreover, a CUBE File aims at speeding up queries on
multidimensional data with hierarchies and exploits hierarchical clustering to this
end. Furthermore, since the dimension domain in OLAP is known a-priori the
directory does not have to grow dynamically. In addition, changes to the directory
are rare, since dimension data do not change very often (compared to the rate of
change for the cube data), and also deletions are seldom, therefore split and merge
operations are not needed so much. Nevertheless, more important is to adapt well
to the native sparseness of a cube data space and to efficiently support incremental
8
updating, so as to minimize the updating window and cube query-down time,
which are critical factors in business intelligence applications nowadays.
+$ ,
&-
%
&
-
.
The set of reported methods in the literature for primary organizations for the
storage of cubes is quite confined. We believe that this is basically due to two
reasons: First of all the generally held view is that a “cube” is a set of precomputed aggregated results and thus the main focus has been to devise efficient
ways to compute these results [11], as well as to choose, which ones to compute
for a specific workload (view selection/maintenance problem [10, 32, 37]).
Kotidis et. al in [19] proposed a storage organization based on packed R-trees for
storing these aggregated results. We believe that this is a one-sided view of the
problem since it disregards the fact that very often, especially for ad hoc queries,
there will be a need for drilling down to the most detailed data in order to compute
a result from scratch. Ad hoc queries represent the essence of OLAP, and in
contrast to report queries, are not known a-priori and thus cannot really benefit
from pre-computation. The only way to process them efficiently is to enable fast
retrieval of the base data. This calls for an effective primary storage organization
for the most detailed data (grain-level) of the cube. This argument is of course
based on the fact that a full pre-computation of all possible aggregates is
prohibitive due to the consequent size explosion, especially for sparse cubes [24].
The second reason that makes people reluctant to work on new primary
organizations for cubes is their adherence to relational systems. Although this
seems justified, one could pinpoint that a relational table (e.g., a fact table of a
star schema [4]]) is a logical entity and thus should be separated from the physical
method chosen for implementing it. Therefore, one can use apart from a paged
record file, also a B*-tree or even a multi-dimensional data structure as a primary
organization for a fact table. In fact, there are not many commercial RDBMS ([39]
is one that we know of) that exploit a multidimensional data structure as a primary
organization for fact tables. All in all, the integration of a new data structure in a
full-blown commercial system is a strenuous task with high cost and high risk and
thus usually the proposed solutions are reluctant to depart from the existing
technology (see also [30] for a detailed description of the issues in this
integration).
9
Fig. 1 positions the CUBE File organization in the space of primary organizations
proposed for storing a cube (i.e., only the base data and not aggregates). The
columns of this table describe the alternative data structures that have been
proposed as a primary organization, while the rows classify the proposed methods
according to the achieved data clustering. At the top-left cell lies the conventional
star schema [4], where a paged record file is used for storing the fact table. This
organization guarantees no particular ordering among the stored data and thus
additional secondary indexes are built around it in order to support efficient access
to the data.
Primary
Organization
Relation
Clustering
Achieved
No Clustering
Clustering
Chunkbased
MD-Array
Multidimensional
data structure
GRID
UB-tree
FILEbased
Star Schema
[28]
[35]
Other
Chunk[5]
based
z-order
[22]
based
Fig. 1. The space of proposed primary organizations for cube storage.
Hierarchical
Clustering
[18]
[28] assumes a typical relation (i.e., a paged record file) as the primary
organization of a cube (i.e., fact table). However, unique combinations of
dimension values are used in order to form blocks of records, which correspond to
consecutive disk pages. These blocks can be considered as chunks. The database
administrator must choose only one hierarchy-level from each dimension to
participate in the clustering scheme. In this sense, the method provides
multidimensional clustering and not hierarchical (multidimensional) clustering.
In [35] a chunk-based method for storing large multidimensional arrays is
proposed. No hierarchies are assumed on the dimensions and data are clustered
according to the most frequent range queries of a particular workload. In [5] the
benefits of hierarchical clustering in speeding-up queries was observed as a side
effect of using a chunk-based file organization over a relation (i.e., a paged file of
records) for query caching, with chunk as the caching unit. Hierarchical clustering
was achieved through appropriate “hierarchical” encoding of the dimension data.
10
Markl et. al in [22], also impose a hierarchical encoding on the dimension data
and assign a path-based surrogate key on each dimension tuple that called the
compound surrogate key. They exploit the UB-tree multidimensional index [2] as
the primary organization of the cube. Hierarchical clustering is achieved by taking
the z-order [27] of the cube data points by interleaving the bits of the
corresponding compound surrogates. [5], [22] and the CUBE File [18], all exploit
hierarchical clustering of the cube data and the last two use multidimensional
structures as the primary organization. This has among others the significant
benefit of transforming a star-join [25] into a multidimensional range query that is
evaluated very efficiently over these data structures.
+ '
(
/
#*$
Clearly our goal is to define a multidimensional file organization that natively
supports hierarchies. There is indeed a plethora of data structures for
multidimensional data [8], but to the best of our knowledge, none of these
explicitly supports hierarchies. Hierarchies complicate things, basically because,
in their presence, the data space “explodes1”. Moreover, since we are primarily
aiming at speeding up queries including restrictions on the hierarchies, we need a
data structure that can efficiently lead us to the corresponding data subset based
on these restrictions. A key observation at this point is that all restrictions on the
hierarchies intuitively define a subcube or a cube-slice.
To this end, we exploit the intuitive representation of a cube as a
multidimensional array and apply a chunking method in order to create subcubes,
i.e., the so-called chunks. Our method of chunking is based on the dimension
hierarchies’ structure and thus we call it hierarchical chunking. In the following
sections we present a dimension-data encoding scheme that assigns hierarchyenabled unique identifiers to each data point in a dimension. Then, we present our
hierarchical chunking method. Finally, we propose a tree structure for
representing the hierarchy of the resultant chunks and thus modeling the cube data
space.
1
Assuming N dimension hierarchies modelled as K-level m-way trees, the number of possible
value combinations is K-times exponential in the number of dimensions, i.e., O(mKN).
11
+
(&
#
In order to apply hierarchical chunking, we first assign a surrogate key to each
dimension hierarchy value. This key uniquely identifies each value within the
hierarchy. More specifically, we order the values in each hierarchy level so that
sibling values occupy consecutive positions and perform a mapping to the domain
of positive integers. The resulting values are depicted in Fig. 2 for an example of a
dimension hierarchy. The simple integers appearing under each value in each
level are called order-codes. In order to identify a value in the hierarchy, we form
the path of order-codes from the root-value to the value in question. This path is
called a hierarchical surrogate key, or simply h-surrogate. For example the hsurrogate for the value “Rhodes” is 0.0.1.2. H-surrogates convey hierarchical (i.e.,
semantic) information for each cube data point, which can be greatly exploited for
the efficient processing of star-queries [15, 29, 40].
Continent
Europe (0)
(0)
Country
Region
Grain level ---
City
Greece (0.0)
(0)
North
(0)
Salonica
(0)
U.K.
(1)
South (0.0.1) North
(1)
(2)
Athens
(1)
LOCATION
South
(3)
Rhodes Glasgow London
(3)
(4)
(2)
(0.0.1.2)
Cardiff
(5)
Fig. 2. Example of hierarchical surrogate keys assigned to an example hierarchy.
The basic incentive behind hierarchical chunking is to partition the data space by
forming a hierarchy of chunks that is based on the dimensions' hierarchies. This
has the beneficial effect of pruning all empty areas. Remember that in a cube data
space empty areas are typically defined on specific combinations of hierarchy
values (e.g., since we didn’t sell the X product Category on Region Y for T
periods of time, an empty region is formed). Moreover, it provides us with a
multi-resolution view of the data space where one can zoom-in and zoom-out
navigating along the dimension hierarchies.
We model the cube as a large multidimensional array, which consists only of the
most detailed data. Initially, we partition the cube in a very few chunks
corresponding to the most aggregated levels of the dimensions' hierarchies. Then
we recursively partition each chunk as we drill-down to the hierarchies of all
dimensions in parallel. We define a measure in order to distinguish each recursion
step, the chunking depth D. We will illustrate hierarchical chunking with an
example. The dimensions of our example cube are depicted in Fig. 3 and
correspond to a 2-dimensional cube hosting sales data for a fictitious company.
12
The two dimensions are namely LOCATION and PRODUCT. In the figure we can
see the members for each level of these dimensions (each appearing with its
member-code).
Fig. 3. Dimensions of our example cube along with two hierarchy instantiations.
In order to apply our method, we need to have hierarchies of equal length. For this
reason, we insert pseudo-levels P into the shorter hierarchies until they reach the
length of the longest one. This "padding" is done after the level that is just above
the grain level. In our example, the PRODUCT dimension has only three levels
and needs one pseudo-level in order to reach the length of the LOCATION
dimension. This is depicted next, where we have also noted the order-code range
at each level:
LOCATION:[0-2].[0-4].[0-10].[0-18]
PRODUCT:[0-1].[0-2].P.[0-5]
The result of hierarchical chunking on our example cube is depicted in Fig. 4(a).
Chunking begins at chunking depth D = 0 and proceeds in a top-down fashion. To
define a chunk, we define discrete ranges of grain-level (i.e., most-detailed) values
on each dimension, denoted in the figure as [a..b], where a and b are grain-level
order-codes. Each such range is defined as the set of values with the same parent
13
(value) in the corresponding parent level. These parent levels form the set of pivot
levels PVT, which guides the chunking process at each step. Therefore initially,
PVT = {LOCATION: Continent, PRODUCT: Category}. For example, if we take
value 0 of pivot level Continent of the LOCATION dimension, then the
corresponding range at the grain level is Cities [0..5].
#
!
"
!
"
$ %
&
(a)
(b)
Fig. 4. (a) The cube from our running example hierarchically chunked. (b) The whole sub-tree up
to the data chunks under chunk
.
The definition of such a range for each dimension defines a chunk. For example,
the chunk defined from the 0, 0 values of the pivot levels Continent and Category
respectively, consists of the following grain data (LOCATION:0.[0-1].[0-3].[0-5],
PRODUCT:0.[0-1].P.[0-3]). The '[]' notation denotes a range of members. This
chunk appears shaded in Fig. 4(a) at D = 0. Ultimately at D = 0 we have a chunk
for each possible combination between the members of the pivot levels, that is a
total of [0-1]×[0-2] = 6 chunks in this example. Thus the total number of chunks
created at each depth D equals the product of the cardinalities of the pivot levels.
Next we proceed at D = 1, with PVT = {LOCATION: Country, PRODUCT: Type}
and recursively chunk each chunk of depth D = 0. This time we define ranges
within the previously defined ranges. For example, on the range corresponding to
Continent value 0 that we created before, we define discrete ranges corresponding
to each country of this continent (i.e., to each value of the Country level, which
has parent 0). In Fig. 4(a), at D = 1, shaded boxes correspond to all the chunks
resulting from the chunking of the chunk mentioned in the previous paragraph.
14
Similarly, we proceed the chunking by descending in parallel all dimension
hierarchies and at each depth D we create new chunks within the existing ones.
The procedure ends when the next levels to include as pivot levels are the grain
levels. Then we do not need to perform any further chunking, because the chunks
that would be produced from such a chunking would be the cells of the cube
themselves. In this case, we have reached the maximum chunking depth DMAX. In
our example, chunking stops at D = 2 and the maximum depth is D = 3. Notice the
shaded chunks in Fig. 4(a) depicting chunks belonging in the same chunk
hierarchy.
The rationale for inserting the pseudo levels above the grain level lies in that we
wish to apply chunking the soonest possible and for all possible dimensions.
Since, the chunking proceeds in a top-to-bottom fashion, this “eager chunking”
has the advantage of reducing very early the chunk size and also provides faster
access to the underlying data, because it increases the fan-out of the intermediate
nodes. If at a particular depth one (or more) pivot level is a pseudo level, then this
level does not take part in the chunking (in our example this occurs at D = 2 for
the PRODUCT dimension.). This means that we don't define any new ranges
within the previously defined range for the specific dimension(s) but instead we
keep the old one with no further chunking. Therefore, since pseudo levels restrict
chunking in the dimensions that are applied, we must insert them to the lowest
possible level. Consequently, since there is no chunking below the grain level (a
data cell cannot be further partitioned), the pseudo level insertion occurs just
above the grain level.
+
$
#*$
!
We use the intermediate depth chunks as directory chunks that will guide us to the
DMAX depth chunks containing the data and thus called data chunks. This leads to
a chunk-tree representation of the hierarchically chunked cube and hence the cube
data space. It is depicted in Fig. 4(b) for our example cube. In Fig. 4(b), we have
expanded the chunk-sub-tree corresponding to the family of chunks that has been
shaded in Fig. 4(a). Pseudo levels are marked with “P” and the corresponding
directory chunks have reduced dimensionality (i.e., one dimensional in this case).
We interleave the h-surrogates of the pivot level values that define a chunk and
form a chunk-id. This is a unique identifier for a chunk within a CUBE File.
15
Moreover, this identifier includes the whole path in the chunk hierarchy of a
chunk. In Fig. 4(b), we note the corresponding chunk-id above each chunk. The
root chunk does not have a chunk-id because it represents the whole cube and
chunk-ids essentially denote sub-cubes. The part of a chunk-id that is contained
between consecutive dots and corresponds to a specific depth D is called Ddomain.
The chunk-tree representation can be regarded as a method to model the
multilevel-multidimensional data space of an OLAP cube. We discuss next the
major benefits form this modeling:
Direct access to cube data through hierarchical restrictions: One of the main
advantages of the chunk-tree representation of a cube is that it explicitly supports
hierarchies. This means that any cube data subset defined through restrictions on
the dimension hierarchies can be accessed directly. This is achieved by simply
accessing the qualifying cells at each depth and following the intermediate chunk
pointers to the appropriate data. Note that the vast majority of OLAP queries
contain an equality restriction on a number of hierarchical attributes and more
commonly on hierarchical attributes that form a complete path in the hierarchy.
This is reasonable since the core of analysis is conducted along the hierarchies.
We call this kind of restrictions hierarchical prefix path (HPP) restrictions and
provide the corresponding definition next:
Definition 1 (Hierarchical Prefix Path Restriction): We define a hierarchical
prefix path restriction (HPP restriction) on a hierarchy H of a dimension D, to be
a set of equality restrictions linked by conjunctions on H’s levels that form a path
in H, which always includes the topmost (most aggregated) level of H.
For example, if we consider the dimension LOCATION of our example cube and a
DATE dimension with a 3-level hierarchy (Year/Month/Day), then the query
“show me sales for country A (in continent C) in region B for each month of
1999” contains two whole-path restrictions, one for the dimension LOCATION
and one for DATE: (a) LOCATION.continent = ‘C’ AND LOCATION.country =
‘A’ AND LOCATION.region = ‘B’, and (b) DATE.year = 1999.
Consequently, we can now define the class of HPP queries:
Definition 2 (Hierarchical Prefix Path Query): We call a query Q on a cube C a
hierarchical prefix path query (HPP query), if and only if all the restrictions
16
imposed by Q on the dimensions of C are HPP restrictions, which are linked
together by conjunctions.
Adaptation to cube’s native sparseness: The cube data space is extremely sparse
[34]. In other words, the ratio of the number of real data points to the product of
the dimension grain–level cardinalities is a very small number. Values for this
ratio in the range of 10-12 to 10-5 are more than typical (especially for cubes with
more than 3 dimensions). It is therefore, imperative that a primary organization
for the cube adapts well to this sparseness, allocating space conservatively.
Ideally, the allocated space must be comparable to the size of the existing data
points. The chunk-tree representation adapts perfectly to the cube data space. The
reason is that the empty regions of a cube are not arbitrarily formed. On the
contrary, specific combinations of dimension hierarchy values form them. For
instance, in our running example, if no music products are sold in Greece, then a
large empty region is formed. Consequently, the empty regions in the cube data
space translate naturally to one or more empty chunk sub-trees in the chunk-tree
representation. Therefore, empty sub-trees can be discarded altogether and the
space allocation corresponds to the real data points and only.
Multi-resolution view of the data space: The chunk-tree represents the whole cube
data space (however with most of the empty areas pruned). Similarly, each subtree represents a sub-space. Moreover, at a specific chunking depth we “view” all
the data points organized in “hierarchical families” (i.e., chunk-trees) according to
the combinations of hierarchy values for the corresponding hierarchy levels. By
descending to a higher depth node we “view” the data of the corresponding
subspace organized in hierarchical families of a more detailed level and so on.
This multi-resolution feature will be exploited later in order to achieve a better
hierarchical clustering of the data by promoting the storage of lower depth chunktrees in a bucket than that of higher depth ones.
Storage efficiency: A chunk is physically represented by a multidimensional
array. This enables an offset-based access, rather than a search-based one, which
speedups the cell access mechanism considerably. Moreover, it gives us the
opportunity to exploit chunk-ids in a very effective way. A chunk-id essentially
consists of interleaved coordinate values. Therefore, we can use a chunk-id in
order to calculate the appropriate offset of a cell in a chunk but we do not have to
store the chunk-id along with each cell. Indeed, a search-based mechanism (like
17
the one used by conventional B-tree indexes, or the UB-tree [2]) requires that the
dimension values (or the corresponding h-surrogates), which form the search-key,
must be also stored within each cell (i.e., tuple) of the cube. In the CUBE File
only the measure values of the cube are stored in each cell. Hence notable space
savings are achieved. In addition, further compression of chunks can be easily
achieved, without affecting the offset-based accessing (see [17] for the details).
Parallel Processing Enabling: Chunk-trees (at various depths) can be exploited
naturally for the logical fragmentation of the cube data, in order to enable the
parallel processing of queries, as well as the construction and maintenance (i.e.,
bulk loading and batch updating) of the CUBE File. Chunk-trees are essentially
disjoint fragments of the data that carry all the hierarchy semantics of the data.
This makes the CUBE File data structure as an excellent candidate for advanced
fragmentation methods ([38]) used in parallel data warehouse DBMSs.
Efficient Maintenance Operations: Any data structure aimed to accommodate
data warehouse data must be efficient in typical data warehousing maintenance
operations. The logical data partitioning provided by the chunk-tree representation
enables fast bulk loading (rollin of data), data purging (rollout of data, i.e., bulk
deletions from the cube), as well as the incremental updating of the cube (i.e.,
when the input data with the latest changes arrive from the data sources, only
local reorganizations are required and not a complete CUBE File rebuild). The
key idea is that new data to be inserted in the CUBE file correspond to a set of
chunk-trees that need to be “hanged” at various depths of the structure. The
insertion of each such chunk-tree requires only a local reorganization without
affecting the rest of the structure. In addition, as noted previously, these chunktree insertions can be performed in parallel as long as they correspond to disjoint
subspaces of the cube. Finally, it is very easy to rollout the oldest month’s data
and rollin the current month’s (we call this data purging), since these data
correspond to separate chunk-trees and only a minimum reorganization is
required. The interested reader can find more information regarding other aspects
of the CUBE File not covered in this paper (e.g., the updating and maintenance
operations), as well as information for a prototype implementation of a CUBE
File based DBMS in [16].
18
0
1
2
-
Any physical organization of data must determine how the latter are distributed in
disk pages. A CUBE File physically organizes its data by allocating the chunks of
the chunk-tree into a set of buckets, which is the I/O transfer unit counterpart in
our case. First, lets try to understand what are the objectives of such an allocation.
As already stated the primary goal is to achieve a high degree of hierarchical
clustering. This statement, although clear, could still be interpreted in several
different ways. What are the elements that can guarantee that a specific
hierarchical clustering scheme is “good”? We attempt to list some next:
1. Efficient evaluation of queries containing restrictions on the dimension
hierarchies.
2. Minimization of the size of the data.
3. High space utilization.
The most important goal of hierarchical clustering is to improve response time of
queries containing hierarchical restrictions. Therefore, the first element calls for a
minimal I/O cost (i.e., bucket reads) for the evaluation of such restrictions. The
second element deals with the ability to minimize the size of the data to be stored
(e.g., by adapting to the extensive sparseness of the cube data space - i.e., not
storing null data- as well as storing only the minimum necessary data, e.g., in an
offset-based access structure we don’t need to store the dimension values along
with the facts). Of course, the storage overhead must be also minimized in terms
of the number of allocated buckets. Naturally, the best way to keep this number
low is to utilize the available space as much as possible. Therefore the third
element implies that the allocation must adapt well to the data distribution, e.g.,
more buckets must be allocated to more densely populated areas and fewer
buckets for more sparse ones. Also, buckets must be filled almost to capacity (i.e.,
imposing a high bucket occupancy threshold). Both the last two elements
guarantee an overall minimum storage cost.
In the following, we propose a metric for evaluating the hierarchical clustering
quality of an allocation of chunks into buckets. Then in the next section we use
this metric to formally define the chunk to bucket allocation problem as an
optimization problem.
19
0 $
We advocate that hierarchical clustering is the most important goal for a file
organization for OLAP cubes. However, the space of possible combinations of
dimension hierarchy values is huge (doubly exponential - see footnote 1 on page
2). To this end, we exploit the chunk-tree representation, resulting from the
hierarchical chunking of a cube, and deal with the problem of hierarchical
clustering, as a problem of allocating chunks of the chunk-tree into disk buckets.
Thus, we are not searching for a linear clustering (i.e., for a total ordering of the
chunked-cube cells), but rather we are interested in the packing of chunks into
buckets according to the criteria of good hierarchical clustering posed above.
The intuitive explanation for the utilization of the chunk-tree for achieving
hierarchical clustering, lies in the fact that the chunk-tree is built based solely on
the hierarchies’ structure and content and not on some storage criteria (e.g., each
node corresponding to a disk page, etc.); as a result, it embodies all possible
combinations of hierarchical values. For example, the sub-tree hanging from the
root-chunk in Fig. 4(b), at the leaf level contains all the sales figures
corresponding to the continent “Europe” (order code ) and to the product
category “Books” (order code ) and any possible combinations of the children
members of the two. Therefore, each sub-tree in the chunk-tree corresponds to a
“hierarchical family” of values and thus reduces the search space significantly. In
the following we will regard as a storage unit the bucket. In this section, we define
a metric for evaluating the degree of hierarchical clustering of different storage
schemes in a quantitative way
Clearly, a hierarchical clustering strategy that respects the quality element of
efficient evaluation of queries with HPP restrictions that we have posed above,
must ensure that the access of the sub-trees hanging under a specific chunk must
be done with a minimal number of bucket reads. Intuitively, one can say that if we
could store whole sub-trees in each bucket (instead of single chunks), then this
would result to a better hierarchical clustering, since all the restrictions on the
specific sub-tree, as well as on any of its descendant sub-trees, would be evaluated
with a single bucket I/O. For example, if we store the sub-tree hanging from the
root-chunk in Fig. 4(b) into a single bucket, we can answer all queries containing
hierarchical restrictions on the combination “Books” and “Europe” and on any
children-values of these two, with just a single disk I/O.
20
Therefore, each sub-tree in this chunk-tree corresponds to a “hierarchical family”
of values. Moreover, the smaller is the chunking depth of this sub-tree the more
value combinations it embodies. Intuitively, we can say that the hierarchical
clustering achieved could be assessed by the degree of storing low-depth whole
chunk sub-trees into each storage unit. Next, we exploit this intuitive criterion to
define the hierarchical clustering degree of a bucket (HCDB). We begin with a
number of auxiliary definitions:
Definition 3 (Bucket-Region): Assume a hierarchically chunked cube represented
by a chunk-tree CT of a maximum chunking depth DMAX. A group of chunk-trees
of the same depth having a common parent node, which are stored in the same
bucket, comprises a bucket-region.
Definition 4 (Region contribution of a tree stored in a bucket – cr): Assume a
hierarchically chunked cube represented by a chunk-tree CT of a maximum
chunking depth DMAX. We define as the region contribution cr of a tree t of depth
d that is stored in a bucket B, to be the total number of trees in the bucket-region
that this tree belongs to divided by the total number of trees of the same depth in
the total chunk-tree CT in general. This is then multiplied by a bucket-region
proximity factor rP, which expresses the proximity of the trees of a bucket-region
in the multidimensional space.
cr
treeNum(d , B)
rP
treeNum(d , CT )
Where
treeNum(d, B): total number of sub-trees in B of depth d,
treeNum(d, CT): total number of sub-trees in CT of depth d and
rP: bucket-region proximity (0 < rP
1).
The region contribution of a tree stored in a bucket essentially denotes the
percentage of trees at a specific depth that a bucket-region covers. Therefore, the
greater this percentage is, the greater the hierarchical clustering achieved by the
corresponding bucket, since more combinations of the hierarchy members will be
clustered in the same bucket. To keep this contribution high we need large bucketregions of low depth trees, because in low depths the total number of CT sub-trees
is small. Notice also that the region contribution includes a bucket-region
proximity factor rP, which expresses the spatial proximity of the trees of a bucketregion in the multidimensional space. The larger this factor becomes the closer the
21
trees of a bucket region are and thus the larger become their individual region
contributions. We will see in more detail the effects of this factor and its
definition (Definition 10) in a following subsection, where we will discuss the
formation of the bucket-regions.
Definition 5 (Depth contribution of a tree stored in a bucket – cd): Assume a
hierarchically chunked cube represented by a chunk-tree CT of a maximum
chunking depth DMAX. We define as the depth contribution cd of a tree t of depth d
that is stored in a bucket B, to be the ratio of d to DMAX.
cd
d
DMAX
The depth contribution of a tree stored in a bucket expresses the proportion
between the depth of the tree and the maximum chunking depth. The less this
ratio becomes (i.e., the lower is the depth of the tree), the greater becomes the
hierarchical clustering achieved by the corresponding bucket. Intuitively, the
depth contribution expresses the percentage of the number of nodes in the path
from the root-chunk to the bucket in question and thus the less it is the less is the
I/O cost to access this bucket. Alternatively, we could substitute the depth value
from the nominator of the depth contribution with the number of buckets in the
path from the root-chunk to the bucket in question (with the latter included).
Next, we provide the definition for the hierarchical clustering degree of a bucket:
Definition 6 (Hierarchical Clustering Degree of a Bucket – HCDB): Assume a
hierarchically chunked cube represented by a chunk-tree CT of a maximum
chunking depth DMAX. For a bucket B containing T whole sub-trees {t1, t2 … tT} of
chunking depths {d1, d2 … dT} respectively, where none of these sub-trees is a
sub-tree of another, we define as the Hierarchical Clustering Degree HCDB of
bucket B to be the ratio of the sum of the region contribution of each tree ti (1 i
T) included in B to the sum of the depth contribution of each tree ti (1 i T),
multiplied by the bucket occupancy OB, where
0
OB 1.
T
c ri
HCDB
i 1
T
OB
c
i
d
T cr
OB
T cd
cr
OB
cd
(1)
i 1
22
Where cri is the region contribution of tree ti, and cdi is the depth contribution of
tree ti (1
i
T). (Note that since bucket-regions have been defined as
consisting of equi-depth trees, then all trees of a bucket have the same region
contribution as well as depth contribution.)
In this definition, we have assumed that the chunking depth di of a chunk-tree ti is
equal to the chunking depth of the root-chunk of this tree. Of course we assume
that a normalization of the depth values has taken place, so as the depth of the
chunk-tree CT to be 1 instead of 0, in order to avoid having zero depths in the
denominator of equation (1). Furthermore, data chunks are considered as chunktrees with a depth equal to the maximum chunking depth of the cube. Note that
directory chunks stored in a bucket not as part of a sub-tree but isolated, have a
zero region contribution; therefore, buckets that contain only such directory
chunks have a zero degree of hierarchical clustering.
From equation (1), we can see that the more sub-trees, instead of single chunks,
are included in a bucket the greater the hierarchical clustering degree of the bucket
becomes, because more HPP restrictions can be evaluated solely with this bucket.
Also the highest these trees are (i.e., the smaller their chunking depth is) the
greater the hierarchical degree of the bucket becomes, since more combinations of
hierarchical attributes are “covered” by this bucket. Moreover, the more trees of
the same depth and hanging under the same parent node, we have stored in a
bucket, the greater becomes the hierarchical clustering degree of the bucket, since
we include more combinations of the same path in the hierarchy.
All in all, the HCDB metric favors the following storage choices for a bucket:
Whole trees instead of single chunks or other data partitions.
Smaller depth trees instead of greater depth ones.
Tree regions instead of single trees.
Regions with a few low-depth trees instead of ones with more trees of
greater depth.
Regions with trees of the same depth that are close in the multidimensional
space instead of dispersed trees.
Buckets with a high occupancy.
We prove the following theorem regarding the maximum value of the hierarchical
clustering degree of a bucket:
23
Theorem 1 (Theorem of maximum hierarchical clustering degree of a bucket):
Assume a hierarchically chunked cube represented by a chunk-tree CT of a
maximum chunking depth DMAX, which has been allocated to a set of buckets.
Then, for any such bucket B holds that:
HCDB
DMAX
Proof:
From the definition of the region contribution of a tree appearing in Definition 4,
we can easily deduce that:
c ri
1
(I)
This means that the following holds:
T
c ri
T
(II)
i 1
In (II) T stands for the number of trees stored in B. Similarly, from the definition
of the depth contribution of a tree appearing in Definition 5, we can easily deduce
that:
1
c di
DMAX
(III)
since, the smallest possible depth value is 1. This means that the following holds:
T
c di
i 1
T
DMAX
(IV)
From (II), (IV), equation (1) and assuming that B is filled to its capacity (i.e., OB
equals 1) the theorem is proved.
It is easy to see that the maximum degree of hierarchical clustering of a bucket B
is achieved only in the ideal case, where we store the chunk-tree CT that
represents the whole cube in B and CT fits exactly in B2. In this case, all of our
primary goals for a good hierarchical clustering, posed in the beginning of this
chapter, such as the efficient evaluation of HPP queries, the low storage cost and
the high space utilization are achieved. This is because all possible HPP
2
Indeed, a bucket with HCDB = DMAX would mean that the depth contribution of each tree in this
bucket should be equal to 1/DMAX (according to the inequality (III)); however this is only possible
for the whole chunk-tree CT, since this only has a depth equal to 1.
24
restrictions can be evaluated with a single bucket read (one I/O operation) and the
achieved space utilization is maximal (full bucket) with a minimal storage cost
(just one bucket). Moreover, it is now clear that the hierarchical clustering degree
of a bucket signifies to what extent the chunk-tree representing the cube has been
“packed” into the specific bucket and this is measured in terms of the chunking
depth of the tree.
By trying to create buckets with a high HCDB we can guarantee that our allocation
respects these elements of good hierarchical clustering. Furthermore, it is now
straightforward to define a metric for evaluating the overall hierarchical clustering
achieved by a chunk to bucket allocation strategy:
Definition 7 (Hierarchical Clustering Factor of a Physical Organization for a
Cube – fHC): For a physical organization that stores the data of a cube into a set of
NB buckets, we define as the hierarchical clustering factor fHC, the percent of
hierarchical clustering achieved by this storage organization, as this results from
the hierarchical clustering degree of each individual bucket divided by the total
number of buckets and we write:
NB
HCDB
f HC
1
N B DMAX
(2)
Note that NB is the total number of buckets used in order to store the cube;
however only the buckets that contain at least one whole chunk-tree have a nonzero HCDB value. Therefore, allocations that spend more buckets for storing subtrees have a higher hierarchical clustering factor than others, which favor e.g.,
single directory chunk allocations. From equation (2), is clear that even if we have
two different allocations of a cube that result to the same total HCDB of individual
buckets, the one that occupies the smaller number of buckets will have the greater
fHC, rewarding this way the allocations that use the available space more
conservatively.
Another way of viewing the fHC is as the average HCDB for all the buckets
divided by the maximum chunking depth. It is now clear that it expresses the
percentage of the extent by which the chunk-tree representing the whole cube has
been “packed” into the set of the NB buckets and thus 0
fHC
1. It follows
directly from Theorem 1 that this factor is maximized (i.e., equals 1), if and only
25
if we store the whole cube (i.e., the chunk-tree CT) into a single bucket, which
corresponds to a perfect hierarchical clustering for a cube.
In the next section we exploit the hierarchical clustering factor fHC, in order to
define the chunk-to-bucket allocation problem as an optimization problem.
Furthermore, we exploit the hierarchical clustering degree of a bucket HCDB in a
greedy strategy that we propose for solving this problem, as an evaluation
criterion, in order to decide how close we are to an optimal solution.
3
In this section we formally define the chunk-to-bucket allocation problem as an
optimization problem. We prove that it is NP-Hard and provide a heuristic
algorithm as a solution. In the course of solving this problem several interesting
sub-problems arise. We tackle each one in a separate subsection.
3
$
#* *
#
% &
The chunk-to-bucket allocation problem is defined as follows:
Definition 8 (The HPP Chunk-to-Bucket Allocation Problem): For a cube C,
represented by a chunk-tree CT with a maximum chunking depth of DMAX, find an
allocation of the chunks of CT into a set of fixed-size buckets that corresponds to
a maximum hierarchical clustering factor fHC.
We assume the following: The storage cost of any chunk-tree t equals cost(t), the
number of sub-trees per depth d in CT equals treeNum(d) and the size of a
bucket equals SB . Finally, we are given a bucket of special size SROOT consisting
of consecutive simple buckets, called the root-bucket with
R,
where SROOT =
SB,
1. Essentially, BR represents the set of buckets that contain no whole sub-
trees and thus have a zero HCDB.
The solution S for this problem consists of a set of K buckets, S = {B1, B2 … BK},
so that each bucket contains at least one sub-tree of CT and a root-bucket BR that
contains all the rest part of CT (part with no whole sub-trees). S must result to a
maximum value for the fHC factor for the given bucket size SB. Since the HCDB
values of the buckets of the root-bucket BR equal to zero (recall that they contain
no whole sub-trees), following from equation (2), fHC can be expressed as:
26
K
HCDB
f HC
1
(K
) DMAX
(3)
From equation (3), it is clear that the more buckets we allocate for the root-bucket
(i.e., the greater
becomes) the less will be the degree of hierarchical clustering
achieved by our allocation. Alternatively, if we consider caching the whole rootbucket in main memory (see following discussion), then we could assume that
does not affect hierarchical clustering (since it does not introduce more bucket
I/Os from the root-chunk to a simple bucket) and could be zeroed.
(a) fHC =0.01(14%)
(b) fHC =0.03(42%)
(c) fHC =0.05(69%)
(d) fHC =0.07(100%)
Fig. 5 The Hierarchical Clustering Factor fHC of the same chunk-tree for 4 different chunk-tobucket allocations.
In Fig. 5, we depict four different chunk-to-bucket allocations for the same chunktree. The maximum chunking depth is DMAX = 5, although in the figure we can see
the nodes up to depth D = 3 (i.e., the triangles correspond to sub-trees of 3 levels).
The numbers inside each node represent the storage cost for the corresponding
27
sub-tree, e.g., the whole chunk-tree has a cost of 65 units. Assume a bucket size of
Fig(a)
0,17
0,04
B3
0,50
0,4
0,73
0,92
B1
0,29
0,6
1,00
0,48
B2
0,14
0,6
0,17
0,04
B3
0,57
0,6
0,50
0,48
B1
0,29
0,6
1,00
0,48
B2
0,14
0,6
0,17
0,04
B3
0,29
0,6
0,33
0,16
B4
0,29
0,6
0,17
0,08
B1
0,14
0,6
0,33
0,08
B2
0,14
0,6
0,67
0,16
B3
0,14
0,6
0,17
0,04
B4
0,14
0,6
0,17
0,04
B5
0,14
0,6
0,17
0,04
B6
0,14
0,6
0,10
0,02
B7
0,14
0,6
0,07
0,02
OB
Chunking Depth DMAX
0,6
3
1
0,07
100%
3
1
0,05
69%
4
1
0,03
42%
0,01
14%
30
Maximum
0,14
Root Bucket
B2
Bucket Size SB
0,48
Bucket Occupancy
1,00
cd
0,6
Depth Contribution
0,29
No of buckets of the
Fig(b)
B1
HCDB
Total No of Buckets K
Fig( c )
Region Contribution cr
Fig(d)
Bucket
Allocation
Chunk-to-bucket
SB = 30 units.
5
7
1
fHC
fHC/fHCmax
(%)
Fig. 6 The individual calculations of the example in Fig. 5.
Below each figure we depict the calculated fHC and beside we note the percentage
with respect to the best fHC that can be achieved for this bucket size (i.e.,
fHC/fHCmax 100%). The chunk-to-bucket allocation that yields the maximum fHC
can be identified easily by exhaustive search in this simple case. Observe, how the
fHC deteriorates gradually, as we move from Fig. 5 (a) to (d).
In Fig. 5 (a) we have failed to create any bucket regions at depth D = 2. Thus each
bucket stores a single sub-tree of depth 3. Note also that the occupancy of most
buckets is quite low. In Fig. 5 (b) the hierarchical clustering improves since some
bucket regions have been formed - buckets B1, B3 and B4 store two sub-trees of
depth 3. In Fig. 5 (c) the total number of buckets decreases by one since a large
bucket region of four sub-trees has been formed in bucket B3. Finally, in Fig. 5
(d) we have managed to store in bucket B3 a higher level (i.e., lower depth) subtree (i.e., a sub-tree of depth 2). This increases even more the hierarchical
clustering achieved, compared to the previous case (Fig. 5 (c)), because the root
28
node is included in the same bucket as the four sub-trees. In addition, the bucket
occupancy of B3 is increased.
It is clear now from this simple example, that the hierarchical clustering factor fHC
rewards the allocations that achieve to store lower-depth subtrees in buckets, that
store regions of sub-trees instead of single sub-trees, and that create highly
occupied buckets. The individual calculations of this example can be seen in Fig.
6.
All in all, it is obvious that we have now the optimization problem of finding a
chunk-to-bucket allocation such that fHC is maximized. This problem is NP-Hard,
which results from the following theorem.
Theorem 2 (Complexity of the HPP chunk-to-bucket allocation problem): The
HPP Chunk-to-Bucket allocation problem is NP-Hard.
Proof
Assume a typical bin packing problem [42] where we are given N items with
weights wi, i = 1,… ,N respectively and a bin size B such as wi
B for all i = 1,
…,N. The problem is to find a packing of the items in the fewest possible bins.
Assume that we create N chunks of depth d and dimensionality D, so as chunk c1
has a storage cost of w1 and chunk c2 has a storage cost w2 and so on. Also
assume that N -1 of these chunks are under the same parent chunk (e.g., the Nth
chunk). This way we have created a two-level chunk-tree where the root lies at
depth d = 0 and the leaves at depth d = 1. Also assume that a bin and a bucket are
equivalent terms. Now we have reduced in polynomial time the bin packing
problem to an HPP chunk-to-bucket allocation problem, which is to find an
allocation of the chunks into buckets of B size such as the achieved hierarchical
clustering factor fHC is maximized.
Since all the chunk-trees (i.e., single chunks in our case) are of the same depth,
the depth contribution cdi (1
i
N), defined in equation (1), is the same for all
chunk-trees. Therefore, in order to maximize the degree of the hierarchical
clustering HCDB for each individual bucket (and thus increase also the
hierarchical clustering factor fHC), we have to maximize the region contribution cri
(1
i
N) of each chunk-tree (equation (1)). This occurs when we pack into each
bucket as many trees as possible on the one hand and - due to the region proximity
factor rP - when the trees of each region are as close as possible in the
multidimensional space, on the other. Finally, according to the fHC definition, the
29
number of buckets used must be the smallest possible. If we assume that the
chunk dimensions have no inherent ordering then there is no notion of spatial
proximity within the trees of the same region and the region proximity factor
equals 1 for all possible regions (see also related discussion in the following
subsection).
In this case the only factor that can maximize the HCDB of each bucket and
consequently the overall fHC is to minimize empty space within each bucket (i.e.,
maximize bucket occupancy in equation 1) and use as few buckets as possible by
packing the largest number of trees in each bucket. These are exactly the goals of
the original bin packing problem and thus a solution to the bin packing problem is
also a solution to the HPP chunk-to-bucket allocation problem and vice-versa.
Since the bin-packing can be reduced in polynomial time to the HPP Chunk-toBucket then, any problem in NP can be reduced in polynomial-time to the HPP
Chunk-to-Bucket. Furthermore, in the general case (where we have chunk-trees of
variant depths and dimension have inherent orderings) it is not easy to find a
polynomial time verifier for a solution to the HPP chunk-to-bucket problem, since
the maximum fHC that can be achieved is not known (as it is in the bin packing
problem where the minimum number of bins can be computed with a simple
division of the total weight of items by the size of a bin). Thus the problem is NPHard.
We proceed next by providing a greedy algorithm based on heuristics for solving
the HPP chunk-to-bucket allocation problem in linear time. The algorithm utilizes
the hierarchical clustering degree of a bucket as a criterion in order to evaluate at
each step how close we are to an optimal solution. In particular, it traverses the
chunk-tree in a top-down depth-first manner, adopting the greedy approach that if
at each step we create a bucket with a maximum value of HCDB, then overall the
acquired hierarchical clustering factor will be maximal. Intuitively, by trying to
pack the available buckets with low-depth trees (i.e., the tallest trees) first (thus
the top-to-bottom traversal) we can ensure that we have not missed the chance to
create the best HCDB buckets possible.
In Fig. 7, we present the GreedyPutChunksIntoBuckets algorithm, which receives
as input the root R of a chunk-tree CT and the fixed size SB of a bucket. The
output of this algorithm is a set of buckets containing at least one whole chunktree, a directory chunk entry pointing at the root chunk R and the root-bucket BR.
30
In each step the algorithm tries “greedily” to make an allocation decision that will
maximize the HCDB of the current bucket. For example, in lines 2 to 7 of Fig. 7,
the algorithm tries to store the whole input tree in a single bucket thus aiming at a
maximum degree of hierarchical clustering for the corresponding bucket. If this
fails, then it allocates the root R to the root-bucket and tries to achieve a
maximum HCDB by allocating the sub-trees at the next depth, i.e., the children of
R (lines: 9-26).
!
!
"
#
#
!"# $ %&#
'(( )
*
% + !"
&
, +. ),,+
/0
-"1 2
&
#
3
$
/
) , +
+4 5
-'!6 5 (,
7 +
!"!
89 4#
:)+ * 5 89 4
3
!"!# < %&#
#
!"!
0
)
++
9
,
;
+4
#
9
5 !"!#
3
3
=.
89 4#
% &
0 +8&
3
?6
-
5 +
A+ ,4
19,)
3
% +
, +3
- %'+
A+
0
,4
, +-
5
. ),,+
#
> &
)
!5
++
5 (, !"!
&
9 ,
+
7
/0 #
)((4 5
>
!5
&
#
$
. ),,+
/0 +
>
#
!"!# @ %&#
+
!"!#>% #
+4 0 + !"!
&
+ )
+
' ( )
7( B ( 5
!"'#>% #
!
7 +
!"'
*
!"'##
3
-"1 2
3
Fig. 7. A greedy algorithm for the HPP chunk-to-bucket allocation problem.
This essentially is achieved by including all direct children sub-trees with size less
than (or equal to) the size of a bucket (SB) into a list of candidate trees for
inclusion into bucket-regions (
0 +8&
) (lines: 14-16). Then the routine
is called upon this list and tries to include the
corresponding trees in a minimum set of buckets, by forming bucket-regions to be
stored in each bucket, so as each one achieves the maximum possible HCDB
(lines: 19-22). We will come back to this routine and discuss how it solves this
problem in the next sub-section. Finally, for the children sub-trees of root R with
31
size cost greater than the size of a bucket, we recursively try to solve the
corresponding HPP chunk-to-bucket allocation sub-problem for each one of them
(lines: 23-26). This of course corresponds to a depth-first traversal of the input
chunk tree.
Very important is also the fact that no space is allocated for empty sub-trees
(lines: 11-13); only a special entry is inserted in the parent node to denote a
NULL sub-tree. Therefore, the allocation performed by the greedy algorithm
adapts perfectly to the data distribution, coping effectively with the native
sparseness of the cube.
The recursive calls might lead us eventually all the way down to a data chunk (at
depth DMAX). Indeed, if the A+
,4
!5
&
is called upon a
root R, which is a data chunk, then this means that we have come upon a data
chunk with size greater than the bucket size. This is called a large data chunk and
a more detailed discussion on how to handle them will follow in a later subsection. For now it is enough to say that in order to resolve the problem of storing
such a chunk we extend the chunking further (with a technique called artificial
chunking) in order to transform the large data chunk into a 2-level chunk tree.
Then, we solve the HPP chunk-to-bucket sub-problem for this sub-tree (lines: 3035). The termination of the algorithm is guaranteed by the fact that each recursive
call deals with a sub-problem of a smaller in size chunk-tree than the parent
problem. Thus, the size of the input chunk-tree is continuously reduced.
$"
!
$!
!!
#
"
!
$
Fig. 8 A chunk-tree to be allocated to buckets by the greedy algorithm.
Assuming an input file consisting of the cube’s data points along with their
corresponding chunk-ids (or equivalently the corresponding h-surrogate key per
dimension) we need a single pass over this file to create the chunk-tree
representation of the cube. Then the above greedy algorithm requires only linear
32
time in the number of input chunks (i.e., the chunks of the chunk-tree) to perform
the allocation of chunks to buckets, since each node is visited exactly once and at
the worst case all nodes are visited.
Assume the chunk-tree of DMAX = 5 of Fig. 8. The numbers inside each node
represent the storage cost for the corresponding sub-tree, e.g., the whole chunktree has a cost of 65 units. For a bucket size SB = 30 units the greedy algorithm
yields a hierarchical clustering factor fHC = 0.72. The corresponding allocation is
depicted in Fig. 9.
$"
$!
#
!!
"
!
"
!
#
!
$
Fig. 9. The chunk-to-bucket allocation for SB = 30.
The solution comprises three buckets B1, B2 and B3, depicted as rectangles in the
figure. The bucket with the highest clustering degree (HCDB) is B3, because it
includes the lowest depth tree. The chunks not included in a rectangle will be
stored in the root-bucket. In this case, the root-bucket consists of only a single
bucket (i.e.,
= 1 and K = 3, see equation (3)), since this suffices for storing the
corresponding two chunks.
3
# *!
&
We have seen that in each step of the greedy algorithm for solving the HPP
Chunk-to-bucket allocation problem (corresponding to an input chunk-tree with a
root node at a specific chunking depth), we try to store all the sibling trees
hanging from this root to a set of buckets, forming this way groups of trees to be
stored in each bucket that we call bucket-regions. The formation of bucket-regions
is essentially a special case of the HPP Chunk-to-bucket allocation problem and
can be described as follows:
Definition 9 (The bucket-region formation problem): We are given a set of N
chunk trees T1, T2, … TN, of the same chunking depth d. Each tree Ti (1
i
N)
33
has a size: cost(Ti)
SB, where SB is the bucket size. The problem is to store these
trees into a set of buckets, so that the hierarchical clustering factor fHC of this
allocation is maximized.
Since all the trees are of the same depth, the depth contribution cdi (1
i
N),
defined in equation (1), is the same for all trees. Therefore, in order to maximize
the degree of the hierarchical clustering HCDB for each individual bucket (and
thus increase also the hierarchical clustering factor fHC), we have to maximize the
region contribution cri (1
i
N) of each tree (equation (1)). This occurs when we
create bucket-regions with as many trees as possible on the one hand and - due to
the region proximity factor rP - when the trees of each region are as close as
possible in the multidimensional space, on the other. Finally, according to the fHC
definition, the number of buckets used must be the smallest possible.
Summarizing, in the bucket-region formation problem we seek a set of buckets to
store the input trees, in order the following three criteria to be fulfilled:
1. The bucket-regions (i.e., each bucket) contain as many trees as possible.
2. The total number of buckets is minimum.
3. The trees of a region are as close in the multidimensional space as
possible.
One could observe that if we focused only on the first two criteria, then the
bucket-region formation problem would be transformed to a typical bin-packing
problem, which is a well-known NP-complete problem [42]. So intuitively the
bucket-region formation problem can be viewed as a bin-packing problem, where
items packed in the same bin must be neighbors in the multidimensional space.
The space proximity of the trees of a region is meaningful only when we have
dimension domains with inherent orderings. Typical example is the TIME
dimension. For example, we might have trees corresponding to the months of the
same year (which guarantees hierarchical proximity) but we would also like the
consecutive months to be in the same region (space proximity). This is because
these dimensions are the best candidates for expressing range predicates (e.g.,
months from FEB99 to AUG99). Otherwise, when there isn’t such an inherent
ordering, e.g., a chunk might point to trees corresponding to products of the same
category along the PRODUCT dimension; space proximity is not important and
therefore all regions with the same number of trees are of equal value. In this case
the corresponding predicates are typically set inclusion predicates (e.g., products
34
IN {“Literature”, “Philosophy”, “Science”}) and not range predicates, so
hierarchical proximity alone suffices to ensure a low I/O cost. To measure the
space proximity of the trees in a bucket-region we use the region proximity rP,
which we define as follows:
Definition 10 (Region Proximity rP): We define the region proximity rP of a
bucket-region R defined in a multidimensional space S, where all dimensions of S
have an inherent ordering, as the relative distance of the average Euclidian
distance between all trees of the region R from the longest distance in S:
rP
dist AVG
dist MAX
dist MAX
In the case where no dimension of the cube has an inherent ordering, then we
assume that the average distance for any region is zero and thus the region
proximity rP equals with one. For example, in Fig.8 we depict two different
bucket-regions R1 and R2. The surrounding chunk represents the sub-cube
corresponding to the months of a specific year and the types of a specific product
category and defines a Euclidian space S. Each point in this figure corresponds to
a root of a chunk-tree. Since, only the TIME dimension, among the two, includes
an inherent ordering of its values, the data space, as long as the region proximity
is concerned, is specified by TIME only (1-dimensional metric space). The largest
distance in S equals 11 and is the distance between the leftmost and the rightmost
trees. The average distance for region R1 equals 2 while for region R2 equals 5. By
a simple substitution of the corresponding values in definition 10, we find that the
region proximity for R1 equals 0.8, while for R2 equals 0.5. This is because the
trees of the latter are more dispersed along the time dimension. Therefore region
R1 exhibits a better space proximity than R2.
In order to tackle the region formation problem we propose an algorithm called
FormBuckRegions. This algorithm is a variation of an approximation algorithm
called best-fit [42] for solving the bin-packing problem. Best-fit is a greedy
algorithm that does not find always the optimal solution, however it runs in P-time
(also can be implemented to run in NlogN, N being the number of trees in the
input), and provides solutions that are far from the optimal solution within a
certain bound. Actually, the best-fit solution in the worst case is never more than
roughly 1.7 times worse than the optimal solution [42]. Moreover, our algorithm
35
exploits a space- filling curve [33] in order to visit the trees in a space-proximity
'
*+
,'-
preserving way. We describe it next:
12 2
32
!
'
&0 '
!
&
"
!
#
.
/
)
"
%
""
&'
(
")))
Fig. 10. The region proximity for two bucket-regions: rP1 > rP2.
FormBuckRegions
Traverse the input set of trees along a space-filling curve SFC on the data space
defined by the parent chunk. Each time you process a tree, insert it in the bucket
that will yield the maximum HCDB value, among the allocated buckets, after the
insertion. On a tie, choose one randomly. If no bucket can accommodate the
current tree, then allocate a new bucket and insert the tree in it.
Note that there is no linearization of multidimensional data points that preserves
space proximity 100% [8, 13]. In the case where no dimension has an inherent
ordering the space filling curve might be a simple row-wise traversal (Fig. 11). In
this figure, we also depict the corresponding bucket-regions that are formed.
We believe that a formation of bucket-regions that will provide an efficient
clustering of chunk-trees must be based on some query patterns. In the following
we show an example of such a query-pattern driven formation of bucket-regions.
A hierarchy level of a dimension can basically take part in an OLAP query in two
ways: (a) as a means of restriction (e.g., year = 2000), or (b) as a grouping
attribute (e.g. “show me sales grouped by month”). In the former, we ask for
values on a hyper-plane of the cube perpendicular to the Time dimension at the
restriction point, while in the latter we ask for values on hyper-planes that are
parallel to the Time dimension. In other words, if we know for a dimension level
that it is going to be used by the queries more often as a restriction attribute, then
we should try to create regions perpendicular to this dimension. Similarly, if we
know that a level is going to be used more often as a grouping attribute, then we
should opt for regions that are parallel to this dimension. Unfortunately, things are
36
not so simple, because if, for example, we have two “restriction levels” from two
different dimensions, then the requirement for vertical regions to the
corresponding dimensions is contradictory.
Fig. 11. A row-wise traversal of the input trees.
! (
$
56
7
!
-
)
-
&
2
!
2
"
2
"
!
$
4
"
'
4
2
5
-7 -
!
"
5
5+
-7 &
Fig. 12. Bucket-region formation based on query patterns.
In Fig. 12, we depict a bucket-region formation that is driven by the table
appearing in the figure. In this table we note for each dimension-level
corresponding to a chunking depth, from our example cube in Fig. 3, whether it
should be characterized as a restriction level or as a grouping level. For instance, a
user might know that 80% of the queries referencing level
will
37
apply a restriction on it and only a 20% will use it as a grouping attribute, thus this
level will be characterized as a restriction level. Furthermore, in the column
labeled “importance order”, we order the different levels of the same depth
according to their importance in the expected query load. For example, we might
know that the )
+4 level will appear much more often in queries than the
level and so on.
In Fig. 12, we also depict a representative chunk for each chunking depth (of
course for the topmost levels there is only one chunk, the root chunk), in order to
show the formation of the regions according to the table. The algorithm in Fig. 13
describes how we can produce the bucket-regions for all depths, when we have as
input a table similar to the one appearing in Fig. 12.
CC
CC
9
D
(
/
+4 9)
7+
-'!6
(
0 +
)( 5
( B (
+
) (
0 +8 , )
5
)((
, 9 5 B)(
89 + )
5
, 9 5
+#
+, +
//
"+4
+ )
) 8) 4 0 ((47 E , +
5) * (( 0)B +
>
9 +9 ,
()+ + 9)+)(( (
5 ( B (
)
+,
5)+)
+ E)
) ) F+
+
B G +
F + 9
G ) +
+ 9
B (4# 5
( B ( )
9
(
5 + )+ 8 + ( B (
5
89 + )
+, + '2H 5 +
)+ 8 +
+ 9 ,
5
7 +
B
#
9
I ( B ( 0+ 8 5
+, +
3
- %I
0+ 8 ( 9
3
3
5 + )+
"+)B +
( B (
+
)
+
+
5
89 +
+
)
+
+
+, +> )
9
((
+ 9 , 5
7
5
5
) + *C ( 8
5
89 + )
+, +
+
> 0
5)+)
# ) +
K 5
5
)
+, +
5
+
> 0
5)+)
# ) +
) ,
, +4
F9) G
5
( >
( 5 + )+
+
#
78)J +
4( *
5 0)
+ E , ) ) +
, ( B (
, 0)
(
+ E , ) ) +
0 + )(( ( B (
)8
) 8
8 +
+
5 5
( *
9
*
9
5
B
0 +
#
#
5
+
)
333
Fig. 13.A bucket-region formation algorithm that is driven by query patterns.
In Fig. 12, for the chunks corresponding to the levels
4,
+4, 49
and
8, we also depict the column-major traversal method corresponding to
the second part of the algorithm. Note also that the term “fully-sized region”
means a region that has a size greater than the bucket occupancy threshold, i.e., it
utilizes well the available bucket space. Finally, whenever, we are at a depth
where a pseudo level exists for a dimension, e.g., D = 2 for our example, no
regions are created for the pseudo level of course. Also, note that bucket-region
formation for chunks at the maximum chunking depth (as is the chunk in depth 3
38
in Fig. 12), is only required in the case, where the chunking is extended beyond
the data-chunk level. This is the case of large data chunks which is the topic of
the next sub-section.
3+/
(
#
In this sub-section, we will discuss the case where the
GreedyPutChunksIntoBuckets algorithm (Fig. 7) is called with input a chunk-tree
that corresponds to a single data chunk. This, as we have already explained, would
be the result of a number of recursive calls to the GreedyPutChunksIntoBuckets
algorithm that led us to descend the chunk hierarchy and to end up at a leaf node.
Typically, this leaf node is large enough so as not to fit in a single bucket,
otherwise the recursive call upon this node would not have occurred in the first
place (Fig. 7).
( 5 9 :& ,
38 0
;
#!
#"
!)
#
"
( ' 4/
4
##
#
# (5
)
"
""
"!
"#
( ' 4/
4
"
"
"
Fig. 14. Example of a large data chunk.
The main idea for tackling this problem is to further continue the chunking
process, although we have fully used the existing dimension hierarchies, by
imposing a normal grid. We call this chunking artificial chunking in contrast to
the hierarchical chunking presented in the previous section. This process
transforms the initial large data chunk into a 2-level chunk-tree of size less than or
equal to the original data chunk. Then, we solve the HPP chunk-to-bucket
allocation sub-problem for this chunk-tree and therefore we once again call the
GreedyPutChunksIntoBuckets routine upon this tree.
In Fig. 14, we depict an example of such a large data chunk. It consists of two
dimensions A and B. We assume that the maximum chunking depth is DMAX = K.
Therefore, K will be the depth of this chunk. Parallel to the dimensions, we depict
the order codes of the dimension values of this chunk that correspond to the most
39
detailed level of each dimension. Also, we denote their parent value on each
dimension, i.e., the pivot level values that created this chunk. Notice that, the
suffix of the chunk-id of this chunk consists of the concatenated order codes of the
two pivot level values, i.e.,
.
In order to extend the chunking further, we need to insert a new level between the
most detailed members of each dimension and their parent. However, this level
must be inserted “locally”, only for this specific chunk and not for all the grain
level values of a dimension. We want to avoid inserting another pseudo level in
the whole level hierarchy of the dimension, because this would trigger the
enlargement of all dimension hierarchies and would result in a lot of useless
chunks. Therefore, it is essential that this new level remains local. To this end, we
introduce the notion of the local depth d of a chunk to characterize the artificial
chunking, similar to the global chunking depth D (introduced in the previous
section) characterizing the hierarchical chunking.
Definition 11 (Local Depth d): The local depth d, where d ! -1, of a chunk Ch
denotes the chunking depth of Ch pertaining to artificial chunking. A local depth d
= -1 denotes that no artificial chunking has been imposed on Ch. A value of d = 0
corresponds to the root of a chunk-tree by artificial chunking and is always a
directory chunk. The value of d increases by one for each artificial chunking level.
Note that the global chunking depth D, while descending levels created by
artificial chunking, remains constant and equal to the maximum global chunking
depth of the cube (in general, to the current global depth value); only the local
depth increases.
Let us assume a bucket size SB that can accommodate a maximum of Mr directory
chunk entries, or a maximum of Me data chunk entries. In order to chunk a large
data chunk Ch of N dimensions by artificial chunking, we define a grid on it,
consisting of mgi (1
N
m gi
i
N) number of members per dimension, such as
M r . This grid will correspond to a new directory chunk, pointing at the
i 1
new chunks created from the artificial chunking of the original large data chunk
Ch and due to the aforementioned constraint it is guaranteed that it will fit in a
bucket. If we assume a normal grid, then for all i : 1
i
N, it holds m gi
N
Mr .
40
In particular, if ni (1
i
N) corresponds to the number of members of the
original chunk Ch along the dimension i, then a new level consisting of mgi
members will be inserted as a “parent” level. In other words, a number of ci
children (out of the ni) will be assigned to each of the mgi members, where
ci
ni , as long as c
i
m gi
1. If 0 < ci < 1, then the corresponding new level will
act as a pseudo level, i.e., no chunking will take place along this dimension. If all
new levels correspond to pseudo levels, i.e., ni < mgi for all i : 1
i
N, then we
take mgi = maximum(ni).
We will describe the above process with an example. Let us assume a bucket that
can accommodate a maximum of Mr = 10 directory chunk entries or a maximum
of Me = 5 data chunk entries. In this case the data chunk of Fig. 14 is a large data
chunk, since it cannot be stored in a single bucket. Therefore, we define a grid
with mg1, mg2 number of members along dimensions A and B respectively. If the
grid is normal then mg1 = mg2 =
10 = 3. Thus, we create a directory chunk,
which consists of 3x3 = 9 cells (i.e., directory chunk entries); this is depicted in
Fig. 15.
38 0 8
5
5
""
"
( 5 9 :& ,
5
38 0
;
56
4
##
#(5
##
"
"#
#
)
38 0 8
"!
#
##
##
#
#
38 068
#!
"
38 0 8
5
#
#"
6
#"
#!
"
( ' 4/
"
<
!)
5
)
"
""
"!
"#
"
"
"!
"
"#
"
#!
38 068
6
5
#!
#"
38 0 8
"
""
#"
)
( ' 4/ 4
38 0 86
5
)
"
""
5
"
"
#
38 0 86
!)
!)
!)
#
5
#
38 0686
"!
"#
"
"
"
Fig. 15. The example large data chunk artificially chunked.
In Fig. 15, we can also see the new values of each dimension and the
corresponding parent-child relationships between the original values and the
newly inserted ones. In this case, each new value will have at most c1
8
3
=3
41
6
3
children for dimension A and c2
= 2 children for dimension B respectively.
The created directory chunk will have a global depth D = K and a local depth d =
0. Around it, we depict all the data chunks (partitions of the original data chunk)
that correspond to each directory entry. Each such data chunk will have a global
depth D = K and a local depth d = 1. The chunk-ids of the new data chunks
include one more domain as a suffix, corresponding to the new chunking depth
that they belong. Notice that from the artificial chunking process new empty
chunks might arise. For example see the rightmost chunk in the top of Fig. 15.
Since no space will be allocated for such empty chunks, it is obvious that artificial
chunking might lead to a minimization of the size of the original data chunk;
especially for sparse data chunks. This important characteristic is stated with the
following theorem, which shows that in the worst case the extra size overhead of
the resultant 2-level tree will be equal to the size of a single bucket. However,
since cubes are sparse, chunks will be also sparse and therefore practically the size
of the tree will always be smaller than that of the original chunk.
Theorem 3( Size upper bound for an artificially chunked large data chunk): For
any large data chunk Ch of size SCh holds that the two-level chunk tree CT
resulting from the application of the artificial chunking process on Ch, will have a
size SCT such as:
S CT
S Ch
SB
where SB is the bucket size.
Proof
Assume a large data chunk Ch which is 100% full. Then from the application of
artificial chunking no empty chunks will be produced. Moreover, from the
definition of chunking we know that if we connect these chunks back together we
will get Ch. Consequently, the total size of these chunks is equal to SCh. Now, the
root chunk of the new tree CT will have (by definition) at most Mr entries, so as to
fit in a single bucket. Therefore the extra size overhead caused by the root is at
most SB. From this we infer that SCT
SCh + SB . Naturally if this holds for the
largest possible Ch it will certainly also hold for all other possible Ch’s that are
not 100% full and thus may result to empty chunks after the artificial chunking.
As soon as we create the 2-level chunk-tree, we have to solve the corresponding
HPP chunk-to-bucket allocation sub-problem for this tree; i.e., we recursively call
42
the GreedyPutChunksIntoBucket algorithm, with input the root node of the new
tree. The algorithm will then try to store the whole chunk-tree in a bucket (which
is possible because as explained above artificial chunking reduces the size of the
original chunk for sparse data chunks), or create the appropriate bucket-regions
and store the root-node in the root-bucket (see Fig. 7). Also it will mark the empty
directory entries. In Fig. 15, we can see the formed region assuming that the
maximum number of data entries in a bucket is Me=5.
Finally, if still there exists a large data chunk that cannot fit by itself in a whole
bucket, then we repeat the whole procedure and thus create some new data chunks
at local depth d = 2. This procedure may continue until we finally store all parts of
the original large data chunk.
!
#
( 5 67 5 *
<
<
( 5 7 5*
<
<
<
<
( 5 7 5*
<
<
<
<
<
<
<
( 5 +7 5 *
<
<
<
<
( 5 07 5 6
<
<
( 5 07 5
<
Fig. 16. An example of a root directory.
30/
!
*(
-
In the previous sub-sections we formally defined the HPP chunk-to-bucket
allocation problem. From this definition we have seen that the root-bucket BR
essentially represented the entire set of buckets that had a zero degree of
hierarchical clustering HCDB, and therefore, had no contribution to the
hierarchical clustering achieved by a specific chunk-to-bucket allocation.
Moreover, due to the factor in equation (3) ( was defined as the number of
43
fixed-size buckets in BR), it is clear that the larger the root-bucket becomes the
worse hierarchical clustering is achieved. In this subsection, we will present a
method for improving the hierarchical clustering contribution of the root-bucket
by reducing the factor, with the use of a main memory cache area, and also by
increasing the HCDB of the buckets in BR.
In Fig. 16, we depict an example of a set of directory nodes that will be stored in
the root-bucket. These are all directory chunks and are rooted all the way up to the
root chunk of the whole cube. These chunks are of different global depths D and
local depths d and form an unbalanced chunk-tree that we call the root directory.
Definition 12 (The Root Directory RD ): The root directory RD of a hierarchically
chunked cube C, represented by a chunk-tree CT, is an unbalanced chunk-tree
with the following properties:
1. The root of RD is the root node of CT.
2. For the set SR of the nodes of RD holds: S R
S CT , where SCT is the set of
the nodes of CT.
3. All the nodes of RD are directory chunks.
4. The leaves of the root directory contain entries that point to chunks stored
in a different bucket than their own.
5. RD is an empty tree iff the root node of CT is stored in the same bucket
with its children nodes.
In Fig. 16, the empty cells correspond to sub-trees that have been allocated to
some bucket, either on their own or with other sub-trees (i.e., forming a bucketregion). We have omitted these links from the figure in order to avoid cluttering
the picture. Also note the symbol “X” for cells pointing to an empty sub-tree.
Beneath the dotted line we can see directory chunks that have resulted from the
artificial chunking process described in the previous subsection.
The basic idea of the method that we will describe next is based on the simple
observation that if we impose hierarchical clustering to the root directory, as if it
was a chunk-tree on its own, the evaluation of HPP queries would be improved,
because all the HPP queries need at some point to access a node of the root
directory. Moreover, since the root directory always contains the root chunk of the
whole chunk tree as well as certain higher level (i.e., lower depth) directory chunk
nodes, we could assume that these nodes are permanently resident in main
44
memory during a query session on a cube. The latter is of course a common
practice for all index structures in databases.
The algorithm that we propose for the storage of the root directory is called
StoreRootDir. It assumes that directory entries in the root directory pointing to
already allocated sub-trees (empty cells in Fig. 16) are treated as pointers to empty
trees, in the sense that their storage cost is not taken into account for the storage of
the root directory. The algorithm receives as an input the root directory RD, a
cache area of size SM and a root-bucket BR of size SROOT =
SM
SB
(therefore S ROOT
SB, where
! 1 and
S M ) and produces a list of allocated buckets for the
root directory; the details of the algorithm are shown in Fig. 17.
!
,
!
++
?6 +
++
,
) )
++
,
88 ,)
,
.
I
++
+
7
,
, #
+ ), 570 +
*)4
3
* 5)B
CC*5 ( +
-"1 2
3
/
+ , )((
,
, +
+4
) 5
)(( )
CC
(B
9+
!.
,
7 +
( 8> 9,)
> &
, +
>
+4#
+
7
#
&
'
5 +
, +
CC8)
) +
+
+ '
3
3
-"1 2
3
0 +
) 5
'
(
+4 0
B
)(( * 5
' ( / %(
%(
89 4#
) 5
#
Fig. 17. Recursive algorithm for storing the root directory.
We begin from the root and visit in a breadth-first manner all nodes of RD (lines
1-5). Each node we visit, we store it in the root-bucket BR, until we find a node
that can no longer be accommodated. Then, for each of the remaining unallocated
chunk sub-trees of RD we solve the corresponding HPP chunk-to-bucket subproblem (lines 10-13). For the storage of the new root directories that might result
from these sub-problems, we use again the StoreRootDir algorithm but with a
zero cache area size this time (lines 15-18).
From the above description we can see that the proposed algorithm uses the rootbucket only for storing the higher-level nodes that will be loaded in the cache.
Therefore, the I/O overhead due to the root-bucket during the evaluation of an
HPP query is zeroed. Furthermore, the chunk-to-bucket allocation solution of a
45
cube is now augmented with an extra set of buckets resulting from the solutions to
the new sub-problems from within StoreRootDir. The hierarchical clustering
degree HCDB of these buckets is calculated based on the input chunk-tree of the
specific sub-problem and not on the chunk-tree representing the whole cube. In
the case where the former is an unbalanced tree, the maximum chunking depth
DMAX is calculated from the longest path from the root to a leaf.
Notice that for each such sub-problem a new root directory might arise. (In fact
the only chance for an empty root directory is the case where the whole chunk
sub-tree, upon which GreedyPutChunksIntoBuckets is called, fits in a single
bucket). Therefore, we solve each of these sub-problems by recursively using
StoreRootDir, but this time with no available cache area. This will make the
StoreRootDir to recursively invoke the GreedyPutChunksIntoBuckets algorithm,
until all chunks of a sub-tree are allocated to a bucket. Recall from the previous
sub-sections that the termination of the GreedyPutChunksIntoBuckets algorithm is
guaranteed by the fact that each recursive call deals with a sub-problem of a
smaller in size chunk-tree than the parent problem. Thus, the size of the input
chunk-tree continuously reduces. Consequently, this also guarantees the
termination of StoreRootDir.
$"
$!
#
!!
"
!
"
!
#
!
$
Fig. 18. Resulting allocation of the running example cube for a bucket size SB = 30 and a cache
area equal to a single bucket.
Note that the root directory is a very small fragment of the overall cube data
space. Thus, it is realistic to assume that in most cases we can store the whole root
directory in the root-bucket and load it entirely in the cache during querying. In
this case, we can evaluate any point HPP query with a single bucket I/O.
46
In the following we provide an upper bound for the size of the root directory. In
order to compute this upper bound, we use the full chunk-tree resulting from the
hierarchical chunking of a cube. A guaranteed upper bound for the size of the root
directory could be the size of all the possible directory chunks of this tree.
However, the root directory of the CUBE File is a significantly smaller version of
the whole directory tree for the following reasons: (a) it does not contain all
directory chunk nodes, only the ones that were not stored in a bucket along with
their descendants, (b) space is not allocated for empty sub-trees and (c) chunks are
stored in a compressed form, not wasting space for empty entries.
Lemma 1: For any cube C consisting of N dimensions, where each dimension has
a hierarchy represented by a complete K-level m-way tree, the size of the root
directory in terms of the number of directory entries is
(m N ( K
2)
).
Proof
Since the root directory is always (by its definition) smaller in size from the tree
containing all the possible directory chunks (called directory tree), then we can
write: the size of the root directory is O(size of directory tree). The size of the
directory tree can be very easily computed by the following series, which adds the
number of all possible directory entries:
Size of directory tree = 1 m N
m2N
... m ( K
2) N
= O(m N ( K
2)
)
Next we provide a theorem that proves an upper bound for the ratio between the
size of the root directory and that of the full most-detailed data space of a cube.
Theorem 4 (Upper bound of the size ratio between the root directory and the
cube’s data space): For any cube C consisting of N dimensions, where each
dimension has a hierarchy represented by a complete K-level m-way tree, the ratio
of the root directory size to the full size of C’s detailed data space (i.e., the
Cartesian product of the cardinalities of the most detailed levels for all
dimensions) is
(
1
).
mN
Proof
From the above lemma we have that the size of the root directory is
(m N ( K
2)
).
Similarly we can prove that the size of the C’s most detailed data space is
(m N ( K 1) ) . Therefore, the ratio
47
root directory size
=
cube most detailed data space size
(
m N (K
m N (K
2)
1)
)
(
1
)
mN
Theorem 4, proves that as dimensionality increases the ratio of the root directory
size to the full cube size at the most detailed level, exponentially decreases.
Therefore, as N increases, the root directory size becomes very fast negligible
compared to the cube’s data space.
$"
!
$!
"
.
!!
"
$#
#
#
!
$
!
/
)
$
Fig. 19. Resulting allocation of the running example cube for a bucket size SB = 10 and a cache
area equal to a single bucket.
If we go back to the allocated cube in Fig. 9, and assume a cache area of size
equal to a single bucket, then the StoreRootDir algorithm will store the whole root
directory in the root-bucket. In other words, the root directory can be fully
accommodated in the cache area and therefore from equation (3), for K = 3 and
= 0 (since the root-bucket will be loaded into memory the factor is zeroed) we
get an improved hierarchical clustering degree fHC = 9.6%. The new allocation is
depicted in Fig. 18. Notice that any point query can be answered now with a
single bucket I/O.
If for the cube of our running example we assume a bucket size of SB = 10, then
the chunk to bucket allocation resulting from the GreedyPutChunksIntoBuckets
and the subsequent call to StoreRootDir is depicted in Fig. 19. In this case, we
have once more assumed a cache area equal to a single bucket. In the figure, we
can see the upper nodes allocated to the cache area (i.e., stored in the root-bucket),
in a breadth-first way. The buckets B1 to B5 have resulted from the initial call to
GreedyPutchunksIntoBuckets. Buckets B6 and B7 store the rest nodes of the root
directory that could not be accommodated in the cache area and are a result of the
48
call to the StoreRootDir algorithm. Finally, in Fig. 20, we present the
corresponding allocation for a zero cache area.
$"
/
!
$!
.
!!
"
$#
#
#
"
!
$
!
/
)
$
Fig. 20. Resulting allocation of the running example cube for a bucket size SB = 10 and a zero
cache area.
This concludes the presentation of the data structures and algorithms used to
construct the CUBE File. We move next to present detailed experimental
evaluation results.
=
,
&
1
We have conducted an extensive set of experiments over our CUBE File
implementation. The large set of experiments covers both the structural and the
query evaluation aspects of the data structure. In addition, we wanted to compare
the CUBE File with the UB-tree/MHC (which to our knowledge is the only
multidimensional structure that achieves hierarchical clustering with the use of hsurrogates), both in terms of structural behavior and query evaluation time. The
latter comparison yielded 7-9 times less I/Os on average, in favor of the CUBE
File for all workloads tested and the former one showed a 2-3 times lower storage
cost for almost all data spaces, again in favor of the CUBE File, hence providing
evidence that the CUBE File achieves a higher degree of hierarchical clustering of
the data. These results appear in [18]. Note that the same comparison but between
the UB-tree/MHC and a bitmap index based star schema has shown a query
evaluation speedup of 20 to 40 times on average (depending on the use or not of
the pre-grouping transformation optimization [40]) (see [15] for more details).
49
The query performance measurements in [18] were based in HPP queries (see
definition 2) that resulted in a single or multiple disjoint query boxes (i.e., hyperrectangles) at the grain level of the cube data space. Both hot and cold cache
query evaluations were examined. In CUBE File parlance, this translates to a
cached or not cached root-bucket respectively.
Our query-load consisted of various query classes with respect to the cube
selectivity (i.e., how many data points were returned in the result set). The CUBE
File performed multiple times less I/Os than the UB-tree, in all query classes, for
both hot and cold cache experiments, exhibiting a superior hierarchical clustering.
For large selectivity queries (i.e., many data in the result set), where the
hierarchical restrictions were posed on higher hierarchy levels, the CUBE File
needed 3 times less I/Os than the UB-tree. Interestingly, for small selectivity
queries, where the restrictions were posed on more detailed hierarchy levels, the
difference in I/Os increased impressively (in favour of the CUBE File) reaching a
factor larger than 10 for all relevant query classes, and up to 37 in certain query
classes.
Note that the most decisive factor for any HPP query in order to run fast (i.e., with
few I/Os) is to achieve hierarchical clustering at all levels of the dimension
hierarchies. This is more obvious in small selectivity queries, where one has to
achieve hierarchical clustering even at the most detailed levels of the hierarchy.
For queries with small cube selectivities the UB-tree performance was worse and
the hierarchical clustering effect reduced. This is due to the way data are clustered
into z-regions (i.e., disk pages) along the z-curve [2]. In contrast, the hierarchical
chunking applied in the CUBE File, creates groups of data (i.e., chunks) that
belong in the same “hierarchical family” even for the most detailed levels. This, in
combination with the chunk-to-bucket allocation, which guarantees that
hierarchical families will be physically stored together, results in better
hierarchical clustering of the cube even for the most detailed levels of the
hierarchies.
In this paper, we want to present further experimental results that show the
adaptation of the CUBE file structure to data spaces of varying characteristics
such as cube sparseness and number of total data points (i.e., scalability tests).
50
Dimension
#Levels
Grain Level
Cardinality
D1
4
2000
D2
5
3125
D3
7
6912
D4
3
500
D5
9
8748
D6
2
36
D7
10
7776
D8
8
6561
D9
6
4096
Fig. 21. Dimension hierarchy configuration for the experimental data sets.
We have used synthetic data sets that were produced with an OLAP data
generator that we have developed. Our aim was to create data sets with a realistic
number of dimensions and hierarchy levels. In Fig. 21, we present the hierarchy
configuration for each dimension used in the experimental data sets. The shortest
hierarchy consists of 2 levels, while the longest consists of 10 levels. We tried
each data set to consist of a good mixture of hierarchy lengths. In order to
evaluate the adaptation to sparse data spaces, we created cubes that were very
sparse. Therefore the number of input tuples was kept from a small to a moderate
level. To simulate the cube data distribution, for each cube we created ten hyperrectangular regions as data point containers. These regions are defined randomly
at the most detailed level of the cube and not by combination of hierarchy values
(although this would be more realistic), in order not to favor the CUBE File
particularly, due to the hierarchical chunking. We then filled each region with data
points uniformly spread and tried to maintain the same number of data points in
each region.
SPARSE
#Dimensions
#Tuples
#Facts
Maximum chunking depth
Bucket size (bytes)
Bucket occupancy threshold
Varying
100,000
1
Depends on longest hierarchy
8K
80%
SCALE
5
Varying
1
8
8K
80%
Fig. 22. Data set configuration for the four series of experiments.
We have distinguished our experiments in two sets depending on the
characteristic for which we wanted to analyze CUBE File’s behavior: (a) data
space sparseness (SPARSE), and (b) input data point scalability (SCALE). Fig. 22
shows the data set configuration for each series of experiments.
=
%
-
(
/
/
We increase the dimensionality of the cube while maintaining the number of data
points constant (=~ 100K tuples); this way we essentially increase the cube
sparseness. The cube sparseness is measured as the ratio of the actual cube data
points to the product of the cardinalities of the dimension grain levels.
The primary hypotheses that we aimed to prove experimentally were the
following:
51
1. The CUBE File adapts perfectly to the extensive sparseness of the data
space and thus its size does not increase as the cube sparseness increases.
2. Hierarchical clustering achieved by the CUBE File is almost unaffected by
the extensive cube sparseness.
3. The root-bucket size remains low compared to the CUBE File size and
thus it is feasible to be cached in main memory for realistic cases.
Additionally, we have drawn other interesting conclusions regarding the
structure’s behavior as sparseness increases.
In Fig. 23, we observe the data space size “exploding” exponentially as the
number of dimensions increases. We can see that the data space size is many
orders of magnitude larger than the CUBE File size.
. :'
;
/ > 1 ?( &
"56!/
"56!
"56!
"56"
"56"!
"56 /
"
"
+ 3 0 17
1
5 '7
3 5 17
#
.
/
)
? &
Fig. 23. CUBE File size (in logarithmic scale) for increasing dimensionality.
In addition, the CUBE File size is smaller than the input file, containing the input
tuples (i.e., fact values accompanied by their corresponding chunk-id, or
equivalently h-surrogates) to be loaded into the CUBE File. This is depicted more
clearly in Fig. 25. Now, we can more clearly see that the total CUBE File size is
smaller than that of the input data file, although the former maintains a whole tree
structure of intermediate directory nodes, essentially because the CUBE File does
not allocate space for empty subtrees and does not store the coordinates along the
measure values.
In the graph, we can see that the CUBE File size exceeds the input data file only
after dimensionality goes over the number of eight dimensions. The real cause in
this case is the cube sparseness, which is magnified by the dimensionality
increase. In our case, for nine dimensions and with 100,000 input data points, the
sparseness has reached a value of 7.08"10-26, which is an extreme case.
This clearly shows that the CUBE File:
52
1. Adapts to the large sparseness of the cube allocating space comparable to
the actual number of data points and not to all possible cells.
2. Achieves a compression of the input data since it does not store the data
point coordinates (i.e., the h-surrogate keys/chunk-ids) but only the
measure values.
The last point is depicted more clearly in Fig. 24, where we present the
compression achieved by the CUBE File organization as the cube sparseness
increases. This compression is calculated as the ratio of the CUBE File size to the
data space size (or the input file size), which is then subtracted from 1. With
respect to the data space size, the compression is always 100% for all depicted
sparseness values. This is reasonable since the CUBE File size is always many
orders of magnitude smaller than the data space size. In addition, with respect to
the input file, the compression remains high (above 50%) even for cubes with
sparseness values down to 10-20. This shows that for all practical cases of cubes
the compression achieved is significant.
' ! //
@
' ! // 4
A (
1
"! 8
" 8
/ 8
8
8
!8
8
!#5
5 )
5
%
4
511
5 "# " 5 "
/
3 5
%
511
")5 "/ # 5 !! . "5 !
!/ 4 //
Fig. 24. Compression achieved by the CUBE File as the cube sparseness increases
It is noteworthy to point that for the measurements presented in this report the
CUBE File implementation does not impose any compression to the intermediate
nodes (i.e., the directory chunks). Only the data chunks are compressed by means
of a bitmap representing the cell offsets (called compression bitmap), which
however is stored uncompressed also. This was a deliberate choice in order to
evaluate the compression achieved merely by the “pruning ability” of our chunkto-bucket allocation scheme, according to which no space is allocated for empty
chunk-trees. Finally, another factor that reduces the achieved compression is that
in our current implementation for each chunk we store also its chunk-id. This is
due to a “defensive” design choice made on the early stages of the implementation
53
but it is not necessary for accessing the chunks, since chunk-ids are not used for
associative search, when accessing the CUBE File. Therefore, regarding the
compression achieved the following could improve the compression ratio even
further:
1. Compression of directory chunks
2. Removal of chunk-ids from chunks
3. Compression of bitmaps (e.g., with run-length encoding)
/ > 1 ?( &
;
""# )) ##" #
/. )
. :'
+ 30 17
3 5 17
#
!
"
9
:
+
#
.
/
55 17
:5 17
)
? &
Fig. 25. Several sizes for increasing cube sparseness (via increase in dimensionality).
In addition, in Fig. 25 we depict the root-bucket size and the chunk-tree size. The
root-bucket grows in a similar way to the CUBE File, however its size is always
one or two orders of magnitude smaller. We will return shortly to the root-bucket.
The chunk-tree denotes the chunk-tree representation of the cube, i.e., it is the
sum of the sizes of all the chunks comprising the chunk-tree. Interestingly, we
observe that as dimensionality increases (i.e., cube sparseness increases) the size
of the chunk-tree exceeds that of the CUBE File. This seems rather strange, since
one would expect the CUBE File size, which includes the storage overhead of the
buckets, to be greater. The explanation lies in the existence of large data chunks.
The chunk-tree representation may include large data chunks, which in the chunkto-bucket allocation process will be artificially chunked. However, in sparse data
spaces, these large data chunks are also very sparse and most of their size cost is
due to the compression bitmap. When such a sparse chunk is artificially chunked,
then its size is significantly reduced due to the pruning ability of the allocation
algorithm. Therefore, in sparse cube data spaces, artificial chunking provides
substantial compression as a side effect. Fig. 26 also verifies the existence of
many large data chunks in highly sparse data spaces.
In Fig. 26, we depict the chunk distribution as dimensionality increases. Note that
the number of chunks depicted is the number of “real” chunks that will be
54
eventually stored in buckets and not the number of “possible” chunks deriving
from the hierarchical chunking process. One interesting result that can be drawn
from this graph is that the increase of dimensionality does not necessarily mean an
increase on the total number of chunks. In fact, we observe this metric decreasing
as dimensionality increases and reaching a minimum point, when dimensionality
becomes 7. One would expect the opposite since the number of chunks at each
depth, generated by hierarchical chunking, equals the product of the dimension
cardinalities at the corresponding levels. The explanation here lies again in the
pruning ability of our method. This shows that although the number of “possible”
chunks increases, the number of “real” chunks might decrease for certain data
distributions and hierarchy configurations. Again, this provides evidence that the
CUBE File adapts well to the sparseness of the data space.
49 ( /$!
; 9
3
:1
<5
$ 4 1 ( ' 4/
3
9
:1
9
:1
3
9
4
$B
3
9
:1
:1
?
#
.
#
!
"
#
.
/
)
?( &
Fig. 26. The distribution of chunks for increasing cube sparseness (via increase in dimensionality).
Another interesting result is that very soon (from dimensionality 5 and above) the
total number of directory chunks exceeds the total number of data chunks. This
leads us to the conclusion that a compression of the directory chunks (which as we
have mentioned above, has not been implemented in our current version) is indeed
meaningful, and might provide a significant compression.
Finally, we observe an increase on the number of large data chunks. This is an
implementation effect and not a data structure characteristic. As we have already
noted, the current chunk implementation leaves the compression bitmap
uncompressed. As the space becomes sparser, these large data chunks are
essentially almost empty data chunks, with a very large compression bitmap,
which is almost filled with 0s. Of course, a more efficient storage of this bitmap
55
(even with a simple run length encoding scheme) would eliminate this effect and
these data chunks would not appear as “large”. The existence of many large data
chunks in high dimensionalities explains also the fact that the number of the rootdirectory chunks (i.e., the chunks that will be stored mainly in the root-bucket but
also in simple buckets if the latter overflows) exceeds the total number of
directory chunks. This is because the total number of directory chunks appearing
in the graph does not include the directory chunks arising from the artificial
chunking of large data chunks, which were not created initially from the
hierarchical chunking process but dynamically during the chunk-to-bucket
allocation phase.
In Fig. 27, we depict the relative size of the root bucket w.r.t. the total CUBE File
size. In particular, we can see the ratio of the root-bucket size to the total CUBE
File size for continuously increasing values of the cube sparseness. It is clear that
for even extremely sparse cubes with sparseness values down to 10-18, the total
root-bucket size remains less than the 20% of the total CUBE File size. For all
realistic cases this ratio is below 5%. Once more, the remarks mentioned above
regarding the compression hold for this case too, i.e., in our experiments no
compression has been imposed to the root-bucket chunks, other than the pruning
of empty regions.
!
$ 9$ / > C
/>
!
$
9 $! $
1 /
!/ 4 //
. 8
8
8
8
# 8
!8
"8
8
!#5
5 )
5 "# " 5 " ")5 "/ # 5 !! . "5 !
/
!/ 4 //
Fig. 27. Relative growth of the size of the root-bucket as the cube sparseness becomes greater.
Finally, we have measured the achieved hierarchical clustering for increasing cube
sparseness. In Fig. 28, we depict fHC values that have been normalized to the range
of [0,1]. We can observe in this figure the fHC values varying from on end of the
curve to the other only about 70%, while the cube sparseness varies 20 orders of
magnitude. Thus the hierarchical clustering factor is essentially not affected by
56
the cube sparseness increase and the CUBE File manages to maintain a high
quality of hierarchical clustering even for extremely sparse data spaces.
1 /
"!
"
4 &
.
/
!
!#5
5 )
5 "#
" 5"
")5 "/
# 5 !!
. "5 !
/
Fig. 28. The hierarchical clustering factor fHC as cube sparseness increases.
We recapitulate the main conclusions drawn regarding the CUBE File’s behavior
in conditions of increasing sparseness:
1. It adapts to the large sparseness of the cube allocating space comparable to
the actual number of data points and not to all possible cells.
2. Moreover, it achieves more than 50% compression of the input data for all
realistic cases.
3. In sparse cube data spaces, artificial chunking provides substantial
compression as a side effect due to the existence of many large data
chunks.
4. The increase of dimensionality does not necessarily mean an increase on
the total number of chunks for the CUBE File. The “possible” chunks
indeed increase but the CUBE File stores only those that are non-empty.
5. Compression of directory chunks in data spaces of large dimensionality is
likely to yield significant storage savings.
6. The root-bucket size remains less than the 20% of the total CUBE File size
for even extremely sparse cubes. For more realistic cases of sparseness the
size is below 5%. Thus caching the root-bucket (or at least a significant
part of it) in main memory is indeed feasible.
7. The hierarchical clustering factor is essentially not affected by the cube
sparseness increase and the CUBE File manages to maintain a high quality
of hierarchical clustering even for extremely sparse data spaces.
57
= /
%
-
This series of experiments aimed at evaluating the scalability of the CUBE File.
To this end we increased the number of input tuples, while maintaining a fixed set
of 5 dimensions (D1 to D5 in Fig. 21). However, we have kept the maximum
number of tuples to a moderate level (1 million rows), in order to maintain the
large sparseness of the cube, which is more realistic. The primary hypotheses that
we aimed to prove with this set of experiments were the following:
1. The CUBE File is scalable (its size remains lower than that of the input
file when the number of input data points increase)
2. Hierarchical clustering achieved remains of high quality, when the number
of input data points increase.
3. The root-bucket size remains low compared to the CUBE File size and
thus it is feasible to be cached in main memory for realistic cases.
The first and the third hypothesis can be confirmed directly from
Fig. 29. In this figure, we can see the CUBE File size remaining smaller than the
input file for all data sets. We can also see the difference between the CUBE File
size and that of the root-bucket becoming larger. Thus as tuple cardinality
increases the root bucket becomes a continually smaller fraction of the CUBE
File. Finally, we can see the chunk-tree size being very close to the CUBE File
size, which demonstrates the high space utilization achieved by the CUBE File.
More interestingly, in Fig. 30, we depict the compression achieved by the CUBE
File as the number of cube data points increases. With respect to the data space
size the compression is constantly 100%. With respect to the input data file the
compression becomes high (around 70%) very soon and maintains this high
compression rate for all tuple cardinalities. In fact, it seems as the compression
reaches a maximum value and then remains almost constant; thus both sizes
increase with the same rate. This is clear evidence that the CUBE File utilizes
space efficiently. It saves a significant portion of storage from discarding the
dimension foreign keys of each tuple (i.e., the chunk-ids or h-surrogates) and then
retains this size difference by increasing proportionally to the number of input
tuples.
58
/ > 1 ?$
3 5 17
+ 30 17
9
:
55 17
+
:5 17
"=
"
!
!=
"#
!
=!
))
"
=.
.
#
=#
#
=)
"=
"=
!
"
#
;
"
. :'
!
?
Fig. 29. CUBE File size as the number of cube data points increases.
&
@
3 5
))
=!
"=
!
=#
#
"
=)
%
511
=.
"= .!
"
!=
"#
511
#
%
!
5
#
"! 8
) 8
8
# 8
8
# 8
8
) 8
"! 8
"=
' ! // 4
A (
1
1 ?$
?$
Fig. 30. The compression achieved by the CUBE File as the number of cube data points increases.
Fig. 31 depicts the decrease of the ratio of the root-bucket size to the CUBE File
size as the number of input tuples increase. It shows that for realistic tuple
cardinalities the root-bucket size becomes negligible compared to the CUBE File
size. Therefore, for realistc cube sizes (> 1000K tuples) the root bucket size is
below 5% of the CUBE file size and it could be cached in main memory. Finally,
we observe a super-linear decrease of the ratio in the number of input tuples,
which further confirms our previous statement.
In Fig. 32, we depict the distribution of buckets with different content in the
number of input tuples. Observe that as the space becomes gradually more dense
and more data points fill-up the empty regions, more chunk-subtrees are created
59
and thus the number of bucket-region buckets increases rapidly. This is a very
welcomed result since, the more bucket-regions are formed the better the
hierarchical clustering of the chunk-to-bucket allocation becomes.
$
9 $! $
1 ?
"=
"
!
!=
"#
#
))
=!
"=
!
"
=)
=#
#
#
"=
=.
.!
. 8
8
8
8
# 8
!8
"8
8
!
$ 9$ /> C
/>
!
?
Fig. 31. The ratio of the root-bucket size to the CUBE File size for increasing tuple cardinality.
#
9 $ ( /$!
?
31 < 5
$ 4 1 ?$
55 +
31 < 5 9
:+
:5 1
/
3+ :
:5 1
3
5<
+
+ : +
:5 1
:5 1
#
!
!=
"#
"=
"
=.
.!
!
?( &
))
=!
"=
!
=#
#
"
=)
"=
#
#
"
Fig. 32. The distribution of buckets as tuple cardinality increases.
This last point is further exhibited in Fig. 33. We depict the normalized values of
the hierarchical clustering factor for each data set. We can clearly see that the
hierarchical clustering quality remains high for all data sets. In particular the
experiments show that the hierarchical clustering remains approximately 0,7 (i.e.,
70% of the best value achieved) even when the tuple cardinality was increased by
3 orders of magnitude. This essentially proves the second hypothesis that we
posed in the beginning of this sub-section.
60
1 ?$
"!
"
"=
"
!=
"#
=.
.!
!
))
=!
"=
!
"
=)
"=
=#
#
#
!
#
4 &
.
/
?$
Fig. 33. The hierarchical clustering factor fHC as tuple cardinality increases.
D / &&
-
In this paper, we tried to solve the problem of devising physical clustering
schemes for multidimensional data that are organized in hierarchies. A typical
case of such data is the OLAP cube. The problem of clustering on disk the most
detailed data of a cube so as to reduce the I/Os during the evaluation of hierarchyselective queries is difficult due to the enormous search space of possible
solutions. Instead of following the typical approach of finding a linear ordering of
the data points, we introduced a representation of the search space (i.e., a model)
that is based on a hierarchical chunking method that results in a chunk-tree
representation of the cube. Then we coped with the problem as a packing
problem, in particular packing of chunks into buckets.
The chunk-tree representation is a very effective model of the cube data space,
because it prunes all empty areas (i.e., chunk-trees) and adapts perfectly to the
usual extensive sparseness of the cube. Moreover, by traversing the chunk-tree
nodes we can very efficiently access subsets of the data space that are based on
hierarchy value combinations. This makes the chunk-tree an excellent index for
queries with hierarchical restrictions.
In order to be able to evaluate the solutions to the proposed problem we defined a
quality metric, namely the hierarchical clustering factor fHC of a cube.
Furthermore, we formally defined the problem as an optimization problem and
proved that is NP-Hard by reducing it to the bin packing problem. We proposed
as a solution an effective greedy algorithm that requires a single pass over the
input fact table and linear time in the number of chunks. Moreover we have
61
analyzed and provided solutions for a number of sub-problems such as the
formation of bucket regions, the storage of large data chunks and the storage of
the root-directory. The whole solution leads to the construction of the CUBE File
data structure.
We presented an extensive set of experiments analyzing the structural behavior of
the CUBE File in terms of increasing sparseness and data point scalability. Our
experimental results have confirmed our principal hypotheses that:
1. The CUBE File adapts perfectly to even the most extremely sparse data
spaces yielding significant space savings. Furthermore, the hierarchical
clustering achieved by the CUBE File is almost unaffected by the
extensive cube sparseness.
2. The CUBE File is scalable (its size remained constantly about 70%
smaller than that of the input tuple-based file, for all input data point
cardinalities tested). In addition, the hierarchical clustering achieved
remains of high quality, when the number of input data points increases.
3. The root-bucket size remained low (below 5%) compared to the total
CUBE File size for all realistic cases of sparseness and data point
cardinality and thus caching it in main memory is a feasible proposal. This
results in a single I/O evaluation of point queries but reduces dramatically
I/Os for all types of hierarchy-selective queries [18].
All in all, the CUBE File is an effective data structure for physically organizing
and indexing the most detailed data of an OLAP cube. One area that such a
structure could be used successfully is as an alternative to bitmap-index based
processing of star-join queries. To this end an efficient processing framework has
been proposed in [15]. However, it could be used as an effective index for any
data that are accessed through multidimensional queries with hierarchical
restrictions.
An interesting enhancement to the CUBE File would be to incorporate more
workload-specific knowledge in its chunk-to-bucket allocation algorithm. For
example the allocation of more frequently accessed sub-trees in the same bucket
should be rewarded with a higher HCDB value etc. We are also investigating the
use of the hierarchical clustering factor for making decisions during the
construction of other common storage organizations (e.g., partitioned heap files,
B-trees, etc.) in order to achieve hierarchical clustering of the data. The interested
62
reader can find more information regarding other aspects of the CUBE File not
covered in this paper (e.g., the updating and maintenance operations), as well as
information for a prototype implementation of a CUBE File based DBMS in [16].
94 "
()' 4$/
We would like to thank our colleagues Yannis Kouvaras and Yannis Roussos
from the Knowledge and Database Systems Laboratory at the N.T.U.Athens for
their fruitful comments and their support in the implementation of the CUBE File
and the completion of the experimental evaluation. We also like to thank Aris
Tsois for his detailed reviewing and commenting on the first draft. This work has
been partially funded by the European Union's Information Society Technologies
Programme (IST) under project EDITH (IST-1999-20722).
!
1.
! 4
/
Bayer R and McCreight.E (1972) Organization and Maintainance of large ordered Indexes.In Acta
Informatica 1 ,pages 173 –189,1972.
2.
Bayer R (1997) The universal B-Tree for multi-dimensional Indexing: General Concepts. WWCA 1997.
3.
Chan CY, Ioannidis Y (1998) Bitmap Index Design and Evaluation. SIGMOD 1998.
4.
Chaudhuri S, Dayal U (1997) An Overview of Data Warehousing and OLAP Technology. SIGMOD
Record 26(1): 65-74 (1997)
5.
Deshpande PM, Ramasamy K, Shukla A , Naughton J (1998) Caching multidimensional Queries using
Chunks, in: Proc. ACM SIGMOD Int. Conf. On Management of Data, (1998) 259-270.
6.
Fagin R, Nievergelt J, Pippenger N, Raymond H (1979) Strong: Extendible Hashing - A Fast Access
Method for Dynamic Files. TODS 4(3): 315-344 (1979)
7.
Faloutsos C, Rong Y (1991) DOT: A Spatial Access Method Using Fractals. ICDE 1991: 152-159
8.
Gaede V, Günther O (1998) Multidimensional Access Methods. ACM Computing Surveys 30(2): 170231 (1998)
9.
Gray J, Bosworth A, Layman A, Pirahesh H (1996) Data Cube: A Relational Aggregation Operator
Generalizing Group-By, Cross-Tab, and SubTotal. ICDE 1996.
10. Gupta A, Mumick IS (1995) Maintenance of Materialized Views: Problems, Techniques, and
Applications. Data Engineering Bulletin 18(2): 3-18 (1995)
11. Harinarayan V, Rajaraman A, Ullman JD (1996) Implementing Data Cubes Efficiently, in: Proc. ACM
SIGMOD Intl Conf. On Management of Data (1996) 205-227.
12. Hinrichs K (1985) Implementation of the Grid File: Design Concepts and Experience. BIT 25(4): 569592 (1985)
13. Jagadish HV (1990) Linear Clustering of Objects with Multiple Attributes. SIGMOD Conference 1990:
332-342
14. Jagadish HV, Lakshmanan LVS, Srivastava D (1999) Snakes and Sandwiches: Optimal Clustering
Strategies for a Data Warehouse. SIGMOD Conference 1999: 37-48
63
15. Karayannidis N et al (2002) Processing Star-Queries on Hierarchically-Clustered Fact-Tables. VLDB
2002.
16. Karayannidis N (2003) Storage Structures, Query Processing and Implementation of On-Line Analytical
Processing Systems, Ph.D. Thesis, National Technical University of Athens, 2003. Available at:
http://www.dblab.ece.ntua.gr/~ni kos/thesis/PhD_thesis_en.pdf.
17. Karayannidis N, Sellis T (2003) SISYPHUS: The Implementation of a Chunk-Based Storage Manager
for OLAP Data Cubes. Data and Knowledge Engineering, 45(2): 155-188, May 2003.
18. Karayannidis N, Sellis T, Kouvaras Y (2004) CUBE File: A File Structure for Hierarchically Clustered
OLAP Cubes. 9th International Conference on Extending Database Technology, Heraklion, Crete,
Greece, March 14-18, 2004, EDBT 2004: 621-638.
19. Kotidis Y, Roussopoulos N (1998) An Alternative Storage Organization for ROLAP Aggregate Views
Based on Cubetrees, in: Proc. ACM SIGMOD Intl Conf. On Management of Data (1998): 249-258.
20. Lakshmanan LVS, Pei J, Han J (2002) Quotient Cube: How to Summarize the Semantics of a Data
Cube. VLDB 2002.
21. Lakshmanan LVS, Pei J, Zhao Y (2003) QC-Trees: An Efficient Summary Structure for Semantic
OLAP. SIGMOD 2003.
22. Markl V, Ramsak F, Bayern R (1999) Improving OLAP Performance by Multidimensional Hierarchical
Clustering. IDEAS 1999.
23. Nievergelt J, Hinterberger H, Sevcik KC (1984) The Grid File: An Adaptable, Symmetric Multikey File
Structure, in: TODS 9(1) (1984) 38-71
24. OLAP Report (1999) Database Explosion. Available at:
http://www.olapreport.com/DatabaseExplosion.htm .
25. O'Neil PE, Graefe G (1995) Multi-Table Joins Through Bitmapped Join Indices. SIGMOD Record
24(3): 8-11 (1995).
26. O'Neil PE, Quass D (1997) Improved Query Performance with Variant Indexes. SIGMOD 1997.
27. Orenstein JA, Merrett TH (1984) A Class of Data Structures for Associative Searching. PODS 1984:
181-190
28. Padmanabhan S, Bhattacharjee B, Malkemus T, Cranston L, Huras M (2003) Multi-Dimensional
Clustering: A New Data Layout Scheme in DB2. SIGMOD Conference 2003: 637-641
29. Pieringer R et al (2003) Combining Hierarchy Encoding and Pre-Grouping: Intelligent Grouping in Star
Join Processing. ICDE 2003.
30. Ramsak F, Markl V, Fenk R, Zirkel M, Elhardt K, Bayer R (2000) Integrating the UB-Tree into a
Database System Kernel. VLDB 2000: 263-272.
31. Régnier M (1985) Analysis of Grid File Algorithms. BIT 25(2): 335-357 (1985)
32. Roussopoulos N (1998) Materialized Views and Data Warehouses. SIGMOD Record 27(1): 21-26
(1998)
33. Sagan.H (1994) Space-Filling Curves .Springer Verlag,1994.
34. Sarawagi S (1997) Indexing OLAP Data. Data Engineering Bulletin 20(1): 36-43 (1997).
35. Sarawagi S and Stonebraker M (1994) Efficient Organization of Large Multidimensional Arrays, in:
Proc. Of the 11th Int. Conf. On Data Eng. (1994) 326-336.
36. Sismanis Y, Deligiannakis A, Roussopoulos N, Kotidis Y (2002) Dwarf: shrinking the PetaCube.
SIGMOD 2002.
37. Srivastava D, Dar S, Jagadish HV, Levy AY (1996) Answering Queries with Aggregation Using Views.
VLDB Conference 1996: 318-329
38. Stöhr T, Märtens H, Rahm E (2000) Multi-Dimensional Database Allocation for Parallel Data
Warehouses. VLDB 2000: 273-284
64
39. The TransBase HyperCube® relational database system (2005), available at: http://www.transaction.de.
40. Tsois A, Sellis T (2003) The Generalized Pre-Grouping Transformation: Aggregate-Query Optimization
in the Presence of Dependencies. VLDB 2003.
41. Weber R, Schek H-J, Blott S (1998) A Quantitative Analysis and Performance Study for SimilaritySearch Methods in High-Dimensional Spaces. VLDB 1998: 194-205
42. Weiss MA (1995) Data Structures and Algorithm Analysis. The Benjamin/Cummings Publishing
Company Inc. 1995. pg 351- 359.
43. Whang K-Y, Krishnamurthy R (1991) The Multilevel Grid File - A Dynamic Hierarchical
Multidimensional File Structure. DASFAA 1991: 449-459
65