Download Hierarchical Clustering for OLAP: the CUBE File Approach

NIKOS KARAYANNIDIS AND TIMOS SELLIS Institute of Communication and Computer Systems and School of Electrical and Computer Engineering, National Technical University of Athens, Zographou 15773, Athens, Hellas Phone: +30-210-772-1601 Fax: +30-210-772-1442 {nikos,timos}@dblab.ece.ntua.gr Abstract. This paper deals with the problem of physical clustering of multidimensional data that are organized in hierarchies on disk in a hierarchy-preserving manner. This is called hierarchical clustering. A typical case, where hierarchical clustering is necessary for reducing I/Os during query evaluation, is the most detailed data of an OLAP cube. The presence of hierarchies in the multidimensional space results in an enormous search space for this problem. We propose a representation of the data space that results in a chunk-tree representation of the cube. The model is adaptive to the cube’s extensive sparseness and provides efficient access to subsets of data based on hierarchy value combinations. Based on this representation of the search space we formulate the problem as a chunk-to-bucket allocation problem, which is a packing problem as opposed to the linear ordering approach followed in the literature. We propose a metric to evaluate the quality of hierarchical clustering achieved (i.e., evaluate the solutions to the problem) and formulate the problem as an optimization problem. We prove its NPHardness and provide an effective solution based on a linear time greedy algorithm. The solution of this problem leads to the construction of the CUBE File data structure. We analyze in-depth all steps of the construction and provide solutions for interesting sub-problems arising, such as the formation of bucket regions, the storage of large data chunks and the caching of the upper nodes (root-directory) in main memory. Finally, we provide an extensive experimental evaluation of the CUBE File’s adaptability to the data space sparseness as well as to an increasing number of data points. The main result is that the CUBE File is highly adaptive to even the most sparse data spaces and for realistic cases of data point cardinalities provides hierarchical clustering of high quality and significant space savings. Keywords: Hierarchical Clustering, OLAP, CUBE File, Data Cube, Physical Data Clustering 1 Efficient processing of ad hoc OLAP queries is a very difficult task considering, on the one hand the native complexity of typical OLAP queries, which potentially combine huge amounts of data, and on the other, the fact that no a-priori knowledge for queries exists and thus no pre-computation of results or other query-specific tuning can be exploited. The only way to evaluate these queries is to access directly the most detailed data in an efficient way. It is exactly this need to access detailed data based on hierarchy criteria that calls for the hierarchical clustering of data. This paper discusses the physical clustering of OLAP cube data points on disk in a hierarchy-preserving manner, where hierarchies are defined along dimensions (hierarchical clustering). The problem addressed is set out as follows: We are given a large Fact Table (FT) containing only grain-level (most detailed) data. We assume that this is part of star schema in a dimensional Data Warehouse. Therefore, data points (i.e., tuples in the FT) are organized by a set of N dimensions. We further assume that each dimension is organized in a hierarchy. Typically the data distribution is extremely skewed. In particular, the OLAP cube is extremely sparse and data tend to appear in arbitrary clusters of data along some dimension. These clusters correspond to specific combinations of the hierarchy values for which there exist actual data (e.g., sales for a specific product Category in a specific geographic Region for a specific Period of time). The problem is on the one hand to store the fact table data in a hierarchy-preserving manner so as to reduce I/Os during the evaluation of ad hoc queries containing restrictions and /or groupings on the dimension hierarchies, and on the other, to enable navigation in the multilevelmultidimensional data space by providing direct access (i.e., indexing) to subsets of data via hierarchical restrictions. The later implies that index nodes must be also hierarchically clustered if we are aiming at a reduced I/O cost. Some of the most interesting proposals [20, 21, 36] in the literature for cube data structures deal with the computation and storage of the data cube operator [9]. These methods omit a significant aspect in OLAP, which is that usually dimensions are not flat but are organized in hierarchies of different aggregation levels (e.g., store, city, area, country is such a hierarchy for a Location dimension). The most popular approach for organizing the most detailed data of a cube is the so-called star schema. In this case the cube data are stored in a 2 relational table, called the fact table. Furthermore, various indexing schemes have been developed [3, 25, 26, 15], in order to speed up the evaluation of the join of the central (and usually very large) fact table with the surrounding dimension tables (also known as a star join). However, even when elaborate indexes are used, due to the arbitrary ordering of the fact table tuples, there might be as many I/Os as are the tuples resulting from the fact table. We propose the CUBE File data structure as an effective solution to the hierarchical clustering problem set above. The CUBE File multidimensional data structure ([18]) clusters data into buckets (i.e., disk pages) with respect to the dimension hierarchies aiming at the hierarchical clustering of the data. Buckets may include both intermediate (index) nodes (directory chunks), as well as leaf (data) nodes (data chunks). The primary goal of a CUBE File is to cluster in the same bucket a “family” of data (i.e., data corresponding to all hierarchy-value combinations for all dimensions) so as to reduce the bucket accesses during query evaluation. Experimental results in [18] have shown that the CUBE File outperforms the UBtree/MHC [22] - which is another effective method for hierarchically clustering the cube - resulting in 7-9 times less I/Os on average for all workloads tested. This simply means that the CUBE File achieves a higher degree of hierarchical clustering of the data. More interestingly, in [15] it was shown that the UBtree/MHC technique outperformed the traditional bitmap index-based star-join by a factor of 20 to 40, which simply proves that hierarchical clustering is the most determinant factor for a file organization for OLAP cube data, in order to reduce I/O cost. To tackle this problem we first model the cube data space as a hierarchy of chunks. This model - called the chunk-tree representation of a cube - copes effectively with the vast data sparseness by truncating empty areas. Moreover, it provides a multiple resolution view of the data space where one can zoom-in or zoom-out to specific areas navigating along the dimension hierarchies. The CUBE File is built by allocating the nodes of the chunk-tree into buckets in a hierarchypreserving manner. This way we depart from the common approach for solving the hierarchical clustering problem, which is to find a total ordering of the data points (linear clustering), and cope with it as a packing problem, namely a chunkto-bucket packing problem. 3 In order to solve the chunk-to-bucket packing problem, we need to be able to evaluate the hierarchical clustering achieved (i.e., evaluate the solutions to this problem). Thus, inspired by the chunk-tree representation of the cube, we define a hierarchical clustering quality metric, called the hierarchical clustering factor. We use this metric to evaluate the quality of the chunk to bucket allocation. Moreover, we exploit it in order to formulate the CUBE File construction problem as an optimization problem, which we call the chunk-to-bucket allocation problem. We formally define this problem and prove that it is NP-Hard. Then, we propose a heuristic algorithm as a solution that requires a single pass over the input fact table and linear time in the number of chunks. In the course of solving this problem several interesting sub-problems arise. We define the sub-problem of chunk-region formation, which deals with the clustering of chunk-trees hanging from the same parent-node in order to increase further the overall hierarchical clustering. We propose two algorithms as a solution, one of which is driven by workload patterns. Next, we deal with the subproblem of storing large data chunks (i.e., chunks that don’t fit in a single bucket), as well as with the sub-problem of storing the so-called root directory of the CUBE File (i.e., the upper nodes of the data structure). Finally, we study the CUBE File’s effective adaptation to several cube data spaces by presenting a set of experimental measurements that we have conducted. All in all, the contributions of this paper are outlined as follows: We provide an analytic solution to the problem of hierarchical clustering an OLAP cube. The solution leads to the construction of the CUBE File data structure. We model the multilevel-multidimensional data space of the cube as a chunk-tree. This representation of the data space adapts perfectly to the extensive data sparseness and provides a multi-resolution view of the data w.r.t. the hierarchies. Moreover, if viewed as an index, it provides direct access to cube data via hierarchical restrictions, which results in significant speedups of typical ad hoc OLAP queries. We transform the hierarchical clustering problem from a linear clustering problem into a chunk-to-bucket allocation (i.e., packing) problem, which we formally define and prove that it is NP-Hard. 4 We introduce a hierarchical clustering quality metric for evaluating the hierarchical clustering achieved (i.e., evaluating the solution to the problem in question). We provide an efficient solution to this problem as well as to all sub-problems that stem from it, such as the storage of large data chunks or the formation of bucket regions. We provide an experimental evaluation which leads to the following basic results: The CUBE File adapts perfectly to even the most extremely sparse data spaces yielding significant space savings. Furthermore, the hierarchical clustering achieved by the CUBE File is almost unaffected by the extensive cube sparseness. The CUBE File is scalable for any realistic number of input data points . In addition, the hierarchical clustering achieved remains of high quality, when the number of input data points increases. The root-directory can be cached in main memory providing a single I/O cost for the evaluation of point queries. The rest of this paper is organized as follows. Section 2 discusses related work and positions the CUBE File in the space of cube storage structures. Section 3 proposes the chunk-tree representation of the cube as an effective representation of the search space. Section 4 introduces a quality metric for the evaluation of hierarchical clustering. Section 5 formally defines the problem of hierarchical clustering, proves its NP-Hardness and then delves into the nuts and bolts of building the CUBE File. Section 6 presents our extensive experimental evaluation and section 7 recapitulates and emphasizes on main conclusions drawn. ! $ " # % & ' & ( The linear clustering problem for multidimensional data is defined as the problem of finding a linear ordering of records indexed on multiple attributes, to be stored in consecutive disk blocks, such as the I/O cost for the evaluation of queries is minimized. The clustering of multidimensional data has been studied in terms of finding a mapping of the multidimensional space to a one-dimensional space. This 5 approach has been explored mainly in two directions: (a) in order to exploit traditional one-dimensional indexing techniques to a multidimensional index space - typical example is the UB-tree [2], which exploits a z-ordering of multidimensional data [27], so that these can be stored into a one-dimensional Btree index [1] – and (b) for ordering buckets containing records that have been indexed on multiple attributes, to minimize the disk access effort. For example, a grid file [23] exploits a multidimensional grid in order to provide a mapping between grid cells and disk blocks. One could find a linear ordering of these cells – and therefore an ordering of the underlying buckets - such as the evaluation of a query to entail more sequential bucket reads than random bucket accesses. To this end, space-filling curves (see [33] for a survey) have been used extensively. For example, Jagadish in [13] provides a linear clustering method based on the Hilbert curve that outperforms previously proposed mappings. Note however that all linear clustering methods are inferior to a simple scan in high dimensional spaces. This is due to the notorious dimensionality curse [41], which states that clustering in such spaces becomes meaningless due to lack of useful distance metrics. In the presence of dimension hierarchies the multidimensional clustering problem becomes combinatorially explosive. Jagadish in [14] tries to solve the problem of finding an optimal linear clustering of records of a fact table on disk, given a specific workload in the form of a probability distribution over query classes. The authors propose a subclass of clustering methods called lattice paths, which are paths on the lattice defined by the hierarchy level combinations of the dimensions. The HPP chunk-to-bucket allocation problem (in section 3.2 we provide a formal definition of HPP restrictions and queries) is a different problem for the following reasons: 1. It tries to find an optimal way (in terms of reduced I/O cost during query evaluation) to pack the data into buckets, rather than order the data linearly. The problem of finding an optimal linear ordering of the buckets, for a specific workload, so as to reduce random bucket reads, is an orthogonal problem and therefore, the methods proposed in [14] could be used additionally. 2. It deals apart from the data also with the intermediate node entries (i.e., directory chunk entries), which provides clustering at a whole-index level 6 and not only at the index-leaf level. In other words, index data are also clustered along with the “real” data. Since, we know that there is no linear clustering of records that will permit all queries over a multidimensional space to be answered efficiently [14], we strongly advocate that linear clustering of buckets (inter-bucket clustering) must be exploited in conjunction with an efficient allocation of records into buckets (intrabucket clustering). Furthermore, in [22], a path-based encoding of dimension data, similar to our encoding scheme, is exploited in order to achieve linear clustering of multidimensional data with hierarchies, through a z-ordering [27]. The authors use the UB-tree [2] as an index on top of the linearly clustered records. This technique has the advantage of transforming typical star-join [25] queries to multidimensional range queries, which are computed more efficiently due to the underlying multidimensional index. However, this technique suffers from the inherent deficiencies of the z spacefilling curve, which is not the best space-filling curve according to [13, 7]. On the other hand, it is very easy to compute and thus straightforward to implement the technique even for high dimensionalities. A typical example of such deficiency is that in the z-curve there is a dispersion of certain data points, which are close in the multidimensional space but are not close in the linear order and the opposite, i.e., distant data points are clustered in the linear space. The latter results also to an inefficient evaluation of multiple disjoint query regions, due to the repetitive retrieval of the same pages for many queries. Finally, the benefits of z-based linear clustering starts to disappear quite soon as dimensionality increases, practically even when dimensionality gets over the number of 4-5 dimensions. ) *% ' & ' The CUBE File organization was initially inspired by the grid file organization [23], which can be viewed as the multidimensional counterpart of extendible hashing [6]. The grid file superimposes a d-dimensional orthogonal grid on the multidimensional space. Given that the grid is not necessarily regular, the resulting cells may be of different shapes and sizes. A grid directory associates one or more of these cells with data buckets, which are stored in one disk page 7 each. Each cell is associated with one bucket, but a bucket may contain several adjacent cells, therefore bucket-regions may be formed. To ensure that data items are always found with no more than two disk accesses for exact match queries, the grid itself is kept in main memory represented by d one-dimensional arrays called scales. The grid file is intended for dynamic insert/delete operations, therefore it supports operations for splitting and merging directory cells. A well-known problem of the grid file is that it suffers from a superlinear growth of the directory even for data that are uniformly distributed [31]. One basic reason for this is that splitting is not a local operation and thus can lead to superlinear directory growth. Moreover, depending on the implementation of the grid directory merging may require a complete directory scan [12]. Hinrichs in [12] attempts to overcome the shortcomings of the grid file by introducing a 2-level grid-directory. In this scheme, the grid directory is now stored on disk and a scaled-down version of it (called root directory) is kept in main memory to ensure the two-disk access principle still holds. Furthermore, he discusses efficient implementations of the split, merge and neighborhood operations. In a similar manner, Whang extends the idea of a 2-level directory to a multilevel directory, introducing the multilevel grid file [43], achieving a linear directory growth in the number of records. There exist more grid file based organizations. A comprehensive survey of these and of multidimensional access methods in general can be found in [8]. An obvious distinction of the CUBE File organization from the above multidimensional access methods is that it has been designed to fulfill completely different requirements; namely those of an OLAP environment and not of a transaction oriented one. A CUBE File is designed for an initial bulk-loading and then a read-only operation mode, in contrast, to the dynamic insert/delete/update workload of a grid file. Moreover, a CUBE File aims at speeding up queries on multidimensional data with hierarchies and exploits hierarchical clustering to this end. Furthermore, since the dimension domain in OLAP is known a-priori the directory does not have to grow dynamically. In addition, changes to the directory are rare, since dimension data do not change very often (compared to the rate of change for the cube data), and also deletions are seldom, therefore split and merge operations are not needed so much. Nevertheless, more important is to adapt well to the native sparseness of a cube data space and to efficiently support incremental 8 updating, so as to minimize the updating window and cube query-down time, which are critical factors in business intelligence applications nowadays. +$ , &- % & - . The set of reported methods in the literature for primary organizations for the storage of cubes is quite confined. We believe that this is basically due to two reasons: First of all the generally held view is that a “cube” is a set of precomputed aggregated results and thus the main focus has been to devise efficient ways to compute these results [11], as well as to choose, which ones to compute for a specific workload (view selection/maintenance problem [10, 32, 37]). Kotidis et. al in [19] proposed a storage organization based on packed R-trees for storing these aggregated results. We believe that this is a one-sided view of the problem since it disregards the fact that very often, especially for ad hoc queries, there will be a need for drilling down to the most detailed data in order to compute a result from scratch. Ad hoc queries represent the essence of OLAP, and in contrast to report queries, are not known a-priori and thus cannot really benefit from pre-computation. The only way to process them efficiently is to enable fast retrieval of the base data. This calls for an effective primary storage organization for the most detailed data (grain-level) of the cube. This argument is of course based on the fact that a full pre-computation of all possible aggregates is prohibitive due to the consequent size explosion, especially for sparse cubes [24]. The second reason that makes people reluctant to work on new primary organizations for cubes is their adherence to relational systems. Although this seems justified, one could pinpoint that a relational table (e.g., a fact table of a star schema [4]]) is a logical entity and thus should be separated from the physical method chosen for implementing it. Therefore, one can use apart from a paged record file, also a B*-tree or even a multi-dimensional data structure as a primary organization for a fact table. In fact, there are not many commercial RDBMS ([39] is one that we know of) that exploit a multidimensional data structure as a primary organization for fact tables. All in all, the integration of a new data structure in a full-blown commercial system is a strenuous task with high cost and high risk and thus usually the proposed solutions are reluctant to depart from the existing technology (see also [30] for a detailed description of the issues in this integration). 9 Fig. 1 positions the CUBE File organization in the space of primary organizations proposed for storing a cube (i.e., only the base data and not aggregates). The columns of this table describe the alternative data structures that have been proposed as a primary organization, while the rows classify the proposed methods according to the achieved data clustering. At the top-left cell lies the conventional star schema [4], where a paged record file is used for storing the fact table. This organization guarantees no particular ordering among the stored data and thus additional secondary indexes are built around it in order to support efficient access to the data. Primary Organization Relation Clustering Achieved No Clustering Clustering Chunkbased MD-Array Multidimensional data structure GRID UB-tree FILEbased Star Schema [28] [35] Other Chunk[5] based z-order [22] based Fig. 1. The space of proposed primary organizations for cube storage. Hierarchical Clustering [18] [28] assumes a typical relation (i.e., a paged record file) as the primary organization of a cube (i.e., fact table). However, unique combinations of dimension values are used in order to form blocks of records, which correspond to consecutive disk pages. These blocks can be considered as chunks. The database administrator must choose only one hierarchy-level from each dimension to participate in the clustering scheme. In this sense, the method provides multidimensional clustering and not hierarchical (multidimensional) clustering. In [35] a chunk-based method for storing large multidimensional arrays is proposed. No hierarchies are assumed on the dimensions and data are clustered according to the most frequent range queries of a particular workload. In [5] the benefits of hierarchical clustering in speeding-up queries was observed as a side effect of using a chunk-based file organization over a relation (i.e., a paged file of records) for query caching, with chunk as the caching unit. Hierarchical clustering was achieved through appropriate “hierarchical” encoding of the dimension data. 10 Markl et. al in [22], also impose a hierarchical encoding on the dimension data and assign a path-based surrogate key on each dimension tuple that called the compound surrogate key. They exploit the UB-tree multidimensional index [2] as the primary organization of the cube. Hierarchical clustering is achieved by taking the z-order [27] of the cube data points by interleaving the bits of the corresponding compound surrogates. [5], [22] and the CUBE File [18], all exploit hierarchical clustering of the cube data and the last two use multidimensional structures as the primary organization. This has among others the significant benefit of transforming a star-join [25] into a multidimensional range query that is evaluated very efficiently over these data structures. + ' ( / #*$ Clearly our goal is to define a multidimensional file organization that natively supports hierarchies. There is indeed a plethora of data structures for multidimensional data [8], but to the best of our knowledge, none of these explicitly supports hierarchies. Hierarchies complicate things, basically because, in their presence, the data space “explodes1”. Moreover, since we are primarily aiming at speeding up queries including restrictions on the hierarchies, we need a data structure that can efficiently lead us to the corresponding data subset based on these restrictions. A key observation at this point is that all restrictions on the hierarchies intuitively define a subcube or a cube-slice. To this end, we exploit the intuitive representation of a cube as a multidimensional array and apply a chunking method in order to create subcubes, i.e., the so-called chunks. Our method of chunking is based on the dimension hierarchies’ structure and thus we call it hierarchical chunking. In the following sections we present a dimension-data encoding scheme that assigns hierarchyenabled unique identifiers to each data point in a dimension. Then, we present our hierarchical chunking method. Finally, we propose a tree structure for representing the hierarchy of the resultant chunks and thus modeling the cube data space. 1 Assuming N dimension hierarchies modelled as K-level m-way trees, the number of possible value combinations is K-times exponential in the number of dimensions, i.e., O(mKN). 11 + (& # In order to apply hierarchical chunking, we first assign a surrogate key to each dimension hierarchy value. This key uniquely identifies each value within the hierarchy. More specifically, we order the values in each hierarchy level so that sibling values occupy consecutive positions and perform a mapping to the domain of positive integers. The resulting values are depicted in Fig. 2 for an example of a dimension hierarchy. The simple integers appearing under each value in each level are called order-codes. In order to identify a value in the hierarchy, we form the path of order-codes from the root-value to the value in question. This path is called a hierarchical surrogate key, or simply h-surrogate. For example the hsurrogate for the value “Rhodes” is 0.0.1.2. H-surrogates convey hierarchical (i.e., semantic) information for each cube data point, which can be greatly exploited for the efficient processing of star-queries [15, 29, 40]. Continent Europe (0) (0) Country Region Grain level --- City Greece (0.0) (0) North (0) Salonica (0) U.K. (1) South (0.0.1) North (1) (2) Athens (1) LOCATION South (3) Rhodes Glasgow London (3) (4) (2) (0.0.1.2) Cardiff (5) Fig. 2. Example of hierarchical surrogate keys assigned to an example hierarchy. The basic incentive behind hierarchical chunking is to partition the data space by forming a hierarchy of chunks that is based on the dimensions' hierarchies. This has the beneficial effect of pruning all empty areas. Remember that in a cube data space empty areas are typically defined on specific combinations of hierarchy values (e.g., since we didn’t sell the X product Category on Region Y for T periods of time, an empty region is formed). Moreover, it provides us with a multi-resolution view of the data space where one can zoom-in and zoom-out navigating along the dimension hierarchies. We model the cube as a large multidimensional array, which consists only of the most detailed data. Initially, we partition the cube in a very few chunks corresponding to the most aggregated levels of the dimensions' hierarchies. Then we recursively partition each chunk as we drill-down to the hierarchies of all dimensions in parallel. We define a measure in order to distinguish each recursion step, the chunking depth D. We will illustrate hierarchical chunking with an example. The dimensions of our example cube are depicted in Fig. 3 and correspond to a 2-dimensional cube hosting sales data for a fictitious company. 12 The two dimensions are namely LOCATION and PRODUCT. In the figure we can see the members for each level of these dimensions (each appearing with its member-code). Fig. 3. Dimensions of our example cube along with two hierarchy instantiations. In order to apply our method, we need to have hierarchies of equal length. For this reason, we insert pseudo-levels P into the shorter hierarchies until they reach the length of the longest one. This "padding" is done after the level that is just above the grain level. In our example, the PRODUCT dimension has only three levels and needs one pseudo-level in order to reach the length of the LOCATION dimension. This is depicted next, where we have also noted the order-code range at each level: LOCATION:[0-2].[0-4].[0-10].[0-18] PRODUCT:[0-1].[0-2].P.[0-5] The result of hierarchical chunking on our example cube is depicted in Fig. 4(a). Chunking begins at chunking depth D = 0 and proceeds in a top-down fashion. To define a chunk, we define discrete ranges of grain-level (i.e., most-detailed) values on each dimension, denoted in the figure as [a..b], where a and b are grain-level order-codes. Each such range is defined as the set of values with the same parent 13 (value) in the corresponding parent level. These parent levels form the set of pivot levels PVT, which guides the chunking process at each step. Therefore initially, PVT = {LOCATION: Continent, PRODUCT: Category}. For example, if we take value 0 of pivot level Continent of the LOCATION dimension, then the corresponding range at the grain level is Cities [0..5]. # ! " ! " $ % & (a) (b) Fig. 4. (a) The cube from our running example hierarchically chunked. (b) The whole sub-tree up to the data chunks under chunk . The definition of such a range for each dimension defines a chunk. For example, the chunk defined from the 0, 0 values of the pivot levels Continent and Category respectively, consists of the following grain data (LOCATION:0.[0-1].[0-3].[0-5], PRODUCT:0.[0-1].P.[0-3]). The '[]' notation denotes a range of members. This chunk appears shaded in Fig. 4(a) at D = 0. Ultimately at D = 0 we have a chunk for each possible combination between the members of the pivot levels, that is a total of [0-1]×[0-2] = 6 chunks in this example. Thus the total number of chunks created at each depth D equals the product of the cardinalities of the pivot levels. Next we proceed at D = 1, with PVT = {LOCATION: Country, PRODUCT: Type} and recursively chunk each chunk of depth D = 0. This time we define ranges within the previously defined ranges. For example, on the range corresponding to Continent value 0 that we created before, we define discrete ranges corresponding to each country of this continent (i.e., to each value of the Country level, which has parent 0). In Fig. 4(a), at D = 1, shaded boxes correspond to all the chunks resulting from the chunking of the chunk mentioned in the previous paragraph. 14 Similarly, we proceed the chunking by descending in parallel all dimension hierarchies and at each depth D we create new chunks within the existing ones. The procedure ends when the next levels to include as pivot levels are the grain levels. Then we do not need to perform any further chunking, because the chunks that would be produced from such a chunking would be the cells of the cube themselves. In this case, we have reached the maximum chunking depth DMAX. In our example, chunking stops at D = 2 and the maximum depth is D = 3. Notice the shaded chunks in Fig. 4(a) depicting chunks belonging in the same chunk hierarchy. The rationale for inserting the pseudo levels above the grain level lies in that we wish to apply chunking the soonest possible and for all possible dimensions. Since, the chunking proceeds in a top-to-bottom fashion, this “eager chunking” has the advantage of reducing very early the chunk size and also provides faster access to the underlying data, because it increases the fan-out of the intermediate nodes. If at a particular depth one (or more) pivot level is a pseudo level, then this level does not take part in the chunking (in our example this occurs at D = 2 for the PRODUCT dimension.). This means that we don't define any new ranges within the previously defined range for the specific dimension(s) but instead we keep the old one with no further chunking. Therefore, since pseudo levels restrict chunking in the dimensions that are applied, we must insert them to the lowest possible level. Consequently, since there is no chunking below the grain level (a data cell cannot be further partitioned), the pseudo level insertion occurs just above the grain level. + $ #*$ ! We use the intermediate depth chunks as directory chunks that will guide us to the DMAX depth chunks containing the data and thus called data chunks. This leads to a chunk-tree representation of the hierarchically chunked cube and hence the cube data space. It is depicted in Fig. 4(b) for our example cube. In Fig. 4(b), we have expanded the chunk-sub-tree corresponding to the family of chunks that has been shaded in Fig. 4(a). Pseudo levels are marked with “P” and the corresponding directory chunks have reduced dimensionality (i.e., one dimensional in this case). We interleave the h-surrogates of the pivot level values that define a chunk and form a chunk-id. This is a unique identifier for a chunk within a CUBE File. 15 Moreover, this identifier includes the whole path in the chunk hierarchy of a chunk. In Fig. 4(b), we note the corresponding chunk-id above each chunk. The root chunk does not have a chunk-id because it represents the whole cube and chunk-ids essentially denote sub-cubes. The part of a chunk-id that is contained between consecutive dots and corresponds to a specific depth D is called Ddomain. The chunk-tree representation can be regarded as a method to model the multilevel-multidimensional data space of an OLAP cube. We discuss next the major benefits form this modeling: Direct access to cube data through hierarchical restrictions: One of the main advantages of the chunk-tree representation of a cube is that it explicitly supports hierarchies. This means that any cube data subset defined through restrictions on the dimension hierarchies can be accessed directly. This is achieved by simply accessing the qualifying cells at each depth and following the intermediate chunk pointers to the appropriate data. Note that the vast majority of OLAP queries contain an equality restriction on a number of hierarchical attributes and more commonly on hierarchical attributes that form a complete path in the hierarchy. This is reasonable since the core of analysis is conducted along the hierarchies. We call this kind of restrictions hierarchical prefix path (HPP) restrictions and provide the corresponding definition next: Definition 1 (Hierarchical Prefix Path Restriction): We define a hierarchical prefix path restriction (HPP restriction) on a hierarchy H of a dimension D, to be a set of equality restrictions linked by conjunctions on H’s levels that form a path in H, which always includes the topmost (most aggregated) level of H. For example, if we consider the dimension LOCATION of our example cube and a DATE dimension with a 3-level hierarchy (Year/Month/Day), then the query “show me sales for country A (in continent C) in region B for each month of 1999” contains two whole-path restrictions, one for the dimension LOCATION and one for DATE: (a) LOCATION.continent = ‘C’ AND LOCATION.country = ‘A’ AND LOCATION.region = ‘B’, and (b) DATE.year = 1999. Consequently, we can now define the class of HPP queries: Definition 2 (Hierarchical Prefix Path Query): We call a query Q on a cube C a hierarchical prefix path query (HPP query), if and only if all the restrictions 16 imposed by Q on the dimensions of C are HPP restrictions, which are linked together by conjunctions. Adaptation to cube’s native sparseness: The cube data space is extremely sparse [34]. In other words, the ratio of the number of real data points to the product of the dimension grain–level cardinalities is a very small number. Values for this ratio in the range of 10-12 to 10-5 are more than typical (especially for cubes with more than 3 dimensions). It is therefore, imperative that a primary organization for the cube adapts well to this sparseness, allocating space conservatively. Ideally, the allocated space must be comparable to the size of the existing data points. The chunk-tree representation adapts perfectly to the cube data space. The reason is that the empty regions of a cube are not arbitrarily formed. On the contrary, specific combinations of dimension hierarchy values form them. For instance, in our running example, if no music products are sold in Greece, then a large empty region is formed. Consequently, the empty regions in the cube data space translate naturally to one or more empty chunk sub-trees in the chunk-tree representation. Therefore, empty sub-trees can be discarded altogether and the space allocation corresponds to the real data points and only. Multi-resolution view of the data space: The chunk-tree represents the whole cube data space (however with most of the empty areas pruned). Similarly, each subtree represents a sub-space. Moreover, at a specific chunking depth we “view” all the data points organized in “hierarchical families” (i.e., chunk-trees) according to the combinations of hierarchy values for the corresponding hierarchy levels. By descending to a higher depth node we “view” the data of the corresponding subspace organized in hierarchical families of a more detailed level and so on. This multi-resolution feature will be exploited later in order to achieve a better hierarchical clustering of the data by promoting the storage of lower depth chunktrees in a bucket than that of higher depth ones. Storage efficiency: A chunk is physically represented by a multidimensional array. This enables an offset-based access, rather than a search-based one, which speedups the cell access mechanism considerably. Moreover, it gives us the opportunity to exploit chunk-ids in a very effective way. A chunk-id essentially consists of interleaved coordinate values. Therefore, we can use a chunk-id in order to calculate the appropriate offset of a cell in a chunk but we do not have to store the chunk-id along with each cell. Indeed, a search-based mechanism (like 17 the one used by conventional B-tree indexes, or the UB-tree [2]) requires that the dimension values (or the corresponding h-surrogates), which form the search-key, must be also stored within each cell (i.e., tuple) of the cube. In the CUBE File only the measure values of the cube are stored in each cell. Hence notable space savings are achieved. In addition, further compression of chunks can be easily achieved, without affecting the offset-based accessing (see [17] for the details). Parallel Processing Enabling: Chunk-trees (at various depths) can be exploited naturally for the logical fragmentation of the cube data, in order to enable the parallel processing of queries, as well as the construction and maintenance (i.e., bulk loading and batch updating) of the CUBE File. Chunk-trees are essentially disjoint fragments of the data that carry all the hierarchy semantics of the data. This makes the CUBE File data structure as an excellent candidate for advanced fragmentation methods ([38]) used in parallel data warehouse DBMSs. Efficient Maintenance Operations: Any data structure aimed to accommodate data warehouse data must be efficient in typical data warehousing maintenance operations. The logical data partitioning provided by the chunk-tree representation enables fast bulk loading (rollin of data), data purging (rollout of data, i.e., bulk deletions from the cube), as well as the incremental updating of the cube (i.e., when the input data with the latest changes arrive from the data sources, only local reorganizations are required and not a complete CUBE File rebuild). The key idea is that new data to be inserted in the CUBE file correspond to a set of chunk-trees that need to be “hanged” at various depths of the structure. The insertion of each such chunk-tree requires only a local reorganization without affecting the rest of the structure. In addition, as noted previously, these chunktree insertions can be performed in parallel as long as they correspond to disjoint subspaces of the cube. Finally, it is very easy to rollout the oldest month’s data and rollin the current month’s (we call this data purging), since these data correspond to separate chunk-trees and only a minimum reorganization is required. The interested reader can find more information regarding other aspects of the CUBE File not covered in this paper (e.g., the updating and maintenance operations), as well as information for a prototype implementation of a CUBE File based DBMS in [16]. 18 0 1 2 - Any physical organization of data must determine how the latter are distributed in disk pages. A CUBE File physically organizes its data by allocating the chunks of the chunk-tree into a set of buckets, which is the I/O transfer unit counterpart in our case. First, lets try to understand what are the objectives of such an allocation. As already stated the primary goal is to achieve a high degree of hierarchical clustering. This statement, although clear, could still be interpreted in several different ways. What are the elements that can guarantee that a specific hierarchical clustering scheme is “good”? We attempt to list some next: 1. Efficient evaluation of queries containing restrictions on the dimension hierarchies. 2. Minimization of the size of the data. 3. High space utilization. The most important goal of hierarchical clustering is to improve response time of queries containing hierarchical restrictions. Therefore, the first element calls for a minimal I/O cost (i.e., bucket reads) for the evaluation of such restrictions. The second element deals with the ability to minimize the size of the data to be stored (e.g., by adapting to the extensive sparseness of the cube data space - i.e., not storing null data- as well as storing only the minimum necessary data, e.g., in an offset-based access structure we don’t need to store the dimension values along with the facts). Of course, the storage overhead must be also minimized in terms of the number of allocated buckets. Naturally, the best way to keep this number low is to utilize the available space as much as possible. Therefore the third element implies that the allocation must adapt well to the data distribution, e.g., more buckets must be allocated to more densely populated areas and fewer buckets for more sparse ones. Also, buckets must be filled almost to capacity (i.e., imposing a high bucket occupancy threshold). Both the last two elements guarantee an overall minimum storage cost. In the following, we propose a metric for evaluating the hierarchical clustering quality of an allocation of chunks into buckets. Then in the next section we use this metric to formally define the chunk to bucket allocation problem as an optimization problem. 19 0 $ We advocate that hierarchical clustering is the most important goal for a file organization for OLAP cubes. However, the space of possible combinations of dimension hierarchy values is huge (doubly exponential - see footnote 1 on page 2). To this end, we exploit the chunk-tree representation, resulting from the hierarchical chunking of a cube, and deal with the problem of hierarchical clustering, as a problem of allocating chunks of the chunk-tree into disk buckets. Thus, we are not searching for a linear clustering (i.e., for a total ordering of the chunked-cube cells), but rather we are interested in the packing of chunks into buckets according to the criteria of good hierarchical clustering posed above. The intuitive explanation for the utilization of the chunk-tree for achieving hierarchical clustering, lies in the fact that the chunk-tree is built based solely on the hierarchies’ structure and content and not on some storage criteria (e.g., each node corresponding to a disk page, etc.); as a result, it embodies all possible combinations of hierarchical values. For example, the sub-tree hanging from the root-chunk in Fig. 4(b), at the leaf level contains all the sales figures corresponding to the continent “Europe” (order code ) and to the product category “Books” (order code ) and any possible combinations of the children members of the two. Therefore, each sub-tree in the chunk-tree corresponds to a “hierarchical family” of values and thus reduces the search space significantly. In the following we will regard as a storage unit the bucket. In this section, we define a metric for evaluating the degree of hierarchical clustering of different storage schemes in a quantitative way Clearly, a hierarchical clustering strategy that respects the quality element of efficient evaluation of queries with HPP restrictions that we have posed above, must ensure that the access of the sub-trees hanging under a specific chunk must be done with a minimal number of bucket reads. Intuitively, one can say that if we could store whole sub-trees in each bucket (instead of single chunks), then this would result to a better hierarchical clustering, since all the restrictions on the specific sub-tree, as well as on any of its descendant sub-trees, would be evaluated with a single bucket I/O. For example, if we store the sub-tree hanging from the root-chunk in Fig. 4(b) into a single bucket, we can answer all queries containing hierarchical restrictions on the combination “Books” and “Europe” and on any children-values of these two, with just a single disk I/O. 20 Therefore, each sub-tree in this chunk-tree corresponds to a “hierarchical family” of values. Moreover, the smaller is the chunking depth of this sub-tree the more value combinations it embodies. Intuitively, we can say that the hierarchical clustering achieved could be assessed by the degree of storing low-depth whole chunk sub-trees into each storage unit. Next, we exploit this intuitive criterion to define the hierarchical clustering degree of a bucket (HCDB). We begin with a number of auxiliary definitions: Definition 3 (Bucket-Region): Assume a hierarchically chunked cube represented by a chunk-tree CT of a maximum chunking depth DMAX. A group of chunk-trees of the same depth having a common parent node, which are stored in the same bucket, comprises a bucket-region. Definition 4 (Region contribution of a tree stored in a bucket – cr): Assume a hierarchically chunked cube represented by a chunk-tree CT of a maximum chunking depth DMAX. We define as the region contribution cr of a tree t of depth d that is stored in a bucket B, to be the total number of trees in the bucket-region that this tree belongs to divided by the total number of trees of the same depth in the total chunk-tree CT in general. This is then multiplied by a bucket-region proximity factor rP, which expresses the proximity of the trees of a bucket-region in the multidimensional space. cr treeNum(d , B) rP treeNum(d , CT ) Where treeNum(d, B): total number of sub-trees in B of depth d, treeNum(d, CT): total number of sub-trees in CT of depth d and rP: bucket-region proximity (0 < rP 1). The region contribution of a tree stored in a bucket essentially denotes the percentage of trees at a specific depth that a bucket-region covers. Therefore, the greater this percentage is, the greater the hierarchical clustering achieved by the corresponding bucket, since more combinations of the hierarchy members will be clustered in the same bucket. To keep this contribution high we need large bucketregions of low depth trees, because in low depths the total number of CT sub-trees is small. Notice also that the region contribution includes a bucket-region proximity factor rP, which expresses the spatial proximity of the trees of a bucketregion in the multidimensional space. The larger this factor becomes the closer the 21 trees of a bucket region are and thus the larger become their individual region contributions. We will see in more detail the effects of this factor and its definition (Definition 10) in a following subsection, where we will discuss the formation of the bucket-regions. Definition 5 (Depth contribution of a tree stored in a bucket – cd): Assume a hierarchically chunked cube represented by a chunk-tree CT of a maximum chunking depth DMAX. We define as the depth contribution cd of a tree t of depth d that is stored in a bucket B, to be the ratio of d to DMAX. cd d DMAX The depth contribution of a tree stored in a bucket expresses the proportion between the depth of the tree and the maximum chunking depth. The less this ratio becomes (i.e., the lower is the depth of the tree), the greater becomes the hierarchical clustering achieved by the corresponding bucket. Intuitively, the depth contribution expresses the percentage of the number of nodes in the path from the root-chunk to the bucket in question and thus the less it is the less is the I/O cost to access this bucket. Alternatively, we could substitute the depth value from the nominator of the depth contribution with the number of buckets in the path from the root-chunk to the bucket in question (with the latter included). Next, we provide the definition for the hierarchical clustering degree of a bucket: Definition 6 (Hierarchical Clustering Degree of a Bucket – HCDB): Assume a hierarchically chunked cube represented by a chunk-tree CT of a maximum chunking depth DMAX. For a bucket B containing T whole sub-trees {t1, t2 … tT} of chunking depths {d1, d2 … dT} respectively, where none of these sub-trees is a sub-tree of another, we define as the Hierarchical Clustering Degree HCDB of bucket B to be the ratio of the sum of the region contribution of each tree ti (1 i T) included in B to the sum of the depth contribution of each tree ti (1 i T), multiplied by the bucket occupancy OB, where 0 OB 1. T c ri HCDB i 1 T OB c i d T cr OB T cd cr OB cd (1) i 1 22 Where cri is the region contribution of tree ti, and cdi is the depth contribution of tree ti (1 i T). (Note that since bucket-regions have been defined as consisting of equi-depth trees, then all trees of a bucket have the same region contribution as well as depth contribution.) In this definition, we have assumed that the chunking depth di of a chunk-tree ti is equal to the chunking depth of the root-chunk of this tree. Of course we assume that a normalization of the depth values has taken place, so as the depth of the chunk-tree CT to be 1 instead of 0, in order to avoid having zero depths in the denominator of equation (1). Furthermore, data chunks are considered as chunktrees with a depth equal to the maximum chunking depth of the cube. Note that directory chunks stored in a bucket not as part of a sub-tree but isolated, have a zero region contribution; therefore, buckets that contain only such directory chunks have a zero degree of hierarchical clustering. From equation (1), we can see that the more sub-trees, instead of single chunks, are included in a bucket the greater the hierarchical clustering degree of the bucket becomes, because more HPP restrictions can be evaluated solely with this bucket. Also the highest these trees are (i.e., the smaller their chunking depth is) the greater the hierarchical degree of the bucket becomes, since more combinations of hierarchical attributes are “covered” by this bucket. Moreover, the more trees of the same depth and hanging under the same parent node, we have stored in a bucket, the greater becomes the hierarchical clustering degree of the bucket, since we include more combinations of the same path in the hierarchy. All in all, the HCDB metric favors the following storage choices for a bucket: Whole trees instead of single chunks or other data partitions. Smaller depth trees instead of greater depth ones. Tree regions instead of single trees. Regions with a few low-depth trees instead of ones with more trees of greater depth. Regions with trees of the same depth that are close in the multidimensional space instead of dispersed trees. Buckets with a high occupancy. We prove the following theorem regarding the maximum value of the hierarchical clustering degree of a bucket: 23 Theorem 1 (Theorem of maximum hierarchical clustering degree of a bucket): Assume a hierarchically chunked cube represented by a chunk-tree CT of a maximum chunking depth DMAX, which has been allocated to a set of buckets. Then, for any such bucket B holds that: HCDB DMAX Proof: From the definition of the region contribution of a tree appearing in Definition 4, we can easily deduce that: c ri 1 (I) This means that the following holds: T c ri T (II) i 1 In (II) T stands for the number of trees stored in B. Similarly, from the definition of the depth contribution of a tree appearing in Definition 5, we can easily deduce that: 1 c di DMAX (III) since, the smallest possible depth value is 1. This means that the following holds: T c di i 1 T DMAX (IV) From (II), (IV), equation (1) and assuming that B is filled to its capacity (i.e., OB equals 1) the theorem is proved. It is easy to see that the maximum degree of hierarchical clustering of a bucket B is achieved only in the ideal case, where we store the chunk-tree CT that represents the whole cube in B and CT fits exactly in B2. In this case, all of our primary goals for a good hierarchical clustering, posed in the beginning of this chapter, such as the efficient evaluation of HPP queries, the low storage cost and the high space utilization are achieved. This is because all possible HPP 2 Indeed, a bucket with HCDB = DMAX would mean that the depth contribution of each tree in this bucket should be equal to 1/DMAX (according to the inequality (III)); however this is only possible for the whole chunk-tree CT, since this only has a depth equal to 1. 24 restrictions can be evaluated with a single bucket read (one I/O operation) and the achieved space utilization is maximal (full bucket) with a minimal storage cost (just one bucket). Moreover, it is now clear that the hierarchical clustering degree of a bucket signifies to what extent the chunk-tree representing the cube has been “packed” into the specific bucket and this is measured in terms of the chunking depth of the tree. By trying to create buckets with a high HCDB we can guarantee that our allocation respects these elements of good hierarchical clustering. Furthermore, it is now straightforward to define a metric for evaluating the overall hierarchical clustering achieved by a chunk to bucket allocation strategy: Definition 7 (Hierarchical Clustering Factor of a Physical Organization for a Cube – fHC): For a physical organization that stores the data of a cube into a set of NB buckets, we define as the hierarchical clustering factor fHC, the percent of hierarchical clustering achieved by this storage organization, as this results from the hierarchical clustering degree of each individual bucket divided by the total number of buckets and we write: NB HCDB f HC 1 N B DMAX (2) Note that NB is the total number of buckets used in order to store the cube; however only the buckets that contain at least one whole chunk-tree have a nonzero HCDB value. Therefore, allocations that spend more buckets for storing subtrees have a higher hierarchical clustering factor than others, which favor e.g., single directory chunk allocations. From equation (2), is clear that even if we have two different allocations of a cube that result to the same total HCDB of individual buckets, the one that occupies the smaller number of buckets will have the greater fHC, rewarding this way the allocations that use the available space more conservatively. Another way of viewing the fHC is as the average HCDB for all the buckets divided by the maximum chunking depth. It is now clear that it expresses the percentage of the extent by which the chunk-tree representing the whole cube has been “packed” into the set of the NB buckets and thus 0 fHC 1. It follows directly from Theorem 1 that this factor is maximized (i.e., equals 1), if and only 25 if we store the whole cube (i.e., the chunk-tree CT) into a single bucket, which corresponds to a perfect hierarchical clustering for a cube. In the next section we exploit the hierarchical clustering factor fHC, in order to define the chunk-to-bucket allocation problem as an optimization problem. Furthermore, we exploit the hierarchical clustering degree of a bucket HCDB in a greedy strategy that we propose for solving this problem, as an evaluation criterion, in order to decide how close we are to an optimal solution. 3 In this section we formally define the chunk-to-bucket allocation problem as an optimization problem. We prove that it is NP-Hard and provide a heuristic algorithm as a solution. In the course of solving this problem several interesting sub-problems arise. We tackle each one in a separate subsection. 3 $ #* * # % & The chunk-to-bucket allocation problem is defined as follows: Definition 8 (The HPP Chunk-to-Bucket Allocation Problem): For a cube C, represented by a chunk-tree CT with a maximum chunking depth of DMAX, find an allocation of the chunks of CT into a set of fixed-size buckets that corresponds to a maximum hierarchical clustering factor fHC. We assume the following: The storage cost of any chunk-tree t equals cost(t), the number of sub-trees per depth d in CT equals treeNum(d) and the size of a bucket equals SB . Finally, we are given a bucket of special size SROOT consisting of consecutive simple buckets, called the root-bucket with R, where SROOT = SB, 1. Essentially, BR represents the set of buckets that contain no whole sub- trees and thus have a zero HCDB. The solution S for this problem consists of a set of K buckets, S = {B1, B2 … BK}, so that each bucket contains at least one sub-tree of CT and a root-bucket BR that contains all the rest part of CT (part with no whole sub-trees). S must result to a maximum value for the fHC factor for the given bucket size SB. Since the HCDB values of the buckets of the root-bucket BR equal to zero (recall that they contain no whole sub-trees), following from equation (2), fHC can be expressed as: 26 K HCDB f HC 1 (K ) DMAX (3) From equation (3), it is clear that the more buckets we allocate for the root-bucket (i.e., the greater becomes) the less will be the degree of hierarchical clustering achieved by our allocation. Alternatively, if we consider caching the whole rootbucket in main memory (see following discussion), then we could assume that does not affect hierarchical clustering (since it does not introduce more bucket I/Os from the root-chunk to a simple bucket) and could be zeroed. (a) fHC =0.01(14%) (b) fHC =0.03(42%) (c) fHC =0.05(69%) (d) fHC =0.07(100%) Fig. 5 The Hierarchical Clustering Factor fHC of the same chunk-tree for 4 different chunk-tobucket allocations. In Fig. 5, we depict four different chunk-to-bucket allocations for the same chunktree. The maximum chunking depth is DMAX = 5, although in the figure we can see the nodes up to depth D = 3 (i.e., the triangles correspond to sub-trees of 3 levels). The numbers inside each node represent the storage cost for the corresponding 27 sub-tree, e.g., the whole chunk-tree has a cost of 65 units. Assume a bucket size of Fig(a) 0,17 0,04 B3 0,50 0,4 0,73 0,92 B1 0,29 0,6 1,00 0,48 B2 0,14 0,6 0,17 0,04 B3 0,57 0,6 0,50 0,48 B1 0,29 0,6 1,00 0,48 B2 0,14 0,6 0,17 0,04 B3 0,29 0,6 0,33 0,16 B4 0,29 0,6 0,17 0,08 B1 0,14 0,6 0,33 0,08 B2 0,14 0,6 0,67 0,16 B3 0,14 0,6 0,17 0,04 B4 0,14 0,6 0,17 0,04 B5 0,14 0,6 0,17 0,04 B6 0,14 0,6 0,10 0,02 B7 0,14 0,6 0,07 0,02 OB Chunking Depth DMAX 0,6 3 1 0,07 100% 3 1 0,05 69% 4 1 0,03 42% 0,01 14% 30 Maximum 0,14 Root Bucket B2 Bucket Size SB 0,48 Bucket Occupancy 1,00 cd 0,6 Depth Contribution 0,29 No of buckets of the Fig(b) B1 HCDB Total No of Buckets K Fig( c ) Region Contribution cr Fig(d) Bucket Allocation Chunk-to-bucket SB = 30 units. 5 7 1 fHC fHC/fHCmax (%) Fig. 6 The individual calculations of the example in Fig. 5. Below each figure we depict the calculated fHC and beside we note the percentage with respect to the best fHC that can be achieved for this bucket size (i.e., fHC/fHCmax 100%). The chunk-to-bucket allocation that yields the maximum fHC can be identified easily by exhaustive search in this simple case. Observe, how the fHC deteriorates gradually, as we move from Fig. 5 (a) to (d). In Fig. 5 (a) we have failed to create any bucket regions at depth D = 2. Thus each bucket stores a single sub-tree of depth 3. Note also that the occupancy of most buckets is quite low. In Fig. 5 (b) the hierarchical clustering improves since some bucket regions have been formed - buckets B1, B3 and B4 store two sub-trees of depth 3. In Fig. 5 (c) the total number of buckets decreases by one since a large bucket region of four sub-trees has been formed in bucket B3. Finally, in Fig. 5 (d) we have managed to store in bucket B3 a higher level (i.e., lower depth) subtree (i.e., a sub-tree of depth 2). This increases even more the hierarchical clustering achieved, compared to the previous case (Fig. 5 (c)), because the root 28 node is included in the same bucket as the four sub-trees. In addition, the bucket occupancy of B3 is increased. It is clear now from this simple example, that the hierarchical clustering factor fHC rewards the allocations that achieve to store lower-depth subtrees in buckets, that store regions of sub-trees instead of single sub-trees, and that create highly occupied buckets. The individual calculations of this example can be seen in Fig. 6. All in all, it is obvious that we have now the optimization problem of finding a chunk-to-bucket allocation such that fHC is maximized. This problem is NP-Hard, which results from the following theorem. Theorem 2 (Complexity of the HPP chunk-to-bucket allocation problem): The HPP Chunk-to-Bucket allocation problem is NP-Hard. Proof Assume a typical bin packing problem [42] where we are given N items with weights wi, i = 1,… ,N respectively and a bin size B such as wi B for all i = 1, …,N. The problem is to find a packing of the items in the fewest possible bins. Assume that we create N chunks of depth d and dimensionality D, so as chunk c1 has a storage cost of w1 and chunk c2 has a storage cost w2 and so on. Also assume that N -1 of these chunks are under the same parent chunk (e.g., the Nth chunk). This way we have created a two-level chunk-tree where the root lies at depth d = 0 and the leaves at depth d = 1. Also assume that a bin and a bucket are equivalent terms. Now we have reduced in polynomial time the bin packing problem to an HPP chunk-to-bucket allocation problem, which is to find an allocation of the chunks into buckets of B size such as the achieved hierarchical clustering factor fHC is maximized. Since all the chunk-trees (i.e., single chunks in our case) are of the same depth, the depth contribution cdi (1 i N), defined in equation (1), is the same for all chunk-trees. Therefore, in order to maximize the degree of the hierarchical clustering HCDB for each individual bucket (and thus increase also the hierarchical clustering factor fHC), we have to maximize the region contribution cri (1 i N) of each chunk-tree (equation (1)). This occurs when we pack into each bucket as many trees as possible on the one hand and - due to the region proximity factor rP - when the trees of each region are as close as possible in the multidimensional space, on the other. Finally, according to the fHC definition, the 29 number of buckets used must be the smallest possible. If we assume that the chunk dimensions have no inherent ordering then there is no notion of spatial proximity within the trees of the same region and the region proximity factor equals 1 for all possible regions (see also related discussion in the following subsection). In this case the only factor that can maximize the HCDB of each bucket and consequently the overall fHC is to minimize empty space within each bucket (i.e., maximize bucket occupancy in equation 1) and use as few buckets as possible by packing the largest number of trees in each bucket. These are exactly the goals of the original bin packing problem and thus a solution to the bin packing problem is also a solution to the HPP chunk-to-bucket allocation problem and vice-versa. Since the bin-packing can be reduced in polynomial time to the HPP Chunk-toBucket then, any problem in NP can be reduced in polynomial-time to the HPP Chunk-to-Bucket. Furthermore, in the general case (where we have chunk-trees of variant depths and dimension have inherent orderings) it is not easy to find a polynomial time verifier for a solution to the HPP chunk-to-bucket problem, since the maximum fHC that can be achieved is not known (as it is in the bin packing problem where the minimum number of bins can be computed with a simple division of the total weight of items by the size of a bin). Thus the problem is NPHard. We proceed next by providing a greedy algorithm based on heuristics for solving the HPP chunk-to-bucket allocation problem in linear time. The algorithm utilizes the hierarchical clustering degree of a bucket as a criterion in order to evaluate at each step how close we are to an optimal solution. In particular, it traverses the chunk-tree in a top-down depth-first manner, adopting the greedy approach that if at each step we create a bucket with a maximum value of HCDB, then overall the acquired hierarchical clustering factor will be maximal. Intuitively, by trying to pack the available buckets with low-depth trees (i.e., the tallest trees) first (thus the top-to-bottom traversal) we can ensure that we have not missed the chance to create the best HCDB buckets possible. In Fig. 7, we present the GreedyPutChunksIntoBuckets algorithm, which receives as input the root R of a chunk-tree CT and the fixed size SB of a bucket. The output of this algorithm is a set of buckets containing at least one whole chunktree, a directory chunk entry pointing at the root chunk R and the root-bucket BR. 30 In each step the algorithm tries “greedily” to make an allocation decision that will maximize the HCDB of the current bucket. For example, in lines 2 to 7 of Fig. 7, the algorithm tries to store the whole input tree in a single bucket thus aiming at a maximum degree of hierarchical clustering for the corresponding bucket. If this fails, then it allocates the root R to the root-bucket and tries to achieve a maximum HCDB by allocating the sub-trees at the next depth, i.e., the children of R (lines: 9-26). ! ! " # # !"# $ %&# '(( ) * % + !" & , +. ),,+ /0 -"1 2 & # 3 $ / ) , + +4 5 -'!6 5 (, 7 + !"! 89 4# :)+ * 5 89 4 3 !"!# < %&# # !"! 0 ) ++ 9 , ; +4 # 9 5 !"!# 3 3 =. 89 4# % & 0 +8& 3 ?6 - 5 + A+ ,4 19,) 3 % + , +3 - %'+ A+ 0 ,4 , +- 5 . ),,+ # > & ) !5 ++ 5 (, !"! & 9 , + 7 /0 # )((4 5 > !5 & # $ . ),,+ /0 + > # !"!# @ %&# + !"!#>% # +4 0 + !"! & + ) + ' ( ) 7( B ( 5 !"'#>% # ! 7 + !"' * !"'## 3 -"1 2 3 Fig. 7. A greedy algorithm for the HPP chunk-to-bucket allocation problem. This essentially is achieved by including all direct children sub-trees with size less than (or equal to) the size of a bucket (SB) into a list of candidate trees for inclusion into bucket-regions ( 0 +8& ) (lines: 14-16). Then the routine is called upon this list and tries to include the corresponding trees in a minimum set of buckets, by forming bucket-regions to be stored in each bucket, so as each one achieves the maximum possible HCDB (lines: 19-22). We will come back to this routine and discuss how it solves this problem in the next sub-section. Finally, for the children sub-trees of root R with 31 size cost greater than the size of a bucket, we recursively try to solve the corresponding HPP chunk-to-bucket allocation sub-problem for each one of them (lines: 23-26). This of course corresponds to a depth-first traversal of the input chunk tree. Very important is also the fact that no space is allocated for empty sub-trees (lines: 11-13); only a special entry is inserted in the parent node to denote a NULL sub-tree. Therefore, the allocation performed by the greedy algorithm adapts perfectly to the data distribution, coping effectively with the native sparseness of the cube. The recursive calls might lead us eventually all the way down to a data chunk (at depth DMAX). Indeed, if the A+ ,4 !5 & is called upon a root R, which is a data chunk, then this means that we have come upon a data chunk with size greater than the bucket size. This is called a large data chunk and a more detailed discussion on how to handle them will follow in a later subsection. For now it is enough to say that in order to resolve the problem of storing such a chunk we extend the chunking further (with a technique called artificial chunking) in order to transform the large data chunk into a 2-level chunk tree. Then, we solve the HPP chunk-to-bucket sub-problem for this sub-tree (lines: 3035). The termination of the algorithm is guaranteed by the fact that each recursive call deals with a sub-problem of a smaller in size chunk-tree than the parent problem. Thus, the size of the input chunk-tree is continuously reduced. $" ! $! !! # " ! $ Fig. 8 A chunk-tree to be allocated to buckets by the greedy algorithm. Assuming an input file consisting of the cube’s data points along with their corresponding chunk-ids (or equivalently the corresponding h-surrogate key per dimension) we need a single pass over this file to create the chunk-tree representation of the cube. Then the above greedy algorithm requires only linear 32 time in the number of input chunks (i.e., the chunks of the chunk-tree) to perform the allocation of chunks to buckets, since each node is visited exactly once and at the worst case all nodes are visited. Assume the chunk-tree of DMAX = 5 of Fig. 8. The numbers inside each node represent the storage cost for the corresponding sub-tree, e.g., the whole chunktree has a cost of 65 units. For a bucket size SB = 30 units the greedy algorithm yields a hierarchical clustering factor fHC = 0.72. The corresponding allocation is depicted in Fig. 9. $" $! # !! " ! " ! # ! $ Fig. 9. The chunk-to-bucket allocation for SB = 30. The solution comprises three buckets B1, B2 and B3, depicted as rectangles in the figure. The bucket with the highest clustering degree (HCDB) is B3, because it includes the lowest depth tree. The chunks not included in a rectangle will be stored in the root-bucket. In this case, the root-bucket consists of only a single bucket (i.e., = 1 and K = 3, see equation (3)), since this suffices for storing the corresponding two chunks. 3 # *! & We have seen that in each step of the greedy algorithm for solving the HPP Chunk-to-bucket allocation problem (corresponding to an input chunk-tree with a root node at a specific chunking depth), we try to store all the sibling trees hanging from this root to a set of buckets, forming this way groups of trees to be stored in each bucket that we call bucket-regions. The formation of bucket-regions is essentially a special case of the HPP Chunk-to-bucket allocation problem and can be described as follows: Definition 9 (The bucket-region formation problem): We are given a set of N chunk trees T1, T2, … TN, of the same chunking depth d. Each tree Ti (1 i N) 33 has a size: cost(Ti) SB, where SB is the bucket size. The problem is to store these trees into a set of buckets, so that the hierarchical clustering factor fHC of this allocation is maximized. Since all the trees are of the same depth, the depth contribution cdi (1 i N), defined in equation (1), is the same for all trees. Therefore, in order to maximize the degree of the hierarchical clustering HCDB for each individual bucket (and thus increase also the hierarchical clustering factor fHC), we have to maximize the region contribution cri (1 i N) of each tree (equation (1)). This occurs when we create bucket-regions with as many trees as possible on the one hand and - due to the region proximity factor rP - when the trees of each region are as close as possible in the multidimensional space, on the other. Finally, according to the fHC definition, the number of buckets used must be the smallest possible. Summarizing, in the bucket-region formation problem we seek a set of buckets to store the input trees, in order the following three criteria to be fulfilled: 1. The bucket-regions (i.e., each bucket) contain as many trees as possible. 2. The total number of buckets is minimum. 3. The trees of a region are as close in the multidimensional space as possible. One could observe that if we focused only on the first two criteria, then the bucket-region formation problem would be transformed to a typical bin-packing problem, which is a well-known NP-complete problem [42]. So intuitively the bucket-region formation problem can be viewed as a bin-packing problem, where items packed in the same bin must be neighbors in the multidimensional space. The space proximity of the trees of a region is meaningful only when we have dimension domains with inherent orderings. Typical example is the TIME dimension. For example, we might have trees corresponding to the months of the same year (which guarantees hierarchical proximity) but we would also like the consecutive months to be in the same region (space proximity). This is because these dimensions are the best candidates for expressing range predicates (e.g., months from FEB99 to AUG99). Otherwise, when there isn’t such an inherent ordering, e.g., a chunk might point to trees corresponding to products of the same category along the PRODUCT dimension; space proximity is not important and therefore all regions with the same number of trees are of equal value. In this case the corresponding predicates are typically set inclusion predicates (e.g., products 34 IN {“Literature”, “Philosophy”, “Science”}) and not range predicates, so hierarchical proximity alone suffices to ensure a low I/O cost. To measure the space proximity of the trees in a bucket-region we use the region proximity rP, which we define as follows: Definition 10 (Region Proximity rP): We define the region proximity rP of a bucket-region R defined in a multidimensional space S, where all dimensions of S have an inherent ordering, as the relative distance of the average Euclidian distance between all trees of the region R from the longest distance in S: rP dist AVG dist MAX dist MAX In the case where no dimension of the cube has an inherent ordering, then we assume that the average distance for any region is zero and thus the region proximity rP equals with one. For example, in Fig.8 we depict two different bucket-regions R1 and R2. The surrounding chunk represents the sub-cube corresponding to the months of a specific year and the types of a specific product category and defines a Euclidian space S. Each point in this figure corresponds to a root of a chunk-tree. Since, only the TIME dimension, among the two, includes an inherent ordering of its values, the data space, as long as the region proximity is concerned, is specified by TIME only (1-dimensional metric space). The largest distance in S equals 11 and is the distance between the leftmost and the rightmost trees. The average distance for region R1 equals 2 while for region R2 equals 5. By a simple substitution of the corresponding values in definition 10, we find that the region proximity for R1 equals 0.8, while for R2 equals 0.5. This is because the trees of the latter are more dispersed along the time dimension. Therefore region R1 exhibits a better space proximity than R2. In order to tackle the region formation problem we propose an algorithm called FormBuckRegions. This algorithm is a variation of an approximation algorithm called best-fit [42] for solving the bin-packing problem. Best-fit is a greedy algorithm that does not find always the optimal solution, however it runs in P-time (also can be implemented to run in NlogN, N being the number of trees in the input), and provides solutions that are far from the optimal solution within a certain bound. Actually, the best-fit solution in the worst case is never more than roughly 1.7 times worse than the optimal solution [42]. Moreover, our algorithm 35 exploits a space- filling curve [33] in order to visit the trees in a space-proximity ' *+ ,'- preserving way. We describe it next: 12 2 32 ! ' &0 ' ! & " ! # . / ) " % "" &' ( "))) Fig. 10. The region proximity for two bucket-regions: rP1 > rP2. FormBuckRegions Traverse the input set of trees along a space-filling curve SFC on the data space defined by the parent chunk. Each time you process a tree, insert it in the bucket that will yield the maximum HCDB value, among the allocated buckets, after the insertion. On a tie, choose one randomly. If no bucket can accommodate the current tree, then allocate a new bucket and insert the tree in it. Note that there is no linearization of multidimensional data points that preserves space proximity 100% [8, 13]. In the case where no dimension has an inherent ordering the space filling curve might be a simple row-wise traversal (Fig. 11). In this figure, we also depict the corresponding bucket-regions that are formed. We believe that a formation of bucket-regions that will provide an efficient clustering of chunk-trees must be based on some query patterns. In the following we show an example of such a query-pattern driven formation of bucket-regions. A hierarchy level of a dimension can basically take part in an OLAP query in two ways: (a) as a means of restriction (e.g., year = 2000), or (b) as a grouping attribute (e.g. “show me sales grouped by month”). In the former, we ask for values on a hyper-plane of the cube perpendicular to the Time dimension at the restriction point, while in the latter we ask for values on hyper-planes that are parallel to the Time dimension. In other words, if we know for a dimension level that it is going to be used by the queries more often as a restriction attribute, then we should try to create regions perpendicular to this dimension. Similarly, if we know that a level is going to be used more often as a grouping attribute, then we should opt for regions that are parallel to this dimension. Unfortunately, things are 36 not so simple, because if, for example, we have two “restriction levels” from two different dimensions, then the requirement for vertical regions to the corresponding dimensions is contradictory. Fig. 11. A row-wise traversal of the input trees. ! ( $ 56 7 ! - ) - & 2 ! 2 " 2 " ! $ 4 " ' 4 2 5 -7 - ! " 5 5+ -7 & Fig. 12. Bucket-region formation based on query patterns. In Fig. 12, we depict a bucket-region formation that is driven by the table appearing in the figure. In this table we note for each dimension-level corresponding to a chunking depth, from our example cube in Fig. 3, whether it should be characterized as a restriction level or as a grouping level. For instance, a user might know that 80% of the queries referencing level will 37 apply a restriction on it and only a 20% will use it as a grouping attribute, thus this level will be characterized as a restriction level. Furthermore, in the column labeled “importance order”, we order the different levels of the same depth according to their importance in the expected query load. For example, we might know that the ) +4 level will appear much more often in queries than the level and so on. In Fig. 12, we also depict a representative chunk for each chunking depth (of course for the topmost levels there is only one chunk, the root chunk), in order to show the formation of the regions according to the table. The algorithm in Fig. 13 describes how we can produce the bucket-regions for all depths, when we have as input a table similar to the one appearing in Fig. 12. CC CC 9 D ( / +4 9) 7+ -'!6 ( 0 + )( 5 ( B ( + ) ( 0 +8 , ) 5 )(( , 9 5 B)( 89 + ) 5 , 9 5 +# +, + // "+4 + ) ) 8) 4 0 ((47 E , + 5) * (( 0)B + > 9 +9 , ()+ + 9)+)(( ( 5 ( B ( ) +, 5)+) + E) ) ) F+ + B G + F + 9 G ) + + 9 B (4# 5 ( B ( ) 9 ( 5 + )+ 8 + ( B ( 5 89 + ) +, + '2H 5 + )+ 8 + + 9 , 5 7 + B # 9 I ( B ( 0+ 8 5 +, + 3 - %I 0+ 8 ( 9 3 3 5 + )+ "+)B + ( B ( + ) + + 5 89 + + ) + + +, +> ) 9 (( + 9 , 5 7 5 5 ) + *C ( 8 5 89 + ) +, + + > 0 5)+) # ) + K 5 5 ) +, + 5 + > 0 5)+) # ) + ) , , +4 F9) G 5 ( > ( 5 + )+ + # 78)J + 4( * 5 0) + E , ) ) + , ( B ( , 0) ( + E , ) ) + 0 + )(( ( B ( )8 ) 8 8 + + 5 5 ( * 9 * 9 5 B 0 + # # 5 + ) 333 Fig. 13.A bucket-region formation algorithm that is driven by query patterns. In Fig. 12, for the chunks corresponding to the levels 4, +4, 49 and 8, we also depict the column-major traversal method corresponding to the second part of the algorithm. Note also that the term “fully-sized region” means a region that has a size greater than the bucket occupancy threshold, i.e., it utilizes well the available bucket space. Finally, whenever, we are at a depth where a pseudo level exists for a dimension, e.g., D = 2 for our example, no regions are created for the pseudo level of course. Also, note that bucket-region formation for chunks at the maximum chunking depth (as is the chunk in depth 3 38 in Fig. 12), is only required in the case, where the chunking is extended beyond the data-chunk level. This is the case of large data chunks which is the topic of the next sub-section. 3+/ ( # In this sub-section, we will discuss the case where the GreedyPutChunksIntoBuckets algorithm (Fig. 7) is called with input a chunk-tree that corresponds to a single data chunk. This, as we have already explained, would be the result of a number of recursive calls to the GreedyPutChunksIntoBuckets algorithm that led us to descend the chunk hierarchy and to end up at a leaf node. Typically, this leaf node is large enough so as not to fit in a single bucket, otherwise the recursive call upon this node would not have occurred in the first place (Fig. 7). ( 5 9 :& , 38 0 ; #! #" !) # " ( ' 4/ 4 ## # # (5 ) " "" "! "# ( ' 4/ 4 " " " Fig. 14. Example of a large data chunk. The main idea for tackling this problem is to further continue the chunking process, although we have fully used the existing dimension hierarchies, by imposing a normal grid. We call this chunking artificial chunking in contrast to the hierarchical chunking presented in the previous section. This process transforms the initial large data chunk into a 2-level chunk-tree of size less than or equal to the original data chunk. Then, we solve the HPP chunk-to-bucket allocation sub-problem for this chunk-tree and therefore we once again call the GreedyPutChunksIntoBuckets routine upon this tree. In Fig. 14, we depict an example of such a large data chunk. It consists of two dimensions A and B. We assume that the maximum chunking depth is DMAX = K. Therefore, K will be the depth of this chunk. Parallel to the dimensions, we depict the order codes of the dimension values of this chunk that correspond to the most 39 detailed level of each dimension. Also, we denote their parent value on each dimension, i.e., the pivot level values that created this chunk. Notice that, the suffix of the chunk-id of this chunk consists of the concatenated order codes of the two pivot level values, i.e., . In order to extend the chunking further, we need to insert a new level between the most detailed members of each dimension and their parent. However, this level must be inserted “locally”, only for this specific chunk and not for all the grain level values of a dimension. We want to avoid inserting another pseudo level in the whole level hierarchy of the dimension, because this would trigger the enlargement of all dimension hierarchies and would result in a lot of useless chunks. Therefore, it is essential that this new level remains local. To this end, we introduce the notion of the local depth d of a chunk to characterize the artificial chunking, similar to the global chunking depth D (introduced in the previous section) characterizing the hierarchical chunking. Definition 11 (Local Depth d): The local depth d, where d ! -1, of a chunk Ch denotes the chunking depth of Ch pertaining to artificial chunking. A local depth d = -1 denotes that no artificial chunking has been imposed on Ch. A value of d = 0 corresponds to the root of a chunk-tree by artificial chunking and is always a directory chunk. The value of d increases by one for each artificial chunking level. Note that the global chunking depth D, while descending levels created by artificial chunking, remains constant and equal to the maximum global chunking depth of the cube (in general, to the current global depth value); only the local depth increases. Let us assume a bucket size SB that can accommodate a maximum of Mr directory chunk entries, or a maximum of Me data chunk entries. In order to chunk a large data chunk Ch of N dimensions by artificial chunking, we define a grid on it, consisting of mgi (1 N m gi i N) number of members per dimension, such as M r . This grid will correspond to a new directory chunk, pointing at the i 1 new chunks created from the artificial chunking of the original large data chunk Ch and due to the aforementioned constraint it is guaranteed that it will fit in a bucket. If we assume a normal grid, then for all i : 1 i N, it holds m gi N Mr . 40 In particular, if ni (1 i N) corresponds to the number of members of the original chunk Ch along the dimension i, then a new level consisting of mgi members will be inserted as a “parent” level. In other words, a number of ci children (out of the ni) will be assigned to each of the mgi members, where ci ni , as long as c i m gi 1. If 0 < ci < 1, then the corresponding new level will act as a pseudo level, i.e., no chunking will take place along this dimension. If all new levels correspond to pseudo levels, i.e., ni < mgi for all i : 1 i N, then we take mgi = maximum(ni). We will describe the above process with an example. Let us assume a bucket that can accommodate a maximum of Mr = 10 directory chunk entries or a maximum of Me = 5 data chunk entries. In this case the data chunk of Fig. 14 is a large data chunk, since it cannot be stored in a single bucket. Therefore, we define a grid with mg1, mg2 number of members along dimensions A and B respectively. If the grid is normal then mg1 = mg2 = 10 = 3. Thus, we create a directory chunk, which consists of 3x3 = 9 cells (i.e., directory chunk entries); this is depicted in Fig. 15. 38 0 8 5 5 "" " ( 5 9 :& , 5 38 0 ; 56 4 ## #(5 ## " "# # ) 38 0 8 "! # ## ## # # 38 068 #! " 38 0 8 5 # #" 6 #" #! " ( ' 4/ " < !) 5 ) " "" "! "# " " "! " "# " #! 38 068 6 5 #! #" 38 0 8 " "" #" ) ( ' 4/ 4 38 0 86 5 ) " "" 5 " " # 38 0 86 !) !) !) # 5 # 38 0686 "! "# " " " Fig. 15. The example large data chunk artificially chunked. In Fig. 15, we can also see the new values of each dimension and the corresponding parent-child relationships between the original values and the newly inserted ones. In this case, each new value will have at most c1 8 3 =3 41 6 3 children for dimension A and c2 = 2 children for dimension B respectively. The created directory chunk will have a global depth D = K and a local depth d = 0. Around it, we depict all the data chunks (partitions of the original data chunk) that correspond to each directory entry. Each such data chunk will have a global depth D = K and a local depth d = 1. The chunk-ids of the new data chunks include one more domain as a suffix, corresponding to the new chunking depth that they belong. Notice that from the artificial chunking process new empty chunks might arise. For example see the rightmost chunk in the top of Fig. 15. Since no space will be allocated for such empty chunks, it is obvious that artificial chunking might lead to a minimization of the size of the original data chunk; especially for sparse data chunks. This important characteristic is stated with the following theorem, which shows that in the worst case the extra size overhead of the resultant 2-level tree will be equal to the size of a single bucket. However, since cubes are sparse, chunks will be also sparse and therefore practically the size of the tree will always be smaller than that of the original chunk. Theorem 3( Size upper bound for an artificially chunked large data chunk): For any large data chunk Ch of size SCh holds that the two-level chunk tree CT resulting from the application of the artificial chunking process on Ch, will have a size SCT such as: S CT S Ch SB where SB is the bucket size. Proof Assume a large data chunk Ch which is 100% full. Then from the application of artificial chunking no empty chunks will be produced. Moreover, from the definition of chunking we know that if we connect these chunks back together we will get Ch. Consequently, the total size of these chunks is equal to SCh. Now, the root chunk of the new tree CT will have (by definition) at most Mr entries, so as to fit in a single bucket. Therefore the extra size overhead caused by the root is at most SB. From this we infer that SCT SCh + SB . Naturally if this holds for the largest possible Ch it will certainly also hold for all other possible Ch’s that are not 100% full and thus may result to empty chunks after the artificial chunking. As soon as we create the 2-level chunk-tree, we have to solve the corresponding HPP chunk-to-bucket allocation sub-problem for this tree; i.e., we recursively call 42 the GreedyPutChunksIntoBucket algorithm, with input the root node of the new tree. The algorithm will then try to store the whole chunk-tree in a bucket (which is possible because as explained above artificial chunking reduces the size of the original chunk for sparse data chunks), or create the appropriate bucket-regions and store the root-node in the root-bucket (see Fig. 7). Also it will mark the empty directory entries. In Fig. 15, we can see the formed region assuming that the maximum number of data entries in a bucket is Me=5. Finally, if still there exists a large data chunk that cannot fit by itself in a whole bucket, then we repeat the whole procedure and thus create some new data chunks at local depth d = 2. This procedure may continue until we finally store all parts of the original large data chunk. ! # ( 5 67 5 * < < ( 5 7 5* < < < < ( 5 7 5* < < < < < < < ( 5 +7 5 * < < < < ( 5 07 5 6 < < ( 5 07 5 < Fig. 16. An example of a root directory. 30/ ! *( - In the previous sub-sections we formally defined the HPP chunk-to-bucket allocation problem. From this definition we have seen that the root-bucket BR essentially represented the entire set of buckets that had a zero degree of hierarchical clustering HCDB, and therefore, had no contribution to the hierarchical clustering achieved by a specific chunk-to-bucket allocation. Moreover, due to the factor in equation (3) ( was defined as the number of 43 fixed-size buckets in BR), it is clear that the larger the root-bucket becomes the worse hierarchical clustering is achieved. In this subsection, we will present a method for improving the hierarchical clustering contribution of the root-bucket by reducing the factor, with the use of a main memory cache area, and also by increasing the HCDB of the buckets in BR. In Fig. 16, we depict an example of a set of directory nodes that will be stored in the root-bucket. These are all directory chunks and are rooted all the way up to the root chunk of the whole cube. These chunks are of different global depths D and local depths d and form an unbalanced chunk-tree that we call the root directory. Definition 12 (The Root Directory RD ): The root directory RD of a hierarchically chunked cube C, represented by a chunk-tree CT, is an unbalanced chunk-tree with the following properties: 1. The root of RD is the root node of CT. 2. For the set SR of the nodes of RD holds: S R S CT , where SCT is the set of the nodes of CT. 3. All the nodes of RD are directory chunks. 4. The leaves of the root directory contain entries that point to chunks stored in a different bucket than their own. 5. RD is an empty tree iff the root node of CT is stored in the same bucket with its children nodes. In Fig. 16, the empty cells correspond to sub-trees that have been allocated to some bucket, either on their own or with other sub-trees (i.e., forming a bucketregion). We have omitted these links from the figure in order to avoid cluttering the picture. Also note the symbol “X” for cells pointing to an empty sub-tree. Beneath the dotted line we can see directory chunks that have resulted from the artificial chunking process described in the previous subsection. The basic idea of the method that we will describe next is based on the simple observation that if we impose hierarchical clustering to the root directory, as if it was a chunk-tree on its own, the evaluation of HPP queries would be improved, because all the HPP queries need at some point to access a node of the root directory. Moreover, since the root directory always contains the root chunk of the whole chunk tree as well as certain higher level (i.e., lower depth) directory chunk nodes, we could assume that these nodes are permanently resident in main 44 memory during a query session on a cube. The latter is of course a common practice for all index structures in databases. The algorithm that we propose for the storage of the root directory is called StoreRootDir. It assumes that directory entries in the root directory pointing to already allocated sub-trees (empty cells in Fig. 16) are treated as pointers to empty trees, in the sense that their storage cost is not taken into account for the storage of the root directory. The algorithm receives as an input the root directory RD, a cache area of size SM and a root-bucket BR of size SROOT = SM SB (therefore S ROOT SB, where ! 1 and S M ) and produces a list of allocated buckets for the root directory; the details of the algorithm are shown in Fig. 17. ! , ! ++ ?6 + ++ , ) ) ++ , 88 ,) , . I ++ + 7 , , # + ), 570 + *)4 3 * 5)B CC*5 ( + -"1 2 3 / + , )(( , , + +4 ) 5 )(( ) CC (B 9+ !. , 7 + ( 8> 9,) > & , + > +4# + 7 # & ' 5 + , + CC8) ) + + + ' 3 3 -"1 2 3 0 + ) 5 ' ( +4 0 B )(( * 5 ' ( / %( %( 89 4# ) 5 # Fig. 17. Recursive algorithm for storing the root directory. We begin from the root and visit in a breadth-first manner all nodes of RD (lines 1-5). Each node we visit, we store it in the root-bucket BR, until we find a node that can no longer be accommodated. Then, for each of the remaining unallocated chunk sub-trees of RD we solve the corresponding HPP chunk-to-bucket subproblem (lines 10-13). For the storage of the new root directories that might result from these sub-problems, we use again the StoreRootDir algorithm but with a zero cache area size this time (lines 15-18). From the above description we can see that the proposed algorithm uses the rootbucket only for storing the higher-level nodes that will be loaded in the cache. Therefore, the I/O overhead due to the root-bucket during the evaluation of an HPP query is zeroed. Furthermore, the chunk-to-bucket allocation solution of a 45 cube is now augmented with an extra set of buckets resulting from the solutions to the new sub-problems from within StoreRootDir. The hierarchical clustering degree HCDB of these buckets is calculated based on the input chunk-tree of the specific sub-problem and not on the chunk-tree representing the whole cube. In the case where the former is an unbalanced tree, the maximum chunking depth DMAX is calculated from the longest path from the root to a leaf. Notice that for each such sub-problem a new root directory might arise. (In fact the only chance for an empty root directory is the case where the whole chunk sub-tree, upon which GreedyPutChunksIntoBuckets is called, fits in a single bucket). Therefore, we solve each of these sub-problems by recursively using StoreRootDir, but this time with no available cache area. This will make the StoreRootDir to recursively invoke the GreedyPutChunksIntoBuckets algorithm, until all chunks of a sub-tree are allocated to a bucket. Recall from the previous sub-sections that the termination of the GreedyPutChunksIntoBuckets algorithm is guaranteed by the fact that each recursive call deals with a sub-problem of a smaller in size chunk-tree than the parent problem. Thus, the size of the input chunk-tree continuously reduces. Consequently, this also guarantees the termination of StoreRootDir. $" $! # !! " ! " ! # ! $ Fig. 18. Resulting allocation of the running example cube for a bucket size SB = 30 and a cache area equal to a single bucket. Note that the root directory is a very small fragment of the overall cube data space. Thus, it is realistic to assume that in most cases we can store the whole root directory in the root-bucket and load it entirely in the cache during querying. In this case, we can evaluate any point HPP query with a single bucket I/O. 46 In the following we provide an upper bound for the size of the root directory. In order to compute this upper bound, we use the full chunk-tree resulting from the hierarchical chunking of a cube. A guaranteed upper bound for the size of the root directory could be the size of all the possible directory chunks of this tree. However, the root directory of the CUBE File is a significantly smaller version of the whole directory tree for the following reasons: (a) it does not contain all directory chunk nodes, only the ones that were not stored in a bucket along with their descendants, (b) space is not allocated for empty sub-trees and (c) chunks are stored in a compressed form, not wasting space for empty entries. Lemma 1: For any cube C consisting of N dimensions, where each dimension has a hierarchy represented by a complete K-level m-way tree, the size of the root directory in terms of the number of directory entries is (m N ( K 2) ). Proof Since the root directory is always (by its definition) smaller in size from the tree containing all the possible directory chunks (called directory tree), then we can write: the size of the root directory is O(size of directory tree). The size of the directory tree can be very easily computed by the following series, which adds the number of all possible directory entries: Size of directory tree = 1 m N m2N ... m ( K 2) N = O(m N ( K 2) ) Next we provide a theorem that proves an upper bound for the ratio between the size of the root directory and that of the full most-detailed data space of a cube. Theorem 4 (Upper bound of the size ratio between the root directory and the cube’s data space): For any cube C consisting of N dimensions, where each dimension has a hierarchy represented by a complete K-level m-way tree, the ratio of the root directory size to the full size of C’s detailed data space (i.e., the Cartesian product of the cardinalities of the most detailed levels for all dimensions) is ( 1 ). mN Proof From the above lemma we have that the size of the root directory is (m N ( K 2) ). Similarly we can prove that the size of the C’s most detailed data space is (m N ( K 1) ) . Therefore, the ratio 47 root directory size = cube most detailed data space size ( m N (K m N (K 2) 1) ) ( 1 ) mN Theorem 4, proves that as dimensionality increases the ratio of the root directory size to the full cube size at the most detailed level, exponentially decreases. Therefore, as N increases, the root directory size becomes very fast negligible compared to the cube’s data space. $" ! $! " . !! " $# # # ! $ ! / ) $ Fig. 19. Resulting allocation of the running example cube for a bucket size SB = 10 and a cache area equal to a single bucket. If we go back to the allocated cube in Fig. 9, and assume a cache area of size equal to a single bucket, then the StoreRootDir algorithm will store the whole root directory in the root-bucket. In other words, the root directory can be fully accommodated in the cache area and therefore from equation (3), for K = 3 and = 0 (since the root-bucket will be loaded into memory the factor is zeroed) we get an improved hierarchical clustering degree fHC = 9.6%. The new allocation is depicted in Fig. 18. Notice that any point query can be answered now with a single bucket I/O. If for the cube of our running example we assume a bucket size of SB = 10, then the chunk to bucket allocation resulting from the GreedyPutChunksIntoBuckets and the subsequent call to StoreRootDir is depicted in Fig. 19. In this case, we have once more assumed a cache area equal to a single bucket. In the figure, we can see the upper nodes allocated to the cache area (i.e., stored in the root-bucket), in a breadth-first way. The buckets B1 to B5 have resulted from the initial call to GreedyPutchunksIntoBuckets. Buckets B6 and B7 store the rest nodes of the root directory that could not be accommodated in the cache area and are a result of the 48 call to the StoreRootDir algorithm. Finally, in Fig. 20, we present the corresponding allocation for a zero cache area. $" / ! $! . !! " $# # # " ! $ ! / ) $ Fig. 20. Resulting allocation of the running example cube for a bucket size SB = 10 and a zero cache area. This concludes the presentation of the data structures and algorithms used to construct the CUBE File. We move next to present detailed experimental evaluation results. = , & 1 We have conducted an extensive set of experiments over our CUBE File implementation. The large set of experiments covers both the structural and the query evaluation aspects of the data structure. In addition, we wanted to compare the CUBE File with the UB-tree/MHC (which to our knowledge is the only multidimensional structure that achieves hierarchical clustering with the use of hsurrogates), both in terms of structural behavior and query evaluation time. The latter comparison yielded 7-9 times less I/Os on average, in favor of the CUBE File for all workloads tested and the former one showed a 2-3 times lower storage cost for almost all data spaces, again in favor of the CUBE File, hence providing evidence that the CUBE File achieves a higher degree of hierarchical clustering of the data. These results appear in [18]. Note that the same comparison but between the UB-tree/MHC and a bitmap index based star schema has shown a query evaluation speedup of 20 to 40 times on average (depending on the use or not of the pre-grouping transformation optimization [40]) (see [15] for more details). 49 The query performance measurements in [18] were based in HPP queries (see definition 2) that resulted in a single or multiple disjoint query boxes (i.e., hyperrectangles) at the grain level of the cube data space. Both hot and cold cache query evaluations were examined. In CUBE File parlance, this translates to a cached or not cached root-bucket respectively. Our query-load consisted of various query classes with respect to the cube selectivity (i.e., how many data points were returned in the result set). The CUBE File performed multiple times less I/Os than the UB-tree, in all query classes, for both hot and cold cache experiments, exhibiting a superior hierarchical clustering. For large selectivity queries (i.e., many data in the result set), where the hierarchical restrictions were posed on higher hierarchy levels, the CUBE File needed 3 times less I/Os than the UB-tree. Interestingly, for small selectivity queries, where the restrictions were posed on more detailed hierarchy levels, the difference in I/Os increased impressively (in favour of the CUBE File) reaching a factor larger than 10 for all relevant query classes, and up to 37 in certain query classes. Note that the most decisive factor for any HPP query in order to run fast (i.e., with few I/Os) is to achieve hierarchical clustering at all levels of the dimension hierarchies. This is more obvious in small selectivity queries, where one has to achieve hierarchical clustering even at the most detailed levels of the hierarchy. For queries with small cube selectivities the UB-tree performance was worse and the hierarchical clustering effect reduced. This is due to the way data are clustered into z-regions (i.e., disk pages) along the z-curve [2]. In contrast, the hierarchical chunking applied in the CUBE File, creates groups of data (i.e., chunks) that belong in the same “hierarchical family” even for the most detailed levels. This, in combination with the chunk-to-bucket allocation, which guarantees that hierarchical families will be physically stored together, results in better hierarchical clustering of the cube even for the most detailed levels of the hierarchies. In this paper, we want to present further experimental results that show the adaptation of the CUBE file structure to data spaces of varying characteristics such as cube sparseness and number of total data points (i.e., scalability tests). 50 Dimension #Levels Grain Level Cardinality D1 4 2000 D2 5 3125 D3 7 6912 D4 3 500 D5 9 8748 D6 2 36 D7 10 7776 D8 8 6561 D9 6 4096 Fig. 21. Dimension hierarchy configuration for the experimental data sets. We have used synthetic data sets that were produced with an OLAP data generator that we have developed. Our aim was to create data sets with a realistic number of dimensions and hierarchy levels. In Fig. 21, we present the hierarchy configuration for each dimension used in the experimental data sets. The shortest hierarchy consists of 2 levels, while the longest consists of 10 levels. We tried each data set to consist of a good mixture of hierarchy lengths. In order to evaluate the adaptation to sparse data spaces, we created cubes that were very sparse. Therefore the number of input tuples was kept from a small to a moderate level. To simulate the cube data distribution, for each cube we created ten hyperrectangular regions as data point containers. These regions are defined randomly at the most detailed level of the cube and not by combination of hierarchy values (although this would be more realistic), in order not to favor the CUBE File particularly, due to the hierarchical chunking. We then filled each region with data points uniformly spread and tried to maintain the same number of data points in each region. SPARSE #Dimensions #Tuples #Facts Maximum chunking depth Bucket size (bytes) Bucket occupancy threshold Varying 100,000 1 Depends on longest hierarchy 8K 80% SCALE 5 Varying 1 8 8K 80% Fig. 22. Data set configuration for the four series of experiments. We have distinguished our experiments in two sets depending on the characteristic for which we wanted to analyze CUBE File’s behavior: (a) data space sparseness (SPARSE), and (b) input data point scalability (SCALE). Fig. 22 shows the data set configuration for each series of experiments. = % - ( / / We increase the dimensionality of the cube while maintaining the number of data points constant (=~ 100K tuples); this way we essentially increase the cube sparseness. The cube sparseness is measured as the ratio of the actual cube data points to the product of the cardinalities of the dimension grain levels. The primary hypotheses that we aimed to prove experimentally were the following: 51 1. The CUBE File adapts perfectly to the extensive sparseness of the data space and thus its size does not increase as the cube sparseness increases. 2. Hierarchical clustering achieved by the CUBE File is almost unaffected by the extensive cube sparseness. 3. The root-bucket size remains low compared to the CUBE File size and thus it is feasible to be cached in main memory for realistic cases. Additionally, we have drawn other interesting conclusions regarding the structure’s behavior as sparseness increases. In Fig. 23, we observe the data space size “exploding” exponentially as the number of dimensions increases. We can see that the data space size is many orders of magnitude larger than the CUBE File size. . :' ; / > 1 ?( & "56!/ "56! "56! "56" "56"! "56 / " " + 3 0 17 1 5 '7 3 5 17 # . / ) ? & Fig. 23. CUBE File size (in logarithmic scale) for increasing dimensionality. In addition, the CUBE File size is smaller than the input file, containing the input tuples (i.e., fact values accompanied by their corresponding chunk-id, or equivalently h-surrogates) to be loaded into the CUBE File. This is depicted more clearly in Fig. 25. Now, we can more clearly see that the total CUBE File size is smaller than that of the input data file, although the former maintains a whole tree structure of intermediate directory nodes, essentially because the CUBE File does not allocate space for empty subtrees and does not store the coordinates along the measure values. In the graph, we can see that the CUBE File size exceeds the input data file only after dimensionality goes over the number of eight dimensions. The real cause in this case is the cube sparseness, which is magnified by the dimensionality increase. In our case, for nine dimensions and with 100,000 input data points, the sparseness has reached a value of 7.08"10-26, which is an extreme case. This clearly shows that the CUBE File: 52 1. Adapts to the large sparseness of the cube allocating space comparable to the actual number of data points and not to all possible cells. 2. Achieves a compression of the input data since it does not store the data point coordinates (i.e., the h-surrogate keys/chunk-ids) but only the measure values. The last point is depicted more clearly in Fig. 24, where we present the compression achieved by the CUBE File organization as the cube sparseness increases. This compression is calculated as the ratio of the CUBE File size to the data space size (or the input file size), which is then subtracted from 1. With respect to the data space size, the compression is always 100% for all depicted sparseness values. This is reasonable since the CUBE File size is always many orders of magnitude smaller than the data space size. In addition, with respect to the input file, the compression remains high (above 50%) even for cubes with sparseness values down to 10-20. This shows that for all practical cases of cubes the compression achieved is significant. ' ! // @ ' ! // 4 A ( 1 "! 8 " 8 / 8 8 8 !8 8 !#5 5 ) 5 % 4 511 5 "# " 5 " / 3 5 % 511 ")5 "/ # 5 !! . "5 ! !/ 4 // Fig. 24. Compression achieved by the CUBE File as the cube sparseness increases It is noteworthy to point that for the measurements presented in this report the CUBE File implementation does not impose any compression to the intermediate nodes (i.e., the directory chunks). Only the data chunks are compressed by means of a bitmap representing the cell offsets (called compression bitmap), which however is stored uncompressed also. This was a deliberate choice in order to evaluate the compression achieved merely by the “pruning ability” of our chunkto-bucket allocation scheme, according to which no space is allocated for empty chunk-trees. Finally, another factor that reduces the achieved compression is that in our current implementation for each chunk we store also its chunk-id. This is due to a “defensive” design choice made on the early stages of the implementation 53 but it is not necessary for accessing the chunks, since chunk-ids are not used for associative search, when accessing the CUBE File. Therefore, regarding the compression achieved the following could improve the compression ratio even further: 1. Compression of directory chunks 2. Removal of chunk-ids from chunks 3. Compression of bitmaps (e.g., with run-length encoding) / > 1 ?( & ; ""# )) ##" # /. ) . :' + 30 17 3 5 17 # ! " 9 : + # . / 55 17 :5 17 ) ? & Fig. 25. Several sizes for increasing cube sparseness (via increase in dimensionality). In addition, in Fig. 25 we depict the root-bucket size and the chunk-tree size. The root-bucket grows in a similar way to the CUBE File, however its size is always one or two orders of magnitude smaller. We will return shortly to the root-bucket. The chunk-tree denotes the chunk-tree representation of the cube, i.e., it is the sum of the sizes of all the chunks comprising the chunk-tree. Interestingly, we observe that as dimensionality increases (i.e., cube sparseness increases) the size of the chunk-tree exceeds that of the CUBE File. This seems rather strange, since one would expect the CUBE File size, which includes the storage overhead of the buckets, to be greater. The explanation lies in the existence of large data chunks. The chunk-tree representation may include large data chunks, which in the chunkto-bucket allocation process will be artificially chunked. However, in sparse data spaces, these large data chunks are also very sparse and most of their size cost is due to the compression bitmap. When such a sparse chunk is artificially chunked, then its size is significantly reduced due to the pruning ability of the allocation algorithm. Therefore, in sparse cube data spaces, artificial chunking provides substantial compression as a side effect. Fig. 26 also verifies the existence of many large data chunks in highly sparse data spaces. In Fig. 26, we depict the chunk distribution as dimensionality increases. Note that the number of chunks depicted is the number of “real” chunks that will be 54 eventually stored in buckets and not the number of “possible” chunks deriving from the hierarchical chunking process. One interesting result that can be drawn from this graph is that the increase of dimensionality does not necessarily mean an increase on the total number of chunks. In fact, we observe this metric decreasing as dimensionality increases and reaching a minimum point, when dimensionality becomes 7. One would expect the opposite since the number of chunks at each depth, generated by hierarchical chunking, equals the product of the dimension cardinalities at the corresponding levels. The explanation here lies again in the pruning ability of our method. This shows that although the number of “possible” chunks increases, the number of “real” chunks might decrease for certain data distributions and hierarchy configurations. Again, this provides evidence that the CUBE File adapts well to the sparseness of the data space. 49 ( /$! ; 9 3 :1 <5 $ 4 1 ( ' 4/ 3 9 :1 9 :1 3 9 4 $B 3 9 :1 :1 ? # . # ! " # . / ) ?( & Fig. 26. The distribution of chunks for increasing cube sparseness (via increase in dimensionality). Another interesting result is that very soon (from dimensionality 5 and above) the total number of directory chunks exceeds the total number of data chunks. This leads us to the conclusion that a compression of the directory chunks (which as we have mentioned above, has not been implemented in our current version) is indeed meaningful, and might provide a significant compression. Finally, we observe an increase on the number of large data chunks. This is an implementation effect and not a data structure characteristic. As we have already noted, the current chunk implementation leaves the compression bitmap uncompressed. As the space becomes sparser, these large data chunks are essentially almost empty data chunks, with a very large compression bitmap, which is almost filled with 0s. Of course, a more efficient storage of this bitmap 55 (even with a simple run length encoding scheme) would eliminate this effect and these data chunks would not appear as “large”. The existence of many large data chunks in high dimensionalities explains also the fact that the number of the rootdirectory chunks (i.e., the chunks that will be stored mainly in the root-bucket but also in simple buckets if the latter overflows) exceeds the total number of directory chunks. This is because the total number of directory chunks appearing in the graph does not include the directory chunks arising from the artificial chunking of large data chunks, which were not created initially from the hierarchical chunking process but dynamically during the chunk-to-bucket allocation phase. In Fig. 27, we depict the relative size of the root bucket w.r.t. the total CUBE File size. In particular, we can see the ratio of the root-bucket size to the total CUBE File size for continuously increasing values of the cube sparseness. It is clear that for even extremely sparse cubes with sparseness values down to 10-18, the total root-bucket size remains less than the 20% of the total CUBE File size. For all realistic cases this ratio is below 5%. Once more, the remarks mentioned above regarding the compression hold for this case too, i.e., in our experiments no compression has been imposed to the root-bucket chunks, other than the pruning of empty regions. ! $ 9$ / > C /> ! $ 9 $! $ 1 / !/ 4 // . 8 8 8 8 # 8 !8 "8 8 !#5 5 ) 5 "# " 5 " ")5 "/ # 5 !! . "5 ! / !/ 4 // Fig. 27. Relative growth of the size of the root-bucket as the cube sparseness becomes greater. Finally, we have measured the achieved hierarchical clustering for increasing cube sparseness. In Fig. 28, we depict fHC values that have been normalized to the range of [0,1]. We can observe in this figure the fHC values varying from on end of the curve to the other only about 70%, while the cube sparseness varies 20 orders of magnitude. Thus the hierarchical clustering factor is essentially not affected by 56 the cube sparseness increase and the CUBE File manages to maintain a high quality of hierarchical clustering even for extremely sparse data spaces. 1 / "! " 4 & . / ! !#5 5 ) 5 "# " 5" ")5 "/ # 5 !! . "5 ! / Fig. 28. The hierarchical clustering factor fHC as cube sparseness increases. We recapitulate the main conclusions drawn regarding the CUBE File’s behavior in conditions of increasing sparseness: 1. It adapts to the large sparseness of the cube allocating space comparable to the actual number of data points and not to all possible cells. 2. Moreover, it achieves more than 50% compression of the input data for all realistic cases. 3. In sparse cube data spaces, artificial chunking provides substantial compression as a side effect due to the existence of many large data chunks. 4. The increase of dimensionality does not necessarily mean an increase on the total number of chunks for the CUBE File. The “possible” chunks indeed increase but the CUBE File stores only those that are non-empty. 5. Compression of directory chunks in data spaces of large dimensionality is likely to yield significant storage savings. 6. The root-bucket size remains less than the 20% of the total CUBE File size for even extremely sparse cubes. For more realistic cases of sparseness the size is below 5%. Thus caching the root-bucket (or at least a significant part of it) in main memory is indeed feasible. 7. The hierarchical clustering factor is essentially not affected by the cube sparseness increase and the CUBE File manages to maintain a high quality of hierarchical clustering even for extremely sparse data spaces. 57 = / % - This series of experiments aimed at evaluating the scalability of the CUBE File. To this end we increased the number of input tuples, while maintaining a fixed set of 5 dimensions (D1 to D5 in Fig. 21). However, we have kept the maximum number of tuples to a moderate level (1 million rows), in order to maintain the large sparseness of the cube, which is more realistic. The primary hypotheses that we aimed to prove with this set of experiments were the following: 1. The CUBE File is scalable (its size remains lower than that of the input file when the number of input data points increase) 2. Hierarchical clustering achieved remains of high quality, when the number of input data points increase. 3. The root-bucket size remains low compared to the CUBE File size and thus it is feasible to be cached in main memory for realistic cases. The first and the third hypothesis can be confirmed directly from Fig. 29. In this figure, we can see the CUBE File size remaining smaller than the input file for all data sets. We can also see the difference between the CUBE File size and that of the root-bucket becoming larger. Thus as tuple cardinality increases the root bucket becomes a continually smaller fraction of the CUBE File. Finally, we can see the chunk-tree size being very close to the CUBE File size, which demonstrates the high space utilization achieved by the CUBE File. More interestingly, in Fig. 30, we depict the compression achieved by the CUBE File as the number of cube data points increases. With respect to the data space size the compression is constantly 100%. With respect to the input data file the compression becomes high (around 70%) very soon and maintains this high compression rate for all tuple cardinalities. In fact, it seems as the compression reaches a maximum value and then remains almost constant; thus both sizes increase with the same rate. This is clear evidence that the CUBE File utilizes space efficiently. It saves a significant portion of storage from discarding the dimension foreign keys of each tuple (i.e., the chunk-ids or h-surrogates) and then retains this size difference by increasing proportionally to the number of input tuples. 58 / > 1 ?$ 3 5 17 + 30 17 9 : 55 17 + :5 17 "= " ! != "# ! =! )) " =. . # =# # =) "= "= ! " # ; " . :' ! ? Fig. 29. CUBE File size as the number of cube data points increases. & @ 3 5 )) =! "= ! =# # " =) % 511 =. "= .! " != "# 511 # % ! 5 # "! 8 ) 8 8 # 8 8 # 8 8 ) 8 "! 8 "= ' ! // 4 A ( 1 1 ?$ ?$ Fig. 30. The compression achieved by the CUBE File as the number of cube data points increases. Fig. 31 depicts the decrease of the ratio of the root-bucket size to the CUBE File size as the number of input tuples increase. It shows that for realistic tuple cardinalities the root-bucket size becomes negligible compared to the CUBE File size. Therefore, for realistc cube sizes (> 1000K tuples) the root bucket size is below 5% of the CUBE file size and it could be cached in main memory. Finally, we observe a super-linear decrease of the ratio in the number of input tuples, which further confirms our previous statement. In Fig. 32, we depict the distribution of buckets with different content in the number of input tuples. Observe that as the space becomes gradually more dense and more data points fill-up the empty regions, more chunk-subtrees are created 59 and thus the number of bucket-region buckets increases rapidly. This is a very welcomed result since, the more bucket-regions are formed the better the hierarchical clustering of the chunk-to-bucket allocation becomes. $ 9 $! $ 1 ? "= " ! != "# # )) =! "= ! " =) =# # # "= =. .! . 8 8 8 8 # 8 !8 "8 8 ! $ 9$ /> C /> ! ? Fig. 31. The ratio of the root-bucket size to the CUBE File size for increasing tuple cardinality. # 9 $ ( /$! ? 31 < 5 $ 4 1 ?$ 55 + 31 < 5 9 :+ :5 1 / 3+ : :5 1 3 5< + + : + :5 1 :5 1 # ! != "# "= " =. .! ! ?( & )) =! "= ! =# # " =) "= # # " Fig. 32. The distribution of buckets as tuple cardinality increases. This last point is further exhibited in Fig. 33. We depict the normalized values of the hierarchical clustering factor for each data set. We can clearly see that the hierarchical clustering quality remains high for all data sets. In particular the experiments show that the hierarchical clustering remains approximately 0,7 (i.e., 70% of the best value achieved) even when the tuple cardinality was increased by 3 orders of magnitude. This essentially proves the second hypothesis that we posed in the beginning of this sub-section. 60 1 ?$ "! " "= " != "# =. .! ! )) =! "= ! " =) "= =# # # ! # 4 & . / ?$ Fig. 33. The hierarchical clustering factor fHC as tuple cardinality increases. D / && - In this paper, we tried to solve the problem of devising physical clustering schemes for multidimensional data that are organized in hierarchies. A typical case of such data is the OLAP cube. The problem of clustering on disk the most detailed data of a cube so as to reduce the I/Os during the evaluation of hierarchyselective queries is difficult due to the enormous search space of possible solutions. Instead of following the typical approach of finding a linear ordering of the data points, we introduced a representation of the search space (i.e., a model) that is based on a hierarchical chunking method that results in a chunk-tree representation of the cube. Then we coped with the problem as a packing problem, in particular packing of chunks into buckets. The chunk-tree representation is a very effective model of the cube data space, because it prunes all empty areas (i.e., chunk-trees) and adapts perfectly to the usual extensive sparseness of the cube. Moreover, by traversing the chunk-tree nodes we can very efficiently access subsets of the data space that are based on hierarchy value combinations. This makes the chunk-tree an excellent index for queries with hierarchical restrictions. In order to be able to evaluate the solutions to the proposed problem we defined a quality metric, namely the hierarchical clustering factor fHC of a cube. Furthermore, we formally defined the problem as an optimization problem and proved that is NP-Hard by reducing it to the bin packing problem. We proposed as a solution an effective greedy algorithm that requires a single pass over the input fact table and linear time in the number of chunks. Moreover we have 61 analyzed and provided solutions for a number of sub-problems such as the formation of bucket regions, the storage of large data chunks and the storage of the root-directory. The whole solution leads to the construction of the CUBE File data structure. We presented an extensive set of experiments analyzing the structural behavior of the CUBE File in terms of increasing sparseness and data point scalability. Our experimental results have confirmed our principal hypotheses that: 1. The CUBE File adapts perfectly to even the most extremely sparse data spaces yielding significant space savings. Furthermore, the hierarchical clustering achieved by the CUBE File is almost unaffected by the extensive cube sparseness. 2. The CUBE File is scalable (its size remained constantly about 70% smaller than that of the input tuple-based file, for all input data point cardinalities tested). In addition, the hierarchical clustering achieved remains of high quality, when the number of input data points increases. 3. The root-bucket size remained low (below 5%) compared to the total CUBE File size for all realistic cases of sparseness and data point cardinality and thus caching it in main memory is a feasible proposal. This results in a single I/O evaluation of point queries but reduces dramatically I/Os for all types of hierarchy-selective queries [18]. All in all, the CUBE File is an effective data structure for physically organizing and indexing the most detailed data of an OLAP cube. One area that such a structure could be used successfully is as an alternative to bitmap-index based processing of star-join queries. To this end an efficient processing framework has been proposed in [15]. However, it could be used as an effective index for any data that are accessed through multidimensional queries with hierarchical restrictions. An interesting enhancement to the CUBE File would be to incorporate more workload-specific knowledge in its chunk-to-bucket allocation algorithm. For example the allocation of more frequently accessed sub-trees in the same bucket should be rewarded with a higher HCDB value etc. We are also investigating the use of the hierarchical clustering factor for making decisions during the construction of other common storage organizations (e.g., partitioned heap files, B-trees, etc.) in order to achieve hierarchical clustering of the data. The interested 62 reader can find more information regarding other aspects of the CUBE File not covered in this paper (e.g., the updating and maintenance operations), as well as information for a prototype implementation of a CUBE File based DBMS in [16]. 94 " ()' 4$/ We would like to thank our colleagues Yannis Kouvaras and Yannis Roussos from the Knowledge and Database Systems Laboratory at the N.T.U.Athens for their fruitful comments and their support in the implementation of the CUBE File and the completion of the experimental evaluation. We also like to thank Aris Tsois for his detailed reviewing and commenting on the first draft. This work has been partially funded by the European Union's Information Society Technologies Programme (IST) under project EDITH (IST-1999-20722). ! 1. ! 4 / Bayer R and McCreight.E (1972) Organization and Maintainance of large ordered Indexes.In Acta Informatica 1 ,pages 173 –189,1972. 2. Bayer R (1997) The universal B-Tree for multi-dimensional Indexing: General Concepts. WWCA 1997. 3. Chan CY, Ioannidis Y (1998) Bitmap Index Design and Evaluation. SIGMOD 1998. 4. Chaudhuri S, Dayal U (1997) An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26(1): 65-74 (1997) 5. Deshpande PM, Ramasamy K, Shukla A , Naughton J (1998) Caching multidimensional Queries using Chunks, in: Proc. ACM SIGMOD Int. Conf. On Management of Data, (1998) 259-270. 6. Fagin R, Nievergelt J, Pippenger N, Raymond H (1979) Strong: Extendible Hashing - A Fast Access Method for Dynamic Files. TODS 4(3): 315-344 (1979) 7. Faloutsos C, Rong Y (1991) DOT: A Spatial Access Method Using Fractals. ICDE 1991: 152-159 8. Gaede V, Günther O (1998) Multidimensional Access Methods. ACM Computing Surveys 30(2): 170231 (1998) 9. Gray J, Bosworth A, Layman A, Pirahesh H (1996) Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and SubTotal. ICDE 1996. 10. Gupta A, Mumick IS (1995) Maintenance of Materialized Views: Problems, Techniques, and Applications. Data Engineering Bulletin 18(2): 3-18 (1995) 11. Harinarayan V, Rajaraman A, Ullman JD (1996) Implementing Data Cubes Efficiently, in: Proc. ACM SIGMOD Intl Conf. On Management of Data (1996) 205-227. 12. Hinrichs K (1985) Implementation of the Grid File: Design Concepts and Experience. BIT 25(4): 569592 (1985) 13. Jagadish HV (1990) Linear Clustering of Objects with Multiple Attributes. SIGMOD Conference 1990: 332-342 14. Jagadish HV, Lakshmanan LVS, Srivastava D (1999) Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse. SIGMOD Conference 1999: 37-48 63 15. Karayannidis N et al (2002) Processing Star-Queries on Hierarchically-Clustered Fact-Tables. VLDB 2002. 16. Karayannidis N (2003) Storage Structures, Query Processing and Implementation of On-Line Analytical Processing Systems, Ph.D. Thesis, National Technical University of Athens, 2003. Available at: http://www.dblab.ece.ntua.gr/~ni kos/thesis/PhD_thesis_en.pdf. 17. Karayannidis N, Sellis T (2003) SISYPHUS: The Implementation of a Chunk-Based Storage Manager for OLAP Data Cubes. Data and Knowledge Engineering, 45(2): 155-188, May 2003. 18. Karayannidis N, Sellis T, Kouvaras Y (2004) CUBE File: A File Structure for Hierarchically Clustered OLAP Cubes. 9th International Conference on Extending Database Technology, Heraklion, Crete, Greece, March 14-18, 2004, EDBT 2004: 621-638. 19. Kotidis Y, Roussopoulos N (1998) An Alternative Storage Organization for ROLAP Aggregate Views Based on Cubetrees, in: Proc. ACM SIGMOD Intl Conf. On Management of Data (1998): 249-258. 20. Lakshmanan LVS, Pei J, Han J (2002) Quotient Cube: How to Summarize the Semantics of a Data Cube. VLDB 2002. 21. Lakshmanan LVS, Pei J, Zhao Y (2003) QC-Trees: An Efficient Summary Structure for Semantic OLAP. SIGMOD 2003. 22. Markl V, Ramsak F, Bayern R (1999) Improving OLAP Performance by Multidimensional Hierarchical Clustering. IDEAS 1999. 23. Nievergelt J, Hinterberger H, Sevcik KC (1984) The Grid File: An Adaptable, Symmetric Multikey File Structure, in: TODS 9(1) (1984) 38-71 24. OLAP Report (1999) Database Explosion. Available at: http://www.olapreport.com/DatabaseExplosion.htm . 25. O'Neil PE, Graefe G (1995) Multi-Table Joins Through Bitmapped Join Indices. SIGMOD Record 24(3): 8-11 (1995). 26. O'Neil PE, Quass D (1997) Improved Query Performance with Variant Indexes. SIGMOD 1997. 27. Orenstein JA, Merrett TH (1984) A Class of Data Structures for Associative Searching. PODS 1984: 181-190 28. Padmanabhan S, Bhattacharjee B, Malkemus T, Cranston L, Huras M (2003) Multi-Dimensional Clustering: A New Data Layout Scheme in DB2. SIGMOD Conference 2003: 637-641 29. Pieringer R et al (2003) Combining Hierarchy Encoding and Pre-Grouping: Intelligent Grouping in Star Join Processing. ICDE 2003. 30. Ramsak F, Markl V, Fenk R, Zirkel M, Elhardt K, Bayer R (2000) Integrating the UB-Tree into a Database System Kernel. VLDB 2000: 263-272. 31. Régnier M (1985) Analysis of Grid File Algorithms. BIT 25(2): 335-357 (1985) 32. Roussopoulos N (1998) Materialized Views and Data Warehouses. SIGMOD Record 27(1): 21-26 (1998) 33. Sagan.H (1994) Space-Filling Curves .Springer Verlag,1994. 34. Sarawagi S (1997) Indexing OLAP Data. Data Engineering Bulletin 20(1): 36-43 (1997). 35. Sarawagi S and Stonebraker M (1994) Efficient Organization of Large Multidimensional Arrays, in: Proc. Of the 11th Int. Conf. On Data Eng. (1994) 326-336. 36. Sismanis Y, Deligiannakis A, Roussopoulos N, Kotidis Y (2002) Dwarf: shrinking the PetaCube. SIGMOD 2002. 37. Srivastava D, Dar S, Jagadish HV, Levy AY (1996) Answering Queries with Aggregation Using Views. VLDB Conference 1996: 318-329 38. Stöhr T, Märtens H, Rahm E (2000) Multi-Dimensional Database Allocation for Parallel Data Warehouses. VLDB 2000: 273-284 64 39. The TransBase HyperCube® relational database system (2005), available at: http://www.transaction.de. 40. Tsois A, Sellis T (2003) The Generalized Pre-Grouping Transformation: Aggregate-Query Optimization in the Presence of Dependencies. VLDB 2003. 41. Weber R, Schek H-J, Blott S (1998) A Quantitative Analysis and Performance Study for SimilaritySearch Methods in High-Dimensional Spaces. VLDB 1998: 194-205 42. Weiss MA (1995) Data Structures and Algorithm Analysis. The Benjamin/Cummings Publishing Company Inc. 1995. pg 351- 359. 43. Whang K-Y, Krishnamurthy R (1991) The Multilevel Grid File - A Dynamic Hierarchical Multidimensional File Structure. DASFAA 1991: 449-459 65

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Hierarchical Clustering for OLAP: the CUBE File Approach