Download Models and Operators for Continuous Queries on Data Streams

Samples in more complex query graphs Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA *Bioinformatics Institute Singapore Continuous Aggregate Queries on Data Streams: Sampling & Load Shedding Only random samples are available for computing aggregate queries because of   Limitations of remote sensors, or transmission lines  Load Shedding policies implemented when overloads occur  When overloads occur (e.g., due to a burst of arrivals} we can 1. drop queries all together, or 2. sample the input---much preferable  Key objective: Achieve answer accuracy with sparse samples for complex aggregates on windows  Can we improve answer accuracy with minimal overhead? General Architecture  Sn …... Query Network Basic Idea:   S1 Optimize sampling rates of load shedders for accurate answers. Previous Work [BDM04]:     Find an error bound for each ∑ aggregate query. Determine sampling rates that minimize query inaccuracy within the limits imposed by resource constraints. Only work for SUM and COUNT No error model provided ∑ …... Si ∑ Data Stream Load Shedder Query Operator ∑ Aggregate A New Approach     Correlation between answers at different points in time  Example: sensor data [VAA04,AVA04] Objective: The current answer can be adjusted by the past answers in the way that:  Low sampling rate  current answer less accurate  more dependent on history.  High sampling rate  current answer more accurate  less dependent on history. We propose a Bayesian quality enhancement module which can achieve this objective automatically and reduce the uncertainty of the approximate answers. A larger class of queries will be considered:  SUM, COUNT, AVG, quantiles. Our Model  The observed answer Ã is computed from random samples of the complete stream with sampling rate P. We propose a bayesian method to obtain the improved answer by P combining    the observed answer the error model history of the answer …... Sn Query Network  S1 ∑ …... ∑ ∑ …... Ã History Quality Enhancement Module Si Data Stream Load Shedder Improved answer Query Operator ∑ Aggregate Error Model of the aggregate answers   Ã – approximate answer obtained by a random sample with sampling rate P. ~ Key result: p ( A | A) ~ N ( A,  Error model for sum count, avg, quantiles: 2 1 P  2   2 2     A . 2 P N 1 P 1 2     N 2. P N 1 P  2   2 2    . P N 2  SUM:  COUNT:  AVG:  p-th Quantiles [B86]: F is the cumulative distribution, f = F’ is the density function p(1  p)   . 1 2 PN ( f ( F ( p))) 2 ). Use of Error Model  Derive accuracy estimate for larger class of queries for optimizing the load shedding policy   Idea: minimize the variance of each query Enhance the quality of the query answer, on the basis of statistically information derived from the past. Learning Prior Distribution from the Past  Statistical information on the answers:    Model the distribution of the answer by:  Spatial – e.g. reading from the neighbors. Temporal – e.g the past answers {xi}. Normal distribution:  By MLE, pdf ~ N(s,s2) where s=xi/n,s2=(xi-s)2/n; only need to store s, s.  Only require a minimal amount of computation time. Assuming that there is no concept change. Observations {    s2 ~ t  2   A  s   2 s  s2   2 2 t 2  s2 2  2  s  2 mean variance Reduced uncertainty. (small s  t << ) Compromise between prior and observed answer:    large   less accurate Ã  more dependent on s small   more accurate Ã  less dependent on s uncertain prior (i.e. large s) will not have much effect on the improved answer. Generalizing to Mining functions: K-means is a Generalized AVG Query Relative Error for the first mean Relative Error for the second mean Quantiles: (dataset with concept drifts) p-th Quantiles Approximate Posterior 20% 3.30 2.67 40% 1.59 1.56 60% 0.30 0.20 80% 0.21 0.13 Average Rel. Err. for every quantile Changing Distribution    Corrections effective for distributions besides normal ones Changing distributions (a.k.a. concept changes) can be easily detected—we used a two sample test Then old prior is dropped and new prior is constructed Minimal Overheads  Computation costs introduced by:   Calculating posterior distribution Detecting changes Query Approximate Posterior SUM 200 200 Quantile 1230 1240 K-means 960 990 Time (in ms) for each query Summary   Proposed a Bayesian quality enhancement method for approximate aggregates in the presence of sampling. Our method:   Works for ordered statistics and data mining functions as well as with traditional aggregates, and also handles concept changes in the data streams

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Models and Operators for Continuous Queries on Data Streams