Download Slides - UCLA Computer Science

Continuous Query Languages for DSMS CS240B Notes by Carlo Zaniolo 1 CQLs for DSMS  Most of DSMS projects use SQL for continuous queries—for good reasons, since Many applications span data streams and DB tables A CQL based on SQL will be easier to learn & use Moreover: the fewer the differences the better!  But DSMS were designed for persistent data and transient queries---not for persistent queries on transient data  Adaptation of SQL and its enabling technology presents many research challenges  Lack of expressive power—even worse now since only nonblocking operators are allowed. 3 Continuous Query Graph: many components—arbitrary DAGs Source σ ∑1 Sink ∑2 O2 Sink O3 Sink O1 Source  Source1 U Source2 Source1 Sink σ  ∑1 Sink ∑2 Sink U Source2 σ 4 Relational Algebra Operators Stored data  Selection, Projection  Union Data Streams  ... same  Union by Sort-Merging on  Join (including X) on tables  Join of Stream with table  Window joins on streams (timestamps merged into 1 column)  Set Difference  Aggregates:  Traditional Blocking aggregates  OLAP functions on windows or unlimited preceding timestamps  No stream difference (blocking—diff of stream with table OK).  Aggregates:  No blocking aggregate  OLAP functions on windows or unlimited preceding  Slides, and tumbles. 5 Bolts and Nuts create stream bids(bid#, item, offer, Time) create stream mybids as (select bid#, offer, Time from bids where item=bolt union select bid#, offer, Time from bids where item=nut) Result same as: select bid#, offer, Time where item= bolt or item=nut 6 Joins We could create a stream called interesting bids by say joining bids with the ‘interesting_items’ table. We next find the bolt bids for which there was a nut bid offered in the last 5 minutes for the same price. create stream selfjoinbids as (select S1.bid#, S1.offer, S2.bid#, S2.Time from bids as S1, bids as S2 [window of 5 minutes] where S1.item=bolt and S2.item=nut and S1.offer=S2.offer) The window condition implies that S1.Time >= S2.Time and S2.Time >= S1.Time-5 minutes. Windows on both streams are used very often. 7 Processing Union and Joins Special techniques are needed to process unions and joins on data streams. The main problem are slow response while waiting to sync multiple data streams---i.e., idle waiting This will be discussed later—after we discuss UDAs that solve the expressive power problem---as needed for more complex queries, such as mining queries. 8 Relational Algebra Operators Stored data Data Streams  Selection, Projection  Union  ... same  Union by Sort-Merging on timestamps  Join of Stream with table  Window joins on streams (timestamps  Join (including X) on tables  Set Difference  Aggregates:  merged into 1 column) No stream difference (blocking—diff of stream with table OK).  Aggregates:  Traditional Blocking aggregates  No blocking aggregate  OLAP functions on windows or unlimited preceding  OLAP functions on windows or unlimited preceding  Slides, and tumbles.  Including UDAs 9 User-Defined Aggregates: Max Power via Min SQL Extensions  Windows (logical, physical, slides, tumbles,…): flexible synopses that solve the blocking problem for aggregates  DSMS only support these constructs on built-in aggregates ESL is the first to support the complete integration of these two  User Defined Aggregates (UDAs) —the key to power and extensibility, and And thus can support data mining, XML, sequences not supported by other DSMS  One framework for aggregates and windows, whether they are built-ins or user-defined, and independent on the language used to define them. 10 Defining Traditional Aggregates  Specification consists of 3 blocks of code--- Written in an external PL (as DBMS and other DSMS do), or  In SQL itself (SQL becomesTuring Complete!)  INITIALIZE Executed upon the arrival of the first tuple  ITERATE Executed upon the arrival of each subsequent tuples (an incremental computation suitable for streams)  TERMINATE Executed after the end of the relation/stream has been reached  Invocation: SELECT myavg(start_price) FROM OpenAuction 11 The UDA AVG in SQL AGGREGATE avg(Next Int) : Real { TABLE state(tsum Int, cnt Int); INITIALIZE : { INSERT INTO state VALUES (Next, 1); } ITERATE : { UPDATE state SET tsum=tsum+Next, cnt=cnt+1; } TERMINATE : { INSERT INTO RETURN SELECT tsum/cnt FROM state; } }  “INSERT INTO RETURN” in TERMINATE  a blocking UDA 12 NonBlocking UDA: AVG of last 200 Values AGGREGATE myavg(Next Int) : Real {TABLE state(tsum Int, cnt Int); INITIALIZE : { INSERT INTO state VALUES (Next, 1); } ITERATE : { UPDATE state SET tsum=tsum+Next, cnt=cnt+1; INSERT INTO RETURN SELECT tsum/cnt FROM state WHERE cnt %200 =0; UPDATE state SET tsum=Next, cnt=1 WHERE cnt %200 =1 } TERMINATE : { } } Empty TERMINATE Denotes a non-blocking UDA 13 UDAs in ESL In ESL user-defined Aggregates (UDAs) can be defined directly in SQL, rather than in a PL Native extensibility in SQL via UDAs (which can also be defined in a PL for better performance) No impedance mismatch Access to DB tables from UDAs Data Independence and optimization Good ease of use and performance Turing completeness & nb-completeness. 14 Data Intensive Applications & UDAs  Complex Applications can expressed concisely, with good performance  ATLAS: a single-user DBMS developed at UCLA. Support for SQL with UDAs On top of Berkeley-DB record manager.  Data Mining Algorithms in ATLAS Decision Tree Classifiers: 18 lines of codes APriori: 40 lines of codes Modest overhead: <50% w.r.t procedural UDA  Data Stream Applications in ESL Data Stream Mining, approximate aggregates, sketches, histograms, … 15 SQL:2003 OLAP Functions Aggregates on Windows CREATE STREAM ClosedAuction (/*auction closings */ itemID, /*id of the item in this auction.*/ Auctions buyerID /*buyer of this item.*/) Final price real /*final price of the item */, Current_time) order by … source … For each seller, show the average selling price over the last 10 items sold (physical window) CREATE STREAM LastTenAvg SELECT sellerID, AVG(price) OVER(PARTITION BY sellerID ROWS 9 PRECEDING), Current_time FROM ClosedPrice; 16 Optimizing Window AVG in ESL •For each expired tuple decrease the count by one and the sum by the expired value—works for logical & physical windows WINDOW AGGREGATE avg(Next Real) : Real { TABLE state(tsum Int, cnt Real); TABLE inwindow(wnext Real); INITIALIZE : { INSERT INTO state VALUES (Next, 1)} ITERATE : { UPDATE state SET tsum=tsum+Next, cnt=cnt+1; INSERT INTO RETURN SELECT tsum/cnt FROM state} EXPIRE: { /*if there are expired tuples, take the oldest */ UPDATE state SET cnt= cnt-1, tsum = tsum – (select wnext FROM inwindow WHERE oldest(inwindow)) } } 17 MAX System maintains inwindow Remove dominated (less & older) values The oldest is always the max. WINDOW AGGREGATE max (Next Real) : Real { TABLE inwindow(wnext real); INITIALIZE : { etc.} /*system adds new tuples to inwindow*/ ITERATE : { DELETE FROM inwindow WHERE wnext <Next; INSERT INTO RETURN SELECT wnext FROM inwindow WHERE oldest(inwindow) } EXPIRE: { } /*expired tuples removed automatically*/ } 18 For Each Aggregate two versions The traditional Base aggregate with terminate The Window aggregate with inwindow and expire. These definitions will take care of both logical and physical windows. But there are more complications: slides and tumbles. 19 Slides and Tumbles Every two minutes, show the average selling price over the last 10 minutes (logical window)  CREATE STREAM LastTenAvg SELECT sellerID, max(price) OVER(RANGE 10 MINUTE PRECEDING SLIDE 2 MINUTE), Current_time FROM ClosedPrice; Here the window is W=10 and the slide is S=2. Tumble: When S ≥ W 20 SLIDEs Summary Tuples slide/pane window window  The slide constructs divides a window into panes, results only returned at the end of each pane  Algebraic Properties make slide is conducive to optimization. Combine summaries into the desired aggregation E.g.: MAX(1, 2, 3, 4)= MAX(MAX(1,2), MAX(3,4)) = 4 I.e., for MAX, we can perform MAX on subsets of numbers as local summaries, then combine them together to get the true MAX Used for built-in aggregates in SQL 2003: but what constructs should be used to integrate these concepts into a language for user-defined aggregates? 21 Slides &Tumbles--Examples  Tumble – where the SLIDE size is equal or larger than the window size E.g. Once every 50 tuples, compute and return average over the last 10 tuples Easy to optimize Skip the first 40 tuples of every 50 tuples, and compute the blocking base version of the aggregate on the last 10  Slide – where slide size is smaller than the window size E.g. Once every 10 tuples, compute and return average over the last 50 tuples Naïve implementation--not optimized Perform incremental maintenance on every incoming tuple Ignore RETURN statements for most incoming tuples Only invoke RETURN once every 10 tuples 22 Pane-Based SLIDE Optimization  Two-level cascading aggregates using two existing aggregates  Perform sub-aggregation inside each pane using the base aggregate No need for incremental maintenance here Computed with a blocking aggregate once for each pane  Combine the summary tuples using the window aggregate that returns on every incoming tuple (non-blocking) With incremental maintenance here At any time, only the last un-finished pane needs to store data tuples all finished panes are reduced to one reusable summary tuple Agg1 (base) window Agg2 (window) window 23 Pane-based SLIDE optimization ClosedAuction (itemID, buyerID, Final_price, Current_time) Computing the MAX on window of 50 tuples & slide size of 10 tuples CREATE STREAM temp AS (SELECT itemID, max(sale_price) OVER(PARTITION BY itemID ROWS 49 PRECEDING SLIDE 10) FROM Auction); This is computed as the cascade of 1.A tumble of 10 rows (returning the max of those 10 rows), 2.Followed by a max on a window of 5 rows. 24 Pane-based SLIDE optimization SUM with window size of 50 tuples, and slide size of 10 tuples 1. First create a stream of summary tuples using base aggregate CREATE STREAM temp AS ( SELECT itemID, max(sale_price) OVER(PARTITION BY itemID ROWS 9 PRECEDING SLIDE 10) AS msp FROM Auction); This is computed as a tumble using the base version of the UDA 2. Then apply the window version of the aggregate on the five (4+1=5) tuples produced in 1. SELECT itemID, window_max(msp) OVER(PARTITION BY itemID ROWS 4 PRECEDING) FROM temp; 25 Checkpoint {Logical|Physical}x{tumble|slide unlimited_preceding}  Six different types of calls, supported by two definitions  Both SQL or procedural languages can be used in the definition.  This simple approach can be used to implement very complex aggregations (e.g. ensemble classifiers)  Applies uniformly to logical/physical windows defined in SQL or in an external language 26 Window UDAs vs. Base UDAs  Base UDAs: called as traditional SQL-2 aggregates, with optional GROUP BY  Window UDAs: called with SQL:2003 OVER clause optional PARTITION BY clause logical or physical windows Optional SLIDE clauses in ESL ca be  Clear semantics and optimization rules unify: UDAs—SQL or PL-defined, algebraic or not …  window (logical & physical), slice, tumbles, etc. System vs. user roles in optimization clearly defined. 27 Window UDAs: Physical Optimization  The Stream Mill System provides efficient support for:  Management of new & expiring tuples in buffer  Main memory & intelligent paging into disk  Events caused by tuple expiration  Users can access the buffer as the table called inwindow 28 Conclusion  Language Technology: ESL a very powerful language for data stream and DB applications Simple semantics and unified syntax conforming to SQL:2003 standards Strong case for the DB-oriented approach to data streams  System Technology: Some performance-oriented techniques well-developed— e.g., buffer management for windows For others: work is still in progress—stay tuned for latest news  Stream Mill is up and running: http://wis.cs.ucla.edu/stream-mill 29 ********* The End THANK YOU ! ***** 30 References [1]ATLaS user manual. http://wis.cs.ucla.edu/atlas. [2]SQL/LPP: A Time Series Extension of SQL Based on Limited Patience Patterns, volume 1677 of Lecture Notes in Computer Science. Springer, 1999. [4]A. Arasu, S. Babu, and J. Widom. An abstract semantics and concrete language for continuous queries over streams and relations. Technical report, Stanford University, 2002. [5]B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS, 2002. [9]D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong Kong, China, 2002. [10]J. Celko. SQL for Smarties, chapter Advanced SQL Programming. Morgan Kaufmann, 1995. [11]S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In VLDB, 2002. [12]J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for internet databases. In SIGMOD, pages 379-390, May 2000. [13]C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck. Gigascope: A stream database for network applications. In SIGMOD Conference, pages 647-651. ACM Press, 2003. [14]Lukasz Golab and M. Tamer Özsu. Issues in data stream management. ACM SIGMOD Record, 32(2):5-14, 2003. [15]J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997. [16] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language and System Designed for Power and Extensibility. Proc. of the ACM 15th Conference on Information and Knowledge Management (CIKM'06), 2006 [17] Yijian Bai, Hetal Thakkar, Haixun Wang and Carlo Zaniolo: Optimizing Timestamp Management in Data Stream Management Systems. ICDE 2007. 31 References (Cont.) [18] Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database Sequences and Data Streams. VLDB 2004: 492-503 [19] Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. Continuously adaptive continuous queries over streams. In SIGMOD, pages 49-61, 2002. [20]R. Motwani, J. Widom, A. Arasu, B. Babcock, M. Datar S. Babu, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query processing, approximation, and resource management in a data stream management system. In First CIDR 2003 Conference, Asilomar, CA, 2003. [21]R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. SRQL: Sorted relational query language, 1998. [23]Reza Sadri, Carlo Zaniolo, and Amir M. Zarkesh andJafar Adibi. A sequential pattern query language for supporting instant data minining for e-services. In VLDB, pages 653-656, 2001. [24]Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in database systems. In PODS, Santa Barbara, CA, May 2001. [25]P. Seshadri. Predator: A resource for database research. SIGMOD Record, 27(1):16-20, 1998. [26]P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages 232-239, Taipei, Taiwan, March 1995. [27]Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. Sequence query processing. In ACM SIGMOD 1994, pages 430-441. ACM Press, 1994. [28]M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In VLDB, 1996. [29]D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. In SIGMOD, pages 321-330, 6 1992. [30]Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowl. Data Eng, 15(3):555-568, 2003. [31]Haixun Wang and Carlo Zaniolo. ATLaS: a native extension of SQL for data minining. In Proceedings of Third SIAM Int. Conference on Data MIning, pages 130-141, 2003. 32

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slides - UCLA Computer Science