* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download Slides - UCLA Computer Science
Survey
Document related concepts
Transcript
Continuous Query Languages for DSMS
CS240B Notes
by
Carlo Zaniolo
1
CQLs for DSMS
Most of DSMS projects use SQL for continuous
queries—for good reasons, since
Many applications span data streams and DB tables
A CQL based on SQL will be easier to learn & use
Moreover: the fewer the differences the better!
But DSMS were designed for persistent data and
transient queries---not for persistent queries on
transient data
Adaptation of SQL and its enabling technology
presents many research challenges
Lack of expressive power—even worse now since
only nonblocking operators are allowed.
3
Continuous Query Graph:
many components—arbitrary DAGs
Source
σ
∑1
Sink
∑2
O2
Sink
O3
Sink
O1
Source
Source1
U
Source2
Source1
Sink
σ
∑1
Sink
∑2
Sink
U
Source2
σ
4
Relational Algebra Operators
Stored data
Selection, Projection
Union
Data Streams
... same
Union by Sort-Merging on
Join (including X) on tables
Join of Stream with table
Window joins on streams
(timestamps merged into 1 column)
Set Difference
Aggregates:
Traditional Blocking aggregates
OLAP functions on windows or
unlimited preceding
timestamps
No stream difference (blocking—diff of
stream with table OK).
Aggregates:
No blocking aggregate
OLAP functions on windows or
unlimited preceding
Slides, and tumbles.
5
Bolts and Nuts
create stream
bids(bid#, item, offer, Time)
create stream mybids as (select bid#, offer, Time
from bids
where item=bolt
union
select bid#, offer, Time
from bids
where item=nut)
Result same as: select bid#, offer, Time where
item= bolt or item=nut
6
Joins
We could create a stream called interesting bids by say joining
bids with the ‘interesting_items’ table.
We next find the bolt bids for which there was a nut bid offered
in the last 5 minutes for the same price.
create stream selfjoinbids as
(select S1.bid#, S1.offer, S2.bid#, S2.Time
from bids as S1, bids as S2 [window of 5 minutes]
where S1.item=bolt and S2.item=nut and
S1.offer=S2.offer)
The window condition implies that
S1.Time >= S2.Time and S2.Time >= S1.Time-5 minutes.
Windows on both streams are used very often.
7
Processing Union and Joins
Special techniques are needed to process unions and joins on
data streams.
The main problem are slow response while waiting to sync
multiple data streams---i.e., idle waiting
This will be discussed later—after we discuss UDAs that solve
the expressive power problem---as needed for more
complex queries, such as mining queries.
8
Relational Algebra Operators
Stored data
Data Streams
Selection, Projection
Union
... same
Union by Sort-Merging on timestamps
Join of Stream with table
Window joins on streams (timestamps
Join (including X) on tables
Set Difference
Aggregates:
merged into 1 column)
No stream difference (blocking—diff of
stream with table OK).
Aggregates:
Traditional Blocking aggregates
No blocking aggregate
OLAP functions on windows or
unlimited preceding
OLAP functions on windows or
unlimited preceding
Slides, and tumbles.
Including UDAs
9
User-Defined Aggregates:
Max Power via Min SQL Extensions
Windows (logical, physical, slides, tumbles,…):
flexible synopses that solve the blocking problem for
aggregates
DSMS only support these constructs on built-in aggregates
ESL is the first to support the complete integration of these
two
User Defined Aggregates (UDAs) —the key to power
and extensibility, and
And thus can support data mining,
XML,
sequences not supported by other DSMS
One framework for aggregates and windows, whether
they are built-ins or user-defined, and
independent on the language used to
define them.
10
Defining Traditional Aggregates
Specification consists of 3 blocks of code--- Written in an
external PL (as DBMS and other DSMS do), or
In SQL itself (SQL becomesTuring Complete!)
INITIALIZE
Executed upon the arrival of the first tuple
ITERATE
Executed upon the arrival of each subsequent tuples (an incremental
computation suitable for streams)
TERMINATE
Executed after the end of the relation/stream has been reached
Invocation:
SELECT myavg(start_price) FROM OpenAuction
11
The UDA AVG in SQL
AGGREGATE avg(Next Int) : Real
{ TABLE state(tsum Int, cnt Int);
INITIALIZE : {
INSERT INTO state VALUES (Next, 1);
}
ITERATE : {
UPDATE state
SET tsum=tsum+Next, cnt=cnt+1;
}
TERMINATE : {
INSERT INTO RETURN
SELECT tsum/cnt FROM state;
}
}
“INSERT INTO RETURN” in TERMINATE a blocking UDA
12
NonBlocking UDA: AVG of last 200 Values
AGGREGATE myavg(Next Int) : Real
{TABLE state(tsum Int, cnt Int);
INITIALIZE : {
INSERT INTO state VALUES (Next, 1);
}
ITERATE : {
UPDATE state SET tsum=tsum+Next, cnt=cnt+1;
INSERT INTO RETURN
SELECT tsum/cnt FROM state
WHERE cnt %200 =0;
UPDATE state SET tsum=Next, cnt=1
WHERE cnt %200 =1
}
TERMINATE : { }
}
Empty
TERMINATE Denotes a non-blocking UDA
13
UDAs in ESL
In ESL user-defined Aggregates (UDAs)
can be defined directly in SQL, rather
than in a PL
Native extensibility in SQL via UDAs (which can
also be defined in a PL for better performance)
No impedance mismatch
Access to DB tables from UDAs
Data Independence and optimization
Good ease of use and performance
Turing completeness & nb-completeness.
14
Data Intensive Applications & UDAs
Complex Applications can expressed concisely, with
good performance
ATLAS: a single-user DBMS developed at UCLA.
Support for SQL with UDAs
On top of Berkeley-DB record manager.
Data Mining Algorithms in ATLAS
Decision Tree Classifiers: 18 lines of codes
APriori: 40 lines of codes
Modest overhead: <50% w.r.t procedural UDA
Data Stream Applications in ESL
Data Stream Mining, approximate aggregates, sketches,
histograms, …
15
SQL:2003 OLAP Functions
Aggregates on Windows
CREATE STREAM ClosedAuction (/*auction closings */
itemID, /*id of the item in this auction.*/
Auctions
buyerID /*buyer of this item.*/)
Final price real /*final price of the item */,
Current_time) order by … source …
For
each seller, show the average selling price over the
last 10 items sold (physical window)
CREATE STREAM LastTenAvg
SELECT sellerID,
AVG(price) OVER(PARTITION BY sellerID
ROWS 9 PRECEDING),
Current_time
FROM ClosedPrice;
16
Optimizing Window AVG in ESL
•For each expired tuple decrease the count by one and the sum
by the expired value—works for logical & physical windows
WINDOW AGGREGATE avg(Next Real) : Real
{
TABLE state(tsum Int, cnt Real);
TABLE inwindow(wnext Real);
INITIALIZE : {
INSERT INTO state VALUES (Next, 1)}
ITERATE : {
UPDATE state SET tsum=tsum+Next, cnt=cnt+1;
INSERT INTO RETURN
SELECT tsum/cnt FROM state}
EXPIRE: { /*if there are expired tuples, take the oldest */
UPDATE state
SET cnt= cnt-1,
tsum = tsum – (select wnext
FROM inwindow
WHERE oldest(inwindow)) }
}
17
MAX
System maintains inwindow
Remove dominated (less & older) values
The oldest is always the max.
WINDOW AGGREGATE max (Next Real) : Real
{
TABLE inwindow(wnext real);
INITIALIZE : { etc.} /*system adds new tuples to inwindow*/
ITERATE : { DELETE FROM inwindow
WHERE wnext <Next;
INSERT INTO RETURN
SELECT wnext FROM inwindow
WHERE oldest(inwindow) }
EXPIRE: { } /*expired tuples removed automatically*/
}
18
For Each Aggregate two versions
The traditional Base aggregate with
terminate
The Window aggregate with inwindow and
expire.
These definitions will take care of both
logical and physical windows.
But there are more complications: slides and
tumbles.
19
Slides and Tumbles
Every two minutes, show the average selling price over the last
10 minutes (logical window)
CREATE STREAM LastTenAvg
SELECT sellerID,
max(price) OVER(RANGE 10 MINUTE PRECEDING
SLIDE 2 MINUTE),
Current_time
FROM ClosedPrice;
Here the window is W=10 and the slide is S=2.
Tumble: When S ≥ W
20
SLIDEs
Summary Tuples
slide/pane
window
window
The slide constructs divides a window into panes, results only
returned at the end of each pane
Algebraic Properties make slide is conducive to optimization.
Combine summaries into the desired aggregation
E.g.: MAX(1, 2, 3, 4)= MAX(MAX(1,2), MAX(3,4)) = 4
I.e., for MAX, we can perform MAX on subsets of numbers as
local summaries, then combine them together to get the true
MAX
Used for built-in aggregates in SQL 2003: but what
constructs should be used to integrate these concepts
into a language for user-defined aggregates?
21
Slides &Tumbles--Examples
Tumble – where the SLIDE size is equal or larger than
the window size
E.g. Once every 50 tuples, compute and return average over the
last 10 tuples
Easy to optimize
Skip the first 40 tuples of every 50 tuples, and compute the
blocking base version of the aggregate on the last 10
Slide – where slide size is smaller than the window size
E.g. Once every 10 tuples, compute and return average over the
last 50 tuples
Naïve implementation--not optimized
Perform incremental maintenance on every incoming tuple
Ignore RETURN statements for most incoming tuples
Only invoke RETURN once every 10 tuples
22
Pane-Based SLIDE Optimization
Two-level cascading aggregates using two existing aggregates
Perform sub-aggregation inside each pane using the base aggregate
No need for incremental maintenance here
Computed with a blocking aggregate once for each pane
Combine the summary tuples using the window aggregate that returns
on every incoming tuple (non-blocking)
With incremental maintenance here
At any time, only the last un-finished pane needs to store data tuples
all finished panes are reduced to one reusable summary tuple
Agg1 (base)
window
Agg2 (window)
window
23
Pane-based SLIDE optimization
ClosedAuction (itemID, buyerID, Final_price, Current_time)
Computing the MAX on window of 50 tuples & slide size of 10 tuples
CREATE STREAM temp AS
(SELECT itemID, max(sale_price) OVER(PARTITION BY itemID
ROWS 49 PRECEDING SLIDE 10)
FROM Auction);
This is computed as the cascade of
1.A tumble of 10 rows (returning the max of those 10 rows),
2.Followed by a max on a window of 5 rows.
24
Pane-based SLIDE optimization
SUM with window size of 50 tuples, and slide size of 10 tuples
1. First create a stream of summary tuples using base aggregate
CREATE STREAM temp AS (
SELECT itemID, max(sale_price) OVER(PARTITION BY itemID
ROWS 9 PRECEDING SLIDE 10)
AS msp
FROM Auction);
This is computed as a tumble using the base version of the UDA
2. Then apply the window version of the aggregate on the five
(4+1=5) tuples produced in 1.
SELECT itemID, window_max(msp)
OVER(PARTITION BY itemID ROWS 4 PRECEDING)
FROM temp;
25
Checkpoint
{Logical|Physical}x{tumble|slide unlimited_preceding}
Six different types of calls, supported by two
definitions
Both SQL or procedural languages can be used in the
definition.
This simple approach can be used to implement very
complex aggregations (e.g. ensemble classifiers)
Applies uniformly to logical/physical windows defined in
SQL or in an external language
26
Window UDAs vs. Base UDAs
Base UDAs:
called as traditional SQL-2 aggregates, with
optional GROUP BY
Window UDAs:
called with SQL:2003 OVER clause optional
PARTITION BY clause
logical or physical windows
Optional SLIDE clauses in ESL ca be
Clear semantics and optimization rules unify:
UDAs—SQL or PL-defined, algebraic or not …
window (logical & physical), slice, tumbles, etc.
System vs. user roles in optimization clearly
defined.
27
Window UDAs: Physical Optimization
The Stream Mill System provides efficient
support for:
Management of new & expiring tuples in buffer
Main memory & intelligent paging into disk
Events caused by tuple expiration
Users can access the buffer as the table called inwindow
28
Conclusion
Language Technology:
ESL a very powerful language for data stream and DB
applications
Simple semantics and unified syntax conforming to
SQL:2003 standards
Strong case for the DB-oriented approach to data
streams
System Technology:
Some performance-oriented techniques well-developed—
e.g., buffer management for windows
For others: work is still in progress—stay tuned for
latest news
Stream Mill is up and running: http://wis.cs.ucla.edu/stream-mill
29
*********
The End
THANK YOU !
*****
30
References
[1]ATLaS user manual. http://wis.cs.ucla.edu/atlas.
[2]SQL/LPP: A Time Series Extension of SQL Based on Limited Patience Patterns, volume 1677 of
Lecture Notes in Computer Science. Springer, 1999.
[4]A. Arasu, S. Babu, and J. Widom. An abstract semantics and concrete language for continuous queries
over streams and relations. Technical report, Stanford University, 2002.
[5]B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream
systems. In PODS, 2002.
[9]D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul,
and S. Zdonik. Monitoring streams - a new class of data management applications. In VLDB, Hong
Kong, China, 2002.
[10]J. Celko. SQL for Smarties, chapter Advanced SQL Programming. Morgan Kaufmann, 1995.
[11]S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In VLDB, 2002.
[12]J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for
internet databases. In SIGMOD, pages 379-390, May 2000.
[13]C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck. Gigascope: A stream database
for network applications. In SIGMOD Conference, pages 647-651. ACM Press, 2003.
[14]Lukasz Golab and M. Tamer Özsu. Issues in data stream management. ACM SIGMOD Record,
32(2):5-14, 2003.
[15]J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In SIGMOD, 1997.
[16] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, Carlo Zaniolo: A Data Stream Language and
System Designed for Power and Extensibility. Proc. of the ACM 15th Conference on Information and
Knowledge Management (CIKM'06), 2006
[17] Yijian Bai, Hetal Thakkar, Haixun Wang and Carlo Zaniolo: Optimizing Timestamp Management in
Data Stream Management Systems. ICDE 2007.
31
References (Cont.)
[18] Yan-Nei Law, Haixun Wang, Carlo Zaniolo: Query Languages and Data Models for Database
Sequences and Data Streams. VLDB 2004: 492-503
[19] Sam Madden, Mehul A. Shah, Joseph M. Hellerstein, and Vijayshankar Raman. Continuously adaptive
continuous queries over streams. In SIGMOD, pages 49-61, 2002.
[20]R. Motwani, J. Widom, A. Arasu, B. Babcock, M. Datar S. Babu, G. Manku, C. Olston, J.
Rosenstein, and R. Varma. Query processing, approximation, and resource management in a data
stream management system. In First CIDR 2003 Conference, Asilomar, CA, 2003.
[21]R. Ramakrishnan, D. Donjerkovic, A. Ranganathan, K. Beyer, and M. Krishnaprasad. SRQL: Sorted
relational query language, 1998.
[23]Reza Sadri, Carlo Zaniolo, and Amir M. Zarkesh andJafar Adibi. A sequential pattern query language
for supporting instant data minining for e-services. In VLDB, pages 653-656, 2001.
[24]Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in
database systems. In PODS, Santa Barbara, CA, May 2001.
[25]P. Seshadri. Predator: A resource for database research. SIGMOD Record, 27(1):16-20, 1998.
[26]P. Seshadri, M. Livny, and R. Ramakrishnan. SEQ: A model for sequence databases. In ICDE, pages
232-239, Taipei, Taiwan, March 1995.
[27]Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. Sequence query processing. In ACM
SIGMOD 1994, pages 430-441. ACM Press, 1994.
[28]M. Sullivan. Tribeca: A stream database manager for network traffic analysis. In VLDB, 1996.
[29]D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. In
SIGMOD, pages 321-330, 6 1992.
[30]Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. Exploiting punctuation semantics in
continuous data streams. IEEE Trans. Knowl. Data Eng, 15(3):555-568, 2003.
[31]Haixun Wang and Carlo Zaniolo. ATLaS: a native extension of SQL for data minining. In Proceedings
of Third SIAM Int. Conference on Data MIning, pages 130-141, 2003.
32