* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slides - SEAS - University of Pennsylvania
		                    
		                    
								Survey							
                            
		                
		                
                            
                            
								Document related concepts							
                        
                        
                    
						
						
							Transcript						
					
					Introduction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems January 10, 2005 Welcome to CIS 650, Database and Information Systems! Instructor: Zachary Ives, zives@cis  576 Levine Hall North  Office hours: Tuesday, 2:30-3:30PM (before colloquium) Home page: www.seas.upenn.edu/~zives/cis650/ Texts and readings:  Hellerstein and Stonebraker: Readings in Database Systems, 4th ed.  (Should be available soon)  Supplementary papers (will be linked via schedule) 2 Course Format and Grading Very discussion-oriented; about one topic area per week or two  Readings in the text & other research papers – summaries/commentary on papers (20%) “Midterm report” (25%)  You’ll take one of the topics we’ve discussed and write a summary and synthesis paper  Graded for organization, clarity, grammar, etc. as well as content Project (50%) -- may choose to work in teams:     Implementation Experimentation / validation Project report (should be in the style of a research paper) Brief (~15-minute) presentation Participation, discussion, intangibles (5%) At the end, you should be equipped to do research in this field, or to take ideas from databases and apply them to your field 3 So What Is This Course About? Not how to build an Oracle-driven Web site… … nor even how to build Oracle… 4 What Is Unique about Data Management?  It’s been said that databases and data management focus on scalability to huge volumes of data  What is it that makes this possible – and what makes the work interesting if NOT at huge scale?  Why are data management techniques useful in situations where scale isn’t the bottleneck? 5 The Key Principle: Data Independence  Most methods of programming don’t separate the logical and physical representations of data  The data structures, access methods, etc. are all given via interfaces!  The relational data model was the first model for data that is independent of its data structures and implementation 6 What Is Data Independence?  Codd points out that previous methods had:  Order dependence  Index dependence  Access path dependence  Still true in today’s Java/C#: what is the drawback?  What might you be able to do in removing those? 7 The Relational Data Model More than just tables!  True relations: sets of tuples  The only data representation a user/programmer “sees”  Explicit encoding of everything in values Additional integrity constraints  Key constraints, functional dependencies, … General and universal means of encoding everything!  (Semantics are pushed to queries) A secondary concept: views  Define virtual, derived relations that are always “live”  A way of encapsulating, abstracting data 8 Constraints and Normalization  Fundamental idea: we don’t want to build semantics into the data model, but we want to be able to encode certain constraints  Functional dependencies, key constraints, foreign-key constraints, multivalued dependencies, join dependencies, etc.  Allows limited data validation, plus opportunities for optimization  The theory of normalization (see CSE 330, CIS 550) makes use of known constraints  Idea: eliminate redundancy, in order to maintain consistency in the presence of updates  (Note that there’s no reason for normalization of data in views!)  Ergo, XML??? 9 Relational Completeness (Plus Extensions): Declarativity What is special about relational query languages that makes them amenable to scalability?  Limited expressiveness – particularly when we consider conjunctive queries (even with recursion)     Guaranteed polytime execution in size of data Can reason about containment, invert them, etc. “Magic sets” (What about XQuery’s Turing-completeness???)  Equivalence between relational calculus and algebra  Calculus  fully declarative, basis of query languages  Algebra  imperative but polytime, basis of runtime systems  Predictability of operations  cost models  Ability to supplement data with auxiliary structures for performance 10 Concurrency and Reliability (Generally requires full control)  Another key element of databases – ACID properties  Atomicity, Consistency, Isolation, Durability  Transaction : an atomic sequence of database actions (read/write) on data items (e.g. calendar entry)  Recoverability via a log: keeping track of all actions carried out by the database  How do distributed systems, Web services, serviceoriented architectures, and the like affect these properties? 11 Other Data Models  Concepts from the relational data model have been adapted to form object-oriented data models (with classes and subclasses), XML models, etc.  But doesn’t this result in some loss of logical-physical independence?  GMAP and answering queries using views? 12 What Is a Data Management System?  Of course, there are traditional databases  The focus of most work in the past 25 years  “Tight loops” due to locally controlled data  Indexing, transactions, concurrency, recovery, optimization  But… 13 80% of the World’s Data is Not in Databases! Examples:  Scientific data (large images, complex programs that analyze the data)  Personal data  WWW and email (some of it is stored in something resembling a DBMS)  Network traffic logs  Sensor data  Are there benefits to declarative techniques and data independence in tackling these issues?  XML is a great way to make this data available  Also need to deal with data we don’t control and can’t guarantee consistency over 14 An Example of Data Management with Heterogeneity: Data Integration “Mediated Schema” XML A layer above heterogeneous sources, to combine them under a unified logical abstraction  Some of these are databases over which we have no control  Some must be accessed in special ways  Data integration system translates queries over mediated schema to the languages of the sources; converts answers to mediated schema 15 Other Interesting Points Data streams and sensor data How do we process infinite amounts of data? Peer-to-peer architectures What’s the best way of finding data here? Personal information management Can we use integration-style concepts and a bit of AI to manage associations between our data? Web search What’s the back-end behind Google? Semantic Web How do we semantically interrelate data to build a better Web? 16 Layers of a Typical Data Management System API/GUI (Simplification!) Query Optimizer Stats Physical plan Exec. Engine Catalog Schemas Data/etc Logging, recovery Requests Access Methods Data/etc Requests Buffer Mgr Pages Pages Physical retrieval Data Red = logical Blue = physical Requests Source 17 Query Answering in a Data Management System  Based on declarative query languages  Based on restricted first-order logic expressions over relations  Not procedural – defines constraints on the output  Converted into a query plan that exploits properties; run over the data by the query optimizer and query execution engine  Data may be local or remote  Data may be heterogeneous or homogeneous  Data sources may have different interfaces, access methods, etc.  Most common query languages:  SQL (based on tuple relational calculus)  Datalog (based on domain relational calculus, plus fixpoint)  XQuery (functional language; has an XML calculus core) 18 Processing the Query Web Server / UI / etc Hash STUDENT Optimizer Takes by cid Execution Engine Merge COURSE by cid Storage Subsystem SELECT * FROM STUDENT, Takes, COURSE WHERE STUDENT.sid = Takes.sID AND Takes.cID = cid 19 DBMSs in the Real World  Big, mature relational databases  IBM, Oracle, Microsoft  “Middleware” above these  SAP, PeopleSoft, dozens of special-purpose apps  “Application servers”  Integration and warehousing systems  Current trends:  Web services; XML everywhere  Smarter, self-tuning systems  Stream systems 20 Our Agenda this Semester  Reading the canonical papers in the data management literature  Some are very systems-y  Some are very experimental  Some are highly algorithmic, complexity-oriented  Gaining an understanding of the principles of building systems to handle declarative queries over large volumes of data 21 For Next Time  Skim Codd if you haven’t already  Read the overview papers of the two first database systems:  Astrahan et al., pp. 117 Wong et al. (skip Section 2; focus on pp. 200-)  Write a summary of your assigned paper and email it to me at zives@cis  Key question: how well did this system mesh with Codd’s relational model? (You may need to skim through other aspects of your assigned paper to help answer that question) 22
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            