* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download Database Compiler Concepts - University of Connecticut
Oracle Database wikipedia , lookup
Microsoft Access wikipedia , lookup
Serializability wikipedia , lookup
Functional Database Model wikipedia , lookup
Concurrency control wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Clusterpoint wikipedia , lookup
Versant Object Database wikipedia , lookup
Relational algebra wikipedia , lookup
Compiler Concepts for Database Systems
CSE
4100
Prof. Steven A. Demurjian
Computer Science & Engineering Department
The University of Connecticut
371 Fairfield Way, Unit 2155
Storrs, CT 06269-3155
steve@engr.uconn.edu
http://www.engr.uconn.edu/~steve
(860) 486 - 4818
CH10.1
Overview
CSE
4100
Motivation and Background
Database System Architecture
Exploring its Capabilities
Focusing on Compiler-Related Concepts
Compile Time Issues in Database Systems
The SQL Query Language
Optimization Issues in Database Systems
Typing
Runtime Issues in Database Systems
Transaction Processing
Execution for Complex Joins
CH10.2
Database System Architecture
CSE
4100
What are the Various Components?
How do they Relate to Compilers?
CH10.3
How Does it Compare to Java Environment?
CSE
4100
CH10.4
Database Concepts - Summary
CSE
4100
Schema vs. Data
Database-Structured Collection of Data Describing
Objects of Universe of Discourse being Modeling.
A Database Consists of Schema and Data
Schema: Describes the Intension (Type) of Objects
Data: Describes the Extension (Instances) of Objects
What is Schema w.r.t. Compilers? What is Data?
CH10.5
What is a DBMS?
CSE
4100
A Database Management System (DBMS) is the
Generalized Tool that Facilitates the Management of and
Access to the Database
Main Functions:
Defining a Database: Specifying Data Types,
Structures, and Constraints
Constructing a Database: the Process of Storing the
Data Itself on Some Storage Medium
Manipulating a Database: Function for Querying
Specific Data in the Database and Updating the
Database
What are the Analogies of Each of the Main Functions
w.r.t. Programming Languages and Compilers?
CH10.6
What is a DBMS?
CSE
4100
Additional Functions:
Interaction with File Manager
So that Details Related to Data Storage and Access are
Removed From Application Programs
Integrity Enforcement
Guarantee Correctness, Validity, Consistency
Security Enforcement
Prevent Data From Illegal Uses
Concurrency Control
Control the Interference Between Concurrent Programs
Recovery from Failure
Query Processing and Optimization
Again – What are Relevant Compiler Concepts?
CH10.7
DBMS Architecture
CSE
4100
DBMS Languages
Data Definition Language (DDL)
Data Manipulation Language (DML)
From Embedded Queries or DB Commands Within a
Program
“Stand-alone” Query Language
Host Language:
DML Specification (e.g., SQL) is Embedded in a
“Host” Programming Language (e.g., Java, C++)
DBMS Interfaces
Menu-Based Interface
Graphical Interface
Forms-Based Interface
Interface for DBA (DB Administrator)
CH10.8
DBMS Architecture
CSE
4100
Main DBMS Modules
DDL Compiler
DML Compiler
Ad-hoc (Interactive) Query Compiler
Run-time Database Processor
Stored Data Manager
Concurrency/Back-Up/Recovery Subsystem
DBMS Utility Modules
Loading Routines
Backup Utility
System Catalog/data Dictionary
CH10.9
Components of a DBMS
CSE
4100
CH10.10
ANSI/SPARC - Three Schema Architecture
CSE
4100
External Data Schema (Users’ view)
Conceptual Data Schema (Logical Schema)
Internal Data Schema (Physical Schema)
What are the Programming Language Analogies?
CH10.11
Conceptual Schema
CSE
4100
Describes the Meaning of Data in the Universe of
Discourse
Emphasizes on General, Conceptually Relevant,
and Often Time Invariant Structural Aspects of the
Universe of Discourse
Excludes the Physical Organization and Access
Aspects of the Data
This could be a UML Design that Realizes a Set of
Classes (no data) or Java Class Declarations (APIs)
CH10.12
Conceptual Schema
CSE
4100
Another Example – A Programming Language Level
Definition
CH10.13
External Schema
CSE
4100
Describes Parts of the Information in the Conceptual
Schema in a form Convenient to a Particular User
Group’s View
Derived from the Conceptual Schema
What is the View of the Outside World in OO?
Akin to Public Interface
CH10.14
External Schema
CSE
4100
Another Example
CH10.15
Internal Schema
CSE
4100
Describes How the Information Described in the
Conceptual Schema is Physically Represented in a
Database to Provide the Overall Best Performance
CH10.16
Internal Schema
CSE
4100
Another Example
This Corresponds to Data Typing and Layout in
Compilers from Runtime Environment!
CH10.17
Unified Example of Three Schemas
CSE
4100
CH10.18
Database Access Process
CSE
4100
What Does This Access Process Resemble?
Akin to Runtime Execution Environment!
A More Complex Activation Process!
CH10.19
Metadata vs. Data
CSE
4100
Recall Introspection and Reflection in Java where you
Can “Look” into the Class Definitions Themselves!
CH10.20
Data Independence
CSE
4100
Ability that Allows Application Programs Not Being
Affected by Changes in Irrelevant Parts of the
Conceptual Data Representation, Data Storage
Structure and Data Access Methods
Invisibility (Transparency) of the Details of Entire
Database Organization, Storage Structure and Access
Strategy to the Users
Recall Software Engineering Concepts:
Abstraction the Details of an Application's
Components Can Be Hidden, Providing a Broad
Perspective on the Design
Representation Independence: Changes Can Be
Made to the Implementation that have No Impact
on the Interface and Its Users
Realized in Today’s Modern PLs!
CH10.21
What are System Components?
CSE
4100
How are these Similar to Complier/PL Concepts?
CH10.22
Relational Model
CSE
4100
Relational Model of Data Based on the Concept of a
Relation
Relation - a Mathematical Concept Based on Sets
Strength of the Relational Approach to Data
Management Comes From the Formal Foundation
Provided by the Theory of Relations
RELATION: A Table of Values
A Relation May Be Thought of as a Set of Rows
A Relation May Alternately be Though of as a Set
of Columns
Each Row of the Relation May Be Given an
Identifier
Each Column Typically is Called by its Column
Name or Column Header or Attribute Name
CH10.23
Relational Tables - Rows/Columns/Tuples
CSE
4100
CH10.24
Relational Database Definition
CSE
4100
CREATE TABLE Student:
Name(CHAR(30)), SSN(CHAR(9)), Gpa(FLOAT(2))
CREATE TABLE Faculty:
Name(CHAR(30)), SSN(CHAR(9)), Ophone(CHAR(7))
CREATE TABLE Courses:
Course#(CHAR(6)), Title(CHAR(20)), Descrip(CHAR(100)),
PCourse#(CHAR(6))
CREATE TABLE Formats:
Section#(INTEGER(3)), Quarter(CHAR(10)), Campus(CHAR(15))
CREATE TABLE TakeorTeach:
SSN(CHAR(9)), Course#(CHAR(6)), Section#(INTEGER(3))
CREATE TABLE COfferings:
Course#(CHAR(6)), Section#(INTEGER(3))
Student(Name*, SSN, Gpa)
Faculty(Name*, SSN, Ophone)
Courses(Course#*, Title, Descrip, PCourse#*)
Formats(Section#*, Quarter, Campus)
TakeorTeach(SSN, Course#, Section#)
COfferings(Course#, Section#)
CH10.25
Relational Views
CSE
4100
Two Views Derived From Prior Tables
Student Transcript View
Course Prerequisite View
CH10.26
SQL: Tuple Relational Calculus-Based
CSE
4100
SQL is a Partial Example of a Tuple Relational
Language
Simple Queries are all Declarative
More Complex Queries are both Declarative and
Procedural (e.g., joins, nested queries)
Find the names of employees working on the CAD/CAM
project
SELECT
EMP.ENAME
FROM EMP, WORKS, PROJ
WHERE (EMP.ENO= WORKS.ENO)
AND (WORKS.PNO = PROJ.PNO)
AND (PROJ.PNAME = “CAD/CAM”)
SQL Defines a Programming Language and Associated
Semantics for Usage and Processing
CH10.27
SQL Components
CSE
4100
Data Definition Language (DDL)
For External and Conceptual Schemas
Views - DDL for External Schemas
Data Manipulation Language (DML)
Interactive DML Against External and Conceptual
Schemas
Embedded DML in Host PLs (EQL, JDBC, etc.)
Note: Separation of Definition (DDL) from Usage
(DML) – Is there Something Similar in PLs?
Others
Integrity (Allowable Values/Referential)
Transaction Control (Long-Duration and Batch)
Authorization (Who can Do What When)
CH10.28
SQL DDL and DML
CSE
4100
Data Definition Language (DDL) - Declarations
Defining the Relational Schema - Relations,
Attributes, Domains - The Meta-Data
CREATE TABLE Student:
Name(CHAR(30)),SSN(CHAR(9)),GPA(FLOAT(2))
CREATE TABLE Courses:
Course#(CHAR(6)), Title(CHAR(20)),
Descrip(CHAR(100)), PCourse#(CHAR(6))
Data Manipulation Language (DML) - Code
Defining the Queries Against the Schema
SELECT Name, SSN
From Student
Where GPA > 3.00
CH10.29
Data Definition Language - DDL
CSE
4100
A Pre-Defined set of Primitive Types
Numeric
Character-string
Bit-string
Additional Types
Defining Domains
Defining Schema
Defining Tables
Defining Views
Note: Each DBMS May have their Own DBMS
Specific Data Types - Is this Good or Bad?
What is this Similar to re. Different C++ Compilers?
These are Akin to PL Data Types!
CH10.30
DDL - Primitive Types
CSE
4100
Numeric
INTEGER (or INT), SMALLINT
REAL, DOUBLE PRECISION
FLOAT(N) Floating Point with at Least N Digits
DECIMAL(P,D) (DEC(P,D) or NUMERIC(P,D))
have P Total Digits with D to Right of Decimal
Note that INTs and REALs are Machine Dependent
(Based on Hardware/OS Platform)
Again – this is Similar to PLs/Compilers and Code
Generation – Data Layout
CH10.31
DDL - Primitive Types
CSE
4100
Character-String
CHAR(N) or CHARACTER(N) - Fixed
VARCHAR(N), CHAR VARYING(N), or
CHARACTER VARYING(N)
Variable with at Most N Characters
Bit-Strings
BIT(N) Fixed
VARBIT(N) or BIT VARYING(N)
Variable with at Most N Bits
CH10.32
DDL - Primitive Types
CSE
4100
These Specialized Primitive Types are Used to:
Simplify Modeling Process
Include “Popular” Types
Reduce Composite Attributes/Programming
DATE : YYYY-MM-DD
TIME: HH-MM-SS
TIME(I): HH-MM-SS-F....F - I Fraction Seconds
TIME WITH TIME ZONE: HH-MM-SS-HH-MM
TIME-STAMP:
YYYY-MM-DD-HH-MM-SS-F...F{-HH-MM}
PLs also have Specialized Types!
Problem: Different Database Systems Sometime
Implement these Types very Differently
This Impacts Portability!
CH10.33
What is a SQL Schema?
CSE
4100
A Schema in SQL is the Major Meta-Data Construct
Supports the Definition of:
Relation - Table with Name
Attributes - Columns and their Types
Identification - Primary Key
Constraints - Referential Integrity (FK)
Two Part Definition
CREATE Schema - Named Database or
Conceptually Related Tables
CREATE Table - Individual Tables of the Schema
CH10.34
DDL-Create/Drop a Schema
CSE
4100
Creating a Schema:
CREATE SCHEMA MY_COMPANY AUTHORIZATION
Demurjian;
Schema MY_COMPANY bas Been Created and is
Owner by the User “Demurjian”
Tables can now be Created and Added to Schema
Dropping a Schema:
DROP SCHEMA MY_COMPANY RESTRICT;
DROP SCHEMA MY_COMPANY CASCADE;
Restrict:
Drop Operation Fails If Schema is Not Empty
Cascade:
Drop Operation Removes Everything in the Schema
CH10.35
DDL - Create Tables
CSE
4100
CREATE TABLE EMPLOYEE
( FNAME
VARCHAR(15)
NOT NULL ,
MINIT
CHAR ,
LNAME
VARCHAR(15)
NOT NULL ,
SSN
CHAR(9)
NOT NULL ,
BDATE
DATE
ADDRESS VARCHAR(30) ,
SEX
CHAR ,
SALARY
DECIMAL(10,2) ,
SUPERSSN CHAR(9) ,
DNO INT NOT NULL ,
PRIMARY KEY (SSN) ,
FOREIGN KEY (SUPERSSN)
REFERENCES EMPLOYEE(SSN) ,
FOREIGN KEY (DNO)
REFERENCES DEPARTMENT(DNUMBER) ) ;
CH10.36
DDL - Create Tables (continued)
CSE
4100
CREATE TABLE DEPARTMENT
( DNAME VARCHAR(15)
NOT NULL ,
DNUMBER INT NOT NULL ,
MGRSSN
CHAR(9)
NOT NULL ,
MGRSTARTDATE DATE ,
PRIMARY KEY (DNUMBER) ,
UNIQUE (DNAME) ,
FOREIGN KEY (MGRSSN)
REFERENCES EMPLOYEE(SSN) ) ;
CREATE TABLE DEPT_LOCATIONS
(DNUMBER INT NOT NULL ,
DLOCATION VARCHAR(15) NOT NULL ,
PRIMARY KEY (DNUMBER, DLOCATION) ,
FOREIGN KEY (DNUMBER)
REFERENCES DEPARTMENT(DNUMBER) ) ;
CH10.37
DDL - Create Tables (continued)
CSE
4100
CREATE TABLE PROJECT
(PNAME
VARCHAR(15) NOT NULL ,
PNUMBER INT NOT NULL ,
PLOCATION VARCHAR(15) ,
DNUM
INT NOT NULL ,
PRIMARY KEY (PNUMBER) , UNIQUE (PNAME) ,
FOREIGN KEY (DNUM)
REFERENCES DEPARTMENT(DNUMBER) ) ;
CREATE TABLE WORKS_ON
(ESSN CHAR(9) NOT NULL , PNO INT NOT NULL ,
HOURS DECIMAL(3,1) NOT NULL ,
PRIMARY KEY (ESSN, PNO) ,
FOREIGN KEY (ESSN)
REFERENCES EMPLOYEE(SSN) ,
FOREIGN KEY (PNO)
REFERENCES PROJECT(PNUMBER) ) ;
CH10.38
DDL - Create Tables with Constraints
CSE
4100
CREATE TABLE EMPLOYEE
(...,
DNO INT NOT NULL
DEFAULT 1,
CONSTRAINT EMPPK
PRIMARY KEY (SSN) ,
CONSTRAINT EMPSUPERFK
FOREIGN KEY (SUPERSSN)
REFERENCES
EMPLOYEE(SSN)
ON DELETE SET NULL
ON UPDATE CASCADE ,
CONSTRAINT EMPDEPTFK
FOREIGN KEY (DNO)
REFERENCES DEPARTMENT(DNUMBER)
ON DELETE SET DEFAULT
ON UPDATE CASCADE );
CH10.39
DDL - Create Tables with Constraints
CSE
4100
CREATE TABLE DEPARTMENT
(...,
MGRSSN CHAR(9) NOT NULL
DEFAULT '888665555' ,
...,
CONSTRAINT DEPTPK
PRIMARY KEY (DNUMBER) ,
CONSTRAINT DEPTSK
UNIQUE (DNAME),
CONSTRAINT DEPTMGRFK
FOREIGN KEY (MGRSSN)
REFERENCES EMPLOYEE(SSN)
ON DELETE SET DEFAULT
ON UPDATE CASCADE );
Is there an Equivalent to Keys and Constraints in PLs?
What Does Java Have Internally?
Constraints Facilitate Type Checking at Data Level!
CH10.40
Data Manipulation Language - DML
CSE
4100
SQL has the SELECT Statement for Retrieving Info.
from a Database (Not Relational Algebra Select)
SQL vs. Formal Relational Model
SQL Allows a Table (Relation) to have Two or
More Identical Tuples in All Their Attribute Values
Hence, an SQL Table is a Multi-set (Sometimes
Called a Bag) of Tuples; it is Not a Set of Tuples
SQL Relations Can Be Constrained to Sets by
PRIMARY KEY or UNIQUE Attributes
Using the DISTINCT Option in a Query
Implied Processing and Procedural Semantics
SQL Queries have Specific Semantics
These Semantics Dictate Processing
Includes Code Generation, Optimization, etc.
CH10.41
Interactive DML - Main Components
CSE
4100
Select-from-where Statement Contains:
Select Clause - Chosen Attributes/Columns
From Clause - Involved Tables
Where Clause - Constrain Tuple Values
Tuple Variables - Distinguish Among Same Names
in Different Tables
String Matching - Detailed Matching Including
Exact
Starts With
Near
Ordering of Rows - Sorting Tuple Results
CH10.42
Recall Prior Schema
CSE
4100
CH10.43
…and Corresponding DB Tables
CSE
4100
Which Represent Tuples/Instances of Each Relation
A
S
C
null
W
B
null
null
1
4
5
5
CH10.44
…and Corresponding DB Tables
CSE
4100
CH10.45
Simple SQL Queries
CSE
4100
Query 0: Retrieve the Birthdate and Address of the
Employee whose Name is 'John B. Smith'.
SELECT BDATE, ADDRESS
FROM EMPLOYEE
WHERE FNAME='John' AND MINIT='B’
AND LNAME='Smith’
Which Row(s) are Selected?
B
S
C
null
W
B
null
null
Note: While All of these Next Queries are from
Chapter 8, Some are From “Earlier” Edition
CH10.46
Simple SQL Queries
CSE
4100
Query 1: Retrieve Name and Address of all Employees
who work for the 'Research' Department
SELECT FNAME, MINIT, LNAME, ADDRESS, DNAME
FROM EMPLOYEE, DEPARTMENT
WHERE DNAME='Research' AND DNUMBER=DNO
What Action is Being Performed? Join! Cartesian
Product!
CH10.47
Simple SQL Queries - Result
CSE
4100
Theta Join on DNO=DNUMBER
CH10.48
Simple SQL Queries
CSE
4100
Query 2: For Every Project in 'Stafford', list the Project
Number, the Controlling Dept. Number, and the Dept.
Manager's Last Name, Address, and Birthdate
SELECT PNUMBER, DNUM, LNAME, BDATE,ADDRESS
FROM PROJECT, DEPARTMENT, EMPLOYEE
WHERE DNUM=DNUMBER AND MGRSSN=SSN AND
PLOCATION='Stafford'
In Q2, there are Two Join Conditions:
The Join Condition DNUM=DNUMBER Relates a
Project to its Controlling Department
The Join Condition MGRSSN=SSN Relates the
Controlling Department to the Employee who
Manages that Department
CH10.49
Query Results
CSE
4100
SELECT PNUMBER, DNUM, LNAME, BDATE,ADDRESS
FROM PROJECT, DEPARTMENT, EMPLOYEE
WHERE DNUM=DNUMBER AND MGRSSN=SSN AND
PLOCATION='Stafford'
A
S
C
null
W
B
null
null
CH10.50
Qualification of Attributes
CSE
4100
In SQL, the Same Name for Two (or More) Attributes
is Allowed if Attributes are in Different Relations
In Those Cases, Query Must Qualify by Prefixing the
Relation Name to the Attribute Name
EMPLOYEE.LNAME, DEPARTMENT.DNAME
Aliases: When Queries Must Refer to the Same
Relation Twice
Alias is Akin to a Variable – Reference in PL!
In These Situations, it is Considered that there are
Two Different Copies of the Same Relation
Let’s See Examples of Both Concepts
CH10.51
Attribute Qualification
CSE
4100
Query 8: For Each Employee, Retrieve the Employee's
Name, and Name of his or her Immediate Supervisor
SELECT E.FNAME, E.LNAME, S.FNAME, S.LNAME
FROM EMPLOYEE E S
WHERE E.SUPERSSN=S.SSN
E and S are aliases for the EMPLOYEE relation
E Represents Employees in the Role of Supervisees
S Represents Employees in the Role of Supervisor
Another Form of Query 8 is:
SELECT E.FNAME, E.LNAME, S.FNAME, S.LNAME
FROM EMPLOYEE AS E, EMPLOYEE AS S
WHERE E.SUPERSSN=S.SSN
CH10.52
Query Results
CSE
4100
SELECT
FROM
WHERE
E.FNAME, E.LNAME, S.FNAME, S.LNAME
EMPLOYEE AS E, EMPLOYEE AS S
E.SUPERSSN=S.SSN
A
S
C
null
W
B
null
null
CH10.53
Nested Queries
CSE
4100
SQL SELECT Nested Query is Specified within
WHERE-clause of another Query (the Outer Query)
Query 1A: Retrieve the Name and Address of all
Employees who Work for the 'Research' Department
SELECT
FNAME, LNAME, ADDRESS
FROM EMPLOYEE
WHERE
DNO IN
(SELECT DNUMBER
FROM
DEPARTMENT
WHERE DNAME='Research' )
Note: This Reformulates Earlier Query 1
The End Result is Essentially:
Outer and Inner For/While Loops!
CH10.54
How Does Nested Query Work?
CSE
4100
The Nested Query Selects Number of 'Research' Dept.
The Outer Query Selects an EMPLOYEE Tuple If Its
DNO Value Is in the Result of Either Nested Query
IN represents Set Inclusion of Result Set
We Can Have Several Levels of Nested Queries
SELECT
FNAME, LNAME, ADDRESS
FROM EMPLOYEE
WHERE
DNO IN
(SELECT DNUMBER
FROM
DEPARTMENT
WHERE Dname=’Research' )
CH10.55
NULLS in SQL Queries
CSE
4100
SQL Allows Queries that Check if a value is NULL
(Missing or Undefined or not Applicable)
SQL uses IS or IS NOT to compare NULLs since it
Considers each NULL value Distinct from other NULL
Values, so Equality Comparison is not Appropriate
Query 18: Retrieve the names of all employees who do
not have supervisors.
SELECT
FNAME, LNAME
FROM EMPLOYEE
WHERE SUPERSSN IS NULL
Why Would Such a Capability be Useful?
Downloading/Crossloading a Database
Promoting a Attribute to PK/FK
CH10.56
Aggregate Functions in SQL Queries
CSE
4100
Query 19: Find Maximum Salary, Minimum Salary,
and Average Salary among all Employees
SELECT
FROM
MAX(SALARY), MIN(SALARY),
AVG(SALARY)
EMPLOYEE
Query 20: Find maximum and Minimum Salaries
among 'Research' Department Employees
SELECT MAX(SALARY), MIN(SALARY)
FROM EMPLOYEE, DEPARTMENT
WHERE DNAME='Research' AND DNUMBER=DNO
What Does Query 22 Do?
SELECT COUNT(*)
FROM EMPLOYEE, DEPARTMENT
WHERE DNAME='Research' AND DNUMBER=DNO
CH10.57
Grouping in SQL Queries
CSE
4100
Query 24: For Each Department, Retrieve the DNO,
Number of Employees, and Their Average Salary
SELECT DNO, COUNT (*), AVG (SALARY)
FROM EMPLOYEE
GROUP BY DNO
EMPLOYEE tuples are Divided into Groups; each
group has the Same Value for Grouping Attribute DNO
COUNT and AVG functions are applied to each Group
of Tuples Aeparately
SELECT-clause Includes only the Grouping Attribute
and the Functions to be Applied on each Tuple Group
Are there PL Equivalents to these Data Oriented
Actions? Yes – in Specific APIs but Not PL Itself!
CH10.58
Results of Query 24:
CSE
4100
SELECT DNO, COUNT (*), AVG (SALARY)
FROM EMPLOYEE
GROUP BY DNO
CH10.59
INSERT SQL Queries
CSE
4100
Add one or more Tuples to a Relation, with Attribute
values Listed in the order specified in the CREATE
Update 1:
INSERT INTO EMPLOYEE
VALUES ('Richard','K','Marini', '653298653',
'30-DEC-52', '98 Oak Forest,Katy,TX', 'M',
37000,'987654321', 4 )
Another Form of Update 1:
INSERT INTO EMPLOYEE (FNAME, LNAME, SSN)
VALUES ('Richard','K','Marini')
All PK and FK Values must be Provided
Nulls are Allowed
DDL Constraints are Enforced
Another form of “Type Checking” at Instance Level
This is Akin to Dynamic Type Checking!
CH10.60
DELETE SQL Queries
CSE
4100
Sample Deletes Include
DELETE FROM EMPLOYEE
WHERE
LNAME='Brown'
DELETE FROM EMPLOYEE
WHERE
SSN='123456789’
DELETE FROM EMPLOYEE
WHERE
DNO IN
(SELECT
DNUMBER
FROM
DEPARTMENT
WHERE
DNAME='Research')
DELETE FROM EMPLOYEE
No. of Tuples Deleted Dependent on WHERE Clause
Referential Integrity (Type Checking!) is Enforced
During DELETE
CH10.61
UPDATE SQL Queries
CSE
4100
Give all Employees in the 'Research' Dept. a 10% raise
UPDATE EMPLOYEE
SET SALARY = SALARY *1.1
WHERE
DNO IN
(SELECT DNUMBER
FROM
DEPARTMENT
WHERE DNAME='Research')
Modified SALARY Value Depends on the Original
SALARY Value in each Tuple
SALARY = SALARY *1.1 - Use PL Interpretation
CH10.62
Query Processing and Optimization
CSE
4100
What are the Processing Issues for DBs?
Database Applications of Today and Tomorrow
Require High Volumes of Information!
Increase of Information Still Requires High
Performance!
Throughput and Response Time
Where's the Bottleneck in DBS?
CPU ??
Main Memory Size/Speed ??
Virtual Memory Limitations ??
Communications Bus ??
I/O Channel ??
How Does this Relate to Compilers/PLs?
CH10.63
90-10 Rule for Database Processing
CSE
4100
Load (Transaction per second) vs.
Performance (Response Time of Transactions)
Processing of Large Amounts of Raw Data
Addressed in Secondary Storage
Staged to Main Memory
Identifying Relevant Data
Large Amounts of Raw Data Discarded
Focus on Data Most Likely to Contain Answers
Possible Loss of CPU and Main Memory Cycles
This is Double Jeopardy!
Load of DBS Must be Reduced
Performance of DBS Degrades
CH10.64
90-10 Rule for Conventional DBS
CSE
4100
Only 10% of Relevant
Data has Answers
Application
Programs
Operating
System
Database
Functions
Only 10% of Raw Data is
Relevant
On-Line
I/O
Disk I/O
Note: Naive Approach to Database Searching Often Occurs
(Little or No Indexing in Practice!)
CH10.65
Query Optimization Goal
CSE
4100
Limit Costly Join Operation by Reducing Data to be
Scanned or that Participates in the Join
While Improving Selection and Projection can Help,
the Main Objective is Join
In Worst Case - Cartesian Product
Can Improve by Introducing Indices on the Join
Attributes (R.B and S.C) to Limit “Product”
Can Further Improve by Sorting on the Join
Attributes (R.B and S.C)
This Reduces Block Accesses by Limiting the Number
of Blocks that Must be Examined in a Join
If B’s Values Range from 0 to 100 and C from 50 to 150,
only need to Compare from 50 to 100
Focus is on Reducing Costly Ops – Same as PL
Optimization to Replace * with +
CH10.66
Query Processing
CSE
4100
Internal Data Structure
Memory Hierarchy
Main Memory + Secondary Memory
Information Must be Staged from Secondary to Primary
Memory for Database Operation
Sequential Search
Brute force Approach
Direct Access (Indexed Search)
Hash, Inverted Index file, Binary Search Tree, B-tree,
B+-tree
Improves Selection by Focusing on Subset of Tuples
that are Involved in the Answer and Equijoin by Not
Having to Compare All Blocks in Two Relations
CH10.67
Algorithms for Database Query Operators
CSE
4100
Largely Fall into Three Classes: Sorting-Based
Methods, Hash-Based Methods, Index-Based Methods
Such Algorithms are Divided into Three Degrees of
Difficulty and Cost (Limiting Factor is Size of Data)
One Pass Algorithms
Where Data is Only Read Once From Disk
Two-pass Algorithms
Data is Read from Disk, Processed in Some Way,
Written Back to Disk, Read Again for Processing, etc.
Multi-pass Algorithms
Where 3 or More Passes Are Required, i.e., Recursive
Generalization of the Two-pass Algorithms
Akin to Multiple Pass Compilers at Data Level
CH10.68
Database Join and Sort are External
CSE
4100
Suppose that your DBS has 1,000 1K Blocks of
Memory Available for Performing Operations (e.g.,
Select, Project, Join, Union, Aggregation, etc.)
Suppose Sort R by R.B
R Contains 5000 Blocks
In order to Perform a Sort/Merge - You Must Use
External Algorithm since all 5000 Blocks Can Fit
Into Memory at the Same Time
Suppose Join R (500 Blocks) and S (800 Blocks)
Again - their Total Exceeds Memory - Hence you
Must Take an Approach that Compares One Block
of R with All Blocks of S, etc. (Slides 22,23)
1
2
3
1000
CH10.69
Database Join and Sort are External
CSE
4100
What’s True about Today’s DBMS Like Oracle?
Oracle Recommends 2 Gigabytes of Primary Memory
That 2 Gigabytes Must be Shared by:
Operating System
Other Applications Running on “Same” Server
(Web Server, etc.)
Database Management Software
Even if there was 1.5 Gigabytes Available, Modern
DBs can Exceed that size Very Easily
Moreover,
Cartesian Product Could Exceed Available Mem.
Join Could Require External Approach Since All
Tables Involved in Join Can’t fit in 1.5 Gigabytes
External Sorting/Block Oriented Processing is Norm
CH10.70
The System Catalog
CSE
4100
Store the Meta Information that Describes Each
Database, Including a Description of
Conceptual Database Schema (Logical Data
Model)
Relations, Attributes, Keys, Indexes, Views
Internal Schema
External Schema
Store Information Needed by Specific DBMS Modules
Query Optimization Module
Security and Authorization
CH10.71
Example of Catalog Information
CSE
4100
CH10.72
Relational DBMS Catalog
CSE
4100
All Metadata Stored as Relations
Example of Metadata Tables are:
CH10.73
Uses of System Catalog
CSE
4100
DDL Compilers:
Correct Definition of
Relations and Attributes
DML (Query) Compiler:
DML Parser
SELECT EMP.ENAME
FROM EMP, WORKS, PROJ
WHERE (EMP.ENO= WORKS.ENO)
AND(WORKS.PNO = PROJ.PNO)
AND(PROJ.PNAME = “CAD/CAM”)
Guided by the Description of DML Syntax and the
Schema Information in the Catalog, Generates a Query
Tree after Parser
Optimizer
Generates Access Paths that is Relatively Optimal for
Executing a Query/ DML Command, by Accessing the
Database Structure Information (Schemas), and
Mapping High-level SQL Queries Into Low-level File
Access Commands
CH10.74
Revisit Typical Database Processing
CSE
4100
Parsed and
Optimized
User Trans.
Pre-Processing
- Parser/Lexical
- Optimizer/Views
Concurrency Control
Lock Request
Response
User Transaction
Errors
Post-Processing
- Collection of Results
- Aggregation Operations
- Security Checks
Low-Level Processing
- Enqueue Trans.
- Request Locks
- Issue I/Os
- Process Returned Data
- Integrity Checks
- Security Checks
- Logging for Recovery
- Release Locks
- Dequeue Trans.
High-Level Processing
- Enqueue Trans.
- Request Locks
- Release Locks
-Dequeue Trans.
Response to User
I/O
Request
Errors
Results
Lock Request
Results
Disk I/O
Recovery
CH10.75
Typical Database Processing
CSE
4100
Pre-Processing
Actions Taken Upon Receipt of a Query from User
SQL Query via Query Tool or JDBC Call
“Compilation” of DB Query
Check Syntax, Semantics, Optimize, Develop RunTime Strategy (Similar to PL Compilation)
Query is Translated to DB Transaction
A Transaction Contains Multiple DB Operations
Transaction has Explicit Order of Operations
Database Transaction Must Succeed or Fail
There is no Intermediate State
Completely Executed and Committed or
Aborts at any Point and Undone
New State or Previous State of DB
CH10.76
Typical Database Processing
CSE
4100
High-Level Processing
Enqueue Transaction from Pre-Processing
Transaction Must Wait for “Earlier” Transactions
Remember - Shared DB State!
Request Locks from Concurrency Control
All Locks Before Proceeding vs. Locks as Needed
Avoid Deadlock and Livelock
Release Locks
As Use of Data Completes to Increase Availability
What Happens if Failure of Later Step in Transaction
Dequeue Transaction
Completes Transaction Processing
Return “Result” to Post-Processing
CH10.77
Typical Database Processing
CSE
4100
Low-Level Processing
Enqueue Transaction - Do Actual DB Operations
Request Locks - Lower Granularity Level
Issue I/Os - Based on Operations to Access
“Correct” and “Relevant” DB Records
Process Returned Data - Aggregation, Sorting
Integrity Checks: Do I/D/U Satisfy Constraints?
Security Checks: Is DB R/I/D/U Allowed?
Logging for Recovery - Commit the Transaction
Release Locks - Available to Others
Dequeue Transaction - Return Results to HighLevel Processing
Note: The Multiple Operations of Each DB
Transaction All Must be Successful
CH10.78
Typical Database Processing
CSE
4100
Post Processing
Collection of Results
May be Passed Portions of Results as they Complete
For Example, Sorted Blocks of Data that are then
Merged in a Final Step
Aggregation Operations
May be Passed Aggregate Intermediate Results
Sum for Different Departments to be Totaled
Security Checks
Last Step Filtering to Insure Only Allowed Data is
Returned
May Execute Query but Only see Aggregate Result
Send Results to User
CH10.79
Typical Database Processing
CSE
4100
Concurrency Control
Control Access to Information
Data and Metadata
Prevent Simultaneous Updates
Ensure Database Always Correct and Consistent
Serial Schedule vs. Serializable Transaction
Two Types
Pessimistic - Locking-Based - Assume Collisions Will
Occur - e.g., Peoplesoft Course Registration
Optimistic - Time-Based - Fix Problems After the Fact e.g., ATM Machines Example
CC Manages Locks at Different Granularity Levels
(Table, Attribute, View, Tuple, Metadata, etc.)
CH10.80
Typical Database Processing
CSE
4100
Disk I/O
Performs the Actual Disk I/O for Read/Writes
Block Oriented Activity
Maintain Queue of All I/O Requests
Ordering is Critical
Related to Concurrency Control and Consistency
Single DB Transactions can have Multiple DB
Operations
Disk I/Os for Different Operations at Different
Times
High and Low Level Processing will Determine
What Operations Needed When
Disk I/O - Relatively “Dumb”
CH10.81
Typical Database Processing
CSE
4100
Recovery
Tightly Tied to DB Transaction Concept
Transactions Must be:
Atomic - Happens or Doesn’t
Durable - Once Committed, Results Survive Failure
Consistent - Follows Protocol/Correct DB State
When Failure Occurs, Can we:
Recover to a Correct “Earlier” State
Reconcile all “Active” Transactions that were Executing
at Failure Time
Involves Logging of Database Actions
Objective: High Availability and Reliability
CH10.82
Query Optimization
CSE
4100
Not Really Optimizing, but Planning to Avoid Bad
Execution Strategies
Models
Heuristics-Based
Apply Transformation Rules According to a General
Strategy
Focus on Relational Algebra that Underlies Each Query
Improve the “Order” of Relational Operations
Cost-Based
Minimize a Cost Function
I/O Cost + CPU Cost
Subject to a Set of Constraints
CH10.83
Query Processing Methodology
CSE
4100
High-level Calculus-based Query
EXTERNAL
SCHEMA
Query
Preprocessing
Algebraic Query (a tree structure)
LOGICAL
SCHEMA
Query
Optimization
INTERNAL
SCHEMA
Execution Schedule (file access plan)
CH10.84
Refute Incorrect Queries
CSE
4100
Example:
E(ENAME, ENO), P(JNO,JNAME), W(ENO,PNO,DUR)
SELECT ENAME, PNAME
FROM E, P, W
WHERE DUR > 27 AND DUR < 25
Incorrect
Disjoint Components are Useless
Multiple Relations, Missing Joins, may not be
incorrect, but may indicate Cartesian product
Contradictory
Qualification can not be Satisfied by any Tuple
DUR > 27 AND DUR < 25
CH10.85
Simplification
CSE
4100
Why Simplify?
The Simpler the Query, the Less Work there is and
the Better the Performance
How? Use transformation rules
Elimination of Redundancy
Idempotency Rules
Application of Transitivity
Use of Integrity Rules
Example
x > a and x > b
DUR > 27 AND DUR > 25
CH10.86
Restructuring
CSE
4100
Convert Relational Calculus to
Relational Algebra
ENAME
Make use of Query Trees
Example
Find the names of employees
other than J. Doe who worked (DUR=12 OR DUR=24) AND
JNAME=“CAD/CAM” AND
on the CAD/CAM project for
ENAME°“J. DOE”
either 1 or 2 years.
SELECT ENAME
FROM
E, W, P
WHERE E.ENO=W.ENO
AND
W.JNO=P.JNO
AND
E.ENAME°"J. Doe"
AND
P.JNAME="CAD/CAM"
AND
(W.DUR=12 OR
W.DUR=24)
Project
Select
JNO
Join
ENO
P
W
E
CH10.87
Query Optimization Objectives
CSE
4100
Improving Performance
Arriving at a Query Plan of Execution
Analyzing the Relational Algebra Query
Replace Costly Operations
Do Selections and Projections Early
Optimization Heuristics for the Relational Algebra
Performing Selection and Projection Before Join
Combining Several Selections Over a Single
Relation Into One Selection
Find Common Subexpressions
Algebraic Rewriting/transformation Rules
General Transformation Rules for Relational Algebra
(Equivalence-preserving Algebraic Rewriting Rules)
CH10.88
Query Optimization: An Example
CSE
4100
Why is it important?
SELECT
FROM
WHERE
AND
ENAME
E,W
E.ENO = W.ENO
W.RESP = "Manager"
Strategy 1
ENAME(RESP="Manager"E.ENO=G.ENO(E W))
Strategy 2
ENAME( E
ENO(RESP="Manager"(W)))
CH10.89
Cost of Alternatives
CSE
4100
Assume :
card(E) = 4,000; card(W)=10,000
10% of tuples in W satisfy RESP="Manager"
(selection generates 1,000 tuples)
Execution time Proportional to the Sum of the
Cardinalities of the Temporary Relations
Searching is Done by Sequential Scanning
Strategy 1
Cartesian prod. = 40,000,000
Search over all = 40,000,000
80,000,000
Strategy 2
Selection over W =
10,000
Join(4000*1000) = 4,000,000
4,010,000
CH10.90
General Query Optimization Strategy
CSE
4100
Perform Selections Early
Yields Smaller Intermediate Results
Direct Impact on Subsequent Join/Cartesian Prod.
Combine Selections with a Prior Cartesian Product into
a Theta or Equi Join
Join is a Cheaper Operation
Combine (Cascade) Selections and Projections
AB(B (R)) AB(R)
p1 ( p2 (R)) p1 ^ p2 (R)
This Results in One Pass Instead of Two over Table
CH10.91
General Query Optimization Strategy
CSE
4100
Identify Common Subexpressions
Compute Once and Store
use Stored Version for Subsequent Times
Often Useful When Views are Employed
Preprocess Data via Sorts and Indexes
Speeds up Searches and Joins by Limiting Scope
Evaluate and Assess Different Options
For Cartesian Product, Use Smaller Relation for
Comparison
Use System Catalog (Meta-data) to Effect Order in
Query Execution Plan
CH10.92
Relational Algebra Transformations
CSE
4100
Cascade of Selection
p1 ^ p2 ^ …^ pn(R)p1(p2(...(pn(R))...))
Commutativity of Selection
p1(p2(R))p2(p1(R))
p1 or p2(R )p1(R p2(R)
Cascade of Projection
A1,A2, … An(R)A1(A2(...(An(R))...))
A1(R) if A1 A2 ... An
Commuting Selection with Projection
A1,A2,...,An(p(R))p(A1,A2,...,An(R)
CH10.93
Relational Algebra Transformations
CSE
4100
Commutativity of Theta Join and Cartesian Product
R
A SS
AR
R SS R
Commuting Selection with Theta Join (Cartesian)
p(A)(R S) p(A)(R)) S
A defined on R only
p(A)^p(B)(R S) p(A)(R)) p(B)(S))
(A defined on R, B defined on S)
Also Holds for Theta Join as Well
Commuting Projection with Theta Join (Cartesian)
C(R S) A(R) B(S) where AB=C
A are Attributes in C for R and B are Attributes in C
for S
CH10.94
Relational Algebra Transformations
CSE
4100
Commutativity of Set Operations
R S S R
R S S R
Associativity of Set Operations
(R S) T R S T)
(R
S) T R
(S T)
(R S) S R (S T)
(R S) S R (S T)
Commuting Select with Set Operations
p(Ai)(R T) p(Ai)(R) p(Ai)(T)
where Ai is defined on both R and T
p(Ai)(R T) p(Ai)(R) p(Ai)(T)
where Ai is defined on both R and T
CH10.95
Relational Algebra Transformations
CSE
4100
11. Commuting Projection with Union
C(R
q(Aj,Bk) S) A’(R)
q(Aj,Bk)
B’(S)
C(R S) A’ (R) B’ (S)
where R[A] and S[B]
C = A' B' where A' A, B’ B
12. Converting Selection/Cartesian Into Theta Join
C (R S) R
S
C
CH10.96
Heuristic Optimization: Example
CSE
4100
Canonical query tree at the end of
query preprocessing phase
ENAME
(DUR=12 OR DUR=24)
AND
JNAME=“CAD/CAM” AND
ENAME= “J. DOE”
E(ENAME, ENO)
P(JNO,JNAME)
W(ENO,PNO,DUR)
JNO
ENO
P
W
E
CH10.97
Heuristic Optimization– Example
ENAME
CSE
4100
DUR=12 OR DUR=24
JNAME=“CAD/CAM”
ENAME = “J. DOE”
Use cascading of selections
rule to decompose selections
JNO
P
ENO
W
E
CH10.98
Heuristic Optimization– Example
ENAME
CSE
4100
DUR=12 OR DUR=24
JNAME=“CAD/CAM”
Push selection down
using commutativity of
selection over join
JNO
ENO
ENAME = "J. Doe"
P
W
E
CH10.99
Heuristic Optimization–Example
CSE
4100
ENAME
DUR=12 OR DUR=24 Push selection down
JNO
JNAME = "CAD/CAM"
using commutativity of
selection over join
ENO
ENAME = "J. Doe"
P
W
E
CH10.100
Heuristic Optimization–Example
CSE
4100
ENAME
JNO
Push selection down
ENO
JNAME = "CAD/CAM"
P
DUR =12 DUR=24
W
ENAME = "J. Doe"
E
CH10.101
Heuristic Optimization–Example
CSE
4100
ENAME
JNO
JNO,ENAME
Do early projection
ENO
JNO
JNAME = "CAD/CAM"
P
JNO,ENO
DUR =12 DUR=24
W
JNO,ENAME
ENAME = "J. Doe"
E
CH10.102
Heuristic Optimization–Example
ENAME
CSE
4100
Identify subtrees that
can be implemented in
one algorithm
JNO
JNO,ENAME
ENO
JNO
JNAME = "CAD/CAM"
JNO,ENO
JNO,ENAME
DUR =12 DUR=24
ENAME = "J. Doe"
P
W
E
CH10.103
Heuristic Optimization: A Second Example
CSE
4100
Title
What is the Final Step?
Combine Select and
Cartesian Product
Borrower.Card_No = Loans.Card_No
Result: Equijoins!
Loans.LC_No
X
Books.LC_No, Title
Books
Books.LC_No = Loans.LC_No
Loans.LC_No,
X
Borr.Card_No
Loans.Card_No
Date 1/1/88
Borrower
Loans
CH10.104
Cost-Based Optimization
CSE
4100
Reduce Defined Cost of Executing Queries
What is Involved in the Cost of Executing a Query?
Access Cost to Secondary Storage
Search for Data Block (Index)
Read/Write Index and Data Blocks
Storage Cost
Index and Data Blocks
Intermediate Files
Computation Cost
Query Planning - Optimization Effort
Record Search, Sort, Merge
Actual Transaction/Query Operations
Communications Cost
Transfer of Results to the User
CH10.105
Complexity of Relational Operations
CSE
4100
Assuming
Relations of
Cardinality n
Sequential Scan
of Data in each
Relation
Complexity of Each
Operation is
Indicated
Avoid Cartesian
Product at All Costs!
Operation
Select
Project
(w/o duplicate elimination)
Project
(with duplicate elimination)
Group
Complexity
O(n)
O(nlog n)
Join
Division
O(nlog n)
Set Operators
Cartesian Product
O(n2)
CH10.106
Cost-Based Optimization
CSE
4100
To Understand Cost-Based Operations, we Must Focus
on Implementation Strategy of:
Select
Project
Join
For Select and Project - There is a Fixed Cost that we
Must Live With
For Join
Implementation Strategy
Different Join Strategies
Objective:
Minimize the Number of Blocks Involved
Note that Cost-Based and Relational Algebra Heuristic
Optimization Can Complement One Another
CH10.107
Optimization Summary
CSE
4100
Most Systems Implement Only a Few Strategies
The Number of Strategies that are Considered by Any
Query Optimizer is Limited
Some Systems Reduce the Number of Strategies by
Making a Heuristic Guess of Strategy for Each Query
The Optimizer Considers Every Possible Strategy,
but Terminates as Soon as it Determines the Cost is
Greater than the Pre-chosen Strategy
Thus Only a Few Competing Strategies Require
Full Analysis of the Cost
The Overhead of Query Optimization is Reduced
Remember - Trade off in Optimization Time
For PL - Optimization is Pre-Execution (Compile)
For DB - Optimization is Part of Execution (Run)
CH10.108