Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Exam1
COSC 6340 (Data Management)
March 22, 2001
Your Name:
Your SSN:
I agree that my grades are posted using the last 4 digits of my ssn
………………….(signature, if you like us to post your grades)
Problem 1 [17]:
Problem 2 [8]:
Problem 3 [7]:
Problem 4 [12]:
Problem 5 [23]
Problem 6: [5]
:
Grade:
The exam is “open books” and you have 75 minutes to complete the exam.
1) Relational Database Design [17]
Consider the following relation R(A,B,C,D,E) with the following functional
dependencies is given ( -> denotes a functional dependency):
(1) A -> B
(2) C -> B
(3) B -> E
(4) E -> D
a) Assume we decompose R into R1(A,B, C) and R2(B,D,E). Does this decomposition
have the lossless join property --- is it possible to reconstruct R from R1 and R2 using a
natural join? Give reasons for your answer! [5]
Yes. Apply the lossless decomposition test on page 435 of the textbook:
R1 ņ R2 = {B}
For R2, the FDs are: B -> E and E -> D. By applying transitivity rule, we get B > D. Because B -> E and B-> D, B->BDE (Union), i.e. B is the candidate key of
relation R2. So R1 ņ R2 -> R2. The decomposition is lossless.
You can also use attribute closure to explain the answer.
b) What is (are) the candidate key(s) of R? [2]
AC
c) Is R in BCNF? If not, which functional dependencies are bad (violate BCNF)? [2]
No. All FDs are bad.
d) Transform the relational schema into a relational schema that is in BCNF and does
not have any lost dependencies; if this is not possible decompose R into a schema that is
in BCNF and has the fewest number of lost functional dependencies. [8]
One possible solution:
R1 (ABCDE)
E -> D
R2 (ABCE)
R3 (DE)
B -> E
R4 (ABC)
R5 (BE)
A -> B
R6 (AC)
R7 (AB)
lost FD C -> B
2) Multi-valued Dependencies [8]
Assume the following relation R(A,B,C,D) is given and the following multivalued
dependencies hold: A ->-> B and A ->-> C
Assume the relation R contains the following two tuples
R(A B C D)
(12 34)…
(15 67)…
What other tuples must R contain so that A ->-> B and A ->-> C hold for R (said
differently, given and example of a relational relation that contains the two tuples and
does not violate the two multi-valued dependencies)?
Apply the formula on page 446 of the textbook, the tuples that must be
included due to the two multi-valued dependency are:
(1 2 6 7)
(1 5 3 4)
(1 2 6 4)
(1 5 3 7)
(1 2 3 7) second round
(1 5 6 4) second round
3) Query Optimization [7]
a) What are the goals and objectives of query optimization? [4]
See textbook!!
b) Why are statistics gathered from the database important for query optimization?
[3]
To predict cost of operations to predict the size of intermediate relations; better
prediction model result in more accurate evaluations of query plans.
4) B+-Trees [12]
a) Compare B+-trees with static hashing. What are the main advantages of B+-trees
if compared with static (bucket hashing techniques). What are the disadvantages?
[4].
Advantages of B+-tree: Sorted data structure; Self-organizing; Efficient for
range search
Disadvantage: May require 1 or 2 more I/Os for equality search than hashing
b) Assume that the following B+-tree with p=5 and k=3 is given. Furthermore,
assume that the keys 1, 21, 22, 23, 39 are deleted in the indicated order. Show
how the tree looks like after each deletion. [8]
21
2
1
2
5
3
23
4
5
20
21
22
23
39
40
40
50
44
50
60
63
One possible solution:
Delete 1:
21
2
3
3
5
4
5
23
20
21
22
23
39
40
40
50
44
50
40
50
44
50
60
63
Delete 21:
23
3 20
2
Delete 22:
3
4
5
20
22
23
39
40
60
63
23
3 5
2
3
4
5
20
23
40
50
39
40
44
50
44
50
60
63
Delete 23:
3
2
3
4
5
20
20
39
40
50
40
Delete 39:
3
2
3
4
5
5
20
5) Physical Database Design [23]
40
40
50
44
50
60
63
60
63
Assume two relations R1(A, B, C) and R2(A, D, E); R1 and R2 are both stored as an
unordered file and contains 1000000 (1 million) tuples. Attributes A, B, C, D, and E need
4 byte of storage each, and blocks have a size of 4096 Byte. A is the primary key of both
R1 and R2 and R1[A]=R2[A]. Moreover, we assume that static hashing is used to
implement index structures, and that index pointers require 4 byte of storage;
furthermore, you can assume that pages of index blocks are 80% full and do not contain
any overflow pages. Moreover, the database system only supports the block nested loops
join (only 3 blocks of buffer are available) and the index nested loops join. What index
structures would you create to speed up the following 3 queries?
Q1: Select B, E
from R1, R2
where R1.A=R2.A
and D=12;
returns 4 answers
Q2: Select B
from R1, R2
where R1.A=R2.A
and C=12;
returns 100000 answers
Q3: Select sum(A)
from R1;
returns one answer
Describe which index structure you would create (justify your design!), and compute the
cost for executing Q1, Q2, and Q3 for your chosen design (Hint: look for unusual
solutions!). Also give the query evaluation plan you assume the database system would
use to implement query Q1.
Q1:
Because 'Q1 returns 4 answers', 'R1[A] = R2[A]' and 'A is the primary key of both
relations' => there are exactly 4 tuples in R2 satisfy D = 12. Establish hash index
on R2.D and use it to find the 4 tuples in R2 satisfying D = 12 without writing out
the result. Meanwhile establish hash index on R1.A and use it to find the four
tuples satisfying R1.A = R2.A on the fly, and write out B, E.
Cost:
To find out the four tuples in R2 with D = 12:
1 (index block of R2.D) + 4 (data blocks) = 5
For each tuple with D = 12, find out the tuple in R1 satisfying R1.A = R2.A:
1 (index block of R1.A) + 1 (data block) = 2
Total cost (without considering writing out the final result):
5 + 4 * 2 = 13 I/Os
Q2:
# of file blocks of each relation = 1000000 * (4 * 3) / 4096 ≈ 3000
Because Q2 returns 100000 answers, index on R1.C will not help. Scan R1 to
retrieve tuples with C = 12 (3000 block access). Meanwhile use hash index on
R2.A do index nested loop join on the fly.
Cost:
3000 + 100000 * (1 + 1) = 203000 I/Os
Q3:
Because only the values of R1.A is needed, and we have already established the
index on R1.A for Q1, an index only search is enough to get the result.
Cost: (equals the number of index blocks)
1000000 * (4 + 4) / (4096 * 80%) ≈ 2442 I/Os
Evaluation Plan for Q1: (reference page 372 of the textbook)
B, E
(on-the-fly)
|X| A=A
(use hash
index; do not
write result
to temp)
D=12
(index nested loop join)
R1
R2
6) Data Warehousing, OLAP, and KDD [5]
Explain the increased popularity of Data Warehousing, OLAP, and data mining
techniques in the commercial area!
Reasons:
deals with data explosion problem (e.g. from scanner, earth satellites,…);
automated tools are necessary to make sense of data, because of
limitations of human resources.
provide high level summary of data that facilitates data analysis, data
mining, and data visualization
support intelligent decision making through aggregated summaries of low
level (production) data
provide information for management
facilitate cooperate report generation