Download Midterm Exam with Solutions

Exam1 COSC 6340 (Data Management) March 22, 2001 Your Name: Your SSN: I agree that my grades are posted using the last 4 digits of my ssn ………………….(signature, if you like us to post your grades) Problem 1 [17]: Problem 2 [8]: Problem 3 [7]: Problem 4 [12]: Problem 5 [23] Problem 6: [5] : Grade: The exam is “open books” and you have 75 minutes to complete the exam. 1) Relational Database Design [17] Consider the following relation R(A,B,C,D,E) with the following functional dependencies is given ( -> denotes a functional dependency): (1) A -> B (2) C -> B (3) B -> E (4) E -> D a) Assume we decompose R into R1(A,B, C) and R2(B,D,E). Does this decomposition have the lossless join property --- is it possible to reconstruct R from R1 and R2 using a natural join? Give reasons for your answer! [5] Yes. Apply the lossless decomposition test on page 435 of the textbook: R1 ņ R2 = {B} For R2, the FDs are: B -> E and E -> D. By applying transitivity rule, we get B > D. Because B -> E and B-> D, B->BDE (Union), i.e. B is the candidate key of relation R2. So R1 ņ R2 -> R2. The decomposition is lossless. You can also use attribute closure to explain the answer. b) What is (are) the candidate key(s) of R? [2] AC c) Is R in BCNF? If not, which functional dependencies are bad (violate BCNF)? [2] No. All FDs are bad. d) Transform the relational schema into a relational schema that is in BCNF and does not have any lost dependencies; if this is not possible decompose R into a schema that is in BCNF and has the fewest number of lost functional dependencies. [8] One possible solution: R1 (ABCDE) E -> D R2 (ABCE) R3 (DE) B -> E R4 (ABC) R5 (BE) A -> B R6 (AC) R7 (AB) lost FD C -> B 2) Multi-valued Dependencies [8] Assume the following relation R(A,B,C,D) is given and the following multivalued dependencies hold: A ->-> B and A ->-> C Assume the relation R contains the following two tuples R(A B C D) (12 34)… (15 67)… What other tuples must R contain so that A ->-> B and A ->-> C hold for R (said differently, given and example of a relational relation that contains the two tuples and does not violate the two multi-valued dependencies)? Apply the formula on page 446 of the textbook, the tuples that must be included due to the two multi-valued dependency are: (1 2 6 7) (1 5 3 4) (1 2 6 4) (1 5 3 7) (1 2 3 7) second round (1 5 6 4) second round 3) Query Optimization [7] a) What are the goals and objectives of query optimization? [4] See textbook!! b) Why are statistics gathered from the database important for query optimization? [3] To predict cost of operations to predict the size of intermediate relations; better prediction model result in more accurate evaluations of query plans. 4) B+-Trees [12] a) Compare B+-trees with static hashing. What are the main advantages of B+-trees if compared with static (bucket hashing techniques). What are the disadvantages? [4]. Advantages of B+-tree: Sorted data structure; Self-organizing; Efficient for range search Disadvantage: May require 1 or 2 more I/Os for equality search than hashing b) Assume that the following B+-tree with p=5 and k=3 is given. Furthermore, assume that the keys 1, 21, 22, 23, 39 are deleted in the indicated order. Show how the tree looks like after each deletion. [8] 21 2 1 2 5 3 23 4 5 20 21 22 23 39 40 40 50 44 50 60 63 One possible solution: Delete 1: 21 2 3 3 5 4 5 23 20 21 22 23 39 40 40 50 44 50 40 50 44 50 60 63 Delete 21: 23 3 20 2 Delete 22: 3 4 5 20 22 23 39 40 60 63 23 3 5 2 3 4 5 20 23 40 50 39 40 44 50 44 50 60 63 Delete 23: 3 2 3 4 5 20 20 39 40 50 40 Delete 39: 3 2 3 4 5 5 20 5) Physical Database Design [23] 40 40 50 44 50 60 63 60 63 Assume two relations R1(A, B, C) and R2(A, D, E); R1 and R2 are both stored as an unordered file and contains 1000000 (1 million) tuples. Attributes A, B, C, D, and E need 4 byte of storage each, and blocks have a size of 4096 Byte. A is the primary key of both R1 and R2 and R1[A]=R2[A]. Moreover, we assume that static hashing is used to implement index structures, and that index pointers require 4 byte of storage; furthermore, you can assume that pages of index blocks are 80% full and do not contain any overflow pages. Moreover, the database system only supports the block nested loops join (only 3 blocks of buffer are available) and the index nested loops join. What index structures would you create to speed up the following 3 queries? Q1: Select B, E from R1, R2 where R1.A=R2.A and D=12; returns 4 answers Q2: Select B from R1, R2 where R1.A=R2.A and C=12; returns 100000 answers Q3: Select sum(A) from R1; returns one answer Describe which index structure you would create (justify your design!), and compute the cost for executing Q1, Q2, and Q3 for your chosen design (Hint: look for unusual solutions!). Also give the query evaluation plan you assume the database system would use to implement query Q1. Q1: Because 'Q1 returns 4 answers', 'R1[A] = R2[A]' and 'A is the primary key of both relations' => there are exactly 4 tuples in R2 satisfy D = 12. Establish hash index on R2.D and use it to find the 4 tuples in R2 satisfying D = 12 without writing out the result. Meanwhile establish hash index on R1.A and use it to find the four tuples satisfying R1.A = R2.A on the fly, and write out B, E. Cost: To find out the four tuples in R2 with D = 12: 1 (index block of R2.D) + 4 (data blocks) = 5 For each tuple with D = 12, find out the tuple in R1 satisfying R1.A = R2.A: 1 (index block of R1.A) + 1 (data block) = 2 Total cost (without considering writing out the final result): 5 + 4 * 2 = 13 I/Os Q2: # of file blocks of each relation = 1000000 * (4 * 3) / 4096 ≈ 3000 Because Q2 returns 100000 answers, index on R1.C will not help. Scan R1 to retrieve tuples with C = 12 (3000 block access). Meanwhile use hash index on R2.A do index nested loop join on the fly. Cost: 3000 + 100000 * (1 + 1) = 203000 I/Os Q3: Because only the values of R1.A is needed, and we have already established the index on R1.A for Q1, an index only search is enough to get the result. Cost: (equals the number of index blocks) 1000000 * (4 + 4) / (4096 * 80%) ≈ 2442 I/Os Evaluation Plan for Q1: (reference page 372 of the textbook)  B, E (on-the-fly) |X| A=A (use hash index; do not write result to temp) D=12 (index nested loop join) R1 R2 6) Data Warehousing, OLAP, and KDD [5] Explain the increased popularity of Data Warehousing, OLAP, and data mining techniques in the commercial area! Reasons:  deals with data explosion problem (e.g. from scanner, earth satellites,…); automated tools are necessary to make sense of data, because of limitations of human resources.  provide high level summary of data that facilitates data analysis, data mining, and data visualization  support intelligent decision making through aggregated summaries of low level (production) data  provide information for management  facilitate cooperate report generation

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Midterm Exam with Solutions