Download cse4701chap26 - University of Connecticut

Chapter 26, 6e - 24, 5e Distributed Databases CSE 4701 Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department The University of Connecticut 191 Auditorium Road, Box U-155 Storrs, CT 06269-3155 steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818   A portion of these slides are being used with the permission of Dr. Ling Lui, Associate Professor, College of Computing, Georgia Tech. Remaining slides represent new material. Chaps26.1 Classical and Distributed Architectures  CSE 4701   Classic/Centralized DBMS Dominated the Commercial Market from 1970s Forward Problems of this Approach  Difficult to Scale w.r.t. Performance Gains  If DB Overloaded, replace with a Faster Computer  this can Only Go So Far - Disk Bottlenecks Distributed DBMS have Evolved to Address a Number of Issues  Improved Performance  Putting Data “Near” Location where it is Needed  Replication of Data for Fault Tolerance  Vertical and Horizontal Partitioning of DB Tuples Chaps26.2 Common Features of Centralized DBMS  CSE 4701   Data Independence  High-Level Representation via Conceptual and External Schemas  Physical Representation (Internal Schema) Hidden Program Independence  Multiple Applications can Share Data  Views/External Schema Support this Capability Reduction of Program/Data Redundancy  Single, Unique, Conceptual Schema  Shared Database  Almost No Data Redundancy  Controlled Data Access Reduces Inconsistencies  Programs Execute with Consistent Results Chaps26.3 Common Features of Centralized DBMS  CSE 4701 Promote Sharing: Automatically Provided via CC  No Longer Programmatic Issue  Most DBMS Offer Locking for Key Shared Data  Oracle Allows Locks on Data Item (Attributes)  For Example, Controlling Access to Shared Identifier     Coherent and Central DB Administration Semantic DB Integrity via the Automatic Enforcement of Data Consistency via Integrity Constraints/Rules Data Resiliency  Physical Integrity of Data in the Presence of Faults and Errors  Supported by DB Recovery Data Security: Control Access for Authorized Users Against Sensitive Data Chaps26.4 Shared Nothing Architecture  CSE 4701    In this Architecture, Each DBMS Operates Autonomously There is No Sharing  Three Separate DBMSs on Three Different Computers Applications/Clients Must Know About the External Schemas of all Three DBMSs for  Database Retrieval  Client Processing Complicates Client  Different DBMS Platforms (Oracle, Sybase, Informix, ..)  Different Access Modes (Query, Embedded, ODBC)  Difficult for SWE to Code Chaps26.5 Difficulty in Access – Manage Multiple APIs  CSE 4701 Each Platform has a Different API  API1 , API3 , …. , APIn  An App Programmer Must Utilize All three APIs which could differ by PL – C++, C, Java, REST, etc.  Any interactions Across 3 DBs – must be programmatically handled without DB Capabilities API1 API2 APIn Chaps26.6 NW Architecture with Centralized DB  CSE 4701 High-Speed NWs/WANs Spawned Centralized DB Accessible Worldwide  Clients at Any Site can Access Repository  Data May be “Far” Away - Increased Access Time  In Practice, Each Remote Site Needs only Portion of the Data in DB1 and/or DB2  Inefficient, no Replication w.r.t. Failure Chaps26.7 Fully Distributed Architecture  CSE 4701    The Five Sites (Chicago, SF, LA, NY, Atlanta) each have a “Portion” of the Database - its Distributed Replication is Possible for Fault Tolerance Queries at one Site May Need to Access Data at Another Site (e.g., for a Join) Increased Transaction Processing Complexity Chaps26.8 Distributed Database Concepts  CSE 4701   A transaction can be executed by multiple networked computers in a unified manner. A distributed database (DDB) processes a Unit of execution (a transaction) in a distributed manner. A distributed database (DDB) can be defined as  Collection of multiple logically related database distributed over a computer network  Distributed database management system as a software system that manages a distributed database while making the distribution transparent to the user. Chaps26.9 Goals of DDBMS  CSE 4701    Support User Distribution Across Multiple Sites  Remote Access by Users Regardless of Location  Distribution and Replication of Database Content Provide Location Transparency  Users Manipulate their Own Data  Non-Local Sites “Appear” Local to Any User Provide Transaction Control Akin to Centralized Case  Transaction Control Hides Distribution  CC and Serializability - Must be Extended Minimize Communications Cost  Optimize Use of Network - a Critical Issue  Distribute DB Design Supported by Partitioning (Fragmentation) and Replication Chaps26.10 Goals of DDBMS  CSE 4701   Improve Response Time for DB Access  Use a More Sophisticated Load Control for Transaction Processing  However, Synchronization Across Sites May Introduce Additional Overhead System Availability  Site Independence in the Presence of Site Failure  Subset of Database is Always Available  Replication can Keep All Data Available, Even When Multiple Sites Fail Modularity  Incremental Growth with the Addition of Sites  Dedicate Sites to Specific Tasks Chaps26.11 Advantages of DDBMS  CSE 4701 1. There are Four Major Advantages Transparency  Distribution/NW Transparency  User Doesn’t Know about NW Configuration (Location Transparency)  User can Find Object at any Site (Naming Transparency)  Replication Transparency (see next PPT)  User Doesn’t Know Location of Data  Replicas are Transparently Accessible  Fragmentation Transparency  Horizontal Fragmentation (Distribute by Row)  Vertical Fragmentation (Distribute by Column) Chaps26.12 Data Distribution and Replication CSE 4701 Chaps26.13 Other Advantages of DDBMS CSE 4701 2. Increased Reliability and Availability  Reliability - System Always Running  Availability - Data Always Present  Achieved via Replication and Distribution  Ability to Make Single Query for Entire DDBMS 3. Improved Performance  Sites Able to Utilize Data that is Local for Majority of Queries 4. Easier Expansion  Improve Performance of Site by  Upgrading Processor of Computer  Adding Additional Disks  Splitting a Site into Two or More Sites  Expansion over Time as Business Grows Chaps26.14 Challenges of DDBMS  CSE 4701    Tracking Data - Meta Data More Complex  Must Track Distribution (where is the Data)  V & H Fragmentation (How is Data Split)  Replication (Multiple Copies for Consistency) Distributed Query Processing  Optimization, Accessibility, etc., More Complex  Block Analysis of Data Size Must also Now Consider the NW Transmitting Time Distributed Transaction Processing  TP Potentially Spans Multiple Sites  Submit Query to Multiple Sites  Collect and Collate Results Distributed Concurrency Control Across Nodes Chaps26.15 Challenges of DDBMS  CSE 4701    Replicated Data Management  TP Must Choose the Replica to Access  Updates Must Modify All Replica Copies Distributed Database Recovery  Recovery of Individual Sites  Recovery Across DDBMS Security  Local and Remote Authorization  During TP, be Able to Verify Remote Privileges Distributed Directory Management  Meta-Data on Database - Local and Remote  Must maintain Replicas of this - Every Site Tracks the Meta-Data for All Sites Chaps26.16 A Complete Schema with Keys ... CSE 4701 Keys Allow us to Establish Links Between Relations Chaps26.17 …and Corresponding DB Tables CSE 4701 which Represent Tuples/Instances of Each Relation A S C null W B null null 1 4 5 5 Chaps26.18 …with Remaining DB Tables CSE 4701 Chaps26.19 What is Fragmentation?  CSE 4701  Fragmentation Divides a DB Across Multiple Sites Two Types of Fragmentation  Horizontal Fragmentation  Given a Relation R with n Total Tuples, Spread Entire Tuples Across Multiple Sites  Each Site has a Subset of the n Tuples  Essentially Fragmentation is a Selection  Vertical Fragmentation  Given a Relation R with m Attributes and n Total Tuples, Spread the Columns Across Multiple Sites  Essentially Fragmentation is a Projection  Not Generally Utilized in Practice  In Both Cases, Sites can Overlap for Replication Chaps26.20 Horizontal Fragmentation  CSE 4701     A horizontal subset of a relation which contain those of tuples which satisfy selection conditions. Consider Employee relation with condition DNO = 5 All tuples satisfying this create a subset which will be a horizontal fragment of Employee relation. A selection condition may be composed of several conditions connected by AND or OR. Derived horizontal fragmentation:  Partitioning of a primary relation to other secondary relations which are related with Foreign keys. Chaps26.21 Horizontal Fragmentation  Site 2 Tracks All Information Related to Dept. 5 CSE 4701 Chaps26.22 Horizontal Fragmentation  CSE 4701  Site 3 Tracks All Information Related to Dept. 4 Note that an Employee Could be Listed in Both Cases, if s/he Works on a Project for Both Departments Chaps26.23 Refined Horizontal Fragmentation  CSE 4701   Further Fragment from Site 2 based on Dept. that Employee Works in Notice that G1 + G2 + G3 is the Same as WORKS_ON5 there is no Overlap Chaps26.24 Refined Horizontal Fragmentation  CSE 4701   Further Fragment from Site 3 based on Dept. that Employee Works in Notice that G4 + G5 + G6 is the Same as WORKS_ON4 Note Some Fragments can be Empty Chaps26.25 Vertical Fragmentation  CSE 4701   Subset of a relation created via a subset of columns.  A vertical fragment of a relation will contain values of selected columns.  There is no selection condition used in vertical fragmentation.  A strict vertical slice/partition Consider the Employee relation.  A vertical fragment of can be created by keeping the values of Name, Bdate, Sex, and Address. Since no condition for creating a vertical fragment  Each fragment must include the primary key attribute of the parent relation Employee.  All vertical fragments of a relation are connected. Chaps26.26 Vertical Fragmentation Example  CSE 4701  Partition the Employee Table as Below Notice Each Vertical Fragment Needs Key Column EmpDemo EmpSupvrDept Chaps26.27 Homogeneous DDBMS  CSE 4701 Homogeneous  Identical Software (w.r.t. Database)  One DB Product (e.g., Oracle, DB2, Sybase) is Distributed and Available at All Sites  Uniformity w.r.t. Administration, Maintenance, Client Access, Users, Security, etc.  Interaction by Programmatic Clients is Consistent (e.g., JDBC or ODBC or REST API …) Chaps26.28 Non-Federated Heterogeneous DDBMS  CSE 4701 Non-Federated Heterogeneous  Different Software (w.r.t. Database)  Multiple DB Products (e.g., Oracle at One Site, MySQL at another, Sybase, Informix, etc.)  Replicated Administration (e.g., Users Needs Accounts on Multiple Systems)  Varied Programmatic Access - SWEs Must Know All Platforms/Client Software Complicated  Very Close to Shared Nothing Architecture Chaps26.29 Federated DDBMS  CSE 4701  Federated  Multiple DBMS Platforms Overlaid with a Global Schema View  Single External Schema Combines Schemas from all Sites Multiple Data Models  Relational in one Component DBS  Object-Oriented in another DBS  Hierarchical in a 3rd DBS Chaps26.30 Federated DBMS Issues  CSE 4701    Differences in Data Models  Reconcile Relational vs. Object-Oriented Models  Each Different Model has Different Capabilities  These Differences Must be Addressed in Order to Present a Federated Schema Differences in Constraints  Referential Integrity Constraints in Different DBSs  Different Constraints on “Similar” Data  Federated Schema Must Deal with these Conflicts Differences in Query Languages  SQL-89, SQL-92, SQL2, SQL3  Specific Types in Different DBMS (Oracle Blobs ) Differences in Key Processing & Timestamping Chaps26.31 Heterogeneous Distributed Database Systems  CSE 4701  Federated: Each site may run different database system but the data access is managed through a single conceptual schema.  The degree of local autonomy is minimum.  Each site must adhere to a centralized access policy  There may be a global schema. Multi-database: There is no one conceptual global schema  For data access a schema is constructed dynamically as needed by the application software. Object Unix Relational Unix Oriented Site 5 Site 1 Hierarchical Window Communications Site 4 network Object Oriented Network DBMS Site 3 Linux Site 2 Linux Relational Chaps26.32 Query Processing in Distributed Databases Issues CSE 4701  Cost of transferring data (files and results) over the network.  This cost is usually high so some optimization is necessary.  Example relations: Employee at site 1 and Department at Site 2 – Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size = 106 bytes. Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno – Department at Site 2. 100 rows. Row size = 35 bytes. Table size = 3,500 bytes. Dname Dnumber Mgrssn Mgrstartdate  Q: For each employee, retrieve employee name and department name Where the employee works.  Q: Fname,Lname,Dname (Employee Dno = Dnumber Department) Chaps26.33 Query Processing in Distributed Databases  CSE 4701 Result  The result of this query will have 10,000 tuples, assuming that every employee is related to a department.  Suppose each result tuple is 40 bytes long.  The query is submitted at site 3 and the result is sent to this site.  Problem: Employee and Department relations are not present at site 3. Chaps26.34 Query Processing in Distributed Databases  CSE 4701 Strategies: 1. Transfer Employee and Department to site 3.  Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes. 2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3.  Query result size = 40 * 10,000 = 400,000 bytes. Total transfer size = 400,000 + 1,000,000 = 1,400,000 bytes. 3. Transfer Department relation to site 1, execute the join at site 1, and send the result to site 3.  Total bytes transferred = 400,000 + 3500 = 403,500 bytes.  Optimization criteria: minimizing data transfer.  Preferred approach: strategy 3. Chaps26.35 Query Processing in Distributed Databases  CSE 4701  Consider the query  Q’: For each department, retrieve the department name and the name of the department manager Relational Algebra expression:  Fname,Lname,Dname (Employee Mgrssn = SSN Department) Chaps26.36 Query Processing in Distributed Databases  CSE 4701 Result of query has 100 tuples, assuming that every department has a manager, the execution strategies are: 1. Transfer Employee and Department to the result site and perform the join at site 3.  Total bytes transferred = 1,000,000 + 3500 = 1,003,500 bytes. 2. Transfer Employee to site 2, execute join at site 2 and send the result to site 3. Query result size = 40 * 100 = 4000 bytes.  Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes. 3. Transfer Department relation to site 1, execute join at site 1 and send the result to site 3.  Total transfer size = 4000 + 3500 = 7500 bytes.  Preferred strategy: Choose strategy 3. Chaps26.37 Query Processing in Distributed Databases  CSE 4701 Now suppose the result site is 2. Possible strategies : 1. Transfer Employee relation to site 2, execute the query and present the result to the user at site 2.  Total transfer size = 1,000,000 bytes for both queries Q and Q’. 2. Transfer Department relation to site 1, execute join at site 1 and send the result back to site 2.  Total transfer size for Q = 400,000 + 3500 = 403,500 bytes and for Q’ = 4000 + 3500 = 7500 bytes. Chaps26.38 DDBS Concurrency Control and Recovery  CSE 4701 Distributed Databases encounter a number of concurrency control and recovery problems which are not present in centralized databases, including:  Dealing with multiple copies of data items  How are they All Updated if Needed?  Failure of individual sites  How are Queries Restarted or Rerouted?  Communication link failure  Network Failure  Distributed commit  How to Know All Updates Done at all Sites?  Distributed deadlock  How to Detect and Recover? Chaps26.39 The Next Big Challenge  CSE 4701  Interoperability  Heterogeneous Distributed Databases  Heterogeneous Distributed Systems  Autonomous Applications Scalability  Rapid and Continuous Growth  Amount of Data  Variety of Data Types  Dealing with personally identifiable information (PII) and personal health information (PHI)  Emergence of Fitness and Health Monitoring Apps  Google Fit and Apple HealthKit  New Apple ResearchKit for Medical Research Chaps26.40 Interoperability: A Classic View CSE 4701 Local Schema Simple Federation Multiple Nested Federation FDB Global Schema FDB Global Schema 4 Federated Integration Federated Integration Local Schema Local Schema FDB 1 Local Schema Federation FDB3 Federation Chaps26.41 Java Client with Wrapper to Legacy Application CSE 4701 Java Client Java Application Code WRAPPER Mapping Classes JAVA LAYER Interactions Between Java Client and Legacy Appl. via C and RPC C is the Medium of Info. Exchange Java Client with C++/C Wrapper NATIVE LAYER Native Functions (C++) RPC Client Stubs (C) Legacy Application Network Chaps26.42 COTS and Legacy Appls. to Java Clients CSE 4701 COTS Application Legacy Application Java Application Code Java Application Code Native Functions that Map to COTS Appl NATIVE LAYER Native Functions that Map to Legacy Appl NATIVE LAYER JAVA LAYER JAVA LAYER Mapping Classes JAVA NETWORK WRAPPER Mapping Classes JAVA NETWORK WRAPPER Network Java Client Java Client Java is Medium of Info. Exchange - C/C++ Appls with Java Wrappers Chaps26.43 Java Client to Legacy App via RDBS CSE 4701 Transformed Legacy Data Java Client Updated Data Relational Database System(RDS) Extract and Generate Data Transform and Store Data Legacy Application Chaps26.44 Database Interoperability in the Internet  CSE 4701  Technology  Web/HTTP, JDBC/ODBC, CORBA (ORBs + IIOP), XML, SOAP, REST API, WSDL Architecture Information Broker •Mediator-Based Systems •Agent-Based Systems Chaps26.45 JDBC  CSE 4701  JDBC API Provides DB Access Protocols for Open, Query, Close, etc. Different Drivers for Different DB Platforms JDBC API Java Application Driver Manager Driver Oracle Driver Access Driver Driver Sybase Chaps26.46 Connecting a DB to the Web  CSE 4701 DBMS  CGI Script Invocation or JDBC Invocation Web Server Internet  Web Server are Stateless DB Interactions Tend to be Stateful Invoking a CGI Script on Each DB Interaction is Very Expensive, Mainly Due to the Cost of DB Open Browser Chaps26.47 Connecting More Efficiently  CSE 4701 DBMS Helper Processes CGI Script or JDBC Invocation  Web Server Internet  To Avoid Cost of Opening Database, One can Use Helper Processes that Always Keep Database Open and Outlive Web Connection Newly Invoked CGI Scripts Connect to a Preexisting Helper Process System is Still Stateless Browser Chaps26.48 DB-Internet Architecture CSE 4701 WWW Client (Netscape) WWW client (Info. Explore) WWW Client (HotJava) Internet HTTP Server DBWeb Gateway DBWeb Gateway DBWeb Gateway DBWeb Dispatcher DBWeb Gateway Chaps26.49 EJB Architecture CSE 4701 Chaps26.50 Technology Push  CSE 4701   Computer/Communication Technology (Almost Free)  Plenty of Affordable CPU, Memory, Disk, Network Bandwidth  Next Generation Internet: Gigabit Now  Wireless: Ubiquitous, High Bandwidth Information Growth  Massively Parallel Generation of Information on the Internet and from New Generation of Sensors  Disk Capacity on the Order of Peta-bytes Small, Handy Devices to Access Information The focus is to make information available to users, in the right form, at the right time, in the appropriate place. Chaps26.51 Research Challenges  CSE 4701 Ubiquitous/Pervasive Many computers and information appliances everywhere, networked together  Inherent Complexity:  Coping with Latency (Sometimes Unpredictable)  Failure Detection and Recovery (Partial Failure)  Concurrency, Load Balancing, Availability, Scale  Service Partitioning  Ordering of Distributed Events “Accidental” Complexity:  Heterogeneity: Beyond the Local Case: Platform, Protocol, Plus All Local Heterogeneity in Spades.  Autonomy: Change and Evolve Autonomously  Tool Deficiencies: Language Support (Sockets,rpc), Debugging, Etc. Chaps26.52 Infosphere Problem: too many sources,too much information CSE 4701 Internet: Information Jungle Infopipes Clean, Reliable, Timely Information, Anywhere Digital Earth Personalized Filtering & Info. Delivery Sensors Chaps26.53 Current State-of-Art – Has Mobile Changed This? CSE 4701 Web Server Mainframe Database Server Thin Client Chaps26.54 Infosphere Scenario – Where Does Mobile Fit? CSE 4701 Infotaps & Fat Clients Sensors Variety of Servers Many sources Database Server Chaps26.55 Heterogeneity and Autonomy  CSE 4701 Heterogeneity:  How Much can we Really Integrate?  Syntactic Integration  Different Formats and Models  XML/JSON/RDF/OWL/SQL Query Languages  Semantic Interoperability  Basic Research on Ontology, Etc.  DoD Maps (Grid, True, and Magnetic North)  Autonomy  No Central DBA on the Net  Independent Evolution of Schema and Content  Interoperation is Voluntary  Interface Technology DCOM: Microsoft Standard  CORBA, Etc... Chaps26.56 Security and Data Quality  CSE 4701 Security  System Security in the Broad Sense  Attacks: Penetrations, Denial of Service  System (and Information) Survivability  Security Fault Tolerance  Replication for Performance, Availability, and Survivability  Data Quality  Web Data Quality Problems     Local Updates with Global Effects Unchecked Redundancy (Mutual Copying) Registration of Unchecked Information Spam on the Rise Chaps26.57 Legacy Data Challenge  CSE 4701  Legacy Applications and Data  Definition: Important and Difficult to Replace  Typically, Mainframe Mission Critical Code  Most are OLTP and Database Applications Evolution of Legacy Databases  Client-server Architectures  Wrappers  Expensive and Gradual in Any Case Chaps26.58 Potential Value Added/Jumping on Bandwagon  CSE 4701     Sophisticated Query Capability  Combining SQL with Keyword Queries Consistent Updates  Atomic Transactions and Beyond But Everything has to be in a Database!  Only If we Stick with Classic DB Assumptions Relaxing DB Assumptions  Interoperable Query Processing  Extended Transaction Updates Commodities DB Software  A Little Help is Still Good If it is Cheap  Internet Facilitates Software Distribution  Databases as Middleware Chaps26.59 Concluding Remarks  CSE 4701  Goals of Distributed DBS  Support User Distribution Across Multiple Sites  Provide Location Transparency  Provide Transaction Control Akin to Centralized Case  Minimize Communications Cost Advantages of Distributed DBS  Transparency  Increased Reliability and Availability  Improved Performance  Easier Expansion Chaps26.60

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download cse4701chap26 - University of Connecticut