* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Unity_lawrence
Extensible Storage Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Concurrency control wikipedia , lookup
Ingres (database) wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Functional Database Model wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Relational algebra wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Clusterpoint wikipedia , lookup
Versant Object Database wikipedia , lookup
Automatic Integration of
Relational Database Systems
Ramon Lawrence
University of Manitoba
umlawren@cs.umanitoba.ca
Outline
 Introduction,
Motivation, and Background
 Our integration approach
 The integration architecture
 Standard
 Example
dictionary, X-Specs, query processor
integration
 Northwind,
 Querying
Southstorm databases
the integrated databases
 Generating
SQL queries from semantic queries
 Unity
implementation
 Contributions, Conclusions, and Future Work
Page 2
Database Terminology
 Database
system - is a database and a system to manage
the data.
 Schema - is a description of the data organization and
format in a database.
 Schema integration - is the process of combining local
schemas into a global, integrated view by resolving
conflicts present between the schemas.
 Data integration - is the process of combining data at the
entity-level. It requires resolving representational
conflicts and determining equivalent keys.
 Multidatabase system (MDBS) - is a collection of
autonomous, local databases participating in a global
database system to share data.
Page 3
What is Integration?
 Two
levels of integration:
 Schema
integration - the description of the data
 Data integration - the individual data instances
 Integration
problems include:
 Different
data models and conflicts within a model
 Incompatible concept representations
 Different user or view perspectives
 Naming conflicts (homonym, synonym)
 Integration
handles the different mechanisms for
storing data (structural conflicts), for referencing
data (naming conflicts), and for attributing
meaning to the data (semantic conflicts).
Page 4
Why is Integration Required?
 There
are many integration environments:
 Operational
systems within an organization
 System integration during company merger
 Data warehouses, Intranets, and the WWW
 Users
require information from many data
sources which often do not work together.
 Companies require a global view of their entire
operations which may be present in numerous
operational databases for different departments
and distributed geographically.
 E-commerce demands integration of web
databases with production systems.
Page 5
What is the Current Solution?
 Manual
Integration Algorithms:
 Allow
designer to detect and resolve conflicts
 Manipulate information using semantic models
 Knowledge
 Cyc
 Global
bases/Artificial Intelligence:
knowledge base and Carnot project
Dictionaries and Lexical Semantics:
 Wordnet,
Clio, Summary schemas model
 Concept hierarchies (Castano)
Page 6
What is the Current Solution? (2)
 SQL
and multidatabase query languages:
 SQL,
MSQL, IDL, DIRECT, SchemaSQL
 Requires user to understand DB structure & semantics
 Wrapper
and mediator systems:
 Information
Manifold, TSIMMIS, Infomaster
 Use query languages or description logics
 Focus on query rewriting and reformulation
 Industrial
standards:
 XML,
BizTalk, E-commerce portals
 Apply to limited domains/industries
 Require standard structures and database changes
Page 7
Previous Work Summary
 Current
techniques for database integration have
some of these problems:
 Require
integrator to understand all databases
 Integration process is manual
 Do not hide system complexity from the user
 Force changes on the existing database systems
 Construct global view manually
 Suffer from query imprecision (query containment)
Page 8
Our Approach
 Our
approach combines standardization and
query mapping algorithms.
 The major idea is that schema conflicts can be
resolved if we:
 Eliminate
all naming conflicts
 Define a language capable of determining schema
equivalence and performing transformations
 Naming
conflicts are eliminated by accepting a
standard term dictionary.
 Not
a knowledge base or set of mediated views
 Leverages semantic information in English words
Page 9
Integration Architecture
Client
Client
Multidatabase Layer
• user’s view of integration
2) X-Spec Editor
Integrated Context View
X-Spec
Editor
Standard
Dictionary
Architecture Components:
1) Integrated Context View
• stores schema & metadata
• uses XML
Integration
Algorithm
3) Standard Dictionary
• terms to express semantics
4) Integration Algorithm
Query Processor and ODBC Manager
5) Query Processor
Subtransactions
X-Spec
X-Spec
Database
Database
Local Transactions
• combines X-Specs into
integrated context view
• accepts query on view
• determines data source
mappings and joins
• executes queries and
formats results
Architecture Components
 The
architecture consists of four components:
A
standard dictionary (SD) to capture data semantics
 SD terms are used to build semantic names describing
semantics of schema elements.
 X-Specs
for storing data semantics
 Database metadata and semantic names stored using XML
 Integration
Algorithm
 Matches concepts in different databases by semantic names.
 Produces an integrated view of all database concepts.
 Query
Processor
 Allows the user to formulate queries on the view.
 Translates from semantic names in integrated view to SQL
queries and integrates and formats results.
Involves determining correct field and table mappings and
discovery of join conditions and join paths
Page 11
Integration Processes
 The
integration architecture consists of three
separate processes:
 Capture
process: independently extracts database
schema information and metadata into a XML
document called a X-Spec.
 Integration
process: combines X-Specs into a
structurally-neutral hierarchy of database concepts
called an integrated context view.
 Query
process: allows the user to formulate queries on
the integrated view that are mapped by the query
processor to structural queries (SQL) and the results
are integrated and formatted.
Page 12
Integration Architecture:
The Capture Process
Relational
Schema
Automatic
Extraction
Specification
Editor
Standard
Dictionary
 Capture
X-Spec
DBA Lookup
of terms
process involves:
 Automatically
extracting the schema information and
metadata using a specification editor
 Assigning semantic names to each schema element
(tables and fields) to capture their semantics
Page 13
Architecture Components:
The Standard Dictionary
A
standard dictionary (SD) provides standardized
terms to capture data semantics.
 Hierarchy
of terms related by IS-A or Has-A links
 Contains base set of common database concepts, but
new concepts can be added
A
SD term is a single, unambiguous semantic
definition.
 Several
SD entries for a single English word are
required if the word has multiple definitions.
 The
top-level dictionary terms are those
proposed by Sowa.
Page 14
Architecture Components:
Dictionary vs. Knowledge Base
 The
standard dictionary differs from a knowledge
base such as Cyc because:
 Not
intended to be a general English dictionary or
contain knowledge facts about the world
 Dictionary is evolved as new terms are required
 Not all English words are used
 Dictionary
provides the systems with no “knowledge”
 Since no facts are stored, system cannot deduce new facts
 Dictionary terms are just semantic place holders, integrators
determine the semantics of the database not the system
 Simplified
organization
 Dictionary is organized as a tree for efficiency and simplicity
in determining related concepts
 Re-use
of terms
 Terms are re-used in semantic names
Page 15
Architecture Components:
Using the Standard Dictionary
 SD
terms are used to build semantic names
describing semantics of schema elements.
 Semantic names have the form:
 semantic
name := [CT_Type] | [CT_Type] CN
 CT_Type := CT | CT {; CT} | CT {,CT}
 CT := context term, CN := concept name
 each CT and CN is a single term from the SD
 Semantic
names are included in specifications
describing a database.
Page 16
Northwind & Southstorm
Integration Example
Northwind Database Schema
Tables
Categories
Customers
Employees
OrderDetails
Order
Products
Shippers
Suppliers
Fields
CategoryID, CategoryName
CustomerID, CompanyName
EmployeeID, LastName, FirstName
OrderID, ProductID, UnitPrice, Quantity
OrderID, CustomerID, EmployeeID, OrderDate,
Shipvia
ProductID, ProductName, SupplierID, CategoryID
ShipperID, CompanyName
SupplierID, CompanyName
Page 17
Northwind & Southstorm
Integration Example (2)
Southstorm Database Schema
Tables
Fields
Orders_tb Order_num, Cust_name, Emp_name, Item1_id, Item1_qty,
Item1_price, Item2_id, Item2_qty, Item2_price
Page 18
Integration Example (3)
Northwind Semantic Name Mappings
Type
T
F
F
T
F
F
T
F
F
F
T
F
F
F
F
Semantic Name
[Category]
[Category] Id
[Category] Name
[Customer]
[Customer] Id
[Customer] Name
[Employee]
[Employee] Id
[Employee] Last Name
[Employee] First Name
[Order;Product]
[Order] Id
[Order;Product] Id
[Order;Product] Price
[Order;Product]
Quantity
System Name
Categories
CategoryID
CategoryName
Customers
CustomerID
CompanyName
Employees
EmployeeID
LastName
FirstName
OrderDetails
OrderID
ProductID
UnitPrice
Quantity
Type
T
F
F
F
F
F
T
F
F
F
F
T
F
F
T
F
F
Semantic Name
[Order]
[Order] Id
[Order;Customer] Id
[Order;Employee] Id
[Order] Date
[Order;Shipper] Id
[Product]
[Product] Id
[Product] Name
[Product;Supplier] Id
[Product;Category] Id
[Shipper]
[Shipper] Id
[Shipper] Name
[Supplier]
System Name
Orders
OrderID
CustomerID
EmployeeID
OrderDate
Shipvia
Products
ProductID
ProductName
SupplierID
CategoryID
Shippers
ShipperID
ShipperName
Suppliers
[Supplier] Id
[Supplier] Name
SupplierID
SupplierName
Page 19
Northwind & Southstorm
Integration Example (4)
Southstorm Semantic Name Mappings
Type
Table
Field
Field
Table
Field
Field
Table
Field
Field
Field
Semantic Name
[Order]
[Order] Id
[Order;Customer] Name
[Order;Employee] Name
[Order;Product] Id
[Order;Product] Quantity
[Order;Product] Price
[Order;Product] Id
[Order;Product] Quantity
[Order;Product] Price
System Name
Orders_tb
Order_num
Cust_name
Emp_name
Item1_id
Item1_qty
Item1_price
Item2_id
Item2_qty
Item2_price
Page 20
What is a semantic name?
A
semantic name is a universal, semantic
identifier in a domain.
 Similar
to a field name in the Universal Relation.
 Semantics are guaranteed unique by construction.
 System has mechanism for comparing semantics
across domains even though it does not understand
them. (Exploiting semantics in English words.)
 Important
 context
definitions:
- a semantic name is a context if it maps to a table
 concept - a semantic name is a concept if it maps to a field
 context closure - of semantic name Si denoted Si* is the
set of semantic names produced by taking ordered subsets
of the terms of Si = {T1, T2 , … TN} starting with T1.
Page 21
Architecture Components:
X-Specs
 Database
metadata and semantic names are
combined into specifications called X-Specs:
 Stored
and transmitted using XML
 Contains information on a relational schema
 Organized into database, table, and field levels
 Stores semantic names to describe and integrate
schema elements
Page 22
Southstorm X-Spec
<?xml version="1.0" ?>
<Schema name = "Southstorm_xspec.xml” xmlns="urn:schemas-microsoft-com:xml-data"
xmlns:dt="urn:schemas-microsoft-com:datatypes">
<ElementType name="[Order]" sys_name = "Orders_tb" sys_type="Table">
<element type = "[Order] Id" sys_name = "Order_num" sys_type = "Field"/>
<element type = "[Order] Total Amount" sys_name = "Order_total" sys_type = "Field"/>
<element type = "[Order;Customer] Name" sys_name = "Cust_name" sys_type = "Field"/>
<element type = "[Order;Customer;Address] Address Line 1" sys_name="Cust_address"
sys_type="Field"/>
<element type = "[Order;Customer;Address] City" sys_name = "Cust_city" sys_type = "Field"/>
<element type = "[Order;Customer;Address] Postal Code" sys_name="Cust_pc" sys_type="Field"/>
<element type = "[Order;Customer;Address] Country" sys_name="Cust_country" sys_type="Field"/>
<element type = "[Order;Product] Id" sys_name = "Item1_id" sys_type = "Field"/>
<element type = "[Order;Product] Quantity" sys_name = "Item1_quantity" sys_type = "Field"/>
<element type = "[Order;Product] Price" sys_name = "Item1_price" sys_type = "Field"/>
<element type = "[Order;Product] Id" sys_name = "Item2_id" sys_type = "Field"/>
<element type = "[Order;Product] Quantity" sys_name = "Item2_quantity" sys_type = "Field"/>
<element type = "[Order;Product] Price" sys_name = "Item2_price" sys_type = "Field"/>
</ElementType>
</Schema>
Page 23
Architecture Components:
Integrating X-Specs
 Each
database to be integrated is described
using a X-Spec.
 Identical concepts in different databases are
identified by similar semantic names.
 Concepts with identical (or hierarchially related)
semantic names are combined regardless of
their physical representation in the individual
databases.
Page 24
Integration Architecture:
The Integration Process
 Integration
process involves:
 Automatically
identifying identical concepts by
matching semantic names
 Constructing a global view of database concepts
consisting of a hierarchy of concept terms
 Resolving structural differences during query
generation and submission (e.g. a concept may be
represented as a table in one database and a field
(attribute) in another)
Page 25
Integration Product:
The Integrated Context View
 The
product of the integration is a structurallyneutral hierarchy of concepts called an integrated
context view.
 Define a context view (CV) as follows:
 If
a semantic name Si is in CV, then for any Sj in Si*, Sj is
also in CV.
 For each semantic name Si in CV, there exists a set of zero
or more mappings Mi that associate a schema element Ej
with Si.
 A semantic name Si can only occur once in the CV.
A
context view (CV) is a valid Universal Relation.
 Each
field is assigned a semantic name which uniquely
identifies its semantic connotation.
Page 26
Northwind & Southstorm
Integration Example
Integrated Context View
Integrated View
Term
V (view root)
- [Category]
- Id
- Name
- [Customer]
- Id
- Name
- [Employee]
- Id
- [Name]
- First Name
- Last Name
- [Product]
- Id
- Name
- [Supplier]
- Id
- [Category]
- Id
Data Source Mappings
(not visible to user)
Integrated View
Term
Data Source Mappings
(not visible to user)
N/A
NW.Categories
NW.Categories.CategoryID
NW.Categories.CategoryName
NW.Customers
NW.Customers.CustomerID
NW.Customers.CompanyName
NW.Employees
NW.Employees.EmployeeID
V (view root) (cont.)
- [Order]
-Id
- [Customer]
- Id
- Name
- [Employee]
- Id
- Name
- [Product]
- Id
- Price
- Quantity
- [Shipper]
- Id
- Name
- [Supplier]
- Id
- Name
N/A
NW.Orders, SS.Orders_tb
NW.[Orders,OrderDetails].OrderID, SS.Orders_tb.Order_num
NW.Employees.FirstName
NW.Employees.LastName
NW.Products
NW.Products.PrdouctID
NW.Products.ProductName
NW.Products.SupplierID
NW.Products.CategoryID
NW.Orders.CustomerID
SS.Orders_tb.Cust_name
NW.Orders.EmployeeID
SS.Orders_tb.Emp_name
NW.OrderDetails
NW.OrderDetails.ProductID, SS.Orders_tb.Item[1,2]_id
NW.OrderDetails.UnitPrice, SS.Orders_tb.Item[1,2]_price
NW.OrderDetails.Quantity, SS.Orders_tb.Item[1,2]_qty
NW.Shippers
NW.Shippers.ShipperID
NW.Shippers.ShipperName
NW.Suppliers
NW.Suppliers.SupplierID
NW.Suppliers.SupplierName
Page 27
Architecture Components:
The Query Processor
 The
query processor:
 Allows
the user to formulate queries on the view.
 Translates from semantic names in the context view to
structural queries (SQL) on databases.
 Involves determining correct field and table mappings and
discovery of join conditions and join paths
 Retrieves
query results and formats them for display to
the user.
 Client-side
query processing:
 Perform
joins between databases using common keys.
 Data value formatting and transformation
Page 28
The Query Processor:
Determining field/table mappings
 For
each database (D) in the context view
 For
each semantic name (S) in query
 If S has only one semantic name mapping in D Then
Add field mapping to query and its parent table
 Else If S has multiple mappings but all in one table Then
Add each field mapping to query and the parent table
 Else S has multiple mappings in more than one table Then
If any field mapping has a table already in query take that one
Else take field mapping with best semantic name match
Else take first mapping found
 End If
 Next
 Next
Page 29
The Query Processor:
Constructing Join Graphs
 Given
a set of fields (F) and tables (T) to access,
joins are applied to connect the tables.
 A join graph is an undirected graph where:
 Each
node Ni is a table in the database.
 There is a link from node Ni to node Nj if there is a join
between the two tables.
A
join path is a sequence of joins connecting two
nodes in the graph.
 A join tree is a set of joins connecting two or
more nodes.
 A join matrix M stores the shortest join paths
between any two nodes (tables).
Page 30
The Query Processor:
Join Graph for Northwind
1
Suppliers
N
N
1
Categories
Products
1
Products
N
OrderDetails
1
Shippers
N
N
1
N
Orders
N
Products
1
Employees
Products
1
Customers
Page 31
The Query Processor:
Join Discovery Results
 Join
discovery in a database with a connected,
acyclic join graph and a join matrix M:
 There
exists only one join tree for any set of tables .
 The joins required to connect a table set T is found by
taking any Ti of T and unioning the join paths in
M[Ni,N1], M[Ni,N2], ... M[Ni,Nn] where N1,N2,..Nn are the
nodes corresponding to the set of tables T.
 For
a cyclic join graph:
 There
may exist more than one join tree for a set of
tables and each tree may have different semantics.
 Can allow the user to uniquely determine join tree by
graphically displaying join conditions to the user as
they browse the context view.
Page 32
Advanced Query Processing
 Advanced
query processor features include:
 global
keys and joins - a mechanism for specifying
when a field stores a global key such as a social
security number.
 result normalization - a procedure for normalizing
query results returned from each individual database.
(e.g. Southstorm)
 data integration - transforming data representational
conflicts at the global level.
 For example, “M” and “F” may represent “Male” and
“Female” in one database, and another may represent these
concepts using “0” and “1”.
Page 33
Northwind & Southstorm
Query Examples
 Example
1: Retrieve all order ids ([Order] Id) and
customers ([Customer] Name):
 SS:
SELECT Order_num, Cust_name FROM Orders_tb
 NW: SELECT OrderID, CompanyName FROM Orders,
Customers WHERE Orders.CustomerID =
Customers.CustomerID
 Example
2: Retrieve all ordered products
([Order;Product] Id) and their order ids.
 SS:
SELECT Order_num, Item1_id, Item2_id FROM
Orders_tb
 NW: SELECT OrderID, ProductID FROM OrderDetails
 Note: In NW, selects from two different order id mappings.
In SS, result normalization is required.
Page 34
Integration Example:
Discussion
 Important
points:
 System
table and field names are not presented to the
user who queries based on semantic names.
 Database structure is not shown to the user.
 Field and table mappings are automatically
determined based on X-Spec information.
 Join conditions are inserted as needed when available
to join tables.
 Different physical representations for the same
concept are combined.
 Hierarchically related concepts are combined based
on their IS-A relationship in the standard dictionary.
Page 35
Unity Overview
 Unity
is a software package that implements the
integration architecture with a GUI.
 Developed using Microsoft Visual C++ 6 and
Microsoft Foundation Classes (MFC).
 Unity allows the user to:
 Construct
and modify standard dictionaries
 Build X-Specs to describe data sources
 Integrate X-Specs into an integrated view
 Transparently query integrated systems using ODBC
and automatically generate SQL transactions
 Unity
is available for demonstration and
distribution.
Page 36
Architecture Discussion
 The
architecture automatically integrates
relational schemas into a multidatabase.
 Desirable properties:
 Individual
mappings - information sources integrated
one-at-a-time and independently
 Integrated view constructed for query transparency user queries system by semantics instead of structure
 Handles schema conflicts - including semantic,
structural, and naming conflicts
 Automated integration - integrated view constructed
efficiently and automatically
 No wrapper or mediator software is required
 Transparent querying - users issue semantic queries
which are translated to SQL by the query processor Page 41
Contributions
 Architecture
contributions:
 Has
an unique application of a standard dictionary
which is not a knowledge base
 Separates the capture and integration processes
 Allows transparent querying without structure
 Provides algorithms for dynamically extracting
database data (creating relevant views)
 Algorithms for mediation of global level conflicts
(global keys, normalization, etc.)
 Arguably simpler method for capturing data semantics
than using description logic
 An implementation, Unity, which demonstrates the
practical benefits of the architecture
Page 42
Conclusions
 Automatic
database integration is possible by
using a standard term dictionary and defining
semantic names for schema elements.
 Integration of data sources has applications to
the WWW and construction of data warehouses.
 Users are able to transparently query integrated
systems by concept instead of structure.
Page 43
Future Work
 The
integration architecture is evolving with
standards on XML and captures metadata
information in XML documents.
 We are constantly refining Unity.
 Develop
an integration component for a web browser
 The
query processor is being extended to
resolve more complex queries and conflicts.
 Test the system in large industrial projects.
 Allow distributed updates and global updates on
all databases.
Page 44
References
 Publications:
 Unity
- A Database Integration Tool, R. Lawrence and
K. Barker, TRLabs Emerging Technology Bulletin,
January 2000.
 Multidatabase Querying by Context, R. Lawrence and
K. Barker, DataSem2000, pages 127-136, Oct. 2000.
 Integrating Relational Database Schemas using a
Standardized Dictionary, To appear in SAC’2001 - ACM
Symposium on Applied Computing, March, 2001.
 Sponsors:
 NSERC,
 Further
TRLabs
Information:
 http://www.cs.umanitoba.ca/~umlawren/
Page 45
					 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            