Download Adam-presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Minor Thesis
A scalable schema matching
framework for relational databases
Student: Ahmed Saimon Adam
ID: 110022478
Award: MSc (Computer & Information Science)
Date: 17th September 2010
Supervisor: Dr. Jixue Liu
Field of thesis
• Schema matching
• Relations database integration
INTRODUCTION
• What is a database schema?
▫ Structure of a database that describes how its
concepts, their relationships and constraints are
arranged
• What is Schema matching?
▫ process of identifying semantic correspondences
between elements of database schemas
INTRODUCTION
• What is Schema matching?
Schema matching applications
▫ Critical task in any data sharing process
▫ Data warehousing
 Consolidation of multiple transaction processing databases
▫ database integration processes
 Eg: two companies merge, integrate employee, inventory,
financial databases
▫ Cooperation between government agencies and
various institutions.
 Eg. Police/transport dept, Immigration and universities
Importance of the research
• Currently done manually and semi automatically
• Doing manually: tedious, error-prone, costly
• No fully automatic system available
 require user interaction
• semantic query processing, mobile web, ecommerce
collaboration in enterprises
• Demand for more scalable, accurate, efficient
schema matching technology increasing
Research objectives
• Propose a framework that
▫ adopts a scalable architecture
▫ Offers a library of schema matching algorithms that
exploit various information for better accuracy
▫ is independent of any specific application domain
Methodology
• Build a framework by adopting a composite
architecture
• Create a library of matchers at different levels
• Build a prototype and perform empirical evaluation
on it to test accuracy, scalability and efficiency
Schema Matching Architecture
• Input
▫ Represented in SQL DDL format
…..
CREATE TABLE StudentDB.Student(
studentId INT,
studentName VARCHAR(100),
studentPhone VARCHAR(50)
PRIMARY KEY (studentId) );
…..
Schema Matching Architecture
• Input
▫ Currently supports versions after Oracle9 and SQL
Server 2000
 Uses a data type conversion table if different DBMS
▫ Input processor extracts schema information
 Eg: element names, data types, keys
Schema Matching Architecture
• Process (schema matching)
▫ Implements multiple matching algorithms (matchers)
• Schema level
▫ Element names similarity algorithms
 Prefix, Suffix, n-gram
 Tech = Technology (prefix matching)
 Phone = telephone (suffix matching)
 Context  Con, ont, nte, tex, ext (ngram)
▫ Structural similarities
 Data type, Field length etc.
Schema Matching Architecture
• Instance Level
▫ Statistical data
 Statistical data obtained: eg. Range, % alphanumeric
characters, statistical properties (eg: mean, std.dev),
distinct values etc.
▫ Discovering complex correspondences
 Mining actual values
 Match different data types (gender : M,F = 1,2)
 Ambiguity issues: Jaguar (car or animal)?
Schema Matching Architecture
• Output
▫ Similarity score between attributes obtained in each
matching algorithm
 all scores normalized between 0 to 1
▫ Match results in similarity cube
 Attribute level, table level, schema level similarities can
be generated
Methodology
• Schema matching prototype in C# .NET
Experimental Evaluation
• Accuracy
▫ Tested on 2 small schemas of 10 tables each with 2-10
attributes
▫ Checked results against manually derived result
▫ Accuracy degrades as schema size increases
▫ 55-60% true matching
▫ Tested on a schema with 140 tables and 1360
attributes 20-40% true matching
Experimental Evaluation
Efficiency
• Drastic fall in efficiency as schema size increases
20000
18000
16000
14000
12000
10000
Small Schema
8000
Large Schema
6000
4000
2000
0
Matcher 1 Matcher 2 Matcher 3 Matcher 4 Matcher 5
Conclusion
• A basic framework for schema matching is proposed
• Matching functions performed independently for
higher scalability so that additional algorithms can
be integrated easily
• Needs improvement in efficiency by deploying hybrid
matching algorithms
• Requires various different algorithms to assess
similarities from different views and increase
accuracy
END
Related documents