Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: 110022478 Award: MSc (Computer & Information Science) Date: 17th September 2010 Supervisor: Dr. Jixue Liu Field of thesis • Schema matching • Relations database integration INTRODUCTION • What is a database schema? ▫ Structure of a database that describes how its concepts, their relationships and constraints are arranged • What is Schema matching? ▫ process of identifying semantic correspondences between elements of database schemas INTRODUCTION • What is Schema matching? Schema matching applications ▫ Critical task in any data sharing process ▫ Data warehousing Consolidation of multiple transaction processing databases ▫ database integration processes Eg: two companies merge, integrate employee, inventory, financial databases ▫ Cooperation between government agencies and various institutions. Eg. Police/transport dept, Immigration and universities Importance of the research • Currently done manually and semi automatically • Doing manually: tedious, error-prone, costly • No fully automatic system available require user interaction • semantic query processing, mobile web, ecommerce collaboration in enterprises • Demand for more scalable, accurate, efficient schema matching technology increasing Research objectives • Propose a framework that ▫ adopts a scalable architecture ▫ Offers a library of schema matching algorithms that exploit various information for better accuracy ▫ is independent of any specific application domain Methodology • Build a framework by adopting a composite architecture • Create a library of matchers at different levels • Build a prototype and perform empirical evaluation on it to test accuracy, scalability and efficiency Schema Matching Architecture • Input ▫ Represented in SQL DDL format ….. CREATE TABLE StudentDB.Student( studentId INT, studentName VARCHAR(100), studentPhone VARCHAR(50) PRIMARY KEY (studentId) ); ….. Schema Matching Architecture • Input ▫ Currently supports versions after Oracle9 and SQL Server 2000 Uses a data type conversion table if different DBMS ▫ Input processor extracts schema information Eg: element names, data types, keys Schema Matching Architecture • Process (schema matching) ▫ Implements multiple matching algorithms (matchers) • Schema level ▫ Element names similarity algorithms Prefix, Suffix, n-gram Tech = Technology (prefix matching) Phone = telephone (suffix matching) Context Con, ont, nte, tex, ext (ngram) ▫ Structural similarities Data type, Field length etc. Schema Matching Architecture • Instance Level ▫ Statistical data Statistical data obtained: eg. Range, % alphanumeric characters, statistical properties (eg: mean, std.dev), distinct values etc. ▫ Discovering complex correspondences Mining actual values Match different data types (gender : M,F = 1,2) Ambiguity issues: Jaguar (car or animal)? Schema Matching Architecture • Output ▫ Similarity score between attributes obtained in each matching algorithm all scores normalized between 0 to 1 ▫ Match results in similarity cube Attribute level, table level, schema level similarities can be generated Methodology • Schema matching prototype in C# .NET Experimental Evaluation • Accuracy ▫ Tested on 2 small schemas of 10 tables each with 2-10 attributes ▫ Checked results against manually derived result ▫ Accuracy degrades as schema size increases ▫ 55-60% true matching ▫ Tested on a schema with 140 tables and 1360 attributes 20-40% true matching Experimental Evaluation Efficiency • Drastic fall in efficiency as schema size increases 20000 18000 16000 14000 12000 10000 Small Schema 8000 Large Schema 6000 4000 2000 0 Matcher 1 Matcher 2 Matcher 3 Matcher 4 Matcher 5 Conclusion • A basic framework for schema matching is proposed • Matching functions performed independently for higher scalability so that additional algorithms can be integrated easily • Needs improvement in efficiency by deploying hybrid matching algorithms • Requires various different algorithms to assess similarities from different views and increase accuracy END