Download Adam-presentation

Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: 110022478 Award: MSc (Computer & Information Science) Date: 17th September 2010 Supervisor: Dr. Jixue Liu Field of thesis • Schema matching • Relations database integration INTRODUCTION • What is a database schema? ▫ Structure of a database that describes how its concepts, their relationships and constraints are arranged • What is Schema matching? ▫ process of identifying semantic correspondences between elements of database schemas INTRODUCTION • What is Schema matching? Schema matching applications ▫ Critical task in any data sharing process ▫ Data warehousing  Consolidation of multiple transaction processing databases ▫ database integration processes  Eg: two companies merge, integrate employee, inventory, financial databases ▫ Cooperation between government agencies and various institutions.  Eg. Police/transport dept, Immigration and universities Importance of the research • Currently done manually and semi automatically • Doing manually: tedious, error-prone, costly • No fully automatic system available  require user interaction • semantic query processing, mobile web, ecommerce collaboration in enterprises • Demand for more scalable, accurate, efficient schema matching technology increasing Research objectives • Propose a framework that ▫ adopts a scalable architecture ▫ Offers a library of schema matching algorithms that exploit various information for better accuracy ▫ is independent of any specific application domain Methodology • Build a framework by adopting a composite architecture • Create a library of matchers at different levels • Build a prototype and perform empirical evaluation on it to test accuracy, scalability and efficiency Schema Matching Architecture • Input ▫ Represented in SQL DDL format ….. CREATE TABLE StudentDB.Student( studentId INT, studentName VARCHAR(100), studentPhone VARCHAR(50) PRIMARY KEY (studentId) ); ….. Schema Matching Architecture • Input ▫ Currently supports versions after Oracle9 and SQL Server 2000  Uses a data type conversion table if different DBMS ▫ Input processor extracts schema information  Eg: element names, data types, keys Schema Matching Architecture • Process (schema matching) ▫ Implements multiple matching algorithms (matchers) • Schema level ▫ Element names similarity algorithms  Prefix, Suffix, n-gram  Tech = Technology (prefix matching)  Phone = telephone (suffix matching)  Context  Con, ont, nte, tex, ext (ngram) ▫ Structural similarities  Data type, Field length etc. Schema Matching Architecture • Instance Level ▫ Statistical data  Statistical data obtained: eg. Range, % alphanumeric characters, statistical properties (eg: mean, std.dev), distinct values etc. ▫ Discovering complex correspondences  Mining actual values  Match different data types (gender : M,F = 1,2)  Ambiguity issues: Jaguar (car or animal)? Schema Matching Architecture • Output ▫ Similarity score between attributes obtained in each matching algorithm  all scores normalized between 0 to 1 ▫ Match results in similarity cube  Attribute level, table level, schema level similarities can be generated Methodology • Schema matching prototype in C# .NET Experimental Evaluation • Accuracy ▫ Tested on 2 small schemas of 10 tables each with 2-10 attributes ▫ Checked results against manually derived result ▫ Accuracy degrades as schema size increases ▫ 55-60% true matching ▫ Tested on a schema with 140 tables and 1360 attributes 20-40% true matching Experimental Evaluation Efficiency • Drastic fall in efficiency as schema size increases 20000 18000 16000 14000 12000 10000 Small Schema 8000 Large Schema 6000 4000 2000 0 Matcher 1 Matcher 2 Matcher 3 Matcher 4 Matcher 5 Conclusion • A basic framework for schema matching is proposed • Matching functions performed independently for higher scalability so that additional algorithms can be integrated easily • Needs improvement in efficiency by deploying hybrid matching algorithms • Requires various different algorithms to assess similarities from different views and increase accuracy END

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Adam-presentation