Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003 Acknowledgments Many of these slides are from presentations by – Erhard Rahm, Hong Hai Do, and Sergey Melnik (Univ. of Leipzig) 2 Mapping Schemas Given two schemas, return an expression that translates instances of one schema into instances of the other (i.e., performs data translation). Applications – – – – Web site integration Catalog integration Schema evolution Data translation – – – – Reverse engineering Data warehouse loading XML message translation Ontology integration 3 Partitioning the Problem Schema matching (aka mapping discovery) – Given two schemas, return a set of correspondences that specify pairs of related terms Semantic Mapping (aka query discovery) – Given correspondences between two schemas, return an expression that translates instances of one schema into instances of the other (i.e., performs data translation). 4 The Schema Matching Problem Types of schemas: Database schemas, XML schemas, ontologies, …, Input: Phils’ mods are in underscored Times Roman Two (or more) schemas, S1 and S2 Possibly data instances for S1 and S2 Background knowledge – thesauri, validated matches, constraints (keys, data types), standard schemas, ontologies, NLP, etc. Output: A mapping between S1 and S2 Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik 5 Generic Match Implementation Tool 1 (Portal schemas) Tool 2 (E-Business schemas) Tool 3 (Data Warehousing schemas) Tool 4 (Database Design) Schema import/ export Global libraries (dictionaries, schemas …) Generic Match Implementation Internal Schema Representation Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik 6 Match Example 1 Yahoo! Shopping Electronics and Photography Camcorders DV Computers and Software Digital Cameras PDAs and Handhelds Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik Epinions.com Electronics Video Camcorders Remote Controls Home Computers & Internet Computer Hardware PDAs 7 Match Example 2 CREATE TABLE PurchaseOrder.Customer ( custNo INT, custName VARCHAR(50), custStreet VARCHAR(50), custCity VARCHAR(50), custZip VARCHAR(10), PRIMARY KEY (custNo) ) ; CREATE TABLE PurchaseOrder.ShipTo ( poNo INT, custNo INT REFERENCES PO1.Customer, shipToStreet VARCHAR(50), shipToCity VARCHAR(50), shipToZip VARCHAR(10), PRIMARY KEY (poNo) ) ; <xsd:schema xmlns:xsd="http://www.w3.org/XMLSchema"> <xsd:complexType name=“POrder" > <xsd:sequence> <xsd:element name=“DeliverTo" type="Address"/> <xsd:element name=“BillTo" type="Address"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="Address" > <xsd:sequence> <xsd:element name=“Street" type="xsd:string"/> <xsd:element name=“City" type="xsd:string"/> <xsd:element name=“Zip" type="xsd:decimal"/> </xsd:sequence> </xsd:complexType> </xsd:schema> a) A relational schema and an XML schema POrder PurchaseOrder Customer custName custStreet custCity Legends: ShipTo custZip Node Containment link shipToStreet DeliverTo Address shipToCity shipToZip BillTo City b) Their corresponding graph representation Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik Street Zip 8 Tool Example (Biztalk Mapper) Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik 9 Current Situation Finding mappings is now the bottleneck! Will only get worse largely done by hand labor intensive, tedious & error prone data sharing & XML become pervasive proliferation of DTDs and XML schemas translation of legacy data reconciling ontologies on semantic web Need semi-automatic approaches to scale up! Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik 10 Why Matching is Difficult Aims to identify same real-world entity using names, structures, types, data values, etc. Schemas represent same entity differently different names => same entity (synonyms): client & user => customer same names => different entities (homonyms): bug => insect or software error Schema & data never fully capture semantics! not adequately documented, not sufficiently expressive data values suffer from synonyms and homonyms too Intended semantics is typically subjective! Cannot be fully automated. Often hard for humans. Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik 11 Desiderata for Match Solution Low degree of manual work Accuracy, efficiency, ease of use Extensibility Support for Reuse Exploit additional match techniques Exploit additional background knowledge Exploit previous manual or automatically generated matchings Generic approach Different schema languages Different application areas Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik 12 Special Situations Match schema S to an incremental modification of S – Can ignore homonyms and possibly synonyms – Little if any reshaping of the structure – Instances probably don’t help Lightweight integration for the semantic web vs E-commerce or data warehouse loading. – The former can’t afford much human review – The latter needs “perfect” mappings and hence 13 Automatic Match Approaches Individual approaches Schema-based Element Structure Instance-based Element Linguistic Constraintbased Constraintbased Linguistic • Names • Descriptions • Types • Keys • Parents • Children • Leaves • IR (word frequencies, key terms) Reuse-oriented Element Structure • Dictionaries • Thesauri • Previous match results Constraintbased • Value pattern and ranges Combining approaches: hybrid vs. composite Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik 14 Match Quality Measures Comparison of automatically with manually (i.e. real) derived match correspondences Real matches A Suggested matches B C D Quality measures: Precision SimilarityFlooding [ICDE02]: Overall 1 B BC Recall A: False Negatives B: True Positives C: False Positives D: True Negatives B A B AC B C 1 Recall* 2 Precision A B A B Overall : post-match effort to add missed (A) and to remove false matches (C): negative Overall no gain Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik 15 UW & MSR Contributions LSD [SIGMOD ] – Learning algorithm based on structures and instances – AnHai Doan, Pedro Domingos, Alon Halvey Cupid [VLDB 02] – Structure matching – Jayant Madhavan, Phil Bernstein Glue [WWW 03] – Taxonomy matching. Uses relaxation labeling. – An Hai Doan, Jayant Madhavan, Pedro Domingos, Alon Halevy Mapping Knowledge Base (MKB) [submitted] 16 Where’s the Research Action? There’s always room for new techniques – Compare distribution of data values between elements of two schemas – Create a global schema by clustering elements from different sample schemas Re-using mappings Combining techniques Better user interfaces 17