* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download slides - IIT Bombay
Survey
Document related concepts
Transcript
Integration and representation of unstructured text in relational databases Sunita Sarawagi IIT Bombay Database Unstructured data Citeseer/Google scholar Structured records from publishers Publications from homepages Company database: products with features Product reviews on the web Customer emails HR database Resumes: skills, experience, references (emails) Text resume in an email Personal Databases: bibtex, address book Extract bibtex entries when I download a paper Enter missing contacts via web search R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see Articles 3 Toplevel entities Id 2 Title Journals Year Update Semantics 1983 Journal Canonical Author 2 11 2 2 2 3 Database: imprecise Name 10 ACM TODS 17 AI 16 ACM Trans. Databases Canonical 10 Writes Article Id Authors Id Name 11 M Y Vardi 2 J. Ullman 4 3 Ron Fagin 3 4 Jeffrey Ullman 4 Canonical 17 Probabilistic variant links to canonical entries R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see Extraction Articles Id 2 Author: R. Fagin A Author: J. Helpern Title Belief,..reasoning Journal: AI Year: 1998 Integration 7 Title Year Update Semantics Belief, awareness, reasoning Journal 1983 Journals Canonical Id Name 10 ACM TODS 17 AI 17 16 ACM Trans. Databases 10 10 1988 17 Writes Article Author Canonical Authors Id Name Canonical 2 11 11 M Y Vardi 2 2 2 J. Ullman 4 2 3 3 Ron Fagin 3 7 8 4 Jeffrey Ullman 4 7 9 8 9 R Fagin J Helpern 3 8 Match with existing linked entities while respecting all constraints Outline Statistical models for integration Extraction while fully exploiting existing database Integrate extracted entities, resolve if entity already in database Performance challenges Entity match, Entity pattern, link/relationship constraints, Efficient graphical model inference algorithms Indexing support Representing uncertainty of integration in DB Imprecise databases and queries Extraction using chain CRFs R. Fagin and J. Helpern, Belief, awareness, reasoning t 1 2 3 4 5 6 x y R. Fagin and J. Helpbern Belief Author Author Other Author Author Title Title y6 y7 y1 y2 y3 y4 y5 7 8 Awareness Reasoning Title y8 Flexible overlapping features •identity of word •ends in “-ski” •is capitalized •is part of a noun phrase? •is under node X in WordNet •is in bold font •is indented •next two words are “and Associates” •previous label is “Other” Difficult to effectively combine features from labeled unstructured data and structured DB CRFs for Segmentation t x y 1 2 3 4 5 6 7 8 R. Fagin and J. Helpbern Belief Awareness Reasoning Author Author Other Author Author Title Title Title Features describe the single word “Fagin” l,u x y l1=1, u1=2 R. Fagin Author l1=u1=3 and Other l1=4, u1=5 J. Helpbern Author l1=6, u1=8 Belief Awareness Reasoning Title Similarity to author’s column in database Features describe the segment from l to u Features from database Similarity to a dictionary entry Similarity to a pattern level dictionary JaroWinkler, TF-IDF Regex based pattern index for database entities Entity classifier A multi-class regression model which gives likelihood of a segment being a particular entity type Features for the classifier: all standard entity-level extraction features Segmentation models Segmentation Input: sequence x=x1,x2..xn, label set Y Output: segmentation S=s1,s2…sp sj = (start position, end position, label) = (tj,uj,yj) Score: F(x,s) = Transition potentials Segment potentials Segment starting at i has label y and previous label is y’ Segment starting at i’, ending at i, and with label y. All positions from i’ to i get same label. Probability of a segmentation: Inference O(nL2) Most likely segmentation, Marginal around segments R. Fagin and J. Helpern, Belief, awareness, reasoning. In AI 1988 [10] also see Extraction Articles Id 2 Author: R. Fagin A Author: J. Helpern Title Belief,..reasoning Journal: AI Year: 1998 Integration 7 Title Year Update Semantics Belief, awareness, reasoning Journal 1983 Journals Canonical Id Name 10 ACM TODS 17 AI 17 16 ACM Trans. Databases 10 10 1988 17 Writes Article Author Canonical Authors Id Name Canonical 2 11 11 M Y Vardi 2 2 2 J. Ullman 4 2 3 3 Ron Fagin 3 7 8 4 Jeffrey Ullman 4 7 9 8 9 R Fagin J Helpern 3 8 Match with existing linked entities while respecting all constraints CACM 2000, R. Fagin and J. Helpern, Belief, awareness, reasoning in AI Combined Extraction+integration Only extraction Author: R. Fagin Author: J. Helpern Title: Belief,..reasoning Journal: AI Year: 2000 Author: R. Fagin Author: J. Helpern Title: Belief,..reasoning in AI Journal: CACM Year: 2000 Id Title Year Journal Canonical 2 Update Semantics 1983 10 7 Belief, awareness, reasoning 1988 17 Year mismatch! Combined extraction + matching Convert predicted label to be a pair y = (a,r) (r=0) means none-of-the-above or a new entry l,u x y r l1=1, u1=2 CACM. 2000 Journal Year 0 7 l1=u1=3 Fagin l1=4, u1=8 Belief Awareness Reasoning In AI Author Title 3 7 Id of matching entity Constraints exist on ids that can be assigned to two segments Constrained models Two kinds of constraints between arbitrary segments Training Foreign key constraint across their canonical-ids Cardinality constraint Ignore constraints or use max-margin methods that require only MAP estimates Application: Formulate as a constrained integer programming problem (expensive) Use general A-star search to find most likely constrained assignment Effect of database on extraction performance L PersonalBib Address L+DB %Δ author 75.7 79.5 4.9 journal 33.9 50.3 48.6 title 61.0 70.3 15.1 city_name 72.4 76.7 6.0 state_name 13.9 33.2 138.5 zipcode 91.6 94.3 3.0 L = Only labeled structured data L + DB: similarity to database entities and other DB features (Mansuri and Sarawagi ICDE 2006) Train=5% Train=10% "-L_edge "-L_context "-L_entity "+db_regex "+db_classifier "+db_similarity "+cardinality only_L (noDB) "-L_edge "-L_context "-L_entity "+db_link "+db_regex "+db_classifier "+db_similarity "+cardinality only_L (noDB) F1 Effect of various features 85 80 75 70 65 60 55 Full integration performance L PersonalBib Address L+DB %Δ author 70.8 74.0 4.5 journal 29.6 45.5 53.6 title 51.6 65.0 25.9 city_name 70.1 74.6 6.4 9.0 28.3 213.8 87.8 90.7 3.3 state_name pincode L = conventional extraction + matching L + DB = technology presented here Much higher accuracies possible with more training data (Mansuri and Sarawagi ICDE 2006) Outline Statistical models for integration Extraction while fully exploiting existing database Integrate extracted entities, resolve if entity already in database Performance challenges Entity match, Entity pattern, link/relationship constraints, Efficient graphical model inference algorithms Indexing support Representing uncertainty of integration in DB Imprecise databases and queries Inference in segmentation models R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI 1998 Surface features (cheap) Many large tables Database lookup features Authors Name (expensive!) M Y Vardi 1. Batch up to do better than individual top-k? Efficient search for top-k most similar entities 2. Find top segmentation without top-k matches for all segments? J. Ullman Ron Fagin Claire Cardie J. Gherke Thorsten J Kleinberg S Chakrabarti Inverted index Jay Shan Jackie Chan Bill Gates Top-k similarity search Q: query segment E: an entry in the database D Similarity score: Goal: get k highest scoring Es in D 1. Fetch/merge Bounds on normalized tidlist subsets idf values 2. Point queries Score bounds Tuple id upper lower (cached) t1 t2 Tidlists: pointers to DB tuples (on disk) t3 - - - tU Candidate matches Upper and lower bounds on dictionary match scores Best segmentation with inexact, bounded features Normal Viterbi: Forward pass over data positions, at each position maintain Best segmentation ending at that position Modify to: best-first search with selective feature refinement s(3,3) s(1,1) s(3,4) s(0,0) s(5,5) End state s(1,2) s(3,5) s(1,3) s(4,4) (Chandel, Nagesh and Sarawagi, ICDE 2006) Suffix upper/lower bound: from a backward Viterbi with bounded features Performance results DBLP authors and titles 100 citations (Chandel, Nagesh and Sarawagi, ICDE 2006) Inference in segmentation models R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI 1998 Surface features (cheap) Not quite! Semi-CRFs 3—8 times slower than chain CRFs Key insight Applications have a mix of token-level and segment-level features Many features applicable to several overlapping segments Compactly represent the overlap through new forms of potentials Redesign inference algorithms to work on compact features Cost is independent of number of segments a feature applies to (Sarawagi, ICML 2006) Compact potentials Four kinds of potentials Running time and Accuracy 92 90 88 F1 Accuracy 86 Sequence-BCEU 84 Segment 82 80 78 Address Cora Articles Address Sequence-BCEU SegmentOpt Segment Time (sec) 2550 2050 Sequence-BCEU 12050 Time (seconds) 3050 Cora 1550 SegmentOpt 10050 8050 Segment 6050 4050 1050 2050 550 50 0 50 0 10 20 30 Training % 40 50 60 10 20 30 40 Training % 50 60 70 Outline Statistical models for integration Extraction while fully exploiting existing database Integrate extracted entities, resolve if entity already in database Performance challenges Entity match, Entity pattern, link/relationship constraints, Efficient graphical model inference algorithms Indexing support Representing uncertainty of integration in DB Imprecise databases and queries Probabilistic Querying Systems Integration systems while improving, cannot be perfect particularly for domains like the web Users supervision of each integration result impossible Create uncertainty-aware storage and querying engines Two enablers: Probabilistic database querying engines over generic uncertainty models Conditional graphical models produce well-calibrated probabilities Probabilities in CRFs are wellcalibrated Cora citations Ideal Cora headers Ideal Probability of segmentation Probability correct E.g: 0.5 probability Correct 50% of the times Uncertainty in integration systems Unstructured text Additional training data Model Other more compact models? Entities p1 Entities p2 Entities pk Very uncertain? IEEE Intl. Conf. On data mining 0.8 Probabilistic database system Conf. On data mining Select conference name of article RJ03? Find most cited author? 0.2 D Johnson 16000 0.6 J Ullman 0.4 13000 Segmentation-per-row model (Rows: Uncertain; Cols: Exact) HNO AREA CITY PINCODE PROB 52 Bandra West Bombay 400 062 0.1 52-A Bandra West Bombay 400 062 0.2 52-A Bandra West Bombay 400 062 0.5 52 Bandra West Bombay 400 062 0.2 Exact but impractical. We can have too many segmentations! One-row Model Each column is a multinomial distribution (Row: Exact; Columns: Independent, Uncertain) HNO AREA CITY PINCODE 52 (0.3) Bandra West (0.6) Bombay (0.6) 400 062 (1.0) 52-A (0.7) Bandra (0.4) West Bombay (0.4) e.g. P(52-A, Bandra West, Bombay, 400 062) = 0.7 x 0.6 x 0.6 x 1.0 = 0.252 Simple model, closed form solution, poor approximation. Multi-row Model Segmentation generated by a ‘mixture’ of rows (Rows: Uncertain; Columns: Independent, Uncertain) HNO AREA CITY PINCODE Prob 52 (0.167) 52-A (0.833) Bandra West (1.0) Bombay (1.0) 400 062 (1.0) 0.6 52 (0.5) 52-A (0.5) Bandra (1.0) West Bombay (1.0) 400 062 (1.0) 0.4 Excellent storage/accuracy tradeoff Populating probabilities challenging (Gupta and Sarawagi, VLDB 2006) Populating a multi-row model Challenge Learning parameters of a mixture model to approximate the SemiCRF but without enumerating the instances from the model Solution Find disjoint partitions of string Direct operation on marginal probability vectors (efficiently computable for SemiCRFs) Each partition a row Experiments: Need for multi-row • KL very high at m=1. One-row model clearly inadequate. • Even a two-row model is sufficient in many cases. What next in data integration? Lots to be done in building large-scale, viable data integration systems Online collective inference Cannot freeze database Cannot batch too many inferences Need theoretically sound, practical alternatives to exact, batch inference Queries and Mining over imprecise databases Models of imprecision for results of deduplication Thank you. Summary Data integration with statistical models an exciting research direction + a useful problem Four take-home messages Segmentation models (semi-CRFs) provide a more elegant way to exploit entity features and build integrated models (NIPS 2004, ICDE 2006a) A-star search adequate for link and cardinality constraints (ICDE 2006a) Recipe for combing two top-k searches so that expensive DB lookup features are refined gradually (ICDE 2006b) An efficient segmentation model with succinct representation of overlapping features + message passing over partial potentials (NIPS 2005 workshop) Software: http://crf.sourceforge.net Outline Problem statement and goals Models for data integration Information Extraction State-of-the-art Overview: Conditional Random Fields Our extensions to incorporate database of entity names Entity matching Combined model for extraction and matching Extending to multi-relational data Outline Problem statement and goals Models for data integration Information Extraction State-of-the-art Overview: Conditional Random Fields Our extensions to incorporate database of entity names Entity matching Combined model for extraction and matching Extending to multi-relational data Entity resolution Variants J. Ullmann Jefry Ulman Prof. J. Ullman J Smith Mike Stonebraker M, Stonebraker Labeled data: Authors Jeffrey Ullman Jeffrey Smith Michael Stonebraker Pedro Domingos ? Domingos, P. record pairs with labels 0 (red-edges) 1 (black-edges) Input features: Various kinds of similarity functions between attributes Edit distance, Soundex, N-grams on text attributes Jaccard, Jaro-Winkler, Subset match Classifier: any binary classifier CRF for extensibility CRFs for predicting matches Given record pair (x1 x2), predict y=1 or 0 as Efficiency: Training: filter and only include pairs which satisfy conditions like at least one common n-gram Link constraints in multi-relational data Any pair of segments in previous output needs to satisfy two conditions Foreign key constraint across their canonical-ids Cardinality constraint Our solution: Constrained Viterbi (branch and bound search) Modified search that retains with best path labels along the path Backtracks when constraints are violated The final picture. Entity column names in the database: Surface patterns, regular expression: Commonly occurring words: Normal CRF Part after “In” is journal name Similarity-based features: Labeled data: Order of attributes: Journal, IEEE journal name Ordering of words: Example: pattern: X. [X.] Xx* author name Title before journal name Semi-CRF Normal CRF Canonical links: Compound-label Schema-level: cardinality of attributes Links between entities: Constrained Viterbi what entity is allowed to go with what. Summary Exploiting existing large databases to bridge with unstructured data, an exciting research problem with many applications Conditional graphical models to combine all possible clues for extraction/matching in a simple framework Probabilistic: robust to noise, soft predictions Ongoing work: Probabilistic output for imprecise query processing Available clues.. Entity column names in the database: Surface patterns, regular expression: Commonly occurring words: Part after “In” is journal name TF-IDF similarity with stored entities Labeled data: Order of attributes: Journal, IEEE journal name Ordering of words: Example: pattern: X. [X.] Xx* author name Title before journal name Schema-level: cardinality of attributes Links between entities: what entity is allowed to go with what. Adding structure to unstructured data Extensive research in web, NLP, machine learning, data mining and database communities. Most current research ignores existing structured databases Database just a store at the last step of data integration. Our goal Extend statistical models to exploit database of entities and relationships Models: persistent, part of database, stored, indexed, evolving and improving along with data.