Download Document

Data Mining with Unstructured Data A Study And Implementation of Industry Product(s) Samrat Sen Goals Issues in Text Mining with Unstructured Data  Analysis of Data Mining products  Study of a Real Life Classification Problem  Strategy for solving the problem  5/22/2017 UB - CS 711, Data Mining with Unstructured Data 2 Issues in Text Mining  Different from KDD and DM techniques in structured Databases Problems: 1. Concerned with predefined fields 2. Based on learning from attribute- value database e.g P.T.O 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 3 Issues in Text Mining Potential Customer Table Married to Table Person Age Sex Income Customer Husband Wife Ann S 32 F 10,000 yes Egor Ann S Jane G 53 F 20,000 no Sri H Jane Sri S 35 M 65,000 yes Egor 25 M 10,000 yes Induced Rules If Married(Person, Spouse) and Income(Person) >= 25,000 Then Potential-Customer(Spouse) If Married(Person, Spouse) and Potential-Customer(Person) Then Potential-Customer(Spouse) 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 4 Issues in Text Mining  Algorithm techniques like Association Extraction from Indexed data, Prototypical Document Extraction from full Text • Industry standard data mining tools cannot be used directly e.g a usual process has to have the Text Transformer, Text Analyzer, Summary generator 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 5 Issues in Text Mining • The input and output interfaces, the file formats • • may cost in time and money. Exhaustive domains have to be set up for classification. Cost and Benefits have to be weighed before model selection. 1. Gain from positive prediction 2. Loss from an incorrect positive prediction (false positive) 3. Benefit from a correct negative prediction 4. Cost of incorrect negative prediction (false negative) 5. Cost of project time (a better product/algorithm may come up) 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 6 Data Mining Products/Tools DARWIN – from Oracle  Intelligent Data Miner – from IBM  Intermedia Text with Oracle Database with context query feature  (theme based document retrieval) FOR MORE INFO... http://www.oracle.com/ip/analyze/warehouse/datamining/ http://www-4.ibm.com/software/data/iminer/ 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 7 Data Mining Products/Tools • New Specification being proposed by SUN for a Data Mining API * • SQLServer 2000 – Data mining and English query writing features • Verity Knowledge Organizer FOR MORE INFO... * http://java.sun.com/aboutJava/communityprocess/jsr/jsr_073_dmapi. html#3 Additional Text Mining sites: 1.http://textmining.krdl.org.sg/resourves.html 2. www.intext.de/TEXTANAE.htm 3. www.cs.uku.fi/~kuikka/systems.html 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 8 DARWIN Functions 1. 2. 3. Prediction (from known values) Classification (into categories) Forecasting (future predictions) Approach 1. 2. 3. Plan Prepare Dataset Build and Use models 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 9 DARWIN  The problem is defined in terms of data fields and data records  The fields are classified as follows: - Categorical and Ordered Fields - Predictive Fields - Target Fields • DARWIN dataset file has to be created containing all the records in the problem domain (using a descriptor file) 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 10 DARWIN - Models Tree model – Based on classification and regression tree algorithm  Net model – A feed forward multilayer neural network  Match Model – Memory based reasoning model, using a K-nearest neighbor algorithm  5/22/2017 UB - CS 711, Data Mining with Unstructured Data 11 DARWIN – Tree Model Create Tree Training Data Test/Evaluate Tree (Information on error rates of pruned sub-trees) I/P Prediction Dataset Predict with Tree (using the selected sub-tree) Analyze Results 5/22/2017 UB - CS 711, Data Mining with Unstructured Data Merged I/P & O/P prediction dataset 12 DARWIN – Net Model Neural Network Model Create Net Training Dataset Train Net (Information on error rates of pruned sub-trees) I/P Prediction Dataset Trained Neural Network Prediction Dataset Analyze Results 5/22/2017 UB - CS 711, Data Mining with Unstructured Data Merged I/P & O/P prediction dataset 13 DARWIN – Match Model Create Match Model Training Data Optimize match weights I/P Prediction Dataset Predict with Match Analyze Results 5/22/2017 UB - CS 711, Data Mining with Unstructured Data Merged I/P & O/P prediction dataset 14 DARWIN – Analyzing Evaluate Evaluates the performance of a given model on a given dataset, when working on known data for test or evaluation purposes. Summarize Data Provides a statistical summary of the values taken by a data in the specified fields of a dataset Frequency Count Provides information on the frequency with which particular data values appear in a dataset 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 15 DARWIN – Analyzing Performance Matrix Can be used to compare simple fields or simple functions of fields Sensitivity Provides a model showing the relative importance of attributes used in building a model 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 16 DARWIN – Code Generation •Darwin can generate C, C++, Java code for a Tree or Net model so that a prediction function can be called from an application Program •Java code can also be generated to embed a model in a Web Applet FOR MORE INFO... http://technet.oracle.com/docs/products/datamining/doc_index.htm 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 17 DARWIN      For more info http://technet.oracle.com/software/products/intermedia/soft ware_index.html 1. Oracle Data Mining Data sheet 2. Oracle Data Mining Solutions http://www.oracle.com/ip/analyze/warehouse/datamining/ http://www.oracle.com/oramag/oracle/98-Jan/fast.html 1. Managing Unstructured Data with Oracle8 http://technet.oracle.com/products/datamining/ 1. Product manuals 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 18 DARWIN Oracle Personalization Hello! We have recommendations for you. Real-Time Recommendations New Offering Available with Oracle9i 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 19 Oracle – Intermedia Text  Ranking technique called theme proving is used Documents grouped into categories and subcategories Integrated with the Oracle – 8 database.  Absolutely no training or tuning required  5/22/2017 UB - CS 711, Data Mining with Unstructured Data 20 Oracle – Intermedia Text  Lexical Knowledge Base - 200,000 concepts from very broad domains - 2000 major categories - Concepts mapped into one or more words/phrases in canonical form - Each of these have alternate inflectional variations,acronyms, synonyms stored - Total vocabulary of 450,000 terms - Each entry has other parameters like parts of speech 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 21 Oracle – Intermedia Text Theme Extraction -Themes are assigned initial ranks based on structure of the document and the frequency of the theme. - All the ancestor themes also included in the result - Theme proving done before final ranking Queries Direct match, phrase search (‘contains’), case-sensitive query, misspellings and fuzzy match, inflections (‘about’), compound queries, Boolean operators, Natural language query 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 22 Oracle – Intermedia Text  Oracle at Trec 8 (Eighth text retrieval conferencehttp://otn.oracle.com/products/intermedia/htdocs/imt_trec8pap.ht m) Recall at 1000 Average Precision Initial precision (at recall 0.0) Final precision (at recall 1.0) 5/22/2017 71.57% (3384/4728) 41.30% 92.79% 07.91% UB - CS 711, Data Mining with Unstructured Data 23 Intermedia Text-Model 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 24 Interface Options 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 25 Language Selection  Java for robot  PL/SQL for data retrieval 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 26 Code Execution 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 27 Overview of the System Customer Browser Listening at port 80 Server process 5/22/2017 Intermedia Text Client Browser Web Server Tag stripper UB - CS 711, Data Mining with Unstructured Data Oracle 8i JDBC 28 Intermedia Text Steps for Building an application  Load the documents  Index the document  Issue Queries  Present the documents that satisfy the query 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 29 Loading Methods  Loading Methods – – – Insert Statements SQL Loader Ctxsrv – This is a server daemon process which builds the index at regular intervals – Ctxload Utility Used for Thesaurus Import/Export Text Loading Document Updating/Exporting 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 30 Create and Populate a Simple Table CREATE TABLE quick ( quick_id quick_pk text NUMBER CONSTRAINT PRIMARY KEY, VARCHAR2(80) ); INSERT INTO quick VALUES ( 1, 'The cat sat on the mat' ); INSERT INTO quick VALUES ( 2, 'The fox jumped over the dog' ); INSERT INTO quick VALUES ( 3, 'The dog barked like a dog' ); COMMIT; 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 31 Run a Text Query SELECT text FROM quick WHERE CONTAINS ( text, 'sat on the mat' ) > 0; DRG-10599: column is not indexed  You must have a Text index on a column before you can do a “contains” query on it 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 32 Create the Text Index CREATE INDEX quick_text on quick ( text ) INDEXTYPE IS CTXSYS.CONTEXT;   CTXSYS is the system user for interMedia Text The INDEXTYPE keyword is a feature of the Extensible Indexing Framework 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 33 Run a Text Query SELECT text FROM quick WHERE CONTAINS ( text, 'sat on the mat' ) > 0; TEXT ----------------------The cat sat on the mat    You should regard the CONTAINS function as boolean in meaning It is implemented as a number since SQL does not have a boolean datatype The only sensible way to use it is with >0 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 34 Run a Text Query SELECT SCORE(42) s, text FROM quick WHERE CONTAINS ( text, 'dog', 42 ) >= 0 /* just for teaching purposes! */ ORDER BY s; S TEXT -- --------------------------7 The dog barked like a dog 4 The fox jumped over the dog    The better is the match, the higher is the score The value can be used in ORDER BY but has no absolute significance The score is zero when the query is not matched 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 35 Intermedia Text - Indexing Pipeline Filtered Doc text Doc Data Datastore Sectioner Filter Section Offsets Column data Engine Database Index Data Lexer Tokens Plain text • First step is creating an index Datastore • Reads the data out of the table (for URL datastore performs a ‘GET ‘) 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 36 Intermedia Text - Indexing Pipeline • Filter : The data is transformed to some text type, • • • this is needed as some of formats may be binary as when storing doc, pdf, HTML types Sectioner: Converts to plain text, removes tags and invisible info. Lexer: Splits the text into discrete tokens. Engine: Takes the tokens from lexer , the offsets from sectioner and a list of stoplist words to build an index. 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 37 Intermedia Text - Indexing Pipeline Example of index creation Statements • Insert into docs values(1,’first document’); • Insert into docs values(2,’second document’); Produces an index DOCUMENT  doc 1 position 2, doc 2 position 2 FIRST  doc 1 position 1 SECOND  doc 2 position 1 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 38 Testing procedure  Document set from newsgroups 122 documents from a text mining site Loaded using insert statements File datastore used  Documents(HTML) from browsing 20 documents Loaded from server process URL datastore used 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 39 Newsgroup Results 1. 1. 2. 3. 2. 4. 5. 3. 6. 7. 8. 4. 5. 9. 10. 6. 11. 12. 13. 7. Religion ,Atheism – 15 on bible, islam, religious beliefs Comp-os-ms-windows-misc - 17 about operating sys, protocols, installation Comp.graphics – 27 on hardware and software for computer graphics Ice Hockey - 18 Computer hardware – 12 on installation of different peripheral devices Mideast.politics - 14 on political development in mideast Science.space - 19 on various space programs, devices,theories 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 40 Newsgroup Results Group Retrieved Wrong Not Retrieved Recall Precision Science and technolog y Computer Hardware Industry 120 16 1 99% 78% 12 0 5 71% 100% Governme nt 103 26 8 90% 74% 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 41 Newsgroup Results politics 17 3 0 100% 82% Military 5 1 0 80% 80% Social Environm ent Religion 48 2 14 77% 96% 22 3 2 90% 86% Islam 4 0 0 100% 100% Leisure recreation 22 4 5 78% 82% 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 42 Newsgroup Results Sports 21 1 0 90% 90% Hockey 18 0 0 100% 100% Recall = predictions # of correct positive --------------------- ------------# of positive examples Precision = predictions 5/22/2017 ----------- # of correct positive UB - CS 711, Data Mining with ---------------------Unstructured Data 43 Query Syntax: Binary Operators  AND &  OR |  EQUIV =  MINUS -  NOT ~  ACCUM , 5/22/2017 cat cat cat cat cat cat & | = ~ , dog dog dog dog dog dog UB - CS 711, Data Mining with Unstructured Data 44 Semantics: Binary Operators   The semantics of all the binary operators is defined in terms of SCORE However, the score for even the simplest query expression - a single word - is calculated by a subtle rule – the score is higher for a document where the query word occurs more frequently than for one where it occurs less frequently – but when “word1” occurs N times in document D, its score is lower than when “word2” occurs N times in document D if “word1” occurs more often in the whole document set than “word2” 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 45 The Salton Algorithm •interMedia Text uses an algorithm which is similar to the Salton Algorithm - widely used in Text Retrieval products •The score for a word is proportional to... f ( 1+log ( N/n) ) ...where –f is the frequency of the search term in the document –N is the total number documents –and n is the number of documents which contain the search term •The score is converted into an integer in the range 0 - 100. 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 46 The Salton Algorithm Assumption Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole. 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 47 The Salton Algorithm This table assumes that only one document in the set contains the query term. # of Documents in Document Set Occurrences of Term in Document Needed to Score 100 34 1 5 20 10 17 50 13 100 12 500 10 1,000 9 10,000 7 100,000 5 1,000,000 5/22/2017 UB - CS 711, Data Mining with 4 Unstructured Data 48 Summary of operators  Binary operators… & | = - ~ , • Built-in expansion... ? $ ! • Thesaurus... BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT, RT, SYN, TR, TRSYN, TT 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 49 Summary of operators • Stored query expression... SQE • Grouping and escaping... () {} \ • Special... NEAR WITHIN ABOUT 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 50 Application Details- Customer profile Analyzer The http server For (User web Page caching) Is started Oracle web Server also started 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 51 Log In Screen- Customer & User Log in Screen Used both By the customer And the users The oracle webServer takes care Of the secure Connections, while For the http server, The user id is Common for the session -no user can invoke a Document from server Without user id. 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 52 Customer Interface – Http Server The user Uses the Interface Provided By the custom http server 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 53 Main User Screen User can Choose the Type of data To be analyzed. Two types of data exist1. Newsgroups 2. User Browsed URL’s 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 54 Selection of Category and options User chooses Category and Other options LikeGenerating theme Generating gist Generatingmarked-up text Date range 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 55 Results Page – Gist Generation Can use this Page for drilling Down to the Actual document Which opens up in The browser (generated By the filter option) Can generate theme And gist from this Screen. 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 56 Search Screen Search screen, Has advance options Like fuzzy search, About search etc. A chain of expressions Can be used along With conjunctions (like ‘not’,’or’,’and’ etc) for Joining the statements 5/22/2017 UB - CS 711, Data Mining with Unstructured Data 57 Conclusion New estimation methods trying to find more meaning from text.  Industry has great text mining products and is constantly improving technology.  Unstructured Data Mining – a long way to go.  5/22/2017 UB - CS 711, Data Mining with Unstructured Data 58

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document