Download Document

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining with Unstructured
Data
A Study And Implementation of Industry
Product(s)
Samrat Sen
Goals
Issues in Text Mining with
Unstructured Data
 Analysis of Data Mining products
 Study of a Real Life Classification
Problem
 Strategy for solving the problem

5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
2
Issues in Text Mining

Different from KDD and DM
techniques in structured Databases
Problems:
1. Concerned with predefined fields
2. Based on learning from attribute- value
database
e.g
P.T.O
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
3
Issues in Text Mining
Potential Customer Table
Married to Table
Person Age Sex Income Customer
Husband
Wife
Ann S
32
F
10,000
yes
Egor
Ann S
Jane G 53
F
20,000
no
Sri H
Jane
Sri S
35
M
65,000
yes
Egor
25
M
10,000
yes
Induced Rules
If Married(Person, Spouse) and Income(Person) >= 25,000
Then Potential-Customer(Spouse)
If Married(Person, Spouse) and Potential-Customer(Person)
Then Potential-Customer(Spouse)
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
4
Issues in Text Mining

Algorithm techniques like
Association Extraction from Indexed data,
Prototypical Document Extraction from full Text
• Industry standard data mining tools
cannot be used directly
e.g a usual process has to have the Text Transformer, Text
Analyzer, Summary generator
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
5
Issues in Text Mining
• The input and output interfaces, the file formats
•
•
may cost in time and money.
Exhaustive domains have to be set up for
classification.
Cost and Benefits have to be weighed before
model selection.
1. Gain from positive prediction
2. Loss from an incorrect positive prediction (false positive)
3. Benefit from a correct negative prediction
4. Cost of incorrect negative prediction (false negative)
5. Cost of project time (a better product/algorithm may come
up)
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
6
Data Mining Products/Tools
DARWIN – from Oracle
 Intelligent Data Miner – from IBM
 Intermedia Text with Oracle Database
with context query feature

(theme based document retrieval)
FOR MORE INFO...
http://www.oracle.com/ip/analyze/warehouse/datamining/
http://www-4.ibm.com/software/data/iminer/
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
7
Data Mining Products/Tools
• New Specification being proposed by SUN for a
Data Mining API *
• SQLServer 2000 – Data mining and English query
writing features
• Verity Knowledge Organizer
FOR MORE INFO...
* http://java.sun.com/aboutJava/communityprocess/jsr/jsr_073_dmapi. html#3
Additional Text Mining sites:
1.http://textmining.krdl.org.sg/resourves.html
2. www.intext.de/TEXTANAE.htm
3. www.cs.uku.fi/~kuikka/systems.html
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
8
DARWIN
Functions
1.
2.
3.
Prediction (from known values)
Classification (into categories)
Forecasting (future predictions)
Approach
1.
2.
3.
Plan
Prepare Dataset
Build and Use models
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
9
DARWIN

The problem is defined in terms of data
fields and data records
 The fields are classified as follows:
- Categorical and Ordered Fields
- Predictive Fields
- Target Fields
• DARWIN dataset file has to be created
containing all the records in the problem
domain (using a descriptor file)
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
10
DARWIN - Models
Tree model – Based on classification
and regression tree algorithm
 Net model – A feed forward
multilayer neural network
 Match Model – Memory based
reasoning model, using a K-nearest
neighbor algorithm

5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
11
DARWIN – Tree Model
Create Tree
Training Data
Test/Evaluate Tree
(Information on error rates of pruned sub-trees)
I/P Prediction Dataset
Predict with Tree
(using the selected sub-tree)
Analyze Results
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
Merged I/P & O/P prediction
dataset
12
DARWIN – Net Model
Neural
Network
Model
Create Net
Training
Dataset
Train Net
(Information on error rates of pruned sub-trees)
I/P Prediction Dataset
Trained
Neural
Network
Prediction Dataset
Analyze Results
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
Merged I/P & O/P prediction
dataset
13
DARWIN – Match Model
Create Match Model
Training Data
Optimize match weights
I/P Prediction Dataset
Predict with Match
Analyze Results
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
Merged I/P & O/P prediction
dataset
14
DARWIN – Analyzing
Evaluate
Evaluates the performance of a given model on a given
dataset, when working on known data for test or
evaluation purposes.
Summarize Data
Provides a statistical summary of the values taken by a data in
the specified fields of a dataset
Frequency Count
Provides information on the frequency with which particular data values
appear in a dataset
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
15
DARWIN – Analyzing
Performance Matrix
Can be used to compare simple fields or simple functions of
fields
Sensitivity
Provides a model showing the relative importance of
attributes used in building a model
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
16
DARWIN – Code Generation
•Darwin can generate C, C++, Java code for a
Tree or Net model so that a prediction function
can be called from an application Program
•Java code can also be generated to embed a
model in a Web Applet
FOR MORE INFO...
http://technet.oracle.com/docs/products/datamining/doc_index.htm
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
17
DARWIN





For more info
http://technet.oracle.com/software/products/intermedia/soft
ware_index.html
1. Oracle Data Mining Data sheet
2. Oracle Data Mining Solutions
http://www.oracle.com/ip/analyze/warehouse/datamining/
http://www.oracle.com/oramag/oracle/98-Jan/fast.html
1. Managing Unstructured Data with Oracle8
http://technet.oracle.com/products/datamining/
1. Product manuals
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
18
DARWIN
Oracle Personalization
Hello! We have recommendations for you.
Real-Time Recommendations
New Offering Available with Oracle9i
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
19
Oracle – Intermedia Text

Ranking technique called theme
proving is used
Documents grouped into categories and
subcategories
Integrated with the Oracle – 8
database.
 Absolutely no training or tuning
required

5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
20
Oracle – Intermedia Text

Lexical Knowledge Base
- 200,000 concepts from very broad domains
- 2000 major categories
- Concepts mapped into one or more words/phrases in
canonical form
- Each of these have alternate inflectional
variations,acronyms, synonyms stored
- Total vocabulary of 450,000 terms
- Each entry has other parameters like parts of speech
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
21
Oracle – Intermedia Text
Theme Extraction
-Themes are assigned initial ranks based on
structure of the document and the frequency of the theme.
- All the ancestor themes also included in the result
- Theme proving done before final ranking
Queries
Direct match, phrase search (‘contains’), case-sensitive
query, misspellings and fuzzy match, inflections (‘about’),
compound queries, Boolean operators, Natural language
query
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
22
Oracle – Intermedia Text

Oracle at Trec 8
(Eighth text retrieval conferencehttp://otn.oracle.com/products/intermedia/htdocs/imt_trec8pap.ht
m)
Recall at 1000
Average Precision
Initial precision (at
recall 0.0)
Final precision (at
recall 1.0)
5/22/2017
71.57% (3384/4728)
41.30%
92.79%
07.91%
UB - CS 711, Data Mining with
Unstructured Data
23
Intermedia Text-Model
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
24
Interface Options
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
25
Language Selection

Java for
robot
 PL/SQL
for data
retrieval
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
26
Code Execution
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
27
Overview of the System
Customer
Browser
Listening
at port 80
Server
process
5/22/2017
Intermedia Text
Client
Browser
Web
Server
Tag stripper
UB - CS 711, Data Mining with
Unstructured Data
Oracle 8i
JDBC
28
Intermedia Text
Steps for Building an application

Load the documents
 Index the document
 Issue Queries
 Present the documents that satisfy the
query
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
29
Loading Methods

Loading Methods
–
–
–
Insert Statements
SQL Loader
Ctxsrv – This is a server daemon process which builds
the index at regular intervals
–
Ctxload Utility Used for
Thesaurus Import/Export
Text Loading
Document Updating/Exporting
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
30
Create and Populate a Simple Table
CREATE TABLE quick (
quick_id
quick_pk
text
NUMBER CONSTRAINT
PRIMARY KEY,
VARCHAR2(80) );
INSERT INTO quick
VALUES ( 1, 'The cat sat on the mat' );
INSERT INTO quick
VALUES ( 2, 'The fox jumped over the dog' );
INSERT INTO quick
VALUES ( 3, 'The dog barked like a dog' );
COMMIT;
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
31
Run a Text Query
SELECT text FROM quick
WHERE CONTAINS ( text,
'sat on the mat' ) > 0;
DRG-10599: column is not indexed

You must have a Text index on a column
before you can do a “contains” query on it
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
32
Create the Text Index
CREATE INDEX quick_text
on quick ( text )
INDEXTYPE IS CTXSYS.CONTEXT;


CTXSYS is the system user for interMedia Text
The INDEXTYPE keyword is a feature of the Extensible
Indexing Framework
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
33
Run a Text Query
SELECT text FROM quick
WHERE CONTAINS ( text,
'sat on the mat' ) > 0;
TEXT
----------------------The cat sat on the mat



You should regard the CONTAINS function as boolean
in meaning
It is implemented as a number since SQL does not have
a boolean datatype
The only sensible way to use it is with >0
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
34
Run a Text Query
SELECT SCORE(42) s, text FROM quick
WHERE CONTAINS ( text, 'dog', 42 )
>= 0 /* just for teaching purposes! */
ORDER BY s;
S TEXT
-- --------------------------7 The dog barked like a dog
4 The fox jumped over the dog



The better is the match, the higher is the score
The value can be used in ORDER BY but has no
absolute significance
The score is zero when the query is not matched
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
35
Intermedia Text - Indexing Pipeline
Filtered
Doc text
Doc Data
Datastore
Sectioner
Filter
Section
Offsets
Column data
Engine
Database
Index Data
Lexer
Tokens
Plain text
• First step is creating an index
Datastore
• Reads the data out of the table (for URL datastore performs a ‘GET ‘)
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
36
Intermedia Text - Indexing Pipeline
• Filter : The data is transformed to some text type,
•
•
•
this is needed as some of formats may be binary as
when storing doc, pdf, HTML types
Sectioner: Converts to plain text, removes tags
and invisible info.
Lexer: Splits the text into discrete tokens.
Engine: Takes the tokens from lexer , the offsets
from sectioner and a list of stoplist words to build
an index.
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
37
Intermedia Text - Indexing Pipeline
Example of index creation
Statements
• Insert into docs values(1,’first document’);
• Insert into docs values(2,’second document’);
Produces an index
DOCUMENT  doc 1 position 2, doc 2 position 2
FIRST
 doc 1 position 1
SECOND
 doc 2 position 1
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
38
Testing procedure

Document set from newsgroups
122 documents from a text mining site
Loaded using insert statements
File datastore used

Documents(HTML) from browsing
20 documents
Loaded from server process
URL datastore used
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
39
Newsgroup Results
1.
1.
2.
3.
2.
4.
5.
3.
6.
7.
8.
4.
5.
9.
10.
6.
11.
12.
13.
7.
Religion ,Atheism
– 15
on bible, islam, religious beliefs
Comp-os-ms-windows-misc - 17
about operating sys, protocols, installation
Comp.graphics
– 27
on hardware and software for computer graphics
Ice Hockey
- 18
Computer hardware
– 12
on installation of different peripheral devices
Mideast.politics
- 14
on political development in mideast
Science.space
- 19
on various space programs, devices,theories
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
40
Newsgroup Results
Group
Retrieved
Wrong
Not
Retrieved
Recall
Precision
Science
and
technolog
y
Computer
Hardware
Industry
120
16
1
99%
78%
12
0
5
71%
100%
Governme
nt
103
26
8
90%
74%
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
41
Newsgroup Results
politics
17
3
0
100%
82%
Military
5
1
0
80%
80%
Social
Environm
ent
Religion
48
2
14
77%
96%
22
3
2
90%
86%
Islam
4
0
0
100%
100%
Leisure
recreation
22
4
5
78%
82%
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
42
Newsgroup Results
Sports
21
1
0
90%
90%
Hockey
18
0
0
100%
100%
Recall
=
predictions
# of correct positive
---------------------
------------# of positive
examples
Precision =
predictions
5/22/2017
-----------
# of correct positive
UB - CS 711,
Data Mining with
---------------------Unstructured Data
43
Query
Syntax: Binary Operators

AND
&

OR
|

EQUIV
=

MINUS
-

NOT
~

ACCUM ,
5/22/2017
cat
cat
cat
cat
cat
cat
&
|
=
~
,
dog
dog
dog
dog
dog
dog
UB - CS 711, Data Mining with
Unstructured Data
44
Semantics: Binary Operators


The semantics of all the binary operators is defined in
terms of SCORE
However, the score for even the simplest query
expression - a single word - is calculated by a subtle
rule
– the score is higher for a document where the query
word occurs more frequently than for one where it
occurs less frequently
– but when “word1” occurs N times in
document D, its score is lower than when “word2”
occurs N times in document D if “word1” occurs
more often in the whole document set than “word2”
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
45
The Salton Algorithm
•interMedia Text uses an algorithm which is similar to the
Salton Algorithm - widely used in Text Retrieval products
•The score for a word is proportional to...
f ( 1+log ( N/n) )
...where
–f is the frequency of the search term in the document
–N is the total number documents
–and n is the number of documents which contain the
search term
•The score is converted into an integer in the range 0 - 100.
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
46
The Salton Algorithm
Assumption
Inverse
frequency scoring assumes that
frequently occurring terms in a document
set are noise terms, and so these terms are
scored lower. For a document to score
high, the query term must occur frequently
in the document but infrequently in the
document set as a whole.
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
47
The Salton Algorithm
This
table assumes that only one document in the set contains the query term.
# of Documents in Document Set
Occurrences of Term in Document
Needed to Score 100
34
1
5
20
10
17
50
13
100
12
500
10
1,000
9
10,000
7
100,000
5
1,000,000
5/22/2017
UB - CS 711, Data Mining
with
4
Unstructured Data
48
Summary of operators

Binary operators…
& | = - ~ ,
•
Built-in expansion...
? $ !
•
Thesaurus...
BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, PT,
RT, SYN, TR, TRSYN, TT
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
49
Summary of operators
•
Stored query expression...
SQE
•
Grouping and escaping...
() {} \
•
Special...
NEAR
WITHIN
ABOUT
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
50
Application Details- Customer profile
Analyzer
The http server
For (User web
Page caching)
Is started
Oracle web
Server also
started
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
51
Log In Screen- Customer &
User
Log in Screen
Used both
By the customer
And the users
The oracle webServer takes care
Of the secure
Connections, while
For the http server,
The user id is
Common for the session
-no user can invoke a
Document from server
Without user id.
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
52
Customer Interface – Http Server
The user
Uses the
Interface
Provided
By the custom
http server
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
53
Main User Screen
User can
Choose the
Type of data
To be analyzed.
Two types of data
exist1. Newsgroups
2. User Browsed
URL’s
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
54
Selection of Category and
options
User chooses
Category and
Other options
LikeGenerating theme
Generating gist
Generatingmarked-up text
Date range
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
55
Results Page – Gist Generation
Can use this
Page for drilling
Down to the
Actual document
Which opens up in
The browser (generated
By the filter option)
Can generate theme
And gist from this
Screen.
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
56
Search Screen
Search screen,
Has advance options
Like fuzzy search,
About search etc.
A chain of expressions
Can be used along
With conjunctions (like
‘not’,’or’,’and’ etc) for
Joining the statements
5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
57
Conclusion
New estimation methods trying to
find more meaning from text.
 Industry has great text mining
products and is constantly improving
technology.
 Unstructured Data Mining – a long
way to go.

5/22/2017
UB - CS 711, Data Mining with
Unstructured Data
58