Download Text Mining - Computer Science Intranet

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
COMP527:
Data Mining
COMP527: Data Mining
M. Sulaiman Khan
(mskhan@liv.ac.uk)
Dept. of Computer Science
University of Liverpool
2009
Text Mining: Challenges, Basics
March 24, 2009
Slide 1
COMP527:
Data Mining
COMP527: Data Mining
Introduction to the Course
Introduction to Data Mining
Introduction to Text Mining
General Data Mining Issues
Data Warehousing
Classification: Challenges, Basics
Classification: Rules
Classification: Trees
Classification: Trees 2
Classification: Bayes
Classification: Neural Networks
Classification: SVM
Classification: Evaluation
Classification: Evaluation 2
Regression, Prediction
Text Mining: Challenges, Basics
Input Preprocessing
Attribute Selection
Association Rule Mining
ARM: A Priori and Data Structures
ARM: Improvements
ARM: Advanced Techniques
Clustering: Challenges, Basics
Clustering: Improvements
Clustering: Advanced Algorithms
Hybrid Approaches
Graph Mining, Web Mining
Text Mining: Challenges, Basics
Text Mining: Text-as-Data
Text Mining: Text-as-Language
Revision for Exam
March 24, 2009
Slide 2
COMP527:
Data Mining
Today's Topics
Text Representation
What is a word?
Dimensionality Reduction
Text Mining vs Data Mining on Text
Text Mining: Challenges, Basics
March 24, 2009
Slide 3
COMP527:
Data Mining
Representation of Documents
Basic Goal: Data Mining on Documents
Each document must be an instance.
First problem: What are the attributes of a document?
Easy attributes: format, length in bytes, potentially some metadata
extractable from headers or file properties (author, date etc)
Harder attributes: How to usefully represent the text?
Basic idea is to treat each word as an attribute – either boolean (is
present/is not present) or numeric (number of times word occurs)
Text Mining: Challenges, Basics
March 24, 2009
Slide 4
COMP527:
Data Mining
Representation of Documents
Second Problem: We will have a LOT of false/0 attribute values in
our instances.
Very sparse matrix, requiring a LOT of storage space.
1,000,000 documents, 500,000 different words = 500,000,000,000
entries
~= 60 gigabytes if store each entry as a single bit
~= 950 gigabytes if store each entry as short integer (2 bytes)
Google's dictionary has 5 million words, times 18 billion web
pages...
(Process that, WEKA!)
Text Mining: Challenges, Basics
March 24, 2009
Slide 5
COMP527:
Data Mining
Representation of Documents
Store only true values: (1,4,100,212,13948)
Or true values with frequency: (1:3, 4:1, 100:1, 212:3, 13948:4)
For compressed storage, we can maintain the differences:
Attribs in order: 1,2,4,5,7,10,15,18, ... 2348651, ...
Intervals:
1,1,2,1,2, 3, 5, 3, ... 6, ...
With frequency: 1,4,1,3,2,5,1,3,2,6,3,3,...
Can always store in short integer.
Regular compression algorithms on this sequence will be efficient.
Reorder attributes based on frequency rather than alphabetical.
Text Mining: Challenges, Basics
March 24, 2009
Slide 6
COMP527:
Data Mining
Input?
That's nice... but WEKA needs ARFF!
Problems with toolkits: Won't accept a sparse input format
Classification Algorithms:
Rules: Possible, but unlikely
Trees: Less likely than unlikely
Bayes: Fine, especially Multinomial Bayes
Bayesian Networks: Maybe... but too many possible networks
SVM: Fine
NN: Not Fine! Tooooo many nodes
Perceptron/Winnow: See NN, but more feasible as no hidden
layer
KNN: Very slow without data structures due to no. of
comparisons
Text Mining: Challenges, Basics
March 24, 2009
Slide 7
COMP527:
Data Mining
Input?
Overall problem for text classification:
Accurate models with high dimensionality impossible to understand
by humans (eg SVM, Multinomial Naive Bayes).
Association Rule Mining: Fine for presence of word, but how to
represent word frequency? Classification Association Rule
Mining possible good solution for understandability?
Clustering: Very high dimensionality a problem for many algorithms,
especially with lots of comparisons (eg partitioning algorithms).
Text Mining: Challenges, Basics
March 24, 2009
Slide 8
COMP527:
Data Mining
Document Types
First problem: Need to be able to extract data from the file. Very
different processing for different file types, eg:
XML, HTML, RSS, Word, Open Document, PDF, RTF, LaTeX, ...
May want to treat different parts of document separately. eg:
title vs authors vs abstract vs text vs references
Want to normalise texts into semantic areas across different
formats – eg abstract in PDF is several lines, but in ODF is an
XML element, in LaTeX surrounded by one or more \verb{}
commands...
Text Mining: Challenges, Basics
March 24, 2009
Slide 9
COMP527:
Data Mining
Term Extraction
Requirement: Extract words from text.
What is a 'word'?
 Obvious 'words' (eg consecutive non space characters)
 Number (1,000 55.6 $10 1012 64.000 vs 64.0)
 Hyphenated (book case vs book-case vs bookcase)
but also for ranges: 32-45 or "New York-New Jersey"
 URI http://www.liv.ac.uk/ and more complicated
 Punctuation ( Rob's vs 'Robs' vs Robs' vs Robs)
 Dates as single token?
 Non-alphanumeric characters: AT&T, Yahoo!
 ...
Text Mining: Challenges, Basics
March 24, 2009
Slide 10
COMP527:
Data Mining
Term Extraction
Requirement: Extract 'words' from text.
 Period character problematic: End of sentence, end of
abbreviation, internal to acronyms (but not always present),
internal to numbers (with 2 different meanings), dotted quad
notation (eg: 138.253.81.72)
 Emoticons
:( :) >:( =>
 Need extra processing for diacritics? eg: é ë ç etc.
 Might want to use phrases as attributes 'with respect to', 'data
mining' etc. but complicated to determine appropriate phrases.
 Expand abbreviations? Expand acronyms?
 Expand ranges? (1999-2007 means all years, not just end
points)
Text Mining: Challenges, Basics
March 24, 2009
Slide 11
COMP527:
Data Mining
Dimensionality Reduction
Requirement: Reduce number of words. (Dimensionality
reduction)
Many words are useless for distinguishing a document. Don't want
to store non-useful words...
 a, an, the, these, those, them, they...
 of, with, to, in, towards, on...
 while, however, because, also, who, when, where ...
Long list of words to ignore. Called 'stopwords'.
BUT ... “The Who” -- Band? Stopwords?
Part of speech filtering more accurate but more expensive.
Text Mining: Challenges, Basics
March 24, 2009
Slide 12
COMP527:
Data Mining
Dimensionality Reduction
Requirement: Normalise terms for consistency and dimensionality
reduction.
 Normally want to ignore case. eg 'computer' and 'Computer'
should be the same attribute. But acronyms different: ram vs
RAM, us vs US

Normally want to use word stems. eg 'computer' and 'computers'
should be the same attribute.
Porter algorithm relies on prefix/suffix matching, but note 'ram'
could be noun or verb... Also, stems can be meaningless:
"datum mine"
Text Mining: Challenges, Basics
March 24, 2009
Slide 13
COMP527:
Data Mining
Dimensionality Reduction
Can use simple statistics to reduce dimensionality:
If a word appears in all classes evenly, then it doesn't distinguish
any particular class, and is not useful for classification and can
be ignored. eg 'the'
Equally, a word that appears in only one document will be perfectly
discriminating, but also probably over-fitting.
Words that appear in most documents (regardless of class
distribution) are also unlikely to be useful.
Text Mining: Challenges, Basics
March 24, 2009
Slide 14
COMP527:
Data Mining
Text Mining vs Data Mining
Data Mining: Discover hidden models to describe the data.
Text Mining: Discover hidden facts within bodies of text.
Completely different approaches:
DM tries to generalise all of the data into a single model without
getting caught up in over-fitting.
TM tries to understand the details, and cross reference between
individual instances.
Text Mining: Challenges, Basics
March 24, 2009
Slide 15
COMP527:
Data Mining
Text Mining vs Data Mining
Text Mining uses Natural Language Processing techniques to
'understand' the data. Tries to understand the semantics of the
text (information) rather than treating it as a big bag of
sequences of characters.
Major processes:
 Part of Speech Tagging
 Phrase Chunking
 Deep Parsing
 Named Entity Recognition
 Information Extraction
Text Mining: Challenges, Basics
March 24, 2009
Slide 16
COMP527:
Data Mining
Text Mining vs Data Mining
Part of Speech Tagging:
Tag each word with its part of speech (noun, verb, adjective
etc.)
Classification problem, but essential to understand the text,
especially the verbs.
Phrase Chunking:
Discover sequences of words that constitute phrases. eg Noun
phrase, verb phrase, prepositional phrase.
Also essential, to discover clauses, rather than working with
individual words.
Text Mining: Challenges, Basics
March 24, 2009
Slide 17
COMP527:
Data Mining
Text Mining vs Data Mining
Deep Parsing:
Discover the structure of the clauses and participants in verbs
etc. eg Dog bites man, not man bites dog.
Essential as the first step where the semantics are really used.
Named Entity Recognition:
Discover 'entities' within the text and tag them with the same
identifier. eg Magnesium and Mg are the same. Bush,
President Bush, G.W. Bush, Dubya, the President, are all the
same.
Essential for correlation of entities.
Text Mining: Challenges, Basics
March 24, 2009
Slide 18
COMP527:
Data Mining
Text Mining vs Data Mining
Information Extraction:
With all the previous information, find all of the information
about each entity from all occurrences within all clauses.
Remove duplicates and find correlations.
Look for interesting correlations, perhaps according to some set
of rules for what is interesting.
Actually, this is an impossibly large task given a reasonable set of
text, and the interestingness of 'new' facts is often very low.
Text Mining: Challenges, Basics
March 24, 2009
Slide 19
COMP527:
Data Mining
Text Mining vs Data Mining
DM crucial for TM, eg correct classification of part of speech.
But TM processes also important for accurate dimensionality
reduction in DM on Texts.
Eg:
Every word: average of 100 attributes per vector, 85.7% accuracy
over 10 classes with SVM
Same data, with linguistic stems and filtered for noun, verb and
adjective:
average of 64 attributes per vector, 87.2% accuracy.
Text Mining: Challenges, Basics
March 24, 2009
Slide 20
COMP527:
Data Mining





Further Reading
Baeza-Yates, Modern Information Retrieval
Weiss, Chapters 2,4
Berry, Survey of Text Mining, Chapter 5
(He gets around, doesn't he?!)
Konchady
Witten, Managing Gigabytes
Text Mining: Challenges, Basics
March 24, 2009
Slide 21