Download Lecture Note 11

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
DataInformatics
SeonHoKim,Ph.D.
seonkim@usc.edu
1
DataMining
2
WhatIsDataMining?
• Datamining
– Extractionofinteresting(non-trivial,implicit,
previouslyunknownandpotentiallyuseful)
patternsorknowledgefromhugeamountofdata
• Alternativenames
– KnowledgeDiscovery(mining)inDatabases(KDD),
knowledgediscoveryfromdata,knowledge
extraction,data/patternanalysis,dataarcheology,
datadredging,informationharvesting,business
intelligence,etc.
3
DataMining—What’sinaName?
InformationHarvesting
KnowledgeMining
DataMining
KnowledgeDiscovery
inDatabases
DataPatternProcessing
DataDredging
DataArchaeology
DatabaseMining
Siftware
KnowledgeExtraction
Theprocessofdiscoveringmeaningfulnewcorrelations,patterns,andtrendsby
siftingthroughlargeamountsofstoreddata,usingpatternrecognition
technologiesandstatisticalandmathematicaltechniques
4
IntegrationofMultipleTechnologies
Artificial
Intelligence
Machine
Learning
Database
Management
Statistics
Visualization
Algorithms
Data
Mining
5
Terms
•
•
•
•
•
•
MachineLearningrelateswiththestudy,designanddevelopmentofthe
algorithmsthatgivecomputersthecapabilitytolearnwithoutbeing
explicitlyprogrammed(definitionofArthurSamuel).
DataMiningcanbedefinedastheprocessthatstartingfromapparently
unstructureddatatriestoextractknowledgeand/orunknowninteresting
patterns.DuringthisprocessmachineLearningalgorithmsareused.
DataAnalysis,DataMining,MachineLearning andMathematical
Modeling aretools:meanstowardsanend.
Analytics,BusinessIntelligence,Econometrics andArtificialIntelligence are
applicationareas:domainsthatusethetoolsabove(andothers)to
produceresultswithinitssubject.Amongthem,Analyticsisprobablya
moregenericterm(i.e.nondomain-specific).
Statistics isabranch ofMathematicsprovidingtheoreticalandpractical
supporttotheabovetools.
DataScience isacatch-alltermtodescribeusingthosealltoolstoprovide
answersinthoseallareas(andalsoinothers),speciallywhendealingwith
BigData,whichisnothingmorethanalabelmeaningdoinganyofthe
abovebutwhenthedatasetsarehuge.
6
KnowledgeDiscoveryinDatabases:
Process
Interpretation/
Evaluation
DataMining
Preprocessing
Knowledge
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An
Overview,” Advances in Knowledge Discovery and Data Mining, U.7
Fayyad et al. (Eds.), AAAI/MIT Press
Multi-DimensionalViewofDataMining
• Datatobemined
– Relational,datawarehouse,transactional,stream,objectoriented/relational,active,spatial,time-series,text,multimedia,heterogeneous,legacy,WWW
• Knowledgetobemined
– Characterization,discrimination,association,classification,
clustering,trend/deviation,outlieranalysis,etc.
– Multiple/integratedfunctionsandminingatmultiplelevels
• Techniquesutilized
– Database-oriented,datawarehouse(OLAP),machinelearning,
statistics,visualization,etc.
• Applicationsadapted
– Retail,telecommunication,banking,fraudanalysis,bio-data
mining,stockmarketanalysis,Webmining,etc.
8
IngredientsofanEffectiveKDDProcess
“In order to discover
anything, you must
be looking for
something. ” Laws
of Serendipity
Visualizationand
HumanComputer
Interaction
Plan
for
Learning
Generate
andTest
Hypotheses
GoalsforLearning
Determine
Knowledge
Relevancy
Discover
Knowledge
KnowledgeBase
DiscoveryAlgorithms
Evolve
Knowledge/
Data
Database(s)
BackgroundKnowledge
9
DataMining:HistoryoftheField
• KnowledgeDiscoveryinDatabasesworkshopsstarted‘89
– NowaconferenceundertheauspicesofACMSIGKDD
– IEEEconferenceseriesstarted2001
• Keyfounders/technologycontributors:
– Usama Fayyad,JPL(thenMicrosoft,thenhisowncompany,
Digimine,nowYahoo!Researchlabs)
– GregoryPiatetsky-Shapiro(thenGTE,nowhisowndatamining
consultingcompany,KnowledgeStreamPartners)
– Rakesh Agrawal (IBMResearch)
• Theterm“datamining” hasbeenaroundsinceatleast
1983– inthestatisticscommunity
10
WhyDataMining?
PotentialApplications
• Dataanalysisanddecisionsupport
– Marketanalysisandmanagement
• Targetmarketing,customerrelationshipmanagement(CRM),
marketbasketanalysis,crossselling,marketsegmentation
– Riskanalysisandmanagement
• Forecasting,customerretention,improvedunderwriting,quality
control,competitiveanalysis
– Frauddetectionanddetectionofunusualpatterns
(outliers)
• OtherApplications
– Textmining(newsgroup,email,documents)andWeb
mining
– Streamdatamining
– DNAandbio-dataanalysis
11
MarketAnalysisandManagement
•
Wheredoesthedatacomefrom?
– Creditcardtransactions,loyaltycards,discountcoupons, customercomplaint
calls,plus(public) lifestylestudies
•
Targetmarketing
– Find clustersof “model” customerswhosharethesamecharacteristics:
interest,incomelevel,spending habits, etc.
– Determinecustomerpurchasing patternsovertime
•
Cross-marketanalysis
– Associations/co-relationsbetweenproduct sales,&prediction basedonsuch
association
•
Customerprofiling
– Whattypesofcustomersbuywhatproducts (clusteringorclassification)
•
Customerrequirementanalysis
– identifying thebestproducts fordifferentcustomers
– predictwhatfactorswillattractnewcustomers
•
Provisionofsummaryinformation
– multidimensional summaryreports
– statisticalsummaryinformation (datacentraltendencyandvariation)
12
CorporateAnalysis&Risk
Management
• Financeplanningandassetevaluation
– cashflowanalysisandprediction
– contingentclaimanalysistoevaluateassets
– cross-sectionalandtimeseriesanalysis(financial-ratio,
trendanalysis,etc.)
• Resourceplanning
– summarizeandcomparetheresourcesandspending
• Competition
– monitorcompetitorsandmarketdirections
– groupcustomersintoclassesandaclass-basedpricing
procedure
– setpricingstrategyinahighlycompetitivemarket
13
FraudDetection&MiningUnusualPatterns
• Approaches:Clustering&modelconstructionfor
frauds,outlieranalysis
• Applications:Healthcare,retail,creditcardservice,
telecomm.
– Autoinsurance:ringofcollisions
– Moneylaundering:suspiciousmonetarytransactions
– Medicalinsurance
• Professionalpatients,ringofdoctors,andringofreferences
• Unnecessaryorcorrelatedscreeningtests
– Telecommunications:phone-callfraud
• Phonecallmodel:destinationofthecall,duration,timeofdayor
week.Analyzepatternsthatdeviatefromanexpectednorm
– Retailindustry
• Analystsestimatethat38%ofretailshrinkisduetodishonest
employees
– Anti-terrorism
14
OtherApplications
• Sports
– IBMAdvancedScoutanalyzedNBAgamestatistics(shots
blocked,assists,andfouls)togaincompetitiveadvantage
forNewYorkKnicksandMiamiHeat
• Astronomy
– JPLandthePalomarObservatorydiscovered22quasars
withthehelpofdatamining
• InternetWebSurf-Aid
– IBMSurf-AidappliesdataminingalgorithmstoWebaccess
logsformarket-relatedpagestodiscovercustomer
preferenceandbehaviorpages,analyzingeffectivenessof
Webmarketing,improvingWebsiteorganization,etc.
15
Example:Useinretailing
• Goal:Improvedbusinessefficiency
– Improvemarketing(advertisetothemostlikelybuyers)
– Inventoryreduction(stockonlyneededquantities)
• Informationsource:Historicalbusinessdata
– Example:Supermarketsalesrecords
Date/Time/Register
12/6 13:15 2
12/6 13:16 3
Fish
N
Y
Turkey
Y
N
Cranberries
Y
N
Wine
N
Y
...
...
...
– Sizerangesfrom50krecords(researchstudies)toterabytes
(yearsofdatafromchains)
– Dataisalreadybeingwarehoused
• Samplequestion– whatproductsaregenerallypurchased
together?
• Theanswersareinthedata,ifonlywecouldseethem
16
DataMiningappliedtoAviationSafety
Records(EricBloedorn)
• Manygroupsrecorddataregardingaviationsafety
includingtheNationalTransportationSafetyBoard
(NTSB)andtheFederalAviationAdministration(FAA)
• Integratingdatafromdifferentsourcesaswellas
miningforpatternsfromamixofbothstructuredfields
andfreetextisadifficulttask
• Thegoalofourinitialanalysisistodeterminehowdata
miningcanbeusedtoimproveairlinesafetybyfinding
patternsthatpredictsafetyproblems
17
AircraftAccidentReport
• ThisdataminingeffortisanextensionoftheFAAOfficeof
SystemSafety’sFlightCrewAccidentandIncidentHuman
FactorsProject
• Inthispreviousapproachtwodatabase-specifichuman
errormodelsweredevelopedbasedongeneralresearch
intohumanfactors
– FAA’sPilotDeviationdatabase(PDS)
– NTSB’saccidentandincidentdatabase
• Theseerrormodelscheckforcertainvaluesinspecific
fields
• Result
– Classificationofsomeaccidentscausedbyhumanmistakesand
slips.
18
Problem
• Currentmodelcannotclassifyalargenumberofrecords
• Alargepercentageofcasesarelabeled‘unclassified’ by
currentmodel
– ~58,000intheNTSBdatabase(90%oftheeventsidentifiedas
involvingpeople)
– ~5,400inthePDSdatabase(93%oftheevents)
• Approximately80,000NTSBeventsarecurrentlylabeled
‘unknown’
• Classificationintomeaningfulhumanerrorclassesislow
becausetheexplicitfieldsandvaluesrequiredforthe
modelstofirearenotbeingused
• Modelsmustbeadjustedtobetterdescribedata
19
DataminingApproach
• Useinformationfromtextfieldstosupplement
currentstructuredfieldsbyextractingfeatures
fromtextinaccidentreports
• Buildahuman-errorclassifierdirectlyfromdata
– Useexperttoprovideclasslabelsforeventsof
interestsuchas‘slips’,‘mistakes’ and‘other’
– Usedata-miningtoolstobuildcomprehensiblerules
describingeachoftheseclasses
20
ExampleRule
• SampleDecisionruleusingcurrentmodel
featuresandtextfeatures
– If(person_code_1b=5150,4105,5100,4100) and
((crew-subject-of-intentional-verb=true)or
(modifier_code_1b=3114))
then
mistake
• “Ifpilotorcopilotisinvolvedandeitherthe
narrative,orthemodifiercodefor1bdescribes
thecrewasintentionallyperformingsomeaction
thenthisisamistake”
21
DataMiningIdeas:Logistics
• Deliverydelays
– Debatablewhatdataminingwilldohere;bestmatch
wouldberelatedto“qualityanalysis”:givenlotsofdata
aboutdeliveries,trytofindcommonthreadsin“problem”
deliveries
• Predictingitemneeds
– Seasonal
• Lookingforcycles,relatedtosimilaritysearchintimeseriesdata
• Lookforsimilarcyclesbetweenproducts,evenifnotrepeated
– Event-related
• Sequentialassociationbetweeneventandproductorder(probably
weak)
22
WhatCanDataMiningDo?
• Cluster
• Classify
– Categorical,Regression
• Summarize
– Summarystatistics,Summaryrules
• LinkAnalysis/ModelDependencies
– Associationrules
• Sequenceanalysis
– Time-seriesanalysis,Sequentialassociations
• DetectDeviations
23
Clustering
•
•
•
•
•
•
•
•
Findgroupsofsimilardataitems
Statisticaltechniquesrequire
somedefinitionof“distance”
(e.g.betweentravelprofiles)
whileconceptualtechniquesuse
backgroundconceptsandlogical
descriptions
Uses:
Demographicanalysis
Technologies:
Self-OrganizingMaps
ProbabilityDensities
ConceptualClustering
• “Grouppeoplewith
similartravelprofiles”
– George,Patricia
– Jeff,Evelyn,Chris
– Rob
Clusters
24
Classification
• Findwaystoseparatedata
itemsintopre-definedgroups
– WeknowXandYbelong
together,findotherthingsin
samegroup
• Requires“trainingdata”:Data
itemswheregroupisknown
Uses:
• Profiling
Technologies:
• Generatedecisiontrees
(resultsarehuman
understandable)
• NeuralNets
• “Routedocumentsto
mostlikelyinterested
parties”
– Englishornon-english?
– DomesticorForeign?
Training Data
tool produces
Groups
classifier
25
AssociationRules
• Identifydependenciesin
thedata:
– XmakesYlikely
• Indicatesignificanceof
eachdependency
• UseExample:
– Targetedmarketing
Date/Time/Register
12/6 13:15 2
12/6 13:16 3
Fish
N
Y
“Findgroupsofitems
commonlypurchased
together”
– Peoplewhopurchasefish
areextraordinarilylikelyto
purchasewine
– Peoplewhopurchase
Turkeyareextraordinarily
likelytopurchase
cranberries
Turkey Cranberries Wine
Y
Y
Y
N
N
Y
…
…
…26
SequentialAssociations
• Findeventsequencesthatare
unusuallylikely
• Requires“training” eventlist,
known“interesting” events
• Mustberobustinthefaceof
additional“noise” events
Uses:
• Failureanalysisandprediction
Technologies:
• Dynamicprogramming
(Dynamictimewarping)
• “Custom” algorithms
• “Findcommonsequences
ofwarnings/faultswithin
10minuteperiods”
– Warn2onSwitchC
precededbyFault21on
SwitchB
– Fault17onanyswitch
precededbyWarn2onany
switch
Time Switch Event
21:10
B
Fault 21
21:11
A
Warn 2
21:13
C
Warn 227
21:20
A
Fault 17
DeviationDetection
• Findunexpectedvalues,
outliers
• “Findunusual
occurrencesinIBM
stockprices”
Uses:
• Failureanalysis
• Anomalydiscoveryforanalysis Sample date
Technologies:
• clustering/classification
methods
• Statisticaltechniques
• visualization
58/07/04
59/01/06
59/04/04
73/10/09
Date
58/07/02
58/07/03
58/07/04
58/07/07
Event
Market closed
2.5% dividend
50% stock split
not traded
Occurrences
317 times
2 times
7 times
1 time
Close Volume
369.50
314.08
369.25
313.87
Market Closed
370.00
314.50
Spread
.022561
.022561
.022561
28
DataMiningComplications
• VolumeofData
– Cleveralgorithmsneededforreasonableperformance
• Interestmeasures
– Howdoweensurealgorithmsselect“interesting” results?
• “KnowledgeDiscoveryProcess” skillrequired
– Howtoselecttool,preparedata?
• DataQuality
– Howdoweinterpretresultsinlightoflowqualitydata?
• DataSourceHeterogeneity
– Howdowecombinedatafrommultiplesources?
29
MajorIssuesinDataMining
• Miningmethodology
– Mining different kindsofknowledge from diversedatatypes,e.g.,bio,stream,
Web
– Performance:efficiency,effectiveness,andscalability
– Patternevaluation:theinterestingness problem
– Incorporation ofbackground knowledge
– Handling noise andincomplete data
– Parallel,distributed andincrementalmining methods
– Integrationofthediscoveredknowledge withexistingone:knowledge fusion
• Userinteraction
– Datamining query languagesandad-hocmining
– Expressionandvisualizationofdatamining results
– Interactivemining ofknowledge atmultiple levelsofabstraction
• Applicationsandsocialimpacts
– Domain-specificdatamining &invisibledatamining
– Protectionof datasecurity,integrity, andprivacy
30
StepsofaKDDProcess
• Learningtheapplicationdomain
– relevantpriorknowledgeandgoalsofapplication
• Creatingatargetdataset:dataselection
• Datacleaningandpreprocessing:(maytake60%ofeffort!)
• Datareductionandtransformation
– Findusefulfeatures,dimensionality/variablereduction,invariant
representation.
• Choosingfunctionsofdatamining
– summarization,classification,regression,association,clustering.
• Choosingtheminingalgorithm(s)
• Datamining:searchforpatternsofinterest
• Patternevaluationandknowledgepresentation
– visualization,transformation,removingredundantpatterns,etc.
• Useofdiscoveredknowledge
31
DataMiningandBusinessIntelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
32
Architecture:TypicalDataMiningSystem
Graphical user interface
Pattern evaluation
Data mining engine
Database or data
warehouse server
Data cleaning & data integration
Databases
CS590D
Knowledge-base
Filtering
Data
Warehouse
33
StateofCommercial/ResearchPractice
• Increasinguseofdataminingsystemsinfinancialcommunity,
marketingsectors,retailing
• Stillhavemajorproblemswithlarge,dynamicsetsofdata
(needbetterintegrationwiththedatabases)
– COTSdataminingpackagesperformspecializedlearning
onsmallsubsetofdata
• Mostresearchemphasizesmachinelearning;littleemphasis
ondatabaseside(especiallytext)
• Peopleachievingresultsarenotlikelytoshareknowledge
34
RelatedTechniques:OLAP
On-LineAnalyticalProcessing
• On-LineAnalyticalProcessingtoolsprovidetheabilityto
posestatisticalandsummaryqueriesinteractively
(traditionalOn-LineTransactionProcessing(OLTP)
databasesmaytakeminutesorevenhourstoanswerthese
queries)
• Advantagesrelativetodatamining
– Canobtainawidervarietyofresults
– Generallyfastertoobtainresults
• Disadvantagesrelativetodatamining
– Usermust“asktherightquestion”
– Generallyusedtodeterminehigh-levelstatisticalsummaries,
ratherthanspecificrelationshipsamonginstances
35
IntegrationofDataMiningandData
Warehousing
• Dataminingsystems,DBMS,Datawarehousesystems
coupling
– Nocoupling,loose-coupling,semi-tight-coupling,tight-coupling
• On-lineanalyticalminingdata
– integrationofminingandOLAPtechnologies
• Interactiveminingmulti-levelknowledge
– Necessityofminingknowledgeandpatternsatdifferentlevels
ofabstractionbydrilling/rolling,pivoting,slicing/dicing,etc.
• Integrationofmultipleminingfunctions
– Characterizedclassification,firstclusteringandthenassociation
36