Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DataInformatics SeonHoKim,Ph.D. seonkim@usc.edu 1 DataMining 2 WhatIsDataMining? • Datamining – Extractionofinteresting(non-trivial,implicit, previouslyunknownandpotentiallyuseful) patternsorknowledgefromhugeamountofdata • Alternativenames – KnowledgeDiscovery(mining)inDatabases(KDD), knowledgediscoveryfromdata,knowledge extraction,data/patternanalysis,dataarcheology, datadredging,informationharvesting,business intelligence,etc. 3 DataMining—What’sinaName? InformationHarvesting KnowledgeMining DataMining KnowledgeDiscovery inDatabases DataPatternProcessing DataDredging DataArchaeology DatabaseMining Siftware KnowledgeExtraction Theprocessofdiscoveringmeaningfulnewcorrelations,patterns,andtrendsby siftingthroughlargeamountsofstoreddata,usingpatternrecognition technologiesandstatisticalandmathematicaltechniques 4 IntegrationofMultipleTechnologies Artificial Intelligence Machine Learning Database Management Statistics Visualization Algorithms Data Mining 5 Terms • • • • • • MachineLearningrelateswiththestudy,designanddevelopmentofthe algorithmsthatgivecomputersthecapabilitytolearnwithoutbeing explicitlyprogrammed(definitionofArthurSamuel). DataMiningcanbedefinedastheprocessthatstartingfromapparently unstructureddatatriestoextractknowledgeand/orunknowninteresting patterns.DuringthisprocessmachineLearningalgorithmsareused. DataAnalysis,DataMining,MachineLearning andMathematical Modeling aretools:meanstowardsanend. Analytics,BusinessIntelligence,Econometrics andArtificialIntelligence are applicationareas:domainsthatusethetoolsabove(andothers)to produceresultswithinitssubject.Amongthem,Analyticsisprobablya moregenericterm(i.e.nondomain-specific). Statistics isabranch ofMathematicsprovidingtheoreticalandpractical supporttotheabovetools. DataScience isacatch-alltermtodescribeusingthosealltoolstoprovide answersinthoseallareas(andalsoinothers),speciallywhendealingwith BigData,whichisnothingmorethanalabelmeaningdoinganyofthe abovebutwhenthedatasetsarehuge. 6 KnowledgeDiscoveryinDatabases: Process Interpretation/ Evaluation DataMining Preprocessing Knowledge Patterns Selection Preprocessed Data Data Target Data adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U.7 Fayyad et al. (Eds.), AAAI/MIT Press Multi-DimensionalViewofDataMining • Datatobemined – Relational,datawarehouse,transactional,stream,objectoriented/relational,active,spatial,time-series,text,multimedia,heterogeneous,legacy,WWW • Knowledgetobemined – Characterization,discrimination,association,classification, clustering,trend/deviation,outlieranalysis,etc. – Multiple/integratedfunctionsandminingatmultiplelevels • Techniquesutilized – Database-oriented,datawarehouse(OLAP),machinelearning, statistics,visualization,etc. • Applicationsadapted – Retail,telecommunication,banking,fraudanalysis,bio-data mining,stockmarketanalysis,Webmining,etc. 8 IngredientsofanEffectiveKDDProcess “In order to discover anything, you must be looking for something. ” Laws of Serendipity Visualizationand HumanComputer Interaction Plan for Learning Generate andTest Hypotheses GoalsforLearning Determine Knowledge Relevancy Discover Knowledge KnowledgeBase DiscoveryAlgorithms Evolve Knowledge/ Data Database(s) BackgroundKnowledge 9 DataMining:HistoryoftheField • KnowledgeDiscoveryinDatabasesworkshopsstarted‘89 – NowaconferenceundertheauspicesofACMSIGKDD – IEEEconferenceseriesstarted2001 • Keyfounders/technologycontributors: – Usama Fayyad,JPL(thenMicrosoft,thenhisowncompany, Digimine,nowYahoo!Researchlabs) – GregoryPiatetsky-Shapiro(thenGTE,nowhisowndatamining consultingcompany,KnowledgeStreamPartners) – Rakesh Agrawal (IBMResearch) • Theterm“datamining” hasbeenaroundsinceatleast 1983– inthestatisticscommunity 10 WhyDataMining? PotentialApplications • Dataanalysisanddecisionsupport – Marketanalysisandmanagement • Targetmarketing,customerrelationshipmanagement(CRM), marketbasketanalysis,crossselling,marketsegmentation – Riskanalysisandmanagement • Forecasting,customerretention,improvedunderwriting,quality control,competitiveanalysis – Frauddetectionanddetectionofunusualpatterns (outliers) • OtherApplications – Textmining(newsgroup,email,documents)andWeb mining – Streamdatamining – DNAandbio-dataanalysis 11 MarketAnalysisandManagement • Wheredoesthedatacomefrom? – Creditcardtransactions,loyaltycards,discountcoupons, customercomplaint calls,plus(public) lifestylestudies • Targetmarketing – Find clustersof “model” customerswhosharethesamecharacteristics: interest,incomelevel,spending habits, etc. – Determinecustomerpurchasing patternsovertime • Cross-marketanalysis – Associations/co-relationsbetweenproduct sales,&prediction basedonsuch association • Customerprofiling – Whattypesofcustomersbuywhatproducts (clusteringorclassification) • Customerrequirementanalysis – identifying thebestproducts fordifferentcustomers – predictwhatfactorswillattractnewcustomers • Provisionofsummaryinformation – multidimensional summaryreports – statisticalsummaryinformation (datacentraltendencyandvariation) 12 CorporateAnalysis&Risk Management • Financeplanningandassetevaluation – cashflowanalysisandprediction – contingentclaimanalysistoevaluateassets – cross-sectionalandtimeseriesanalysis(financial-ratio, trendanalysis,etc.) • Resourceplanning – summarizeandcomparetheresourcesandspending • Competition – monitorcompetitorsandmarketdirections – groupcustomersintoclassesandaclass-basedpricing procedure – setpricingstrategyinahighlycompetitivemarket 13 FraudDetection&MiningUnusualPatterns • Approaches:Clustering&modelconstructionfor frauds,outlieranalysis • Applications:Healthcare,retail,creditcardservice, telecomm. – Autoinsurance:ringofcollisions – Moneylaundering:suspiciousmonetarytransactions – Medicalinsurance • Professionalpatients,ringofdoctors,andringofreferences • Unnecessaryorcorrelatedscreeningtests – Telecommunications:phone-callfraud • Phonecallmodel:destinationofthecall,duration,timeofdayor week.Analyzepatternsthatdeviatefromanexpectednorm – Retailindustry • Analystsestimatethat38%ofretailshrinkisduetodishonest employees – Anti-terrorism 14 OtherApplications • Sports – IBMAdvancedScoutanalyzedNBAgamestatistics(shots blocked,assists,andfouls)togaincompetitiveadvantage forNewYorkKnicksandMiamiHeat • Astronomy – JPLandthePalomarObservatorydiscovered22quasars withthehelpofdatamining • InternetWebSurf-Aid – IBMSurf-AidappliesdataminingalgorithmstoWebaccess logsformarket-relatedpagestodiscovercustomer preferenceandbehaviorpages,analyzingeffectivenessof Webmarketing,improvingWebsiteorganization,etc. 15 Example:Useinretailing • Goal:Improvedbusinessefficiency – Improvemarketing(advertisetothemostlikelybuyers) – Inventoryreduction(stockonlyneededquantities) • Informationsource:Historicalbusinessdata – Example:Supermarketsalesrecords Date/Time/Register 12/6 13:15 2 12/6 13:16 3 Fish N Y Turkey Y N Cranberries Y N Wine N Y ... ... ... – Sizerangesfrom50krecords(researchstudies)toterabytes (yearsofdatafromchains) – Dataisalreadybeingwarehoused • Samplequestion– whatproductsaregenerallypurchased together? • Theanswersareinthedata,ifonlywecouldseethem 16 DataMiningappliedtoAviationSafety Records(EricBloedorn) • Manygroupsrecorddataregardingaviationsafety includingtheNationalTransportationSafetyBoard (NTSB)andtheFederalAviationAdministration(FAA) • Integratingdatafromdifferentsourcesaswellas miningforpatternsfromamixofbothstructuredfields andfreetextisadifficulttask • Thegoalofourinitialanalysisistodeterminehowdata miningcanbeusedtoimproveairlinesafetybyfinding patternsthatpredictsafetyproblems 17 AircraftAccidentReport • ThisdataminingeffortisanextensionoftheFAAOfficeof SystemSafety’sFlightCrewAccidentandIncidentHuman FactorsProject • Inthispreviousapproachtwodatabase-specifichuman errormodelsweredevelopedbasedongeneralresearch intohumanfactors – FAA’sPilotDeviationdatabase(PDS) – NTSB’saccidentandincidentdatabase • Theseerrormodelscheckforcertainvaluesinspecific fields • Result – Classificationofsomeaccidentscausedbyhumanmistakesand slips. 18 Problem • Currentmodelcannotclassifyalargenumberofrecords • Alargepercentageofcasesarelabeled‘unclassified’ by currentmodel – ~58,000intheNTSBdatabase(90%oftheeventsidentifiedas involvingpeople) – ~5,400inthePDSdatabase(93%oftheevents) • Approximately80,000NTSBeventsarecurrentlylabeled ‘unknown’ • Classificationintomeaningfulhumanerrorclassesislow becausetheexplicitfieldsandvaluesrequiredforthe modelstofirearenotbeingused • Modelsmustbeadjustedtobetterdescribedata 19 DataminingApproach • Useinformationfromtextfieldstosupplement currentstructuredfieldsbyextractingfeatures fromtextinaccidentreports • Buildahuman-errorclassifierdirectlyfromdata – Useexperttoprovideclasslabelsforeventsof interestsuchas‘slips’,‘mistakes’ and‘other’ – Usedata-miningtoolstobuildcomprehensiblerules describingeachoftheseclasses 20 ExampleRule • SampleDecisionruleusingcurrentmodel featuresandtextfeatures – If(person_code_1b=5150,4105,5100,4100) and ((crew-subject-of-intentional-verb=true)or (modifier_code_1b=3114)) then mistake • “Ifpilotorcopilotisinvolvedandeitherthe narrative,orthemodifiercodefor1bdescribes thecrewasintentionallyperformingsomeaction thenthisisamistake” 21 DataMiningIdeas:Logistics • Deliverydelays – Debatablewhatdataminingwilldohere;bestmatch wouldberelatedto“qualityanalysis”:givenlotsofdata aboutdeliveries,trytofindcommonthreadsin“problem” deliveries • Predictingitemneeds – Seasonal • Lookingforcycles,relatedtosimilaritysearchintimeseriesdata • Lookforsimilarcyclesbetweenproducts,evenifnotrepeated – Event-related • Sequentialassociationbetweeneventandproductorder(probably weak) 22 WhatCanDataMiningDo? • Cluster • Classify – Categorical,Regression • Summarize – Summarystatistics,Summaryrules • LinkAnalysis/ModelDependencies – Associationrules • Sequenceanalysis – Time-seriesanalysis,Sequentialassociations • DetectDeviations 23 Clustering • • • • • • • • Findgroupsofsimilardataitems Statisticaltechniquesrequire somedefinitionof“distance” (e.g.betweentravelprofiles) whileconceptualtechniquesuse backgroundconceptsandlogical descriptions Uses: Demographicanalysis Technologies: Self-OrganizingMaps ProbabilityDensities ConceptualClustering • “Grouppeoplewith similartravelprofiles” – George,Patricia – Jeff,Evelyn,Chris – Rob Clusters 24 Classification • Findwaystoseparatedata itemsintopre-definedgroups – WeknowXandYbelong together,findotherthingsin samegroup • Requires“trainingdata”:Data itemswheregroupisknown Uses: • Profiling Technologies: • Generatedecisiontrees (resultsarehuman understandable) • NeuralNets • “Routedocumentsto mostlikelyinterested parties” – Englishornon-english? – DomesticorForeign? Training Data tool produces Groups classifier 25 AssociationRules • Identifydependenciesin thedata: – XmakesYlikely • Indicatesignificanceof eachdependency • UseExample: – Targetedmarketing Date/Time/Register 12/6 13:15 2 12/6 13:16 3 Fish N Y “Findgroupsofitems commonlypurchased together” – Peoplewhopurchasefish areextraordinarilylikelyto purchasewine – Peoplewhopurchase Turkeyareextraordinarily likelytopurchase cranberries Turkey Cranberries Wine Y Y Y N N Y … … …26 SequentialAssociations • Findeventsequencesthatare unusuallylikely • Requires“training” eventlist, known“interesting” events • Mustberobustinthefaceof additional“noise” events Uses: • Failureanalysisandprediction Technologies: • Dynamicprogramming (Dynamictimewarping) • “Custom” algorithms • “Findcommonsequences ofwarnings/faultswithin 10minuteperiods” – Warn2onSwitchC precededbyFault21on SwitchB – Fault17onanyswitch precededbyWarn2onany switch Time Switch Event 21:10 B Fault 21 21:11 A Warn 2 21:13 C Warn 227 21:20 A Fault 17 DeviationDetection • Findunexpectedvalues, outliers • “Findunusual occurrencesinIBM stockprices” Uses: • Failureanalysis • Anomalydiscoveryforanalysis Sample date Technologies: • clustering/classification methods • Statisticaltechniques • visualization 58/07/04 59/01/06 59/04/04 73/10/09 Date 58/07/02 58/07/03 58/07/04 58/07/07 Event Market closed 2.5% dividend 50% stock split not traded Occurrences 317 times 2 times 7 times 1 time Close Volume 369.50 314.08 369.25 313.87 Market Closed 370.00 314.50 Spread .022561 .022561 .022561 28 DataMiningComplications • VolumeofData – Cleveralgorithmsneededforreasonableperformance • Interestmeasures – Howdoweensurealgorithmsselect“interesting” results? • “KnowledgeDiscoveryProcess” skillrequired – Howtoselecttool,preparedata? • DataQuality – Howdoweinterpretresultsinlightoflowqualitydata? • DataSourceHeterogeneity – Howdowecombinedatafrommultiplesources? 29 MajorIssuesinDataMining • Miningmethodology – Mining different kindsofknowledge from diversedatatypes,e.g.,bio,stream, Web – Performance:efficiency,effectiveness,andscalability – Patternevaluation:theinterestingness problem – Incorporation ofbackground knowledge – Handling noise andincomplete data – Parallel,distributed andincrementalmining methods – Integrationofthediscoveredknowledge withexistingone:knowledge fusion • Userinteraction – Datamining query languagesandad-hocmining – Expressionandvisualizationofdatamining results – Interactivemining ofknowledge atmultiple levelsofabstraction • Applicationsandsocialimpacts – Domain-specificdatamining &invisibledatamining – Protectionof datasecurity,integrity, andprivacy 30 StepsofaKDDProcess • Learningtheapplicationdomain – relevantpriorknowledgeandgoalsofapplication • Creatingatargetdataset:dataselection • Datacleaningandpreprocessing:(maytake60%ofeffort!) • Datareductionandtransformation – Findusefulfeatures,dimensionality/variablereduction,invariant representation. • Choosingfunctionsofdatamining – summarization,classification,regression,association,clustering. • Choosingtheminingalgorithm(s) • Datamining:searchforpatternsofinterest • Patternevaluationandknowledgepresentation – visualization,transformation,removingredundantpatterns,etc. • Useofdiscoveredknowledge 31 DataMiningandBusinessIntelligence Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA 32 Architecture:TypicalDataMiningSystem Graphical user interface Pattern evaluation Data mining engine Database or data warehouse server Data cleaning & data integration Databases CS590D Knowledge-base Filtering Data Warehouse 33 StateofCommercial/ResearchPractice • Increasinguseofdataminingsystemsinfinancialcommunity, marketingsectors,retailing • Stillhavemajorproblemswithlarge,dynamicsetsofdata (needbetterintegrationwiththedatabases) – COTSdataminingpackagesperformspecializedlearning onsmallsubsetofdata • Mostresearchemphasizesmachinelearning;littleemphasis ondatabaseside(especiallytext) • Peopleachievingresultsarenotlikelytoshareknowledge 34 RelatedTechniques:OLAP On-LineAnalyticalProcessing • On-LineAnalyticalProcessingtoolsprovidetheabilityto posestatisticalandsummaryqueriesinteractively (traditionalOn-LineTransactionProcessing(OLTP) databasesmaytakeminutesorevenhourstoanswerthese queries) • Advantagesrelativetodatamining – Canobtainawidervarietyofresults – Generallyfastertoobtainresults • Disadvantagesrelativetodatamining – Usermust“asktherightquestion” – Generallyusedtodeterminehigh-levelstatisticalsummaries, ratherthanspecificrelationshipsamonginstances 35 IntegrationofDataMiningandData Warehousing • Dataminingsystems,DBMS,Datawarehousesystems coupling – Nocoupling,loose-coupling,semi-tight-coupling,tight-coupling • On-lineanalyticalminingdata – integrationofminingandOLAPtechnologies • Interactiveminingmulti-levelknowledge – Necessityofminingknowledgeandpatternsatdifferentlevels ofabstractionbydrilling/rolling,pivoting,slicing/dicing,etc. • Integrationofmultipleminingfunctions – Characterizedclassification,firstclusteringandthenassociation 36