Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining By: William Delmore, Justin Robinson Overview ● ● ● ● ● ● ● What is Data Mining Data, Information, Knowledge How Does Data Mining Work Knowledge Discovery in Database (KDD) Data Mining Techniques Data Mining Tools Why is Data Mining useful in business? 2 What Is Data Mining? Data Mining ● is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. ● overall goal is to extract information from a data set and transform it into knowledge 4 Data Mining ● refers to the application of algorithms for extracting patterns from data without the additional steps of the KDD process. ● Data mining is considered a misnomer, because it’s the extraction of patterns and knowledge from large amounts of data not the extraction (mining) of data itself. ● is the analysis step of the “KDD” process. 5 Architecture of a Typical Data Mining System 6 Data, Information, and Knowledge What is Data ● Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. ● Data encompasses: 1. operational or transactional data such as, sales, cost, inventory, payroll, and accounting 2. non-operational data, such as industry sales, forecast data, and macro economic data 3. metadata - data about the data itself, such as logical database design or data dictionary definitions ● inputs collected such as transactions on a credit card, clicks of a mouse, views of a video, or anything that can be recorded can be used as data. 8 Data explosion Problem ● Advance data collection tools and database technology lead to tremendous amounts of data stored in databases ● We are drowning in data, but starving for knowledge ● Solution: Data warehousing and data mining ● Data warehousing and on-line analytical processing ● Extraction of interesting knowledge using Data Mining 9 Data Warehousing ● ● ● ● ● ● Data Warehousing provides Enterprise with a memory Data Mining provides the Enterprise with intelligence On-Line Application Processing (OLAP): Few, but complex queries --- may run for hours. Queries do not depend on having an absolutely up-to-date database. Example: Analysts at Wal-Mart look for items with increasing sales in some region 10 What is Information ● The patterns ● associations ● relationships among all this data can provide information. ● For example, analysis of retail point of sale transaction data can yield information on which products are selling and when. 11 What is Knowledge ● Information can be converted into knowledge about historical patterns and future trends. ● For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts. 12 How Does Data Mining Work How Does Data Mining Work ● While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. ● Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. 14 Data Mining Consist of Five Major Elements 1. Extract, transform, and load transaction data onto the data warehouse system. 2. Store and manage the data in a multidimensional database system. 3. Provide data access to business analysts and information technology professionals. 4. Analyze the data by application software. 5. Present the data in a useful format, such as a graph or table. 15 Knowledge Discovery in Database (KDD) Knowledge Discovery in Database Knowledge Discovery in Database (KDD) ● refers to the overall process of discovering useful knowledge from data. ● It involves the evaluation and possibly interpretation of the patterns to make the decision of what qualifies as knowledge. ● It also includes the choice of encoding schemes, preprocessing, sampling, and projections of the data prior to the data mining step. 17 Selection Creating a target data set: ● selecting a data set, ● or focusing on a subset of variables, ● or data samples, on which discovery is to be performed. ● Process of refining data into target data 18 Pre-processing Data cleaning and preprocessing: • Removal of noise or outliers • Collecting necessary information to model or account for noise • Strategies for handling missing data fields • Accounting for time sequence information and known changes • Gets to preprocessed data 19 Transformation Normalizes the data Find correlated variables Eliminates redundancies. Data reduction and projection. Finding useful features to represent the data depending on the goal of the task. Using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data. ● Gets to transformed data. ● ● ● ● ● ● 20 Data Mining 1. 2. 3. Choosing the data mining task Deciding whether the goal of the KDD process is classification, regression, clustering, etc. Choosing the data mining algorithms Selecting method(s) to be used for searching for patterns in the data. Deciding which models and parameters may be appropriate. Matching a particular data mining method with the overall criteria of the KDD process. Data mining. Searching for patterns of interest in a particular representational form or a set of such representations as classification rules or trees, regression, clustering, and so forth. 21 Data Mining Tasks The goals of prediction and description are achieved by using the following primary data mining tasks: 1. Classification is learning a function that maps (classifies) a data item into one of several predefined classes. 2. Regression is learning a function which maps a data item to a real-valued prediction variable. 3. Clustering is a common descriptive task where one seeks to identify a finite set of categories or clusters to describe the data. ● Closely related to clustering is the task of probability density estimation which consists of techniques for estimating, from data, the joint multivariate probability density function of all of the variables/fields in the database. 22 Data Mining Tasks 4. Summarization involves methods for finding a compact description for a subset of data. 5. Dependency Modeling consists of finding a model which describes significant dependencies between variables. ● Dependency models exist at two levels: i. The structural level of the model specifies (often graphically) which variables are locally dependent on each other, and ii. The quantitative level of the model specifies the strengths of the dependencies using some numerical scale. 6. Change and Deviation Detection focuses on discovering the most significant changes in the data from previously measured or normative values. 23 Interpretation/Evaluation ● looks at patterns discovered ● tries to see if redundancies or irrelevant patterns exist ● Consolidating discovered knowledge. 24 KDD Compared to Data Mining Knowledge Discovery in Database: ● ● ● refers to the overall process of discovering useful knowledge from data. It involves the evaluation and possibly interpretation of the patterns to make the decision of what qualifies as knowledge. It also includes the choice of encoding schemes, preprocessing, sampling, and projections of the data prior to the data mining step. Data mining: ● refers to the application of algorithms for extracting patterns from data without the additional steps of the KDD process. 25 Definitions Related to the KDD Process ● ● Knowledge discovery in databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Interestingness is an overall measure of pattern value, combining validity, novelty, usefulness, and simplicity. 26 Data Mining Techniques Data Mining Techniques Techniques ● Association rules ● Classification algorithms ● Decision trees ● Regression algorithms ● Neural networks 28 Association Rules ● Association rules- if/then statements that help uncover relationships between seemingly unrelated data in a relational database ● The most common application of this kind of algorithm is market basket analysis. 29 Market Basket Analysis • Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. • In retailing, most purchases are bought on impulse. Market basket analysis gives clues as to what a customer might have bought if the idea had occurred to them. 30 Association Rules-If/Then Statements • "If a customer buys a dozen eggs, he is 80% likely to also purchase milk." • The probability that a customer will buy eggs (i.e. that the antecedent is true) is referred to as the support for the rule. The conditional probability that a customer will also purchase milk is referred to as the confidence. 31 Classification Algorithms • Classification algorithms encompasses a large group of algorithms used for machine learning. • Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. • Instead of using static programming instructions the algorithm builds a simple model with sample inputs to make data driven decisions. 32 Classification examples ● Quadratic Classifier-used to separate measurements of two or more classes of objects or events by a quadric surface ● Naive Bayes classifier- a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable ● Nearest Neighbor Search (proximity search)-is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point 33 Decision Trees ● Decision trees Two types: 1. Classification tree-analysis is when the predicted outcome is the class to which the data belongs 2. Regression trees-Regression tree analysis is when the predicted outcome can be considered a real number 34 Regression Algorithms ● Regression algorithms predict one or more continuous numeric variables such as profit or loss, based on other attributes in the dataset. ● Regression focuses on the relationship between a dependent variable and one or more independent variables (or 'predictors'). More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied 35 Neural Networks ● Neural Network-a computer system modeled on the human brain and nervous system. ● These systems are self-learning and trained, rather than explicitly programmed, and excel in areas where the solution or feature detection is difficult to express in a traditional computer program. 36 Neural Networks ● Neural networks typically consist of multiple layers or a cube design, and the signal path traverses from front to back. ● Newer designs of a neural network are more free flowing to access information faster ● Dynamic neural networks are the most advanced- in that they dynamically can, based on rules, form new connections and even new neural units while disabling others. 37 Cluster Analysis • Cluster analysis( Clustering) - used as a predictor of missing data • An example of Cluster Analysis can be found in market analysis when referring to surveys. The algorithm groups (clusters) consumers by what they buy or when they shop. • Cluster analysis segments data based on specified parameters by the user. 38 What are Data Mining Techniques Used for ● Classification and Prediction ○ Example - Focused Hiring ○ predict one or more discrete variables, based on the other attributes in the dataset. ● Cluster Analysis ○ Example - Market Segmentation ● Outlier Analysis ○ Example - Fraud Detection ● Association Analysis ○ Example - Market Basket Analysis ● Evolution Analysis ○ Example - Forecasting stock market index using Time Series Analysis 39 Data Mining Tools Data Mining Tools Free Data Mining Software: ● Weka - an open-source software for data mining ● RapidMiner- an open source system for data and text mining ● KNIME - an open source data integration, processing, analysis, and exploration platform Proprietary Data Mining Software: ● IBM SPSS Modeler ● Microsoft Analysis Services ● Oracle Data Mining ● SAS Enterprise Miner 41 How Data Mining is Used in Business How Data Mining is Used in Business ● It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. ● Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. 43 How Data Mining is Used in Business ● Data Mining is primarily used today by companies with a strong consumer focus, to “drill down” into their transactional data. ● data mining helps determine pricing, customer preferences and product positioning, impact on sales, customer satisfaction and corporate profits. ● Businesses can use this information to influence their business decisions, follow and predict trends, and improve efficiency. 44 How Data Mining is Used in Business Basket Analysis ● Sometimes called “affinity analysis,” this looks at the items that a customer bought, which could help brick-and-mortar stores improve their layouts or online companies like Amazon recommend related products. The “basket” refers to what shoppers use when they are shopping. ● It’s based on the assumption that you can predict future customer behavior by past performance, including purchases and preferences. And it’s not just grocery stores that can use this data. Here are a few ways it can be applied in various industries: 45 How Data Mining is Used in Business ● Evaluating use of credit cards. (especially important for online ecommerce). Typically, professionals mine credit card data to find patterns that might suggest fraud, but the data is also used to tailor cards around a variation of credit limits, terms and interest rates…and even collect debt. ● Evaluating patterns of telephone use. For instance, you could discover customers who adopt all of the latest services and features your phone company offers…suggesting they will need something new to stick around…and then offer them an incentive to stay another year. ● Identifying fraud insurance claims. Through the mining of historical information, insurance companies can spot claims with a high percentage of recovering money lost through fraud and develop rules to help them flag future fraudulent claims. ● And all the products don’t have to be purchased at the same time. Most customer analytic tools can observe purchases over time, thus helping you spot trends or opportunities that you can test for future promotions. 46 How Data Mining is Used in Business Sales Forecasting ● This looks at when customers bought, and tries to predict when they will buy again. You could use this type of analysis to determine a strategy of planned obsolescence or figure out complimentary products to sell. ● This also looks at the number of customers in your market and predicts how many will actually buy. For example, imagine if you have a coffee shop in Seattle. Here are questions you might ask: ● How many people/households/businesses within a mile of your store will buy your coffee? ● How many competitors are in that mile? ● How many people/households/businesses in 5 miles? ● How many competitors in those 5 miles? 47 How Data Mining is Used in Business Database Marketing ● By examining customer purchasing patterns and looking at the demographics and psychographics of customers to build profiles, you can create products that will sell themselves. ● Of course for a marketer to get any value out of a database, it must continue to grow and evolve. You feed database information from sales, surveys, subscriptions and questionnaires. And then you target customers based upon this intelligence. 48 How Data Mining is Used in Business Examples ● Purchase records stored via a club card that you offer via incentives like 5% off purchases or accumulation of points. ● Contests you run to collect additional information about where people live. ● Email newsletter you use to update customers weekly, but also to send out surveys in which you collect additional information concerning new products and promotions. ● Twitter account that doubles as a flash promotion tool and customer service hub where you listen to the good and bad of what your followers are saying…and then respond. ● As you collect this data, start to look for opportunities like best days to run a discount promotion. Ask yourself: Who are your local customers and how you can turn these customers in advocates for your store? 49 How Data Mining is Used in Business Merchandise Planning: ● This is helpful for offline or online companies. For the offline, a company looking to grow by adding stores can evaluate the amount of merchandise they will need by looking at the exact layout of a current store. ● For an online business, merchandise planning can help you determine stocking options and inventory warehousing. 50 Example ● Inventory getting old – Merchandising planning can be as easy as updating a PDF white paper to be current or stocking up-to-date accessories for products. ● Selecting product – Mining your database will help you determine which products customers want, which should include intelligence on your competitors merchandise. ● Balancing your stock – Database mining can also help you determine the right amount of stock…not too much or too little…throughout the year and buying seasons. ● Pricing – Database mining can also help you determine the best price for your products as you uncover customer sensitivity. 51 How Data Mining is Used in Business Card Marketing ● If your business involves issuing credit cards, you can collect the information from usage, identify customer segments and then based on information on these segments build programs that improve retention, boost acquisition, target products to develop and design prices. ● A great example of this occurred when the UN decided to issue a Visa credit card to people who traveled overseas frequently. The agency marketers segmented their database into wealthy travelers—30,000 people in high-income households. ● The agency marketers used direct mail for their appeals and generated a 3% response. That may sound small, but it actually exceeded industry standards. Large financial institutions typically see 0.5% response rates. That’s how effective databases can be when marketing cards. How Data Mining is Used in Business Call Detail Record Analysis ● If your company depends upon telecommunications, then you can mine that incoming data to see use patterns, build customer profiles from these patterns and then construct a tiered pricing structure to maximize profit. Or you could build promotions that reflect your data. 53 Example ● A China mobile operator with about 600,000 customers wanted to analyze their data to create offerings to fend off competition. The first thing the project team behind collecting and analyzing the data did was create an index to describe caller behavior. That index then clustered the callers into 15 segments based on elements like this: ● ● ● ● ● ● ● ● Minutes Of Usage per user on the average Local call percentage Long distance call percentage IP call percentage Roam percentage Idle period local call percentage Idle period long distance call percentage Idle period roam call percentage 54 How it helped ● From that data the marketing department then created strategies directed at each segment, namely improving customer satisfaction ● delivering quality SMS service for another group ● encouraging another group to use more minutes. 55 Customer Loyalty ● In a world where price wars occur, you will get customers jumping ship every time a competitor offers lower prices. You can use data mining to help minimize this churn, especially with social media. ● Spigit uses different data mining techniques from your social media audience to help you acquire and retain more customers. Their programs include: ○ Employee innovation – A tool used to ask employees for their ideas on how to improve customer engagement, product development and future growth. Who says data mining is always customer-centric? ○ Facebook – Through a technique called “customer cluster” Spigit uses data from your audience on Facebook to generate ideas for improving your brand, satisfying more customers and increasing loyalty. 56 How Data Mining is Used in Business ● FaceOff – This app acts like a place where people can create possible ideas on which to vote. For example, someone might suggest “create the in-flight social network” v. “make a clear view floor so you can see what’s below you when you fly.” Then people are shown these ideas and vote. Naturally this allows a company to find ideas that are coming from customers…and then being voted on by people who might be interested in the final idea. 57 How Data Mining is Used in Business Market Segmentation ● One of the best uses of data mining is to segment your customers. And it’s pretty simple. From your data you can break down your market into meaningful segments like age, income, occupation or gender. And this works whether you are running email marketing campaigns or SEO strategies. ● Segmentation can also help you understand your competition. This insight alone will help you identify that the usual suspects are not the only ones targeting the same customer money as you are. ● This is especially important, because when I ask most clients who their competitors are they give me a list of people. I then hand them back a bigger list. Most businesses need to expand their circle of competitors out two or three times if they plan on competing effectively. Data mining will help you do that. 58 Product Production ● Data mining is also perfect for creating custom products designed for market segments. In fact, you can predict which features users may want…although truly innovative products are not created from giving customers what they want. 59 Example ● Rather truly innovative products are created when you look at the data from your customers and spot holes customers are demanding be filled. When it comes to creating that product, these are the elements that must be baked into the product. ● ● ● ● ● ● ● ● Fulfill an obvious need Offer something utterly unique Set to enter the market with a unique name Attractive design Serves a broad market Can be sold in generations Create an impulse-purchase price Cost to make is low enough to make a profit 60 Warranties ● Finally, database mining will allow you to predict how many people will actually cash in on the warranty you’ve set up. This is also true for guarantees. ● For example, I tested the pulling power that a guarantee has for improving sales on a test I ran on QuickSprout. But before I ran the test I had to analyze data to see how many people would actually return the product I was selling. I looked at previous data on these two sets: ○ Net sales ○ Settlements made within the defined guarantee 61 Example continued ● I gathered those two numbers over several different sales sets to predict how many people would cash in on the guarantee…and then adjusted the guarantee amount so as not to lose money when people returned product. ● These calculations are typically more robust for large corporations, but for smaller shops you don’t need anything more complicated than that. 62 Why It Matters ● The more data you collect from customers…the more value you can deliver to them. And the more value you can deliver to them…the more revenue you can generate. ● Data mining is what will help you do that. So, if you are sitting on loads of customer data and not doing anything with it…I want to encourage you to make a plan to start diving into it this week. Do it yourself or hire someone else…whatever it takes. 63 Questions? Thank You