Download What Is Data Mining? Data Mining

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
By: William Delmore, Justin Robinson
Overview
●
●
●
●
●
●
●
What is Data Mining
Data, Information, Knowledge
How Does Data Mining Work
Knowledge Discovery in Database (KDD)
Data Mining Techniques
Data Mining Tools
Why is Data Mining useful in business?
2
What Is Data
Mining?
Data Mining
● is the computational process of discovering patterns in large data sets
involving methods at the intersection of artificial intelligence, machine
learning, statistics, and database systems.
● overall goal is to extract information from a data set and transform it into
knowledge
4
Data Mining
● refers to the application of algorithms for extracting
patterns from data without the additional steps of the KDD
process.
● Data mining is considered a misnomer, because it’s the
extraction of patterns and knowledge from large amounts
of data not the extraction (mining) of data itself.
● is the analysis step of the “KDD” process.
5
Architecture of a Typical Data Mining System
6
Data,
Information,
and Knowledge
What is Data
● Data are any facts, numbers, or text that can be processed by a computer. Today,
organizations are accumulating vast and growing amounts of data in different formats and
different databases.
● Data encompasses:
1. operational or transactional data such as, sales, cost, inventory, payroll, and accounting
2. non-operational data, such as industry sales, forecast data, and macro economic data
3. metadata - data about the data itself, such as logical database design or data dictionary
definitions
● inputs collected such as transactions on a credit card, clicks of a mouse, views of a video, or
anything that can be recorded can be used as data.
8
Data explosion Problem
● Advance data collection tools and database technology lead to tremendous amounts
of data stored in databases
● We are drowning in data, but starving for knowledge
● Solution: Data warehousing and data mining
● Data warehousing and on-line analytical processing
● Extraction of interesting knowledge using Data Mining
9
Data Warehousing
●
●
●
●
●
●
Data Warehousing provides Enterprise with a memory
Data Mining provides the Enterprise with intelligence
On-Line Application Processing (OLAP):
Few, but complex queries --- may run for hours.
Queries do not depend on having an absolutely up-to-date database.
Example: Analysts at Wal-Mart look for items with increasing sales in some region
10
What is Information
● The patterns
● associations
● relationships among all this data can provide
information.
● For example, analysis of retail point of sale
transaction data can yield information on which
products are selling and when.
11
What is Knowledge
● Information can be converted into knowledge about historical patterns and
future trends.
● For example, summary information on retail supermarket sales can be
analyzed in light of promotional efforts to provide knowledge of consumer
buying behavior. Thus, a manufacturer or retailer could determine which
items are most susceptible to promotional efforts.
12
How Does
Data Mining
Work
How Does Data Mining Work
● While large-scale information technology has been evolving separate transaction and
analytical systems, data mining provides the link between the two.
● Data mining software analyzes relationships and patterns in stored transaction data based on
open-ended user queries.
14
Data Mining Consist of Five Major Elements
1. Extract, transform, and load transaction data onto the data
warehouse system.
2. Store and manage the data in a multidimensional database system.
3. Provide data access to business analysts and information technology
professionals.
4. Analyze the data by application software.
5. Present the data in a useful format, such as a graph or table.
15
Knowledge
Discovery in
Database (KDD)
Knowledge Discovery in Database
Knowledge Discovery in Database (KDD)
● refers to the overall process of discovering useful knowledge from data.
● It involves the evaluation and possibly interpretation of the patterns to make the decision of
what qualifies as knowledge.
● It also includes the choice of encoding schemes, preprocessing, sampling, and projections
of the data prior to the data mining step.
17
Selection
Creating a target data set:
● selecting a data set,
● or focusing on a subset of variables,
● or data samples, on which discovery is to be performed.
● Process of refining data into target data
18
Pre-processing
Data cleaning and preprocessing:
• Removal of noise or outliers
• Collecting necessary information to model or account for noise
• Strategies for handling missing data fields
• Accounting for time sequence information and known changes
• Gets to preprocessed data
19
Transformation
Normalizes the data
Find correlated variables
Eliminates redundancies.
Data reduction and projection.
Finding useful features to represent the data depending on the goal of the task.
Using dimensionality reduction or transformation methods to reduce the effective number
of variables under consideration or to find invariant representations for the data.
● Gets to transformed data.
●
●
●
●
●
●
20
Data Mining
1.
2.
3.
Choosing the data mining task
Deciding whether the goal of the KDD process is classification, regression, clustering, etc.
Choosing the data mining algorithms
Selecting method(s) to be used for searching for patterns in the data.
Deciding which models and parameters may be appropriate.
Matching a particular data mining method with the overall criteria of the KDD process.
Data mining.
Searching for patterns of interest in a particular representational form or a set of such representations as
classification rules or trees, regression, clustering, and so forth.
21
Data Mining Tasks
The goals of prediction and description are achieved by using the following primary data mining tasks:
1. Classification is learning a function that maps (classifies) a data item into one of several
predefined classes.
2. Regression is learning a function which maps a data item to a real-valued prediction variable.
3. Clustering is a common descriptive task where one seeks to identify a finite set of categories or
clusters to describe the data.
● Closely related to clustering is the task of probability density estimation which consists of
techniques for estimating, from data, the joint multivariate probability density function of
all of the variables/fields in the database.
22
Data Mining Tasks
4.
Summarization involves methods for finding a compact description for a subset of data.
5.
Dependency Modeling consists of finding a model which describes significant
dependencies between variables.
● Dependency models exist at two levels:
i. The structural level of the model specifies (often graphically) which variables are locally
dependent on each other, and
ii. The quantitative level of the model specifies the strengths of the dependencies using
some numerical scale.
6.
Change and Deviation Detection focuses on discovering the most significant changes in
the data from previously measured or normative values.
23
Interpretation/Evaluation
● looks at patterns discovered
● tries to see if redundancies or irrelevant patterns exist
● Consolidating discovered knowledge.
24
KDD Compared to Data Mining
Knowledge Discovery in Database:
●
●
●
refers to the overall process of discovering
useful knowledge from data.
It involves the evaluation and possibly
interpretation of the patterns to make the
decision of what qualifies as knowledge.
It also includes the choice of encoding
schemes, preprocessing, sampling, and
projections of the data prior to the data
mining step.
Data mining:
●
refers to the
application of
algorithms for
extracting patterns
from data without the
additional steps of the
KDD process.
25
Definitions Related to the KDD Process
●
●
Knowledge discovery in databases is the non-trivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data.
Interestingness is an overall measure of pattern value, combining validity, novelty, usefulness, and
simplicity.
26
Data Mining
Techniques
Data Mining Techniques
Techniques
● Association rules
● Classification algorithms
● Decision trees
● Regression algorithms
● Neural networks
28
Association Rules
● Association rules- if/then statements that help uncover
relationships between seemingly unrelated data in a relational
database
●
The most common application of this kind of algorithm is market
basket analysis.
29
Market Basket Analysis
• Market Basket Analysis is a modelling technique based upon the
theory that if you buy a certain group of items, you are more (or less)
likely to buy another group of items.
• In retailing, most purchases are bought on impulse. Market basket
analysis gives clues as to what a customer might have bought if the
idea had occurred to them.
30
Association Rules-If/Then Statements
• "If a customer buys a dozen eggs, he is 80% likely to also purchase
milk."
• The probability that a customer will buy eggs (i.e. that the
antecedent is true) is referred to as the support for the rule. The
conditional probability that a customer will also purchase milk is
referred to as the confidence.
31
Classification Algorithms
• Classification algorithms encompasses a large group of algorithms
used for machine learning.
• Machine learning explores the study and construction of algorithms
that can learn from and make predictions on data.
• Instead of using static programming instructions the algorithm builds a
simple model with sample inputs to make data driven decisions.
32
Classification examples
● Quadratic Classifier-used to separate measurements of two or more classes of
objects or events by a quadric surface
● Naive Bayes classifier- a family of simple probabilistic classifiers based on applying
Bayes' theorem with strong (naive) independence assumptions between the features. It is
not a single algorithm for training such classifiers, but a family of algorithms based on a
common principle: all naive Bayes classifiers assume that the value of a particular feature
is independent of the value of any other feature, given the class variable
● Nearest Neighbor Search (proximity search)-is the optimization problem of finding
the point in a given set that is closest (or most similar) to a given point
33
Decision Trees
● Decision trees Two types:
1. Classification tree-analysis is when the predicted outcome is the
class to which the data belongs
2. Regression trees-Regression tree analysis is when the predicted
outcome can be considered a real number
34
Regression Algorithms
● Regression algorithms predict one or more continuous numeric
variables such as profit or loss, based on other attributes in the dataset.
● Regression focuses on the relationship between a dependent variable
and one or more independent variables (or 'predictors'). More
specifically, regression analysis helps one understand how the typical
value of the dependent variable (or 'criterion variable') changes when
any one of the independent variables is varied
35
Neural Networks
● Neural Network-a computer system modeled on
the human brain and nervous system.
● These systems are self-learning and trained, rather
than explicitly programmed, and excel in areas
where the solution or feature detection is difficult
to express in a traditional computer program.
36
Neural Networks
● Neural networks typically consist of multiple layers or a cube design,
and the signal path traverses from front to back.
● Newer designs of a neural network are more free flowing to access
information faster
● Dynamic neural networks are the most advanced- in that they
dynamically can, based on rules, form new connections and even new
neural units while disabling others.
37
Cluster Analysis
• Cluster analysis( Clustering) - used as a predictor of missing data
• An example of Cluster Analysis can be found in market analysis when
referring to surveys. The algorithm groups (clusters) consumers by what
they buy or when they shop.
• Cluster analysis segments data based on specified parameters by the
user.
38
What are Data Mining Techniques Used for
● Classification and Prediction
○ Example - Focused Hiring
○ predict one or more discrete variables, based on the other attributes in the dataset.
● Cluster Analysis
○ Example - Market Segmentation
● Outlier Analysis
○ Example - Fraud Detection
● Association Analysis
○ Example - Market Basket Analysis
● Evolution Analysis
○ Example - Forecasting stock market index using Time Series Analysis
39
Data Mining
Tools
Data Mining Tools
Free Data Mining Software:
● Weka - an open-source software for data mining
● RapidMiner- an open source system for data and text mining
● KNIME - an open source data integration, processing, analysis, and
exploration platform
Proprietary Data Mining Software:
● IBM SPSS Modeler
● Microsoft Analysis Services
● Oracle Data Mining
● SAS Enterprise Miner
41
How Data
Mining is Used
in Business
How Data Mining is Used in Business
● It allows users to analyze data from many different
dimensions or angles, categorize it, and summarize
the relationships identified.
● Technically, data mining is the process of finding
correlations or patterns among dozens of fields in
large relational databases.
43
How Data Mining is Used in Business
● Data Mining is primarily used today by companies with a strong
consumer focus, to “drill down” into their transactional data.
● data mining helps determine pricing, customer preferences and
product positioning, impact on sales, customer satisfaction and
corporate profits.
● Businesses can use this information to influence their business
decisions, follow and predict trends, and improve efficiency.
44
How Data Mining is Used in Business
Basket Analysis
● Sometimes called “affinity analysis,” this looks at the items that a customer bought, which could
help brick-and-mortar stores improve their layouts or online companies like Amazon recommend
related products. The “basket” refers to what shoppers use when they are shopping.
● It’s based on the assumption that you can predict future customer behavior by past performance,
including purchases and preferences. And it’s not just grocery stores that can use this data. Here
are a few ways it can be applied in various industries:
45
How Data Mining is Used in Business
●
Evaluating use of credit cards. (especially important for online ecommerce). Typically,
professionals mine credit card data to find patterns that might suggest fraud, but the data is
also used to tailor cards around a variation of credit limits, terms and interest rates…and even
collect debt.
●
Evaluating patterns of telephone use. For instance, you could discover customers who
adopt all of the latest services and features your phone company offers…suggesting they will
need something new to stick around…and then offer them an incentive to stay another year.
●
Identifying fraud insurance claims. Through the mining of historical information, insurance
companies can spot claims with a high percentage of recovering money lost through fraud and
develop rules to help them flag future fraudulent claims.
●
And all the products don’t have to be purchased at the same time. Most customer analytic
tools can observe purchases over time, thus helping you spot trends or opportunities that you
can test for future promotions.
46
How Data Mining is Used in Business
Sales Forecasting
●
This looks at when customers bought, and tries to predict when they will buy again. You could use this type of
analysis to determine a strategy of planned obsolescence or figure out complimentary products to sell.
●
This also looks at the number of customers in your market and predicts how many will actually buy. For
example, imagine if you have a coffee shop in Seattle. Here are questions you might ask:
●
How many people/households/businesses within a mile of your store will buy your coffee?
●
How many competitors are in that mile?
●
How many people/households/businesses in 5 miles?
●
How many competitors in those 5 miles?
47
How Data Mining is Used in Business
Database Marketing
● By examining customer purchasing patterns and looking at the demographics and
psychographics of customers to build profiles, you can create products that will sell
themselves.
● Of course for a marketer to get any value out of a database, it must continue to grow and
evolve. You feed database information from sales, surveys, subscriptions and questionnaires.
And then you target customers based upon this intelligence.
48
How Data Mining is Used in Business
Examples
●
Purchase records stored via a club card that you offer via incentives like 5% off purchases or accumulation of
points.
●
Contests you run to collect additional information about where people live.
●
Email newsletter you use to update customers weekly, but also to send out surveys in which you collect
additional information concerning new products and promotions.
●
Twitter account that doubles as a flash promotion tool and customer service hub where you listen to the good
and bad of what your followers are saying…and then respond.
●
As you collect this data, start to look for opportunities like best days to run a discount promotion. Ask yourself:
Who are your local customers and how you can turn these customers in advocates for your store?
49
How Data Mining is Used in Business
Merchandise Planning:
●
This is helpful for offline or online companies. For the offline, a company looking to grow by
adding stores can evaluate the amount of merchandise they will need by looking at the exact
layout of a current store.
●
For an online business, merchandise planning can help you determine stocking options and
inventory warehousing.
50
Example
●
Inventory getting old – Merchandising planning can be as easy as updating a PDF white paper to be
current or stocking up-to-date accessories for products.
●
Selecting product – Mining your database will help you determine which products customers want,
which should include intelligence on your competitors merchandise.
●
Balancing your stock – Database mining can also help you determine the right amount of stock…not too
much or too little…throughout the year and buying seasons.
●
Pricing – Database mining can also help you determine the best price for your products as you uncover
customer sensitivity.
51
How Data Mining is Used in Business
Card Marketing
●
If your business involves issuing credit cards, you can collect the information from usage, identify customer
segments and then based on information on these segments build programs that improve retention, boost
acquisition, target products to develop and design prices.
●
A great example of this occurred when the UN decided to issue a Visa credit card to people who traveled
overseas frequently. The agency marketers segmented their database into wealthy travelers—30,000 people
in high-income households.
●
The agency marketers used direct mail for their appeals and generated a 3% response. That may sound
small, but it actually exceeded industry standards. Large financial institutions typically see 0.5% response
rates. That’s how effective databases can be when marketing cards.
How Data Mining is Used in Business
Call Detail Record Analysis
● If your company depends upon telecommunications, then you can
mine that incoming data to see use patterns, build customer profiles
from these patterns and then construct a tiered pricing structure to
maximize profit. Or you could build promotions that reflect your data.
53
Example
●
A China mobile operator with about 600,000 customers wanted to analyze their data to create offerings to
fend off competition. The first thing the project team behind collecting and analyzing the data did was
create an index to describe caller behavior. That index then clustered the callers into 15 segments based
on elements like this:
●
●
●
●
●
●
●
●
Minutes Of Usage per user on the average
Local call percentage
Long distance call percentage
IP call percentage
Roam percentage
Idle period local call percentage
Idle period long distance call percentage
Idle period roam call percentage
54
How it helped
● From that data the marketing department then created strategies
directed at each segment, namely improving customer satisfaction
● delivering quality SMS service for another group
● encouraging another group to use more minutes.
55
Customer Loyalty
●
In a world where price wars occur, you will get customers jumping ship every time a
competitor offers lower prices. You can use data mining to help minimize this churn,
especially with social media.
●
Spigit uses different data mining techniques from your social media audience to help you
acquire and retain more customers. Their programs include:
○ Employee innovation – A tool used to ask employees for their ideas on how to
improve customer engagement, product development and future growth. Who says
data mining is always customer-centric?
○ Facebook – Through a technique called “customer cluster” Spigit uses data from
your audience on Facebook to generate ideas for improving your brand, satisfying
more customers and increasing loyalty.
56
How Data Mining is Used in Business
● FaceOff – This app acts like a place where people can create possible ideas on which to vote.
For example, someone might suggest “create the in-flight social network” v. “make a clear view
floor so you can see what’s below you when you fly.” Then people are shown these ideas and
vote. Naturally this allows a company to find ideas that are coming from customers…and then
being voted on by people who might be interested in the final idea.
57
How Data Mining is Used in Business
Market Segmentation
●
One of the best uses of data mining is to segment your customers. And it’s pretty simple. From your data you
can break down your market into meaningful segments like age, income, occupation or gender. And this
works whether you are running email marketing campaigns or SEO strategies.
●
Segmentation can also help you understand your competition. This insight alone will help you identify that
the usual suspects are not the only ones targeting the same customer money as you are.
●
This is especially important, because when I ask most clients who their competitors are they give me a list of
people. I then hand them back a bigger list. Most businesses need to expand their circle of competitors out two
or three times if they plan on competing effectively. Data mining will help you do that.
58
Product Production
● Data mining is also perfect for creating custom products designed
for market segments. In fact, you can predict which features users
may want…although truly innovative products are not created from
giving customers what they want.
59
Example
●
Rather truly innovative products are created when you look at the data from your customers
and spot holes customers are demanding be filled. When it comes to creating that product,
these are the elements that must be baked into the product.
●
●
●
●
●
●
●
●
Fulfill an obvious need
Offer something utterly unique
Set to enter the market with a unique name
Attractive design
Serves a broad market
Can be sold in generations
Create an impulse-purchase price
Cost to make is low enough to make a profit
60
Warranties
● Finally, database mining will allow you to predict how many people will actually cash in on
the warranty you’ve set up. This is also true for guarantees.
● For example, I tested the pulling power that a guarantee has for improving sales on a test I
ran on QuickSprout. But before I ran the test I had to analyze data to see how many people
would actually return the product I was selling. I looked at previous data on these two sets:
○
Net sales
○
Settlements made within the defined guarantee
61
Example continued
● I gathered those two numbers over several different sales sets to
predict how many people would cash in on the guarantee…and then
adjusted the guarantee amount so as not to lose money when
people returned product.
● These calculations are typically more robust for large corporations,
but for smaller shops you don’t need anything more complicated
than that.
62
Why It Matters
● The more data you collect from customers…the more value you can
deliver to them. And the more value you can deliver to them…the more
revenue you can generate.
● Data mining is what will help you do that. So, if you are sitting on loads
of customer data and not doing anything with it…I want to encourage
you to make a plan to start diving into it this week. Do it yourself or hire
someone else…whatever it takes.
63
Questions?
Thank You