Download Computational Intelligence in Data Mining and Prospects in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 2 (4): 601-605
© Scholarlink Research Institute Journals, 2011 (ISSN: 2141-7016)
jeteas.scholarlinkresearch.org
Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 2 (4): 601-605 (ISSN: 2141-7016)
Computational Intelligence in Data Mining and Prospects in
Telecommunication Industry
Isinkaye O. Folasade
Department of Computer Science and Information Technology
University of Science and Technology
Ifaki-Ekiti, Ekiti State,
___________________________________________________________________________
Abstract
The development of mobile phone networks, video and internet technologies have created enormous pressure on
the telecommunication industry. They generate and store very huge amount of data which need intelligent tools to
analyze. Data mining techniques are powerful mechanisms that have different features and abilities suitable for
analyzing great amount of data due to the fact that it allows the selection, exploring and modeling of large
volume of dataset to uncover previously unknown data patterns for business advantage. Computational
intelligence in data mining provides complementary and searching methods to solve complex and real-world
problems. The aim of this paper is to explore different data mining tools and applications and how they can be
used to detect telecommunication fraud, fault and improve market effectiveness. It also describes how data
mining can be used to uncover useful information embedded within large datasets.
__________________________________________________________________________________________
Keywords: computational intelligence techniques, data mining, telecommunication industry, fraud detection,
fuzzy-logic
__________________________________________________________________________________________
a deep understanding of the knowledge hidden in
ITRODUCTIO
Data mining refers to the process of extracting or telecommunication data is vital to the industry’s
mining knowledge from large amount of data (Jiawei competitive position and organizational decision
and Micheline, 2000) or it is the application of specific making. This paper discusses how data mining can be
algorithms for extracting patterns from data (Fayyah et used to discover and extract useful patterns from large
al., 1996). It has the ability to learn from past success databases to find observable patterns.
and failures and to predict what will happen in the
future. Data mining is therefore useful in any field There is a growing increase in the quantity of data in
where there are large quantities of data from which to the world today, especially in the telecommunication
extract meaningful patterns and rules (Berry and industries. These data include call detail data, network
Linof, 2000). Data mining has been used by data and customer data (Chang, 2009). The need to
statisticians, data analysts, management information handle such large volumes of data has paved way to
systems and in the telecommunication industry.
the development of computational techniques for
extracting knowledge from large amount of data. In
Data mining in telecommunication is an important telecommunication industries, call detail data is useful
application because telecommunication routinely for marketing and fraud detection applications. All
generates and stores a tremendous amount of high telecommunication industries maintain data about the
quality data. The quantity of data is always so large phone call that traverse their network in the form of
that manual analysis of the data is practically call detail records. These call detail records are kept
impossible. The advent of data mining technology on-line for many months; hence billions of call detail
promised solution to these problems and that is the records were usually available for data mining (Cortes
main reason the telecommunication industry was an and Pregibon, 2001).
earlier adopter of data mining technology.
According to Thearling (1999), data mining is the
Telecommunication is a service-oriented business; extraction of hidden predictive information from large
hence data mining can be viewed as an extension of databases and it a suitable technology with great
the use of expert systems (Liebwitz, 1998), which was potential to help companies focus on the most
majorly designed to address the complexity associated important information in large databases. The SAS
with maintaining a huge network infrastructure and the institute (2000) defines data mining as the process of
need to maximize network reliability while selecting, exploring and modeling large amount of
minimizing labour cost. Also, within huge amount of data to uncover previously unknown data patterns for
data usually lies hidden knowledge of strategic business advantage. Hence, data mining has to do with
importance which the natural ability cannot analyze applying data analysis and discovery algorithms to
unless with powerful tools (Hills, et al., 2006). Hence, enumerate patterns over data for prediction and
601
Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 2 (4): 601-605 (ISSN: 2141-7016)
description (Ranjan, 2007, Costea, 2006). Some data
mining techniques allow telecommunication industries
to mine historical data for the purpose of predicting
when a customer is likely to churn. These techniques
use billing data, call detail data, subscription
information and customer information. The companies
can take action if desired based on the induced model.
One such model is the use of neural network to
estimate the probability of cancellation at a given time
in the future (Mani et al., 1999). Also, the problem of
direct-marketing has been recognized by Wang et al.
(2005) as a classification problem and also the
importance of cost-sensitive data in the directmarketing and customer retention domain (Domingos,
1999).
There are several applications of data mining
techniques in telecommunication as observed by
Liebwitz (1998) as identifying telecommunication
patterns, predicting which customers are likely to
default on payments, improving service quality and
resource utilization and also facilitating multidimensional data analysis to improve how to
understand the behaviors of customers (Liebwitz,
1998). Fayyah et al (1996) explains that Fuzzy Logic
[FL], Probabilistic Reasoning [PR], Neural Networks
[NNs], and Evolutionary Algorithms [EAs] are the
main components of computational intelligence which
provide complementary reasoning and searching
methods to solve complex real-world problem. The
principal constituent methodologies in computational
intelligence are complementary rather than
competitive (Abraham, 1998).
Different Types of Telecommunication Data
Data mining has attracted world-wide attention in
recent years due to the fact that there are huge amount
of electronic data and the imminent need to turn such
data into useful knowledge (Mani et al., 1999). The
first step in data mining process is to clearly
understand the data so that appropriate applications
could be developed. The telecommunication data
include the call detail data, the network data and the
customer data.
Call Detail Data: Whenever a call is made on a
telecommunication network, there is always
descriptive information about the call which is saved
as a call record. The call detail record contains
adequate information which describes the significant
features of every call made. Such call detail record
will usually consist of the originating and terminating
phone numbers, the date and time of call and the
duration of the call. The call detail records of every
customer must be summarized into a single record that
describes the customer’s calling behavior before useful
knowledge could be extracted. This can help in
generating customer profiles which can be mined for
marketing purposes.
602
etwork Data: Telecommunication networks consist
of different configurations of equipment which consist
of many interconnected components. Each of the
components is capable of generating error and status
messages that can lead to a huge quantity of network
data. The data is normally stored and analyzed in order
to support network management functions such as
fault isolation and detection. Data mining technology
helps to perform the functions above by automatically
extracting knowledge from network data.
Customer Data: Telecommunication industries
maintain a very large database of information due to
their numerous numbers of customers. The
information consists of names, address information as
well as service plan, contract information, credit score,
family income and payment history. These data are
often used alongside other data such as using customer
data in conjunction with call detail data to identify
phone fraud.
DATA MIIG TECHIQUES
Data mining techniques have different features that
make them suitable for analyzing large quantity of
data, and these techniques are the result of a long
process of research and product development. Data
mining tools have the capability of analyzing massive
databases and to deliver answers to questions at a very
fast speed. OLAP (online analytical processing) is one
of the analytical tools that focus on providing multidimensional data analysis which are based on
verification where the system is limited to verifying
user’s hypotheses (Zadeh, 1998). They are mainly
used for simplifying and supporting interactive data
analysis. The main goal of data mining is to discover
new patterns in data for the purpose of predictions and
description. They also check the statistical significant
of the predicted patterns and give relevant reports on
them. Although the boundaries between prediction and
description are not sharp, the distinction is very vital in
understanding the overall discovery goal which is
achieved through the following data mining
techniques.
Clustering: This involves identifying a finite set of
clusters to describe a set of data items (Abraham,
1998) or it can be described as a method by which
similar records are grouped together. The clusters
could consist of a richer representation which could be
hierarchical or overlapping clusters. Using clustering,
telecommunication customer data can be grouped
based on customer name, customer address, service
plan, credit score and payment history
Regression: This takes a numerical dataset and
develops a mathematical formula that fits the data.
That is, for a set of data, regression technique predicts
attribute value automatically for a new data depending
on the dependency of an attribute on another. This
Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 2 (4): 601-605 (ISSN: 2141-7016)
technique is very efficient when working with
categorical data where order is not important.
Summarization: This technique involves finding a
compact description for a subset of data.
Summarization is often applied to interactive
exploratory data analysis and automatic report
generator.
Classification: This is used to predict group
membership for data instances in order to predict and
determine which out of a predefined set of classes a
data item belong to. Classification technique is very
efficient when working with categorical data or a
mixture of continuous numeric and categorical data. It
is also capable of processing a wider variety of data
with output that is much easier to interpret. A
telecommunication industry whose customer credit
history is known can classify its customer record as
Good, Medium or Poor.
Dependency modeling: The technique finds a model
that describes significant dependencies between
variables. Dependency model could either be
structured level model that specifies which variables
are locally dependent on each other, or quantitative
level model which specifies the strengths of the
dependencies using some numeric scale.
Change and Deviation Detection: It focuses on
discovering the most significant changes in data from
previously measured or normative value.
DATA MIIG ALGORITHMS COMPOETS
Data mining algorithm helps in constructing specific
algorithms to implement the general techniques
discussed above. It has to do with deciding an actual
algorithm for searching for patterns in a dataset, which
entails selecting the model and parameters which are
appropriate and matching a particular algorithm for
knowledge discovery. There are three major
components in any data mining algorithm which are
model representation, model evaluation and model
search.
Model Representation: This is the language used to
represent patterns that could be easily identified. The
representation must not be too scanty otherwise no
amount of illustration will generate an accurate model
of the data. Also, if the representation is too enormous,
it increases the danger of over fitting the training data
which results in reduced prediction accuracy of unseen
data. Fuzzy logic is an appropriate computational
intelligence technique that could be employed for
implementation here. Using Fuzzy logic involves the
identification of a classifier system to design a model
that has the ability to predict if a particular pattern
should be classified or not. The classic approach for
this problem is based on Bayes’ rule. Another model
representation is interpretability in Fuzzy system; this
603
allows Fuzzy system to be evaluated according to their
performance or accuracy. In order to evaluate Fuzzy
system, there is a need for a way of accessing their
interpretability, simplicity or user friendliness.
Model Evaluation: Some other algorithms focused on
either accuracy or interpretability, but recent
algorithms try to combine these two features. Model
evaluation criteria is therefore a quantitative statement
of how well a specific pattern meet the goal of
knowledge discovery process. In model evaluation
criteria, different rules may be applied to remove
redundancy which usually appears as overlapping
Fuzzy set (Setnes, 1998). Similarity-driven rule base
simplification is a method that uses similarity measure
to quantify the redundancy among Fuzzy sets in the
rule base. This method is very useful in reducing the
number of Fuzzy sets from model, making it simple
and yet robust. Multi-objective function for genetic
algorithm (GA) based identification is another rulebased method that improves the classification
capability, GA by applying optimization method
where cost function is based on the model accuracy
measured in terms of the misclassifications (Setnes,
1999). Finally, Orthogonal transforms for reducing
the number of rules is also very vital. It evaluates the
output contribution of the rules to obtain an important
ordering. When it comes to modeling, orthogonal least
squares (OLS) is a well suitable tool to use (Yen and
Wang, 2001).
Search Method: This is made up of parameter and
model search. As soon as the model representation and
evaluation are set, the problem is scaled to an
optimization task of finding the models/and
parameters that optimize the evaluation criteria. Model
search occurs as a loop over the parameter search
method. Computational intelligence tools for the
initialization step of the identification procedure
include: Fuzzy logic (FL), Probabilistic reasoning
(PR), Neural networks (NNs), and Genetic algorithms
(GAs). Computational intelligence based search
methods for identification of Fuzzy based
classifications allows fixed membership functions to
be used to partition feature space which pave way to
functions that are based on data to better explain the
data patterns. The automatic determination of Fuzzy
classification rule from data can be approach from any
of the techniques listed: Neuro-fuzzy method, Genetic
algorithm based rule selection and finally Fuzzy
clustering in combination with GA-optimization.
Data mining algorithm tools that could be used by
these techniques include : Microsoft Decision Trees
Algorithm, Microsoft Time Series Algorithm,
Microsoft Clustering Algorithm, Microsoft Sequence
Clustering Algorithm. They help to build algorithms
that can create a data mining model.
To create a model: an algorithm first analyzes a
dataset and looks for specific patterns and trends (in
Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 2 (4): 601-605 (ISSN: 2141-7016)
this case call detail data, customer data and network
data), the algorithm uses the results of this analysis to
define the parameters of the mining model and
finally, these parameters are then applied across the
entire data set to extract actionable patterns and
detailed statistics.
can be used by marketing departments to better target
recruitment campaign and by active monitoring of
customer call base to highlight customers who may by
signature in their usage pattern be thinking on
switching to another network provider.
Data Mining Prospects in Telecommunication
Industry
The prospects of data mining techniques in
telecommunication are many. They help in predicting
which customers are likely to default on payment,
identifying of telecommunication patterns, catching
fraudulent activities, improving service quality and
resource utilization and facilitating multi-dimensional
data analysis to improve understanding of customer
behavior (Berry and Linoff, 2004). Information gained
from data mining techniques can be used for
application ranging from market analysis, fraud
detection, and customer retention to production control
and science exploration (Han and Kamba, 2001). Data
mining helps to reduce company inefficiencies by
making good predictions about business outcomes.
Fraud Detection: In order to identify patterns of
fraud, data mining application analyzed large amount
of cellular call data (Fawcett and Provost, 1997) which
are used to generate monitors. These monitors watch a
customer’s behavior with respect to one pattern of
fraud. The monitors are then fed into a neural network
that determines when there is sufficiently evidence of
fraud to raise an alert. Data mining also assists in
detecting fraud by identifying and storing the phone
numbers called when a phone known to be used
fraudulently. Hence, data mining can be used generally
to protect telecommunication operator revenues due to
fraud (Estevez et al., 2006) or customer insolvency.
Marketing
and
Customer
Profiling:
Telecommunication is one of the most data-intensive
industries. They maintain vey huge amount of
information about their customers. As such, they are
the leader in the use of data mining to identify and
retain customers, maximize the profit obtained from
each customer. Data mining also helps in generating
customer’s profiles from call detail records and then
mining these profiles for marketing purposes. This
method is now used to identify whether a phone line is
used for voice or fax and to classify a phone line as
belonging to a business or residential customers. A
variety of data mining methods are now being used to
model customer lifetime value for telecommunication
customers (Rosset et al., 2003 ) because it is much
more expensive to acquire new telecommunication
customers than to retain existing ones. In order to
improve customer relationship and to combat high cost
of churn, increasing sophisticated data mining
techniques are now being employed to analyze why
customers churn and which customers are most likely
to churn in the nearest future. This kind of information
604
etwork Fault Prediction and Isolation: Data
mining applications have been developed in order to
identified and predict network faults. The
Telecommunication Alarm Sequence analyzer (TASA)
is one of the data mining tools that help in fault
identification by automatically discovery recurrent
patterns of alarms within the network data. This
patterns discovered by the tool are used to construct a
rule-based alarm correlation system. TASA is also
capable of finding episodic rules that depend on
temporal relationships between the alarms. Another
technique used in predicting telecommunication switch
failures is the GA (genetic algorithm) which is used
for mining historical alarm logs (Weiss and Hirsh,
1998) to search for predictive sequential and temporal
patterns.
COCLUSIO
This paper describes how data mining tools and
techniques can be used in telecommunication
companies to discover and extract useful patterns from
very large volume of dataset in order to find
observable patterns, which can help in identifying
telecommunication patterns, catching fraudulent
activities, improving service quality and resource
utilization, facilitating multi-dimensional data analysis
to improve the understanding of customer behavior.
REFERECES
Abraham A. 1998. Intelligent Systems: Architecture
and Perspectives, Recent Advances in Intelligent
Paradigms and Applications, Abraham A., Jain L. and
Kacpryk J. (Eds), Studies in Fuzziness and Soft
Computing, Springer Verlag, Germany, ISBN
3790815381, Chapter 1, pp1-35
Berry M. and Linoff G. 2000. Mastering Data Mining:
The Art and Science of Customer Relationship
Management, John Wiley and Sons Inc, New York.
Berry M. and Lino G. 2004 Data Mining Techniques
for Marketing, Sales, and Customer Relationship
Management: Indianapolis, 2nd Edition, Wiley
Publishing Inc, New York.
Chang Y. T. 2009. Applying Data Mining to Telecom
Churn Management, International Journal of Reviews
in Computing, 1(10): 67-77.
Cortes C. and Pregibon, D. 2001. Signature-based
Methods for Data Streams, Data Mining and
Knowledge Discovery, 5(3):167-182.
Journal of Emerging Trends in Engineering and Applied Sciences (JETEAS) 2 (4): 601-605 (ISSN: 2141-7016)
Costea
A.
2006.
The
Analysis
of
the
Telecommunications sector by the means of Data
Mining Techniques, Journal of Applied Quantitative
Methods 1 (2): 144-150.
Domingos P. 1999. Metacost: A General method for
making classification cost sensitive. Proceedings 5th
ACM Sigkdd Conf. Knowledge Discovery and Data
Mining (KDD ‘99). ACM press, pp 155-164
Estevez P. A., Held C. M. and Perez C. A. 2006
Subscription Fraud Prevention in Telecommunication
using Fuzzy Rules and Neural Networks, Expert
Systems with Application. 31 (2): 337-344.
Fawcett T. and Provost F. 1997. Adaptive Fraud
Detection, Data Mining and Knowledge Discovery. 1
(3): 291-316.
Fayyah U., Piatetsky-Shapiro G. and Smyth P. 1996.
Knowledge discovery and data mining: Toward a
unifying framework. Proceeding of the Second
International Conference on Knowledge Discovery
and Data Mining (KDD’96). E. Simoudis, J. Han, and
U. Fayyad (eds), AAAI Press, Portland, Oregon,
August 2-4, pp: 82-88.
Han J and Kamber M. 2001. Data Mining: Concepts
and Techniques, Academic Press.
Hills S., Agarwal D., Bell R. and Volinsky C. 2006
Building an Effective Representation for Dynamic
Networks, Journal of Computational and Graphical
statistic. 15 (3): 584-608.
Jiawei H. and Micheline K. 2000. Data Mining:
Concepts and Techniques, 1st Edition Morgan
Kauffman.
Liebwitz J. 1998. Expert System Application to
Telecommunications, John Wiley and Sons Inc. New
York.
Mani D. R., J. Drew J., Betz A. and Datta P. 1999.
Statistics and data mining techniques for lifetime value
modeling. Proceedings 5th ACM Sigkdd Conf.
Knowledge Discovery and Data Mining (KDD ‘99).
ACM press, pp 94-103.
Ranjan J. 2007. Application of Data Mining
Techniques in Pharmaceutical industry, Journal of
Theoretical and applied Information Technology.
.3(3): 61-67
Rosset S., Neumann E., Eick U. and Vatnik M. 2003.
ustomer lifetime value model for decision support.
Data Mining and Knowledge Discovery. 7 (3): 321339
605
SAS Institute 2000. Best Price in Churn Prediction, A
SAS Institute White Paper
Setnes M. and Roubos J. 1999. Transparent fuzzy
modeling using fuzzy clustering and GA’s, In
NAFIPS, New York, USA, pp 198-202.
Setnes M., Babusoeka R. and Verbruggen H.B. 1998.
Complexity Reduction in Fuzzy Modeling,
Mathematics and Computing Simulation.
Thearling K. 1999. An introduction of Data Mining,
Direct Marketing Magazine, 28-31
Wang K. et al., 2005 Mining Customer Value: From
Association Rules to Direct Marketing, Int’l J. Data
Mining and Knowledge Discovery (DMKDJ). 11 (1):
57-80
Weiss G. and Hirsh H. 1998 Learning to predict rare
events in event sequences, In: R. Argrawal and P.
Stolorz (Eds.). Proceedings of the fourth International
Conference on Knowledge Discovery and Data
Mining. Menlo Park, CA: AAAI Press, pp. 359-369.
Yen J. and Wang L. 2001. Simplifying Fuzzy Rulebased Models using Orthogonal Transformation
Methods, IEEE Trans. On System, Man and
Cybernetics, 2(31) pp 199-206.
Zadeh L. 1998. Roles of Soft Computing and Fuzzy
Logic in the Conception, design and Deployment of
Information/Intelligent Systems, in: Computational
Intelligence: Soft Computing and Fuzzy- Neuro
Integration with Applications. O. Kaynak et al. (Eds),
Springer Verlag, Germany, pp1-9