Download Real-Time Knowledge Discovery and Dissemination for Intelligence Analysis Bhavani

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Proceedings of the 42nd Hawaii International Conference on System Sciences - 2009
Real-time Knowledge Discovery and Dissemination for Intelligence Analysis
Bhavani Thuraisingham,
Latifur Khan,
Murat Kantarcioglu,
Sonia Chib
The University of Texas at Dallas
lkhan@utdallas.edu
bhavani.thuraisingham@utdallas.edu
muratk@utdallas.edu
sonia.chib@student.utdallas.edu
Jiawei Han
University of Illinois at Urbana Champagne
hanj@cs.uiuc.edu
Sang Son
University of Virginia
son@virginia.edu
Abstract
This paper describes the issues and challenges for
real-time knowledge discovery and then discusses
approaches and challenges for real-time data mining
and stream mining. Our goal is to extract accurate
information to support the emergency responder, the
war fighter, as well as the intelligence analyst in a
timely manner.
1. Introduction
Data mining/knowledge discovery is the process
of posing various queries and extracting useful and
often previously unknown and unexpected
information, patterns, and trends from large
quantities of data, generally stored in databases.
These data could be accumulated over a long period
of time or they could be large data sets accumulated
simultaneously from heterogeneous sources such as
different sensor types. The goals of data mining
include improving marketing capabilities, detecting
abnormal patterns, and predicting the future based on
past experiences and current trends. There is clearly a
need for this technology for many applications in
government and industry.
Much of the focus on data mining has been for
analytical applications. However there is a clear need
to mine data for applications that have to meet timing
constraints. For example, a government agency may
need to determine whether a terrorist activity will
happen within a certain time or a financial institution
may need to give out financial quotes and estimates
within a certain time. Consider for example a medical
application where the surgeons and radiologists have
to work together during an operation. Here, the
radiologist has to analyze the images in real-time and
give inputs to the surgeon. In the case of military
applications, images and video may arrive from the
war zone. These images have to be analyzed in realtime so that advice is given to the war fighter. In the
case of counter-terrorism applications, the system has
to analyze the data about the passenger from the time
the passenger gets ticketed until the plane is boarded,
and give proper advice to the security agent. For all
of these applications there is an urgent need to mine
the data and extract knowledge in real-time.
Therefore, we need tools and techniques for real-time
data mining. The challenge is to determine which
data to analyze and which data to discard for future
analysis in non real-time.
978-0-7695-3450-3/09 $25.00 © 2009 IEEE
1
Proceedings of the 42nd Hawaii International Conference on System Sciences - 2009
This paper describes approaches for real-time
knowledge discovery. In particular the following
techniques are discussed.
Real-time knowledge discovery techniques
including real-time clustering, machine learning,
association rule mining, outlier and trend detection
Real-time stream mining techniques including
clustering evolving streams and multidimensional
stream cube analysis.
The organization of this paper is as follows. A
motivating scenario for real-time data mining is
discussed in section 2. Some technical issues in realtime data mining are given in Section 3. Adapting
data mining techniques to meet real-time constraints
including multiple real-time data mining algorithms
is described in Section 4. Stream data mining, which
is an important aspect of mining sensor data in realtime, is discussed in Section 5. The paper is
concluded in section 6.
2. Motivational Scenario
There is a strong need for coordinated activities
between multiple organizations (such as Army, Navy,
Air force; Coalition members such as US, UK and
Australia, or Health organizations such as CDC:
Center for Disease Control, HRSA: Human
Resources and Services Administration, and
Hospitals) in the crisis management domain for
working with one another to determine the enemy
locations at a specific time or to monitor and manage
the spread of infectious diseases. Here, organizations
such as local, state and federal governments or
coalition members have to work together and
integrate the diverse sources of information.
For time-critical applications such as crisis
management, there are two major components. One is
data sharing and the other is decision support. The
data sharing component consists of data integration
where data from structured and unstructured sources
have to be integrated. The integrated data has to be
analyzed using data mining and data warehousing
techniques. For real-time applications, active data
management techniques will ensure that triggers are
enforced on the data so that incidents are propagated
in a timely manner. Finally the integrated data, the
results obtained from activating triggers as well as
the results from data mining are collected and
analyzed by a decision support system to give advice
to the incident commanders and decision makers.
One can carry out decision support analysis using
partial data in which case the analysis as well as
integration will be carried out incrementally. This is
the bottom up approach where it is assumed that not
all of the data is available at one time. The top-down
approach integrates all of the data first and then
carries out the mining and subsequently decision
support analysis. For both approaches, the
organizations may enforce their security policies
including confidentiality policies, privacy policies,
dissemination policies and trust polices. Real-time
data sharing and security have conflicting
requirements. The challenge is to enforce the various
policies and at the same time share the data so that
real-time decision making can be carried out.
In defense applications, organizations such as Air
Force, Navy, Army, Central Intelligence Agency and
the National Security Agency have to share
intelligence and military data. Each organization may
enforce its own policies. However the data has to be
sent to the war fighter in real-time to make life and
death decisions. For example, as a result of the
analysis the outcome may be that the enemy forces
will be in Location X between 6 and 6:30pm. This
information has to be sent to the war fighter so that
he has enough time to attack the enemy if need be. In
addition to techniques for enforcing policies we also
need to make tradeoffs between security and timely
processing. The soldiers and the analysts have to
work together to get timely and accurate information.
One issue with real-time analysis is that we may not
be able to get the complete and accurate picture as
well as meet timing constraints. In such a situation is
partial information better than no information? This
depends on the situation under consideration. If the
plan is to say monitor the enemy, the enemy location
information will be extremely valuable even if it is
partial information. However if the plan is to attack
the enemy, then we will need a complete and
accurate picture of the situation.
We are investigating secure data sharing across
agencies for the Air Force Office of Scientific
Research [THUR06]. However our research is
focusing on the non real-time data integration and
data analysis aspects. While this is very important,
there is also a strong need to provide a dependable
real-time analysis environment so that the emergency
responder as well as the war fighter gets the right
information at the right time.
3. Technical Issues
To study effective real-time data mining and data
stream mining, the first issue is to examine what
kinds of patterns and knowledge that should be
mined from huge sets of data. In general, the kinds
2
Proceedings of the 42nd Hawaii International Conference on System Sciences - 2009
of knowledge to be mined can be categorized into the
following classes: (1) multidimensional summary,
which characterizes and compares summary
information of different groups of data in
multidimensional space; (2) clustering, which
automatically partitions data into multiple groups
based on the similarities among data; (3)
classification, which builds up predictive models
based on the training and testing data sets; (4)
frequent pattern mining for association and
correlation analysis, which mines frequent patterns
for discovering association rules and correlation
relationships among data; (5) mining changes, trends,
sequential patterns and evolutions with time in the
data sets and data streams; and (6) outlier detection,
which finds singular or small groups of data that
deviate substantially from the properties of the
majority of the data. Both real-time and stream data
mining involve the accomplishment of one or more
of the above data mining tasks.
In recent research, we have performed in-depth
investigation on real-time and data stream mining for
multidimensional analysis, classification and cluster
analysis. In this paper we will discuss some aspects
of real-time and data stream mining in the following
two major directions: (1) exploration of real-time and
data stream mining in several strategically important
applications, and (2) development of real-time and
data stream mining methods for trend and outlier
detection.
4. Real-time Knowledge Discovery
Techniques
We believe that the most critical issue in real-time
data mining is that at any moment in time the
algorithm should be able to stop and produce
intermediate output with an indication of how close
to optimum the intermediate output is. This will
enable a real-time data miner to determine how much
weight to put on the conclusions drawn from the
intermediate output. The mining should be done
efficiently, with high scalability, and incrementally
when the new data sets being added into the system
or the existing data sets being modified dynamically.
Stream data mining adds an additional but critical
constraint that a data stream in most cases can be
scanned only once because due to the huge size and
the sequential nature of a data stream, it is unrealistic
to repeatedly scan or randomly access of a data
stream. Therefore, in our discussion, we will focus on
how to develop efficient, scalable, and incremental
data mining methods for each knowledge discovery
task listed above and discuss how to extend the
methods to the data stream environments with the
development of single-scan mining method.
In this section, we will discuss each of the data
mining tasks and how to extend the major methods
for real-time data mining, with an emphasis on
efficiency, scalability and incremental mining
methods.
4.1. Real-time Multidimensional Summary
For the first data mining task, multidimensional
summary, which characterizes and compares
summary information of different groups of data in
multidimensional space, data cube has been a popular
approach for multidimensional aggregation and
efficient methods have been developed for computing
sparse, dense, iceberg, and closed cubes(ZDN97,
BeRa99, HPDW01, XHLW03, XHSL06). Online
analytical processing can also be performed with
multidimensional databases of rather high dimensions
by partial materialization
(LHG04). These methods can be generalized to
semi-structured databases as well.
Moreover,
incremental computation can be performed on data
cubes with the above mentioned methods. Thus the
above computational methods set up a foundation for
efficient
and
incremental
computation
of
multidimensional databases.
4.2. Real-time Clustering
In clustering, the goal is to partition a dataset so
that similar points belong to the same cluster and
dissimilar points belong to different clusters. One of
the most widely used clustering criteria is the squared
error distortion, where the goal is to minimize the
average squared distance from a point to its nearest
center.
While many clustering algorithms are available,
the real-time nature of the problem calls for
algorithms that make measurable progress towards
the optimum solution. The reason is that if an
individual is to make a real-time decision based on a
data mining algorithm that has not run to completion,
then it is important to know how far the algorithm is
from the final solution. This will enable the decisionmaker to determine how much weight to put on the
conclusion drawn from the incomplete run of the data
mining algorithm. Ideally, the final solution is the
optimum solution, but in the case of clustering, no
efficient algorithm is known for finding the optimum
solution.
On the other hand, constant factor
approximation algorithms are known. For example,
3
Proceedings of the 42nd Hawaii International Conference on System Sciences - 2009
the algorithm given in (AGKM04) produces a
solution that is no worse than nine times the optimum
squared error distortion. Another desirable feature of
this algorithm is that it begins with a solution that is
at most 2n times the optimum, where n is the number
of points and, with each successive iteration, the
quality of the solution improves by a (1- /k) factor,
where is an input parameter and k is the number of
clusters. It can be shown that that after roughly
O((k/ ) log n) steps that the cost of the solution is at
most nine times the optimum. But, more relevant to
this context, if a real-time decision is made after the
algorithm has completed i iterations, then the cost of
the solution is (1- /k)i(2n).
A more detailed study of clustering algorithms
that make measurable progress towards the optimum
is needed. Further, the problem of clustering a
streaming graph in real-time is quite unexplored. We
also need to investigate extensions of such algorithms
to the data stream setting in the case where the goal is
to find the best clusters (GMMO00) as well as the
best evolving clusters (AHWY03).
4.3. Real-time Machine Learning
In the classification problem, one is given a
collection of labeled data, e.g., positive and negative
examples, and the goal is to learn a function that
distinguishes the two classes. Standard examples of
machine learning algorithms include learning
halfspaces, decision trees and other function classes
(Mit97).
In the real-time version of this problem, the
machine learning algorithm must make measurable
progress towards the optimum solution. A consistent
learning algorithm, i.e., one that maintains a
hypothesis that is always consistent with the data
seen so far, is one example of an algorithm that is
always making progress towards the target concept.
For instance, in the context of learning halfspaces,
the algorithm of (MT94) always cuts the number of
consistent concepts in half.
We need to further investigate such algorithms
that make steady progress towards the final target
concept so that we can quantify how informed the
real-time decision is. In this setting, the number of
consistent concepts may serve as a measure of how
much weight to put on the decision.
4.4. Association Rules/Frequent Itemsets in
Real Time
Association rules uncover patterns of the form
ABC->D, with the interpretation that whenever
events A and B and C occur, D also occurs. A
common first step employed in the discovery of such
rules is the identification of maximal frequent
itemsets, subsets of values that occur frequently in a
dataset. The Apriori algorithm (AS94) gives one way
of identifying the maximal frequent itemsets.
One of the limitations of this algorithm in the
real-time context is that the algorithm does not make
measurable progress. In other words, if all the
frequent itemsets are of length m, the algorithm does
not produce any itemsets until it has iterated through
all 2m-1 subsets of the frequent itemset that are of
length 1,2,3,…, m-1. Such an approach may not be
suitable in the real-time context since no useful
output may be produced when needed.
Alternate algorithms do exist that may be more
appropriate. For example, (FK96) give an algorithm
that outputs frequent itemsets with boundable delay
in between itemsets. In particular, if there are s
maximally frequent itemsets and minimally
infrequent itemsets, then the algorithm spends time
O(so(logs)) time in between itemsets. No better
bound on the delay is known to this date.
We need to study other boundable delay
algorithms for the problem of finding frequent
itemsets, closed patterns, and constraint-based mining
(NLHP98). For an application that makes decisions
in real-time, boundable delay algorithms have the
advantage of making as much information available
to the individual as possible.
4.5. Real-time Trend Detection
For the fifth data mining task, mining changes,
trends, sequential patterns and evolutions with time
in the data sets, there have been many studies as well.
For example, for mining sequential patterns, efficient
and scalable algorithms, such as GSP (AS96)
PrefixSpan (Pei+04), SPADE (Zaki01), and CloSpan
(YaHa03), have been developed and shown high
performance in large data sets.
Methods for
incremental mining of such patterns, such as IncSpan
(CYH04), have been developed as well. We will
further develop such methods for mining changes,
trends and evolutions in real-time for large datasets.
An alternate problem is to cluster real-time
sequences of activity, e.g., movement through a
sensor network, into a collection of Markov models,
where the vertices correspond to sensors and the
weight on edge u,v indicates the probability that there
is movement from vertex u to v. These models are
useful both for understanding normal patterns of
activity as well as triggering alarms in the event of an
unusual real-time sequence.
4
Proceedings of the 42nd Hawaii International Conference on System Sciences - 2009
The problem of inferring such Markov models of
maximum likelihood has been studied in the past
(KMBRT04). However these algorithms have many
limitations – the most notable is that we do not
currently know how to quantify the progress that
these algorithms make as they infer the collection of
Markov models. As indicated, evaluating this
progress is important for a real-time miner.
4.6. Real-Time Outlier Detection
Finally, for the last data mining task, outlier
detection, which finds singular or small groups of
data that deviate substantially from the properties of
the majority of the data. It is critically important to
find outliers and alarming incidents in many
applications, such as homeland security, accident,
disasters (fire, explosion, etc.). There have been
studies on mining outliers and local outliers in large
data sets (KnNg98, SLZ01, BKNS00, JTHW06). In
general, the outlier detection method needs to be
further developed for different applications. We will
devote more effort to investigate outlier detection at
real-time, especially for real-time outlier detection for
spatiotemporal data sets.
5. Stream Data Mining
Recent
emerging
applications
call
for
investigation of stream data, where data takes the
form of continuous, ordered, potentially infinite data
streams, as opposed to finite, statically stored data
sets. Huge volumes of dynamically changing data
are in the form of data streams, including time-series
data, scientific and engineering data, and data
produced in other dynamic environments, such as
power supply, network traffic, stock exchange,
telecommunication, Web clicking, weather or
environment monitoring, and so on.
Analysis of stream data poses great challenges to
data mining research, due to the unique features of
data streams, such as huge (and possibly infinite)
volume, dynamically changing behavior, flowing inand-out in a fixed order, allowing only one scan of
the entire data stream, and demanding fast (often
real-time) responses. Moreover, a lot of stream data
resides at the primitive abstraction level, and it is
necessary to perform multi-level, multi-dimensional
analysis on such data to find interesting patterns at
appropriate levels of abstraction and with appropriate
dimension combinations.
Stream data modeling, stream data management,
and continuous stream query processing are under
popular investigation (Babc+02, BaWi01, GKMS01,
GMMO00, GKS01b, DGGR02). Moreover, mining
data streams, including classification of stream data
(DoHu00, HSD01), clustering data streams
(GMMO00, Ocal+02), and finding frequent patterns
in data streams (MaMo02) are also under intensive
study. Nevertheless, most existing studies have not
paid enough attention to one critical fact: most data
streams reside at rather low level of abstraction and
are multi-dimensional in nature, whereas most
analysts are interested in finding characteristics,
rules, unusual patterns, and dynamic changes (such as
trends and outliers) at relatively high levels of
abstraction and in multi-dimensional space.
Let's examine an example. Suppose that a Web
server receives a huge volume of online Web service
requests, in the form of data streams. Usually, such
stream data resides at rather low level, consisting of
time (down to seconds), Web page address (down to
concrete URL), user IP address (down to detailed
machine IP address), etc. However, an analyst may
be interested in (1) finding trends happening in the
data streams at certain high levels of abstraction, such
as “the Web clicking traffic in North America on
Mideast in the last 15 minutes is 40% higher than the
last 24 hours on average”, (2) detecting network
intrusions and traffic bursting, by multi-dimensional
clustering or frequent pattern analysis, and (3)
dynamic modeling the Web traversal behavior for
certain suspicious groups of users to protect network
security and swiftly react to cyber-attacks. On the
other hand, from the point of view of a Web analysis
provider, given a large volume of fast changing
Internet traffic streams, and with limited main
memory and computational power, it is only realistic
to analyze the changes of Web usage at certain high
levels, discover unusual events and patterns, and drill
down to some more detailed levels for in-depth
analysis, when needed. Moreover, it is critical to
have such analysis performed automatically or semiautomatically, and in almost real time, in order to
make timely responses.
All these applications demand for the
development of mechanisms for multi-dimensional
analysis and mining of changes, trends, and unusual
patterns of data streams, with low cost and fast
response time. In our recent studies, we have
developed a data stream cube architecture for
multidimensional stream data analysis (Chen+02,
Han+05). Moreover, we have studied how to cluster
dynamic data streams (AHWY03, AHWY04,
AHWY05), classify data streams (WFYH03,
AHWY04b, AHWY06), and finding frequent
patterns in data streams (LLHH06). We briefly
introduce our efficient methods for effective online
5
Proceedings of the 42nd Hawaii International Conference on System Sciences - 2009
mining of data streams and outline further research
issues.
5.1. Multidimensional data stream analysis
using stream cubes
For multi-dimensional analysis of data streams,
we proposed a stream cube architecture (Chen+02,
Han+05) for efficient and effective multi-dimensional
regression analysis in time series data streams. The
major challenges for construction of a stream data
cube is the huge volume of data, and the potentially
even huger size of stream data cube, and the
incremental update of such a stream cube. The
proposed architecture has the following features: (1)
tilted time window for registering recent data at fine
granularity, and more distant ones at coarser
granularity; (2) two critical layers: a minimal interest
layer and an observation layer, which are used to
save minimal number of cuboids whereas achieving
the power of high-level observation and low-level
drilling, and (3) an efficient data structure for partial
computation of data cubes, which uses an H-tree
structure (HPDW01) to partially compute data cubes,
saving storage space pre-computation efforts, and online computation time. The stream data cubes so
constructed are much smaller than those constructed
from the raw stream data but will still be effective for
many multi-dimensional stream data analysis tasks.
This architecture has been shown its efficiency and
scalability in our construction of the MAIDS system
(Mining Alarming Incidents in Data Streams)
(Cai+04) and its subsequent experiments with various
kinds of stream data. The architecture will be further
developed in this project, especially for the analysis
of correlations, trends and outliers in the context of
data streams, to facilitate multi-dimensional mining
of various kinds of patterns in data streams.
5.2. Clustering evolving data streams
For efficient clustering of stream data, typical
clustering methods proposed in machine learning,
statistics and data mining communities (ABKS99,
KHK99, Agga+99, AGGR98, BFR98, GRS98,
NgHa94, JaDu88, KaRo90, ZRL96) do not work
since they require multiple scans of a database. For
single-scan clustering of data streams, new
algorithms such as (GMMO00, Ocal+02) have been
proposed. Unfortunately, these stream clustering
methods cannot be used to discover stream evolution
behavior since they suffer from two deficiencies.
First, they assume that the clusters are to be
computed over the entire data stream. Although such
an assumption may be useful in some applications, a
data stream should often be viewed as an infinite
process consisting of data which continuously
evolves with time. As a result, the underlying clusters
may also change considerably with time. The nature
of the clusters may vary with both the moment at
which they are computed as well as the time horizon
over which they are measured. For example, a data
analyst may wish to compare clusters occurring now
with that in the last hour, day, week, or month. These
clusters could be considerably different. Without
considering multiple tilted time windows, it is
impossible to perform such comparison to find the
dynamics and evolution of data streams.
Second, they assume that incremental clustering
can be performed by maintaining only a small
number of existing clusters (such as k in the k median or k -means method). That is, the level of
granularity to be preserved across windows is at the
level of macro-clusters. However, with the fast
evolving streaming data, it is obvious that some
objects originally belongings to cluster C1 may need
to be moved to another cluster C2 or even form a
new, independent cluster C3. Since the macro-cluster
level does not register any detailed information about
such groups of objects, it is impossible to make
correct judgment on whether such annihilated groups
of objects should be moved to another cluster or form
a new one. That is, since the amount of information
maintained by a k -means or k -median type approach
is too rough in granularity, and once two cluster
centers are joined, there is no way to informatively
split the clusters when required by the changes in the
stream at a later stage. Therefore, the quality of
stream clustering could suffer from the lack of
sufficient information.
Based on these observations, we proposed a new
stream clustering method, called CluStream
(AHWY03) based on two major ideas: microclustering and tilted time frame, which are useful for
efficient bookkeeping in data streams and overcome
the two problems: lack of historical information and
lack of appropriate level of granularity.
First, we need to explore the concept of microclustering (ZRL96) to ensure that clustering can be
adapted to the dynamic changes of data streams by
using additional information at a relatively fine level
of granularity.
More concretely, we maintain
statistical information about the data locality in terms
of microclusters, defined as a direct extension of the
cluster feature vector (ZRL96). The microclusters
have an additivity property which makes them a
natural choice for data streams. When new data
objects streaming in, they could be either absorbed by
an existing microcluster, or form a new one,
6
Proceedings of the 42nd Hawaii International Conference on System Sciences - 2009
depending on their relatively distance from the
nearest microcluster center. The creation of new
microclusters may lead to the merge of some nearby
existing microclusters or removal of some small and
outdated microclusters (treated as old outliers).
These may lead to incremental maintenance and
dynamic evolution of microclusters in data streams.
The summary information in the microclusters can be
used for creation of macroclusters (i.e., the high-level
k clusters for an input parameter k) dynamically
based on an efficient clustering algorithm, such as k means, and application requirements.
Such a
microcluster-to-macrocluster framework will provide
an efficient and quality clustering mechanism for
dynamically changing data streams.
Second, we need to adopt tilted time frame, where
microclusters are stored at snapshots in time. In the
tilted time frame, the snapshots are maintained at
different granularity depending upon their level of
recency. This framework facilitates the fading of
aged microclusters and discovery of the dynamic
changes of the clusters with time. Since the time
windows are tilted, the amount of information to be
registered is limited and thus it strikes a balance
between efficiency and the capability of detection of
stream dynamics.
Notice that the proposed method requires us to
register some additional information with respect to
the tilted time frame and the finer level of
granularity.
People may wonder whether the
overhead may outweigh the potential benefit. Based
on our initial experiments, even with very moderate
sized microclusters (such as the number of
microclusters is only about 10 * k, where k is the
number of macroclusters), the quality of clustering
can be enhanced substantially, with only very
minimal overhead in computation time (AHWY03).
The clustering quality enhancement by this approach
is striking for dynamically changing data, such as
network intrusion data. Therefore, with the current
availability of fast computers and considerable sized
main memory (usually more than one million times
bigger in bytes than k, where k is the desired number
of macroclusters), it is realistic to trade some
computation resource for the quality and flexibility of
stream clustering.
With the above discussed framework, i.e., microclustering and tilted time window, several other
clustering paradigms, such as density-based
clustering (ABKS99), Gaussian distribution-based
clustering (HiKe98), and distance- and connectionintegrated hierarchical clustering (KHK99), can be
re-examined in the context of stream clustering.
Efficient and effective stream clustering algorithms
can be developed following those paradigms as well.
We need to investigate the efficient and effective
methods for stream clustering based on the ideas
outline above and construct effective and highquality methods for clustering dynamic, dataintensive streams to discover the dynamics of data
streams.
5.3. Dynamic modeling and classification of
data streams
As an essential data mining task, classification
has been one of most popularly studied subjects in
machine learning, statistics, and data mining
(Quin86, WiFr00, Mitc97, HTF01, HaKa06). There
have been many well-developed classifiers, such as
decision trees, naive Bayesian classifiers, Bayesian
networks, nearest neighbor, neural networks, and
support vector machines.
In many studies,
researchers have found that each classifier has
advantages for certain types of data sets (LLS00).
For stream data, there is a necessity to dynamically
construct classifiers based on the historical and
current information, which poses the following
challenges.
1. The classifier construction process should be fast
and dynamic because the data may arrive at a high
rate, with dramatic changes. For example, thousands
of packets can be collected from sensor nets every
second, and millions of customers may make
purchases every day.
2. The classifier should also evolve over time since
the labels of similar records may change from time to
time. As a result, how to keep trace of this type of
evolution and how to discover the cause that leads to
the evolution is an important and difficult problem.
To build fast, adaptive, and evolving classifiers in
the stream data environment, one needs to reexamine the broad spectrum of available classifiers to
see what kinds of classifiers may be good candidates
for stream classification. Some of the known
classifiers, such as neural networks and support
vector machines, may not be good for single-scan,
very fast model reconstruction while handling the
huge volumes of stream data. Recently, (DoHu00,
HSD01) developed a few efficient decision tree
induction algorithms for stream data, such as VFDT
and CVFDT. However, these stream classification
methods are not adequate at handling data streams
and discovering stream evolution regularity, based on
the following analysis.
First, decision trees have been popularly adopted
as the first choice in the classification of stream data,
7
Proceedings of the 42nd Hawaii International Conference on System Sciences - 2009
such as (DoHu00, HSD01), for its simplicity and
easy explanation.
However, it is difficult to
dynamically and drastically change decision trees due
to the costly reconstruction once they have been built.
In many real applications, there are frequent,
dynamic changes in stream data, such as in stock data
analysis, traffic or weather modeling, and so on. In
addition, a large amount of raw data is needed to
build a decision tree. According to the model
proposed in (HSD01), it is necessary to keep such
raw data in memory or on disk since they may be
used later for updating the statistics when old records
leave the window and for reconstructing parts of the
tree. If the concept drifts often, the related data needs
to be scanned multiple times so that the decision tree
can be kept updated. This is usually unaffordable for
streaming data. Also, after detecting the drift in the
model, it may take a long time to accumulate
sufficient data to build an accurate decision tree
(HSD01). Any drift taking place during that period
either cannot be caught or will make the tree
unstable.
Second, based on the similar observation as in
stream clustering, without storing historical
information, it is impossible to detect the evolution
regularity of classification models for dynamic data
streams. In practice, one may need to compare the
models constructed at different time frames and see
what has changed. Moreover, it is often necessary to
put different weights on the data arriving at different
time frames so that aged data can be weighted less
and be gradually faded out. This poses the necessity
for a tilted time frame model.
Based on the above analysis, we proposed two
major components for the construction and dynamic
evolution of classification models: (1) tilted time
frame, and (2) a classification method that is more
adaptable to the changes of data.
First, a tilted time frame similar to that described
above is introduced in our stream classification
model.
This ensures that essential historical
information can be saved for fading aged data and for
comparing models at different times to discover the
evolution of classification models.
Second, instead of refining the decision tree
model, we integrate a classification method that is
more adaptive to the changes of data. We developed
and tested three methods for classification of data
streams: (1) ensemble classification (WFYH03), (2)
naïve Bayesian classifier with boosting (YYHW05),
and (3) k-nearest neighbor approach (AHWY06).
The first such test is to construct an ensemble
classifier (WFYH03) that explores the power of
ensemble multiple classifiers, each being a simple
classification model adaptive to dynamic changes of
data streams. Our experiments show the high
accuracy and efficiency of such a classification
methods, despite of the dynamic changes of data
streams.
The second test is to integrate a naïve Bayesian
classifier (YYHW05), since it is easy to construct and
adapt. The naïve Bayesian classifier, in essence,
maintains a set of probability distributions P(ai | v)
where ai and v are the attribute value and the class
label, respectively. To classify a record with several
attribute values, it is assumed that the conditional
probability distributions of these values are
independent of each other. Thus, one can simply
multiply the conditional probabilities together and
label the record with the class label of the greatest
probability. Despite its simplicity, the accuracy of the
naïve Baysian classifier is comparable to other
classifiers, such as decision trees (Mitc97, DoPa97).
To further enhance the accuracy of stream classifier,
a boosting technique similar to that described in
(FrSc97) can be introduced. The boosting method
retains a portion of recent data set as an independent
testing set, and then puts additional weight on the
misclassified cases, that is, increasing the weight
when new similar cases of data streaming in. Our
experimental results show that this method leads to a
high quality stream classifier.
The third test is to construct a k-nearest neighbor
classifier (AHWY06) that performs microclustering
in a similar way as clustering evolving data streams
and then perform classification dynamically based on
k such precomputed and incrementally updated
microclusters. This method integrated the streambased microclustering with online classification
based on k -nearest neighbors and leads to high
accuracy and high efficiency/scalability in stream
classification.
6. Conclusion
In this paper we have discussed issues,
approaches, challenges and directions for Real-time
and Stream Data Mining. In particular, we discuss
real-time clustering and outlier detection approaches
as well as stream data mining and multidimensional
analysis. For each topic we discuss some approaches
we have developed and the research that needs to be
done.
Real-time
knowledge
discovery
and
dissemination is a new direction. Not only do we
need to extract useful patterns we also need to extract
such patterns in a timely manner so that the
emergency responder or the war fighter or the
intelligence analyst has this information to make
8
Proceedings of the 42nd Hawaii International Conference on System Sciences - 2009
appropriate decisions. On the one hand we can
determine all possibilities ahead of time and execute
during run time. At the other extreme we need to
extract the nuggets as well as develop models during
run time. We can also develop solutions that carry
out partial computations ahead of time and the rest
during run time.
Several techniques such as real-time query
approximation have been proposed by the real-time
research community. The knowledge discovery
community has developed numerous data mining and
machine learning algorithms. This paper attempts to
integrate the solutions proposed by both
communities.
7. References
[ABKS99] M. Ankerst, M. Breunig, H.-P. Kriegel, and
J. Sander. OPTICS: Ordering points to identify the
clustering structure. In Proc. 1999 ACM-SIGMOD Int.
Conf. Management of Data (SIGMOD'99), pages 49-60,
Philadelphia, PA, June 1999.
[AGGR98] R. Agrawal, J. Gehrke, D. Gunopulos, and P.
Raghavan.
Automatic subspace clustering of high
dimensional data for data mining applications. In Proc.
1998 ACM-SIGMOD Int. Conf. Management of Data
(SIGMOD'98), pages 94-105, Seattle, WA, June 1998.
[AGKM04] Arya, Garg, Khandekar, Meyerson, Munagala,
and Pandit. Local Search Heuristics for k-Median and
Facility Location Problems. SIAM Journal of Computing,
vol. 33, 2004.
[AS94] R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. In Proc. 1994 Int. Conf. Very
Large Data Bases (VLDB'94), pages 487-499, Santiago,
Chile, Sept. 1994.
[AHWY03] C. C. Aggarwal, J. Han, J. Wang, and P. S.
Yu. A framework for clustering evolving data streams. In
Proc. 2003 Int. Conf. Very Large Data Bases (VLDB'03),
pages 81-92, Berlin, Germany, Sept. 2003.
[AHWY04a] C. Aggarwal, J. Han, J. Wang, and P. S. Yu.
A framework for projected clustering of high dimensional
data streams. In Proc. 2004 Int. Conf. Very Large Data
Bases (VLDB'04), pages 852-863, Toronto, Canada, Aug.
2004.
[AHWY04b] C. Aggarwal, J. Han, J. Wang, and P. S. Yu.
On demand classification of data streams. In Proc. 2004
ACM SIGKDD Int. Conf. Knowledge Discovery in
Databases (KDD'04), pages 503-508, Seattle, WA, Aug.
2004.
[AHWY05] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu.
On efficient algorithms for high dimensional projected
clustering of data streams. Data Mining and Knowledge
Discovery, 10:251-273, 2005.
[AHWY06] C. Aggarwal, J. Han, J. Wang, and P. S. Yu.
A framework for on demand classification of evolving data
streams. In IEEE Transactions on Knowledge and Data
Engineering, 18(5):577-789, 2006.
[APW+ 99] C. C. Aggarwal, C. Procopiuc, J. Wolf, P. S.
Yu, and J.-S. Park. Fast algorithms for projected clustering.
In Proc. 1999 ACM-SIGMOD Int. Conf. Management of
Data (SIGMOD'99), pages 61-72, Philadelphia, PA, June
1999.
[AS94] R. Agrawal and R. Srikant. Fast algorithms for
mining association rules. In Proc. 1994 Int. Conf. Very
Large Data Bases (VLDB'94), pages 487-499, Santiago,
Chile, Sept. 1994.
[BBD+ 02] B. Babcock, S. Babu, M. Datar, R. Motwani,
and J. Widom. Models and issues in data stream systems.
In Proc. 2002 ACM Symp. Principles of Database Systems
(PODS'02), pages 1-16, Madison, WI, June 2002.
[BFR98] P. Bradley, U. Fayyad, and C. Reina. Scaling
clustering algorithms to large databases. In Proc. 1998 Int.
Conf. Knowledge Discovery and Data Mining (KDD'98),
pages 9-15, New York, NY, Aug. 1998.
[BKNS00] M. M. Breunig, H.-P. Kriegel, R. Ng, and
J. Sander. LOF: Identifying density-based local outliers.
In Proc. 2000 ACM-SIGMOD Int. Conf. Management of
Data (SIGMOD'00), pages 93-104, Dallas, TX, May 2000.
[BR99] K. Beyer and R. Ramakrishnan. Bottom-up
computation of sparse and iceberg cubes. In Proc.
1999 ACM-SIGMOD Int. Conf. Management of Data
(SIGMOD'99), pages 359-370, Philadelphia, PA, June
1999.
[Bur98] C. J. C. Burges. A tutorial on support vector
machines for pattern recognition. Data Mining and
Knowledge Discovery, 2:121-168, 1998.
[BW01] S. Babu and J. Widom. Continuous queries over
data streams. SIGMOD Record, 30:109-120, 2001.
[CCP+ 04] Y. D. Cai, D. Clutter, G. Pape, J. Han, M.
Welge, and L. Auvil. MAIDS: Mining alarming incidents
from data streams. In Proc. 2004 ACM-SIGMOD Int.
Conf. Management of Data (SIGMOD'04), pages 919-920,
Paris, France, June 2004.
[CDH+ 02] Y. Chen, G. Dong, J. Han, B. W. Wah, and J.
Wang. Multi-dimensional regression analysis of time-series
data streams. In Proc. 2002 Int. Conf. Very Large Data
Bases (VLDB'02), pages 323-334, Hong Kong, China,
Aug. 2002.
[CKS+ 88] P. Cheeseman, J. Kelly, M. Self, J. Stutz, W.
Taylor, and D. Freeman. AutoClass:
a Bayesian
9
Proceedings of the 42nd Hawaii International Conference on System Sciences - 2009
classification system. In Proc. 1988 Int. Conf. Machine
Learning, pages 54-64, San Mateo, CA, 1988.
1998 ACM-SIGMOD Int. Conf. Management of Data
(SIGMOD'98), pages 73-84, Seattle, WA, June 1998.
[CYH04] H. Cheng, X. Yan, and J. Han. IncSpan:
Incremental mining of sequential patterns in large
database. In Proc. 2004 ACM SIGKDD Int. Conf.
Knowledge Discovery in Databases (KDD'04), pages 527532, Seattle, WA, Aug. 2004.
[HCD+ 05] J. Han, Y. Chen, G. Dong, J. Pei, B. W. Wah,
J. Wang, and Y. D. Cai. Stream cube: An architecture for
multi-dimensional analysis of data streams. Distributed and
Parallel Databases, 18:173-197, 2005.
[DGGR02] A. Dobra, M. Garofalakis, J. Gehrke, and R.
Rastogi. Processing complex aggregate queries over data
streams.
In Proc. 2002 ACM-SIGMOD Int. Conf.
Management of Data (SIGMOD'02), pages 61-72,
Madison, WI, June 2002.
[Hec96] D. Heckerman. Bayesian networks for knowledge
discovery. In U. M. Fayyad, G. Piatetsky-Shapiro, P.
Smyth, and R. Uthurusamy, editors, Advances in
Knowledge Discovery and Data Mining, pages 273-305.
MIT Press, 1996.
[DH00] P. Domingos and G. Hulten. Mining high-speed
data streams. In Proc. 2000 ACM SIGKDD Int. Conf.
Knowledge Discovery in Databases (KDD'00), pages 7180, Boston, MA, Aug. 2000.
[HK98] A. Hinneburg and D. A. Keim. An efficient
approach to clustering in large multimedia databases with
noise. In Proc. 1998 Int. Conf. Knowledge Discovery and
Data Mining (KDD'98), pages 58-65, New York, NY, Aug.
1998.
[DHS01] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
Classification (2nd ed.). John Wiley & Sons, 2001.
[HK06] J. Han and M. Kamber. Data Mining: Concepts and
Techniques (2nd ed.). Morgan Kaufmann, 2006.
[DP97] P. Domingos and M. J. Pazzani. On the optimality
of the simple bayesian classifier under zero-one loss.
Machine Learning, 29:103-130, 1997.
[HPDW01] J. Han, J. Pei, G. Dong, and K. Wang.
Efficient computation of iceberg cubes with complex
measures. In Proc. 2001 ACM-SIGMOD Int. Conf.
Management of Data (SIGMOD'01), pages 1-12, Santa
Barbara, CA, May 2001.
[FK96] M. Fredman and L. Khachiyan. On the Complexity
of Dualization of Monotone Disjunctive Normal Forms.
Journal of algorithms, vol. 21, 1996.
[FS97] Y. Freund and R. E. Schapire. A decision-theoretic
generalization of on-line learning and an application to
boosting. J. Computer and System Sciences, 55:119-139,
1997.
[GKMS01] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan,
and M. Strauss. Surfing wavelets on streams: One-pass
summaries for approximate aggregate queries. In Proc.
2001 Int. Conf. on Very Large Data Bases (VLDB'01),
pages 79-88, Rome, Italy, Sept. 2001.
[GKS01] S. Guha, N. Koudas, and L. Shim. Data streams
and historgrams. In Proc. ACM Symposium on Theory of
Computing (STOC'00), pages 471-475, Crete, Greece, July
2001.
[GMMO00] S. Guha, N. Mishra, R. Motwani, and L.
O'Callaghan. Clustering data streams. In Proc. 2000
Symp. Foundations of Computer Science (FOCS'00),
pages 359-366, Redondo Beach, CA, 2000.
[GRG98] J. Gehrke, R. Ramakrishnan, and V. Ganti.
RainForest: A framework for fast decision tree construction
of large datasets. In Proc. 1998 Int. Conf. Very Large Data
Bases (VLDB'98), pages 416-427, New York, NY, Aug.
1998.
[GRS98] S. Guha, R. Rastogi, and K. Shim. Cure: An
efficient clustering algorithm for large databases. In Proc.
[HPY00] J. Han, J. Pei, and Y. Yin. Mining frequent
patterns without candidate generation. In Proc. 2000
ACM-SIGMOD Int. Conf. Management of Data
(SIGMOD'00), pages 1-12, Dallas, TX, May 2000.
[HSD01] G. Hulten, L. Spencer, and P. Domingos. Mining
time-changing data streams. In Proc. 2001 ACM SIGKDD
Int. Conf. Knowledge Discovery in Databases (KDD'01),
San Fransisco, CA, Aug. 2001.
[HTF01] T. Hastie, R. Tibshirani, and J. Friedman. The
Elements of Statistical Learning: Data Mining, Inference,
and Prediction. Springer-Verlag, 2001.
[JD88] A. K. Jain and R. C. Dubes.
Clustering Data. Prentice Hall, 1988.
Algorithms for
[JTHW06] W. Jin, A. K. H. Tung, J. Han, and W. Wang.
Ranking outliers using symmetric neighborhood
relationship. In Proc. 2006 Pacific-Asia Conf. Knowledge
Discovery and Data Mining (PAKDD'06), Singapore, April
2006.
[KHK99] G. Karypis, E.-H. Han, and V. Kumar.
CHAMELEON: A hierarchical clustering algorithm using
dynamic modeling. COMPUTER, 32:68-75, 1999.
[KMBRT04] P. Kontkanen, P. Myllymaki, W. Buntine, J.
Rissanen, H. Tirri. An MDL Framework for Data
Clustering. MIT Press, 2004.
10
Proceedings of the 42nd Hawaii International Conference on System Sciences - 2009
[KN98] E. Knorr and R. Ng. Algorithms for mining
distance-based outliers in large datasets. In Proc. 1998 Int.
Conf. Very Large Data Bases (VLDB'98), pages 392-403,
New York, NY, Aug. 1998.
[KR90] L. Kaufman and P. J. Rousseeuw. Finding Groups
in Data: An Introduction to Cluster Analysis. John Wiley &
Sons, 1990.
[LHG04] X. Li, J. Han, and H. Gonzalez. Highdimensional OLAP: A minimal cubing approach. In Proc.
2004 Int. Conf. Very Large Data Bases (VLDB'04), pages
528-539, Toronto, Canada, Aug. 2004.
[LLHH06] H. Liu, Y. Lu, J. Han, and J. He. Error-adaptive
and time-aware maintenance of frequency counts over
data streams. In Proc. 2006 Int. Conf. on Web-Age
Information Management (WAIM'06), pages 484-495,
Hong Kong, China, June, 2006.
[LLS00] T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A
comparison of prediction accuracy, complexity, and
training time of thirty-three old and new classification
algorithms. Machine Learning, 40:203-228, 2000.
[Mit97] T. M. Mitchell. Machine Learning. McGraw-Hill,
1997.
[MM02] G. Manku and R. Motwani. Approximate
frequency counts over data streams. In Proc. 2002 Int.
Conf. Very Large Data Bases (VLDB'02), pages 346-357,
Hong Kong, China, Aug. 2002.
[MT94] Maass and Turan. Algorithms and Lower Bounds
for On-Line Learning of Geometrical Concepts. Machine
Learning, vol. 14, 1994.
[NH94] R. Ng and J. Han. Efficient and effective
clustering method for spatial data mining. In Proc. 1994
Int. Conf. Very Large Data Bases (VLDB'94), pages 144155, Santiago, Chile, Sept. 1994.
[NLHP98] R. Ng, L. V. S. Lakshmanan, J. Han, and A.
Pang. Exploratory mining and pruning optimizations of
constrained associations rules. In Proc. 1998 ACMSIGMOD Int. Conf. Management of Data (SIGMOD'98),
pages 13-24, Seattle, WA, June 1998.
[OMM+ 02] L. O'Callaghan, A. Meyerson, R. Motwani, N.
Mishra, and S. Guha. Streaming-data algorithms for highquality clustering.
In Proc. 2002 Int. Conf. Data
Engineering (ICDE'02), pages 685-696, San Fransisco, CA,
Apr. 2002.
[PHMA+ 04] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H.
Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. Mining
sequential patterns by pattern-growth: The prefixspan
approach. IEEE Trans. Knowledge and Data Engineering,
16:1424-1440, 2004.
[Qui86] J. R. Quinlan. Induction of decision trees.
Machine Learning, 1:81-106, 1986.
[SA96] R. Srikant and R. Agrawal. Mining sequential
patterns: Generalizations and performance improvements.
In Proc. 5th Int. Conf. Extending Database Technology
(EDBT'96), pages 3-17, Avignon, France, Mar. 1996.
[SLZ01] S. Shekhar, C. T. Lu, and P. Zhang. Detecting
graph-based spatial outlier: Algorithms and applications (a
summary of results). In Proc. 2001 ACM SIGKDD
Int. Conf. Knowledge Discovery in Databases (KDD'01),
San Fransisco, CA, Aug. 2001.
[SS02] B. Schoelkopf and A. J. Smola. Learning with
Kernels:
Support Vector Machines, Regularization,
Optimization, and Beyond. MIT Press, 2002.
[THUR06] B. Thuraisingham, Assured Information
Sharing: Volume 1: Overview, UTD Technical Report,
2006.
[TSK05] P. Tan, M. Steinbach, and V. Kumar.
Introduction to Data Mining. Addison Wesley, 2005.
[WF05] I. H. Witten and E. Frank. Data Mining: Practical
Machine Learning Tools and Techniques (2nd ed.).
Morgan Kaufmann, 2005.
[WFYH03] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining
concept-drifting data streams using ensemble classifiers. In
Proc. 2003 ACM SIGKDD Int. Conf. Knowledge
Discovery and Data Mining (KDD'03), pages 226-235,
Washington, DC, Aug. 2003.
[XHLW03] D. Xin, J. Han, X. Li, and B. W. Wah. Starcubing: Computing iceberg cubes by top-down and
bottom-up integration. In Proc. 2003 Int. Conf. Very Large
Data Bases (VLDB'03), pages 476-487, Berlin, Germany,
Sept. 2003.
[XHSL06] D. Xin, J. Han, Z. Shao, and H. Liu. Ccubing: Efficient computation of closed cubes by
aggregation-based checking. In Proc. 2006 Int. Conf. Data
Engineering (ICDE'06), Atlanta, Georgia, April 2006.
[YH03] X. Yan and J. Han.
closed frequent graph patterns.
SIGKDD Int. Conf. Knowledge
Mining (KDD'03), pages 286-295,
2003.
CloseGraph: Mining
In Proc. 2003 ACM
Discovery and Data
Washington, DC, Aug.
[YYH03] H. Yu, J. Yang, and J. Han. Classifying large
data sets using SVM with hierarchical clusters. In Proc.
2003 ACM SIGKDD Int. Conf. Knowledge Discovery and
Data Mining (KDD'03), pages 306-315, Washington, DC,
Aug. 2003.
[YYHW05] J. Yang, X. Yan, J. Han, and W. Wang.
Discovering evolutionary classifier over high speed nonstatic stream. In S. Bandyopadhyay, U. Maulik, L. B.
11
Proceedings of the 42nd Hawaii International Conference on System Sciences - 2009
Holder, and D. J. Cook (eds.), Advanced Methods for
Knowledge Discovery from Complex Data, pages 337-363,
Springer, 2005.
[Zak01] M. Zaki. SPADE: An efficient algorithm for
mining frequent sequences. Machine Learning, 40:31-60,
2001.
[ZDN97] Y. Zhao, P. M. Deshpande, and J. F. Naughton.
An
array-based
algorithm
for
simultaneous
multidimensional aggregates. In Proc. 1997 ACMSIGMOD Int. Conf. Management of Data (SIGMOD'97),
pages 159-170, Tucson, AZ, May 1997.
[ZH02] M. J. Zaki and C. J. Hsiao. CHARM: An efficient
algorithm for closed itemset mining. In Proc. 2002 SIAM
Int. Conf. Data Mining (SDM'02), pages 457-473,
Arlington, VA, April 2002.
[ZRL96] T. Zhang, R. Ramakrishnan, and M. Livny.
BIRCH : an efficient data clustering method for very large
databases. In Proc. 1996 ACM-SIGMOD Int. Conf.
Management of Data (SIGMOD'96), pages 103-114,
Montreal, Canada, June 1996.
12