Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Streaming Models and
Algorithms for Communication
and Information Networks
Brian Thompson (joint work with James Abello)
Outline
Introduction and Motivation
A Streaming Model
Our Approach
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Outline
Introduction and Motivation
A Streaming Model
Our Approach
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Problem Description
Data: A network (G;T)
G = (V,E) is a graph
T is a set of time-stamped events corresponding to nodes
or edges in G
Goals:
Identify recent correlated activity
Measure influence between entities
Challenges:
Scalability – networks may be very large, limited space
Efficiency – high data rate, time-sensitive information
Variability – entities have different temporal dynamics
Streaming Models and Algorithms for Communication and Information Networks
Related Work
Time-evolving graph model - sequence of “snapshots”
t=1
t=2
t=3
t=4
Time series analysis
IP Traffic (MB Per Hour)
Streaming Models and Algorithms for Communication and Information Networks
Related Work
Cascade model – set of seed nodes, information
(product, news, virus) propagates through network
Streaming Models and Algorithms for Communication and Information Networks
Outline
Introduction and Motivation
A Streaming Model
Our Approach
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Data Model
G is a graph
Alice
Devika
Bob
Cheng
Elina
T is a set of time-stamped events corresponding to
nodes or edges in G
Source
Recipient
Content
Timestamp
Alice
(public)
“Fire at 2nd & Main!”
Tuesday, 9:25am
Bob
Cheng
(private message)
Tuesday, 9:27am
Cheng
(public)
“RT @Alice Fire ...”
Tuesday, 9:28am
Streaming Models and Algorithms for Communication and Information Networks
Data Model
(Node-centric)
Devika
Alice
Bob
Cheng
Elina
Streaming Models and Algorithms for Communication and Information Networks
Data Model
(Edge-centric)
Devika
Alice
Bob
Cheng
Elina
Streaming Models and Algorithms for Communication and Information Networks
Renewal Theory
A renewal process Φ is a continuous-time Markov
process where state transitions occur with holding times
sampled independently from a positive distribution 𝜇.
Let 𝑆1 , 𝑆2 , … be samples from 𝜇, and consider a sequence
of events corresponding to those holding times.
S3
𝑇Φ :
0
t1
t2
t3 t4
t5
We call 𝑆𝑖 inter-arrival times, and refer to the sequence
𝑇Φ = 𝑡𝑖 =
𝑖
0 𝑆𝑖
as the discrete-event sequence for Φ.
Streaming Models and Algorithms for Communication and Information Networks
Renewal Theory
The age of a renewal process Φ at time 𝑡 is the amount
of time elapsed since the last event:
𝑡 − max 𝑡𝑖 ∶ 𝑡𝑖 < 𝑡 if 𝑡 ≥ 𝑡1
∞ otherwise
𝐴𝑔𝑒Φ 𝑡 =
𝐴𝑔𝑒Φ 𝑡
𝑇Φ :
0
t1
t2
t3 t4
t5
t
Streaming Models and Algorithms for Communication and Information Networks
The REWARDS Model
REneWal theory Approach for Real-time Data Streams
We model a stream of communication data from a node
or across an edge as a renewal process
Inter-Arrival Time Distribution
xmin
xmax
Discrete-event sequence:
t1 t2
t3 t4
t5
Streaming Models and Algorithms for Communication and Information Networks
The REWARDS Model
REneWal theory Approach for Real-time Data Streams
Given a stream of time-stamped events, we estimate
the parameters of the renewal process for each node
or edge based on the inter-arrival times
Inter-Arrival Time Distribution
xmin
xmax
Discrete-event sequence:
t1 t2
t3 t4
t5
Streaming Models and Algorithms for Communication and Information Networks
Outline
Introduction and Motivation
A Streaming Model
Our Approach
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Recency
Goal: highlight recent activity
Key idea: more recent = more relevant
8:00 am
10:00 am
12:00 pm
NOW!
User: alice1337
User: bob_iz_kewl
Challenge: The most frequent communicators will
always seem “recent”, overshadowing others’ behavior.
We call this time-scale bias.
Streaming Models and Algorithms for Communication and Information Networks
Recency
We can overcome time-scale bias by using the
REWARDS Model
We first derive the limit distribution
𝐴𝑔𝑒 ∗
𝐹Φ
of the 𝐴𝑔𝑒
function:
𝐴𝑔𝑒 ∗
𝐹Φ
𝜏 = lim Pr 𝐴𝑔𝑒Φ 𝑡 ≤ 𝜏
𝑡→∞
We define the recency of Φ at time 𝑡 to be:
𝑅𝑒𝑐Φ 𝑡 = 1 −
𝐴𝑔𝑒 ∗
𝐹Φ
𝐴𝑔𝑒Φ 𝑡
Streaming Models and Algorithms for Communication and Information Networks
Recency
𝑅𝑒𝑐Φ is a decreasing function on every interval 𝑡𝑖 , 𝑡𝑖+1 .
It also satisfies the uniformity property: for any renewal
process Φ, the limit distribution of 𝑅𝑒𝑐Φ is Uniform(0,1).
Recency of Edge <3,22> in Bluetooth Dataset
Recency effectively normalizes the age of a process
relative to its own temporal dynamics, making our
approach robust to differences in time scale between
networks or between entities within the same network.
Streaming Models and Algorithms for Communication and Information Networks
Delay
Goal: measure influence of entity A on entity B
Key idea: study pairwise (A,B)-gaps
8:00 am
10:00 am
12:00 pm
NOW!
User: alice1337
User: bob_iz_kewl
Challenge: More frequent communicators will tend to
always have shorter “gaps”.
Another example of time-scale bias.
Streaming Models and Algorithms for Communication and Information Networks
Delay
Given renewal processes Φ and Ψ, we say the ordered
pair of events 𝜙𝑖 , 𝜓𝑗 are adjacent if 𝑡(𝜙𝑖 ) < 𝑡(𝜓𝑗 ) and
∄ 𝑡 ∈ 𝑇Φ ∪ 𝑇Ψ ∶ 𝑡(𝜙𝑖 ) ≤ 𝑡 ≤ 𝑡(𝜓𝑗 ). We refer to the
elapsed time 𝑡(𝜓𝑗 ) − 𝑡(𝜙𝑖 ) as the pairwise gap. We
denote by 𝐺𝑎𝑝Φ,Ψ (𝑡) the most recent such gap at time 𝑡.
If Φ and Ψ are independent processes, then we can
𝐺𝑎𝑝 ∗
𝐹Φ,Ψ
derive the limit distribution
of pairwise gaps
between consecutive (Φ, Ψ) event pairs.
We define the (Φ, Ψ)-delay at time 𝑡 to be:
𝐷𝑒𝑙Φ,Ψ 𝑡 = 1 −
𝐺𝑎𝑝 ∗
𝐹Φ,Ψ
𝐺𝑎𝑝Φ,Ψ 𝑡
Streaming Models and Algorithms for Communication and Information Networks
Delay
𝐷𝑒𝑙Φ,Ψ is a constant function on every interval 𝑡𝑖 , 𝑡𝑖+1 ,
and also satisfies the uniformity property: for any pair of
independent renewal process Φ and Ψ, the limit
distribution of 𝐷𝑒𝑙Φ,Ψ is Uniform(0,1).
By comparing an observed gap to the theoretical joint
distribution of inter-arrival times for Φ and Ψ, delay
effectively normalizes the gap relative to the temporal
dynamics of Φ and Ψ individually.
Similarly to the recency function, this makes our
approach robust to differences in time scale between
networks or between entities within the same network.
Streaming Models and Algorithms for Communication and Information Networks
Outline
Introduction and Motivation
A Streaming Model
Our Approach
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Divergence
Based on the Kolmogorov-Smirnov statistic:
Fn(x)
F(x)
1
Compares empirical EDF Fn(x)
to hypothetical CDF F(x)
0.8
0.6
0.4
𝑲𝑺 𝑭𝒏 || 𝑭 = 𝐬𝐮𝐩 𝑭𝒏 (𝒙) − 𝑭(𝒙)
KS = 0.32
0.2
0
0
0.2
0.4
0.6
0.8
1
Recency divergence compares recency values for a set
of nodes or edges to the CDF for Uniform(0,1)
Delay divergence compares delay values for a set of
edges, or for all (A,B)-gaps, to the CDF for Uniform(0,1)
Streaming Models and Algorithms for Communication and Information Networks
Streaming Node-Centric Algorithm
• Goal: Flag times at which a node exhibits anomalous
activity (indicated by an unusually high concentration
of recent outgoing communication)
• Approach: Since the recency function is decreasing
between consecutive communication, measure the
recency divergence at a node only at times at which
new activity occurs
Streaming Models and Algorithms for Communication and Information Networks
The MCD Algorithm
Maximal Component Divergence Algorithm
• Goal: Identify subgraphs with correlated behavior
• Recency divergence to find recent anomalous activity
• Delay divergence to identify spheres of influence
Challenge: How do we overcome the combinatorial explosion?
Streaming Models and Algorithms for Communication and Information Networks
The MCD Algorithm
Maximal Component Divergence Algorithm
1. Calculate edge weights using recency or delay function
2. Gradually decrease the threshold, updating
components and divergence values as necessary
3. Output: Disjoint components with max divergence
0.9
V1
0.7
0.1
V5
V2
0.75
V3
0.3
0.5
V4
θ
Component
Div(C)
0.9
{V1,V2}
2.908
0.75
{V1,V2,V3}
2.723
0.7
{V1,V2,V3}
6.132
0.5
{V4,V5}
1.143
0.3
{V1,V2,V3,V4,V5}
2.380
0.1
{V1,V2,V3,V4,V5}
1.882
2.4
2.7
6.1
V3
2.9
V1
V2
1.1
V4
Streaming Models and Algorithms for Communication and Information Networks
V5
Sample Output
MCD
θ
#V(C)
E-frac
%E(C)
%E(G)
14.57
0.07
54
53/212
0.25
0.08
12.84
0.08
32
31/88
0.35
0.08
3.70
0.10
6
5/7
0.71
0.10
2.97
0.18
5
4/4
1.00
0.14
1.91
0.05
7
6/41
0.15
0.04
Streaming Models and Algorithms for Communication and Information Networks
Outline
Introduction and Motivation
A Streaming Model
Our Approach
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Robustness to Time Scale
• Simulation: R-MAT model, 128 vertices, avg. degree 16
• IATs for edge activity sampled from Bounded Pareto
distributions, rate parameter btwn 10 mins. and 1 week
• Every 5 days, a randomly selected node has anomalous
activity at 10x its normal rate
Streaming Models and Algorithms for Communication and Information Networks
Robustness to Time Scale
Streaming Models and Algorithms for Communication and Information Networks
Robustness to Time Scale
• Conclusion: While it takes longer for anomalous
activity to be recognized at nodes with lower rates,
the magnitude of the peak seems to be independent
of activity rate but highly correlated with degree
Streaming Models and Algorithms for Communication and Information Networks
Accuracy and Precision
• Simulation: star network, 100 trials w/ only normal activity
and 100 trials including a period of anomalous activity
• ROC curves show accuracy and precision for several
methods for distinguishing between the two scenarios
• Conclusion: Especially when variability is introduced, our
approach out-performs the WtdDeg and Z-Score metrics
Streaming Models and Algorithms for Communication and Information Networks
Detection Latency
• Data: Enron corpus, 1k nodes, 2k edges, 4k timestamps
• Compare our approach with GraphScope Algorithm
• Conclusion: The two algorithms seem to identify similar
times of anomalous activity, but our approach based on
the REWARDS model has shorter response time
Streaming Models and Algorithms for Communication and Information Networks
Anomaly Detection in IP Traffic
• Data: LBNL network trace, > 9 million timestamps during
one hour on December 15, 2004
• Compare our approach with total network volume and
with “scanning activity” labeled by LBNL analysts
Streaming Models and Algorithms for Communication and Information Networks
Anomaly Detection in IP Traffic
• Three of the four times of highest 𝐷𝑖𝑣 𝑅𝑒𝑐 correspond to
labeled scanning activity
• The peak in scanning activity at 12:07pm is primarily due
to an increase in DNS and NBNS lookups
• The peak at 12:26pm was not flagged by the analysts
since the sequence of IP addresses was not monotonic
Streaming Models and Algorithms for Communication and Information Networks
Complexity Analysis
Dataset: Twitter messages, Nov. 2008 – Oct. 2009
(263k nodes, 308k edges, 1.1 million timestamps)
Updates O(1) per communication
MCD Algorithm O(m log m), where m = # of edges;
can be approximated in effectively O(m) time
runtime (milliseconds)
Runtime for MCD Algorithm
2000
1500
1000
500
0
0
15,000
30,000
45,000
60,000
number of live edges
Streaming Models and Algorithms for Communication and Information Networks
Outline
Introduction and Motivation
A Streaming Model
Our Approach
Algorithms
Experimental Results
Conclusions and Future Work
Streaming Models and Algorithms for Communication and Information Networks
Future Work
Incorporate duration of communication and other node
or edge attributes into our model
Make use of geographical and textual content
Use gap divergence to infer links, compare to approach
of Gomez-Rodriguez et. al.
Develop streaming algorithm to identify emerging
trends
Streaming Models and Algorithms for Communication and Information Networks
Acknowledgements
Part of this work was conducted at Lawrence Livermore
National Laboratory, under the guidance of Tina EliassiRad.
This project is partially supported by a DHS Career
Development Grant, under the auspices of CCICADA,
a DHS Center of Excellence.
Streaming Models and Algorithms for Communication and Information Networks
Streaming Models and Algorithms for Communication and Information Networks