Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Jing Gao1, Feng Liang1, Wei Fan2,
Chi Wang1, Yizhou Sun1, Jiawei Han1
University of Illinois, IBM TJ Watson
Debapriya Basu
Determine outliers in information networks
Compare various algorithms which does the
same
2
Eg Internet, Social Networking Sites
Nodes – characterized by feature values
Links - representative of relation between
nodes
3
Outliers – anomalies, novelties
Different kinds of outliers
◦ Global
◦ Contextual
4
Global Outlier
V7
V8
V9
V10 V6
V1
V4 V5
V3
V2
10
30 40
70
100 110
Salary (in $1000)
140
160
5
Unified model considering both nodes and
links
Community discovery and outlier detection
are related processes
6
Treat each object as a multivariate data point
Use K components to describe normal
community behavior and one component to
denote outliers
Induce a hidden variable zi at each object
indicating community
Treat network information as a graph
Model the graph as a Hidden Markov Random
Field on zi
Find the local minimum of the posterior
probability potential energy of the model.
7
outlier
community
label Z
node
feature
X
link
structure W
K: number
of
communitie
s
high-income:
mean: 116k
std: 35k
low-income:
mean: 20k
std: 12k
8
model
parameters
Symbol
Definition
I = {1,2,3….i,..M}
Indices of the objects
V = {v1,v2….vm}
Set of objects
S = {s1,s2,….sm}
Given attributes of
objects
WM*M = {wij}
Adjacency matrix
containing the weights of
the links
Z = {z1,…..,zm}
RVs for hidden labels of
objects
X = {x1,…..,xm}
RVs for observed data
Ni
Neighborhood of object
vi
(i ∈ I)
1,….,k,….K
Indices of normal
communities
Θ = {Θ1, Θ2,……, Θk}
R.Vs for model
parameters
9
◦ Set of R.Vs X are conditionally independent given their labels
P(X=S|Z) = ΠP(xi=si|zi)
◦ Kth normal community is characterized by a set of parameters
P(xi=si|zi =k) = P(xi=si|Θk)
◦ Outliers are characterized by uniform distribution
◦ P(xi=si|zi =0) = ρ0
◦ Markov random field is defined over hidden variable Z
◦ P(zi|zI-{i}) = P(zi|zNi)
◦ The equivalent Gibbs distribution is P(Z) = exp(-U(Z))*1/H1
H1 = normalizing constant, U(Z) = sum of clique potentials.
◦ Goal is to find the configuration of z that maximizes
P(X=S|Z)P(Z) for a given Θ
10
Continuous Data
◦ Is modeled as Gaussian distribution
◦ Model parameters: mean, standard deviation
Text Data
◦ Is modeled as Multinomial distribution
◦ Model parameters: probability of a word
appearing in a community
11
Initialize Z
Θ : model parameters
Z: community labels
Given Z, find Θ
that maximizes P(X|Z)
PARAMETER
ESTIMATION
Given Θ, find Z
that maximizes P(Z|X)
INFERENCE
12
Calculate model parameters
◦ maximum likelihood estimation
Continuous
◦ mean: sample mean of the community
◦ standard deviation: square root of the sample
variance of the community
Text
◦ probability of a word appearing in the
community: empirical probability
13
Calculate Zi values
◦ Given Model parameters,
◦ Iteratively update the community labels of nodes at each
timestep
◦ Select the label that maximizes P(Z|X,ZN)
Calculate P(Z|X,ZN) values
◦ Both the node features and community labels of
neighbors if Z indicates a normal community
◦ If the probability of a node belonging to any community is
low enough, label it as an outlier
14
Setting Hyper parameters
◦ a0 = threshold
◦ Λ = confidence in the network
◦ K = number of communities
Initialization
◦ Group outliers in clusters.
◦ It will eventually get corrected.
15
Data Generation
Baseline models
◦ Generate continuous data based on Gaussian
distributions and generate labels according to the
model
◦ Define r: percentage of outliers, K: number of
communities
◦ GLODA: global outlier detection (based on node
features only)
◦ DNODA: local outlier detection (check the feature
values of direct neighbors)
◦ CNA: partition data into communities based on
links and then conduct outlier detection in each
community
16
0.8
0.7
0.6
0.5
GLODA
DNODA
0.4
CNA
0.3
CODA
0.2
0.1
0
r=1 % K=5
r=5 % K=5
r=1 % K=8
r=5 % K=8
17
Communities
◦ data mining, artificial intelligence, database,
information analysis
Sub network of Conferences
Links: percentage of common authors among two
conferences
Node features: publication titles in the conference
Sub network of Authors
Links: co-authorship relationship
Node features: titles of publications by an author
18
Community outliers: CVPR CIKM
19
Community Outliers
Community Outlier Detection
QUESTIONS
20
On Community Outliers and their Efficient Detection
in Information Networks – Gao, Liang, Fan, Wang,
Sun, Han
Outlier detection – Irad Ben-Gal
Automated detection of outliers in real-world data –
Last, Kandel
Outlier Detection for High Dimensional Data –
Aggarwal, Yu
21