Download Hierarchical Exponential-Family Random Graph Mod

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Hierarchical Exponential-Family Random Graph Models With Local Dependence
Michael Schweinberger
Department of Statistics, Pennsylvania State University, University Park, PA, USA
Mark S. Handcock
Department of Statistics, University of California, Los Angeles, CA, USA
Summary. Dependent phenomena, such as relational, spatial, and temporal phenomena,
tend to be characterized by local dependence in the sense that units which are close in a
well-defined sense are dependent. In contrast to spatial and temporal phenomena, however,
relational phenomena tend to lack a natural dependence structure in the sense that it is unknown which units are close and thus dependent. We develop here a novel class of hierarchical exponential-family models which addresses the lack of a natural dependence structure
of relational phenomena and which has important advantages. First, it respects the local
nature of relational phenomena by assuming that there is an underlying local dependence
structure, which may or may not be observed. Second, it constitutes a simple and flexible
statistical framework for modeling a wide range of relational phenomena characterized by local dependence. Third, by restricting dependence to be local, it reduces the degenerate behavior of conventional exponential-family models based on notions of Markov dependence.
We follow a Bayesian approach to hierarchical exponential-family models based on auxiliaryvariable Markov chain Monte Carlo methods. We demonstrate the advantages of hierarchical
exponential-family models over conventional exponential-family models by applying them to
the network of terrorists behind the Bali bombing in 2002 as well as a classic data set.
Keywords: social networks, stochastic block models, statistical exponential families, undirected graphical models
1.
Introduction
Discrete, relational data arise in the social and health sciences, biology, computer science,
and other fields (Kolaczyk, 2009). Examples are terrorist networks (e.g., Koschade, 2006),
communication and collaboration networks arising in the study of disasters (e.g., PetrescuPrahova and Butts, 2008), and contact networks arising in the study of the spread of disease
(e.g., Jones and Handcock, 2003).
We consider here discrete, relational data which can be represented by a graph y with a
set of nodes and a set of edges (representing relationships between, e.g., animals, humans,
computers), where edges may or may not be directed and take on discrete values. In line
with convention, we assume that the set of nodes is fixed while the graph y is an outcome
of a random graph Y with sample space Y.
Discrete, relational data can be modeled by discrete exponential families of distributions
of the form
Pθ (Y = y) = exp [hθ, s(y)i − ψ(θ)] , y ∈ Y,
(1)
E-mail: michael.schweinberger@stat.psu.edu
2
where hθ, s(y)i denotes the inner product of a d-vector of natural parameters θ and a
d-vector of sufficient statistics s(y), and ψ(θ) is the log partition function given by
X
ψ(θ) = log
exp [hθ, s(y′ )i] , θ ∈ Θ,
(2)
y′ ∈Y
where the natural parameter space is given by Θ = {θ ∈ Rd : ψ(θ) < ∞}. Exponentialfamily random graph models (ERGMs) of the form (1) were pioneered by Holland and
Leinhardt (1981); Frank and Strauss (1986); Wasserman and Pattison (1996). ERGMs are
widely used for at least two reasons. First, ERGMs are exponential families with wellknown, desirable properties (Barndorff-Nielsen, 1978). Second, scientists are interested in
a wide range of dependencies—including, but not limited to, transitive closure (Wasserman
and Pattison, 1996)—and ERGMs admit simple representations of such dependencies.
Despite these attractive properties, many ERGMs are plagued by the so-called model
degeneracy problem: the subset of parameter values corresponding to non-degenerate distributions tends to be negligible (Strauss, 1986; Jonasson, 1999; Snijders, 2002; Handcock,
2003a,b; Park and Newman, 2005; Rinaldo et al., 2009; Butts, 2011; Schweinberger, 2011;
Chatterjee and Diaconis, 2011). The model degeneracy problem tends to obstruct Markov
chain Monte Carlo simulation of data and Monte Carlo maximum likelihood estimation
of parameters (Snijders, 2002; Handcock, 2003a,b; Rinaldo et al., 2009). In practice, the
model degeneracy problem tends to result in striking lack of fit (Snijders, 2002; Handcock,
2003a,b; Hunter et al., 2008).
Strauss (1986) was the first to point out that the model degeneracy problem is rooted in
the model, and so is its solution. To address the model degeneracy problem, Snijders et al.
(2006); Hunter and Handcock (2006) introduced curved ERGMs. While curved ERGMs
have been applied with some success (Hunter et al., 2008), curved ERGMs do not admit
simple representations of dependencies and the interpretation of parameters is challenging,
making the application of curved ERGMs restrictive from a scientific point of view.
The purpose of the present paper is two-fold. First, we argue that in the absence of a
natural dependence structure many ERGMs tend to induce strong dependence and model
degeneracy. Second, we address the lack of a natural dependence structure of ERGMs
by developing a novel class of hierarchical ERGMs with an underlying local dependence
structure, which may or may not be observed. Some important advantages are that hierarchical ERGMs (1) respect the local nature of graphs; (2) admit simple representations
of dependencies as long as dependencies are local; and (3) reduce the model degeneracy
problem.
The paper is structured as follows. Section 2 discusses the model degeneracy problem
of ERGMs. Section 3 introduces hierarchical ERGMs with local dependence and discusses
special cases of interest. Section 4 discusses Bayesian inference based on auxiliary-variable
Markov chain Monte Carlo methods. Section 5 compares ERGMs and hierarchical ERGMs
by using prior predictive checks as well as posterior predictive checks.
2.
The model degeneracy problem of ERGMs
A class of ERGMs of special interest is the groundbreaking class of ERGMs with Markov
dependence due to Frank and Strauss (1986). ERGMs with Markov dependence demonstrate why ERGMs are appealing and, at the same time, give insight into the roots of the
model degeneracy problem and possible solutions. Throughout, for historical reasons and
Hierarchical ERGMs
3
convenience, we consider undirected, binary graphs y defined on n nodes, where the edges
yij ∈ {0, 1} satisfy the linear constraints yij = yji (all i < j) and yii = 0 (all i), and the
sample space Y corresponds to the set of undirected, binary graphs y defined on n nodes.
We note that it is straightforward to extend the developments of Sections 2—4 to directed,
binary and non-binary graphs with finite sample spaces.
2.1. ERGMs with Markov dependence
Motivated by the nearest neighbor definition in statistical physics (Ising, 1925) and spatial statistics (Besag, 1974), Frank and Strauss (1986) called two dyads {i, j} and {k, l}
neighbors if {i, j} and {k, l} share a node and assumed that, if {i, j} and {k, l} are not
neighbors, then Yij and Ykl are independent conditional on the rest of random graph Y. By
the Hammersley-Clifford theorem (Besag, 1974; Frank and Strauss, 1986), the probability
mass function (PMF) of random graph Y can be written as
#
"
n−1
X
θk sk (y) + θn sn (y) − ψ(θ) ,
Pθ (Y = y) = exp θ1 s1 (y) +
(3)
k=2
P P
of edges, sk (y) = i j1 <···<jk yij1 · · · yijk is the
where s1 (y) = i<j yij is the number
P
number of k-stars, and sn (y) = i<j<k yij yjk yik is the number of triangles.
P
2.2. Properties of ERGMs with Markov dependence
ERGMs with Markov dependence possess both appealing and problematic properties.
ERGMs with Markov dependence are appealing from a statistical point of view, being
exponential-family models (Barndorff-Nielsen, 1978) and undirected graphical models (Lauritzen, 1996), motivated by and related to models in spatial statistics (Besag, 1974). At
the same time, ERGMs with Markov dependence are appealing from a scientific point of
view, because scientists have long considered k-stars and triangles—the sufficient statistics
of ERGMs with Markov dependence—to be fundamental functions of graphs (Wasserman
and Pattison, 1996).
However, the neighborhood assumption underlying ERGMs with Markov dependence is
problematic (Strauss, 1986): for any given dyad {i, j}, the number of neighbors is 2(n − 2)
and thus increases with n. The large and growing neighborhoods indicate that ERGMs with
Markov dependence, while inspired by the Ising model in statistical physics (Ising, 1925)
and its relatives in spatial statistics (Besag, 1974), resemble the so-called “unphysical”
mean-field Ising model with large and growing neighborhoods (Baxter, 2007, Chapter 3)
rather than the classic Ising model with small and bounded neighborhoods (Ising, 1925).
The comparison with the “unphysical” mean-field Ising model suggests that ERGMs with
Markov dependence may induce strong dyad-dependence and may be problematic provided
n is large.
We show in Section 2.3 that ERGMs with large and growing neighborhoods indeed tend
to be problematic and discuss implications in terms of statistical inference in Section 2.4.
2.3. Model degeneracy
To demonstrate that ERGMs with large and growing neighborhoods tend to be problematic,
we consider ERGMs of the form
Pθ (Yn = yn )
=
exp [θ1 s1 (yn ) + θ2 s2 (yn ) − ψn (θ)] , yn ∈ Yn ,
(4)
4
where
ψn (θ)
=
log
X
exp [θ1 s1 (yn′ ) + θ2 s2 (yn′ )] , θ ∈ Θ,
′ ∈Y
yn
n
(5)
where the subscript n acknowledges the dependence of Yn and Yn on n. The natural
parameter space Θ is given by Θ = R2 , because the sample space Yn is finite (BarndorffNielsen, 1978, pp. 115–116). Let µn : Θ 7→ int(Cn ) be the vector of mean-value parameters
defined by µn (θ) = Eθ [s(Yn )], where int(Cn ) denotes the interior of the convex hull of
{s(yn ) : yn ∈ Yn } (Barndorff-Nielsen, 1978, p. 121).
P
i<j yij and s2 (yn ) =
P We consider here sufficient statistics of the form s1 (yn ) =
y
f
(y
),
where
f
:
Y
→
7
R.
A
prominent
example
is
given
by the ERGM with
n
ij
n
i<j ij ij
the
number
of
edges
and
triangles,
which
is
a
special
case
of
(3)
and
(4) with fij (yn ) =
P
y
y
.
We
assume
f
(y
)
=
0
for
n
=
1,
2
and
f
(y
)
≥
0
for
all n > 2, which
ij
n
ij
n
k:k>j>i ik jk
covers the number of triangles as well as other statistics based on counts of subgraphs of
size k ≥ 3. A sequence of graphs y1 , y2 , . . . is called monotone if y1 ∈ Y1 and, for any
n > 1, yn ∈ Yn is obtained from yn−1 ∈ Yn−1 by adding one node and up to n − 1 edges
between the n − 1 existing nodes and the additional node.
Proposition. If there exists a monotone sequence of graphs y1 , y2 , . . . and, for any
constant C > 0, however large, there exists a constant nC > 1 such that
s2 (yn ) − s2 (yn−1 )
n−1
>
C for all n > nC ,
(6)
then, given any θ ∈ {θ ∈ Θ : θ2 < 0},
µ2n (θ)
−→ 0 as n −→ ∞
sup µ2n (θ)
(7)
θ∈Θ
and, given any θ ∈ {θ ∈ Θ : θ2 > 0},
µ2n (θ)
−→ 1 as n −→ ∞.✷
sup µ2n (θ)
(8)
θ∈Θ
A proof of the proposition can be found in the appendix. The problem with ERGMs
with large and growing neighborhoods is that the growth of the sufficient statistic s2 (yn )
outpaces the growth of the number of possible edges. As n increases, sequences of graphs
which are extreme in terms of s2 (yn ) are allowed to amass more and more probability mass
and dominate all others in terms of probability mass. An example is given by the ERGM
with the number of edges and triangles: the sequence of complete graphs
y1 , y2 , . . . , where
all possible edges are present, implies s2 (yn ) − s2 (yn−1 ) = n−1
for
all
n > 2, which
2
is quadratic rather than linear in n. As n increases, graphs with at least a fraction ǫ of
triangles, where ǫ ∈ (0, 1) is arbitrary, attract either less and less probability mass (provided
θ2 < 0) or more and more probability mass (provided θ2 > 0), pushing the mean-value
parameter to the boundary of the mean-value parameter space. As a result, if n is large,
then the subset of the natural parameter space mapping to mean-value parameters which
are not close to the boundary of the mean-value parameter space tends to be negligible.
Figure 1 demonstrates how negligible the viable subset of the natural parameter space
is when n is as small as 17 and the sufficient statistics s1 (yn ) and s2 (yn ) are given by the
number of edges and triangles, respectively. It plots Markov chain Monte Carlo sample esti-
Hierarchical ERGMs
5
1.0
0.8
0.6
0.0
0.2
0.4
µ2(θ) sup µ2(θ)
0.6
0.4
0.0
0.2
µ1(θ) sup µ1(θ)
0.8
1.0
Fig. 1.
Markov chain Monte Carlo sample estimates of mean-value parameters
µ1n (θ)/ supθ∈Θ µ1n (θ) (left) and µ2n (θ)/ supθ∈Θ µ2n (θ) (right) plotted against natural parameter θ2 , where natural parameter θ1 is given by θ1 = −.147
−4
−2
0
θ2
2
4
−4
−2
0
2
4
θ2
mates of mean-value parameters µ1n (θ)/ supθ∈Θ µ1n (θ) and µ2n (θ)/ supθ∈Θ µ2n (θ) against
natural parameter θ2 , where natural parameter θ1 is fixed at the maximum likelihood estimate θ̂1 = −.147 of θ1 under the ERGM restricted by θ2 = 0, which was estimated from
the terrorist network described in Section 5.2.1. The Markov chain Monte Carlo sample
estimates were evaluated at 201 points in the interval [−5, 5]: at every one of the 201 points,
a Markov chain was started at the network of n = 17 terrorists with a burn-in of 100,000
iterations and a post-burn-in of 100 million iterations, saving every 10,000-th post-burn-in
draw. As expected, the mean-value parameter µ2n (θ) is close to its infimum (provided
θ2 < 0) or supremum (provided θ2 > 0). It is worth noting that the result says nothing
about the behavior of the mean-value parameter µ1n (θ), though the number of edges s1 (yn )
and the number of triangles s2 (yn ) are dependent and thus the pathological behavior of
µ2n (yn ) tends to be reflected in pathological behavior of µ1n (yn ), as Figure 1 demonstrates.
In the negligible subset of the natural parameter space corresponding to non-degenerate
distributions, distributions tend to resemble two-component mixture distributions, where
one component distribution corresponds to the distribution indexed by θ1 and θ2 = 0—
under which edges are i.i.d. Bernoulli random variables—and the other component distribution corresponds to a near-degenerate distribution. In the special case of the ERGM
with the number of edges and triangles, that was first shown by Jonasson (1999) and complemented by Park and Newman (2005); Chatterjee and Diaconis (2011); see in addition
Snijders (2002); Handcock (2003a,b); Hunter et al. (2008); Rinaldo et al. (2009); Butts
(2011); Schweinberger (2011).
In conclusion, ERGMs with large and growing neighborhoods tend to place hardly any
probability mass on graphs which resemble real-world graphs.
2.4. Implications of model degeneracy in terms of statistical inference
Degenerate ERGMs tend to be problematic in terms of statistical inference.
6
Maximum likelihood estimates of natural parameter vector θ cannot be obtained by
direct maximization of the log likelihood function, because the log likelihood function of
many ERGMs is intractable (e.g., Frank and Strauss, 1986). A widely used approach is
to obtain Monte Carlo maximum likelihood estimates of θ by maximizing a Monte Carlo
approximation of the log likelihood function based on a Monte Carlo sample of graphs
(Geyer and Thompson, 1992; Handcock, 2003a,b; Hunter and Handcock, 2006). Suppose
that the observed value of the vector of sufficient statistics s(yn ) is in the interior of the
convex hull of {s(yn ) : yn ∈ Yn }, implying that the maximum likelihood estimate of θ exists
and is unique (Barndorff-Nielsen, 1978, p. 151). Let Sn ⊂ Yn be a subset of graphs generated
by Monte Carlo methods under a starting value of θ. The negligible subset of the natural
parameter space corresponding to non-degenerate distributions suggests that finding good
starting values of θ is hard and in practice many starting values generate samples Sn close
to the boundary of the convex hull of {s(yn ) : yn ∈ Yn }. As a result, the observed value
of s(yn ) may not be in the interior of the convex hull of {s(yn ) : yn ∈ Sn } and thus the
Monte Carlo maximum likelihood estimate of θ may not exist even though the maximum
likelihood estimate of θ does exist (Handcock, 2003a,b; Rinaldo et al., 2009). In practice,
non-existence of Monte Carlo maximum likelihood estimates results in computational failure
and computational failure has been observed in a wide range of applications (e.g., Handcock,
2003a,b; Rinaldo et al., 2009).
A Bayesian approach (Koskinen et al., 2010; Caimo and Friel, 2011) with proper priors
ensures proper posteriors, but fails to address one of the most important issues which the
model degeneracy problem raises. The model degeneracy problem is rooted in the family of
distributions {Pθ , θ ∈ Θ} and, if a family of distributions includes no member which places
much probability mass on graphs resembling real-world graphs, then neither a Bayesian
approach nor any other approach to statistical inference can produce it.
In practice, no matter which approach to statistical inference is adopted, the model
degeneracy problem tends to result in striking lack of fit (Snijders, 2002; Handcock, 2003a,b;
Hunter et al., 2008).
2.5. Conclusions
The most important conclusion is that the neighborhood assumption underlying ERGMs
with Markov dependence is problematic and that ERGMs with Markov dependence tend
to be degenerate provided n is large. Therefore, the application of ERGMs with Markov
dependence to large graphs is not advisable. It is debatable what “large” means, but
simulations (e.g., Handcock, 2003a; Rinaldo et al., 2009) suggest that ERGMs
with Markov
dependence should not be applied to graphs with n ≫ 10 nodes and n2 ≫ 45 possible
edges.
3.
Hierarchical ERGMs
We develop a novel class of hierarchical ERGMs motivated by two observations:
• In a wide range of applications in the social and health sciences and biology, it is
believed that the expected numbers of edges of nodes are either bounded or grow only
slowly as a function of the number of nodes n, implying that graphs tend to be sparse
(e.g., Jonasson, 1999; Krivitsky et al., 2011). Therefore, dependence tends to be local
in the sense that dependence is limited to small subsets of edges (e.g., Pattison and
Hierarchical ERGMs
7
Robins, 2002). If there is uncertainty about which subsets of edges are dependent, it
makes sense to express the uncertainty by specifying a family of distributions on the
set of possible dependence structures.
• By restricting dependence to subgraphs, we (1) respect the sparse and local nature of
graphs; (2) admit simple representations of dependencies as long as dependencies are
local; and (3) reduce the model degeneracy of ERGMs.
We introduce hierarchical ERGMs in Section 3.1, describe priors in Section 3.2, and
discuss special cases of interest in Section 3.3.
3.1. Model
The class of hierarchical ERGMs introduced here is based on two fundamental assumptions.
It is worth noting that, in line with convention, we consider the set of nodes to be fixed and
the graph to be random.
The first assumption states that there is an underlying local neighborhood structure.
Assumption 1: local neighborhood structure. The set of nodes is partitioned into
K local neighborhoods, indexed by integers 1, . . . , K. The memberships of nodes to local
neighborhoods are governed by
iid
Xi | π1 , . . . , πK ∼ Multinomial(1; π1 , . . . , πK ), i = 1, . . . , n,
(9)
where Xi denotes the vector of membership indicators of node i.✷
The membership indicators X = (X1 , . . . , Xn ) induce a partition of the set of nodes N
into subsets N1 , . . . , NK and a partition of the set of edge variables Y = {Yij : i ∈ N, j ∈ N}
into subsets Y(kl) = {Yij : i ∈ Nk , j ∈ Nl }.
The second assumption states that, conditional on the local neighborhood structure,
edges within local neighborhoods are dependent, while edges between local neighborhoods
are independent.
Assumption 2: local dyad-dependence, global dyad-independence. The conditional PMF of random graph Y given local neighborhood structure X can be factorized
into within- and between-neighborhood PMFs:
Pθ (Y = y | X = x)
=
K
Y
Pθ (Y(kk) = y(kk) | X = x)
K
Y
Pθ (Y(kl) = y(kl) | X = x).
k=1
×
(10)
k<l
We assume that the between-neighborhood PMFs can be factorized into dyad-bound PMFs:
Y
Pθ (Yij = yij | X = x),
Pθ (Y(kl) = y(kl) | X = x) =
(11)
i∈Nk , j∈Nl
while the within-neighborhood PMFs are not assumed to be factorizable.✷
Remark: local dependence. The restriction of dependence to local neighborhoods
serves to respect the sparse and local nature of graphs on the one hand and to reduce
8
the model degeneracy of ERGMs on the other hand. An important advantage is that
dependence, such as transitive closure, is admissible within local neighborhoods.
Remark: local neighborhood structure. Suitable local neighborhood structure
may or may not be observed.
If suitable local neighborhood structure is observed, then it should be used. However,
the emphasis is on suitable local neighorhood structure. If the observed number of local
neighborhoods K is small relative to the number of nodes n and thus some local neighborhoods are large, then the improvement in goodness of fit relative to ERGMs may be small
and thus the observed local neighborhood structure may not be useful.
If no suitable local neighborhood structure is observed, then the uncertainty about the
unknown number of local neighborhoods K needs to be addressed. One possible approach
is to express the uncertainty about K by specifying a prior for K (e.g., Richardson and
Green, 1997). An alternative approach is based on non-parametric priors (e.g., Ferguson,
1973). We follow a non-parametric approach, which we describe in Section 3.2. It is worth
noting that, while the number of local neighborhoods K needs to be large so that the
local neighborhoods can be small, there is no need to impose strong prior restrictions on
the size of the local neighborhoods. If, for a given local neighborhood structure x and
observed graph y, the conditional probability Pθ (Y = y | X = x) is negligible under all
possible values of θ—e.g., when all nodes are members of the same local neighborhood and
Pθ (Y = y | X = x) is near-degenerate, then the marginal posterior probability of x given
y tends to be negligible relative to other values of X which make more sense in light of y.
We have indeed made the experience that the marginal posterior probability of problematic
local neighborhood structures (e.g., with too large local neighborhoods) relative to less
problematic local neighborhood structures tends to be negligible. We present two examples
in Section 5.2.
Remark: parameterizations. Exponential parameterizations of the between- and
within-neighborhood PMFs are convenient, though other parameterizations may be used as
well. The between-neighborhood PMFs can be written as
Pθ (Yij = yij | X = x)
=
exp[hθB,ij , sB (yij )i − ψB,ij (θB,ij )],
(12)
where sB (yij ) is a vector of between-neighborhood sufficient statistics, θB,ij is a vector of
between-neighborhood natural parameters, and ψB,ij (θB,ij ) is the between-neighborhood
log partition function,
X
′
ψB,ij (θB,ij ) = log
exp[hθB,ij , sB (yij
)i].
(13)
′
yij ∈{0,1}
The between-neighborhood sufficient statistics sB (yij ) may be functions of edges yij and covariates. It is worth noting that the exponential parameterization of the between-neighborhood
PMFs is equivalent to a logit model with linear predictor hθB,ij , sB (1) − sB (0)i.
The within-neighborhood PMFs can be written as
Pθ (Y(kk) = y(kk) | X = x)
=
exp[hθW,k , sW (y(kk) )i − ψW,k (θW,k )],
(14)
where sW (y(kk) ) is a vector of within-neighborhood sufficient statistics, θW,k is a vector of
within-neighborhood natural parameters, and ψW,k (θW,k ) is the within-neighborhood log
partition function,
X
′
ψW,k (θW,k ) = log
exp[hθW,k , sW (y(kk)
)i],
(15)
′
y(kk) ∈Y(kk)
Hierarchical ERGMs
9
where Y(kk) is the sample space of y(kk) . The within-neighborhood sufficient statistics
sW (y(kk) ) may include interactions, such as the number of triangles within local neighborhood k, and functions of covariates.
The exponential parameterization of the between- and within-neighborhood PMFs implies that the conditional PMF of Y given X can be written as
Pθ (Y = y | X = x)
=
exp [hη(θ), s(y)i − ψ(θ)] ,
(16)
where the vector of parameters η(θ) is a linear function of the vectors of between- and
within-neighborhood parameters, the vector of sufficient statistics s(y) is a linear function
of between- and within-neighborhood vectors of sufficient statistics, and the log partition
function ψ(θ) is given by
ψ(θ)
=
K
X
X
ψB,ij (θB,ij ) +
k<l i∈Nk ,j∈Nl
K
X
ψW,k (θW,k ).
(17)
k=1
Remark: parameter constraints. In the interest of model parsimony, it is sometimes
desirable to constrain parameters. We consider here the constraints θB,ij = θB (all i < j) on
the between-neighborhood parameter vectors θB,ij , which are of secondary interest. The
within-neighborhood parameter vectors θW,k , which govern the dependence within local
neighborhoods and are thus of primary interest, are left unconstrained.
3.2. Prior
The class of hierarchical ERGMs introduced in Section 3.1 aims to reduce model degeneracy
and improve goodness of fit relative to ERGMs. To accomplish that, the local neighborhoods must be small and thus the number of local neighborhoods large. We consider here
a non-parametric approach based on stick-breaking priors (Ishwaran and James, 2001),
which allows the number of non-empty local neighborhoods a posteriori to be large, while
encouraging it a priori to be small.
Suppose that there is an infinite number of local neighborhoods and that nodes belong
to local neighborhood k = 1, 2, . . . with probability πk , k = 1, 2, . . . , where
π1
πk
=
=
V1
Vk
(18)
k−1
Y
(1 − Vj ), k = 2, 3, . . . ,
(19)
j=1
where
iid
(20)
Vk | α ∼ Beta(1, α), k = 1, 2, . . . ,
P∞
where α > 0 is a parameter and k=1 πk = 1 with probability 1 (Ishwaran and James,
2001).
The between- and within-neighborhood parameter vectors θB and θW,k index exponential families and therefore conjugate priors exist (Diaconis and Ylvisaker, 1979), though
direct sampling from the resulting full conditional distributions is infeasible. In the absence
of computational advantages, multivariate Gaussian priors are convenient alternatives:
−1
θB | µB , Σ−1
B ∼ MVN(µB , ΣB )
−1
θW,k | µW , Σ−1
W ∼ MVN(µW , ΣW ), k = 1, 2, . . . ,
iid
(21)
10
−1
where µB and µW are mean parameter vectors and Σ−1
B and ΣW are precision matrices of
suitable order.
Last, to acknowledge the uncertainty about the critical hyper-parameters α, µW , and
Σ−1
W , we assign conjugate Gamma, multivariate Gaussian, and Wishart hyper-priors to α,
µW , and Σ−1
W , respectively.
3.3. Special cases
Special cases of interest are the (stochastic) block models of Wang and Wong (1987); Strauss
and Ikeda (1990); Nowicki and Snijders (2001) and the related models of Handcock et al.
(2007); Airoldi et al. (2008); Koskinen (2009).
Wang and Wong (1987) assumed that there is a known partition of the set of nodes
and that the conditional PMF of Y given the partition can be factorized into dyad-bound
PMFs. Nowicki and Snijders (2001) dropped the assumption that the partition is known,
but kept the assumption that the conditional PMF of Y can be factorized into dyad-bound
PMFs. Handcock et al. (2007); Airoldi et al. (2008) introduced more general models than
Nowicki and Snijders (2001), while retaining the assumption that the conditional PMF of Y
can be factorized into dyad-bound PMFs. All of these models assume that the conditional
PMF of Y can be factorized into dyad-bound PMFs, which makes direct modeling of a wide
range of dependencies—including, but not limited to, transitive closure—impossible.
Strauss and Ikeda (1990) assumed that the partition is known and that edges within
observed local neighborhoods are governed by ERGMs with Markov dependence. While the
models of Strauss and Ikeda (1990) admit dependence within observed local neighborhoods,
the usefulness of the models is limited, because in most applications suitable local neighborhood structure—suitable in the sense that the number of local neighborhoods is large and
the local neighborhoods are small—is not observed. Last, Koskinen (2009) assumed that
the partition is unknown and that the conditional PMF of Y does not factorize. However,
Koskinen (2009) attempted to capture unobserved heterogeneity rather than to address the
model degeneracy problem.
4.
Bayesian inference
We follow a Bayesian approach to hierarchical ERGMs. A Bayesian approach to hierarchical
ERGMs must overcome multiple obstacles. The most serious obstacle is the fact that with
positive probability one or more local neighborhoods k contains nk ≫ 5 nodes and thus
one
or more within-neighborhood log partitions functions, which are log sums of exp[ n2k log 2]
terms (see (15)), is intractable. To facilitate posterior computations, we approximate the
prior and augment the posterior.
We describe the approximation of the prior in Section 4.1, discuss the augmentation of
the posterior and sampling from the augmented posterior in Section 4.2, and address the
non-identifiability of within-neighborhood parameter vectors and membership indicators in
Section 4.3.
4.1. Prior truncation
The stick-breaking prior of Section 3.2 can be approximated by a truncated stick-breaking
prior along the lines of Ishwaran and James (2001), which facilitates posterior computations.
Hierarchical ERGMs
11
We choose a maximum number of local neighborhoods, denoted by Kmax . Some general
advice concerning the choice of Kmax is given by Ishwaran and James (2001). We are
here more concerned with the goodness of fit of the model than the approximation of the
stick-breaking prior and choose Kmax in accordance. In practice, we choose Kmax by either
I. trying out multiple values of Kmax and comparing the goodness of fit of the model; II.
exploiting on-the-ground knowledge; or III. setting Kmax = n. We demonstrate strategies
I and II in Section 5.2.
Given Kmax , the membership probabilities π = (π1 , . . . , πKmax ) are constructed by
truncated stick-breaking (Ishwaran and James, 2001):
π1
πk
=
V1
=
Vk
(22)
k−1
Y
(1 − Vj ), k = 2, . . . , Kmax ,
(23)
j=1
where
iid
Vk | α ∼ Beta(1, α), k = 1, . . . , Kmax − 1
(24)
VKmax = 1,
PKmax
πk = 1. The truncated stickwhere α > 0 is a parameter and VKmax = 1 ensures k=1
breaking construction of π implies that π is generalized Dirichlet distributed, which is
conjugate to multinomial sampling (Ishwaran and James, 2001).
The (hyper)priors of α, µW , Σ−1 , θW,1 , . . . , θW,Kmax , and θB are equivalent to the
(hyper)priors described in Section 3.2.
4.2. Posterior augmentation
Under the truncated prior described in Section 4.1, the posterior is of the form
p(α, µW , Σ−1
W , π, θB , θW , x | y)
∝
p(α, µW , Σ−1
W , π, θB , θW )
×
Pπ (X = x) Pθ (Y = y | X = x),
(25)
where the truncated prior is of the form
p(α, µW , Σ−1
W , π, θB , θW )
=
×
p(α) p(µW ) p(Σ−1
W ) p(π | α) p(θB )
KY
max
p(θW,k | µW , Σ−1
W ),
(26)
k=1
where θW = (θW,1 , . . . , θW,Kmax ) denotes the within-neighborhood parameter vectors.
Owing to the fact that the conditional PMF of Y is not, in general, tractable, the
posterior is doubly intractable, implying that standard Markov chain Monte Carlo methods
(e.g., Metropolis-Hastings) cannot be used to sample from the posterior. Auxiliary-variable
Markov chain Monte Carlo methods for sampling from doubly intractable posteriors arising
in complete-data problems were introduced by Møller et al. (2006) and extended by Murray
et al. (2006); Koskinen et al. (2010); Liang (2010); Caimo and Friel (2011). We extend
them from the complete-data problems considered there to the incomplete-data problem
considered here.
12
To facilitate posterior computations, we augment α, µW , Σ−1
W , π, θB , θW , X, and
⋆
Y by auxiliary variables θW
, X⋆ , and Y⋆ . The auxiliary variable Y⋆ can be interpreted
as an auxiliary random graph, X⋆ can be interpreted as an auxiliary local neighborhood
⋆
structure, and θW
can be interpreted as auxiliary within-neighborhood parameter vectors.
⋆
⋆
⋆
We assume that the joint distribution of α, µW , Σ−1
W , π, θB , θW , X, Y, θW , X , and Y
is of the form
⋆
⋆
⋆
p(α, µW , Σ−1
W , π, θB , θW , x, y, θW , x , y )
= p(α, µW , Σ−1
W , π, θB , θW ) Pπ (X = x) Pθ (Y = y | X = x)
(27)
⋆
× q(θW
, x⋆ | π, θB , θW , x, y) Pθ⋆ (Y⋆ = y⋆ | X⋆ = x⋆ ),
⋆
where q(θW
, x⋆ | π, θB , θW , x, y) is a suitable, auxiliary distribution, the conditional distributions Y and Y⋆ belong to the same exponential family of distributions, and θ ⋆ =
⋆
(θB , θW
). The augmented posterior is of the form
⋆
⋆
⋆
p(α, µW , Σ−1
W , π, θB , θW , x, θW , x , y | y)
⋆
⋆
⋆
∝ p(α, µW , Σ−1
W , π, θB , θW , x, y, θW , x , y ).
(28)
⋆
Integrating out the auxiliary variables θW
, X⋆ , and Y⋆ results in the posterior of α, µW ,
−1
ΣW , π, θB , θW , and X. While sampling from the posterior (25) is infeasible, sampling
⋆
from the augmented posterior (28) and integrating out the auxiliary variables θW
, X⋆ , and
⋆
Y turns out to be feasible.
We focus here on auxiliary-variable Markov chain Monte Carlo updates of θW and X
and provide details concerning α, µW , Σ−1
W , π, and θB in Supplement A. A basic auxiliaryvariable Metropolis-Hastings update of θW and x can be described as follows.
⋆
(1) Sample θW
, X⋆ , and Y⋆ :
⋆
⋆
(1.1) Sample θW
, X⋆ | π, θB , θW , X = x, Y = y ∼ q(θW
, x⋆ | π, θB , θW , x, y).
⋆
).
(1.2) Sample Y⋆ | θ ⋆ , X⋆ = x⋆ ∼ Pθ⋆ (Y⋆ = y⋆ | X⋆ = x⋆ ), where θ ⋆ = (θB , θW
⋆
(2) Propose to swap the values of (θW , x) and (θW
, x⋆ ) and accept the proposal with
probability min(1, h), where
QKmax
−1
⋆
⋆
⋆
k=1 p(θW,k | µW , ΣW ) Pπ (X = x ) Pθ ⋆ (Y = y | X = x )
h = QKmax
−1
Pπ (X = x) Pθ (Y = y | X = x)
k=1 p(θW,k | µW , ΣW )
(29)
⋆
q(θW , x | π, θB , θW
, x⋆ , y) Pθ (Y⋆ = y⋆ | X⋆ = x)
×
⋆ , x⋆ | π, θ , θ , x, y) P ⋆ (Y ⋆ = y⋆ | X⋆ = x⋆ ) .
q(θW
B
W
θ
Remark: acceptance probability. The acceptance probability (29) of the auxiliaryvariable Metropolis-Hastings update depends on the intractable within-neighborhood log
partition functions through the ratios
Pθ⋆ (Y = y | X = x⋆ ) Pθ (Y⋆ = y⋆ | X⋆ = x)
.
Pθ⋆ (Y⋆ = y⋆ | X⋆ = x⋆ ) Pθ (Y = y | X = x)
(30)
Since the conditional distributions of Y and Y⋆ belong to the same exponential family
of distributions, all intractable within-neighborhood log partition functions in (29) cancel.
Hierarchical ERGMs
13
Therefore, the acceptance probability of the auxiliary-variable Metropolis-Hastings algorithm operating on the augmented state space is tractable, whereas the acceptance probability of Metropolis-Hastings algorithms operating on the original state space is intractable.
⋆
Remark: sampling θW
and X⋆ . In Step (1.1), large moves from θW and x may
result in low acceptance rates of the auxiliary-variable Metropolis-Hastings algorithm. We
therefore consider local moves from θW and x by changing one or more within-neighborhood
parameter vectors or one or more memberships. Local moves from θW may be generated
from Gaussians centered at the present values, whereas local moves from x may be generated from the full conditional distributions of memberships. It is worth noting that the full
conditional distributions of memberships are not, in general, tractable, because the withinneighborhood log partition functions ψW,k (θW,k ) of local neighborhoods k with nk ≫ 5
nodes are intractable. To construct auxiliary distributions which approximate the full conditional distributions of memberships, we approximate the intractable within-neighborhood
log partition functions ψW,k (θW,k ) by variational methods (Wainwright and Jordan, 2008).
Details are provided in Supplement B.
Remark: sampling Y⋆ . Two remarks are in place.
First, local moves from θW and x require no more than local sampling of Y⋆ , i.e.,
⋆
⋆
sampling subgraphs. Consider moving from (θW , x) to (θW
, x⋆ ), where θW
deviates from
⋆
θW in θW,k and x = x. Then the ratio of the probability masses of y⋆ in acceptance
probability (29) reduces to
Pθ (Y⋆ = y⋆ | X⋆ = x)
Pθ⋆ (Y⋆ = y⋆ | X⋆ = x⋆ )
=
=
Pθ (Y⋆ = y⋆ | X⋆ = x)
Pθ⋆ (Y⋆ = y⋆ | X⋆ = x)
⋆
⋆
Pθ (Y(kk)
= y(kk)
| X⋆ = x)
⋆
⋆
Pθ⋆ (Y(kk)
= y(kk)
| X⋆ = x)
(31)
.
In other words, to evaluate the acceptance probability (29), we need to sample a (small)
subgraph rather than the whole (large) graph.
Second, direct sampling of subgraphs is infeasible and exact sampling schemes are not
known (but see the work in progress by Butts, 2012). However, Liang (2010) demonstrated
that it is admissible to sample auxiliary variables by suitable reversible Markov chains
with the observed data as initial state and the desired distribution as target distribution.
The argument extends from the complete-data problem considered by Liang (2010) to the
incomplete-data problem considered here, though we omit details. We follow the same
approach in the incomplete-data problem considered here, i.e., sample subgraphs by suitable
reversible Markov chains. The construction of suitable, reversible Markov chains is discussed
by, e.g., Snijders (2002); Hunter and Handcock (2006); Handcock et al. (2010).
4.3. Non-identifiability of within-neighborhood parameters and membership indicators
A Bayesian Markov chain Monte Carlo approach along the lines of Section 4.2 suffers from
the so-called label-switching problem (Stephens, 2000). The label-switching problem is
rooted in the invariance of the likelihood function to switching the labels of local neighborhoods, resulting in non-identifiable within-neighborhood parameters θW,1 , . . . , θW,Kmax and
membership indicators X1 , . . . , Xn . As a result, in un-processed Markov chain Monte Carlo
samples from the posterior, the labels of local neighborhoods may have switched multiple
times and statistical inference which depends on the labels of local neighborhoods cannot
be based on un-processed samples. We follow the Bayesian decision-theoretic approach of
14
Stephens (2000) to undo the label-switching, but introduce a stochastic version of the relabeling algorithm of Stephens (2000), which is based on Simulated Annealing (Liu, 2008)
and reduces computing time when Kmax is moderate or large. Details are provided in
Supplement C.
5.
Comparing ERGMs and hierarchical ERGMs
A natural approach to comparing ERGMs and hierarchical ERGMs is based on model
predictions, because degenerate ERGMs tend to be incapable of generating graphs which
resemble real-world graphs and in practice it is thus imperative to inspect model predictions
(e.g., Hunter et al., 2008).
Prior predictive checks can give tentative answers to two central questions: Do hierarchical ERGMs place much prior predictive mass on graphs which resemble real-world graphs?
Do hierarchical ERGMs place much prior predictive mass on extreme graphs? In short, can
hierarchical ERGMs a priori be recommended as models of data?
Posterior predictive checks complement prior predictive checks by assessing the goodness
of fit of hierarchical ERGMs and answering the question of whether hierarchical ERGMs
can be recommended a posteriori given data.
We compare ERGMs and hierarchical ERGMs capturing transitive closure, because
transitive closure is one of the most interesting and most problematic forms of dependence.
The ERGM considered here is of the form
Pθ (Y = y)
∝
exp [θ1 s1 (y) + θ2 s2 (y)] ,
(32)
where the sufficient statistics are given by the number of edges yij and triangles yij yjh yih .
Its natural companion is the hierarchical ERGM with between-neighborhood PMFs
Pθ (Y(kl) = y(kl) | X = x) ∝ exp θB sB (y(kl) ) ,
(33)
where the sufficient statistic is given by the number of edges yij between local neighborhoods
k and l, and within-neighborhood PMFs
Pθ (Y(kk) = y(kk) | X = x) ∝ exp θW,k,1 sW,k,1 (y(kk) ) + θW,k,2 sW,k,2 (y(kk) ) , (34)
where the sufficient statistics are given by the number of edges yij and triangles yij yjh yih
within local neighborhood k.
We used the R packages ergm (Handcock et al., 2010), Bergm (Caimo and Friel, 2010),
and hergm to obtain the results presented here.
5.1. Prior predictive checks
The prior predictive distribution under ERGM (32) can be written as
Z
P (Y = y) =
p(θ) Pθ (Y = y) d θ,
(35)
where p(θ) denotes the prior. Based on experience, values of θ1 outside of (−5, 0) and
values of θ2 outside of (0, 5) index near-degenerate distributions. Therefore, we choose
independent, uniform priors given by θ1 ∼ Uniform(−5, 0) and θ2 ∼ Uniform(0, 5).
Hierarchical ERGMs
15
0
0
2000
2000
4000
4000
6000
6000
8000
8000
Fig. 2. Prior predictions of the number of edges (left) and triangles (right) under ERGM (32) with
n = 100 and N = 4,950
0
1000
2000
3000
4000
5000
0
50000
100000
150000
0
0
1000
500
2000
3000
1000
4000
1500
5000
Fig. 3. Prior predictions of the number of edges (left) and triangles (right) under the hierarchical
ERGM corresponding to (33) and (34) with n = 100 and N = 4,950
0
1000
2000
3000
4000
5000
0
50000
100000
150000
The prior predictive distribution under the hierarchical ERGM corresponding to (33)
and (34) can be written as
Z
Z X
P (Y = y) =
···
p(α, µW , Σ−1
W , π, θB , θW )
x∈X
(36)
× Pπ (X = x) Pθ (Y = y | X = x) d α d µW d Σ−1
W d π d θB d θW ,
−1
, π, θB , θW ) denotes the prior. To make the prior of θW under the
where p(α, µW , ΣW
hierarchical ERGM comparable to the prior of θ under the ERGM, we assign independent, uniform priors θW,k,1 ∼ Uniform(−5, 0) and θW,k,2 ∼ Uniform(0, 5) to the withinneighborhood parameters θW,k,1 and θW,k,2 , respectively. To respect the sparse and local
nature of graphs, we assume that the between-neighborhood parameter θB is governed
by the prior θB ∼ N (−5, 1), i.e., there tend to be more edges within than between local
neighborhoods. The prior for π is given by the truncated stick-breaking prior with α = 5.
We consider graphs with n = 100 nodes and N = 4,950 edge variables and let the
maximum number of local neighborhoods be Kmax = 50. The prior predictive distributions
can be sampled by Markov chain Monte Carlo methods: direct sampling of parameters and
between-neighborhood subgraphs is straightforward, while within-neighborhood subgraphs
can be sampled by Markov chain Monte Carlo methods along the lines of Hunter and
Handcock (2006). Monte Carlo samples of size 10,000 were generated from the prior of the
16
ERGM and hierarchial ERGM and, for every one of the draws from the prior, a prediction
was generated by a Markov chain of length 100,000, accepting the final draw of the Markov
chain as a draw from the prior predictive distribution.
Figure 2 shows prior predictions of the number of edges and triangles under ERGM
(32). The bulk of the prior predictive mass is placed on extreme graphs with few edges
and triangles and graphs with almost all possible edges and triangles. In contrast, Figure
3 shows that the hierarchical ERGM corresponding to (33) and (34) places much prior
predictive mass on graphs which resemble
real-world graphs: i.e., graphs where the average
P
number of edges of nodes given by i,j yij /n ranges from 2 to 20 and where the number of
triangles is a small multiple of the number of edges, which covers a wide range of real-world
graphs. In addition, the prior predictive distribution is bimodal under the ERGM, while
unimodal under the hierarchical ERGM.
A tentative answer to the questions raised at the start of Section 5 is therefore: hierarchical ERGMs are capable of generating graphs which resemble real-world graphs, in
contrast to ERGMs, and can thus be recommended a priori as models of real-world graphs.
5.2. Posterior predictive checks
To compare ERGMs and hierarchical ERGMs in terms of posterior predictions, we select
two data sets: the terrorist network behind the Bali bombing in 2002 as well as the classic
Sampson network.
The posterior predictive distribution under ERGM (32) given data y can be written as
Z
P (Ẏ = ẏ | Y = y) =
p(θ | y) Pθ (Ẏ = ẏ) d θ,
(37)
where p(θ | y) denotes the posterior. The posterior predictive distribution under the
hierarchical ERGM corresponding to (33) and (34) can be written as
Z
Z X
P (Ẏ = ẏ | Y = y) =
···
p(α, µW , Σ−1
W , π, θB , θW , x | y)
x∈X
(38)
×
Pθ (Ẏ = ẏ | X = x) d α d µW d Σ−1
W d π d θB d θW ,
where p(α, µW , Σ−1
W , π, θB , θW , x | y) denotes the posterior. Independent priors θi ∼
N (0, 25) are used in the case of the ERGM and independent priors α ∼ Gamma(1, 1),
−2
∼ Gamma(10, 10) in the case of the hierarchical ERGM. 120,000
µW,i ∼ N (0, 1), and σW,i
draws from the posterior predictive distribution of the ERGM were generated by the Markov
chain Monte Carlo algorithm of Caimo and Friel (2011), with a burn-in of 20,000 and saving
every 10-th post-burn-in draw, and 1,200,000 draws from the posterior predictive distribution of the hierarchical ERGM were generated by the Markov chain Monte Carlo algorithm
of Section 4, with a burn-in of 200,000 and saving every 100-th post-burn-in draw.
5.2.1. Terrorist network behind Bali bombing in 2002
The structure of terrorist networks is of interest with an eye to understand how terrorists
communicate, to identify cells (i.e., subsets of terrorists), to isolate cells, and to dismantle
them. We consider here the network of terrorists behind the Bali, Indonesia bombing in
2002, killing 202 (Koschade, 2006). The 17 terrorists who carried out the bombing were
Hierarchical ERGMs
17
Fig. 4. Terrorist network behind Bali bombing in 2002. The posterior membership probabilities are
represented by colored pie charts
Octavia
Arnasan
Azahari
Dulmatin
Junaedi
Hidayat
Feri
Ghoni
Sarijo
Samudra
Rauf
Imron
Patek
Idris
Muklas
Mubarok
Amrozi
members of the Southeast Asian al-Qaeda affiliate Jemaah Islamiyah. The terrorist network
can be represented by a graph with n = 17 nodes and N = 136 edge variables, where Yij = 1
if terrorists i and j were in contact prior to the bombing and Yij = 0 otherwise. The terrorist
network is shown in Figure 4.
We start by determining the maximum number of local neighborhoods Kmax to truncate
the prior. Using strategy I sketched in Section 4.1, we compare the hierarchical ERGM
corresponding to (33) and (34) with up to Kmax = 1, 2, 3, 4, 5 local neighborhoods in terms
of predictive power. Predictive power is taken to be the root mean square deviation of
the predicted number of triangles. According to Figure 5, the hierarchical ERGM with
Kmax = 2 is superior to the hierarchical ERGM with Kmax = 1, which is equivalent to
ERGM (32). The hierarchical ERGM with Kmax = 3 in turn is superior to the hierarchical
ERGM with Kmax = 2, but increasing Kmax from 3 to 5 does not increase the predictive
power.
We compare ERGM (32) and the hierarchical ERGM corresponding to (33) and (34) with
up to Kmax = 5 local neighborhoods in terms of the posterior predictive distribution of the
number of edges and triangles, shown in Figures 6 and 7. Under the ERGM, the posterior
predictive distribution is bimodal. In contrast, the posterior predictive distribution under
the hierarchical ERGM is unimodal and places most mass on graphs which are close to the
observed graph in terms of the number of edges and triangles. We note that, while other
statistics may be used to compare the ERGM and hierarchical ERGM in terms of goodness
of fit, the choice of goodness of fit statistics may not influence the main conclusions much.
The fact that the ERGM places so much mass on dense graphs with almost all edges and
triangles indicates that the ERGM fits much worse than the hierarchical ERGM, no matter
which goodness of fit statistics are chosen, because the topology of graphs which are local
in nature—such as the observed graph—stands in sharp contrast to the topology of dense
graphs in terms of connectivity, centrality, transitivity, and other interesting features of
graphs (e.g., Kolaczyk, 2009).
−1
−1
The posterior of α, µW,1 , µW,2 , σW,1
, and σW,2
is shown in Table 1. The mean parameters
µW,1 and µW,2 governing the within-neighborhood parameters tend to be both positive—and
more so the mean parameter µW,2 governing the within-neighborhood triangle parameters—
18
50
100
150
200
250
300
Fig. 5. Terrorist network: hierarchical ERGM corresponding to (33) and (34) with up to Kmax =
1, 2, 3, 4, 5 local neighborhoods: root mean square deviation of predicted number of triangles plotted
against maximum number of local neighborhoods Kmax
1
2
3
4
5
−1
−1
Table 1. Terrorist network: posterior of parameters α, µW,1 , µW,2 , σW,1
, and σW,2
parameter .05 quantile .50 quantile .95 quantile odds of parameter being positive
α
.36
1.32
3.43
∞
µW,1
-1.03
.45
2.00
2.22
µW,2
-.27
.91
2.22
8.74
−1
σW,1
.55
.98
1.59
∞
−1
σW,2
.55
.98
1.57
∞
which is not surprising in the light of the large number of edges and triangles within local
neighborhoods.
Last, while the primary purpose of introducing local neighborhoods is the desire to
address the model degeneracy and striking lack of fit of ERGMs, predictions of the memberships to local neighborhoods may be of interest as well, e.g., to identify cells. The
pie charts in Figure 4 represent the posterior membership probabilities reported by the
stochastic relabeling algorithm described in Supplement C. The 5 green-colored terrorists
turn out to be the 5 members of the so-called support group, which was to supposed to
support the so-called main group consisting of all other terrorists. The members of the
main group tend to be black-colored, with the exception of Amrozi and Mubarok who are
more red-colored than black-colored. Indeed, while Amrozi and Mubarok belonged to the
main group, both resided elsewhere and were almost isolated from the rest of the main
group (Koschade, 2006). Most interesting is the membership of Feri. He was a member of
the main group and was the suicide bomber who initiated the attack. Feri arrived two days
before the attack, whereas all other members of the main group had arrived days or weeks
earlier and in fact started leaving the night Feri arrived (Koschade, 2006). As a result,
Feri had limited opportunities to communicate with others. In particular, Feri was the one
and only member of the main group who did not communicate with the three commanders
Muklas (the Jemaah Islamiyah head of operations in Singapore and Malaysia), Samudra
(the field commander), and Idris (the logistics commander) (Koschade, 2006). Therefore,
the network position of Feri is unique and the uncertainty about his membership is reflected
Hierarchical ERGMs
19
1500
1000
500
0
0
500
1000
1500
Fig. 6. Terrorist network: posterior predictions of the number of edges (left) and triangles (right)
under ERGM (32); vertical lines represent observed numbers
0
20
40
60
80
100
120
140
0
100
200
300
400
500
600
700
1500
1000
500
0
0
500
1000
1500
Fig. 7. Terrorist network: posterior predictions of the number of edges (left) and triangles (right) under
the hierarchical ERGM corresponding to (33) and (34); vertical lines represent observed numbers
0
20
40
60
80
100
120
140
0
100
200
300
400
500
600
700
in the posterior membership probability distribution.
In conclusion, the hierarchical ERGM framework admits the specification of models
capturing simple and interesting features of the terrorist network and, under the parameterization of the hierarchical ERGM with within-neighborhood edges and triangles, posterior membership predictions are consistent with on-the-ground knowledge of the terrorist
network.
5.2.2. Classic Sampson network
The Sampson network (de Nooy et al., 2005, pp. 87–95) is a classic data set used by, e.g.,
Frank and Strauss (1986); Strauss and Ikeda (1990); Handcock (2003a); Caimo and Friel
(2011). Sampson studied social relations among a group of novices who were preparing
to enter a monastic order. The network corresponds to N = 306 relationships among
n = 18 novices measured at three time points. We consider here the following directed
edge variables Yij : if novice i liked novice j at any of the three time points, then Yij = 1,
otherwise Yij = 0. The network is plotted in Figure 8.
A natural extension of ERGM (32) to directed graphs is given by
Pθ (Y = y)
∝
exp [θ1 s1 (y) + θ2 s2 (y) + θ3 s3 (y)] ,
(39)
20
Fig. 8. Sampson network. The posterior membership probabilities are represented by colored pie
charts
Winf
Boni
Mark
Albert
Simp
Greg
Elias
Hugh
Ambrose
John
Victor
Basil
Louis
Amand
Bonaven
Berth
Romul
Peter
where the sufficient statistics are the number of edges yij , mutual edges yij yji , and transitive triples yij yjh yih , and its natural companion is given by the hierarchical ERGM with
between-neighborhood PMFs
Pθ (Y(kl) = y(kl) | X = x) ∝ exp θB,1 sB,1 (y(kl) ) + θB,2 sB,2 (y(kl) ) ,
(40)
where the sufficient statistics are given by the number of edges yij and mutual edges yij yji
between local neighborhoods k and l, and within-neighborhood PMFs
Pθ (Y(kk) = y(kk) | X = x) ∝ exp θW,k,1 sW,k,1 (y(kk) ) + θW,k,2 sW,k,2 (y(kk) )
(41)
+ θW,k,3 sW,k,3 (y(kk) ) ,
where the sufficient statistics are given by the number of edges yij , mutual edges yij yji ,
and transitive triples yij yjh yih within local neighborhood k. Since experts argue that the
novices are divided into 3 or 4 groups (de Nooy et al., 2005, pp. 87–95), we follow strategy
II sketched in Section 4.1 and set Kmax = 5, which can be considered to be an upper bound
on the number of local neighborhoods.
Figures 9 and 10 show posterior predictions of the number of edges, mutual edges, and
transitive triples. The contrast between the ERGM and the hierarchical ERGM in terms of
goodness of fit is at least as striking as in the case of the terrorist network in Section 5.2.1.
The problematic nature of the ERGM is underlined by the posterior of the number
of non-empty local neighborhoods of the hierarchical ERGM. Figure 11 shows that the
posterior places negligible mass on partitions of the set of nodes where all nodes are assigned
to one local neighborhood, which corresponds to ERGM (39). In addition, the posterior
mode is 3, which is in line with expert knowledge (de Nooy et al., 2005, pp. 87–95).
The local neighborhoods correspond, once again, to physical groups: the posterior membership probabilities shown in Figure 8 agree with the three-group division of novices into
“Loyals,” “Turks,” and “Outcasts” advocated by most experts (de Nooy et al., 2005, pp.
87–95).
Hierarchical ERGMs
21
1500
1500
1000
1000
0
500
500
0
0
500
1000
1500
Fig. 9. Sampson network: posterior predictions of the number of edges (left), mutual edges (middle),
and transitive triples (right) under ERGM (39); vertical lines represent observed numbers
0
50
100
150
200
250
300
0
50
100
0
150
1000
2000
3000
4000
5000
0
6.
50
100
150
200
250
300
1500
1000
500
0
0
0
500
500
1000
1000
1500
1500
Fig. 10. Sampson network: posterior predictions of the number of edges (left), mutual edges (middle), and transitive triples (right) under the hierarchical ERGM corresponding to (40) and (41); vertical
lines represent observed numbers
0
50
100
150
0
1000
2000
3000
4000
5000
Discussion
The most important conclusion is that Markov dependence along the lines of Frank and
Strauss (1986) and other forms of network dependence are not problematic as long as dependence is sufficiently local. We have introduced hierarchical ERGMs which allow dependence
to be sufficiently local and we have demonstrated that hierarchical ERGMs can be recommended as models of data, both a priori and a posteriori. Hierarchical ERGMs can be
expected to be superior to ERGMs in terms of goodness of fit as long as the data set in
question is sparse and local in nature. As we have pointed out, many data sets in the social
and health sciences and biology are indeed sparse and local in nature, though technological
networks (e.g., the World Wide Web, twitter) may be an exception.
The class of hierarchical ERGMs introduced here can be considered to be the first
model of the “next generation of social network models” (Snijders, 2007, p. 324): i.e., the
first model which combines latent structure models (e.g., Nowicki and Snijders, 2001; Hoff
et al., 2002; Schweinberger and Snijders, 2003; Handcock et al., 2007) and ERGMs in a
way that takes advantage of the strengths of ERGMs—i.e., the power of ERGMs to model
dependencies—while reducing the weaknesses of ERGMs—i.e., the fact that Markov dependence along the lines of Frank and Strauss (1986) is more global than local in nature. We
note that a partition of the set of nodes N can be considered to constitute a latent, discrete
space. Let d : N × N 7→ R+
0 be a distance function such that d(i, j) = 0 if and only if i = j,
22
0
1
2
3
4
Fig. 11. Sampson network: posterior of number of non-empty local neighborhoods under the hierarchical ERGM corresponding to (40) and (41)
1
2
3
4
5
d(i, j) = 1 if i and j are members of the same local neighborhood, and d(i, j) = 2 otherwise.
Then d satisfies reflexivity, symmetry, and the triangle inequality and is thus a metric, and
the probability of an observed graph depends on d. In contrast to simple latent structure
models (e.g., Nowicki and Snijders, 2001; Hoff et al., 2002; Schweinberger and Snijders, 2003;
Handcock et al., 2007), which assume dyads to be independent conditional on d, hierarchical
ERGMs assume dyads to be locally dependent while globally independent conditional on
d. It is evident that other metrics may be used, but the hierarchical ERGM framework is
a simple starting point and the local dyad-dependence and global dyad-independence have
conceptual and computational advantages.
Simulation and statistical inference for hierarchical ERGMs is implemented in the R
package hergm, which will be made available in the future. Owing to the fact that in
most applications suitable local neighborhood structure is not observed and the posterior
is doubly intractable, statistical inference for hierarchical ERGMs is expensive. Despite
the expensive computations, we believe that hierarchical ERGMs are simple and attractive
alternatives to ERGMs and superior in terms of goodness of fit as long as the data set in
question is sparse and local in nature.
Acknowledgements
We acknowledge support from the Netherlands Organisation for Scientific Research (NWO
grant 446-06-029) (MS), the National Institutes of Health (NIH grant 1R01HD052887-01A2)
(MS), and the Office of Naval Research (ONR grant N00014-08-1-1015) (MS, MSH). We
are grateful to Johan Koskinen for valuable comments and suggestions on drafts of the
manuscript.
Hierarchical ERGMs
23
Proof of proposition (7) and (8)
Let y1 , y2 , . . . be a monotone sequence of graphs. By assumption, fij (yn ) = 0 for n = 1, 2
and fij (yn ) ≥ 0 for all n > 2, implying
s2 (yn ) = 0 for n = 1, 2
s2 (yn ) − s2 (yn−1 ) ≥ 0 for all n > 2.
(42)
By (6), there exists a monotone sequence of graphs y1 , y2 , . . . and, for any C > 0, however
large, there exists nC > 1 such that
s2 (yn )
=
s2 (yn ) − s2 (y2 )
=
n
X
[s2 (ym ) − s2 (ym−1 )]
m=3
>
(43)
C (n − 1) (n − nC ) for all n > nC .
Therefore, the sufficient statistic s2 (yn ) is unstable in the sense of Schweinberger (2011)
whereas s1 (yn ) is stable, and (7) and (8) follow from Theorem 3 and Corollary 2 of Schweinberger (2011).
References
Airoldi, E., D. Blei, S. Fienberg, and E. Xing (2008). Mixed membership stochastic blockmodels. Journal of Machine Learning Research 9, 1981–2014.
Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in Statistical Theory.
New York: Wiley.
Baxter, R. J. (2007). Exactly solved models in statistical mechanics. New York: Dover.
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal
of the Royal Statistical Society, Series B 36, 192–225.
Butts, C. T. (2011). Bernoulli graph bounds for general random graph models. Sociological
Methodology 41, 299–345.
Butts, C. T. (2012). Manuscript in preparation. University of California, Irvine.
Caimo, A. and N. Friel (2010). R package version 1.2. Bayesian inference for exponential
random graph models. http://CRAN.R-project.org/package=Bergm.
Caimo, A. and N. Friel (2011). Bayesian inference for exponential random graph models.
Social Networks 33, 41–55.
Chatterjee, S. and P. Diaconis (2011). Estimating and understanding exponential random
graph models. Technical report, Courant Institute of Mathematical Sciences, New York
University.
de Nooy, W., A. Mrvar, and V. Batagelj (2005). Exploratory Social Network Analysis with
Pajek. New York: Cambridge University Press.
24
Diaconis, P. and D. Ylvisaker (1979). Conjugate priors for exponential families. Annals of
Statistics 7, 269–281.
Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems. Annals of
Statistics 1, 209–230.
Frank, O. and D. Strauss (1986). Markov graphs. Journal of the American Statistical
Association 81 (395), 832–842.
Geyer, C. J. and E. A. Thompson (1992). Constrained Monte Carlo maximum likelihood
for dependent data. Journal of the Royal Statistical Society, Series B 54, 657–699.
Handcock, M. (2003a). Assessing degeneracy in statistical models of social networks. Technical report, Center for Statistics and the Social Sciences, University of Washington.
http://www.csss.washington.edu/Papers.
Handcock, M. (2003b). Statistical models for social networks: Inference and degeneracy.
In R. Breiger, K. Carley, and P. Pattison (Eds.), Dynamic Social Network Modeling and
Analysis: Workshop Summary and Papers. Washington, D.C.: National Academies Press.
Handcock, M. S., D. R. Hunter, C. T. Butts, S. M. Goodreau, M. Morris, and P. Krivitsky (2010). R package ergm version 2.2-2: A Package to Fit, Simulate and Diagnose
Exponential-Family Models for Networks. http://CRAN.R-project.org/package=hergm.
Handcock, M. S., A. E. Raftery, and J. M. Tantrum (2007). Model-based clustering for
social networks. Journal of the Royal Statistical Society, Series A 170, 301–354. with
discussion.
Hoff, P. D., A. E. Raftery, and M. S. Handcock (2002). Latent space approaches to social
network analysis. Journal of the American Statistical Association 97, 1090–1098.
Holland, P. W. and S. Leinhardt (1981). An exponential family of probability distributions
for directed graphs. Journal of the American Statistical Association 76 (373), 33–65.
Hunter, D. R., S. M. Goodreau, and M. S. Handcock (2008). Goodness of fit of social
network models. Journal of the American Statistical Association 103 (481), 248–258.
Hunter, D. R. and M. S. Handcock (2006). Inference in curved exponential family models
for networks. Journal of Computational and Graphical Statistics 15, 565–583.
Ishwaran, H. and L. F. James (2001). Gibbs sampling methods for stick-breaking priors.
Journal of the American Statistical Association 96 (453), 161–173.
Ising, E. (1925). Beitrag zur Theorie des Ferromagnetismus. Zeitschrift für Physik A 31,
253–258.
Jonasson, J. (1999). The random triangle model. Journal of Applied Probability 36, 852–876.
Jones, J. H. and M. Handcock (2003). Social networks: Sexual contacts and epidemic
thresholds. Nature 423, 605–606.
Kolaczyk, E. D. (2009).
Springer.
Statistical Analysis of Network Data: Methods and Models.
Hierarchical ERGMs
25
Koschade, S. (2006). A social network analysis of Jemaah Islamiyah: The applications to
counter-terrorism and intelligence. Studies in Conflict and Terrorism 29, 559–575.
Koskinen, J. H. (2009). Using latent variables to account for heterogeneity in exponential
family random graph models. In S. M. Ermakov, V. B. Melas, and A. N. Pepelyshev
(Eds.), Proceedings of the 6th St. Petersburg Workshop on Simulation Vol. II, pp. 845–
849.
Koskinen, J. H., G. L. Robins, and P. E. Pattison (2010). Analysing exponential random
graph (p-star) models with missing data using Bayesian data augmentation. Statistical
Methodology 7 (3), 366–384.
Krivitsky, P. N., M. S. Handcock, and M. Morris (2011). Adjusting for network size and
composition effects in exponential-family random graph models. Statistical Methodology 8,
319–339.
Lauritzen, S. (1996). Graphical Models. Oxford, UK: Oxford University Press.
Liang, F. (2010). A double Metropolis-Hastings sampler for spatial models with intractable
normalizing constants. Journal of Statistical Computing and Simulation 80, 1007–1022.
Liu, J. S. (2008). Monte Carlo Strategies in Scientific Computing. New York: Springer.
Møller, J., A. N. Pettitt, R. Reeves, and K. K. Berthelsen (2006). An efficient Markov
chain Monte Carlo method for distributions with intractable normalising constants.
Biometrika 93, 451–458.
Murray, I., Z. Ghahramani, and D. J. MacKay (2006). MCMC for doubly-intractable
distributions. In Proceedings of the 22nd Annual Conference on Uncertainty in Artificial
Intelligence (UAI-06), pp. 359–366. AUAI Press.
Nowicki, K. and T. A. B. Snijders (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association 96 (455), 1077–1087.
Park, J. and M. E. J. Newman (2005). Solution for the properties of a clustered network.
Physical Review E 72, 026136.
Pattison, P. and G. Robins (2002). Neighborhood-based models for social networks. In
R. M. Stolzenberg (Ed.), Sociological Methodology, Volume 32, Chapter 9, pp. 301–337.
Boston: Blackwell Publishing.
Petrescu-Prahova, M. and C. Butts (2008). Emergent Coordinators in the World Trade
Center Disaster. International Journal of Mass Emergencies and Disasters 28, 133–168.
Richardson, S. and P. J. Green (1997). On Bayesian analysis of mixtures with an unknown
number of components. Journal of the Royal Statistical Society, Series B 59, 731–792.
Rinaldo, A., S. E. Fienberg, and Y. Zhou (2009). On the geometry of discrete exponential
families with application to exponential random graph models. Electronic Journal of
Statistics 3, 446–484.
Schweinberger, M. (2011). Instability, sensitivity, and degeneracy of discrete exponential
families. Journal of the American Statistical Association 106 (496), 1361–1370.
26
Schweinberger, M. and T. A. B. Snijders (2003). Settings in social networks: A measurement
model. In R. M. Stolzenberg (Ed.), Sociological Methodology, Volume 33, Chapter 10, pp.
307–341. Boston & Oxford: Basil Blackwell.
Snijders, T. A. B. (2002). Markov chain Monte Carlo estimation of exponential random
graph models. Journal of Social Structure 3, 1–40.
Snijders, T. A. B. (2007). Contribution to the discussion of Handcock, M.S., Raftery, A.E.,
and J.M. Tantrum, Model-based clustering for social networks. Journal of the Royal
Statistical Society, Series A 170, 322–324.
Snijders, T. A. B., P. E. Pattison, G. L. Robins, and M. S. Handcock (2006). New specifications for exponential random graph models. Sociological Methodology 36, 99–153.
Stephens, M. (2000). Dealing with label-switching in mixture models. Journal of the Royal
Statistical Society, Series B 62, 795–809.
Strauss, D. (1986). On a general class of models for interaction. SIAM Review 28, 513–527.
Strauss, D. and M. Ikeda (1990). Pseudolikelihood estimation for social networks. Journal
of the American Statistical Association 85 (409), 204–212.
Wainwright, M. J. and M. I. Jordan (2008). Graphical models, exponential families, and
variational inference. Foundations and Trends in Machine Learning 1, 1–305.
Wang, Y. J. and G. Y. Wong (1987). Stochastic blockmodels for directed graphs. Journal
of the American Statistical Association 82 (397), 8–19.
Wasserman, S. and P. Pattison (1996). Logit models and logistic regression for social
networks: I. An introduction to Markov graphs and p∗ . Psychometrika 61, 401–425.