Download Hierarchical Exponential-Family Random Graph Mod

Hierarchical Exponential-Family Random Graph Models With Local Dependence Michael Schweinberger Department of Statistics, Pennsylvania State University, University Park, PA, USA Mark S. Handcock Department of Statistics, University of California, Los Angeles, CA, USA Summary. Dependent phenomena, such as relational, spatial, and temporal phenomena, tend to be characterized by local dependence in the sense that units which are close in a well-defined sense are dependent. In contrast to spatial and temporal phenomena, however, relational phenomena tend to lack a natural dependence structure in the sense that it is unknown which units are close and thus dependent. We develop here a novel class of hierarchical exponential-family models which addresses the lack of a natural dependence structure of relational phenomena and which has important advantages. First, it respects the local nature of relational phenomena by assuming that there is an underlying local dependence structure, which may or may not be observed. Second, it constitutes a simple and flexible statistical framework for modeling a wide range of relational phenomena characterized by local dependence. Third, by restricting dependence to be local, it reduces the degenerate behavior of conventional exponential-family models based on notions of Markov dependence. We follow a Bayesian approach to hierarchical exponential-family models based on auxiliaryvariable Markov chain Monte Carlo methods. We demonstrate the advantages of hierarchical exponential-family models over conventional exponential-family models by applying them to the network of terrorists behind the Bali bombing in 2002 as well as a classic data set. Keywords: social networks, stochastic block models, statistical exponential families, undirected graphical models 1. Introduction Discrete, relational data arise in the social and health sciences, biology, computer science, and other fields (Kolaczyk, 2009). Examples are terrorist networks (e.g., Koschade, 2006), communication and collaboration networks arising in the study of disasters (e.g., PetrescuPrahova and Butts, 2008), and contact networks arising in the study of the spread of disease (e.g., Jones and Handcock, 2003). We consider here discrete, relational data which can be represented by a graph y with a set of nodes and a set of edges (representing relationships between, e.g., animals, humans, computers), where edges may or may not be directed and take on discrete values. In line with convention, we assume that the set of nodes is fixed while the graph y is an outcome of a random graph Y with sample space Y. Discrete, relational data can be modeled by discrete exponential families of distributions of the form Pθ (Y = y) = exp [hθ, s(y)i − ψ(θ)] , y ∈ Y, (1) E-mail: michael.schweinberger@stat.psu.edu 2 where hθ, s(y)i denotes the inner product of a d-vector of natural parameters θ and a d-vector of sufficient statistics s(y), and ψ(θ) is the log partition function given by X ψ(θ) = log exp [hθ, s(y′ )i] , θ ∈ Θ, (2) y′ ∈Y where the natural parameter space is given by Θ = {θ ∈ Rd : ψ(θ) < ∞}. Exponentialfamily random graph models (ERGMs) of the form (1) were pioneered by Holland and Leinhardt (1981); Frank and Strauss (1986); Wasserman and Pattison (1996). ERGMs are widely used for at least two reasons. First, ERGMs are exponential families with wellknown, desirable properties (Barndorff-Nielsen, 1978). Second, scientists are interested in a wide range of dependencies—including, but not limited to, transitive closure (Wasserman and Pattison, 1996)—and ERGMs admit simple representations of such dependencies. Despite these attractive properties, many ERGMs are plagued by the so-called model degeneracy problem: the subset of parameter values corresponding to non-degenerate distributions tends to be negligible (Strauss, 1986; Jonasson, 1999; Snijders, 2002; Handcock, 2003a,b; Park and Newman, 2005; Rinaldo et al., 2009; Butts, 2011; Schweinberger, 2011; Chatterjee and Diaconis, 2011). The model degeneracy problem tends to obstruct Markov chain Monte Carlo simulation of data and Monte Carlo maximum likelihood estimation of parameters (Snijders, 2002; Handcock, 2003a,b; Rinaldo et al., 2009). In practice, the model degeneracy problem tends to result in striking lack of fit (Snijders, 2002; Handcock, 2003a,b; Hunter et al., 2008). Strauss (1986) was the first to point out that the model degeneracy problem is rooted in the model, and so is its solution. To address the model degeneracy problem, Snijders et al. (2006); Hunter and Handcock (2006) introduced curved ERGMs. While curved ERGMs have been applied with some success (Hunter et al., 2008), curved ERGMs do not admit simple representations of dependencies and the interpretation of parameters is challenging, making the application of curved ERGMs restrictive from a scientific point of view. The purpose of the present paper is two-fold. First, we argue that in the absence of a natural dependence structure many ERGMs tend to induce strong dependence and model degeneracy. Second, we address the lack of a natural dependence structure of ERGMs by developing a novel class of hierarchical ERGMs with an underlying local dependence structure, which may or may not be observed. Some important advantages are that hierarchical ERGMs (1) respect the local nature of graphs; (2) admit simple representations of dependencies as long as dependencies are local; and (3) reduce the model degeneracy problem. The paper is structured as follows. Section 2 discusses the model degeneracy problem of ERGMs. Section 3 introduces hierarchical ERGMs with local dependence and discusses special cases of interest. Section 4 discusses Bayesian inference based on auxiliary-variable Markov chain Monte Carlo methods. Section 5 compares ERGMs and hierarchical ERGMs by using prior predictive checks as well as posterior predictive checks. 2. The model degeneracy problem of ERGMs A class of ERGMs of special interest is the groundbreaking class of ERGMs with Markov dependence due to Frank and Strauss (1986). ERGMs with Markov dependence demonstrate why ERGMs are appealing and, at the same time, give insight into the roots of the model degeneracy problem and possible solutions. Throughout, for historical reasons and Hierarchical ERGMs 3 convenience, we consider undirected, binary graphs y defined on n nodes, where the edges yij ∈ {0, 1} satisfy the linear constraints yij = yji (all i < j) and yii = 0 (all i), and the sample space Y corresponds to the set of undirected, binary graphs y defined on n nodes. We note that it is straightforward to extend the developments of Sections 2—4 to directed, binary and non-binary graphs with finite sample spaces. 2.1. ERGMs with Markov dependence Motivated by the nearest neighbor definition in statistical physics (Ising, 1925) and spatial statistics (Besag, 1974), Frank and Strauss (1986) called two dyads {i, j} and {k, l} neighbors if {i, j} and {k, l} share a node and assumed that, if {i, j} and {k, l} are not neighbors, then Yij and Ykl are independent conditional on the rest of random graph Y. By the Hammersley-Clifford theorem (Besag, 1974; Frank and Strauss, 1986), the probability mass function (PMF) of random graph Y can be written as # " n−1 X θk sk (y) + θn sn (y) − ψ(θ) , Pθ (Y = y) = exp θ1 s1 (y) + (3) k=2 P P of edges, sk (y) = i j1 <···<jk yij1 · · · yijk is the where s1 (y) = i<j yij is the number P number of k-stars, and sn (y) = i<j<k yij yjk yik is the number of triangles. P 2.2. Properties of ERGMs with Markov dependence ERGMs with Markov dependence possess both appealing and problematic properties. ERGMs with Markov dependence are appealing from a statistical point of view, being exponential-family models (Barndorff-Nielsen, 1978) and undirected graphical models (Lauritzen, 1996), motivated by and related to models in spatial statistics (Besag, 1974). At the same time, ERGMs with Markov dependence are appealing from a scientific point of view, because scientists have long considered k-stars and triangles—the sufficient statistics of ERGMs with Markov dependence—to be fundamental functions of graphs (Wasserman and Pattison, 1996). However, the neighborhood assumption underlying ERGMs with Markov dependence is problematic (Strauss, 1986): for any given dyad {i, j}, the number of neighbors is 2(n − 2) and thus increases with n. The large and growing neighborhoods indicate that ERGMs with Markov dependence, while inspired by the Ising model in statistical physics (Ising, 1925) and its relatives in spatial statistics (Besag, 1974), resemble the so-called “unphysical” mean-field Ising model with large and growing neighborhoods (Baxter, 2007, Chapter 3) rather than the classic Ising model with small and bounded neighborhoods (Ising, 1925). The comparison with the “unphysical” mean-field Ising model suggests that ERGMs with Markov dependence may induce strong dyad-dependence and may be problematic provided n is large. We show in Section 2.3 that ERGMs with large and growing neighborhoods indeed tend to be problematic and discuss implications in terms of statistical inference in Section 2.4. 2.3. Model degeneracy To demonstrate that ERGMs with large and growing neighborhoods tend to be problematic, we consider ERGMs of the form Pθ (Yn = yn ) = exp [θ1 s1 (yn ) + θ2 s2 (yn ) − ψn (θ)] , yn ∈ Yn , (4) 4 where ψn (θ) = log X exp [θ1 s1 (yn′ ) + θ2 s2 (yn′ )] , θ ∈ Θ, ′ ∈Y yn n (5) where the subscript n acknowledges the dependence of Yn and Yn on n. The natural parameter space Θ is given by Θ = R2 , because the sample space Yn is finite (BarndorffNielsen, 1978, pp. 115–116). Let µn : Θ 7→ int(Cn ) be the vector of mean-value parameters defined by µn (θ) = Eθ [s(Yn )], where int(Cn ) denotes the interior of the convex hull of {s(yn ) : yn ∈ Yn } (Barndorff-Nielsen, 1978, p. 121). P i<j yij and s2 (yn ) = P We consider here sufficient statistics of the form s1 (yn ) = y f (y ), where f : Y → 7 R. A prominent example is given by the ERGM with n ij n i<j ij ij the number of edges and triangles, which is a special case of (3) and (4) with fij (yn ) = P y y . We assume f (y ) = 0 for n = 1, 2 and f (y ) ≥ 0 for all n > 2, which ij n ij n k:k>j>i ik jk covers the number of triangles as well as other statistics based on counts of subgraphs of size k ≥ 3. A sequence of graphs y1 , y2 , . . . is called monotone if y1 ∈ Y1 and, for any n > 1, yn ∈ Yn is obtained from yn−1 ∈ Yn−1 by adding one node and up to n − 1 edges between the n − 1 existing nodes and the additional node. Proposition. If there exists a monotone sequence of graphs y1 , y2 , . . . and, for any constant C > 0, however large, there exists a constant nC > 1 such that s2 (yn ) − s2 (yn−1 ) n−1 > C for all n > nC , (6) then, given any θ ∈ {θ ∈ Θ : θ2 < 0}, µ2n (θ) −→ 0 as n −→ ∞ sup µ2n (θ) (7) θ∈Θ and, given any θ ∈ {θ ∈ Θ : θ2 > 0}, µ2n (θ) −→ 1 as n −→ ∞.✷ sup µ2n (θ) (8) θ∈Θ A proof of the proposition can be found in the appendix. The problem with ERGMs with large and growing neighborhoods is that the growth of the sufficient statistic s2 (yn ) outpaces the growth of the number of possible edges. As n increases, sequences of graphs which are extreme in terms of s2 (yn ) are allowed to amass more and more probability mass and dominate all others in terms of probability mass. An example is given by the ERGM with the number of edges and triangles: the sequence of complete graphs y1 , y2 , . . . , where all possible edges are present, implies s2 (yn ) − s2 (yn−1 ) = n−1 for all n > 2, which 2 is quadratic rather than linear in n. As n increases, graphs with at least a fraction ǫ of triangles, where ǫ ∈ (0, 1) is arbitrary, attract either less and less probability mass (provided θ2 < 0) or more and more probability mass (provided θ2 > 0), pushing the mean-value parameter to the boundary of the mean-value parameter space. As a result, if n is large, then the subset of the natural parameter space mapping to mean-value parameters which are not close to the boundary of the mean-value parameter space tends to be negligible. Figure 1 demonstrates how negligible the viable subset of the natural parameter space is when n is as small as 17 and the sufficient statistics s1 (yn ) and s2 (yn ) are given by the number of edges and triangles, respectively. It plots Markov chain Monte Carlo sample esti- Hierarchical ERGMs 5 1.0 0.8 0.6 0.0 0.2 0.4 µ2(θ) sup µ2(θ) 0.6 0.4 0.0 0.2 µ1(θ) sup µ1(θ) 0.8 1.0 Fig. 1. Markov chain Monte Carlo sample estimates of mean-value parameters µ1n (θ)/ supθ∈Θ µ1n (θ) (left) and µ2n (θ)/ supθ∈Θ µ2n (θ) (right) plotted against natural parameter θ2 , where natural parameter θ1 is given by θ1 = −.147 −4 −2 0 θ2 2 4 −4 −2 0 2 4 θ2 mates of mean-value parameters µ1n (θ)/ supθ∈Θ µ1n (θ) and µ2n (θ)/ supθ∈Θ µ2n (θ) against natural parameter θ2 , where natural parameter θ1 is fixed at the maximum likelihood estimate θ̂1 = −.147 of θ1 under the ERGM restricted by θ2 = 0, which was estimated from the terrorist network described in Section 5.2.1. The Markov chain Monte Carlo sample estimates were evaluated at 201 points in the interval [−5, 5]: at every one of the 201 points, a Markov chain was started at the network of n = 17 terrorists with a burn-in of 100,000 iterations and a post-burn-in of 100 million iterations, saving every 10,000-th post-burn-in draw. As expected, the mean-value parameter µ2n (θ) is close to its infimum (provided θ2 < 0) or supremum (provided θ2 > 0). It is worth noting that the result says nothing about the behavior of the mean-value parameter µ1n (θ), though the number of edges s1 (yn ) and the number of triangles s2 (yn ) are dependent and thus the pathological behavior of µ2n (yn ) tends to be reflected in pathological behavior of µ1n (yn ), as Figure 1 demonstrates. In the negligible subset of the natural parameter space corresponding to non-degenerate distributions, distributions tend to resemble two-component mixture distributions, where one component distribution corresponds to the distribution indexed by θ1 and θ2 = 0— under which edges are i.i.d. Bernoulli random variables—and the other component distribution corresponds to a near-degenerate distribution. In the special case of the ERGM with the number of edges and triangles, that was first shown by Jonasson (1999) and complemented by Park and Newman (2005); Chatterjee and Diaconis (2011); see in addition Snijders (2002); Handcock (2003a,b); Hunter et al. (2008); Rinaldo et al. (2009); Butts (2011); Schweinberger (2011). In conclusion, ERGMs with large and growing neighborhoods tend to place hardly any probability mass on graphs which resemble real-world graphs. 2.4. Implications of model degeneracy in terms of statistical inference Degenerate ERGMs tend to be problematic in terms of statistical inference. 6 Maximum likelihood estimates of natural parameter vector θ cannot be obtained by direct maximization of the log likelihood function, because the log likelihood function of many ERGMs is intractable (e.g., Frank and Strauss, 1986). A widely used approach is to obtain Monte Carlo maximum likelihood estimates of θ by maximizing a Monte Carlo approximation of the log likelihood function based on a Monte Carlo sample of graphs (Geyer and Thompson, 1992; Handcock, 2003a,b; Hunter and Handcock, 2006). Suppose that the observed value of the vector of sufficient statistics s(yn ) is in the interior of the convex hull of {s(yn ) : yn ∈ Yn }, implying that the maximum likelihood estimate of θ exists and is unique (Barndorff-Nielsen, 1978, p. 151). Let Sn ⊂ Yn be a subset of graphs generated by Monte Carlo methods under a starting value of θ. The negligible subset of the natural parameter space corresponding to non-degenerate distributions suggests that finding good starting values of θ is hard and in practice many starting values generate samples Sn close to the boundary of the convex hull of {s(yn ) : yn ∈ Yn }. As a result, the observed value of s(yn ) may not be in the interior of the convex hull of {s(yn ) : yn ∈ Sn } and thus the Monte Carlo maximum likelihood estimate of θ may not exist even though the maximum likelihood estimate of θ does exist (Handcock, 2003a,b; Rinaldo et al., 2009). In practice, non-existence of Monte Carlo maximum likelihood estimates results in computational failure and computational failure has been observed in a wide range of applications (e.g., Handcock, 2003a,b; Rinaldo et al., 2009). A Bayesian approach (Koskinen et al., 2010; Caimo and Friel, 2011) with proper priors ensures proper posteriors, but fails to address one of the most important issues which the model degeneracy problem raises. The model degeneracy problem is rooted in the family of distributions {Pθ , θ ∈ Θ} and, if a family of distributions includes no member which places much probability mass on graphs resembling real-world graphs, then neither a Bayesian approach nor any other approach to statistical inference can produce it. In practice, no matter which approach to statistical inference is adopted, the model degeneracy problem tends to result in striking lack of fit (Snijders, 2002; Handcock, 2003a,b; Hunter et al., 2008). 2.5. Conclusions The most important conclusion is that the neighborhood assumption underlying ERGMs with Markov dependence is problematic and that ERGMs with Markov dependence tend to be degenerate provided n is large. Therefore, the application of ERGMs with Markov dependence to large graphs is not advisable. It is debatable what “large” means, but simulations (e.g., Handcock, 2003a; Rinaldo et al., 2009) suggest that ERGMs with Markov dependence should not be applied to graphs with n ≫ 10 nodes and n2 ≫ 45 possible edges. 3. Hierarchical ERGMs We develop a novel class of hierarchical ERGMs motivated by two observations: • In a wide range of applications in the social and health sciences and biology, it is believed that the expected numbers of edges of nodes are either bounded or grow only slowly as a function of the number of nodes n, implying that graphs tend to be sparse (e.g., Jonasson, 1999; Krivitsky et al., 2011). Therefore, dependence tends to be local in the sense that dependence is limited to small subsets of edges (e.g., Pattison and Hierarchical ERGMs 7 Robins, 2002). If there is uncertainty about which subsets of edges are dependent, it makes sense to express the uncertainty by specifying a family of distributions on the set of possible dependence structures. • By restricting dependence to subgraphs, we (1) respect the sparse and local nature of graphs; (2) admit simple representations of dependencies as long as dependencies are local; and (3) reduce the model degeneracy of ERGMs. We introduce hierarchical ERGMs in Section 3.1, describe priors in Section 3.2, and discuss special cases of interest in Section 3.3. 3.1. Model The class of hierarchical ERGMs introduced here is based on two fundamental assumptions. It is worth noting that, in line with convention, we consider the set of nodes to be fixed and the graph to be random. The first assumption states that there is an underlying local neighborhood structure. Assumption 1: local neighborhood structure. The set of nodes is partitioned into K local neighborhoods, indexed by integers 1, . . . , K. The memberships of nodes to local neighborhoods are governed by iid Xi | π1 , . . . , πK ∼ Multinomial(1; π1 , . . . , πK ), i = 1, . . . , n, (9) where Xi denotes the vector of membership indicators of node i.✷ The membership indicators X = (X1 , . . . , Xn ) induce a partition of the set of nodes N into subsets N1 , . . . , NK and a partition of the set of edge variables Y = {Yij : i ∈ N, j ∈ N} into subsets Y(kl) = {Yij : i ∈ Nk , j ∈ Nl }. The second assumption states that, conditional on the local neighborhood structure, edges within local neighborhoods are dependent, while edges between local neighborhoods are independent. Assumption 2: local dyad-dependence, global dyad-independence. The conditional PMF of random graph Y given local neighborhood structure X can be factorized into within- and between-neighborhood PMFs: Pθ (Y = y | X = x) = K Y Pθ (Y(kk) = y(kk) | X = x) K Y Pθ (Y(kl) = y(kl) | X = x). k=1 × (10) k<l We assume that the between-neighborhood PMFs can be factorized into dyad-bound PMFs: Y Pθ (Yij = yij | X = x), Pθ (Y(kl) = y(kl) | X = x) = (11) i∈Nk , j∈Nl while the within-neighborhood PMFs are not assumed to be factorizable.✷ Remark: local dependence. The restriction of dependence to local neighborhoods serves to respect the sparse and local nature of graphs on the one hand and to reduce 8 the model degeneracy of ERGMs on the other hand. An important advantage is that dependence, such as transitive closure, is admissible within local neighborhoods. Remark: local neighborhood structure. Suitable local neighborhood structure may or may not be observed. If suitable local neighborhood structure is observed, then it should be used. However, the emphasis is on suitable local neighorhood structure. If the observed number of local neighborhoods K is small relative to the number of nodes n and thus some local neighborhoods are large, then the improvement in goodness of fit relative to ERGMs may be small and thus the observed local neighborhood structure may not be useful. If no suitable local neighborhood structure is observed, then the uncertainty about the unknown number of local neighborhoods K needs to be addressed. One possible approach is to express the uncertainty about K by specifying a prior for K (e.g., Richardson and Green, 1997). An alternative approach is based on non-parametric priors (e.g., Ferguson, 1973). We follow a non-parametric approach, which we describe in Section 3.2. It is worth noting that, while the number of local neighborhoods K needs to be large so that the local neighborhoods can be small, there is no need to impose strong prior restrictions on the size of the local neighborhoods. If, for a given local neighborhood structure x and observed graph y, the conditional probability Pθ (Y = y | X = x) is negligible under all possible values of θ—e.g., when all nodes are members of the same local neighborhood and Pθ (Y = y | X = x) is near-degenerate, then the marginal posterior probability of x given y tends to be negligible relative to other values of X which make more sense in light of y. We have indeed made the experience that the marginal posterior probability of problematic local neighborhood structures (e.g., with too large local neighborhoods) relative to less problematic local neighborhood structures tends to be negligible. We present two examples in Section 5.2. Remark: parameterizations. Exponential parameterizations of the between- and within-neighborhood PMFs are convenient, though other parameterizations may be used as well. The between-neighborhood PMFs can be written as Pθ (Yij = yij | X = x) = exp[hθB,ij , sB (yij )i − ψB,ij (θB,ij )], (12) where sB (yij ) is a vector of between-neighborhood sufficient statistics, θB,ij is a vector of between-neighborhood natural parameters, and ψB,ij (θB,ij ) is the between-neighborhood log partition function, X ′ ψB,ij (θB,ij ) = log exp[hθB,ij , sB (yij )i]. (13) ′ yij ∈{0,1} The between-neighborhood sufficient statistics sB (yij ) may be functions of edges yij and covariates. It is worth noting that the exponential parameterization of the between-neighborhood PMFs is equivalent to a logit model with linear predictor hθB,ij , sB (1) − sB (0)i. The within-neighborhood PMFs can be written as Pθ (Y(kk) = y(kk) | X = x) = exp[hθW,k , sW (y(kk) )i − ψW,k (θW,k )], (14) where sW (y(kk) ) is a vector of within-neighborhood sufficient statistics, θW,k is a vector of within-neighborhood natural parameters, and ψW,k (θW,k ) is the within-neighborhood log partition function, X ′ ψW,k (θW,k ) = log exp[hθW,k , sW (y(kk) )i], (15) ′ y(kk) ∈Y(kk) Hierarchical ERGMs 9 where Y(kk) is the sample space of y(kk) . The within-neighborhood sufficient statistics sW (y(kk) ) may include interactions, such as the number of triangles within local neighborhood k, and functions of covariates. The exponential parameterization of the between- and within-neighborhood PMFs implies that the conditional PMF of Y given X can be written as Pθ (Y = y | X = x) = exp [hη(θ), s(y)i − ψ(θ)] , (16) where the vector of parameters η(θ) is a linear function of the vectors of between- and within-neighborhood parameters, the vector of sufficient statistics s(y) is a linear function of between- and within-neighborhood vectors of sufficient statistics, and the log partition function ψ(θ) is given by ψ(θ) = K X X ψB,ij (θB,ij ) + k<l i∈Nk ,j∈Nl K X ψW,k (θW,k ). (17) k=1 Remark: parameter constraints. In the interest of model parsimony, it is sometimes desirable to constrain parameters. We consider here the constraints θB,ij = θB (all i < j) on the between-neighborhood parameter vectors θB,ij , which are of secondary interest. The within-neighborhood parameter vectors θW,k , which govern the dependence within local neighborhoods and are thus of primary interest, are left unconstrained. 3.2. Prior The class of hierarchical ERGMs introduced in Section 3.1 aims to reduce model degeneracy and improve goodness of fit relative to ERGMs. To accomplish that, the local neighborhoods must be small and thus the number of local neighborhoods large. We consider here a non-parametric approach based on stick-breaking priors (Ishwaran and James, 2001), which allows the number of non-empty local neighborhoods a posteriori to be large, while encouraging it a priori to be small. Suppose that there is an infinite number of local neighborhoods and that nodes belong to local neighborhood k = 1, 2, . . . with probability πk , k = 1, 2, . . . , where π1 πk = = V1 Vk (18) k−1 Y (1 − Vj ), k = 2, 3, . . . , (19) j=1 where iid (20) Vk | α ∼ Beta(1, α), k = 1, 2, . . . , P∞ where α > 0 is a parameter and k=1 πk = 1 with probability 1 (Ishwaran and James, 2001). The between- and within-neighborhood parameter vectors θB and θW,k index exponential families and therefore conjugate priors exist (Diaconis and Ylvisaker, 1979), though direct sampling from the resulting full conditional distributions is infeasible. In the absence of computational advantages, multivariate Gaussian priors are convenient alternatives: −1 θB | µB , Σ−1 B ∼ MVN(µB , ΣB ) −1 θW,k | µW , Σ−1 W ∼ MVN(µW , ΣW ), k = 1, 2, . . . , iid (21) 10 −1 where µB and µW are mean parameter vectors and Σ−1 B and ΣW are precision matrices of suitable order. Last, to acknowledge the uncertainty about the critical hyper-parameters α, µW , and Σ−1 W , we assign conjugate Gamma, multivariate Gaussian, and Wishart hyper-priors to α, µW , and Σ−1 W , respectively. 3.3. Special cases Special cases of interest are the (stochastic) block models of Wang and Wong (1987); Strauss and Ikeda (1990); Nowicki and Snijders (2001) and the related models of Handcock et al. (2007); Airoldi et al. (2008); Koskinen (2009). Wang and Wong (1987) assumed that there is a known partition of the set of nodes and that the conditional PMF of Y given the partition can be factorized into dyad-bound PMFs. Nowicki and Snijders (2001) dropped the assumption that the partition is known, but kept the assumption that the conditional PMF of Y can be factorized into dyad-bound PMFs. Handcock et al. (2007); Airoldi et al. (2008) introduced more general models than Nowicki and Snijders (2001), while retaining the assumption that the conditional PMF of Y can be factorized into dyad-bound PMFs. All of these models assume that the conditional PMF of Y can be factorized into dyad-bound PMFs, which makes direct modeling of a wide range of dependencies—including, but not limited to, transitive closure—impossible. Strauss and Ikeda (1990) assumed that the partition is known and that edges within observed local neighborhoods are governed by ERGMs with Markov dependence. While the models of Strauss and Ikeda (1990) admit dependence within observed local neighborhoods, the usefulness of the models is limited, because in most applications suitable local neighborhood structure—suitable in the sense that the number of local neighborhoods is large and the local neighborhoods are small—is not observed. Last, Koskinen (2009) assumed that the partition is unknown and that the conditional PMF of Y does not factorize. However, Koskinen (2009) attempted to capture unobserved heterogeneity rather than to address the model degeneracy problem. 4. Bayesian inference We follow a Bayesian approach to hierarchical ERGMs. A Bayesian approach to hierarchical ERGMs must overcome multiple obstacles. The most serious obstacle is the fact that with positive probability one or more local neighborhoods k contains nk ≫ 5 nodes and thus one or more within-neighborhood log partitions functions, which are log sums of exp[ n2k log 2] terms (see (15)), is intractable. To facilitate posterior computations, we approximate the prior and augment the posterior. We describe the approximation of the prior in Section 4.1, discuss the augmentation of the posterior and sampling from the augmented posterior in Section 4.2, and address the non-identifiability of within-neighborhood parameter vectors and membership indicators in Section 4.3. 4.1. Prior truncation The stick-breaking prior of Section 3.2 can be approximated by a truncated stick-breaking prior along the lines of Ishwaran and James (2001), which facilitates posterior computations. Hierarchical ERGMs 11 We choose a maximum number of local neighborhoods, denoted by Kmax . Some general advice concerning the choice of Kmax is given by Ishwaran and James (2001). We are here more concerned with the goodness of fit of the model than the approximation of the stick-breaking prior and choose Kmax in accordance. In practice, we choose Kmax by either I. trying out multiple values of Kmax and comparing the goodness of fit of the model; II. exploiting on-the-ground knowledge; or III. setting Kmax = n. We demonstrate strategies I and II in Section 5.2. Given Kmax , the membership probabilities π = (π1 , . . . , πKmax ) are constructed by truncated stick-breaking (Ishwaran and James, 2001): π1 πk = V1 = Vk (22) k−1 Y (1 − Vj ), k = 2, . . . , Kmax , (23) j=1 where iid Vk | α ∼ Beta(1, α), k = 1, . . . , Kmax − 1 (24) VKmax = 1, PKmax πk = 1. The truncated stickwhere α > 0 is a parameter and VKmax = 1 ensures k=1 breaking construction of π implies that π is generalized Dirichlet distributed, which is conjugate to multinomial sampling (Ishwaran and James, 2001). The (hyper)priors of α, µW , Σ−1 , θW,1 , . . . , θW,Kmax , and θB are equivalent to the (hyper)priors described in Section 3.2. 4.2. Posterior augmentation Under the truncated prior described in Section 4.1, the posterior is of the form p(α, µW , Σ−1 W , π, θB , θW , x | y) ∝ p(α, µW , Σ−1 W , π, θB , θW ) × Pπ (X = x) Pθ (Y = y | X = x), (25) where the truncated prior is of the form p(α, µW , Σ−1 W , π, θB , θW ) = × p(α) p(µW ) p(Σ−1 W ) p(π | α) p(θB ) KY max p(θW,k | µW , Σ−1 W ), (26) k=1 where θW = (θW,1 , . . . , θW,Kmax ) denotes the within-neighborhood parameter vectors. Owing to the fact that the conditional PMF of Y is not, in general, tractable, the posterior is doubly intractable, implying that standard Markov chain Monte Carlo methods (e.g., Metropolis-Hastings) cannot be used to sample from the posterior. Auxiliary-variable Markov chain Monte Carlo methods for sampling from doubly intractable posteriors arising in complete-data problems were introduced by Møller et al. (2006) and extended by Murray et al. (2006); Koskinen et al. (2010); Liang (2010); Caimo and Friel (2011). We extend them from the complete-data problems considered there to the incomplete-data problem considered here. 12 To facilitate posterior computations, we augment α, µW , Σ−1 W , π, θB , θW , X, and ⋆ Y by auxiliary variables θW , X⋆ , and Y⋆ . The auxiliary variable Y⋆ can be interpreted as an auxiliary random graph, X⋆ can be interpreted as an auxiliary local neighborhood ⋆ structure, and θW can be interpreted as auxiliary within-neighborhood parameter vectors. ⋆ ⋆ ⋆ We assume that the joint distribution of α, µW , Σ−1 W , π, θB , θW , X, Y, θW , X , and Y is of the form ⋆ ⋆ ⋆ p(α, µW , Σ−1 W , π, θB , θW , x, y, θW , x , y ) = p(α, µW , Σ−1 W , π, θB , θW ) Pπ (X = x) Pθ (Y = y | X = x) (27) ⋆ × q(θW , x⋆ | π, θB , θW , x, y) Pθ⋆ (Y⋆ = y⋆ | X⋆ = x⋆ ), ⋆ where q(θW , x⋆ | π, θB , θW , x, y) is a suitable, auxiliary distribution, the conditional distributions Y and Y⋆ belong to the same exponential family of distributions, and θ ⋆ = ⋆ (θB , θW ). The augmented posterior is of the form ⋆ ⋆ ⋆ p(α, µW , Σ−1 W , π, θB , θW , x, θW , x , y | y) ⋆ ⋆ ⋆ ∝ p(α, µW , Σ−1 W , π, θB , θW , x, y, θW , x , y ). (28) ⋆ Integrating out the auxiliary variables θW , X⋆ , and Y⋆ results in the posterior of α, µW , −1 ΣW , π, θB , θW , and X. While sampling from the posterior (25) is infeasible, sampling ⋆ from the augmented posterior (28) and integrating out the auxiliary variables θW , X⋆ , and ⋆ Y turns out to be feasible. We focus here on auxiliary-variable Markov chain Monte Carlo updates of θW and X and provide details concerning α, µW , Σ−1 W , π, and θB in Supplement A. A basic auxiliaryvariable Metropolis-Hastings update of θW and x can be described as follows. ⋆ (1) Sample θW , X⋆ , and Y⋆ : ⋆ ⋆ (1.1) Sample θW , X⋆ | π, θB , θW , X = x, Y = y ∼ q(θW , x⋆ | π, θB , θW , x, y). ⋆ ). (1.2) Sample Y⋆ | θ ⋆ , X⋆ = x⋆ ∼ Pθ⋆ (Y⋆ = y⋆ | X⋆ = x⋆ ), where θ ⋆ = (θB , θW ⋆ (2) Propose to swap the values of (θW , x) and (θW , x⋆ ) and accept the proposal with probability min(1, h), where QKmax −1 ⋆ ⋆ ⋆ k=1 p(θW,k | µW , ΣW ) Pπ (X = x ) Pθ ⋆ (Y = y | X = x ) h = QKmax −1 Pπ (X = x) Pθ (Y = y | X = x) k=1 p(θW,k | µW , ΣW ) (29) ⋆ q(θW , x | π, θB , θW , x⋆ , y) Pθ (Y⋆ = y⋆ | X⋆ = x) × ⋆ , x⋆ | π, θ , θ , x, y) P ⋆ (Y ⋆ = y⋆ | X⋆ = x⋆ ) . q(θW B W θ Remark: acceptance probability. The acceptance probability (29) of the auxiliaryvariable Metropolis-Hastings update depends on the intractable within-neighborhood log partition functions through the ratios Pθ⋆ (Y = y | X = x⋆ ) Pθ (Y⋆ = y⋆ | X⋆ = x) . Pθ⋆ (Y⋆ = y⋆ | X⋆ = x⋆ ) Pθ (Y = y | X = x) (30) Since the conditional distributions of Y and Y⋆ belong to the same exponential family of distributions, all intractable within-neighborhood log partition functions in (29) cancel. Hierarchical ERGMs 13 Therefore, the acceptance probability of the auxiliary-variable Metropolis-Hastings algorithm operating on the augmented state space is tractable, whereas the acceptance probability of Metropolis-Hastings algorithms operating on the original state space is intractable. ⋆ Remark: sampling θW and X⋆ . In Step (1.1), large moves from θW and x may result in low acceptance rates of the auxiliary-variable Metropolis-Hastings algorithm. We therefore consider local moves from θW and x by changing one or more within-neighborhood parameter vectors or one or more memberships. Local moves from θW may be generated from Gaussians centered at the present values, whereas local moves from x may be generated from the full conditional distributions of memberships. It is worth noting that the full conditional distributions of memberships are not, in general, tractable, because the withinneighborhood log partition functions ψW,k (θW,k ) of local neighborhoods k with nk ≫ 5 nodes are intractable. To construct auxiliary distributions which approximate the full conditional distributions of memberships, we approximate the intractable within-neighborhood log partition functions ψW,k (θW,k ) by variational methods (Wainwright and Jordan, 2008). Details are provided in Supplement B. Remark: sampling Y⋆ . Two remarks are in place. First, local moves from θW and x require no more than local sampling of Y⋆ , i.e., ⋆ ⋆ sampling subgraphs. Consider moving from (θW , x) to (θW , x⋆ ), where θW deviates from ⋆ θW in θW,k and x = x. Then the ratio of the probability masses of y⋆ in acceptance probability (29) reduces to Pθ (Y⋆ = y⋆ | X⋆ = x) Pθ⋆ (Y⋆ = y⋆ | X⋆ = x⋆ ) = = Pθ (Y⋆ = y⋆ | X⋆ = x) Pθ⋆ (Y⋆ = y⋆ | X⋆ = x) ⋆ ⋆ Pθ (Y(kk) = y(kk) | X⋆ = x) ⋆ ⋆ Pθ⋆ (Y(kk) = y(kk) | X⋆ = x) (31) . In other words, to evaluate the acceptance probability (29), we need to sample a (small) subgraph rather than the whole (large) graph. Second, direct sampling of subgraphs is infeasible and exact sampling schemes are not known (but see the work in progress by Butts, 2012). However, Liang (2010) demonstrated that it is admissible to sample auxiliary variables by suitable reversible Markov chains with the observed data as initial state and the desired distribution as target distribution. The argument extends from the complete-data problem considered by Liang (2010) to the incomplete-data problem considered here, though we omit details. We follow the same approach in the incomplete-data problem considered here, i.e., sample subgraphs by suitable reversible Markov chains. The construction of suitable, reversible Markov chains is discussed by, e.g., Snijders (2002); Hunter and Handcock (2006); Handcock et al. (2010). 4.3. Non-identifiability of within-neighborhood parameters and membership indicators A Bayesian Markov chain Monte Carlo approach along the lines of Section 4.2 suffers from the so-called label-switching problem (Stephens, 2000). The label-switching problem is rooted in the invariance of the likelihood function to switching the labels of local neighborhoods, resulting in non-identifiable within-neighborhood parameters θW,1 , . . . , θW,Kmax and membership indicators X1 , . . . , Xn . As a result, in un-processed Markov chain Monte Carlo samples from the posterior, the labels of local neighborhoods may have switched multiple times and statistical inference which depends on the labels of local neighborhoods cannot be based on un-processed samples. We follow the Bayesian decision-theoretic approach of 14 Stephens (2000) to undo the label-switching, but introduce a stochastic version of the relabeling algorithm of Stephens (2000), which is based on Simulated Annealing (Liu, 2008) and reduces computing time when Kmax is moderate or large. Details are provided in Supplement C. 5. Comparing ERGMs and hierarchical ERGMs A natural approach to comparing ERGMs and hierarchical ERGMs is based on model predictions, because degenerate ERGMs tend to be incapable of generating graphs which resemble real-world graphs and in practice it is thus imperative to inspect model predictions (e.g., Hunter et al., 2008). Prior predictive checks can give tentative answers to two central questions: Do hierarchical ERGMs place much prior predictive mass on graphs which resemble real-world graphs? Do hierarchical ERGMs place much prior predictive mass on extreme graphs? In short, can hierarchical ERGMs a priori be recommended as models of data? Posterior predictive checks complement prior predictive checks by assessing the goodness of fit of hierarchical ERGMs and answering the question of whether hierarchical ERGMs can be recommended a posteriori given data. We compare ERGMs and hierarchical ERGMs capturing transitive closure, because transitive closure is one of the most interesting and most problematic forms of dependence. The ERGM considered here is of the form Pθ (Y = y) ∝ exp [θ1 s1 (y) + θ2 s2 (y)] , (32) where the sufficient statistics are given by the number of edges yij and triangles yij yjh yih . Its natural companion is the hierarchical ERGM with between-neighborhood PMFs Pθ (Y(kl) = y(kl) | X = x) ∝ exp θB sB (y(kl) ) , (33) where the sufficient statistic is given by the number of edges yij between local neighborhoods k and l, and within-neighborhood PMFs Pθ (Y(kk) = y(kk) | X = x) ∝ exp θW,k,1 sW,k,1 (y(kk) ) + θW,k,2 sW,k,2 (y(kk) ) , (34) where the sufficient statistics are given by the number of edges yij and triangles yij yjh yih within local neighborhood k. We used the R packages ergm (Handcock et al., 2010), Bergm (Caimo and Friel, 2010), and hergm to obtain the results presented here. 5.1. Prior predictive checks The prior predictive distribution under ERGM (32) can be written as Z P (Y = y) = p(θ) Pθ (Y = y) d θ, (35) where p(θ) denotes the prior. Based on experience, values of θ1 outside of (−5, 0) and values of θ2 outside of (0, 5) index near-degenerate distributions. Therefore, we choose independent, uniform priors given by θ1 ∼ Uniform(−5, 0) and θ2 ∼ Uniform(0, 5). Hierarchical ERGMs 15 0 0 2000 2000 4000 4000 6000 6000 8000 8000 Fig. 2. Prior predictions of the number of edges (left) and triangles (right) under ERGM (32) with n = 100 and N = 4,950 0 1000 2000 3000 4000 5000 0 50000 100000 150000 0 0 1000 500 2000 3000 1000 4000 1500 5000 Fig. 3. Prior predictions of the number of edges (left) and triangles (right) under the hierarchical ERGM corresponding to (33) and (34) with n = 100 and N = 4,950 0 1000 2000 3000 4000 5000 0 50000 100000 150000 The prior predictive distribution under the hierarchical ERGM corresponding to (33) and (34) can be written as Z Z X P (Y = y) = ··· p(α, µW , Σ−1 W , π, θB , θW ) x∈X (36) × Pπ (X = x) Pθ (Y = y | X = x) d α d µW d Σ−1 W d π d θB d θW , −1 , π, θB , θW ) denotes the prior. To make the prior of θW under the where p(α, µW , ΣW hierarchical ERGM comparable to the prior of θ under the ERGM, we assign independent, uniform priors θW,k,1 ∼ Uniform(−5, 0) and θW,k,2 ∼ Uniform(0, 5) to the withinneighborhood parameters θW,k,1 and θW,k,2 , respectively. To respect the sparse and local nature of graphs, we assume that the between-neighborhood parameter θB is governed by the prior θB ∼ N (−5, 1), i.e., there tend to be more edges within than between local neighborhoods. The prior for π is given by the truncated stick-breaking prior with α = 5. We consider graphs with n = 100 nodes and N = 4,950 edge variables and let the maximum number of local neighborhoods be Kmax = 50. The prior predictive distributions can be sampled by Markov chain Monte Carlo methods: direct sampling of parameters and between-neighborhood subgraphs is straightforward, while within-neighborhood subgraphs can be sampled by Markov chain Monte Carlo methods along the lines of Hunter and Handcock (2006). Monte Carlo samples of size 10,000 were generated from the prior of the 16 ERGM and hierarchial ERGM and, for every one of the draws from the prior, a prediction was generated by a Markov chain of length 100,000, accepting the final draw of the Markov chain as a draw from the prior predictive distribution. Figure 2 shows prior predictions of the number of edges and triangles under ERGM (32). The bulk of the prior predictive mass is placed on extreme graphs with few edges and triangles and graphs with almost all possible edges and triangles. In contrast, Figure 3 shows that the hierarchical ERGM corresponding to (33) and (34) places much prior predictive mass on graphs which resemble real-world graphs: i.e., graphs where the average P number of edges of nodes given by i,j yij /n ranges from 2 to 20 and where the number of triangles is a small multiple of the number of edges, which covers a wide range of real-world graphs. In addition, the prior predictive distribution is bimodal under the ERGM, while unimodal under the hierarchical ERGM. A tentative answer to the questions raised at the start of Section 5 is therefore: hierarchical ERGMs are capable of generating graphs which resemble real-world graphs, in contrast to ERGMs, and can thus be recommended a priori as models of real-world graphs. 5.2. Posterior predictive checks To compare ERGMs and hierarchical ERGMs in terms of posterior predictions, we select two data sets: the terrorist network behind the Bali bombing in 2002 as well as the classic Sampson network. The posterior predictive distribution under ERGM (32) given data y can be written as Z P (Ẏ = ẏ | Y = y) = p(θ | y) Pθ (Ẏ = ẏ) d θ, (37) where p(θ | y) denotes the posterior. The posterior predictive distribution under the hierarchical ERGM corresponding to (33) and (34) can be written as Z Z X P (Ẏ = ẏ | Y = y) = ··· p(α, µW , Σ−1 W , π, θB , θW , x | y) x∈X (38) × Pθ (Ẏ = ẏ | X = x) d α d µW d Σ−1 W d π d θB d θW , where p(α, µW , Σ−1 W , π, θB , θW , x | y) denotes the posterior. Independent priors θi ∼ N (0, 25) are used in the case of the ERGM and independent priors α ∼ Gamma(1, 1), −2 ∼ Gamma(10, 10) in the case of the hierarchical ERGM. 120,000 µW,i ∼ N (0, 1), and σW,i draws from the posterior predictive distribution of the ERGM were generated by the Markov chain Monte Carlo algorithm of Caimo and Friel (2011), with a burn-in of 20,000 and saving every 10-th post-burn-in draw, and 1,200,000 draws from the posterior predictive distribution of the hierarchical ERGM were generated by the Markov chain Monte Carlo algorithm of Section 4, with a burn-in of 200,000 and saving every 100-th post-burn-in draw. 5.2.1. Terrorist network behind Bali bombing in 2002 The structure of terrorist networks is of interest with an eye to understand how terrorists communicate, to identify cells (i.e., subsets of terrorists), to isolate cells, and to dismantle them. We consider here the network of terrorists behind the Bali, Indonesia bombing in 2002, killing 202 (Koschade, 2006). The 17 terrorists who carried out the bombing were Hierarchical ERGMs 17 Fig. 4. Terrorist network behind Bali bombing in 2002. The posterior membership probabilities are represented by colored pie charts Octavia Arnasan Azahari Dulmatin Junaedi Hidayat Feri Ghoni Sarijo Samudra Rauf Imron Patek Idris Muklas Mubarok Amrozi members of the Southeast Asian al-Qaeda affiliate Jemaah Islamiyah. The terrorist network can be represented by a graph with n = 17 nodes and N = 136 edge variables, where Yij = 1 if terrorists i and j were in contact prior to the bombing and Yij = 0 otherwise. The terrorist network is shown in Figure 4. We start by determining the maximum number of local neighborhoods Kmax to truncate the prior. Using strategy I sketched in Section 4.1, we compare the hierarchical ERGM corresponding to (33) and (34) with up to Kmax = 1, 2, 3, 4, 5 local neighborhoods in terms of predictive power. Predictive power is taken to be the root mean square deviation of the predicted number of triangles. According to Figure 5, the hierarchical ERGM with Kmax = 2 is superior to the hierarchical ERGM with Kmax = 1, which is equivalent to ERGM (32). The hierarchical ERGM with Kmax = 3 in turn is superior to the hierarchical ERGM with Kmax = 2, but increasing Kmax from 3 to 5 does not increase the predictive power. We compare ERGM (32) and the hierarchical ERGM corresponding to (33) and (34) with up to Kmax = 5 local neighborhoods in terms of the posterior predictive distribution of the number of edges and triangles, shown in Figures 6 and 7. Under the ERGM, the posterior predictive distribution is bimodal. In contrast, the posterior predictive distribution under the hierarchical ERGM is unimodal and places most mass on graphs which are close to the observed graph in terms of the number of edges and triangles. We note that, while other statistics may be used to compare the ERGM and hierarchical ERGM in terms of goodness of fit, the choice of goodness of fit statistics may not influence the main conclusions much. The fact that the ERGM places so much mass on dense graphs with almost all edges and triangles indicates that the ERGM fits much worse than the hierarchical ERGM, no matter which goodness of fit statistics are chosen, because the topology of graphs which are local in nature—such as the observed graph—stands in sharp contrast to the topology of dense graphs in terms of connectivity, centrality, transitivity, and other interesting features of graphs (e.g., Kolaczyk, 2009). −1 −1 The posterior of α, µW,1 , µW,2 , σW,1 , and σW,2 is shown in Table 1. The mean parameters µW,1 and µW,2 governing the within-neighborhood parameters tend to be both positive—and more so the mean parameter µW,2 governing the within-neighborhood triangle parameters— 18 50 100 150 200 250 300 Fig. 5. Terrorist network: hierarchical ERGM corresponding to (33) and (34) with up to Kmax = 1, 2, 3, 4, 5 local neighborhoods: root mean square deviation of predicted number of triangles plotted against maximum number of local neighborhoods Kmax 1 2 3 4 5 −1 −1 Table 1. Terrorist network: posterior of parameters α, µW,1 , µW,2 , σW,1 , and σW,2 parameter .05 quantile .50 quantile .95 quantile odds of parameter being positive α .36 1.32 3.43 ∞ µW,1 -1.03 .45 2.00 2.22 µW,2 -.27 .91 2.22 8.74 −1 σW,1 .55 .98 1.59 ∞ −1 σW,2 .55 .98 1.57 ∞ which is not surprising in the light of the large number of edges and triangles within local neighborhoods. Last, while the primary purpose of introducing local neighborhoods is the desire to address the model degeneracy and striking lack of fit of ERGMs, predictions of the memberships to local neighborhoods may be of interest as well, e.g., to identify cells. The pie charts in Figure 4 represent the posterior membership probabilities reported by the stochastic relabeling algorithm described in Supplement C. The 5 green-colored terrorists turn out to be the 5 members of the so-called support group, which was to supposed to support the so-called main group consisting of all other terrorists. The members of the main group tend to be black-colored, with the exception of Amrozi and Mubarok who are more red-colored than black-colored. Indeed, while Amrozi and Mubarok belonged to the main group, both resided elsewhere and were almost isolated from the rest of the main group (Koschade, 2006). Most interesting is the membership of Feri. He was a member of the main group and was the suicide bomber who initiated the attack. Feri arrived two days before the attack, whereas all other members of the main group had arrived days or weeks earlier and in fact started leaving the night Feri arrived (Koschade, 2006). As a result, Feri had limited opportunities to communicate with others. In particular, Feri was the one and only member of the main group who did not communicate with the three commanders Muklas (the Jemaah Islamiyah head of operations in Singapore and Malaysia), Samudra (the field commander), and Idris (the logistics commander) (Koschade, 2006). Therefore, the network position of Feri is unique and the uncertainty about his membership is reflected Hierarchical ERGMs 19 1500 1000 500 0 0 500 1000 1500 Fig. 6. Terrorist network: posterior predictions of the number of edges (left) and triangles (right) under ERGM (32); vertical lines represent observed numbers 0 20 40 60 80 100 120 140 0 100 200 300 400 500 600 700 1500 1000 500 0 0 500 1000 1500 Fig. 7. Terrorist network: posterior predictions of the number of edges (left) and triangles (right) under the hierarchical ERGM corresponding to (33) and (34); vertical lines represent observed numbers 0 20 40 60 80 100 120 140 0 100 200 300 400 500 600 700 in the posterior membership probability distribution. In conclusion, the hierarchical ERGM framework admits the specification of models capturing simple and interesting features of the terrorist network and, under the parameterization of the hierarchical ERGM with within-neighborhood edges and triangles, posterior membership predictions are consistent with on-the-ground knowledge of the terrorist network. 5.2.2. Classic Sampson network The Sampson network (de Nooy et al., 2005, pp. 87–95) is a classic data set used by, e.g., Frank and Strauss (1986); Strauss and Ikeda (1990); Handcock (2003a); Caimo and Friel (2011). Sampson studied social relations among a group of novices who were preparing to enter a monastic order. The network corresponds to N = 306 relationships among n = 18 novices measured at three time points. We consider here the following directed edge variables Yij : if novice i liked novice j at any of the three time points, then Yij = 1, otherwise Yij = 0. The network is plotted in Figure 8. A natural extension of ERGM (32) to directed graphs is given by Pθ (Y = y) ∝ exp [θ1 s1 (y) + θ2 s2 (y) + θ3 s3 (y)] , (39) 20 Fig. 8. Sampson network. The posterior membership probabilities are represented by colored pie charts Winf Boni Mark Albert Simp Greg Elias Hugh Ambrose John Victor Basil Louis Amand Bonaven Berth Romul Peter where the sufficient statistics are the number of edges yij , mutual edges yij yji , and transitive triples yij yjh yih , and its natural companion is given by the hierarchical ERGM with between-neighborhood PMFs Pθ (Y(kl) = y(kl) | X = x) ∝ exp θB,1 sB,1 (y(kl) ) + θB,2 sB,2 (y(kl) ) , (40) where the sufficient statistics are given by the number of edges yij and mutual edges yij yji between local neighborhoods k and l, and within-neighborhood PMFs Pθ (Y(kk) = y(kk) | X = x) ∝ exp θW,k,1 sW,k,1 (y(kk) ) + θW,k,2 sW,k,2 (y(kk) ) (41) + θW,k,3 sW,k,3 (y(kk) ) , where the sufficient statistics are given by the number of edges yij , mutual edges yij yji , and transitive triples yij yjh yih within local neighborhood k. Since experts argue that the novices are divided into 3 or 4 groups (de Nooy et al., 2005, pp. 87–95), we follow strategy II sketched in Section 4.1 and set Kmax = 5, which can be considered to be an upper bound on the number of local neighborhoods. Figures 9 and 10 show posterior predictions of the number of edges, mutual edges, and transitive triples. The contrast between the ERGM and the hierarchical ERGM in terms of goodness of fit is at least as striking as in the case of the terrorist network in Section 5.2.1. The problematic nature of the ERGM is underlined by the posterior of the number of non-empty local neighborhoods of the hierarchical ERGM. Figure 11 shows that the posterior places negligible mass on partitions of the set of nodes where all nodes are assigned to one local neighborhood, which corresponds to ERGM (39). In addition, the posterior mode is 3, which is in line with expert knowledge (de Nooy et al., 2005, pp. 87–95). The local neighborhoods correspond, once again, to physical groups: the posterior membership probabilities shown in Figure 8 agree with the three-group division of novices into “Loyals,” “Turks,” and “Outcasts” advocated by most experts (de Nooy et al., 2005, pp. 87–95). Hierarchical ERGMs 21 1500 1500 1000 1000 0 500 500 0 0 500 1000 1500 Fig. 9. Sampson network: posterior predictions of the number of edges (left), mutual edges (middle), and transitive triples (right) under ERGM (39); vertical lines represent observed numbers 0 50 100 150 200 250 300 0 50 100 0 150 1000 2000 3000 4000 5000 0 6. 50 100 150 200 250 300 1500 1000 500 0 0 0 500 500 1000 1000 1500 1500 Fig. 10. Sampson network: posterior predictions of the number of edges (left), mutual edges (middle), and transitive triples (right) under the hierarchical ERGM corresponding to (40) and (41); vertical lines represent observed numbers 0 50 100 150 0 1000 2000 3000 4000 5000 Discussion The most important conclusion is that Markov dependence along the lines of Frank and Strauss (1986) and other forms of network dependence are not problematic as long as dependence is sufficiently local. We have introduced hierarchical ERGMs which allow dependence to be sufficiently local and we have demonstrated that hierarchical ERGMs can be recommended as models of data, both a priori and a posteriori. Hierarchical ERGMs can be expected to be superior to ERGMs in terms of goodness of fit as long as the data set in question is sparse and local in nature. As we have pointed out, many data sets in the social and health sciences and biology are indeed sparse and local in nature, though technological networks (e.g., the World Wide Web, twitter) may be an exception. The class of hierarchical ERGMs introduced here can be considered to be the first model of the “next generation of social network models” (Snijders, 2007, p. 324): i.e., the first model which combines latent structure models (e.g., Nowicki and Snijders, 2001; Hoff et al., 2002; Schweinberger and Snijders, 2003; Handcock et al., 2007) and ERGMs in a way that takes advantage of the strengths of ERGMs—i.e., the power of ERGMs to model dependencies—while reducing the weaknesses of ERGMs—i.e., the fact that Markov dependence along the lines of Frank and Strauss (1986) is more global than local in nature. We note that a partition of the set of nodes N can be considered to constitute a latent, discrete space. Let d : N × N 7→ R+ 0 be a distance function such that d(i, j) = 0 if and only if i = j, 22 0 1 2 3 4 Fig. 11. Sampson network: posterior of number of non-empty local neighborhoods under the hierarchical ERGM corresponding to (40) and (41) 1 2 3 4 5 d(i, j) = 1 if i and j are members of the same local neighborhood, and d(i, j) = 2 otherwise. Then d satisfies reflexivity, symmetry, and the triangle inequality and is thus a metric, and the probability of an observed graph depends on d. In contrast to simple latent structure models (e.g., Nowicki and Snijders, 2001; Hoff et al., 2002; Schweinberger and Snijders, 2003; Handcock et al., 2007), which assume dyads to be independent conditional on d, hierarchical ERGMs assume dyads to be locally dependent while globally independent conditional on d. It is evident that other metrics may be used, but the hierarchical ERGM framework is a simple starting point and the local dyad-dependence and global dyad-independence have conceptual and computational advantages. Simulation and statistical inference for hierarchical ERGMs is implemented in the R package hergm, which will be made available in the future. Owing to the fact that in most applications suitable local neighborhood structure is not observed and the posterior is doubly intractable, statistical inference for hierarchical ERGMs is expensive. Despite the expensive computations, we believe that hierarchical ERGMs are simple and attractive alternatives to ERGMs and superior in terms of goodness of fit as long as the data set in question is sparse and local in nature. Acknowledgements We acknowledge support from the Netherlands Organisation for Scientific Research (NWO grant 446-06-029) (MS), the National Institutes of Health (NIH grant 1R01HD052887-01A2) (MS), and the Office of Naval Research (ONR grant N00014-08-1-1015) (MS, MSH). We are grateful to Johan Koskinen for valuable comments and suggestions on drafts of the manuscript. Hierarchical ERGMs 23 Proof of proposition (7) and (8) Let y1 , y2 , . . . be a monotone sequence of graphs. By assumption, fij (yn ) = 0 for n = 1, 2 and fij (yn ) ≥ 0 for all n > 2, implying s2 (yn ) = 0 for n = 1, 2 s2 (yn ) − s2 (yn−1 ) ≥ 0 for all n > 2. (42) By (6), there exists a monotone sequence of graphs y1 , y2 , . . . and, for any C > 0, however large, there exists nC > 1 such that s2 (yn ) = s2 (yn ) − s2 (y2 ) = n X [s2 (ym ) − s2 (ym−1 )] m=3 > (43) C (n − 1) (n − nC ) for all n > nC . Therefore, the sufficient statistic s2 (yn ) is unstable in the sense of Schweinberger (2011) whereas s1 (yn ) is stable, and (7) and (8) follow from Theorem 3 and Corollary 2 of Schweinberger (2011). References Airoldi, E., D. Blei, S. Fienberg, and E. Xing (2008). Mixed membership stochastic blockmodels. Journal of Machine Learning Research 9, 1981–2014. Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in Statistical Theory. New York: Wiley. Baxter, R. J. (2007). Exactly solved models in statistical mechanics. New York: Dover. Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, Series B 36, 192–225. Butts, C. T. (2011). Bernoulli graph bounds for general random graph models. Sociological Methodology 41, 299–345. Butts, C. T. (2012). Manuscript in preparation. University of California, Irvine. Caimo, A. and N. Friel (2010). R package version 1.2. Bayesian inference for exponential random graph models. http://CRAN.R-project.org/package=Bergm. Caimo, A. and N. Friel (2011). Bayesian inference for exponential random graph models. Social Networks 33, 41–55. Chatterjee, S. and P. Diaconis (2011). Estimating and understanding exponential random graph models. Technical report, Courant Institute of Mathematical Sciences, New York University. de Nooy, W., A. Mrvar, and V. Batagelj (2005). Exploratory Social Network Analysis with Pajek. New York: Cambridge University Press. 24 Diaconis, P. and D. Ylvisaker (1979). Conjugate priors for exponential families. Annals of Statistics 7, 269–281. Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics 1, 209–230. Frank, O. and D. Strauss (1986). Markov graphs. Journal of the American Statistical Association 81 (395), 832–842. Geyer, C. J. and E. A. Thompson (1992). Constrained Monte Carlo maximum likelihood for dependent data. Journal of the Royal Statistical Society, Series B 54, 657–699. Handcock, M. (2003a). Assessing degeneracy in statistical models of social networks. Technical report, Center for Statistics and the Social Sciences, University of Washington. http://www.csss.washington.edu/Papers. Handcock, M. (2003b). Statistical models for social networks: Inference and degeneracy. In R. Breiger, K. Carley, and P. Pattison (Eds.), Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers. Washington, D.C.: National Academies Press. Handcock, M. S., D. R. Hunter, C. T. Butts, S. M. Goodreau, M. Morris, and P. Krivitsky (2010). R package ergm version 2.2-2: A Package to Fit, Simulate and Diagnose Exponential-Family Models for Networks. http://CRAN.R-project.org/package=hergm. Handcock, M. S., A. E. Raftery, and J. M. Tantrum (2007). Model-based clustering for social networks. Journal of the Royal Statistical Society, Series A 170, 301–354. with discussion. Hoff, P. D., A. E. Raftery, and M. S. Handcock (2002). Latent space approaches to social network analysis. Journal of the American Statistical Association 97, 1090–1098. Holland, P. W. and S. Leinhardt (1981). An exponential family of probability distributions for directed graphs. Journal of the American Statistical Association 76 (373), 33–65. Hunter, D. R., S. M. Goodreau, and M. S. Handcock (2008). Goodness of fit of social network models. Journal of the American Statistical Association 103 (481), 248–258. Hunter, D. R. and M. S. Handcock (2006). Inference in curved exponential family models for networks. Journal of Computational and Graphical Statistics 15, 565–583. Ishwaran, H. and L. F. James (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association 96 (453), 161–173. Ising, E. (1925). Beitrag zur Theorie des Ferromagnetismus. Zeitschrift für Physik A 31, 253–258. Jonasson, J. (1999). The random triangle model. Journal of Applied Probability 36, 852–876. Jones, J. H. and M. Handcock (2003). Social networks: Sexual contacts and epidemic thresholds. Nature 423, 605–606. Kolaczyk, E. D. (2009). Springer. Statistical Analysis of Network Data: Methods and Models. Hierarchical ERGMs 25 Koschade, S. (2006). A social network analysis of Jemaah Islamiyah: The applications to counter-terrorism and intelligence. Studies in Conflict and Terrorism 29, 559–575. Koskinen, J. H. (2009). Using latent variables to account for heterogeneity in exponential family random graph models. In S. M. Ermakov, V. B. Melas, and A. N. Pepelyshev (Eds.), Proceedings of the 6th St. Petersburg Workshop on Simulation Vol. II, pp. 845– 849. Koskinen, J. H., G. L. Robins, and P. E. Pattison (2010). Analysing exponential random graph (p-star) models with missing data using Bayesian data augmentation. Statistical Methodology 7 (3), 366–384. Krivitsky, P. N., M. S. Handcock, and M. Morris (2011). Adjusting for network size and composition effects in exponential-family random graph models. Statistical Methodology 8, 319–339. Lauritzen, S. (1996). Graphical Models. Oxford, UK: Oxford University Press. Liang, F. (2010). A double Metropolis-Hastings sampler for spatial models with intractable normalizing constants. Journal of Statistical Computing and Simulation 80, 1007–1022. Liu, J. S. (2008). Monte Carlo Strategies in Scientific Computing. New York: Springer. Møller, J., A. N. Pettitt, R. Reeves, and K. K. Berthelsen (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93, 451–458. Murray, I., Z. Ghahramani, and D. J. MacKay (2006). MCMC for doubly-intractable distributions. In Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06), pp. 359–366. AUAI Press. Nowicki, K. and T. A. B. Snijders (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association 96 (455), 1077–1087. Park, J. and M. E. J. Newman (2005). Solution for the properties of a clustered network. Physical Review E 72, 026136. Pattison, P. and G. Robins (2002). Neighborhood-based models for social networks. In R. M. Stolzenberg (Ed.), Sociological Methodology, Volume 32, Chapter 9, pp. 301–337. Boston: Blackwell Publishing. Petrescu-Prahova, M. and C. Butts (2008). Emergent Coordinators in the World Trade Center Disaster. International Journal of Mass Emergencies and Disasters 28, 133–168. Richardson, S. and P. J. Green (1997). On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society, Series B 59, 731–792. Rinaldo, A., S. E. Fienberg, and Y. Zhou (2009). On the geometry of discrete exponential families with application to exponential random graph models. Electronic Journal of Statistics 3, 446–484. Schweinberger, M. (2011). Instability, sensitivity, and degeneracy of discrete exponential families. Journal of the American Statistical Association 106 (496), 1361–1370. 26 Schweinberger, M. and T. A. B. Snijders (2003). Settings in social networks: A measurement model. In R. M. Stolzenberg (Ed.), Sociological Methodology, Volume 33, Chapter 10, pp. 307–341. Boston & Oxford: Basil Blackwell. Snijders, T. A. B. (2002). Markov chain Monte Carlo estimation of exponential random graph models. Journal of Social Structure 3, 1–40. Snijders, T. A. B. (2007). Contribution to the discussion of Handcock, M.S., Raftery, A.E., and J.M. Tantrum, Model-based clustering for social networks. Journal of the Royal Statistical Society, Series A 170, 322–324. Snijders, T. A. B., P. E. Pattison, G. L. Robins, and M. S. Handcock (2006). New specifications for exponential random graph models. Sociological Methodology 36, 99–153. Stephens, M. (2000). Dealing with label-switching in mixture models. Journal of the Royal Statistical Society, Series B 62, 795–809. Strauss, D. (1986). On a general class of models for interaction. SIAM Review 28, 513–527. Strauss, D. and M. Ikeda (1990). Pseudolikelihood estimation for social networks. Journal of the American Statistical Association 85 (409), 204–212. Wainwright, M. J. and M. I. Jordan (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning 1, 1–305. Wang, Y. J. and G. Y. Wong (1987). Stochastic blockmodels for directed graphs. Journal of the American Statistical Association 82 (397), 8–19. Wasserman, S. and P. Pattison (1996). Logit models and logistic regression for social networks: I. An introduction to Markov graphs and p∗ . Psychometrika 61, 401–425.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Hierarchical Exponential-Family Random Graph Mod