Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sharing Clusters Among Related Groups: Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Overview Motivation Dirichlet Processes Hierarchical Dirichlet Processes Inference Experimental results Conclusions Motivation Multi-task learning: clustering Goal: Share clusters among multiple related clustering problems (model-based). Approach: Hierarchical; Nonparametric Bayesian; DP Mixture Model: learn a generative model over the data, treating the classes as hidden variables; Dirichlet Processes Let (,) be a measurable space, G0 be a probability measure on the space, and be a positive real number. A Dirichlet process is any distribution of a random probability measure G over (,) such that, for all finite partitions (A1,…,Ar) of , G ~DP(, G0 ) if G is a random probability measure with distribution given by the Dirichlet process. Draws G from DP are generally not distinct, discrete, Өk~G0, βk are random and depend on . Properties: , Chinese Restaurant Processes CRP(the polya urn scheme) Φ1,…,Φi-1, i.i.d., r.v., distributed according to G; Ө1,…, ӨK to be the distinct values taken on by Φ1,…,Φi-1, nk be # of Φi’= Өk, 0<i’<i, This slide is from “Chinese Restaurants and Stick-Breaking: An Introduction to the Dirichlet Process”, NLP Group, Stanford, Feb. 2005 DP Mixture Model One of the most important application of DP: nonparametric prior distribution on the components of a mixture model. G ~ DP ( 0 , G0 ) i | G ~ G xi | i ~ F (i ) Why no direct application of density estimation? Because G is discrete? HDP – Problem statement We have J groups of data, {Xj}, j=1,…, J. For each group, Xj={xji}, i=1, …, nj. In each group, Xj={xji} are modeled with a mixture model. The mixing proportions are specific to the group. Different groups share the same set of mixture components (underlying clusters, ), but different group is a different combination of the mixture components. Goal: Discover the distribution of Discover the distribution of within a group; across groups; HDP - General representation G0: the global prob. measure ~ DP(r, H) , r: concentration parameter, H is the base measure. Gj: the probability distribution for group j, ~ DP(α, G0). Φji : the hidden parameters of distribution F(Φji) corresponding to xji. The overall model is: Two-level DPs. HDP - General representation G0 places non-zeros mass only on , i.i.d, r.v. distributed according to H. , thus, HDP-CR franchise First level: within each group, DP mixture G j ~ DP (0 , G0 ), ji | G j ~ G j , x ji | ji ~ F ( ji ) Φj1,…,Φji-1, i.i.d., r.v., distributed according to Gj; Ѱj1,…, ѰjTj to be the values taken on by Φj1,…,Φji-1, njk be # of Φji’= Ѱjt, 0<i’<i. Second level: across group, sharing components Base measure of each group is a draw from DP: Ѱjt | G0 ~ G0, G0 ~ DP(r, H), Ө1,…, ӨK to be the values taken on by Ѱj1,…, ѰjTj , mk be # of Ѱjt=Өk, all j, t. HDP-CR franchise Values of Φji are shared among groups. Integrating out G0 Inference- MCMC Gibbs sampling the posterior in the CR franchise: Instead of directly dealing with Φji & Ѱjt to get p(Φ, Ѱ|X), p(t, k, Ө|X) is achieved by sampling t, k, Ө, where, t={tji}, tji is the table index that Φji associated with, Φji=Ѱjt . K={kjt}, kjt is the index that Ѱjt takes value on Өk, Ѱjt=Өkjt. ji Knowing the prior distribution as shown in CPR franchise, the posterior is sampled iteratively, Sampling t: Sampling K: Sampling Ө: Experiments on the synthetic data Data description: We have three group data; Each group is a Gaussian mixture; Different group can share same clusters; Each cluster has 50 2-D data points, features are independent; Original data 6 1 Group1 Group2 Group3 5 2 x(2) 4 3 7 4 1 6 2 5 3 4 1 2 2 1 3 0 2 4 6 x(1) 8 10 6 5 Group 3: [5, 6, 1, 7] Group 1: [1, 2, 3, 7] 3 7 7 4 6 5 Group 2: [3, 4, 5, 7] Experiments on the synthetic data HDPs definition: here, F(xji|φji) is Gussian distribution, φji={μji, σji}; φji take values on one of θk={μk, σk}, k=1…. μ ~ N(m, σ/β), σ-1 ~ Gamma (a, b), i. e., H is NormGamma joint distribution. m, β, a, b are given hyperparameters. Goal: Model each group as a Gaussian mixture ; Model the cluster distribution over groups ; Experiments on the synthetic data Results on Synthetic Data Global distribution: Estimated underlying distribution Global mixing propotion (over groups) 6 3 Group1 Group2 Group3 5 2.5 Mixing propotion x(2) 4 3 2 2 1.5 1 0.5 1 0 2 4 6 x(1) Estimated 8 10 0 1 2 3 4 5 6 7 Component Index 8 9 over all groups and the corresponding mixing proportions The number of components is openended, here only partial is shown. 10 Experiments on the synthetic data 1-th group mixture propotion (over data) Mixture within each group: 50 45 2-th group mixture propotion (over data) 35 50 30 45 25 40 Mixing propotion 20 15 10 5 0 1 2 3 4 5 6 7 Component Index 8 9 10 35 30 25 20 15 10 3-th group mixture propotion (over data) 50 5 45 0 40 Mixing propotion Mixting propotion 40 1 2 3 4 5 6 7 Component Index 8 9 10 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 Component Index 8 9 10 The number of components in each group is also open-ended, here only partial is shown. Conclusions & discussions This hierarchical Bayesian method can automatically determine the appropriate number of mixture components needed. A set of DPs are coupled via their base measure to achieve the component sharing among groups. Non-parametric priors; not non-parametric density estimation.