Download ppt - Data Mining Lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
The 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
26-29 August, 2012, Kadir Has University, Istanbul, Turkey
On Finding Fine-Granularity User
Communities by Profile Decomposition
Seulki Lee, Minsam Ko, Keejun Han, Jae-Gil Lee
Department of Knowledge Service Engineering
KAIST(Korea Advanced Institute of Science and Technology)
{seulki15, minsam.ko, brianhan87}@gmail.com, jaegil@kaist.ac.kr
2
Table of Contents
 Introduction
 DecompClus Algorithm
 Evaluation
 Related Work
 Conclusion
3
Community Discovery
 Community discovery is one of the most popular tasks in
social network analysis.
 Many real-world applications with community discovery
• Advertisement to common interest groups
• Recommendation of potential collaborators in workplaces
4
Relationships in Social Networks
 A social network is modeled as a huge graph.
• A node is a user.
• An edge is a relationship between users.
 Two types of relationships in social network
• Explicit relationship
• Implicit relationship
Explicit relationship
Follower / Following
Friend
Implicit relationship
Unknown, but similar interest
We focus on this relationship.
5
Extracting implicit relationships
 To extract implicit relationships, a user is typically represented by
his/her profile, and the similarity between user profiles is measured.
 The form of the profile depends on the social network and
application.
• In DBLP, the profile is a list of papers he/she wrote
• In Twitter, the profile is a list of tweets he/she posted
User A’s profile
User B’s profile
Similarity
between the profiles
= Implicit relationship
…
…
6
Limitation of a Single Profile
 Generally, a user is described by only a single profile which
oversimplifies the multiple characteristics of a user.
 This problem results in loss of meaningful communities.
 Though User A and User B share the same interest about photography,
overall similarity between the two users is not very high.
7
DecompClus
 We propose DecompClus, the community discovery method of
profile decomposition, which divides a profile into sub-profiles.
Step1: Profile Decomposition
Profiles
Step2: sub-profile clustering
Sub-Profiles
Communities
outdoor, hiking, …
…
photo, lens, …
outdoor, hiking, …
outdoor, hiking, …
photo, lens, …
photo, lens, …
photo, color, …
art, museum, …
photo, color, …
photo, color, …
…
art, museum,
art, museum, …
…
8
Table of Contents
• Introduction
• DecompClus Algorithm
• Evaluation
• Related Work
• Conclusion
9
Overall Procedure of DecompClus
10
Step 1: Profile Decomposition (1/2)
 A network of unit items (e.g., papers or tweets) is constructed for
each user’s profile.
• A node (item) is represented by a term vector (weight: TF-IDF).
• An edge is determined as the similarity between two nodes (cosine
similarity).
User A’s profile
i1
i5
i2
i6
i3
i4
i7
11
Step 1: Profile Decomposition (2/2)
 Clustering is performed on the small network.
• We adopted a clustering algorithm based on modularity optimization,
which tries to detect high modularity partitions of networks [V. D. Blondel,
et. al., 2008].
 Each cluster becomes a sub-profile.
User A’s profile
User A’s sub-profiles
12
Step 2: Sub-Profile Clustering (1/2)
 A network of sub-profiles is constructed by accumulating sub-profiles
from every user.
• A node (sub-profile) is represented by a term vector (weight: TF-IDF).
• A edge is weighted by the similarity between two nodes (cosine similarity).
User A’s sub-profile
User B’s sub-profile
User A’s sub-profile User D’s sub-profile
User C’s sub-profile
User E’s sub-profile
13
Step 2: Sub-Profile Clustering (2/2)
 Clustering is performed on the network of sub-profiles.
• The same clustering method is used to group sub-profiles.
 Now, each cluster becomes a user community.
User A’s User
sub-profile
A
Usersub-profile
B
User B’s
UserUser
D’s sub-profile
D
User A’s User
sub-profile
A
User E
User C
User C’s sub-profile
Community C1
User E’s sub-profile
Community C2
 A user can belong to multiple communities (e.g., User A is in C1 and C2)
• DecompClus is a method to discover overlapping community structure
by non-overlapping clustering method.
14
Overall Procedure of DecompClus
15
Table of Contents
• Introduction
• DecompClus Algorithm
• Evaluation
• Related Work
• Conclusion
16
Experimental Set-up (1/3)
 Evaluation methods
• Quantitative evaluation: verify that DecompClus finds more tightly and
well-connected communities
 Modularity value
 Intra-similarity
 Inter-similarity
• Qualitative evaluation: explain how the communities by our method
and those by compared method are different semantically
 Defining the theme of each community
 Case studies (See the paper)
 Visualization
17
Experimental Set-up (2/3)
 CiteULike
• Social bookmarking service for
scholarly papers
• http://www.citeulike.org/faq/data.adp
Distribution of users according to
their tags
 Dataset
• # of users = 122
• # of articles = 25,089
tag like 'data_mining%'
or 'mining%' or
'knowledge_discovery%'
• # of unique stemmed tags = 16,161
• Half of the users have more than one
interest
tag like
'social_network%'
or 'socialnetwork%'
tag like
'recommend%’
18
Experimental Set-up (3/3)
 Implementation
• Gephi Library - open-source software for visualizing and analyzing large
network graphs
 Baseline
• Follows almost the same procedures.
• Use only one overall profile for a user
Profiles
photo, lens, …
outdoor, hiking, …
photo, color, …
art, museum, …
Communities
…
…
photo, lens, …
outdoor, hiking,…
photo, color, …
art, museum, …
…
…
19
Discovered Communities
Community ID
Bc1
Bc2
# OF USERS
57
65
Community ID
DC1
DC2
DC3
DC4
# OF USERS
80
53
91
84
 # of community
•
DecompClus finds more communities than Baseline does.
 # of users in community
•
The discovered communities by DecompClus have a greater number of
members than Baseline.
∵ DecompClus allows a user to belong to multiple communities at the same
time.
20
Quantitative Evaluation
• DecompClus achieves better metrics than Baseline
• Modularity value: the strength of division of a network into modules
• Intra-similarity: the average value of similarities in a community
• Inter-similarity: the average value of similarities between communities
0.08
0.03
0.0734
0.5
0.0279
0.4534
0.45
0.07
0.025
0.4
0.06
0.02
0.05
0.3604
0.35
0.3
0.04
Baseline
0.03
DecompClus
0.015
0.0133
Baseline
DecompClus
0.01
0.25
Baseline
DecompClus
0.2
0.15
0.02
0.01
0.1
0.005
0.05
0.0035
0
0
Modularity
0
Intra-similarity
Inter-similarity
 In DecompClus the connections between the members within a community are
denser; in contrast, the connections between the members in different communities
are sparser.
21
Qualitative Evaluation (1/2)
 DecompClus preserves the themes defined by Baseline.
 DecompClus finds new communities that
are
not
found
by
Baseline.
ID
THEME
ID
THEME
BC1
Data mining &
Recommendation
DC1
BC2
Social Network
DC2
Semantic Web
DC3
Data mining &
Bioinformatics
DC4
Social Network
Data mining &
Recommendation
newly founded
22
Qualitative Evaluation (2/2)
 In DecompClus , a
user’s minor interests are not assimilated
into his/her major interests, so new communities which consist
of users’ minor interests can be discovered.
Distribution of articles related
to “Semantic web”
Distribution of articles related
to “Bioinformatics”
23
Visualization
 The community structure produced by DecompClus
is
more
clearly distinguishable.
By ForceAtlas2 layout provided by Gephi
24
Table of Contents
• Introduction
• DecompClus Algorithm
• Evaluation
• Related Work
• Conclusion
25
Related Work (1/2)
 Comparison with related areas
Approach
# of profile
per user
In clustering,
the type of mapping
(Node: Community)
Result
Non-overlapping
community discovery
One profile
1:1
A user belongs to
one community
1:N
A user belongs to
multiple
communities
1:1
A user belongs to
multiple
communities
Overlapping
community discovery
One profile
DecompClus
Multiple subprofiles
26
Related Work (2/2)
 Non-overlapping community discovery
• Newman’s method [Newman and Girvan, 2004]
• Multi-level graph partitioning method [Karypis and Kumar, 1995]
• Attribute augmented graph [Zhou et al., 2006]
• Bayesian generative models [Wang, 2006]
 Overlapping community discovery
• CPM (clique percolation method) [Pallal et al., 2005]
• Connectedness and local optimality [Goldberg et al., 2010]
• Label propagation [Gregory, 2009]
27
Conclusion
 A novel concept of profile decomposition, which enables us to
detect fine-granularity user communities with implicit relationships
 A new approach to discovering overlapping communities with
non-overlapping community discovery algorithms
 We demonstrate, by using real data set, that our algorithm
effectively discovers user communities from social media data.
THANK YOU !!
29
Case Studies
Case 1
•
Users who become a member in multiple communities by profile
decomposition
For example, a user A’s profile
Baseline
User A
Community Bc1(data mining&
Recommendation)
DecompClus
User A’s
sub-profile2
User A’s
sub-profile1
user
model,
recommender,
personalization,
user profiling, knn,
data mining …
semantics,
semantic web, rdf,
ontology,
social
semantic web …
Community Dc2 (semantic web)
User A’s
sub-profile3
Community Dc1 (data mining & recommendation)
social
network
analysis,
social
search, graphs, …
Community Bc2(Social network)
Community Dc3 (Data mining & Bioinformatics)
Community Dc4 (social network)
In our data set, there are total 99 users (81.1%) like the user A.
30
Case Studies
Case 2
•
Users who become a member in the communities newly discovered
by DecompClus
For example, a user B’s profile
Baseline
DecompClus
User B
Community Bc1(data mining&
Recommendation)
Community Dc1 (data mining & recommendation)
Community Dc2 (semantic web)
User B’s
sub-profile1
Community Bc2(Social network)
statistics, cancer,
genomics,
gene,
sequencing, virus,
bacteria, database,
classification, …
Community Dc3 (Data mining & Bioinformatics)
There are total 9 users (7.3%) like the user B.
Community Dc4 (social network)