* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Isograph: Neighbourhood Graph Construction Based On Geodesic Distance For Semi-Supervised Learning
Computational phylogenetics wikipedia , lookup
Theoretical computer science wikipedia , lookup
Algorithm characterizations wikipedia , lookup
Factorization of polynomials over finite fields wikipedia , lookup
Pattern recognition wikipedia , lookup
Expectationβmaximization algorithm wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Graph coloring wikipedia , lookup
Travelling salesman problem wikipedia , lookup
Isograph: Neighbourhood Graph Construction Based on Geodesic Distance for
Semi-Supervised Learning
Marjan Ghazvininejad, Mostafa Mahdieh, Hamid R. Rabiee, Parisa Khanipour Roshan and Mohammad Hossein Rohban
DML Research Lab, Department of Computer Engineering, Sharif University of Technology
Tehran, Iran
Contact Email: rabiee@sharif.edu
AbstractβSemi-supervised learning based on manifolds has
been the focus of extensive research in recent years. Convenient
neighbourhood graph construction is a key component of
a successful semi-supervised classification method. Previous
graph construction methods fail when there are pairs of data
points that have small Euclidean distance, but are far apart
over the manifold. To overcome this problem, we start with
an arbitrary neighbourhood graph and iteratively update the
edge weights by using the estimates of the geodesic distances
between points. Moreover, we provide theoretical bounds on the
values of estimated geodesic distances. Experimental results on
real-world data show significant improvement compared to the
previous graph construction methods.
Keywords- Semi-supervised Learning, Manifold, Geodesic
distance, Graph Construction
I. I NTRODUCTION
A. Semi-supervised Learning
The costly and time consuming process of data labeling,
and the amount of relatively cheap unlabeled data at hand,
are two elements of real-world applications that have caused
a recent interest in applying Semi-Supervised learning (SSL)
methods. Text and image mining are common examples of
applications in which SSL plays an important role ([1], [2],
[3]). SSL methods utilize both labeled and unlabeled data
to improve the generalization ability of the learner in such
applications. Using the unlabeled data, one can calculate the
distribution of data in feature space, which can extremely
improve the classification.
In order to use the unlabeled data for label inference more
efficiently, certain assumptions should be made about the
general geometric properties of the data. In many applications, high dimensional data points are actually samples
from a low-dimensional subspace of the actual feature space.
In these cases, we can make use of the Manifold/Cluster
assumption which is among the most practical assumptions
in SSL [4]. Manifold/Cluster assumption is held in many real
world datasets in general, and image datasets in particular
[5].
Weighted discrete graphs are a suitable representation of
manifolds. A manifold can be represented by sampling a
finite number of points from the manifold as the graph
vertices and putting edges between nearby points on the
manifold. As the underlying manifold is unknown in real
data and smoothness estimation of the labeling function
heavily relies on the manifold model, graph construction
plays an important role in this problem. Therefore, several
methods have been proposed to construct a suitable graph
representing the manifold. These graph construction methods, output a weighted graph in which each data point is
vertex and edge weights illustrate the amount of distance
between ending points. The constructed graph is then used to
infer labels of unlabeled data points. Therefore, appropriate
construction of the neighbourhood graph plays a key role in
manifold based SSL. This argument is further discussed in
Subsection II-C.
Some recent work in Semi-Supervised Learning literature
has focused on proposing graph construction methods that
best represent the manifold structure. π-NN and π-ball are
two classical methods of graph construction [6]. Several
schemes have been proposed to improve the π-NN graph
construction method. Moreover, Jebara et al. proposed the
π-matching algorithm [7] which unlike the π-NN method,
produces a balanced graph, i.e. all nodes have the same
number of neighbours. The effectiveness of this method
has been corroborated by both theoretical and experimental
justifications [8].
However, quite a lot of graph construction methods have
utilized the Euclidean distance of data points as a measure for evaluating distance. Unfortunately, this approach
is misleading at times, since two points having a small
Euclidean distance may be situated far apart on the manifold.
In this case two points are connected while in fact they
are distant from each other on the manifold. Such edges
are called shortcut edges [9]. In this case the graph does
not represent the manifold structure correctly. This situation
can be prevented using a distance measure that reflects the
distance of data points more efficiently. Approaching the
correct distance of data points enables us to determine the
neighbourhood of a data point more precisely.
Cukierski et. al. have proposed a method to identify
shortcut edges via the Betweenness Centrality measure[9].
Betweenness Centrality is related to the number of graph
shortest paths (between any two vertices) that pass through
a specific edge. They intuitively argue that the shortcut edges
are probable to have high Betweenness Centrality, and used
this fact to remove these edges. However, to the best of our
knowledge, this argument is not justified theoretically.
In this paper, we introduce a novel algorithm to determine
shortcut edges and adjust their weights with the aim of
approaching the real distance of the points on the manifold. The graph constructed by this algorithm is based on
the intrinsic distance between points, hence we named it
Isograph. We provide a solid theoretical foundation for our
work, and actually our algorithm comes out from the theorems. Experiments on benchmark datasets show promising
results compared to the previous state-of-the-art work. It
is noticeable that our method can be initialized from any
arbitrary graph, therefore it can be easily combined with
many previous graph construction methods.
Finally,f note that the famous Isomap, tries to estimate the
geodesic distance between the points and the result of these
methods are considerable [10]. However Isomap can not
find the geodesic distance between near points properly. It
concentrates on finding geodesic distance between far points,
therefore can not used in graph construction methods.
The remaining of the paper is organized as follows.
In section II we introduce the notations used throughout
the paper and provide basic definitions in this field. This
section can be safely skipped by the experienced reader.
Next in SectionIII and IV we explain the motivations of our
algorithm and the basic idea of geodesic distance. This is
followed by the precise problem setting in Section V. In section VI we present our proposed method and give theoretical
justifications for it. Finally in section VII, the experimental
results of applying our method on both synthetic and real
world datasets are presented.
Figure 1. Four of five given data points are labeled. The goal is to predict
the label of the unlabeled data point A) Without any prior knowledge, B)
Knowing the Manifold/Cluster assumption [4].
prediction is class -, as the two points near the unknown
point are from this class. Now suppose we somehow know
that the data points are only distributed on the curve shown
in Figure 1 B and expect the labels of adjacent points on
this curve to be similar. In this new setting the label + is a
better candidate, as the two adjacent points of the unlabeled
point are both from class +. The assumption just mentioned
is called Manifold/Cluster assumption and is generalized to
π-dimensional spaces. The Manifold/Cluster assumption in
fact consists of two parts: The Manifold assumption and
the Cluster assumption. The Manifold assumption states that
data points lie on a π-dimensional manifold (denoted by β³)
in the π-dimensional feature space (π βͺ π). The Cluster
assumption states that labels of the points vary smoothly on
β³. We will use the term βManifold assumptionβ instead of
Manifold/Cluster assumption in the rest of this paper.
Suppose π is a labeling function on β³ .i.e a function
from β³ to β. The smoothness of π is formally defined as
[11]:
β«
β₯β½π β₯2 ππβ³
(1)
π(π ) =
β³
II. BASICS AND N OTATIONS
A. Classification Setting
Consider the set of possible classes {β1, +1}, and let
the feature space be the π-dimensional real-valued space:
βπ . We denote the labeled data as ππΏ = {x1 , . . . , xπ } with
corresponding labels y = (π¦1 , . . . , π¦π ), where xπ β βπ and
π¦π β {β1, +1}. The unlabeled dataset is given as ππ =
{xπ+1 , . . . , xπ+π’ } where xπ β βπ .
In this setting ππΏ , ππ and y are given and the goal
is to find an estimation of the labels of the data f =
(π1 , . . . , ππ+π’ ) for all the points. The value of ππ is a real
value between -1 and 1 and bigger values of ππ correspond
to more membership in class +1. Practically these values
are mapped to {β1, +1} after inference is done. This
classification problem can be generalized to the case of
possible classes {1, 2, . . . , π} where π β₯ 2 using the oneagainst-all method.
B. Manifold/Cluster assumption
In Figure 1 A the goal is to predict the label of the
unlabeled point. Without any other prior knowledge the best
Actually π(π ) captures the concept of roughness instead
of smoothness, but we will name it smoothness as previous
authors have done. The Manifold assumption also states that
π(π ) must have a small value.
C. Neighbourhood graph
In SSL algorithms, we should use a discrete representation of the manifold, as we only have a finite number
of data points. Therefore graphs are suitable representation
for manifolds. To build a graph from the data points given,
we consider one vertex for each data point and add edges
between data points that are adjacent on the manifold, hence
we call it neighbourhood graph. If the underlying manifold
is known, constructing such a graph is straightforward.
The challenging problem occurs when we do not have the
manifold, which is the case in real world problems. This
problem is called βgraph constructionβ.
An example of a complex manifold together with the
neighbourhood graph constructed on the manifold is shown
in Figure 2.
construction, when tested on the digit recognition task and
text classification. In the experimental results section, we
will show that Isograph can improve the graph generated by
π-matching.
D. Label Inference
Figure 2. A curved 2D manifold in the 3D feature space. The data points
are shown as black dots, and the neighbourhood graph edges are shown as
lines connecting data points [12].
We denote the neighbourhood graph constructed by πΊ =
(π, πΈ), where π = π (πΊ) is the set of vertices of the graph
and πΈ = πΈ(πΊ) is the set of edges of the graph. Each
edge on the graph represents a neighbourhood relationship
and the edge weights are the distances of the corresponding
endpoints. The weight of edge π = (π’, π£) is denoted by π(π)
or π(π’, π£) throughout the paper. For simplicity we assume
π(π’, π£) = β if no edge exists between π’ and π£ in πΊ.
We choose the neighbours of each vertex and the weights
of edges such that should approximate the manifold structure. Several methods have been proposed for graph construction, which have tried to present appropriate approximations of the manifold structure. We introduce some of
these methods in the following.
1) Classical graph construction methods: π-NN and πball are two classical methods of graph construction [6]. In
the π-NN graph construction method, each data point in the
graph is connected to the π nearest neighbours of the data
points, and the weights are the euclidean distance between
the endpoints. As we always put the reverse edges in the
graph to make the graph symmetric, The degree of some of
the vertices might get much greater than π.
In the π-ball method, each data point is connected to data
points which the distance between them are less than π and
the weight of that edge is equal to the distance between
endpoints. There is no constraint for the degree of vertices
in this method. If π is too small the resulting graph will
be too sparse and a big π will result too many non-relevant
edges, therefore finding a suitable π is a hard task. Hence
π-NN is used more frequent in practice.
2) π-matching: π-matching is a well-known state-of-theart graph construction method that has experienced active
research in the recent years [8], [7]. π-matching creates a
balanced graph which has equal degree π for all vertices.
This method works well when samples are distributed nonuniformly in the feature space.
Theoretical foundations for this method have been presented. This method is reported to improve the π-NN graph
We have used distances as edge weights but another near
concept, namely similarity, is needed for semi-supervised
label inference. Similarity is the converse of distance; when
distance between two data points is low there similarity is
high and vice versa. Similarity can be derived from distance
in few ways, among them is Gaussian similarity. Let π be
the similarity matrix corresponding to graph πΊ, that is πππ
is the similarity between vertices π and π. Then Gaussian
similarity is defined as
π(π, π)2
)
π2
We mentioned that smoothness is defined as:
β«
β₯β½π β₯2 ππβ³
π(π ) =
πππ = exp(β
(2)
β³
As we just have finite number of points on the manifold
and need to infer π just on these points, we can approximate
smoothness restricted to these points as [11]:
Λ )=
π(f
π+π’
β
Wππ (ππ β ππ )2
(3)
π,π=1
The label inference process is based on finding an π which
minimizes a mixture of both π(π ) and the error of π on the
labeled data.
Λ ) can be written in the following
It is easy to show that π(f
quadratic form:
Λ ) = f β€ Lf
π(f
(4)
with L = Dββ W, where D is the diagonal degree matrix
π+π’
(i.e. Dππ = π=1 Wππ ). L is known as the graph Laplacian.
The inference minimization problem is formally defined
in the following form.
f β = min β₯Cf β yβ₯2 + πΎf β€ Lf
(5)
f
)
(
C = Iπ×π 0π×π’ is a selection matrix, .i.e Cf only has
the labeled indexes of f , therefore β₯Cf β yβ₯2 represents the
difference between y and f . An algorithm of running time
π(π3 ) can compute the solution to this equation, where π
is the number of data points.
III. M OTIVATION
As previously mentioned, shortcut edges connect those
points of the graph which are close to each other according
to the Euclidean distance, but have large geodesic distance
on the manifold. An example of such edges and the underlying manifold is shown in Figure 3. These edges may
be disastrous to the label inference process. According to
u
v
Figure 3. Part of a one-dimensional manifold showing the shortcut edge
between π’ and π£
is a popular assumption in Semi-Supervised Learning and
the sampling condition is a reasonable condition which is
common in the manifold learning literature [14].
In our problem setting, we have the following assumptions:
1) The data points lie on a π-dimensional manifold,
denoted by β³.
2) Sampling condition: The manifold β³ is sampled as
follows: There exists πΏ β β such that for any point
π β β³, there exists a data point π in the labeled or
unlabeled data points, such that πβ³ (π, π) β€ πΏ. We
refer to the least such πΏ as πΏ(β³). 1
VI. P ROPOSED M ETHOD
Figure 4.
Geodesic curve between two points on a manifold [13]
the Manifold assumption, we expect close data points on
the manifold to have similar labels. This condition may be
violated in the case of shortcut edges, since the adjacent
data points are actually far from each other on the manifold.
Therefore, it is crucial to find such edges and reduce their
impact on the inference process.
We expect a graph which has fewer shortcut edges to perform better in classification, therefore shortcut edge detection is a key problem in neighbourhood graph construction.
In fact this paper aims at detecting such edges and removing
them or adjusting their weights in an appropriate manner.
In this section, we want to pass down the intuition of the
proposed algorithm with a basic algorithm. Then we add
more details and practical modification to this and introduce
our final algorithm. One major improvement of the final
algorithm over the baseline is that it adjusts the weights
of shortcut edges, instead of naively removing all edges
suspicious to being shortcut.
A. The baseline algorithm
V. P ROBLEM S ETTING
In the baseline algorithm, we mainly try to detect the
shortcut edges. An edge (π’, π£) from the neighbourhood
graph is a shortcut edge if and only if π(π’, π£) βͺ πβ³ (π’, π£).
Looking back at Figure 3, we can observe that an important feature of edge (π’, π£) -which is a shortcut edge- is that
when we remove this edge, the shortest path between π’ and
π£ which only contains small edges, must nearly pass through
the curved manifold, therefore in fact this path has a lot of
edges. This is the key intuition to our algorithm, which we
explain more precisely in the following.
Suppose we start with an initial graph πΊ, achieved by any
graph construction method. For any edge π = (π’, π£) β πΈ(πΊ)
with weight π, we consider the subgraph containing edges
with weights less than π (the small edges previously mentioned). Assume that the shortest path between π’ and π£
be a long path in this subgraph. All edges on this path
have smaller weights than the edge π, therefore we expect
this path to better represent the geodesic curve between π’
and π£ compared to edge π. As a result, the estimation of
geodesic distance may be better achieved using this path.
If the number of edges in such a path is big enough, e.g.
bigger than two, it is probable that (π’, π£) is connecting points
that may be far on the manifold, and therefore (π’, π£) is
probably a shortcut edge. In the following, we prove that
the threshold length of two is an appropriate measure for
detecting shortcuts.
This procedure does not perform for the edges of Minimum Spanning Tree (MST) of the initial graph of πΊ
In this section, we introduce the assumptions which we
have based our algorithms on. These contain the Manifold
assumption and a sampling condition. Manifold assumption
1 β³ is assumed to be bounded. This is reasonable because usally in a
machine representation the feature space is finite and β³ is a subset of the
feature space.
IV. G EODESIC D ISTANCE
In the plane, the shortest path between two points is the
straight line connecting them, but in general manifolds such
as sphere, this line does not lie on the manifold. Therefore,
we need a new concept to define distance between points on
manifolds.
Geodesic curves are curves lying on the manifold connecting points with the shortest path (Figure 4).
Definition 1. For any two points π and π on the manifold
β³, we define πβ³ (π, π) as the length of the shortest curve
between π and π lying on β³.
Proposition 1. For any π, π β β³: π(π, π) β€ πβ³ (π, π),
where π(π’, π£) is the metric in the ambient space
This is intuitively clear, but can be proven rigorously using
straight line segments for length estimation.
(MST(πΊ)), because we claim that preserving edges MST(πΊ)
is necessary for graph construction in proposed algorithm.
A disconnected graph is disastrous to the process of label
inference. To ensure the connectivity of πΊ we do not remove
any of the edges in π ππ (πΊ). π ππ (πΊ) is chosen to prefer
smaller edges as they are less probable to be shortcut edges.
Require: An initial graph πΊ built with a graph construction
method (e.g. π-NN)
Ensure: Shortcuts of graph πΊ are removed
1: Let πΊπ be the full graph on the sampling, i.e. the graph
which contains edges π = (π’, π£) for all π’, π£ β π (πΊ)
2: for all π = (π’, π£) β πΈ(πΊ) β πΈ(π ππ (πΊ)) in ascending
order of distance do
3:
πΊπ’,π£ β the subgraph of πΊπ with edge weights less
than π(π’, π£)
4:
π β length of shortest path in πΊπ’,π£ between π’ and π£
5:
if π > 2 then
6:
Remove edge π from πΈ(πΊ)
7:
end if
8: end for
Algorithm 1: The baseline algorithm
To justify the correctness of our algorithm, we should
show that the baseline algorithm preserves an edge (π’, π£) β
πΈ(πΊ) if π(π’, π£) is close enough to πβ³ (π’, π£), and removes
it otherwise.
We already know from Proposition 1 that π(π’, π£) β€
πβ³ (π’, π£) is always held. In the following theorems, we first
justify that if πβ³ (π’, π£) is not too larger than π(π’, π£), the
edge is not removed by the baseline algorithm.
Theorem 1. If πβ³ (π’, π£) < 2π(π’, π£) β 2πΏ(β³), where πΏ(β³)
is defined in Definition 1, then the baseline algorithm will
preserve edge (π’, π£).
Proof: This theorem is a special case of Theorem 3,
which will be proved in Appendix A.
In order to complete the justifications, we further show
that if πβ³ (π’, π£) is much larger than π(π’, π£), edge (π’, π£)
will be removed by the baseline algorithm. To do so, we
need to define some concepts first.
Definition 2.
1) Consider all unit-speed geodesic curves πΆ completely
lying on β³. The minimum radius of curvature π0 =
π0 (β³) is defined by
1
¨ β₯}
= max{β₯ πΆ(π‘)
πΆ,π‘
π0
¨ represents the second derivation of πΆ with
where πΆ(π‘)
respect to π‘ [14].
2) The minimum branch separation π 0 = π 0 (β³) is
defined as the largest positive number for which,
u
v
w
Figure 5. The geodesic path between two endpoints of edge π = (π’, π£)
is showed by the dashed line. The geodesic paths between pairs π’, π€ and
π€, π£ are shown by solid curves.
π(π₯, π¦) < π 0 implies that πβ³ (π₯, π¦) β€ ππ0 , for
every π₯, π¦ β β³, where π0 is the minimum radius of
curvature [14].
Definition 3. Manifold β³ is called geodesically convex if
there exists a Mathematically geodesic curve πΆ between any
two arbitrary points π₯, π¦ β β³ with the length πβ³ (π₯, π¦)
[14]. A Mathematically geodesic curve πΆ on manifold β³
is a curve where the geodesic curvature is zero on all points
of the curve [15].
This condition is just needed for next theorem.
Theorem 2. If β³ is a geodesically convex manifold and
there exist π’, π£ β β³ where π(π’, π£) < π 0 and πβ³ (π’, π£) β₯
2
1βπ0 π(π’, π£), then the baseline algorithm removes edge π =
(π’, π£), where π0 is a constant for a given manifold β³ and
2
(β³)2
can be computed by π0 = π96ππ 00(β³)
2 .
Proof: Suppose the baseline algorithm does not remove
edge π. Therefore, according to the baseline algorithm, the
length of the shortest path between π’ and π£ in πΊπ’,π£ equals to
two (It can not be one because we omit (π’, π£)) and hence,
there exist edges π1 = (π’, π€) and π2 = (π€, π£) such that
π(π’, π€) < π(π’, π£) and π(π€, π£) < π(π’, π£) (Figure 5).
From [14], we know that for any arbitrary 0 < π < 1,
if the points π₯, π¦ from a geodesically convex manifold β³
satisfy the conditions:
2 β
πππ
π(π₯, π¦) β€ π0 24π (6)
π(π₯, π¦) < π 0
π
then we have: πβ³ (π₯, π¦) β₯ π(π₯, π¦) β₯ (1 β π)πβ³ (π₯, π¦)
Taking π₯ = π’, π¦ = π£ and π = π0 , it can be easily
verified that the conditions in equation 6 are satisfied for
our case. As π(π’, π€) < π(π’, π£), the conditions in 6 also
hold for π₯ = π’, π¦ = π€, π = π0 . Therefore we have:
π(π’, π€) β₯ (1 β π0 )πβ³ (π’, π€). Combining this result with
the previously known relation π(π’, π£) > π(π’, π€), we can
conclude that: π(π’, π£) > π(π’, π€) β₯ (1 β π0 )πβ³ (π’, π€). A
similar conclusion can be made taking π₯ = π£, π¦ = π€ and
π = π0 : π(π’, π£) > π(π€, π£) β₯ (1 β π0 )πβ³ (π€, π£). Summing
up these two relations we reach the following conclusion:
π(π’, π£) >
1 β π0
1
(1βπ0 )(πβ³ (π€, π’)+πβ³ (π€, π£)) β₯
πβ³ (π’, π£)
2
2
This contradicts the assumption that π(π’, π£)
β€
0
πβ³ (π’, π£) 1βπ
2 . Therefore, the baseline algorithm will not
remove edge π = (π’, π£) and the proof ends here.
B. Shortcomings
The baseline algorithm has two drawbacks. First, if
πβ³ (π’, π£) is small (i.e. close to 2πΏ(β³)), Theorem 1 can
not guaranty that edge (π’, π£) is not removed, even when
π(π’, π£) βΌ
= πβ³ (π’, π£). If πβ³ (π’, π£) < 2πΏ(β³), in equation of
Theorem 1, the πΏ(β³) has greater influence than πβ³ (π’, π£).
Consequently, the algorithm may remove wrong edges,
because now the precondition of the theorem is in risk of
being not true. An example of this situation occurs on a
plane-shaped manifold, where π(π’, π£) is exactly equal to
πβ³ (π’, π£) for all π’, π£ β π (πΊ). Even though no shortcut
edge exists in this case, the baseline algorithm may remove
some of the edges in πΊ.
Secondly, although the baseline algorithm is able to
pinpoint the large difference between π(π’, π£) and πβ³ (π’, π£)
for a shortcut edge, it naively removes the edges. The
classification result will improve if these edges have a very
small effect on inference instead of removing them. That is,
adjusting the edge weights in an appropriate manner, is a
better solution. This way, we can estimate the structure of
the manifold more accurately.
These shortcomings are overcome in the proposed algorithm, namely Isograph, which is described in the next
section.
C. An improved algorithm: Isograph
We now propose the Isograph algorithm to overcome
the shortcomings described in the previous section. This
algorithm is a modified version of the baseline algorithm
with two improvements:
β To overcome the problem with small values of
πβ³ (π’, π£), Isograph leaves all edges with π(π’, π£) β€ π
unchanged. If we choose π such that π > 2πΏ(β³),
since we know πβ³ (π’, π£) β₯ π(π’, π£), then πβ³ (π’, π£) >
2πΏ(β³), which solves the first problem.
β To overcome the second shortcoming, Isograph maintains an estimated value πΛβ³ (π’, π£) for each edge
(π’, π£) β πΈ(πΊ) and if πΛβ³ (π’, π£) is too far from
πβ³ (π’, π£), instead of removing this shortcut edge, it
increases the edge weight, πΛβ³ (π’, π£), to become a better
estimation for geodesic distance. Therefore, the same
graph structure is achieved with better edge weights
which might result in updating other edges in the
next iterations. In Theorem 3, we show that updating
in multiple iterations will increase the edge weights
and makes it more near to the geodesic distance and
therefore a more accurate estimation of the geodesic
distance is achieved.
As previously mentioned in Theorem 1, it can be proven
that for any edge (π’, π£), which is detected as shortcut by the
baseline algorithm, we have:
πβ³ (π’, π£) β₯ 2(π(π’, π£) β πΏ(β³))
Therefore, we may use the following update rule for edge
weights:
πΛβ³ (π’, π£) β 2(π(π’, π£) β πΏ(β³))
Later in Theorem 4, we will show that this is actually an
appropriate updating rule which gives a better estimation
of πβ³ (π’, π£). Using the π constraint, and updating the edge
weights iteratively, using the mentioned updating rule, we
come up with Isograph.
Require: An initial graph πΊ built with a graph construction
method (e.g. π-NN)
Ensure: Adjusted edge weights: πΛπ‘β³ (π’, π£), β(π’, π£) β πΈ(πΊ)
1: for all π = (π’, π£) β πΈ(πΊ) do
(1)
2:
πΛβ³ (π’, π£) β π(π’, π£)
3: end for
4: for π‘ = 1 . . . ππ’ππππππ πΌπ‘ππππ‘ππππ do
5:
for all π = (π’, π£) β πΈ(πΊ) β πΈ(MST(πΊ)) do
(π‘)
6:
if πΛβ³ (π’, π£) β₯ π then
7:
πΊπ’,π£ β the subgraph of πΊ with edge weights
(π‘)
less than πΛβ³ (π’, π£)2
8:
π β length of shortest path in πΊπ’,π£ between π’
and π£
9:
if π > 2 then
Λπ‘
10:
πΛπ‘+1
β³ (π’, π£) β 2(πβ³ (π’, π£) β πΏ(β³))
11:
end if
12:
end if
13:
end for
14: end for
Algorithm 2: Isograph (The proposed algorithm)
In the following theorems, we will prove that the following loop invariant holds throughout the procedure of
Isograph:
π(π’, π£) β€ πΛπ‘β³ (π’, π£) β€ πβ³ (π’, π£)
(7)
In addition, we show that the difference between the real
and the estimated values of πβ³ (π’, π£) decreases by updating
edge weights in each iteration.
The following theorems show that the estimated value of
geodesic distance is always between the Euclidean distance
and the real geodesic distance. Therefore, we may increase
edge weights iteratively, without worrying about exceeding
the true distance.
β(π’, π£) β πΈ(πΊ) :
Theorem 3. Assuming the loop invariant (equation 7) holds
at some time instance, if πβ³ (π’, π£) < 2πΛβ³ (π’, π£) β 2πΏ(β³),
(π‘)
2 In fact we also add any (π₯, π¦) β
/ πΈ(πΊ) such that π(π₯, π¦) β€ πΛβ³ (π’, π£)
to πΊπ’,π£
then Isograph will not update edge π = (π’, π£) (line 10 of
Algorithm 2).
Theorem 4. At any point throughout the procedure of
Isograph, equation 7 holds.
The proof of these theorems is included in Appendix A.
Lemma 1. If edge π = (π’, π£) is updated at iteration π‘ then
Λπ‘
πΛπ‘+1
β³ (π’, π£) > πβ³ (π’, π£)
(8)
Proof: Edge π is updated so:
Λπ‘
πΛπ‘+1
β³ (π’, π£) = 2(πβ³ (π’, π£) β πΏ(β³))
We know that πΛπ‘β³ (π’, π£) β₯ π > 2πΏ(β³). Adding πΛπ‘β³ (π’, π£)
to both sides of this relation we have:
2πΛπ‘β³ (π’, π£) > 2πΏ(β³) + πΛπ‘β³ (π’, π£)
Therefore
β
2(πΛπ‘β³ (π’, π£) β πΏ(β³)) > πΛπ‘β³ (π’, π£)
πΛπ‘+1 (π’, π£) > πΛπ‘ (π’, π£)
β³
β³
D. Practical modifications
In Isograph, we have explicitly used πΏ(β³) and π. From
a sampling of the manifold β³, we can not exactly specify
its underlying geometry. There are many manifolds passing
through the same sample points, each having different πΏ(β³)
values, therefore computing πΏ(β³) is naturally an ill-posed
problem. However, Lemma 2 gives a lower bound for πΏ(β³).
Lemma 2. If the maximum edge in the minimum spanning
tree(MST) of the neighbourhood graph has weight πππ π‘ , then
we have πΏ(β³) β₯ πππ π‘
2 , where πΏ(β³) is defined in section V.
The proof of this lemma will come in Appendix A. As
(π(π))
Lemma 2 indicates, πΏ(β³) β₯ πππ₯πβMST(G)
. However
2
we argued that estimating πΏ(β³) is an ill-posed problem,
therefore in order to estimate πΏ(β³), we must suppose that
the data provided to our algorithms lies on a manifold with
some intuitively reasonable constraints. i.e we must assume
some prior knowledge about πΏ(β³).
Another issue is that in many cases the sampling might
be sparse in just a small region of β³. As πΏ(β³) is defined
as a global parameter on the sampling, this results in a big
value of πΏ(β³), however the local value of πΏ(β³) may be
much smaller in many regions. In Theorem 3 where πΏ(β³)
entered our formulation, we do not need a global bound on
πΏ(β³), so we can use a local bound instead. We assume that
the local value of πΏ has the same order of magnitude as πmst
in Lemma 2 i.e. πΏ = πΌ π2mst .
Sparsity at some parts of the manifold can be more than
other parts, and an adaptive method of estimating πΏ, will
clearly help Isograph. The πΌ π2mst estimation of πΏ is not
adaptive, therefore to overcome this problem we will try
to solve it in an indirect manner. We already know that all
edges in the ππ -NN graph, where ππ βͺ π are rarely shortcut
edges. Therefore its reasonable if we do not modify these
edges at all. This showed to be effective in practice.
VII. E XPERIMENTAL R ESULTS
In this section we present the experimental results and
demonstrate the effectiveness of our neighbourhood graph
construction method when applied on the π-NN graph and
on the output of b-matching.
A. Synthetic Datasets
To evaluate our proposed method, we generated three synthetic datasets. We are able to evaluate the effectiveness of
our algorithm by illustrating the edges which are detected as
shortcuts. Each dataset lies on a different 2D manifold shape
embedded in a 3D space: Swiss roll, Step and Ellipsoid
(Figure 6). The data points are generated (200 points) by
a uniform i.i.d sampling on the manifold surface and each
point is translated by a independent random offset. The πNN method is used to construct the neighbourhood graph
where π is selected as the smallest value for which a considerable number of shortcut edges emerge. The parameter
ππ (section VI-D), is set to 5 for all graphs.
An effective shortcut edge detection algorithm eliminates
the edges connecting two irrelevant points (i.e. edges with
πβ³ β« π), while maintaining edges lying on the manifold,
no matter how long the length of such edges may be. These
properties are pursuant to Theorem 2 and 3 respectively. In
these figures, it is easy to observe that our algorithm has both
of these properties, and therefore is effective. For a better
illustration, the graph edges of each manifold are partitioned
and shown in two separate figures: edges which are detected
as shortcuts, and those preserved.
B. Real World Experiments
In order to evaluate the proposed method, four standard
datasets which are consistent with the Manifold assumption
are selected which include MNIST, USPS, Caltech 101 and
Corel. USPS and MNIST are digit recognition datasets and
the others are image categorization datasets. For Caltech
and Corel datasets, a subset of classes was selected and
the CEDD feature set introduced by [16] was extracted; for
MNIST and USPS the image is the feature vector itself
(which is a low resolution image). Principal Component
Analysis (PCA) is applied on all datasets for noise removal.
For each dataset ten random sampling containing 2000
points of the whole data points were generated and Crossvalidation was used to partition the sampling into labeled
and unlabeled points, such that there were ten labeled
points for each class in average. The value of πΎ, ππ , πΌ and
ππ’ππππππ πΌπ‘ππππ‘ππππ were selected as 0.02, 3, 0.5 and 3
respectively, for all experiments. In the first experiment, we
have applied Isograph on the 10-NN graph. To illustrate the
USPS
Accuracy(%)
Accuracy(%)
MNIST
82
79
5
10
K
83
81
15
5
Caltech
10
K
15
Corel
72
Accuracy(%)
90
Accuracy(%)
0β1 kNN
0β1 kNN+Isograph
kNN+ML
kNN+ML+Isograph
88
5
10
K
15
70
5
10
K
15
Figure 8. Charts comparing the accuracy of Isograph applied on the π-NN
with plain π-NN graph construction
Figure 6. Shortcut detection in π-NN graphs of three noisy synthetic
datasets: Ellipsoid (π = 20), Step (π = 22) and Swiss roll (π = 13).
Figures on the right column illustrate the edges detected as shortcuts, and
therefore updated by Isograph algorithm, and edges on the left column are
those maintained.
Figure 7. Shortcut edges detected by Isograph and the path found by the
algorithm
effectiveness of Isograph in detecting the shortcut edges, we
have selected some of the updated edges and plotted the
path found by Isograph between the endpoints (Figure 7).
The first and the last pictures in each row represent the
endpoints of an edge. Although this edge was in the 10-NN,
the endpoints belong to two different classes. Therefore, our
algorithm improves the graph structure by updating the edge
between them.
In the second experiment we applied Isograph on the πNN graph and measured the accuracy of the classifier built
using the resulting neighbourhood graph. The results are
presented in Figure 8 for the mentioned datasets. We have
run our algorithm in two settings:
1) Binary: In this setting we only use unit weights for
edges. The π-NN approach to graph construction in
the binary setting is to connect each vertex to the π
nearest neighbours with unit weight. We call this the
β0-1 π-NN graphβ to discriminate it from the weighted
π-NN graph.
Isograph can be applied in binary graph construction
by running Isograph in the following way: We build
the weighted π-NN graph, run Isograph on this graph,
and remove any edges from the graph that are updated
by Isograph, after all iterations have finished. We name
the result as β0-1 π-NN+Isographβ. Note that this is
different from using the baseline algorithm. Edges are
not removed in Isograph so they can influence on
estimating geodesic distance of other edges; Hence,
potentially more shortcut edges are updated.
2) Weighted: We compare Isograph with the βπNN+MLβ graph in this setting. The π-NN+ML graph
is constructed by creating the weighted π-NN graph
using the similarity Equation 2.
To build the βπ-NN+ML+Isographβ we applied Isograph on the weighted π-NN and used Equation 2.
In both weighted methods, Marginal Likelihood (ML)
was used to find the best π for creating the similarity
matrix W.
All of four graph constructions above can be redone by
using any arbitrary graph construction method instead of
π-NN method. For instance, we combined Isograph with
π-matching and showed that the classification accuracy is
superior to plain π-matching on every four datasets that
mentioned before (Figure 9).
USPS
Accuracy(%)
Accuracy(%)
MNIST
78
75
5
10
K
77
Figure 10. The shortest curve lying on β³ connecting π’ and π£. π is the
midpoint of the curve.
74
15
5
10
K
15
Corel
Caltech
71
Accuracy(%)
81
Accuracy(%)
0β1 bMatching
0β1 bMatching+Isograph
bMatching+ML
bMatching+ML+Isograph
77
5
10
K
15
70
5
10
K
15
Figure 9. π-matching is combined with Isograph in the 0-1 and ML setting
These figures show steady improvement of Isograph in
all the settings presented and the improvements are robust
to π. As we can see, in MNIST and USPS, π-NN+Isograph
considerably improved the results, This shows that Isograph
detects the shortcut edges perfectly. In these two datasets
π-NN+ML+Isograph works better for small values of π,
however performance slightly degrades for larger amounts
of k. This phenomenon can be explained by the fact that
for small values of π, the π-NN graph uses shorter edges,
that probably have smaller difference between their π and
πβ³ , therefore a maximum of three iterations is enough to
reach their correct weight. However when π increases in
spite of the fact that we detect shortcut edges correctly, we
will not update the weight of edges in an appropriate number
of iterations, as the change in each iteration is limited to a
factor of nearly two.
On the Corel dataset, ML improves the results of 0-1 πNN and 0-1 π-NN+Isograph. In this setting, weights have
an important role in inferencing labels correctly. Therefore,
the difference between weighted and 0-1 is considerable. In
contrast on the Caltech, ML weighting has not a positive effect, therefore, we see the best results in 0-1 π-NN+Isograph.
This might be due to the possibility of non-equal number of
labeled data from each class, however note that Isograph has
still improved the 0-1 graph.
Furthermore, we combined Isograph with π-matching and
showed that the classification accuracy is superior to plain
π-matching on all datasets with robustness w.r.t to the
parameter π (Figure 9).
VIII. C ONCLUSIONS
In this paper, we showed that using geodesic distance
instead of Euclidean distance will improve the neighbourhood graph. Therefore, we proposed an unsupervised method
(Isograph) to estimate the geodesic distance between points.
We have provided bounds on the values of the geodesics
estimated by Isograph. As Isograph can be combined with
other graph construction methods, we combined it with
π-NN and π-matching and presented the results on realworld datasets, which show steady effectiveness of Isograph.
The effectiveness of using geodesic distance in the graph
construction procedure and convergance of the Isograph
algorithm are subject of future theoretical analysis. Better
local estimation of πΏ may lead to better geodesic distance
estimation. Furthermore labeled data may be employed to
improve the shortcut detection procedure.
IX. A PPENDIX A: P ROOF OF SOME OF THE THEOREMS
Theorem 3. Assuming the loop invariant (Equation 7) holds
at some time instance, if πβ³ (π’, π£) < 2πΛβ³ (π’, π£) β 2πΏ(β³),
then Isograph will preserve edge π = (π’, π£).
Proof: Consider a shortest curve πΆ on β³ starting from
u and ending at v (Figure 10). Let m be the midpoint of curve
πΆ, that is the point that halves the length of πΆ. From the
sampling condition we know that there exists a point π€ in
the sampling such that πβ³ (π, π€) β€ πΏ(β³).
We first want to show that π€ can not coincide any of π’ or
π£. From the loop invariant assumption we have: πΛβ³ (π’, π£) β€
πβ³ (π’, π£). As we had assumed πβ³ (π’,π£)
+πΏ(β³) < πΛβ³ (π’, π£),
2
we get
πβ³ (π’, π£)
= πβ³ (π’, π) = πβ³ (π, π£)
2
This means that π€ can not be any of π’ or π£ because
πβ³ (π, π€) β€ πΏ(β³).
Now by the triangle inequality we have:
πΏ(β³) <
πβ³ (π’, π€) β€ πβ³ (π’, π) + πβ³ (π, π€)
By adding πβ³ (π’, π) =
we get
πβ³ (π’,π£)
2
πβ³ (π’, π) + πβ³ (π, π€) β€
and πβ³ (π, π€) β€ πΏ(β³)
πβ³ (π’, π£)
+ πΏ(β³)
2
So we have
πβ³ (π’, π€) β€
πβ³ (π’, π£)
+ πΏ(β³)
2
Finally plugging the last inequality in the assumption that
πβ³ (π’,π£)
+ πΏ(β³) < πΛβ³ (π’, π£) we reach
2
πΛβ³ (π’, π€) β€ πβ³ (π’, π€) < πΛβ³ (π’, π£)
In a similar way we have πΛβ³ (π£, π€) < πΛβ³ (π’, π£). Therefore,
edges (π’, π€) and (π£, π€) are both in πΈ(πΊπ’,π£ ) and Isograph
will preserve edge (π’, π£) due to point π€.
Theorem 4. At any point throughout the procedure of
Isograph, Equation 7 holds.
Proof: We show that Equation 7 is a loop invariant, that
is we must show that:
1) Equation 7 is true when initializing πΛβ³ at the beginning of the algorithm.
2) With the assumption that Equation 7 holds at some
time instance , it still holds after updating an edge.
(1)
Item one is true because πΛβ³ = π(π’, π£), and by Proposition 1
we have π(π’, π£) β€ πβ³ (π’, π£).
We now prove item two. Suppose at some time π‘ the loop
invariant holds.
We should show that πΛπ‘+1
β³ (π’, π£) β€ πβ³ (π’, π£). According to theorem 3, if an edge is updated we must have:
+ πΏ(β³), so
πΛπ‘β³ (π’, π£) β€ πβ³ (π’,π£)
2
Λπ‘
πΛπ‘+1
β³ (π’, π£) = 2(πβ³ (π’, π£) β πΏ(β³)) β€ πβ³ (π’, π£)
Lemma 2. If the maximum edge in the minimum spanning
tree (MST) of the neighbourhood graph has weight πmst , we
have πΏ(β³) β₯ π2mst , where πΏ(β³) is defined in section V.
Proof: Let (π’, π£) be the edge with maximum weight
in the MST. Suppose that removing edge (π’, π£) results in
two connected components πΆ1 and πΆ2 . Define πβ³ (π₯, πΆ1 )
to be the minimum distance of point π₯ to points in πΆ1 i.e.
πβ³ (π₯, πΆ1 ) = ππππ¦βπΆ1 πβ³ (π₯, π¦). πβ³ (π₯, πΆ2 ) is defined in
a similar way.
Now, let πΆ be any curve between π’ and π£. For any point π₯
on this curve we compute π (π₯) = πβ³ (π₯, πΆ1 ) β πβ³ (π₯, πΆ2 ).
We know π (π’) < 0 and π (π£) > 0 and f is continuous, so
by the intermediate value theorem, there exists a point π₯β
on curve πΆ such that π (π₯β ) = 0.
Suppose that π₯1 be the point from πΆ1 that has minimum
distance from π₯β and π₯2 is defined in a similar way. By
definition πΏ(β³) β₯ πβ³ (π₯β , π₯1 ) = πβ³ (π₯β , π₯2 ), so we have
2πΏ(β³)
β₯ πβ³ (π₯β , π₯1 ) + πβ³ (π₯β , π₯2 ) β₯ πβ³ (π₯1 , π₯2 )
β₯ πΛβ³ (π₯1 , π₯2 )
As (π’, π£) is an edge in MST, πΛβ³ (π’, π£) is the shortest edge
between any vertex in πΆ1 to some arbitrary vertex in πΆ2 .
Therefore
2πΏ(β³) β₯ πΛβ³ (π₯1 , π₯2 ) β₯ πΛβ³ (π’, π£) = πππ π‘
R EFERENCES
[1] R. Ando and T. Zhang, βA high-performance semi-supervised
learning method for text chunking,β in Proceedings of the
43rd Annual Meeting on Association for Computational Linguistics, pp. 1β9, 2005.
[2] S. Basu, M. Bilenko, and R. Mooney, βA probabilistic framework for semi-supervised clustering,β in Proceedings of the
tenth ACM international conference on Knowledge discovery
and data mining, pp. 59β68, 2004.
[3] S. Hoi and M. Lyu, βA semi-supervised active learning
framework for image retrieval,β in IEEE Computer Society
Conference on Computer Vision and Pattern Recognition,
vol. 2, pp. 302β309, 2005.
[4] O. Chapelle, B. SchoΜlkopf, and A. Zien, Semi-supervised
learning, vol. 2. MIT press Cambridge, MA, 2006.
[5] M. Belkin and P. Niyogi, βSemi-supervised learning on
riemannian manifolds,β Machine Learning, vol. 56, no. 1,
pp. 209β239, 2004.
[6] X. Zhu, J. Lafferty, and R. Rosenfeld, Semi-supervised learning with graphs. PhD thesis, 2005.
[7] T. Jebara, J. Wang, and S. Chang, βGraph construction and bmatching for semi-supervised learning,β in Proceedings of the
26th Annual International Conference on Machine Learning,
pp. 441β448, ACM, 2009.
[8] B. Huang and T. Jebara, βLoopy belief propagation for
bipartite maximum weight b-matching,β Artificial Intelligence
and Statistics, 2007.
[9] W. Cukierski and D. Foran, βUsing betweenness centrality to
identify manifold shortcuts,β in IEEE International Conference on Data Mining Workshops, pp. 949β958, 2008.
[10] J. Tenenbaum, V. Silva, and J. Langford, βA global geometric
framework for nonlinear dimensionality reduction,β Science,
vol. 290, no. 5500, p. 2319, 2000.
[11] M. Belkin and P. Niyogi, βProblems of learning on manifolds,β The University of Chicago, 2003.
[12] M. Hein and U. von Luxburg, βIntroduction to graph-based
semi-supervised learning,β
[13] J. Odegard, βDimensionality reduction methods for molecular
motion,β
[14] M. Bernstein, V. De Silva, J. Langford, and J. Tenenbaum,
βGraph approximations to geodesics on embedded manifolds,β tech. rep., Technical report, Department of Psychology,
Stanford University, 2000.
[15] M. Do Carmo, Riemannian geometry. Birkhauser, 1992.
[16] Y. Chen and J. Wang, βImage categorization by learning and
reasoning with regions,β The Journal of Machine Learning
Research, vol. 5, pp. 913β939, 2004.