Download Proximal Operator and Proximal-Point Algorithm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Math 273a: Optimization
Proximal Operator and Proximal-Point Algorithm
Wotao Yin
Department of Mathematics, UCLA
online discussions on piazza.com
Outline
• Concept
• Definition
• Examples
• Optimization and operator-theoretic properties
• Algorithm
• Interpretations
Why proximal method?
• Newton’s method:
•
•
for C 2 -smooth, unconstrained problems
allow modest size
• Gradient method:
•
•
for C 1 -smooth, unconstrained problems
give large size and sometimes distributed implementations
• Proximal method:
•
•
•
for smooth and non-smooth, constrained and unconstrained
but for structured problems
gives large size and distributed implementations
Why proximal method?
• Newton’s method
•
uses low-level (explicit) operation: x k+1 = x k − λH −1 (x k )∇f (x k ).
• Gradient method
•
uses low-level (explicit) operation: x k+1 = x k − λ∇f (x k ).
• Proximal methods
•
•
•
uses high-level (implicit) operation: x k+1 = proxλf (x k ).
proxλf is an optimization problem
only simple for structured f , but there are many of them.
Proximal operator
Assumptions and Notation
1. f : Rn → Rn ∪ {∞} is a closed, proper, convex function
2. f is proper if domf 6= ∅
3. f is closed if its graph is a closed set.
⇒ proxλf is well-defined and unique for λ > 0
4. the raised
∗
(e.g. x ∗ ) is a global minimizer of some function.
5. domf is the domain of f , which is where f (x) is finite.
Proximal operator
Definition
• The proximal operator proxf : Rn → Rn of a function f is defined by:
proxf (v) = arg min f (x) +
x∈Rn
1
kx − vk2
2
• The scaled proximal operator proxλf : Rn → Rn is defined by:
proxλf (v) = arg min f (x) +
x∈Rn
1
kx − vk2
2λ
Special case: Projection
• Consider a closed convex set C 6= ∅
• Let ιC be the indicator function of C : ιC (x) = 0 if x ∈ C ; ∞ otherwise.
proxιC (x) = arg min ιC (y) +
y
1
ky − xk2
2
1
= arg min ky − xk2 =: PC (x)
2
y∈C
x
proxιC x
reflιC (x)
0
• By generalizing ιC to f , we generalize PC to proxf .
Proximal “step size”
proxλf (v) = arg min f (x) +
x∈Rn
1
kx − vk2
2λ
• λ > 0 is the “step size” :
•
λ ↑ ∞ =⇒ proxλf (v) → arg minx∈Rn f (x)
(In case there are multiple solutions, pick the one closest to v)
•
λ↓0
=⇒ proxλf (v) → Pdomf (v)
Pdomf (v) = arg min
x∈Rn
n
1
kv − xk2 : f (x) is finite
2
o
• Given v, (proxλf (v) − v) is not linear in λ, so λ has the function like a
step size is not a step size
Examples
• Linear function: Let a ∈ Rn , b ∈ R and
f (x) = a T x + b.
Then,
proxλf (v) = arg min (a T x + b) +
x∈Rn
1
kx − vk2
2λ
has first-order optimality conditions:
a+
1
(proxλf (v) − v) = 0 ⇐⇒
λ
proxλf (v) = v − λa
• Application: proximal operator of linear approximation of f
•
let f (1) (x) = f (x 0 ) + h∇f (x 0 ), x − x 0 i
•
then, proxλf (1) (x 0 ) = x 0 − λ∇f (x 0 ) is a gradient step with size λ
Examples
• Quadratic function Let A ∈ Sn+ be a symmetric positive semi-definite
matrix, b ∈ Rn , and
f (x) =
1 T
x Ax − bT x + c.
2
The proximal operator
proxλf (v) = arg min f (x) +
x∈Rn
1
kx − vk2
2λ
has first order optimality conditions:
(Av ∗ − b) +
1 ∗
(v − v) = 0
λ
⇔
v ∗ = (λA + I )−1 (λb + v)
⇔
v ∗ = (λA + I )−1 (λb + λAv + v − λAv)
1
v ∗ = v + (A + I )−1 (b − Av)
λ
⇔
It gives a iterative refinement method for least squares problems.
proxλf (v) = v + (A +
1 −1
I ) (b − Av)
λ
• Application: proximal operator of quadratic approximation of f
•
let f (2) (x) = f (x 0 ) + h∇f (x 0 ), x − x 0 i + 21 (x − x 0 )T ∇2 f (x 0 )(x − x 0 )
= 12 x T Ax − bT x + c
where
•
•
•
A = ∇2 f (x 0 )
b = (∇2 f (x 0 ))T x 0 − ∇f (x 0 )
by letting v = x 0 , we get
proxλf (2) (x 0 ) = x 0 − (∇2 f (x 0 ) +
•
1 −1
I ) ∇f (x 0 )
λ
modified-Hessian Newton update, Levenberg-Marquardt update
Examples
• `1 -norm: x ∈ Rn , let f (x) = kxk1 , then
˙ max(|x| − λ, 0) = x − P[−λ,λ]n x.
proxλf = sign(x)×
The operator is often written as shrink(x, λ)
• `2 -norm: let f (x) = kxk2 , then
proxλf = shrinkk·k (x, λ) = max(kxk − λ, 0)
x
= x − PB(0,λ) x,
kxk
where we let 0/0 = 0 if x = 0.
More examples:
• `∞ -norm.
• `2,1 -norm.
• Unitary-invariant matrix norms: Frobenius-norm, nuclear-norm, maximal
singular value.
Properties
Proposition (separable sums)
Suppose that f (x, y) = φ(x) + ψ(y) is a block separable function
proxλf (v, w) = (proxλφ (v), proxλψ (w))
• Note: we have observed this with f (x) =
Pn
i=1
|xi |.
Properties
Theorem (minimizer = fixed point)
Let λ > 0. Point x ∗ ∈ Rn is a minimizer of f if, and only if, proxλf (x ∗ ) = x ∗ .
Proximal-point algorithm (PPA)
• Iteration:
x k+1 = proxλf (x k )
• Convergence:
•
For strongly convex f , proxλf is a contraction, i.e., ∃ C0 < 1:
kproxλf (x) − proxλf (y)k ≤ C0 kx − yk.
Let x = x k and y = x ∗ . We get
kx k+1 − x ∗ k = kproxλf (x k ) − proxλf (x ∗ )k ≤ C0 kx k − x ∗ k.
Iterate this, then
kx k+1 − x ∗ k ≤ C0k+1 kx k − x ∗ k.
x k converges x ∗ linearly.
• For general convex f , proxλf is firmly nonexpansive, i.e.,
kproxλf (x)−proxλf (y)k2 ≤ kx−yk2 −k(x−proxλf (x))−(y−proxλf (y))k2 .
From this inequality, one can show:
•
x k → x ∗ weakly; still true if computation error is summable
•
fixed-point residual kproxλf (x k ) − x k k2 = o(1/k 2 )
•
objective function f (x k ) − f (x ∗ ) = o(1/k)
Proximal operator and resolvent
Definition
Given a mapping T , (I + λT )−1 is called the resolvent of T .
Proposition
Suppose that f has subdifferential ∂f . We have
proxλf = (I + λ∂f )−1 .
In addition, they are single valued.
Interpretation: implicit (sub)gradient
x k+1 = proxλf (x k )
⇔
x k+1 = (I + λ∂f )−1 (x k )
⇔
x k ∈ (I + λ∂f )x k+1
⇔
x k ∈ x k+1 + λ∂f (x k+1 )
⇔
˜ (x k+1 )
x k+1 = x k − λ∇f
˜ is the subgradient mapping uniquely defined by prox
• Notation: ∇f
λf
˜ (x k+1 ) ∈ ∂f (x k+1 ). Plugging formula of x k+1 , we get
• Let y k+1 = ∇f
y k+1 ∈ ∂f (x k − λy k+1 ).
• Given x k and λ, compute proxλf (x k ) ⇐⇒ solve x k+1 (primal approach)
⇐⇒ solve y k+1 (dual approach)
Summary: proximal operator
• Conceptually simple, easy to understand and derive, a standard tool for
nonsmooth and/or constrained optimization
• Work for any λ > 0, more stable than gradient descent
• Gives a fixed-point optimality condition and a converging algorithm
• “Sits at a high level of abstraction”
• Interpretations: general projection, implicit gradient, backward Euler
• Closed-formed or quick solutions for many basic functions
Next few lectures:
• Prox operation is applied when it is easily evaluated, so often a step in
other algorithms, especially operator splitting algorithms
• Under separable structures, it is amenable to parallel and distributed
algorithms with interesting applications in machine learning, signal
processing, compressed sensing, large-scale modern convex problems
Related documents