On Convergence Analysis of Network-GIANT: An approximate Hessian-based fully distributed optimization algorithm (original) (raw)
Luca Schenato \IEEEmembershipFellow,IEEE and Subhrakanti Dey \IEEEmembershipFellow, IEEEThis publication has emanated from research conducted with the financial support of Swedish Research Council (VR) Grant 2023-04232.Souvik Das and Subhrakanti Dey are with the Department of Electrical Engineering, Uppsala University, Sweden. (e-mail: {souvik.das, subhrakanti.dey}@angstrom.uu.se).Luca Schenato is with the Department of Information Engineering, University of Padova, Italy. e-mail: (schenato@dei.unipd.it).
Abstract
In this paper, we present a detailed convergence analysis of a recently developed approximate Newton-type fully distributed optimization method for smooth, strongly convex local loss functions, called Network-GIANT (adapted from the Federated learning algorithm GIANT possessing mixed linear-quadratic convergence properties), which has been empirically illustrated to show faster linear convergence properties compared to its gradient-based counterparts, and several other existing second order distributed algorithms, while having the same communication complexity (per iteration) as its first order distributed counterparts. By using consensus based parameter updates, and a local Hessian based descent direction at the individual nodes with gradient tracking, we first explicitly characterize a global linear convergence rate for Network-GIANT when the local objective functions are smooth and strongly convex, which can be computed as the spectral radius of a 3×33\times 3 matrix dependent on the Lipschitz continuity (LL) and strong convexity (μ\mu) parameters of the objective functions, and the spectral norm (σ\sigma) of the underlying undirected graph represented by a doubly stochastic consensus matrix. We provide an explicit bound on the step size parameterη\eta, below which this spectral radius is guaranteed to be less than 11. Furthermore, we derive a mixed linear-quadratic inequality based upper bound for the optimality gap norm, which allows us to conclude that, under small step size values, asymptotically, as the algorithm approaches the global optimum, it achieves a locally linear convergence rate of 1−η(1−γμ)1-\eta(1-\frac{\gamma}{\mu}) for Network-GIANT, provided the Hessian approximation error γ\gamma (between the harmonic mean of the local Hessians and the global hessian (the arithmetic mean of the local Hessians) is smaller than μ\mu. This asymptotically linear convergence rate of≈1−η\approx 1-\eta explains the faster convergence rate of Network-GIANT for the first time. Numerical experiments are carried out with a reduced CovType dataset for binary logistic regression over a variety of graphs to illustrate the above theoretical results.
1 Introduction
Distributed optimization and learning generally refer to the paradigm where multiple agents optimize a global cost function or train a global model collaboratively, often by optimizing their local cost functions or training a local model based on their local data sets. In applications such as networked autonomous systems, Internet of Things (IoT), collaborative robotics and smart manufacturing, moving large amounts of locally collected data to a central processing unit can be expensive from both communication and storage point of view. Additionally, distributed optimization/learning prevents sharing of raw data, thus guaranteeing data privacy to a certain extent. Distributed/decentralized optimization or learning can occur over a network consisting of a central server with multiple local nodes (Federated Learning), or over a fully distributed setting without a central server, where nodes only communicate to their neighbours. There have been significant progress in recent years in both settings - see [1, 2, 3] for an extensive survey on these topics focusing on their key challenges and opportunities.
In this work, we will primarily focus on the topic of fully distributed optimization where nodes collaboratively optimize a global cost which is a sum of local cost functions. We will also restrict ourselves to convex, and in particular, strongly convex cost functions. In this context, distributed optimization or learning algorithms proceed in an iterative fashion, where each node updates its local estimate of the global optimization variable based on consensus-type averaging of its own information and the information received from its neighbours, combined with (typically) a gradient-descent type algorithm with a suitably chosen step size (constant or diminishing with time). Without going into a detailed literature survey of gradient based distributed optimization algorithms, we refer the readers to [4, 5]. For strongly convex functions, it is well known that gradient based techniques achieve linear (exponential) convergence with a constant step size. Replicating this in a fully distributed setting requires a gradient-tracking algorithm where each node keeps a vector that keeps a local estimate of the global gradient [6]. A detailed convergence analysis was carried out in [6] to illustrate linear convergence of gradient-tracking based distributed first order optimization algorithms for smooth and strongly convex functions over undirected graphs, whereas a sublinear convergence rate O(1/k)O(1/k) was guaranteed for smooth convex functions that are not strongly convex. A gradient-tracking based distributed first-order optimization with an additional distributed eigenvector tracking was presented in [7], with a provable linear convergernce rate.
It is well known that for strongly convex and smooth functions, centralized optimization algorithms using curvature (or Hessian) information of the cost function, such as the Newton-Raphson method, achieves faster locally quadratic convergence in the vicinity of the optimum. In the context of fully distributed optimization algorithms utilizing Newton-type methods, achieving quadratic convergence is difficult, primarily due to the fact that communicating local Hessian information to neighboring nodes is infeasible for large model size or parameter dimension, and the underlying consensus algorithm essentially slows down the convergence to at best a linear rate. Computation of the Hessian and its inversion is another bottleneck, which has motivated a plethora of recent approximate Newton-type Federated learning algorithms [8, 9, 10, 11], which can also display a mixed linear-quadratic convergence rate. In the fully distributed regime, earlier examples of second order optimization algorithms include [12, 13, 14, 15] that approximate the Hessian based on Taylor series expansion and historical data. More recent works, such as [16] developed a fully distributed version of DANE [17] (termed as Network-DANE) to overcome the high computational costs of the existing algorithms, and [18] captured the second order information of the objective functions based on an augmented Lagrangian function along with gradient diffusion technique to come up with a distributed algorithm leveraging the Hessian information of the objective function. Recent papers such as [19] and [20] have shown however, that the superlinear convergence properties of Federated second order optimization algorithms such as GIANT can be retrieved only when exact averages of gradients or parameters can be computed via finite-time exact consensus[19], or through distributed finite-time set consensus [20], both of which can require up to O(N)O(N) (where NN is the number of nodes) consensus rounds between two successive optimization iterations, thus making them infeasible for large networks.
In this article, we focus on a recently developed fully distributed optimization algorithm called Network-GIANT\mathrm{Network\text{-}GIANT} [21], which combined gradient tracking with consensus based updates and a descent direction computed by the product of the inverse local Hessian and the gradient tracking vector at each node. Built on an extension of the approximate Newton-type federated learning algorithm GIANT\mathrm{GIANT} [22], it was shown that Network-GIANT\mathrm{Network\text{-}GIANT} leverages gradient tracking to achieve semi-global exponential convergence to the exact optimal solution for a sufficiently small step-size, and enjoys a low communication overhead of O(n)O(n) per iteration at each node, where nn is the dimension of the decision space, thus making it comparable to first-order distributed optimization techniques. The authors of [21] empirically also demonstrated a superior convergence rate compared to gradient-tracking based first order methods, as well as a number of other second order distributed optimization algorithms, including the well-known Network-DANE [16]. Although [21] proved semi-global exponential convergence using a two-time-scale separation principle that guarantees linear convergence and empirically verified its superior performance, an explicit expression for the actual convergence rate was not established. In addition, while a superior linear convergence was empirically demonstrated compared to its first order optimization counterparts, there was no analytical justification for this observation. In this paper, we set out to provide a deeper analysis of the convergence of Network-GIANT and provide partial answers to some of the above unanswered questions.
Contributions
The main contributions are highlighted below:
- •
We first provide a global linear convergence result with an explicitly computable rate for Network-GIANT\mathrm{Network\text{-}GIANT} for a sufficiently small step size η\eta. In a similar vein to [6], utilizing a linear convergence result for a damped Newton algorithm, we establish a linear inequality for a three-dimensional system consisting of the norms of consensus error, gradient tracking error and the optimality gap error. We show that the linear convergence rate is explicitly given by the spectral radius of an associated 3×33\times 3 matrix dependent on the step size η\eta, Lipschitz continuity (LL) and strong convexity parameter (μ\mu) of the local cost functions, and the spectral norm (σ\sigma) of the doubly-stochastic consensus matrix of the underlying graph connecting the nodes. - •
We also provide an explicit (albeit conservative) upper bound on the step size below which Network-GIANT\mathrm{Network\text{-}GIANT} is guaranteed to converge linearly (i.e. the aforementioned spectral radius is less than 11). - •
Finally, we derive a mixed linear-quadratic inequality based upper bound for the optimality gap norm, under the condition that the error due to Hessian approximation is sufficiently small. Exploring this result further, we argue that under a small step-size η\eta assumption, the gradient-tracking and consensus error norms decay much faster than the optimality gap error, and asymptotically as the algorithm approaches the optimum, Network-GIANT\mathrm{Network\text{-}GIANT} displays a network-independent approximate local linear convergence rate of 1−η1-\eta, which explains analytically the superior convergence rate of Network-GIANT\mathrm{Network\text{-}GIANT} compared to its first-order counterparts. - •
We illustrate the above findings through a distributed logistic-regression example with a reduced CovType dataset over regular expander type and Erdős–Rényi random graphs with varying degrees of connectivity.
Furthermore, we refer the readers to [21] for a thorough comparison of Network-GIANT\mathrm{Network\text{-}GIANT} with other state-of-the-art fully distributed second-order algorithms.
2 Problem Formulation
We consider a fully distributed unconstrained optimization problem over a communication network that can be modeled by a connected and time-invariant graph 𝒢=(𝒩,ℰ)\mathcal{G}=(\mathcal{N},\mathcal{E}), where 𝒩={1,2,…,N}\mathcal{N}=\{1,2,...,N\} denotes the set of nodes and ℰ⊆𝒩×𝒩\mathcal{E}\subseteq\mathcal{N}\times\mathcal{N} the set of bidirectional edges connecting the nodes. Nodes can only directly communicate with their 1-hop neighbors. The network can be equivalently described by a weight matrix W∈ℝN×NW\in\mathbb{R}^{N\times N}, where the element wijw_{ij} is positive if there is an edge (i,j)∈ℰ(i,j)\in\mathcal{E} and zero otherwise. The consensus matrix WW is chosen to be symmetric and doubly stochastic, which implies that W𝟏=𝟏W\mathbf{1}=\mathbf{1} (here 𝟏∈ℝN×1\mathbf{1}\in\mathbb{R}^{N\times 1} is a column vector of all 1’s), and the eigenvalues of WW lie in (−1,1](-1,1]. It is possible to construct this matrix in a distributed way, for example using the Metropolis algorithm [23]. We will use the following conventions for vector and matrix norms: the vector norm ∥.∥\left\lVert.\right\rVert represents the 22-norm, whereas for a matrix, ∥.∥\left\lVert.\right\rVert denotes the Frobenius norm. The matrix 22-norm (or spectral norm), if used, will be denoted by ∥.∥2\left\lVert.\right\rVert_{2}. It is a well known fact that σ=‖W−1N𝟏𝟏T‖2\sigma=\left\lVert W-\frac{1}{N}\mathbf{1}\mathbf{1}^{T}\right\rVert_{2} satisfies0<σ<10<\sigma<1.
The unconstrained optimization problem is a fully distributed version of the Federated optimization problem considered in [22]: The unconstrained global optimization problem is defined as
| f(x⋆)=minx∈ℝn{f(x)=1N∑i=1Nfi(x)},f(x^{\star})=\min_{x\in\mathbb{R}^{n}}\left\{f(x)=\frac{1}{N}\sum_{i=1}^{N}f_{i}(x)\right\}, | ((1)) |
|---|
where f(x)f_{(}x) denotes the local loss function for the ii-th node.
We make the following standard assumption regarding the local loss functions fi(⋅)f_{i}(\cdot):
Assumption 1
The local objective fi(⋅)f_{i}(\cdot) is μ\mu-strongly convex and LL-smooth, i.e. there exists a constant LL such that ∀x,x′∈ℝn\forall x,x^{\prime}\in\mathbb{R}^{n}, and for all i∈{1,2,…,N}i\in\{1,2,\ldots,N\}, ‖∇fi(x)−∇fi(x′)‖2⩽L‖x−x′‖\left\lVert\nabla f_{i}(x)-\nabla f_{i}(x^{\prime})\right\rVert_{2}\leqslant L\left\lVert x-x^{\prime}\right\rVert, and μ𝕀⩽∇2fi(x)⩽L𝕀\mu\mathbb{I}\leqslant\nabla^{2}f_{i}(x)\leqslant L\mathbb{I}, where 𝕀\mathbb{I} denotes an identity matrix of sizeN×NN\times N, Note that this assumption implies that the global objective function f(x)f(x) is also μ\mu-strongly convex and LL-smooth.
Assumption 1 guarantees the uniqueness of the minimizer of ((1)) and is standard in second-order optimization, and can often be guaranteed by the addition of a regularizer to the cost function if it is not strictly convex.
We define the following matrices 𝐱k,𝐬k,∇k∈ℝN×n\mathbf{x}_{k},\mathbf{s}_{k},\nabla_{k}\in\mathbb{R}^{N\times n}, such that:
| 𝐱k=[x1kx2k⋮xNk],𝐬k=[s1ks2k⋮sNk],∇k=[∇f1(x1k)∇f2(x2k)⋮∇fN(xNk)].\displaystyle\mathbf{x}_{k}=\begin{bmatrix}x_{1}^{k}\\ x_{2}^{k}\\ \vdots\\ x_{N}^{k}\end{bmatrix},\,\mathbf{s}_{k}=\begin{bmatrix}s_{1}^{k}\\ s_{2}^{k}\\ \vdots\\ s_{N}^{k}\end{bmatrix},\,\nabla_{k}=\begin{bmatrix}\nabla f_{1}(x_{1}^{k})\\ \nabla f_{2}(x_{2}^{k})\\ \vdots\\ \nabla f_{N}(x_{N}^{k})\end{bmatrix}. | ((3)) |
|---|
In order to avoid Kronecker-product based cumbersome notations, we define the relevant vectors in the row-vector format throughout the paper. We assume that each node ii keeps a local copy xik∈ℝ1×nx_{i}^{k}\in\mathbb{R}^{1\times n} of the global state vector𝐱k\mathbf{x}_{k} at the kk-th iteration, and a vector sik∈ℝ1×ns_{i}^{k}\in\mathbb{R}^{1\times n} that tracks the global average gradient∇f(𝐱k)=1N∑i=1N∇fi(xik)\nabla f(\mathbf{x}_{k})=\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}(x_{i}^{k}), where ∇fi(xik)∈ℝ1×n\nabla f_{i}(x_{i}^{k})\in\mathbb{R}^{1\times n}, denotes the local gradient at the node ii. We also define the global Hessian as∇2f(𝐱𝐤)=1N∇2fi(xik)\nabla^{2}f(\mathbf{x_{k}})=\frac{1}{N}\nabla^{2}f_{i}(x_{i}^{k}), and with a slight abuse of notation∇2f(x~)=∇2f(𝐱𝐤)∣xik=x~T\nabla^{2}f(\tilde{x})=\nabla^{2}f(\mathbf{x_{k}})\mid_{x_{i}^{k}=\tilde{x}^{T}} for further analysis.
3 Algorithm description
The Network-GIANT algorithm is given by the two following consensus-based updates at each node ii at iteration kk, one for the parameter vector xikx_{i}^{k}, and the other one being the familiar update for the gradient-tracking vector siks_{i}^{k} [6], as given below:
| xik+1=∑j=1nwijxik−ηsik∇2fi(xik)−1\displaystyle x_{i}^{k+1}=\sum_{j=1}^{n}w_{ij}x_{i}^{k}-\eta s_{i}^{k}{\nabla^{2}f_{i}(x_{i}^{k})}^{-1} | |
|---|---|
| sik+1=∑j=1nwijsik+∇fi(xik+1)−∇fi(xik),si0=∇fi(xi0),\displaystyle s_{i}^{k+1}=\sum_{j=1}^{n}w_{ij}s_{i}^{k}+\nabla f_{i}(x_{i}^{k+1})-\nabla f_{i}(x_{i}^{k}),\;s_{i}^{0}=\nabla f_{i}(x_{i}^{0}), | ((4)) |
where Hi~k=∇2fi(xik)\tilde{H_{i}}^{k}=\nabla^{2}f_{i}(x_{i}^{k}) is the local Hessian at the ii-th node at the kk-the iteration, which is invertible due to the strong convexity assumption on fi(x)f_{i}(x). 0<η<10<\eta<1 is a stepsize that needs to be chosen carefully.
Using the matrix notation from ((3)), one can write the above algorithm compactly as follows:
| 𝐱k+1=W𝐱k−η𝐲k\displaystyle\mathbf{x}_{k+1}=W\mathbf{x}_{k}-\eta\mathbf{y}_{k} | |
|---|---|
| 𝐬k+1=W𝐬k+∇k+1−∇k,𝐬0=∇0.\displaystyle\mathbf{s}_{k+1}=W\mathbf{s}_{k}+\nabla_{k+1}-\nabla_{k},\;\mathbf{s}_{0}=\nabla_{0}. | ((5)) |
Here 𝐲k=[y1ky2k…,yNk]T\mathbf{y}_{k}=[y_{1}^{k}\,y_{2}^{k}\,\ldots,y_{N}^{k}]^{T}, where yik=sik∇2fi(xik)−1y_{i}^{k}=s_{i}^{k}{\nabla^{2}f_{i}(x_{i}^{k})}^{-1}.
For the purpose of the subsequent convergence analysis of the algorithm given in ((4)), we also define the following average quantities x¯k=𝟏T𝐱kN=1N∑i=1Nxik,s¯k=𝟏T𝐬kN=1N∑i=1Nsik\bar{x}_{k}=\frac{\mathbf{1}^{T}\mathbf{x}_{k}}{N}=\frac{1}{N}\sum_{i=1}^{N}x_{i}^{k},\bar{s}_{k}=\frac{\mathbf{1}^{T}\mathbf{s}_{k}}{N}=\frac{1}{N}\sum_{i=1}^{N}s_{i}^{k} and gk=∇f(𝐱k)=𝟏T∇kNg_{k}=\nabla f(\mathbf{x}_{k})=\frac{\mathbf{1}^{T}\nabla_{k}}{N}. Using a similar inductive proof as in [6], it follows that s¯k=gk\bar{s}^{k}=g_{k}, that is, the average gradient tracking vector tracks the average global gradient at all steps, when initialized as 𝐬0=∇0\mathbf{s}_{0}=\nabla_{0}.
4 Convergence Analysis
In this section, we provide two main results regarding the convergence of the Network-GIANT algorithm given by ((5)). First, under Assumptions 1(a) and 1(b), we establish that for a sufficiently small step size η\eta, the Network-GIANT algorithm enjoys a linear convergence rate, implying that the optimality gap ‖x¯k−x⋆‖||\bar{x}^{k}-x^{\star}|| decays exponentially with the iteration number kk. We are able to provide an explicit expression for an upper bound η¯\bar{\eta} such that for η<η¯\eta<\bar{\eta}, this result holds true. In the second result, with additional assumptions on the Lipschitz continuity of the Hessians of the local loss functions and a small enough approximation error bound for the error between the true Hessian of f(x)f(x), and the one given by the harmonic mean of the local Hessians, we prove a mixed linear-quadratic inequality based upper bound for the optimality gap in the Network-GIANT algorithm.
4.1 Linear Convergence Analysis of Network-GIANT
Before we proceed further, we quote the following results that have been established in [6] under Assumption 1. We also define the three error terms: ‖𝐱k−𝟏x¯k‖\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert, ‖𝐬k−𝟏gk‖\left\lVert\mathbf{s}_{k}-\mathbf{1}g_{k}\right\rVert, and‖x¯k−x⋆‖\left\lVert\bar{x}_{k}-x^{\star}\right\rVert, which correspond to the norms of the consensus error, gradient tracking error, and theoptimality gap, respectively.
Lemma 1
Under Assumption 1, we have
(i)
| ‖∇k−∇k−1‖⩽L‖𝐱k−𝐱k−1‖,\displaystyle\left\lVert\nabla_{k}-\nabla_{k-1}\right\rVert\leqslant L\left\lVert\mathbf{x}_{k}-\mathbf{x}_{k-1}\right\rVert, | ((6)) |
|---|---|
| ‖gk−gk−1‖⩽LN ‖𝐱k−𝐱k−1‖.\displaystyle\left\lVert g_{k}-g_{k-1}\right\rVert\leqslant\frac{L}{\mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}}\left\lVert\mathbf{x}_{k}-\mathbf{x}_{k-1}\right\rVert. | ((7)) |
(ii)
| ‖𝐬k‖⩽‖𝐬k−𝟏gk‖+L‖𝐱k−𝟏x⋆‖,\displaystyle\left\lVert\mathbf{s}_{k}\right\rVert\leqslant\left\lVert\mathbf{s}_{k}-\mathbf{1}g_{k}\right\rVert+L\left\lVert\mathbf{x}_{k}-\mathbf{1}x^{\star}\right\rVert, | ((8)) |
|---|---|
| ⩽‖𝐬k−𝟏gk‖+L‖𝐱k−𝟏x¯k‖+LN ‖x¯k−x⋆‖.\displaystyle\leqslant\left\lVert\mathbf{s}_{k}-\mathbf{1}g_{k}\right\rVert+L\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert+L\mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}\left\lVert\bar{x}_{k}-x^{\star}\right\rVert. |
We will also need another result concerning the global linear convergence of a damped Newton method, stated as follows.
Lemma 2
Consider a column vector z∈ℝnz\in\mathbb{R}^{n}, and a function u(z)u(z) that satisfies the smoothness and convexity assumptions in Assumption 1 (a)-(b). A damped Newton iterative update to find z⋆z^{\star}, the unique minimum of u(z)u(z) can be written as
| z+=z−η[∇2u(z)]−1∇u(z), 0<η<1.\displaystyle z^{+}=z-\eta[\nabla^{2}u(z)]^{-1}\nabla u(z),\;0<\eta<1. | ((9)) |
|---|
It can be shown that if η⩽μL\eta\leqslant\frac{\mu}{L}, then
| ‖z+−z⋆‖⩽(1−ημL)‖z−z⋆‖.\displaystyle\left\lVert z^{+}-z^{\star}\right\rVert\leqslant(1-\eta\frac{\mu}{L})\left\lVert z-z^{\star}\right\rVert. | ((10)) |
|---|
Proof 4.1.
Define Hz=[∇2u(z)]−1∫01∇2(z⋆+λ(z−z⋆)dλH_{z}=[\nabla^{2}u(z)]^{-1}\int_{0}^{1}\nabla^{2}(z^{\star}+\lambda(z-z^{\star})d\lambda. Then one can show that
| [∇2u(z)]−1∇u(z)=[∇2u(z)]−1(∇u(z)−∇u(z⋆))\displaystyle[\nabla^{2}u(z)]^{-1}\nabla u(z)=[\nabla^{2}u(z)]^{-1}(\nabla u(z)-\nabla u(z^{\star})) | |
|---|---|
| =([∇2u(z)]−1∫01∇2(z⋆+λ(z−z⋆)dλ)(z−z⋆)\displaystyle=\left([\nabla^{2}u(z)]^{-1}\int_{0}^{1}\nabla^{2}(z^{\star}+\lambda(z-z^{\star})d\lambda\right)(z-z^{\star}) | |
| =Hz(z−z⋆)\displaystyle=H_{z}(z-z^{\star}) | ((11)) |
which implies
| z+−z⋆=(𝕀−ηHz)(z−z⋆)z^{+}-z^{\star}=(\mathbb{I}-\eta H_{z})(z-z^{\star}) |
|---|
Using μL𝕀⩽Hz⩽Lμ𝕀\frac{\mu}{L}\mathbb{I}\leqslant H_{z}\leqslant\frac{L}{\mu}\mathbb{I}, it then follows that
| (1−ηLμ)𝕀⩽(𝕀−ηHz)⩽(1−ημL)𝕀.(1-\eta\frac{L}{\mu})\mathbb{I}\leqslant(\mathbb{I}-\eta H_{z})\leqslant(1-\eta\frac{\mu}{L})\mathbb{I}. |
|---|
Therefore, if η⩽μL\eta\leqslant\frac{\mu}{L}, we obtain0⩽𝕀−ηHz⩽(1−ημL)𝕀0\leqslant\mathbb{I}-\eta H_{z}\leqslant(1-\eta\frac{\mu}{L})\mathbb{I}, implying‖𝕀−ηHz‖2⩽(1−ημL)\left\lVert\mathbb{I}-\eta H_{z}\right\rVert_{2}\leqslant(1-\eta\frac{\mu}{L}), from which the proof follows.
Using the above two lemmas, one can show the following result:
Theorem 4.3.
Consider the NETWORK-GIANT algorithm given by ((5)), implemented with a doubly stochastic consensus matrix WW with a spectral norm σ=‖W−1N𝟏𝟏T‖2<1\sigma=\left\lVert W-\frac{1}{N}\mathbf{1}\mathbf{1}^{T}\right\rVert_{2}<1. Assuming η⩽μL\eta\leqslant\frac{\mu}{L}, and that Assumption 1 (a)-(b) holds, we can show the following inequality:
| [‖𝐱k+1−𝟏x¯k+1‖‖𝐬k+1−𝟏gk+1‖N ‖x¯k+1−x⋆‖]⩽G(η)[‖𝐱k−𝟏x¯k‖‖𝐬k−𝟏gk‖N ‖x¯k−x⋆‖]\displaystyle\begin{bmatrix}\left\lVert\mathbf{x}_{k+1}-\mathbf{1}\bar{x}_{k+1}\right\rVert\\ \left\lVert\mathbf{s}_{k+1}-\mathbf{1}g_{k+1}\right\rVert\\ \mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}\left\lVert\bar{x}_{k+1}-x^{\star}\right\rVert\end{bmatrix}\leqslant G(\eta)\begin{bmatrix}\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert\\ \left\lVert\mathbf{s}_{k}-\mathbf{1}g_{k}\right\rVert\\ \mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}\left\lVert\bar{x}_{k}-x^{\star}\right\rVert\end{bmatrix} | |
|---|---|
| G(η)=[σ+ηLμημηLμ2L+ηL2μσ+ηLμηL2μηLμημ1−ημL]\displaystyle G(\eta)=\begin{bmatrix}\sigma+\eta\frac{L}{\mu}&\frac{\eta}{\mu}&\eta\frac{L}{\mu}\\ 2L+\eta\frac{L^{2}}{\mu}&\sigma+\eta\frac{L}{\mu}&\eta\frac{L^{2}}{\mu}\\ \eta\frac{L}{\mu}&\frac{\eta}{\mu}&1-\eta\frac{\mu}{L}\end{bmatrix} | ((12)) |
Proof 4.4.
See Appendix A for a proof.
It is clear from ((12)) that the three error norms will go to zero exponentially fast as long as the spectral radiusρ(G(η))<1\rho(G(\eta))<1. Since G(η)G(\eta) is a matrix with all positive elements, from the Perron-Frobenius theorem, the largest eigenvalue is real and positive. In the next theorem, we provide an upper bound on the step size η\eta that ensures that the spectral radius ρ(G(η))<1\rho(G(\eta))<1.
Theorem 4.5.
If one chooses η<η¯=(1−σ)22(2−σ)(κ+κ3)\eta<\bar{\eta}=\frac{(1-\sigma)^{2}}{2(2-\sigma)\left(\kappa+\kappa^{3}\right)}, whereκ=Lμ\kappa=\frac{L}{\mu} is the condition number of the global Hessian, we can show that ρ(G(η))<1\rho(G(\eta))<1.
Proof 4.6.
See Appendix B.
4.2 A Mixed Linear-Quadratic Inequality for the Optimality Gap
Motivated by the conservativeness of the global linear convergence results in the previous subsection, we investigate why we obtain substantially faster convergence rates for Network-GIANT in numerical results, than those predicted by Theorems 1 and 2. To this end, we show that under some further assumptions, one can obtain a mixed linear-quadratic convergence rate of the optimality gap norm ‖x¯k−x⋆‖\left\lVert\bar{x}_{k}-x^{\star}\right\rVert. Before we proceed, we define the true global Hessian (arithmetic mean of the local Hessians) asHtr(𝐱𝐤)=1N∑i=1NH~ikH_{tr}(\mathbf{x_{k}})=\frac{1}{N}\sum_{i=1}^{N}\tilde{H}_{i}^{k}, and the approximate Hessian given by the harmonic mean of the local Hessians asHapp(𝐱𝐤)=(1N∑i=1N(H~ik)−1)−1H_{app}(\mathbf{x_{k}})=\left(\frac{1}{N}\sum_{i=1}^{N}(\tilde{H}_{i}^{k})^{-1}\right)^{-1}.
We make the following two assumptions, the first of which is standard in the analysis of Newton-type optimization algorithms.
Assumption 2
We assume that the Hessians of all the local cost functions are Lipschitz continuous, i.e.
| ‖∇2fi(x)−∇2fi(y)‖⩽L¯‖x−y‖,∀i.\displaystyle\left\lVert\nabla^{2}f_{i}(x)-\nabla^{2}f_{i}(y)\right\rVert\leqslant\bar{L}\left\lVert x-y\right\rVert,\forall i. | ((13)) |
|---|
The second assumption is on the tightness of the approximation of global hessian by the harmonic mean of the local Hessians.
Assumption 3
There exists a sufficiently small γ\gamma such that‖Htr(𝐱)−Happ(𝐱)‖⩽γ,∀𝐱\left\lVert H_{tr}(\mathbf{x})-H_{app}(\mathbf{x})\right\rVert\leqslant\gamma,\;\forall\mathbf{x}.
Theorem 4.11.
Under Assumptions 1, 2, and 3, one can show that if γ<μ\gamma<\mu, andη<1\eta<1, given specific error norms ‖x¯k−x⋆‖,‖𝐱k−𝟏x¯k‖,‖𝐬k−𝟏gk‖\left\lVert\bar{x}_{k}-x^{\star}\right\rVert,\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert,\left\lVert\mathbf{s}_{k}-\mathbf{1}g_{k}\right\rVert at iteration kk, the optimality gap error norm ‖x¯k+1−x⋆‖\left\lVert\bar{x}_{k+1}-x^{\star}\right\rVert at iteration k+1k+1 satisfies the following mixed linear-quadratic inequality:
| ‖x¯k+1−x⋆‖⩽(1−η(1−γμ))‖x¯k−x⋆‖\displaystyle\left\lVert\bar{x}_{k+1}-x^{\star}\right\rVert\leqslant\left(1-\eta(1-\frac{\gamma}{\mu})\right)\left\lVert\bar{x}_{k}-x^{\star}\right\rVert | |
|---|---|
| +ηL¯μN ‖𝐱k−𝟏x¯k‖‖x¯k−x⋆‖+ηL¯2μ‖x¯k−x⋆‖2\displaystyle\;\;\;\;\;\;\;+\frac{\eta\bar{L}}{\mu\mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}}\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert\left\lVert\bar{x}_{k}-x^{\star}\right\rVert+\frac{\eta\bar{L}}{2\mu}\left\lVert\bar{x}_{k}-x^{\star}\right\rVert^{2} | |
| +ημN ‖𝐬k−𝟏gk‖+ηLμN ‖𝐱k−𝟏x¯k‖.\displaystyle\;\;\;\;\;\;\;+\frac{\eta}{\mu\mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}}\left\lVert\mathbf{s}_{k}-\mathbf{1}g_{k}\right\rVert+\frac{\eta L}{\mu\mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}}\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert. | ((14)) |
Proof 4.12.
See Appendix C.
It was illustrated with a comprehensive set of numerical results in[21] that Network-GIANT outperforms first-order optimization algorithms based on gradient tracking and a number of other distributed optimization methods using different types of Hessian approximation techniques, such as Network-DANE [27], and various other Network-Newton methods including the one presented in [28]. While it is generally difficult to compare exact convergence rates of various distributed Newton type methods, Theorem4.11 provides an explanation why Network-GIANT provides a faster linear convergence rate than the gradient tracking based first order algorithm, e.g. in [6].
An intuitive explanation follows like this: note that when the step-size η\eta is small, from Theorem 1, it is clear that the matrix G(η)G(\eta) is diagonally dominant, and the consensus error norm and the gradient tracking error norm decay faster with an approximate linear rate of σ<1\sigma<1, while the optimality gap norm decays at a slower rate of 1−ημL1-\eta\frac{\mu}{L}. Thus it is clear that when the consensus error norm and the gradient tracking error norm are sufficiently small, the linear quadratic convergence result in Theorem 3 above indicates that the optimality gap norm at time iteration k+1k+1 depends on a linear and a quadratic term of the optimality gap norm at iteration kk. Clearly, the quadratic term will only dominate in the early stages when the optimality gap is larger (see [29] for a similar convergence behaviour for a centralized approximate Newton method). Asymptotically, as the optimality gap decreases, the linear term will dominate and the asymptotic convergence rate is given by (1−η(1−γμ))(1-\eta(1-\frac{\gamma}{\mu})). When the hessian approximation error γ\gamma is sufficiently small compared to the strong convexity constant μ\mu, then this linear convergence rate is approximately (1−η)(1-\eta), which is much faster than the corresponding rate of the gradient tracking based first order method, given by approximately (1−ημ)(1-\eta\mu) in the third diagonal entry of G¯(η)\bar{G}(\eta).
In the next section, we will illustrate these somewhat qualitative arguments through numerical illustrations via a logistic regression based binary classification algorithm based on a CovType data set. We will compare Network-GIANT and the gradient tracking based first order method from [6], as well as a Nesterov-acceleration based gradient tracking algorithm ACC-NGD-SC [30]. Note that to avoid repetition, we do not include comparative performance results with other second order approximate Newton algorithms that were already reported in [21].
5 Numerical Analysis
This section empirically studies the convergence properties of Network-GIANT\mathrm{Network\text{-}GIANT}. Extensive comparisons of Network-GIANT\mathrm{Network\text{-}GIANT} with other first-order and second-order distributed algorithms were reported in [21, §V] and will, therefore, not be included in the numerical analysis of this article. Here, we pose the following research questions:
- (a)
How does the connectivity of the underlying graph affect the convergence rate? - (b)
What is the rate of convergence of the algorithm, when the consensus steps of the decision variable and the gradient tracker, as described in Theorem 22, are combined with the mixed linear-quadratic optimality gap error norm?
We observed that for a graph with higher connectivity, with a small spectral norm, ensures faster convergence in comparison to a graph with less connectivity and a spectral norm close to 1. Moreover, when η\eta is small, we numerically verified that Network-GIANT\mathrm{Network\text{-}GIANT} achieves a faster convergence, given by 1−η1-\eta in contrast to its gradient tracking based first order counterpart.
Experimental setup
For the experiments, we considered a binary logistic classification problem given by the distributed optimization problem
| minx∈ℝn1N∑i=1N1m∑j=1mlog(1+exp(−vji(x⊤uji)))+~λ2‖x‖2,\displaystyle\min_{x\in\mathbb{R}^{n}}\frac{1}{N}\sum_{i=1}^{N}\frac{1}{m}\sum_{j=1}^{m}\log\Big(1+\exp\big(-v^{i}_{j}(x^{\top}u^{i}_{j})\big)\Big)+\frac{\tilde{}\lambda}{2}\left\lVert x\right\rVert^{2}, |
|---|
where λ~\tilde{\lambda} is the regularizer chosen to be 0.010.01 across all experiments. We performed distributed classification on a reduced CovType dataset [31], consisting of 566602566602 sample points and n=10n=10 features. After random reshuffling to minimize bias, the samples (uji,vji)(u_{j}^{i},v_{j}^{i}) were uniformly distributed across the NN nodes with (mm samples each). The variables x(0)\textbf{x}(0) were initialized uniformly randomly, and we imposed ∇fi(xi(0))=si(0)\nabla f_{i}(x_{i}(0))=s_{i}(0) for each i={1,2,…,N}i=\{1,2,\ldots,N\}.
The experiments were conducted on dd-regular expander graphs and Erdős-Rényi graphs (denoted by G(N,p))\big(\text{denoted by }G(N,p)\big) with N=20N=20 nodes and different graph parameters, summarized in Table 1. Here dd and pp denote the degree of the expander graph and the edge/connection probability associated with the Erdős-Rényi graph, respectively, and σ\sigma is the spectral radius of these graphs.
Table 1: Spectral norm (σ\sigma) for different graph parameters.
We assumed that each agent can exchange data with their 11-hop neighbors. The consensus weight matrices of these graphs were constructed using the Metropolis-Hastings algorithm [32].
Results and discussions
In the following experiments, the true optimal solution x∗x^{\ast} and the corresponding optimal value f(x∗)f(x^{\ast}), used as baselines, are obtained from the centralized Newton-Raphson algorithm with backtracking line search. Figure 1 displays the convergence of different errors, namely, the consensus error (CE), the gradient-tracking error (Grad-T), and the optimality gap error (OE), for two different graph configurations of dd-regular expander graphs. The details regarding the spectral norm and the graph parameters are summarized in Table 1, and the corresponding stepsizes were picked according to Table 2.
Table 2: Stepsize η\eta for different graph configurations.
We observed that for small stepsize (η\eta) choices, and for an expander graph with degree d=14d=14 and an Erdős-Rényi graph with connection probability p=0.75p=0.75 (both of which exhibit a higher degree of connectivity and a lower spectral norm), the consensus error dynamics and the gradient-tracking error achieve a faster convergence rate in comparison with the optimality gap error dynamics decaying at a slower rate of 1−ημL1-\eta\frac{\mu}{L}. This is consistent with the observation reported in the assertion of [21, Theorem 2]. For graphs with a lower degree of connectivity, faster convergence is not distinctly observed, as shown in Figure 1. A similar behavior was observed for the Erdős-Rényi graph with a lower connection probability as well, which we omit here for brevity.
Figure 1: Errors associated with Network-GIANT\mathrm{Network\text{-}GIANT}. Here the acronyms, ‘CE’, ‘OE’, and ‘Grad-T’, are short forms for consensus error, optimality gap error, and gradient-tracking error, respectively.
We further observe that when the consensus and gradient tracking errors converge to sufficiently small values, the linear term in the optimality error gap dominates the quadratic term at a later stage of consensus when x¯k\bar{x}_{k} lies in a neighborhood of x∗x^{\ast}.
Empirical evidence for improved rate of convergence
Figures 3 and 5 illustrate that when η\eta is small, the ratio rkr_{k}, defined by
| rk≔‖𝐱k+1−x∗‖‖𝐱k−x∗‖⩽1−η+ηγμ≈1−η for each k,r_{k}\coloneqq\frac{\left\lVert\mathbf{x}_{k+1}-x^{\ast}\right\rVert}{\left\lVert\mathbf{x}_{k}-x^{\ast}\right\rVert}\leqslant 1-\eta+\frac{\eta\gamma}{\mu}\approx 1-\eta\text{ for each }k, | ((15)) |
|---|
provided that the Hessian approximation error γ\gamma is sufficiently small. Two observations are noteworthy here. First, we report that for both the graph models, (rk)k∈ℕ(r_{k})_{k\in\mathbb{N}} converges asymptotically approximately to the limiting value of 1−η1-\eta, provided η\eta and the Hessian approximation error are small. This is consistent with our discussion preceding §V. For a larger stepsize value η=0.07\eta=0.07 and a lower graph connectivity d=4d=4 in Figures 3 and 5 (both top figures), the above convergence property no longer remains valid.
Figure 3: Rate of convergence, as defined in ((15)), of Network-GIANT\mathrm{Network\text{-}GIANT} for dd-regular expander graph with d=4d=4 (top) and d=14d=14 (bottom).
Similar behavior was observed for larger values of stepsizes beyond 0.070.07, although the overall Network-GIANT algorithm still converges. However, for graphs with a higher degree of connectivity, a locally asymptotic approximate convergence rate of 1−η1-\eta is seen in Figures 3 and 5 (both bottom figures).
Figure 5: Rate of convergence, as defined in ((15)), of Network-GIANT\mathrm{Network\text{-}GIANT} for Erdős–Rényi graph with p=0.2p=0.2 (top) and p=0.75p=0.75 (bottom).
Finally, we observed a slightly improved rate of convergence when the spectral norm of the graph is small, even for larger values of η\eta. This trend is consistent for both the dd-regular expander graph and the Erdős-Rényi graph. An exact characterization of the region of (σ,η\sigma,\eta) where this locally approximate asymptotic convergence rate of1−η1-\eta for the optimality gap holds, remains an open problem for future investigation.
Comparison with state-of-the-art algorithms
For comparing Network-GIANT\mathrm{Network\text{-}GIANT} with the gradient-tracking based first-order method developed in [6], termed as ‘GradTrack’ for convenience, and a Nesterov-accelerated version of the gradient tracking based optimization algorithm (ACC-NGD-SC) [30], we considered the full CovType dataset, and solve the multinomial logistic classification problem [33, Chapter 4]. Specifically, we solve the distributed optimization problem ((1)), where local objective function is defined by
| fi(x)≔−1mi∑j=1mi∑l=1M𝟙{vji=l}log(φl(uji;x))+λ‖x‖2,f_{i}(x)\coloneqq-\frac{1}{m_{i}}\sum_{j=1}^{m_{i}}\sum_{l=1}^{M}\mathds{1}_{\{v^{i}_{j}=l\}}\log\Big(\varphi_{l}(u^{i}_{j};x)\Big)+\lambda\left\lVert x\right\rVert^{2}, |
|---|
λ>0\lambda>0 is the regularization penalty, and
| φl(uji;x)={exp((xl)⊤uji)1+∑l=1M−1exp((xl)⊤uji)for l=1,…,M−1,11+∑l=1M−1exp((xl)⊤uji)for l=M.\displaystyle\varphi_{l}(u^{i}_{j};x)=\begin{cases}\frac{\exp\big((x^{l})^{\top}u^{i}_{j}\big)}{1+\sum_{l=1}^{M-1}\exp\big((x^{l})^{\top}u^{i}_{j}\big)}\,&\text{for }l=1,\ldots,M-1,\\ \frac{1}{1+\sum_{l=1}^{M-1}\exp\big((x^{l})^{\top}u^{i}_{j}\big)}\,&\text{for }l=M.\end{cases} |
|---|
Here the parameters x≔(x1x2⋯xM−1)⊤x\coloneqq(x^{1}\,x^{2}\,\cdots\,x^{M-1})^{\top} are (M−1)×(d+1)(M-1)\times(d+1)-dimensional matrices, where xl∈ℝN+1x^{l}\in\mathbb{R}^{N+1} for each l∈{1,…,M}l\in\{1,\ldots,M\}. The full CovType dataset consists of 581012581012 data points with 5454-dimensional feature vectors distributed among 77 classes. For these experiments, the regularizer λ\lambda was chosen to be 0.0010.001. The stepsizes for different algorithms were chosen according to Table 3.
Table 3: Stepsize η\eta for different algorithms and graph models.
(a) dd-regular expander graph with d=4d=4 and d=14d=14.
(b) Erdős-Rényi graph with p=0.2p=0.2 and p=0.75p=0.75.
Figure 6: Comparison of Network-GIANT\mathrm{Network\text{-}GIANT} with GradTrack and ACC-NGD-SC.
For ACC-NGD-SC [30], we considered a grid of [0.035,0.3][0.035,0.3], and the step-sizes, η∈[0.035,0.3]\eta\in[0.035,0.3], corresponding the values reported in Table 3 corresponding to the best result obtained. The momentum parameter was chosen to be α=μη \alpha=\mathchoice{{\hbox{$\displaystyle\sqrt{\mu\eta\,}$}\lower 0.4pt\hbox{\vrule height=4.30554pt,depth=-3.44446pt}}}{{\hbox{$\textstyle\sqrt{\mu\eta\,}$}\lower 0.4pt\hbox{\vrule height=4.30554pt,depth=-3.44446pt}}}{{\hbox{$\scriptstyle\sqrt{\mu\eta\,}$}\lower 0.4pt\hbox{\vrule height=3.01389pt,depth=-2.41113pt}}}{{\hbox{$\scriptscriptstyle\sqrt{\mu\eta\,}$}\lower 0.4pt\hbox{\vrule height=2.15277pt,depth=-1.72223pt}}} as stated in [30], where μ\mu is the strong convexity parameter. For the dataset considered, μ≃0.002\mu\simeq 0.002 was computed empirically.111In our experiments, we observed that a larger α\alpha can be appropriately chosen to obtain a damped response; see Figure 8 for more details. It is observed that Network-GIANT comfortably outperforms the ACC-NGD-SC algorithm as well, illustrating that using approximate curvature information can provide significant improvement in convergence rate in distributed optimization.
Figure 8: Comparison when α>μη \alpha>\mathchoice{{\hbox{$\displaystyle\sqrt{\mu\eta\,}$}\lower 0.4pt\hbox{\vrule height=3.87498pt,depth=-3.1pt}}}{{\hbox{$\textstyle\sqrt{\mu\eta\,}$}\lower 0.4pt\hbox{\vrule height=3.87498pt,depth=-3.1pt}}}{{\hbox{$\scriptstyle\sqrt{\mu\eta\,}$}\lower 0.4pt\hbox{\vrule height=2.71248pt,depth=-2.17pt}}}{{\hbox{$\scriptscriptstyle\sqrt{\mu\eta\,}$}\lower 0.4pt\hbox{\vrule height=1.93748pt,depth=-1.55pt}}} was chosen, keeping all other parameter identical, and considered α=0.05\alpha=0.05. A damped response for ACC-NGD-SC [30] is reported.
6 Conclusions
This paper focused on derivation of a number of detailed convergence results of a recently developed approximate Newton-type fully distributed optimization algorithm called Network-GIANT. For strongly convex local cost functions with the step size below an explicit bound, a global linear convergence result is shown, which can be explicitly computed as the spectral radius of a 3×33\times 3 matrix, the elements of which depend on the cost function Lipschitz continuity (LL) and strong convexity (μ\mu) parameters, the step size (η\eta) and the spectral norm (σ\sigma) of the underlying consensus matrix. A mixed linear-quadratic inequality based iterative upper bound is also established for the optimality gap, which indicates that for small step sizes and a negligible Hessian approximation error, Network-GIANT achieves an asymptotic approximate local linear convergence rate of 1−η1-\eta, which provides a mathematical justification as to why Network-GIANT is empirically seen to achieve a faster convergence rate than its first order gradient-tracking based counterparts. Ongoing and future work will focus on extensions of Network-GIANT to graphs with compressed and noisy communication, incorporating acceleration techniques such as Nesterov-acceleration and heavy-ball type methods, and more general (e.g. directed) graphs.
{appendices}
7 Proof of Theorem 1
We prove three norm inequalities in ((12)) individually.
Consensus error norm:
First, note that, using the state/parameter update equation of ((5)), we have (using W𝟏=𝟏W\mathbf{1}=\mathbf{1})
| 𝐱k+1−𝟏x¯k+1=W𝐱k−𝟏x¯k−η(𝕀−1N𝟏𝟏T)𝐲k,\mathbf{x}_{k+1}-\mathbf{1}\bar{x}_{k+1}=W\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}-\eta(\mathbb{I}-\frac{1}{N}\mathbf{1}\mathbf{1}^{T})\mathbf{y}_{k}, |
|---|
where𝕀\mathbb{I} denotes the identity matrix of order N×NN\times N. Noting that the 2-norm ‖𝕀−1N𝟏𝟏T‖2⩽1\left\lVert\mathbb{I}-\frac{1}{N}\mathbf{1}\mathbf{1}^{T}\right\rVert_{2}\leqslant 1 for all n>1n>1, ‖𝐲k‖⩽1μ‖𝐬k‖\left\lVert\mathbf{y}_{k}\right\rVert\leqslant\frac{1}{\mu}\left\lVert\mathbf{s}_{k}\right\rVert, and ‖W𝐱k−𝟏x¯k‖⩽σ‖𝐱k−𝟏x¯k‖\left\lVert W\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert\leqslant\sigma\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert, it follows that
| ‖𝐱k+1−𝟏x¯k+1‖⩽σ‖𝐱k−𝟏x¯k‖+ημ‖𝐬k‖.\left\lVert\mathbf{x}_{k+1}-\mathbf{1}\bar{x}_{k+1}\right\rVert\leqslant\sigma\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert+\frac{\eta}{\mu}\left\lVert\mathbf{s}_{k}\right\rVert. |
|---|
Finally using equation (ii) of Lemma 1, we have
| ‖𝐱k+1−𝟏x¯k+1‖⩽(σ+ηLμ)‖𝐱k−𝟏x¯k‖\displaystyle\left\lVert\mathbf{x}_{k+1}-\mathbf{1}\bar{x}_{k+1}\right\rVert\leqslant(\sigma+\eta\frac{L}{\mu})\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert |
|---|
| +ημ‖𝐬k−𝟏gk‖+ηLμN ‖x¯k−x⋆‖\displaystyle+\frac{\eta}{\mu}\left\lVert\mathbf{s}_{k}-\mathbf{1}g_{k}\right\rVert+\eta\frac{L}{\mu}\mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}\left\lVert\bar{x}_{k}-x^{\star}\right\rVert |
Gradient tracking error norm:
Next, using a result from [6], we have
| ‖𝐬k+1−𝟏gk+1‖⩽σ‖𝐬k−𝟏gk‖+L‖𝐱k+1−𝐱k‖.\left\lVert\mathbf{s}_{k+1}-\mathbf{1}g_{k+1}\right\rVert\leqslant\sigma\left\lVert\mathbf{s}_{k}-\mathbf{1}g_{k}\right\rVert+L\left\lVert\mathbf{x}_{k+1}-\mathbf{x}_{k}\right\rVert. |
|---|
Next, by using
| 𝐱k+1−𝐱k=(W−𝕀)𝐱k−η𝐲k=(W−𝕀)(𝐱k−𝟏x¯k)−η𝐲k,\mathbf{x}_{k+1}-\mathbf{x}_{k}=(W-\mathbb{I})\mathbf{x}_{k}-\eta\mathbf{y}_{k}\\ =(W-\mathbb{I})(\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k})-\eta\mathbf{y}_{k}, |
|---|
and ‖W−𝕀‖2⩽2\left\lVert W-\mathbb{I}\right\rVert_{2}\leqslant 2, we get (after substituting for‖𝐲k‖⩽1μ‖𝐬k‖\left\lVert\mathbf{y}_{k}\right\rVert\leqslant\frac{1}{\mu}\left\lVert\mathbf{s}_{k}\right\rVert, and some rearranging)
| ‖𝐬k+1−𝟏gk+1‖⩽(2L+ηL2μ)‖𝐱k−𝟏x¯k‖\displaystyle\left\lVert\mathbf{s}_{k+1}-\mathbf{1}g_{k+1}\right\rVert\leqslant(2L+\eta\frac{L^{2}}{\mu})\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert |
|---|
| +(σ+ηLμ)‖𝐬k−𝟏gk‖+ηL2μN ‖x¯k−x⋆‖\displaystyle+(\sigma+\eta\frac{L}{\mu})\left\lVert\mathbf{s}_{k}-\mathbf{1}g_{k}\right\rVert+\eta\frac{L^{2}}{\mu}\mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}\left\lVert\bar{x}_{k}-x^{\star}\right\rVert |
Optimality gap norm:
To simplify the following analysis, we use x~k=x¯kT,x~⋆=(x⋆)T\tilde{x}_{k}=\bar{x}^{T}_{k},\tilde{x}^{\star}=(x^{\star})^{T}, and s~ik=(sik)T\tilde{s}_{i}^{k}=(s_{i}^{k})^{T}. Once again, averaging both sides of the state/parameter update equation in ((5)), and using the shorthand notation H~ik=∇2fi(xik)\tilde{H}_{i}^{k}=\nabla^{2}f_{i}(x_{i}^{k}) for the local Hessians (note that these matrices are positive definite symmetric), we can write
| x~k+1−x~⋆=x~k−x~⋆−η1N∑i=1N(H~ik)−1s~ik\displaystyle\tilde{x}_{k+1}-\tilde{x}^{\star}=\tilde{x}_{k}-\tilde{x}^{\star}-\eta\frac{1}{N}\sum_{i=1}^{N}(\tilde{H}_{i}^{k})^{-1}\tilde{s}_{i}^{k} |
|---|
| =x~k−x~⋆−η1N∑i=1N(H~ik)−1(s~ik−∇f(x~k)+∇f(x~k))\displaystyle=\tilde{x}_{k}-\tilde{x}^{\star}-\eta\frac{1}{N}\sum_{i=1}^{N}(\tilde{H}_{i}^{k})^{-1}(\tilde{s}_{i}^{k}-\nabla f(\tilde{x}_{k})+\nabla f(\tilde{x}_{k})) |
where ∇f(x~k)=(∇f(x¯k))T\nabla f(\tilde{x}_{k})=(\nabla f(\bar{x}_{k}))^{T}. Define
| H¯ik=(H~ik)−1∇f(x~k)\displaystyle\bar{H}_{i}^{k}=(\tilde{H}_{i}^{k})^{-1}\nabla f(\tilde{x}_{k}) |
|---|
| =(H~ik)−1∫01∇2f(x~⋆+λ(x~k−x~⋆))𝑑λ(x~k−x~⋆)\displaystyle=(\tilde{H}_{i}^{k})^{-1}\int_{0}^{1}\nabla^{2}f(\tilde{x}^{\star}+\lambda(\tilde{x}_{k}-\tilde{x}^{\star}))d\lambda(\tilde{x}_{k}-\tilde{x}^{\star}) |
Consequently, one can write
| x~k+1−x~⋆=(𝕀−η∑H¯ikN)(x~k−x~⋆)\displaystyle\tilde{x}_{k+1}-\tilde{x}^{\star}=(\mathbb{I}-\eta\frac{\sum\bar{H}_{i}^{k}}{N})(\tilde{x}_{k}-\tilde{x}^{\star}) |
|---|
| −η1N∑i=1N(H~ik)−1(s~ik−∇f(x~k))\displaystyle\;\;\;\;\;-\eta\frac{1}{N}\sum_{i=1}^{N}(\tilde{H}_{i}^{k})^{-1}(\tilde{s}_{i}^{k}-\nabla f(\tilde{x}_{k})) |
Using a similar technique as in the Proof of Lemma 2, one can show that μL𝕀⩽H¯ik⩽Lμ𝕀\frac{\mu}{L}\mathbb{I}\leqslant\bar{H}_{i}^{k}\leqslant\frac{L}{\mu}\mathbb{I}, and hence
| (1−ηLμ)𝕀⩽𝕀−η∑H¯ikN⩽(1−ημL)𝕀\left(1-\eta\frac{L}{\mu}\right)\mathbb{I}\leqslant\mathbb{I}-\eta\frac{\sum\bar{H}_{i}^{k}}{N}\leqslant\left(1-\eta\frac{\mu}{L}\right)\mathbb{I} |
|---|
It follows then that if η⩽μL\eta\leqslant\frac{\mu}{L}, we have
| 0⩽𝕀−η∑H¯ikN⩽(1−ημL)𝕀,0\leqslant\mathbb{I}-\eta\frac{\sum\bar{H}_{i}^{k}}{N}\leqslant\left(1-\eta\frac{\mu}{L}\right)\mathbb{I}, |
|---|
and finally, ‖𝕀−η∑H¯ikN‖2⩽(1−ημL)\left\lVert\mathbb{I}-\eta\frac{\sum\bar{H}_{i}^{k}}{N}\right\rVert_{2}\leqslant(1-\eta\frac{\mu}{L}).
This allows us to write
| ‖x~k+1−x~⋆‖⩽(1−ημL)‖x~k−x~⋆‖\displaystyle\left\lVert\tilde{x}_{k+1}-\tilde{x}^{\star}\right\rVert\leqslant(1-\eta\frac{\mu}{L})\left\lVert\tilde{x}_{k}-\tilde{x}^{\star}\right\rVert |
|---|
| +η‖1N∑i=1N(H~ik)−1(s~ik−∇f(x~k))‖\displaystyle\;\;\;\;\;\;\;\;+\eta\left\lVert\frac{1}{N}\sum_{i=1}^{N}(\tilde{H}_{i}^{k})^{-1}(\tilde{s}_{i}^{k}-\nabla f(\tilde{x}_{k}))\right\rVert |
Denoting Sa=η‖1N∑i=1N(H~ik)−1(s~ik−∇f(x~k))‖S_{a}=\eta\left\lVert\frac{1}{N}\sum_{i=1}^{N}(\tilde{H}_{i}^{k})^{-1}(\tilde{s}_{i}^{k}-\nabla f(\tilde{x}_{k}))\right\rVert, using the strong convexity of the local cost functions and the fact that the norm of the transpose of a vector is the same as the norm of the vector, it is easy to show that
| Sa⩽ημ1N∑i=1N∥sik−∇f(x¯k))∥\displaystyle S_{a}\leqslant\frac{\eta}{\mu}\frac{1}{N}\sum_{i=1}^{N}\left\lVert s_{i}^{k}-\nabla f(\bar{x}_{k}))\right\rVert |
|---|
| ⩽ημ1N∑i=1N∥sik−∇f(𝐱k)+∇f(𝐱k)−∇f(x¯k))∥\displaystyle\leqslant\frac{\eta}{\mu}\frac{1}{N}\sum_{i=1}^{N}\left\lVert s_{i}^{k}-\nabla f(\mathbf{x}_{k})+\nabla f(\mathbf{x}_{k})-\nabla f(\bar{x}_{k}))\right\rVert |
| ⩽ημ(1N∑i=1N∥sik−∇f(𝐱k)∥+∥∇f(𝐱k)−∇f(x¯k))∥)\displaystyle\leqslant\frac{\eta}{\mu}\left(\frac{1}{N}\sum_{i=1}^{N}\left\lVert s_{i}^{k}-\nabla f(\mathbf{x}_{k})\right\rVert+\left\lVert\nabla f(\mathbf{x}_{k})-\nabla f(\bar{x}_{k}))\right\rVert\right) |
| ⩽ημ1N∑i=1N‖sik−gk‖+ημLN ‖𝐱k−𝟏x¯k‖\displaystyle\leqslant\frac{\eta}{\mu}\frac{1}{N}\sum_{i=1}^{N}\left\lVert s_{i}^{k}-g_{k}\right\rVert+\frac{\eta}{\mu}\frac{L}{\mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}}\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert |
where we have used the Lipschitz continuity of the gradients.
Finally, noting again that ‖x~k−x~⋆‖=‖x¯k−x⋆‖\left\lVert\tilde{x}_{k}-\tilde{x}^{\star}\right\rVert=\left\lVert\bar{x}_{k}-x^{\star}\right\rVert, and
| 1N∑i=1N‖sik−gk‖⩽1N∑i=1N‖sik−gk‖2 =‖𝐬k−𝟏gk‖N ,\frac{1}{N}\sum_{i=1}^{N}\left\lVert s_{i}^{k}-g_{k}\right\rVert\leqslant\mathchoice{{\hbox{$\displaystyle\sqrt{\frac{1}{N}\sum_{i=1}^{N}{\left\lVert s_{i}^{k}-g_{k}\right\rVert}^{2}\,}$}\lower 0.4pt\hbox{\vrule height=9.8611pt,depth=-7.88892pt}}}{{\hbox{$\textstyle\sqrt{\frac{1}{N}\sum_{i=1}^{N}{\left\lVert s_{i}^{k}-g_{k}\right\rVert}^{2}\,}$}\lower 0.4pt\hbox{\vrule height=8.49002pt,depth=-6.79205pt}}}{{\hbox{$\scriptstyle\sqrt{\frac{1}{N}\sum_{i=1}^{N}{\left\lVert s_{i}^{k}-g_{k}\right\rVert}^{2}\,}$}\lower 0.4pt\hbox{\vrule height=6.49002pt,depth=-5.19205pt}}}{{\hbox{$\scriptscriptstyle\sqrt{\frac{1}{N}\sum_{i=1}^{N}{\left\lVert s_{i}^{k}-g_{k}\right\rVert}^{2}\,}$}\lower 0.4pt\hbox{\vrule height=4.94444pt,depth=-3.95558pt}}}=\frac{\left\lVert\mathbf{s}_{k}-\mathbf{1}g_{k}\right\rVert}{\mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}}, |
|---|
we have
| N ‖x¯k+1−x⋆‖⩽(1−ημL)N ‖x¯k−x⋆‖\displaystyle\mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}\left\lVert\bar{x}_{k+1}-x^{\star}\right\rVert\leqslant(1-\eta\frac{\mu}{L})\mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}\left\lVert\bar{x}_{k}-x^{\star}\right\rVert |
|---|
| +ηLμ‖𝐱k−𝟏x¯k‖+ημ‖𝐬k−𝟏gk‖\displaystyle+\frac{\eta L}{\mu}\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert+\frac{\eta}{\mu}\left\lVert\mathbf{s}_{k}-\mathbf{1}g_{k}\right\rVert |
Combining the three inequalities for the three error norms, we obtain the inequality in ((12)).
8 Proof of Theorem 2
In order to analyze the spectral radius of the matrix G(η)G(\eta) defined in ((12)), we first recall that since G(η)G(\eta) is a positive matrix, its spectral radius is real valued and positive. Therefore, we need to investigate under what conditions on the step-size η\eta, this real eigenvalue lies between 0 and 11. First, note that at η=0\eta=0, the eigenvalues G(0)G(0) are σ,σ\sigma,\sigma, and 11. Clearly, the spectral radius ρ(G(η))=1\rho(G(\eta))=1 at η=0\eta=0. From continuity of eigenvalues as a function of the elements of the matrix, we know that ρ(G(η))\rho(G(\eta)) is a continuous function of η\eta.
Next, we evaluate the derivative of ρ(G(η))\rho(G(\eta)) with respect to η\eta at η=0\eta=0. Define α=1−ημL\alpha=1-\eta\frac{\mu}{L}, and β=σ+ηLμ\beta=\sigma+\eta\frac{L}{\mu}, and δ=ηLμ\delta=\eta\frac{L}{\mu} for simplifications. By analyzing the characteristic polynomial of G(η)G(\eta), one can write that an eigenvalueλ\lambda of G(η)G(\eta) satisfies the following equation
| (λ−α)(λ−β)2−(λ−α)(2δ+δ2)−2δ2(λ−β)\displaystyle(\lambda-\alpha)(\lambda-\beta)^{2}-(\lambda-\alpha)(2\delta+\delta^{2})-2\delta^{2}(\lambda-\beta) |
|---|
| −2δ3−2δ2=0\displaystyle\;\;\;\;\;\;\;\;\;\qquad\qquad\qquad\qquad\qquad-2\delta^{3}-2\delta^{2}=0 |
Taking the derivative with respect to η\eta, and substituting λ=1\lambda=1, and δ=0\delta=0 at η=0\eta=0, one can get after some algebra
| (∂λ∂η+μL)(1−σ)2=0.(\frac{\partial\lambda}{\partial\eta}+\frac{\mu}{L})(1-\sigma)^{2}=0. |
|---|
Since σ<1\sigma<1, this implies that ∂λ∂η<0\frac{\partial\lambda}{\partial\eta}<0 at η=0\eta=0. Therefore, from continuity of eigenvalues as a function of the matrix elements, it follows that the largest eigenvalue remains real and decreases as η\eta increases slightly from 0. Since the other two eigenvalues are σ<1\sigma<1 at η=0\eta=0, we obtain thatρ(G(η))<1\rho(G(\eta))<1 for a sufficiently small η>0\eta>0.
Next, we check for what values of η\eta, we have a solution λ=1\lambda=1. Clearly, one such value, as we saw earlier, is η=0\eta=0. By substituting λ=1\lambda=1 (and consequently, (λ−α)=ημL(\lambda-\alpha)=\eta\frac{\mu}{L}, (λ−β)=1−σ−ηLμ(\lambda-\beta)=1-\sigma-\eta\frac{L}{\mu}) the characteristic polynomial, and discarding the solution η=0\eta=0, we get
| μL((1−σ−ηLμ)2−ηLμ(2+ηLμ))\displaystyle\frac{\mu}{L}\left((1-\sigma-\eta\frac{L}{\mu})^{2}-\eta\frac{L}{\mu}(2+\eta\frac{L}{\mu})\right) |
|---|
| −η2L3μ3−ηL2μ2(1−σ−ηLμ)\displaystyle-\eta^{2}\frac{L^{3}}{\mu^{3}}-\eta\frac{L^{2}}{\mu^{2}}(1-\sigma-\eta\frac{L}{\mu}) |
| +1μ((1−σ−ηLμ)(−ηL2μ−2ηL2μ−η2L3μ2)=0\displaystyle+\frac{1}{\mu}\left((1-\sigma-\eta\frac{L}{\mu})(-\eta\frac{L^{2}}{\mu}-2\eta\frac{L^{2}}{\mu}-\eta^{2}\frac{L^{3}}{\mu^{2}}\right)=0 |
After some algebraic manipulation, surprisingly, the above equation yields a single solutionη¯=(1−σ)μL2(2−σ)(1+(Lμ)2)\bar{\eta}=\frac{(1-\sigma)\frac{\mu}{L}}{2(2-\sigma)(1+(\frac{L}{\mu})^{2}}). Since at η=0\eta=0, the eigenvalues are 1,σ,σ1,\sigma,\sigma, and the spectral radius is less than 11 in the immediate vicinity of η>0\eta>0, and there is only one other value η=η¯\eta=\bar{\eta}, where an eigenvalue can be 11, we conclude that the magnitudes of all the eigenvalues (and hence the spectral radius) are less than 11 for 0<η<η¯0<\eta<\bar{\eta}. By noting that η¯<μL\bar{\eta}<\frac{\mu}{L} (thus satisfying the global convergence condition on the step size for the damped Newton method), and substituting κ=Lμ\kappa=\frac{L}{\mu}, we complete the proof of Theorem 2.
9 Proof of Theorem 3
Similar to the optimality error norm proof of Theorem 1, we work with the transposed notations x~k,x~⋆\tilde{x}_{k},\tilde{x}^{\star},s~ik\tilde{s}_{i}^{k}, and ∇f(x~k)\nabla f(\tilde{x}_{k}). We adapt a similar proof technique as presented in [24]. As before, we can write
| x~k+1−x~⋆\displaystyle\tilde{x}_{k+1}-\tilde{x}^{\star} |
|---|
| =x~k−x~⋆−η1N∑i=1N(H~ik)−1(s~ik−∇f(x~k)+∇f(x~k))\displaystyle=\tilde{x}_{k}-\tilde{x}^{\star}-\eta\frac{1}{N}\sum_{i=1}^{N}(\tilde{H}_{i}^{k})^{-1}(\tilde{s}_{i}^{k}-\nabla f(\tilde{x}_{k})+\nabla f(\tilde{x}_{k})) |
| =x~k−x~⋆−η1N∑i=1N(H~ik)−1(s~ik−∇f(x~k))⏟S1\displaystyle=\tilde{x}_{k}-\tilde{x}^{\star}-\underbrace{\eta\frac{1}{N}\sum_{i=1}^{N}(\tilde{H}_{i}^{k})^{-1}(\tilde{s}_{i}^{k}-\nabla f(\tilde{x}_{k}))}_{S_{1}} |
| −η1N∑i=1N(H~ik)−1∇f(x~k))⏟S2\displaystyle\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;-\underbrace{\eta\frac{1}{N}\sum_{i=1}^{N}(\tilde{H}_{i}^{k})^{-1}\nabla f(\tilde{x}_{k}))}_{S_{2}} |
We note that S2S_{2} can be written as
| S2=η(Happ(𝐱𝐤))−1∫01∇2f(x~⋆+λ(x~k−x~⋆))𝑑λ(x~k−x~⋆)S_{2}=\eta(H_{app}(\mathbf{x_{k}}))^{-1}\int_{0}^{1}\nabla^{2}f(\tilde{x}^{\star}+\lambda(\tilde{x}_{k}-\tilde{x}^{\star}))d\lambda(\tilde{x}_{k}-\tilde{x}^{\star}) |
|---|
This allows us to write
| x~k+1−x~⋆\displaystyle\tilde{x}_{k+1}-\tilde{x}^{\star} |
|---|
| =[𝕀−η(Happ(𝐱𝐤))−1∫01∇2f(x~⋆+λ(x~k−x~⋆))𝑑λ⏟δ(x~k)]\displaystyle=\left[\mathbb{I}-\eta(H_{app}(\mathbf{x_{k}}))^{-1}\underbrace{\int_{0}^{1}\nabla^{2}f(\tilde{x}^{\star}+\lambda(\tilde{x}_{k}-\tilde{x}^{\star}))d\lambda}_{\delta(\tilde{x}_{k})}\right] |
| ×(x~k−x~⋆)+S1\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\times(\tilde{x}_{k}-\tilde{x}^{\star})+S_{1} |
| =(1−η)(x~k−x~⋆)+η(Happ(𝐱𝐤))−1[Happ(𝐱𝐤)−δ(x~k)]\displaystyle=(1-\eta)(\tilde{x}_{k}-\tilde{x}^{\star})+\eta(H_{app}(\mathbf{x_{k}}))^{-1}\left[H_{app}(\mathbf{x_{k}})-\delta(\tilde{x}_{k})\right] |
| ×(x~k−x~⋆)+S1\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\times(\tilde{x}_{k}-\tilde{x}^{\star})+S_{1} |
It follows then from strong convexity and η<1\eta<1
| ‖x~k+1−x~⋆‖⩽((1−η)+ημ‖Happ(𝐱𝐤)−δ(x~k)‖)\displaystyle\left\lVert\tilde{x}_{k+1}-\tilde{x}^{\star}\right\rVert\leqslant\left((1-\eta)+\frac{\eta}{\mu}\left\lVert H_{app}(\mathbf{x_{k}})-\delta(\tilde{x}_{k})\right\rVert\right) | |
|---|---|
| ×‖x~k−x~⋆‖+‖S1‖\displaystyle\qquad\qquad\qquad\qquad\times\left\lVert\tilde{x}_{k}-\tilde{x}^{\star}\right\rVert+\left\lVert S_{1}\right\rVert | ((16)) |
One can now show (recalling that ‖vT‖=‖v‖\left\lVert v^{T}\right\rVert=\left\lVert v\right\rVert for a vector vv)
| ‖Happ(𝐱𝐤)−δ(x~k)‖\displaystyle\left\lVert H_{app}(\mathbf{x_{k}})-\delta(\tilde{x}_{k})\right\rVert |
|---|
| ⩽‖Happ(𝐱𝐤)−Htr(𝐱𝐤)‖+‖Htr(𝐱𝐤)−δ(x~k)‖\displaystyle\leqslant\left\lVert H_{app}(\mathbf{x_{k}})-H_{tr}(\mathbf{x_{k}})\right\rVert+\left\lVert H_{tr}(\mathbf{x_{k}})-\delta(\tilde{x}_{k})\right\rVert |
| ⩽γ+∫01‖∇2f(𝐱k)−∇2f(x~k)‖𝑑λ\displaystyle\leqslant\gamma+\int_{0}^{1}\left\lVert\nabla^{2}f(\mathbf{x}_{k})-\nabla^{2}f(\tilde{x}_{k})\right\rVert d\lambda |
| +∫01‖∇2f(x~k)−∇2f(x~∗+λ(x~k−x~∗))‖𝑑λ\displaystyle\qquad\qquad+\int_{0}^{1}\left\lVert\nabla^{2}f(\tilde{x}_{k})-\nabla^{2}f(\tilde{x}^{*}+\lambda(\tilde{x}_{k}-\tilde{x}^{*}))\right\rVert d\lambda |
| ⩽γ+L¯N ‖𝐱k−𝟏x¯k‖+L¯2‖x¯k−x⋆‖\displaystyle\leqslant\gamma+\frac{\bar{L}}{\mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}}\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert+\frac{\bar{L}}{2}\left\lVert\bar{x}_{k}-x^{\star}\right\rVert |
where the last term in the last inequality is due to the Lipschitz continuity of the global Hessian.
Using the above result in ((16)), we obtain (observing that since γ<μ\gamma<\mu, and η<1\eta<1, η(1−γμ)<1\eta(1-\frac{\gamma}{\mu})<1)
| ‖x¯k+1−x⋆‖⩽(1−η(1−γμ))‖x¯k−x⋆‖\displaystyle\left\lVert\bar{x}_{k+1}-x^{\star}\right\rVert\leqslant\left(1-\eta(1-\frac{\gamma}{\mu})\right)\left\lVert\bar{x}_{k}-x^{\star}\right\rVert |
|---|
| +ηL¯μN ‖𝐱k−𝟏x¯k‖‖x¯k−x⋆‖+ηL¯2μ‖x¯k−x⋆‖2\displaystyle\;\;\;\;\;\;\;+\frac{\eta\bar{L}}{\mu\mathchoice{{\hbox{$\displaystyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\textstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=6.83331pt,depth=-5.46667pt}}}{{\hbox{$\scriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=4.78334pt,depth=-3.82669pt}}}{{\hbox{$\scriptscriptstyle\sqrt{N\,}$}\lower 0.4pt\hbox{\vrule height=3.41667pt,depth=-2.73335pt}}}}\left\lVert\mathbf{x}_{k}-\mathbf{1}\bar{x}_{k}\right\rVert\left\lVert\bar{x}_{k}-x^{\star}\right\rVert+\frac{\eta\bar{L}}{2\mu}\left\lVert\bar{x}_{k}-x^{\star}\right\rVert^{2} |
| +‖S1‖\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad+\left\lVert S_{1}\right\rVert |
Finally, noting that‖S1‖\left\lVert S_{1}\right\rVert is the same as the quantity SaS_{a} defined in the Proof of Theorem 1, we can obtain the final expression for the upper bound on ‖x¯k+1−x⋆‖\left\lVert\bar{x}_{k+1}-x^{\star}\right\rVert given in ((14)).
References
- [1] L. Yuan, Z. Wang, L. Sun, P. S. Yu, and C. G. Brinton, “Decentralized federated learning: A survey and perspective,” IEEE Internet of Things Journal, vol. 11, no. 21, pp. 34 617–34 638, 2024.
- [2] C. Liu, N. Bastianello, W. Huo, Y. Shi, and K. H. Johansson, “A survey on secure decentralized optimization and learning,” arXiv preprint arXiv:2408.08628, 2024.
- [3] T. Yang, X. Yi, J. Wu, Y. Yuan, D. Wu, Z. Meng, Y. Hong, H. Wang, Z. Lin, and K. H. Johansson, “A survey of distributed optimization,” Annual Reviews in Control, vol. 47, pp. 278–305, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1367578819300082
- [4] A. Nedić and J. Liu, “Distributed optimization for control,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 1, no. 1, pp. 77–103, 2018.
- [5] A. Nedic, “Distributed gradient methods for convex machine learning problems in networks: Distributed Optimization,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 92–101, 2020.
- [6] G. Qu and N. Li, “Harnessing smoothness to accelerate distributed optimization,” IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2018.
- [7] C. Xi, V. S. Mai, R. Xin, E. H. Abed, and U. A. Khan, “Linear convergence in optimization over directed graphs with row-stochastic matrices,” IEEE Transactions on Automatic Control, vol. 63, no. 10, pp. 3558–3565, 2018.
- [8] C. T. Dinh, N. H. Tran, T. D. Nguyen, W. Bao, A. R. Balef, B. B. Zhou, and A. Y. Zomaya, “DONE: distributed approximate Newton-type method for federated edge learning,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 11, pp. 2648–2660, 2022.
- [9] G. Sharma and S. Dey, “On analog distributed approximate Newton with determinantal averaging,” in 2022 IEEE 33rd Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC). IEEE, 2022, pp. 1–7.
- [10] M. Safaryan, R. Islamov, X. Qian, and P. Richtárik, “FedNL: Making Newton-type methods applicable to federated learning,” in 39th International Conference on Machine Learning, ICML 2022, 2022.
- [11] M. Sen, A. Qin, and K. M. C, “FAGH: Accelerating federated learning with approximated global hessian,” arXiv e-prints, pp. arXiv–2403, 2024.
- [12] A. Mokhtari, Q. Ling, and A. Ribeiro, “Network Newton distributed optimization methods,” IEEE Transactions on Signal Processing, vol. 65, no. 1, pp. 146–161, 2016.
- [13] D. Bajovic, D. Jakovetic, N. Krejic, and N. K. Jerinkic, “Newton-like method with diagonal correction for distributed optimization,” SIAM Journal on Optimization, vol. 27, no. 2, pp. 1171–1203, 2017.
- [14] N. Bof, R. Carli, G. Notarstefano, L. Schenato, and D. Varagnolo, “Multiagent Newton–Raphson optimization over lossy networks,” IEEE Transactions on Automatic Control, vol. 64, no. 7, pp. 2983–2990, 2018.
- [15] J. Zhang, Q. Ling, and A. M.-C. So, “A Newton tracking algorithm with exact linear convergence for decentralized consensus optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 7, pp. 346–358, 2021.
- [16] H. Ye, C. He, and X. Chang, “Accelerated distributed approximate Newton method,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 11, pp. 8642–8653, 2022.
- [17] O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient distributed optimization using an approximate Newton-type method,” in International conference on machine learning. PMLR, 2014, pp. 1000–1008.
- [18] Z. Qu, X. Li, L. Li, and Y. Hong, “Distributed second-order method with diffusion strategy,” in 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2023, pp. 2022–2027.
- [19] A. Maritan, G. Sharma, S. Dey, and L. Schenato, “Fully-distributed optimization with Network Exact Consensus-GIANT,” in 2024 IEEE 25th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC). IEEE, 2024, pp. 436–440.
- [20] J. Zhang, K. You, and T. Başar, “Distributed adaptive Newton methods with global superlinear convergence,” Automatica, vol. 138, p. 110156, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0005109821006865
- [21] A. Maritan, G. Sharma, L. Schenato, and S. Dey, “Network-GIANT: Fully distributed Newton-type optimization via harmonic hessian consensus,” in 2023 IEEE Globecom Workshops (GC Wkshps), 2023, pp. 902–907.
- [22] S. Wang, F. Roosta, P. Xu, and M. W. Mahoney, “GIANT: Globally improved approximate Newton method for distributed optimization,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- [23] L. Xiao, S. Boyd, and S.-J. Kim, “Distributed average consensus with least-mean-square deviation,” Journal of parallel and distributed computing, vol. 67, no. 1, pp. 33–46, 2007.
- [24] N. Agarwal, B. Bullins, and E. Hazan, “Second-order stochastic optimization for machine learning in linear time,” Journal of Machine Learning Research, vol. 18, no. 116, pp. 1–40, 2017. [Online]. Available: http://jmlr.org/papers/v18/16-491.html
- [25] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, March 2004.
- [26] B. Polyak, “Introduction to Optimization,” Translations Series in Mathematics and Engineering. New York: Optimization Software Inc. Publications Division, 1987.
- [27] B. Li, S. Cen, Y. Chen, and Y. Chi, “Communication-efficient distributed optimization in networks with gradient tracking and variance reduction,” The Journal of Machine Learning Research, vol. 21, no. 1, 2020.
- [28] A. Mokhtari, Q. Ling, and A. Ribeiro, “Network Newton distributed optimization methods,” IEEE Transactions on Signal Processing, 2016.
- [29] M. A. Erdogdu and A. Montanari, “Convergence rates of sub-sampled Newton methods,” in Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28. Curran Associates, Inc., 2015.
- [30] G. Qu and N. Li, “Accelerated distributed Nesterov gradient descent,” IEEE Transactions on Automatic Control, vol. 65, no. 6, pp. 2566–2581, 2019.
- [31] D. Dua and C. Graff, “UCI Machine Learning Repository,” 2019, DOI: http://archive.ics.uci.edu/ml.
- [32] L. Xiao, S. Boyd, and S.-J. Kim, “Distributed average consensus with least-mean-square deviation,” Journal of parallel and distributed computing, vol. 67, no. 1, pp. 33–46, 2007.
- [33] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, 2nd ed., ser. Springer Series in Statistics. Springer, New York, 2009, Data Mining, Inference, and Prediction. [Online]. Available: https://doi.org/10.1007/978-0-387-84858-7
{IEEEbiography}
[
]Souvik Das is a postdoctoral researcher in the Department of Electrical Engineering at Uppsala University, Sweden. He received his doctoral degree from the Centre for Systems and Control at the IIT Bombay in 2025. His research interests are broadly in the field of optimization and optimization-based control, and security and privacy of cyber-physical systems.
{IEEEbiography}
[
]Luca Schenato (Fellow, IEEE) received the Dr.Eng. degree in electrical engineering from the University of Padova, Padova, Italy, in 1999, and the Ph.D. degree in electrical engineering and computer sciences from the University of California (UC), Berkeley, Berkeley, CA, USA, in 2003. He was a Postdoctoral Researcher in 2004 and Visiting Professor during 2013–2014 with UC Berkeley. He is currently a Full Professor with the Information Engineering Department, University of Padova. His interests include networked control systems, multi-agent systems, wireless sensor networks, distributed optimisation and synthetic biology. Luca Schenato has been awarded the 2004 Researchers Mobility Fellowship by the Italian Ministry of Education, University and Research (MIUR), the 2006 Eli Jury Award in U.C. Berkeley and the EUCA European Control Award in 2014, and IEEE Fellow in 2017. He served as Associate Editor for IEEE Trans. on Automatic Control from 2010 to 2014 and he is he is currently Senior Editor for IEEE Trans. on Control of Network Systems and Associate Editor for Automatica.
{IEEEbiography}
[
]Subhrakanti Dey (Fellow, IEEE) received the Ph.D. degree from the Department of Systems Engineering, Research School of Information Sciences and Engineering, Australian National University, Canberra, in 1996. He is currently a Professor and Head of the Signals and Systems division in the Dept of Electrical Engineering at Uppsala University, Sweden. He has also held professorial positions at NUI Maynooth, Ireland and University of Melbourne, Australia. His current research interests include networked control systems, distributed machine learning and optimization, and detection and estimation theory for wireless sensor networks. He is currently a Senior Editor for IEEE Transactions of Control of Network Systems and IEEE Transactions on Signal and Information Processing over Networks, and an Associate Editor for Automatica. He also served as a Senior Editor for IEEE Control Systems Letter until 2025.