A Homogeneous Second-Order Descent Ascent Algorithm for Nonconvex-Strongly Concave Minimax Problems††thanks: This work is supported by National Key R & D Program of China (Nos. 2025YFA1017801 and 2025YFA1017800), the National Natural Science Foundation of China under the grant 12471294. (original) (raw)
∎
11institutetext: Jia-Hao Chen22institutetext: Department of Mathematics, College of Sciences, Shanghai University, Shanghai 200444, P.R.China.
22email: chenjiahao@shu.edu.cn 33institutetext: Zi Xu44institutetext: Department of Mathematics, College of Sciences, Shanghai University, Shanghai 200444, P.R.China.
Corresponding author. 44email: xuzi@shu.edu.cn 55institutetext: Hui-Ling Zhang66institutetext: LSEC, ICMSEC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.
66email: zhanghl1209@shu.edu.cn
(Received: date / Accepted: date)
Abstract
This paper introduces a novel Homogeneous Second-order Descent Ascent (HSDA) algorithm for nonconvex-strongly concave minimax optimization problems. At each iteration, HSDA uniquely computes a search direction by solving a homogenized eigenvalue subproblem built from the gradient and Hessian of the objective function. This formulation guarantees a descent direction with sufficient negative curvature even in near-positive-semidefinite Hessian regimes—a key feature that enhances escape from saddle points. We prove that HSDA finds an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point within 𝒪~(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}) iterations, matching the optimal ε\varepsilon-order iteration complexity among second-order methods for this problem class. To address large-scale applications, we further design an inexact variant (IHSDA) that preserves the single-loop structure while solving the subproblem approximately via a Lanczos procedure. With high probability, IHSDA achieves the same 𝒪~(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}) iteration complexity and attains an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point, with the total Hessian‑vector product cost bounded by 𝒪~(ε−7/4)\tilde{\mathcal{O}}(\varepsilon^{-7/4}). Experiments on synthetic minimax problems and adversarial training tasks confirm the practical effectiveness and robustness of the proposed algorithms.
1 Introduction
In this paper, we consider the following unconstrained minimax problem:
| minx∈ℝnmaxy∈ℝmf(x,y),\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}f(x,y), | (P) |
|---|
where f(x,y):ℝn×ℝm→ℝf(x,y):\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R} is a continuously differentiable function, which is strongly concave in yy, but possibly nonconvex in xx. For convenience, we denote
| ℱ(x):=maxy∈ℝmf(x,y).\mathcal{F}(x):=\max_{y\in\mathbb{R}^{m}}f(x,y). | (1.1) |
|---|
Such a structure captures a wide range of machine learning applications, including adversarial training and distributionally robust optimization gao2023distributionally ; sanjabi2018convergence ; sinha2017certifying , reinforcement learning, domain adaptation, and AUC maximization ganin2016domain ; qiu2020single ; ying2016stochastic .
To solve the minimax problem (P), three main classes of optimization algorithms have been developed: zeroth-order, first-order, and second-order methods, which utilize the function value, gradient, and Hessian of the objective function, respectively. Compared to zeroth- and first-order approaches, second-order methods have garnered considerable attention owing to their faster convergence rates. Moreover, they are more effective at escaping saddle points and avoiding poor local minima, thereby increasing the likelihood of converging to a globally optimal solution. This paper focuses on second-order optimization algorithms for solving (P).
For nonconvex-strongly concave minimax problems, first-order algorithms can obtain an ε\varepsilon-first-order stationary point in𝒪~(κy2ε−2)\tilde{\mathcal{O}}(\kappa_{y}^{2}\varepsilon^{-2}) iterations jin2020local ; lin2020gradient ; lu2020hybrid ; rafique2022weakly ; xu2023unified , where κy\kappa_{y} denotes the condition number of f(x,⋅)f(x,\cdot). Acceleration frameworks further improve the iteration complexity to𝒪~(κyε−2)\tilde{\mathcal{O}}(\sqrt{\kappa_{y}}\,\varepsilon^{-2}) lin2020near ; zhang2021complexity ; Li2021ComplexityLB .
There are few studies on second-order algorithms for solving nonconvex-strongly concave minimax optimization problems (P). Existing second-order algorithms can be divided into two categories, i.e., cubic regularization Newton type algorithms luo2022finding ; chen2021cubic and trust-region type algorithms yao2024two ; wang2025gradient . Building upon the cubic regularization (CR) framework, Luo et al. luo2022finding proposed the Minimax Cubic Newton (MCN) algorithm, which alternates between a cubic-regularized Newton step in the minimization variable and an ascent step in the maximization variable, achieving an iteration complexity of 𝒪(ε−3/2)\mathcal{O}(\varepsilon^{-3/2}) to reach an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point. They further introduced an inexact variant (IMCN) that solves the cubic subproblem via gradient-based iterations and approximates Hessian inverse operations using Chebyshev polynomial expansions, relying solely on Hessian-vector products. Within the same line of work, Chen et al. chen2021cubic developed the Inexact Cubic-LocalMinimax (ICLM) algorithm, which attains the same order of iteration complexity. In the trust-region family of methods, Yao and Xu yao2024two proposed MINIMAX-TR, a fixed-radius inexact trust-region method that finds an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point within 𝒪(ε−3/2)\mathcal{O}(\varepsilon^{-3/2}) iterations. To enhance practical performance, they also designed MINIMAX-TRACE, which adaptively adjusts the trust-region radius through contraction and expansion steps while maintaining the same theoretical iteration complexity. More recently, Wang and Xu wang2025gradient introduced a gradient norm regularized trust-region method (GRTR) and a Levenberg-Marquardt type negative-curvature method (LMNegCur) for nonconvex-strongly concave minimax problems. GRTR achieves an iteration complexity of 𝒪~(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}), and its inexact variant IGRTR preserves this rate while reducing Hessian-vector product computations to be 𝒪~(ε−7/4)\tilde{\mathcal{O}}(\varepsilon^{-7/4}). LMNegCur and its inexact counterpart ILMNegCur offer analogous convergence guarantees.
Collectively, these advances highlight the ongoing development of second-order methods tailored to nonconvex-strongly concave minimax optimization and motivate the algorithm design pursued in this work.
1.1 Contributions
In this paper, we propose a homogeneous second-order descent ascent (HSDA) algorithm whose outer iteration solves a single homogenized eigenvalue subproblem—constructed from the gradient and Hessian of the value function—to obtain an iteration direction. This homogenized formulation guarantees, even when the Hessian is nearly positive semidefinite, a descent direction with sufficient negative curvature for the value function. We prove that HSDA finds an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point within 𝒪~(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}) iterations, matching the best known iteration complexity for second-order methods in this setting chen2021cubic ; luo2022finding ; wang2025gradient . For large-scale problems, we develop an inexact version (IHSDA) that approximately solves the homogenized eigenvalue subproblem via a Lanczos procedure with a carefully controlled residual. IHSDA preserves the same 𝒪~(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}) outer iteration complexity and, with high probability, reaches an𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point, while its total Hessian-vector product computations are upper bounded by 𝒪~(ε−7/4)\tilde{\mathcal{O}}(\varepsilon^{-7/4}).
Unlike recent inexact trust-region schemes (e.g., IGRTR and ILMNegCur wang2025gradient ) that require solving both a regularized Newton system and an nn-dimensional extremal-eigenvalue problem, while HSDA only solves a single (n+1)(n+1)-dimensional extremal-eigenvalue problem in a lifted space at each iteration. We further show (in Section 3) that the homogenized eigenvalue subproblems typically exhibit better conditioning than the regularized Newton/trust-region systems underlying IGRTR/ILMNegCur. Hence, HSDA and IHSDA offer an alternative second-order framework whose inner subproblem is structurally simpler, yet achieves the same 𝒪~(ε−7/4)\tilde{\mathcal{O}}(\varepsilon^{-7/4}) Hessian-vector product computations as IGRTR and ILMNegCur.
Notation. We adopt the following notation throughout the paper: [a;b][a;b] and [a,b][a,b] denote vertical and horizontal concatenation, respectively; sgn(⋅)\operatorname{sgn}(\cdot) is the sign function, defined by sgn(a)=−1\operatorname{sgn}(a)=-1 if a<0a<0 and sgn(a)=1\operatorname{sgn}(a)=1 if a⩾0a\geqslant 0. For a vector a∈ℝna\in\mathbb{R}^{n} and 0⩽j⩽n0\leqslant j\leqslant n, a[1:j]a_{[1:j]} denotes the subvector formed by its first jj entries. The symbol ∥⋅∥\|\cdot\| denotes the Euclidean norm for vectors and the induced ℓ2\ell_{2} operator norm for matrices. The eigenvalues of a matrix A∈ℝn×nA\in\mathbb{R}^{n\times n} are ordered as λ1(A),λ2(A),…,λmax(A)\lambda_{1}(A),\lambda_{2}(A),\dots,\lambda_{\max}(A) in nondecreasing order. The identity matrix of dimension nn is written as InI_{n}, or simply II when the dimension is clear. For a function f(x,y):ℝn×ℝm→ℝf(x,y):\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R}, ∇xf(x,y)\nabla_{x}f(x,y) and ∇yf(x,y)\nabla_{y}f(x,y) denote its partial gradients with respect to xx and yy, respectively; the full gradient is ∇f(x,y):=(∇xf(x,y),∇yf(x,y))\nabla f(x,y):=(\nabla_{x}f(x,y),\nabla_{y}f(x,y)). Second-order partial derivatives are denoted by ∇xx2f(x,y)\nabla_{xx}^{2}f(x,y), ∇xy2f(x,y)\nabla_{xy}^{2}f(x,y), ∇yx2f(x,y)\nabla_{yx}^{2}f(x,y), and ∇yy2f(x,y)\nabla_{yy}^{2}f(x,y). Complexity notations 𝒪(⋅),Ω(⋅),Θ(⋅)\mathcal{O}(\cdot),\Omega(\cdot),\Theta(\cdot) hide only absolute constants independent of problem parameters, while 𝒪~(⋅)\tilde{\mathcal{O}}(\cdot) additionally hides logarithmic factors. We also define the value function ℱ(x):=maxy∈ℝmf(x,y)\mathcal{F}(x):=\max_{y\in\mathbb{R}^{m}}f(x,y), the maximizer y∗(x):=argmaxy∈ℝmf(x,y)y^{\ast}(x):=\arg\max_{y\in\mathbb{R}^{m}}f(x,y), the partial gradient g(x,y):=∇xf(x,y)g(x,y):=\nabla_{x}f(x,y), and H(x,y):=[∇xx2f−∇xy2f(∇yy2f)−1∇yx2f](x,y)H(x,y):=\bigl[\nabla_{xx}^{2}f-\nabla_{xy}^{2}f(\nabla_{yy}^{2}f)^{-1}\nabla_{yx}^{2}f\bigr](x,y).
2 A Homogeneous Second-Order Descent Ascent Algorithm
In this section, we propose a Homogeneous Second-order Descent Ascent (HSDA) algorithm for solving the nonconvex-strongly concave minimax problem (P). HSDA is inspired by the Homogeneous Second-order Descent Method (HSODM) zhang2025homogeneous , a second-order framework originally designed for unconstrained minimization problems of the form minx∈ℝnℱ(x)\min_{x\in\mathbb{R}^{n}}\mathcal{F}(x). At each iteration, HSODM obtains a search direction by solving a homogenized eigenvalue subproblem of the form:
| min‖[u;v]‖⩽1[uv]⊤[∇2ℱ(x)∇ℱ(x)∇ℱ(x)⊤−α][uv],\begin{array}[]{ll}\mathop{\min}\limits_{\|[u;v]\|\leqslant 1}&\begin{bmatrix}u\\ v\end{bmatrix}^{\!\top}\begin{bmatrix}\nabla^{2}\mathcal{F}(x)&\nabla\mathcal{F}(x)\\[2.0pt] \nabla\mathcal{F}(x)^{\top}&-\alpha\end{bmatrix}\begin{bmatrix}u\\ v\end{bmatrix},\\ \end{array} | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
where α⩾0\alpha\geqslant 0 is a prescribed parameter. As demonstrated in zhang2025homogeneous , the resulting subproblem is an eigenvalue problem, where the optimal solution [ut;vt][u_{t};v_{t}] is a unit eigenvector associated with the smallest eigenvalue of the homogenized matrix. Building on this framework, we propose a HSDA algorithm as a generalized and inexact extension of HSODM tailored to the minimax setting in (P) with ℱ(x)=maxy∈ℝmf(x,y)\mathcal{F}(x)=\max_{y\in\mathbb{R}^{m}}f(x,y). At each iteration, HSDA incorporates two key algorithmic components:
- •
Inexact inner maximization via gradient ascent: Given xtx_{t}, we perform NtN_{t} steps of Nesterov’s accelerated gradient ascent nesterov2018lectures on yy to obtain an approximate maximizeryt≈y∗(xt)=argmaxy∈ℝmf(xt,y).y_{t}\approx y^{\ast}(x_{t})=\arg\max_{y\in\mathbb{R}^{m}}f(x_{t},y). Using yty_{t}, we then construct the corresponding inexact first- and second-order information of ℱ\mathcal{F} at xtx_{t}: gt:=g(xt,yt),Ht:=H(xt,yt).g_{t}:=g(x_{t},y_{t}),\quad H_{t}:=H(x_{t},y_{t}). ------------------------------------------------------------------------------------- - •
Direction generation via homogenized eigenvector computation: Given the inexact gradient estimate gtg_{t}, Hessian estimate HtH_{t}, and a parameter α>0\alpha>0, we compute the search direction [ut;vt][u_{t};v_{t}] with ut∈ℝnu_{t}\in\mathbb{R}^{n} and vt∈ℝv_{t}\in\mathbb{R} by solving the following homogenized eigenvalue subproblem:
| min‖[u;v]‖⩽1[uv]⊤Gt(α)[uv],Gt(α):=[Htgtgt⊤−α].\min_{\|[u;v]\|\leqslant 1}\begin{bmatrix}u\\ v\end{bmatrix}^{\!\top}G_{t}(\alpha)\begin{bmatrix}u\\ v\end{bmatrix},\qquad G_{t}(\alpha):=\begin{bmatrix}H_{t}&g_{t}\\ g_{t}^{\top}&-\alpha\end{bmatrix}. | (2.1) |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |
The search direction sts_{t} is then defined as:
| st={utvt,|vt|⩾ω,sgn(−gt⊤ut)ut,|vt|<ω.s_{t}=\begin{cases}\dfrac{u_{t}}{v_{t}},&\qquad|v_{t}|\geqslant\omega,\\[9.0pt] \operatorname{sgn}\!\big(-g_{t}^{\top}u_{t}\big)u_{t},&\qquad|v_{t}|<\omega.\end{cases} | (2.2) |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- |
The detailed algorithm is formally stated in Algorithm 1.
Algorithm 1 A Homogeneous Second-Order Descent Ascent (HSDA) Algorithm
Step 1: Input x1x_{1}, y0y_{0}, η1>0\eta_{1}>0, η2>0\eta_{2}>0, ω∈(0,1/2)\omega\in(0,1/2), {Nt⩾1}\{N_{t}\geqslant 1\}, ε>0\varepsilon>0, α>0\alpha>0, Λ>0\Lambda>0 and set t=1t=1.
Step 2: Update yty_{t}:
(2a): Set i=0i=0, yit=y~it=yt−1y_{i}^{t}=\tilde{y}_{i}^{t}=y_{t-1}.
(2b): Update yity_{i}^{t} and y~it\tilde{y}_{i}^{t}:
| yi+1t\displaystyle y_{i+1}^{t} | =y~it+η1∇yf(xt,y~it),\displaystyle=\tilde{y}_{i}^{t}+\eta_{1}\nabla_{y}f\big(x_{t},\tilde{y}_{i}^{t}\big), |
|---|---|
| y~i+1t\displaystyle\tilde{y}_{i+1}^{t} | =yi+1t+η2(yi+1t−yit).\displaystyle=y_{i+1}^{t}+\eta_{2}\big(y_{i+1}^{t}-y_{i}^{t}\big). |
(2c): If i⩾Nt−1i\geqslant N_{t}-1, set yt=yNtty_{t}=y_{N_{t}}^{t}; otherwise set i=i+1i=i+1 and go to Step (2b).
Step 3: Compute
| gt=∇xf(xt,yt),Ht=[∇xx2f−∇xy2f(∇yy2f)−1∇yx2f](xt,yt)g_{t}=\nabla_{x}f(x_{t},y_{t}),\qquad H_{t}=\big[\nabla^{2}_{xx}f-\nabla^{2}_{xy}f(\nabla^{2}_{yy}f)^{-1}\nabla^{2}_{yx}f\big](x_{t},y_{t}) |
|---|
and solve the following homogeneous subproblem to obtain [ut;vt][u_{t};v_{t}]:
| [ut;vt]=argmin‖[u;v]‖⩽1[uv]⊤[Htgtgt⊤−α][uv].[u_{t};v_{t}]=\arg\min_{\|[u;v]\|\leqslant 1}\begin{bmatrix}u\\ v\end{bmatrix}^{\!\top}\begin{bmatrix}H_{t}&g_{t}\\[2.0pt] g_{t}^{\top}&-\alpha\end{bmatrix}\begin{bmatrix}u\\ v\end{bmatrix}. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Step 4: Update the direction sts_{t}:
| st={utvt,|vt|⩾ω,sgn(−gt⊤ut)ut,|vt|<ω.s_{t}=\begin{cases}\dfrac{u_{t}}{v_{t}},&\qquad|v_{t}|\geqslant\omega,\\[9.0pt] \operatorname{sgn}\!\big(-g_{t}^{\top}u_{t}\big)u_{t},&\qquad|v_{t}|<\omega.\end{cases} | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
Step 5: If some stationary condition or stopping criterion is satisfied, set xt+1=xt+stx_{t+1}=x_{t}+s_{t} and terminate. Otherwise, compute τt=Λ/‖st‖\tau_{t}=\Lambda/\|s_{t}\|, update xt+1=xt+τtstx_{t+1}=x_{t}+\tau_{t}s_{t}, set t=t+1t=t+1 and go to Step 2.
In the following subsection, we establish that the proposed HSDA algorithm attains an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point of ℱ(x)\mathcal{F}(x) for problem (P) with an iteration complexity of 𝒪~(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}).
2.1 Complexity Analysis
Throughout this paper, we work under the following assumptions on f(x,y)f(x,y), which ensure that the value function ℱ(x)\mathcal{F}(x) is well-defined and has the required smoothness.
Assumption 2.1
f(x,y)f(x,y) satisfies the following assumptions:
- (1)
The function f(x,y)f(x,y) is assumed to be μ\mu-strongly concave in yy for any fixed xx. - (2)
We assume the gradient ∇f(⋅,⋅)\nabla f(\cdot,\cdot) is ℓ1\ell_{1}-Lipschitz continuous in the joint variable: for all (x,y),(x′,y′)∈ℝn×ℝm(x,y),(x^{\prime},y^{\prime})\in\mathbb{R}^{n}\times\mathbb{R}^{m},
| ‖∇f(x,y)−∇f(x′,y′)‖⩽ℓ1‖(x,y)−(x′,y′)‖.\big\|\nabla f(x,y)-\nabla f(x^{\prime},y^{\prime})\big\|\leqslant\ell_{1}\big\|(x,y)-(x^{\prime},y^{\prime})\big\|. |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | - (3)
We assume that the Hessian blocks ∇xx2f(x,y)\nabla^{2}_{xx}f(x,y), ∇xy2f(x,y)\nabla^{2}_{xy}f(x,y), ∇yx2f(x,y)\nabla^{2}_{yx}f(x,y), and ∇yy2f(x,y)\nabla^{2}_{yy}f(x,y) are ℓ2\ell_{2}-Lipschitz continuous. - (4)
The value function ℱ(x)=maxy∈ℝmf(x,y)\mathcal{F}(x)=\max_{y\in\mathbb{R}^{m}}f(x,y) has a finite lower bound ℱinf\mathcal{F}_{\inf}.
Under these assumptions, Lemma 2.1 establishes the Lipschitz continuity of∇ℱ\nabla\mathcal{F} and ∇2ℱ\nabla^{2}\mathcal{F}.
Lemma 2.1(chen2021cubic )
Under Assumption 2.1, the following properties hold:
- (1)
The maximizer y∗(x)=argmaxy∈ℝmf(x,y)y^{*}(x)=\arg\max_{y\in\mathbb{R}^{m}}f(x,y)is well-defined for every xx and is κ\kappa-Lipschitz continuous. - (2)
∇ℱ(x)=∇xf(x,y∗(x))\nabla\mathcal{F}(x)=\nabla_{x}f\big(x,y^{*}(x)\big), and is L1L_{1}-Lipschitz continuous with L1:=(κ+1)ℓ1L_{1}:=(\kappa+1)\ell_{1} and κ:=ℓ1/μ\kappa:=\ell_{1}/\mu. - (3)
H(x,y)H(x,y) is LHL_{H}-Lipschitz continuous with LH:=ℓ2(1+κ)2L_{H}:=\ell_{2}(1+\kappa)^{2} and ‖H(x,y)‖⩽L1\|H(x,y)\|\leqslant L_{1}. - (4)
It holds that ∇2ℱ(x)=H(x,y∗(x))\nabla^{2}\mathcal{F}(x)=H\big(x,y^{*}(x)\big). Consequently, the Hessian ∇2ℱ(x)\nabla^{2}\mathcal{F}(x) is Lipschitz continuous with constant L2:=ℓ2(1+κ)3L_{2}:=\ell_{2}(1+\kappa)^{3}.
Lemma 2.1 directly yields the upper bounds presented in Lemma 2.2.
Lemma 2.2(nesterov2018lectures )
Under Assumption 2.1, for all x,x′∈ℝnx,x^{\prime}\in\mathbb{R}^{n} the gradient and Hessian ofℱ\mathcal{F} satisfy the following inequalities:
| ‖∇ℱ(x′)−∇ℱ(x)−∇2ℱ(x)(x′−x)‖⩽L22‖x′−x‖2,\displaystyle\left\|\nabla\mathcal{F}(x^{\prime})-\nabla\mathcal{F}(x)-\nabla^{2}\mathcal{F}(x)(x^{\prime}-x)\right\|\leqslant\frac{L_{2}}{2}\|x^{\prime}-x\|^{2}, | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | |ℱ(x′)−ℱ(x)−∇ℱ(x)⊤(x′−x)−12(x′−x)⊤∇2ℱ(x)(x′−x)|⩽L26‖x′−x‖3,\displaystyle\left|\mathcal{F}(x^{\prime})-\mathcal{F}(x)-\nabla\mathcal{F}(x)^{\top}(x^{\prime}-x)-\frac{1}{2}(x^{\prime}-x)^{\top}\nabla^{2}\mathcal{F}(x)(x^{\prime}-x)\right|\leqslant\frac{L_{2}}{6}\|x^{\prime}-x\|^{3}, |
where L2L_{2} is the Lipschitz constant of ∇2ℱ\nabla^{2}\mathcal{F} given in Lemma 2.1.
We next recall the standard definitions of first- and second-order stationary points used in minimax optimization luo2022finding .
Definition 1
A point xx is an ε\varepsilon-first-order stationary point of ℱ(x)\mathcal{F}(x)when ‖∇ℱ(x)‖⩽ε\|\nabla\mathcal{F}(x)\|\leqslant\varepsilon.
Definition 2
A point xx is an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point of ℱ(x)\mathcal{F}(x) when
| ‖∇ℱ(x)‖⩽c1εand∇2ℱ(x)≽−c2εI,\|\nabla\mathcal{F}(x)\|\leqslant c_{1}\varepsilon\quad\text{and}\quad\nabla^{2}\mathcal{F}(x)\succcurlyeq-c_{2}\sqrt{\varepsilon}I, | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
where positive constants c1c_{1} and c2c_{2} do not depend on ε\varepsilon.
The following lemma demonstrates that by selecting an appropriate step size and performing sufficiently many inner gradient-ascent updates on yy, we can obtain approximations of the gradient and Hessian of the value function within a desired accuracy.
Lemma 2.3(wang2025gradient )
Suppose that Assumption 2.1 holds. For any ε1,ε2>0\varepsilon_{1},\varepsilon_{2}>0, set the inner ascent step sizes as η1=1/ℓ1\eta_{1}=1/\ell_{1}, η2=(κ−1)/(κ+1)\eta_{2}=(\sqrt{\kappa}-1)/(\sqrt{\kappa}+1)and define A=min{ε1ℓ1,ε22LH}A=\min\left\{\frac{\varepsilon_{1}}{\ell_{1}},\ \frac{\varepsilon_{2}}{2L_{H}}\right\}, where LHL_{H} is the Lipschitz constant of H(x,y)H(x,y) given in Lemma 2.1. If the iteration counts {Nt}\{N_{t}\} for yy-updates in Algorithm 1 satisfy
| N1\displaystyle N_{1} | ⩾2κlog(κ+1‖y0−y∗(x1)‖A),\displaystyle\geqslant 2\sqrt{\kappa}\log\!\left(\frac{\sqrt{\kappa+1}\|y_{0}-y^{*}(x_{1})\|}{A}\right), | | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Nt\displaystyle N_{t} | ⩾2κlog(κ+1(A+κ‖xt−xt−1‖)A),t⩾2,\displaystyle\geqslant 2\sqrt{\kappa}\log\!\left(\frac{\sqrt{\kappa+1}\big(A+\kappa\|x_{t}-x_{t-1}\|\big)}{A}\right),\qquad t\geqslant 2, |
then for every t⩾1t\geqslant 1 the following error bounds hold:
| ‖yt−y∗(xt)‖⩽A,‖∇ℱ(xt)−gt‖⩽ε1,‖∇2ℱ(xt)−Ht‖⩽ε2.\left\|y_{t}-y^{*}\left(x_{t}\right)\right\|\leqslant A,\qquad\big\|\nabla\mathcal{F}(x_{t})-g_{t}\big\|\leqslant\varepsilon_{1},\qquad\big\|\nabla^{2}\mathcal{F}(x_{t})-H_{t}\big\|\leqslant\varepsilon_{2}. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
The following lemma characterizes the optimality conditions for the homogenized eigenvalue subproblem (2.1).
Lemma 2.4(zhang2025homogeneous )
The vector [ut;vt][u_{t};v_{t}] is a solution to the homogenized eigenvalue subproblem (2.1) if and only if there exists a dual scalar δt\delta_{t} such that
| [Ht+δtIgtgt⊤−α+δt]\displaystyle\begin{bmatrix}H_{t}+\delta_{t}I&g_{t}\\ g_{t}^{\top}&-\alpha+\delta_{t}\end{bmatrix} | ≽0,\displaystyle\succcurlyeq 0, | (2.3a) |
|---|---|---|
| (Ht+δtI)ut=−vtgt,gt⊤ut\displaystyle(H_{t}+\delta_{t}I)u_{t}=-v_{t}g_{t},\quad g_{t}^{\top}u_{t} | =vt(α−δt),\displaystyle=v_{t}(\alpha-\delta_{t}), | (2.3b) |
| δt⩾α>0,‖[ut;vt]‖\displaystyle\delta_{t}\geqslant\alpha>0,\quad\big\|[u_{t};v_{t}]\big\ | =1.\displaystyle=1. |
Furthermore, −δt-\delta_{t} equals the smallest eigenvalue of the homogenized matrix Gt(α)G_{t}(\alpha), i.e., −δt=λ1(Gt(α)),-\delta_{t}=\lambda_{1}\!\big(G_{t}(\alpha)\big), and [ut;vt][u_{t};v_{t}] is a corresponding unit eigenvector. Moreover, when gt≠0g_{t}\neq 0, the inequality δt⩾α>0\delta_{t}\geqslant\alpha>0 in (2.3c) can be strengthened to the strict form δt>α>0\delta_{t}>\alpha>0.
We now proceed to analyze the iteration complexity of the proposed HSDA algorithm. Our analysis begins by establishing a descent property for the case where |vt|⩽1/(1+Λ2)|v_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}.
Lemma 2.5
Suppose that Assumption 2.1 holds. Assume Λ⩽2/2\Lambda\leqslant\sqrt{2}/2 and ω∈(0,1/2)\omega\in(0,1/2). For the case |vt|⩽1/(1+Λ2)|v_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}, we have
| ℱ(xt+1)−ℱ(xt)⩽Λε1+Λ22(ε2−α)+L26Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}(\varepsilon_{2}-\alpha)+\frac{L_{2}}{6}\Lambda^{3}. | (2.4) |
|---|
Proof
We first prove that when |vt|⩽1/(1+Λ2)|v_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}, the direction sts_{t} satisfies:
| ‖st‖⩾Λ.\|s_{t}\|\geqslant\Lambda. | (2.5) | | --------------------------------------- | ----- |
Since when |vt|<ω|v_{t}|<\omega, according to st=sgn(−gt⊤ut)uts_{t}=\operatorname{sgn}(-g_{t}^{\top}u_{t})u_{t} in Algorithm 1 and (2.3c), we have
| ‖st‖=‖ut‖=1−|vt|2⩾1−ω2⩾3/2⩾Λ.\|s_{t}\|=\|u_{t}\|=\sqrt{1-|v_{t}|^{2}}\geqslant\sqrt{1-\omega^{2}}\geqslant{\sqrt{3}}/{2}\geqslant\Lambda. | | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
When ω⩽|vt|⩽1/(1+Λ2)\omega\leqslant|v_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}, according to Algorithm 1 and (2.3c), we get
| ‖st‖=‖ut/vt‖=1−|vt|2/|vt|⩾Λ.\|s_{t}\|=\|{u_{t}}/{v_{t}}\|={\sqrt{1-|v_{t}|^{2}}}/{|v_{t}|}\geqslant\Lambda. | | -------------------------------------------------------------------------------------------------------------------------- |
Therefore, (2.5) holds, and thus τt=Λ/‖st‖∈(0,1].\tau_{t}=\Lambda/\|s_{t}\|\in(0,1].
Denote Et:=τtgt⊤st+τt22st⊤HtstE_{t}:=\tau_{t}g_{t}^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t}. It follows from Step 5 in Algorithm 1, the L2L_{2}-Lipschitz continuity of ∇2ℱ(x)\nabla^{2}\mathcal{F}(x), and Lemma 2.3 that
| ℱ(xt+1)−ℱ(xt)\displaystyle~\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t}) | ||
|---|---|---|
| ⩽\displaystyle\leqslant | τt∇ℱ(xt)⊤st+τt22st⊤∇2ℱ(xt)st+L26τt3‖st‖3\displaystyle~\tau_{t}\nabla\mathcal{F}(x_{t})^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}\nabla^{2}\mathcal{F}(x_{t})s_{t}+\frac{L_{2}}{6}\tau_{t}^{3}\|s_{t}\ | ^{3} |
| =\displaystyle= | Et+τt(∇ℱ(xt)−gt)⊤st+τt22st⊤(∇2ℱ(xt)−Ht)st+L26τt3‖st‖3\displaystyle~E_{t}+\tau_{t}(\nabla\mathcal{F}(x_{t})-g_{t})^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}\big(\nabla^{2}\mathcal{F}(x_{t})-H_{t}\big)s_{t}+\frac{L_{2}}{6}\tau_{t}^{3}\|s_{t}\ | ^{3} |
| ⩽\displaystyle\leqslant | Et+τt‖∇ℱ(xt)−gt‖‖st‖+τt22‖∇2ℱ(xt)−Ht‖‖st‖2+L26τt3‖st‖3\displaystyle~E_{t}+\tau_{t}\|\nabla\mathcal{F}(x_{t})-g_{t}\ | \ |
| ⩽\displaystyle\leqslant | Et+ε1Λ+ε22Λ2+L26Λ3.\displaystyle~E_{t}+\varepsilon_{1}\Lambda+\frac{\varepsilon_{2}}{2}\Lambda^{2}+\frac{L_{2}}{6}\Lambda^{3}. | (2.6) |
When |vt|<ω|v_{t}|<\omega and gt≠0g_{t}\neq 0, by (2.3b), (2.3c) and st=sgn(−gt⊤ut)uts_{t}=\operatorname{sgn}(-g_{t}^{\top}u_{t})u_{t} in Algorithm 1, we obtain
| st⊤Htst=−δt‖st‖2−vt2(α−δt),gt⊤st=|vt|(α−δt).s_{t}^{\top}H_{t}s_{t}=-\delta_{t}\|s_{t}\|^{2}-v_{t}^{2}(\alpha-\delta_{t}),\quad g_{t}^{\top}s_{t}=|v_{t}|(\alpha-\delta_{t}). | (2.7) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |
Therefore, using (2.7), we get
| Et\displaystyle E_{t} | =τtgt⊤st+τt22st⊤Htst\displaystyle=\tau_{t}g_{t}^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t} |
|---|---|
| =τt|vt | (α−δt)−τt22δt‖st‖2−τt22vt2(α−δt)\displaystyle=\tau_{t} |
| ⩽τtvt2(α−δt)−τt22δt‖st‖2−τt22vt2(α−δt)\displaystyle\leqslant\tau_{t}v_{t}^{2}(\alpha-\delta_{t})-\frac{\tau_{t}^{2}}{2}\delta_{t}\|s_{t}\ | ^{2}-\frac{\tau_{t}^{2}}{2}v_{t}^{2}(\alpha-\delta_{t}) |
| =(τt−τt22)vt2(α−δt)−τt22δt‖st‖2,\displaystyle=\Bigl(\tau_{t}-\frac{\tau_{t}^{2}}{2}\Bigr)v_{t}^{2}(\alpha-\delta_{t})-\frac{\tau_{t}^{2}}{2}\delta_{t}\|s_{t}\ | ^{2}, |
where the inequalities are derived from |vt|<ω<1|v_{t}|<\omega<1 and α−δt<0\alpha-\delta_{t}<0. Since τt∈(0,1]\tau_{t}\in(0,1], we have τt−τt2/2⩾0\tau_{t}-\tau_{t}^{2}/2\geqslant 0. Further combining this with (2.3c), we get
| (τt−τt22)(α−δt)⩽0.\left(\tau_{t}-\frac{\tau_{t}^{2}}{2}\right)(\alpha-\delta_{t})\leqslant 0. | (2.8) |
|---|
Furthermore, using τt=Λ/‖st‖\tau_{t}=\Lambda/\|s_{t}\| and (2.8), we get
| Et⩽−τt22δt‖st‖2=−δtΛ22⩽−Λ22α.E_{t}\leqslant-\frac{\tau_{t}^{2}}{2}\delta_{t}\|s_{t}\|^{2}=-\delta_{t}\frac{\Lambda^{2}}{2}\leqslant-\frac{\Lambda^{2}}{2}\alpha. | (2.9) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |
When |vt|<ω|v_{t}|<\omega and gt=0g_{t}=0, then Et=τt22st⊤HtstE_{t}=\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t}. In this case, it can be similarly proven that:
| Et⩽−Λ22α.E_{t}\leqslant-\frac{\Lambda^{2}}{2}\,\alpha. | (2.10) |
|---|
When ω⩽|vt|⩽1/(1+Λ2)\omega\leqslant|v_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}, we have st=ut/vts_{t}=u_{t}/v_{t}. Substituting this into (2.3b) yields
| st⊤Htst=−gt⊤st−δt‖st‖2,gt⊤st=α−δt⩽0.s_{t}^{\top}H_{t}s_{t}=-g_{t}^{\top}s_{t}-\delta_{t}\|s_{t}\|^{2},\qquad g_{t}^{\top}s_{t}=\alpha-\delta_{t}\leqslant 0. | (2.11) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
Since τt∈(0,1]\tau_{t}\in(0,1], it holds that τt−τt2/2⩾0\tau_{t}-\tau_{t}^{2}/2\geqslant 0, and consequently
| (τt−τt22)gt⊤st⩽0.\left(\tau_{t}-\frac{\tau_{t}^{2}}{2}\right)g_{t}^{\top}s_{t}\leqslant 0. | (2.12) |
|---|
Using (2.11) and (2.12), we obtain
| Et\displaystyle E_{t} | =τtgt⊤st+τt22st⊤Htst=(τt−τt22)gt⊤st−τt22δt‖st‖2\displaystyle=\tau_{t}g_{t}^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t}=\Bigl(\tau_{t}-\frac{\tau_{t}^{2}}{2}\Bigr)g_{t}^{\top}s_{t}-\frac{\tau_{t}^{2}}{2}\delta_{t}\|s_{t}\|^{2} | (2.13) | | ------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ | | ⩽−τt22δt‖st‖2⩽−Λ22α.\displaystyle\leqslant-\frac{\tau_{t}^{2}}{2}\delta_{t}\|s_{t}\|^{2}\leqslant-\frac{\Lambda^{2}}{2}\alpha. | | |
Combining (2.1) with (2.9), (2.10) and (2.13) completes the proof.
We now consider the case where |vt|>1/(1+Λ2)|v_{t}|>\sqrt{1/(1+\Lambda^{2})}. The subsequent lemma establishes explicit bounds for both ‖∇ℱ(xt+1)‖\|\nabla\mathcal{F}(x_{t+1})\| and ∇2ℱ(xt+1)\nabla^{2}\mathcal{F}(x_{t+1}).
Lemma 2.6
Under Assumption 2.1 and for the case |vt|>1/(1+Λ2)|v_{t}|>\sqrt{1/(1+\Lambda^{2})}, we further assume Λ⩽2/2\Lambda\leqslant\sqrt{2}/2. Then the following holds:
| ‖∇ℱ(xt+1)‖\displaystyle\big\|\nabla\mathcal{F}(x_{t+1})\big\| | ⩽2(L1+α)Λ3+L22Λ2+(ε2+α)Λ+3ε1,\displaystyle\leqslant 2(L_{1}+\alpha)\Lambda^{3}+\frac{L_{2}}{2}\Lambda^{2}+(\varepsilon_{2}+\alpha)\Lambda+3\varepsilon_{1}, | (2.14a) | | ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------- | | ∇2ℱ(xt+1)\displaystyle\nabla^{2}\mathcal{F}(x_{t+1}) | ≽−{2(L1+α)Λ2+α+ε2+LH[(1+κ)Λ+2A]}I.\displaystyle\succcurlyeq-\Big\{2(L_{1}+\alpha)\Lambda^{2}+\alpha+\varepsilon_{2}+L_{H}\big[(1+\kappa)\Lambda+2A\big]\Big\}I. | (2.14b) |
Proof
We first examine the case gt≠0g_{t}\neq 0 to derive an upper bound for ‖∇ℱ(xt+1)‖\big\|\nabla\mathcal{F}(x_{t+1})\big\|. The analysis begins by estimating ‖gt‖\|g_{t}\|. Given the condition |vt|>1/(1+Λ2)|v_{t}|>\sqrt{1/(1+\Lambda^{2})}, (2.3c) implies
| ‖st‖=‖utvt‖=1−|vt|2|vt|<Λ.\|s_{t}\|=\Big\|\frac{u_{t}}{v_{t}}\Big\|=\frac{\sqrt{1-|v_{t}|^{2}}}{|v_{t}|}<\Lambda. | (2.15) | | ----------------------------------------------------------------------------------------------------------------------------------- | ------ |
Combining (2.3b) with the upper bound (2.15) yields
| Htst+gt\displaystyle H_{t}s_{t}+g_{t} | =−δtst,\displaystyle=-\delta_{t}s_{t}, | (2.16a) |
|---|---|---|
| δt−α\displaystyle\delta_{t}-\alpha | =−gt⊤st⩽‖gt‖‖st‖⩽Λ‖gt‖.\displaystyle=-g_{t}^{\top}s_{t}\leqslant\|g_{t}\ | \ |
Define the quadratic function
| h(m):=m2+(gt⊤Htgt‖gt‖2+α)m−‖gt‖2.h(m):=m^{2}+\left(\frac{g_{t}^{\top}H_{t}g_{t}}{\|g_{t}\|^{2}}+\alpha\right)m-\|g_{t}\|^{2}. | | ------------------------------------------------------------------------------------------------------------------------------------------------ |
The equation h(m)=0h(m)=0 has two real roots of opposite signs; denote its positive root by m2m_{2}. We now show that h(δt−α)⩾0h\left(\delta_{t}-\alpha\right)\geqslant 0. To this end, consider the matrix
| Q(k):=[Ht+(k+α)Igtgt⊤k].Q(k):=\begin{bmatrix}H_{t}+(k+\alpha)I&g_{t}\\ g_{t}^{\top}&k\end{bmatrix}. |
|---|
From the optimality condition, we have Q(δt−α)≽0Q(\delta_{t}-\alpha)\succcurlyeq 0 and δt−α>0\delta_{t}-\alpha>0. Applying the Schur complement with respect to the scalar block k=δt−αk=\delta_{t}-\alpha gives
| Ht+δtI−1δt−αgtgt⊤≽0.H_{t}+\delta_{t}I-\frac{1}{\delta_{t}-\alpha}g_{t}g_{t}^{\top}\succcurlyeq 0. |
|---|
Premultiplying and postmultiplying the above inequality by the unit vector gt/‖gt‖g_{t}/\|g_{t}\| yields
| gt⊤Htgt‖gt‖2+δt−‖gt‖2δt−α⩾0.\frac{g_{t}^{\top}H_{t}g_{t}}{\|g_{t}\|^{2}}+\delta_{t}-\frac{\|g_{t}\|^{2}}{\delta_{t}-\alpha}\geqslant 0. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
Because δt−α>0\delta_{t}-\alpha>0, multiplying both sides by δt−α\delta_{t}-\alpha leads to
| (δt−α)(gt⊤Htgt‖gt‖2+(δt−α)+α)−‖gt‖2=(δt−α)2+(gt⊤Htgt‖gt‖2+α)(δt−α)−‖gt‖2⩾0,(\delta_{t}-\alpha)\!\left(\frac{g_{t}^{\top}H_{t}g_{t}}{\|g_{t}\|^{2}}+(\delta_{t}-\alpha)+\alpha\right)-\|g_{t}\|^{2}=(\delta_{t}-\alpha)^{2}+\left(\frac{g_{t}^{\top}H_{t}g_{t}}{\|g_{t}\|^{2}}+\alpha\right)(\delta_{t}-\alpha)-\|g_{t}\|^{2}\geqslant 0, | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
Hence δt−α⩾m2\delta_{t}-\alpha\geqslant m_{2} (since m2m_{2} is the positive root of h(m)=0h(m)=0). Combining this with the bound from (2.16b), we obtain
| h(Λ‖gt‖)=Λ2‖gt‖2+(gt⊤Htgt‖gt‖2+α)Λ‖gt‖−‖gt‖2⩾0.h(\Lambda\|g_{t}\|)=\Lambda^{2}\|g_{t}\|^{2}+\left(\frac{g_{t}^{\top}H_{t}g_{t}}{\|g_{t}\|^{2}}+\alpha\right)\Lambda\|g_{t}\|-\|g_{t}\|^{2}\geqslant 0. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
Rearranging the inequality and using the condition Ht≼L1IH_{t}\preccurlyeq L_{1}I (i.e. gt⊤Htgt/‖gt‖2⩽L1{g_{t}^{\top}H_{t}g_{t}}/{\|g_{t}\|^{2}}\leqslant L_{1}) together with Λ⩽2/2\Lambda\leqslant\sqrt{2}/2, we obtain
| ‖gt‖⩽(gt⊤Htgt‖gt‖2+α)Λ1−Λ2⩽(L1+α)Λ1−Λ2⩽2(L1+α)Λ.\|g_{t}\|\leqslant\frac{\left(\frac{g_{t}^{\top}H_{t}g_{t}}{\|g_{t}\|^{2}}+\alpha\right)\Lambda}{1-\Lambda^{2}}\leqslant\frac{(L_{1}+\alpha)\Lambda}{1-\Lambda^{2}}\leqslant 2(L_{1}+\alpha)\Lambda. | (2.17) | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
From (2.15) and (2.16b), it follows that
| δt‖st‖=(α+(δt−α))‖st‖⩽(α+Λ‖gt‖)‖st‖⩽αΛ+Λ2‖gt‖.\delta_{t}\|s_{t}\|=\bigl(\alpha+(\delta_{t}-\alpha)\bigr)\|s_{t}\|\leqslant(\alpha+\Lambda\|g_{t}\|)\|s_{t}\|\leqslant\alpha\Lambda+\Lambda^{2}\|g_{t}\|. | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Combining this with (2.16a), we obtain
| ‖Htst+gt‖=δt‖st‖⩽αΛ+Λ2‖gt‖.\|H_{t}s_{t}+g_{t}\|=\delta_{t}\|s_{t}\|\leqslant\alpha\Lambda+\Lambda^{2}\|g_{t}\|. | | -------------------------------------------------------------------------------------------------------------------------------------- |
To bound ‖gt+1‖\|g_{t+1}\|, we note that
| ‖gt+1‖\displaystyle\|g_{t+1}\| | ⩽‖gt+1−Htst−gt‖+‖Htst+gt‖\displaystyle\leqslant\|g_{t+1}-H_{t}s_{t}-g_{t}\|+\|H_{t}s_{t}+g_{t}\| | | ------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | | ⩽‖gt+1−Htst−gt‖+αΛ+‖gt‖Λ2\displaystyle\leqslant\|g_{t+1}-H_{t}s_{t}-g_{t}\|+\alpha\Lambda+\|g_{t}\|\Lambda^{2} | | | ⩽‖gt+1−Htst−gt‖+αΛ+2(L1+α)Λ3,\displaystyle\leqslant\|g_{t+1}-H_{t}s_{t}-g_{t}\|+\alpha\Lambda+2(L_{1}+\alpha)\Lambda^{3}, | |
where the last line uses inequality (2.17). Furthermore, using (2.15) together with Lemmas 2.2 and 2.3, we have
| ‖gt+1−Htst−gt‖⩽\displaystyle\left\|g_{t+1}-H_{t}s_{t}-g_{t}\right\|\leqslant | ‖∇ℱ(xt+1)−∇2ℱ(xt)st−∇ℱ(xt)‖+‖gt+1−∇ℱ(xt+1)‖\displaystyle~\left\|\nabla\mathcal{F}\left(x_{t+1}\right)-\nabla^{2}\mathcal{F}\left(x_{t}\right)s_{t}-\nabla\mathcal{F}\left(x_{t}\right)\right\|+\left\|g_{t+1}-\nabla\mathcal{F}\left(x_{t+1}\right)\right\| | | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ | | +‖∇2ℱ(xt)−Ht‖‖st‖+‖∇ℱ(xt)−gt‖\displaystyle~+\left\|\nabla^{2}\mathcal{F}\left(x_{t}\right)-H_{t}\right\|\left\|s_{t}\right\|+\left\|\nabla\mathcal{F}\left(x_{t}\right)-g_{t}\right\| | | | | ⩽\displaystyle\leqslant | L22‖st‖2+ε2‖st‖+2ε1\displaystyle~\frac{L_{2}}{2}\left\|s_{t}\right\|^{2}+\varepsilon_{2}\left\|s_{t}\right\|+2\varepsilon_{1} | | | ⩽\displaystyle\leqslant | L22Λ2+ε2Λ+2ε1.\displaystyle~\frac{L_{2}}{2}\Lambda^{2}+\varepsilon_{2}\Lambda+2\varepsilon_{1}. | (2.18) |
Consequently,
| ‖gt+1‖⩽2(L1+α)Λ3+L22Λ2+(ε2+α)Λ+2ε1.\|g_{t+1}\|\leqslant 2(L_{1}+\alpha)\Lambda^{3}+\frac{L_{2}}{2}\Lambda^{2}+(\varepsilon_{2}+\alpha)\Lambda+2\varepsilon_{1}. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Finally, applying Lemma 2.3 gives
| ‖∇ℱ(xt+1)‖⩽‖gt+1‖+ε1⩽2(L1+α)Λ3+L22Λ2+(ε2+α)Λ+3ε1.\left\|\nabla\mathcal{F}\left(x_{t+1}\right)\right\|\leqslant\left\|g_{t+1}\right\|+\varepsilon_{1}\leqslant 2(L_{1}+\alpha)\Lambda^{3}+\frac{L_{2}}{2}\Lambda^{2}+(\varepsilon_{2}+\alpha)\Lambda+3\varepsilon_{1}. | (2.19) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
We now establish a lower bound for the Hessian ∇2ℱ(xt+1)\nabla^{2}\mathcal{F}(x_{t+1}). From (2.3a), we have Ht+δtI≽0H_{t}+\delta_{t}I\succcurlyeq 0. Combining this with (2.16b) and (2.17) yields
| Ht≽−δtI≽−(Λ‖gt‖+α)I≽−2(L1+α)Λ2I−αI.H_{t}\succcurlyeq-\delta_{t}I\succcurlyeq-(\Lambda\|g_{t}\|+\alpha)I\succcurlyeq-2(L_{1}+\alpha)\Lambda^{2}I-\alpha I. | (2.20) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
To bound Ht+1H_{t+1}, we first use the LHL_{H}-Lipschitz continuity of H(x,y)H(x,y):
| Ht+1\displaystyle H_{t+1} | ≽Ht−‖Ht+1−Ht‖I\displaystyle\succcurlyeq H_{t}-\|H_{t+1}-H_{t}\|I | | ------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------- | | ≽Ht−LH(‖xt+1−xt‖+‖yt+1−yt‖)I.\displaystyle\succcurlyeq H_{t}-L_{H}\bigl(\|x_{t+1}-x_{t}\|+\|y_{t+1}-y_{t}\|\bigr)I. | |
Since xt+1=xt+stx_{t+1}=x_{t}+s_{t} with ‖st‖<Λ\|s_{t}\|<\Lambda, and using the κ\kappa-Lipschitz continuity of y⋆(x)y^{\star}(x) together with the bound‖yt−y⋆(xt)‖⩽A\|y_{t}-y^{\star}(x_{t})\|\leqslant A from Lemma 2.3, we obtain
| Ht+1\displaystyle H_{t+1} | ≽Ht−LH(‖st‖+‖yt+1−y⋆(xt+1)‖+‖y⋆(xt+1)−y⋆(xt)‖+‖y⋆(xt)−yt‖)I\displaystyle\succcurlyeq H_{t}-L_{H}\Bigl(\|s_{t}\|+\|y_{t+1}-y^{\star}(x_{t+1})\|+\|y^{\star}(x_{t+1})-y^{\star}(x_{t})\|+\|y^{\star}(x_{t})-y_{t}\|\Bigr)I | | ---------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ≽Ht−LH[(1+κ)‖st‖+2A]I\displaystyle\succcurlyeq H_{t}-L_{H}\bigl[(1+\kappa)\|s_{t}\|+2A\bigr]I | | | ≽Ht−LH[(1+κ)Λ+2A]I.\displaystyle\succcurlyeq H_{t}-L_{H}\bigl[(1+\kappa)\Lambda+2A\bigr]I. | (2.21) |
Substituting the lower bound for HtH_{t} from (2.20) into (2.1) gives
| Ht+1≽−2(L1+α)Λ2I−αI−LH[(1+κ)Λ+2A]I.H_{t+1}\succcurlyeq-2(L_{1}+\alpha)\Lambda^{2}I-\alpha I-L_{H}\left[(1+\kappa)\Lambda+2A\right]I. | (2.22) |
|---|
Finally, applying the Hessian approximation error from Lemma 2.3, we obtain the desired lower bound for the exact Hessian:
| ∇2ℱ(xt+1)\displaystyle\nabla^{2}\mathcal{F}(x_{t+1}) | ≽Ht+1−ε2I\displaystyle\succcurlyeq H_{t+1}-\varepsilon_{2}I |
|---|---|
| ≽−(2(L1+α)Λ2+α+ε2+LH[(1+κ)Λ+2A])I.\displaystyle\succcurlyeq-\Bigl(2(L_{1}+\alpha)\Lambda^{2}+\alpha+\varepsilon_{2}+L_{H}\bigl[(1+\kappa)\Lambda+2A\bigr]\Bigr)I. | (2.23) |
We now consider the case gt=0g_{t}=0. The argument for establishing an upper bound on ‖∇ℱ(xt+1)‖\big\|\nabla\mathcal{F}(x_{t+1})\big\| proceeds analogously to the case gt≠0g_{t}\neq 0 with the simplification that gt=0g_{t}=0. Therefore, we obtain
| ‖gt+1‖⩽L22Λ2+(ε2+α)Λ+2ε1.\|g_{t+1}\|\leqslant\frac{L_{2}}{2}\Lambda^{2}+(\varepsilon_{2}+\alpha)\Lambda+2\varepsilon_{1}. | | ------------------------------------------------------------------------------------------------------------------------------------------ |
Applying Lemma 2.3 then yields
| ‖∇ℱ(xt+1)‖⩽‖gt+1‖+ε1⩽L22Λ2+(ε2+α)Λ+3ε1.\left\|\nabla\mathcal{F}\left(x_{t+1}\right)\right\|\leqslant\left\|g_{t+1}\right\|+\varepsilon_{1}\leqslant\frac{L_{2}}{2}\Lambda^{2}+(\varepsilon_{2}+\alpha)\Lambda+3\varepsilon_{1}. | (2.24) | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
Combining (2.19) with (2.24), we have (2.14a). The lower bound for the Hessian proceeds analogously. Under gt=0g_{t}=0, we have
| Ht+1≽−αI−LH[(1+κ)Λ+2A]I.H_{t+1}\succcurlyeq-\alpha I-L_{H}\bigl[(1+\kappa)\Lambda+2A\bigr]I. |
|---|
Consequently,
| ∇2ℱ(xt+1)≽−(α+ε2+LH[(1+κ)Λ+2A])I.\nabla^{2}\mathcal{F}(x_{t+1})\succcurlyeq-\Bigl(\alpha+\varepsilon_{2}+L_{H}\bigl[(1+\kappa)\Lambda+2A\bigr]\Bigr)I. | (2.25) |
|---|
Finally, combining (2.1) with the Hessian bound (2.25) establishes (2.14b), which completes the proof.
We are now prepared to establish the iteration complexity of the HSDA algorithm. Let ε>0\varepsilon>0 be the target accuracy. We define the first iteration at which an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point is reached as
| T(ε):=min{t∣‖∇ℱ(xt+1)‖⩽c1εand∇2ℱ(xt+1)≽−c2εI},T(\varepsilon):=\min\Bigl\{t\,\Bigm|\,\|\nabla\mathcal{F}(x_{t+1})\|\leqslant c_{1}\varepsilon\ \text{and}\ \nabla^{2}\mathcal{F}(x_{t+1})\succcurlyeq-c_{2}\sqrt{\varepsilon}\,I\Bigr\}, | (2.26) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
where c1c_{1} and c2c_{2} are positive constants independent of ε\varepsilon. In parallel, we introduce a verifiable stopping index based on the eigenvector component vtv_{t}:
| T~(ε):=min{t∣|vt|>1/(1+Λ2)}.\widetilde{T}(\varepsilon):=\min\Bigl\{t\,\Bigm|\,|v_{t}|>\sqrt{1/(1+\Lambda^{2})}\Bigr\}. | (2.27) | | ---------------------------------------------------------------------------------------------------------------------------------------- | ------ |
The following theorem shows that once |vt|>1/(1+Λ2)|v_{t}|>\sqrt{1/(1+\Lambda^{2})}, the next iterate xt+1=xt+stx_{t+1}=x_{t}+s_{t} is already an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point; consequently,T(ε)⩽T~(ε)T(\varepsilon)\leqslant\widetilde{T}(\varepsilon).
Theorem 2.1
Suppose Assumption 2.1 holds, and set the parameters asα=L2ε,ε1=ε/12,ε2=L2ε/12,Λ=ε/L2,ω∈(0,1/2)\alpha=\sqrt{L_{2}\varepsilon},\,\varepsilon_{1}=\varepsilon/12,\,\varepsilon_{2}=\sqrt{L_{2}\varepsilon}/12,\,\Lambda=\sqrt{\varepsilon/L_{2}},\,\omega\in(0,1/2)with 0<ε⩽min{L2/2, 1}0<\varepsilon\leqslant\min\{L_{2}/2,\,1\}. Then the iterate xT~(ε)+1x_{\widetilde{T}(\varepsilon)+1} satisfies
| ‖∇ℱ(xT~(ε)+1)‖⩽(2L1L2+176)ε,\displaystyle\|\nabla\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\|\leqslant\Bigl(\frac{\sqrt{2}\,L_{1}}{L_{2}}+\frac{17}{6}\Bigr)\varepsilon, | (2.28) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ | | ∇2ℱ(xT~(ε)+1)≽−[2L1L2+136L2+LH(1+κ)L2]εI,\displaystyle\nabla^{2}\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\succcurlyeq-\Bigl[\frac{\sqrt{2}\,L_{1}}{\sqrt{L_{2}}}+\frac{13}{6}\sqrt{L_{2}}+\frac{L_{H}(1+\kappa)}{\sqrt{L_{2}}}\Bigr]\sqrt{\varepsilon}\,I, | |
and furthermore,
| T(ε)⩽T~(ε)⩽ 1+24L25(ℱ(x1)−ℱinf)ε−3/2.T(\varepsilon)\ \leqslant\ \widetilde{T}(\varepsilon)\ \leqslant\ 1+\frac{24\sqrt{L_{2}}}{5}\bigl(\mathcal{F}(x_{1})-\mathcal{F}_{\inf}\bigr)\varepsilon^{-3/2}. | (2.29) |
|---|
Proof
We first prove the stationarity bounds in (2.28). By definition of T~(ε)\widetilde{T}(\varepsilon), we have|vT~(ε)|>1/(1+Λ2)|v_{\widetilde{T}(\varepsilon)}|>\sqrt{1/(1+\Lambda^{2})}. Applying Lemma 2.6 under this condition yields
| ‖∇ℱ(xT~(ε)+1)‖⩽2(L1+α)Λ3+L22Λ2+(ε2+α)Λ+3ε1.\|\nabla\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\|\leqslant 2(L_{1}+\alpha)\Lambda^{3}+\frac{L_{2}}{2}\Lambda^{2}+(\varepsilon_{2}+\alpha)\Lambda+3\varepsilon_{1}. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
Using the parameter values specified in Theorem 2.1, we substitute into the gradient bound to obtain
| ‖∇ℱ(xT~(ε)+1)‖\displaystyle\|\nabla\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\| | ⩽2(L1+L2ε)(εL2)3/2+L22(εL2)+(L2ε12+L2ε)εL2+ε4\displaystyle\leqslant 2\bigl(L_{1}+\sqrt{L_{2}\varepsilon}\bigr)\Bigl(\frac{\varepsilon}{L_{2}}\Bigr)^{3/2}+\frac{L_{2}}{2}\Bigl(\frac{\varepsilon}{L_{2}}\Bigr)+\Bigl(\frac{\sqrt{L_{2}\varepsilon}}{12}+\sqrt{L_{2}\varepsilon}\Bigr)\sqrt{\frac{\varepsilon}{L_{2}}}+\frac{\varepsilon}{4} | | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | =2L1L23/2ε3/2+2L2ε2+116ε.\displaystyle=\frac{2L_{1}}{L_{2}^{3/2}}\varepsilon^{3/2}+\frac{2}{L_{2}}\varepsilon^{2}+\frac{11}{6}\varepsilon. | |
By 0<ε⩽min{L2/2, 1}0<\varepsilon\leqslant\min\{L_{2}/2,\,1\}, we can easily get
| ‖∇ℱ(xT~(ε)+1)‖⩽(2L1L2+176)ε.\|\nabla\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\|\leqslant\Bigl(\frac{\sqrt{2}\,L_{1}}{L_{2}}+\frac{17}{6}\Bigr)\varepsilon. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Moreover, Lemma 2.6 provides the Hessian lower bound
| ∇2ℱ(xT~(ε)+1)≽−{2(L1+α)Λ2+α+ε2+LH[(1+κ)Λ+2A]}I,\nabla^{2}\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\succcurlyeq-\Bigl\{2(L_{1}+\alpha)\Lambda^{2}+\alpha+\varepsilon_{2}+L_{H}\bigl[(1+\kappa)\Lambda+2A\bigr]\Bigr\}I, |
|---|
with A=min{ε1/ℓ1,ε2/(2LH)}A=\min\{\varepsilon_{1}/\ell_{1},\ \varepsilon_{2}/(2L_{H})\}. Since A⩽ε2/(2LH)A\leqslant\varepsilon_{2}/(2L_{H}), we haveLH[(1+κ)Λ+2A]⩽LH(1+κ)Λ+ε2L_{H}[(1+\kappa)\Lambda+2A]\leqslant L_{H}(1+\kappa)\Lambda+\varepsilon_{2}. Substituting the parameter choices from Theorem 2.1 into the above bound gives
| 2(L1+α)Λ2+α+ε2+LH[(1+κ)Λ+2A]\displaystyle~2(L_{1}+\alpha)\Lambda^{2}+\alpha+\varepsilon_{2}+L_{H}\bigl[(1+\kappa)\Lambda+2A\bigr] | |
|---|---|
| ⩽\displaystyle\leqslant | 2(L1+L2ε)εL2+L2ε+2⋅L2ε12+LH(1+κ)L2ε.\displaystyle~2\bigl(L_{1}+\sqrt{L_{2}\varepsilon}\bigr)\frac{\varepsilon}{L_{2}}+\sqrt{L_{2}\varepsilon}+2\cdot\frac{\sqrt{L_{2}\varepsilon}}{12}+\frac{L_{H}(1+\kappa)}{\sqrt{L_{2}}}\sqrt{\varepsilon}. |
Since 0<ε⩽min{L2/2, 1}0<\varepsilon\leqslant\min\{L_{2}/2,\,1\}, the right-hand side of the previous inequality can be simplified, yielding
| ∇2ℱ(xT~(ε)+1)≽−[2L1L2+136L2+LH(1+κ)L2]εI,\nabla^{2}\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\succcurlyeq-\Bigl[\frac{\sqrt{2}\,L_{1}}{\sqrt{L_{2}}}+\frac{13}{6}\sqrt{L_{2}}+\frac{L_{H}(1+\kappa)}{\sqrt{L_{2}}}\Bigr]\sqrt{\varepsilon}\,I, |
|---|
which establishes the Hessian bound in (2.28). Consequently, we haveT(ε)⩽T~(ε)T(\varepsilon)\leqslant\widetilde{T}(\varepsilon).
We now proceed to bound T~(ε)\widetilde{T}(\varepsilon). By the definition of T~(ε)\widetilde{T}(\varepsilon), for any t<T~(ε)t<\widetilde{T}(\varepsilon), we have|vt|⩽1/(1+Λ2)|v_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}. Hence Lemma 2.5 is applicable and gives
| ℱ(xt+1)−ℱ(xt)⩽Λε1+(ε2−α)Λ22+L26Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant\Lambda\varepsilon_{1}+(\varepsilon_{2}-\alpha)\frac{\Lambda^{2}}{2}+\frac{L_{2}}{6}\Lambda^{3}. |
|---|
Substituting the parameter choices from Theorem 2.1 into this inequality leads to the per‑iteration decrease
| ℱ(xt+1)−ℱ(xt)⩽−524ε3/2L2,∀t<T~(ε).\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant-\frac{5}{24}\,\frac{\varepsilon^{3/2}}{\sqrt{L_{2}}},\qquad\forall\,t<\widetilde{T}(\varepsilon). | (2.30) |
|---|
Summing (2.30) over t=1,…,T~(ε)−1t=1,\ldots,\widetilde{T}(\varepsilon)-1 and noting that the total possible decrease in ℱ\mathcal{F} is at most ℱ(x1)−ℱinf\mathcal{F}(x_{1})-\mathcal{F}_{\inf}, we obtain
| ℱ(x1)−ℱinf⩾∑t=1T~(ε)−1(ℱ(xt)−ℱ(xt+1))⩾(T~(ε)−1)524ε3/2L2.\mathcal{F}(x_{1})-\mathcal{F}_{\inf}\geqslant\sum_{t=1}^{\widetilde{T}(\varepsilon)-1}\bigl(\mathcal{F}(x_{t})-\mathcal{F}(x_{t+1})\bigr)\geqslant(\widetilde{T}(\varepsilon)-1)\,\frac{5}{24}\,\frac{\varepsilon^{3/2}}{\sqrt{L_{2}}}. |
|---|
Rearranging gives the desired upper bound on T~(ε)\widetilde{T}(\varepsilon) in (2.29). Together with the already proved relation T(ε)⩽T~(ε)T(\varepsilon)\leqslant\widetilde{T}(\varepsilon), the proof is complete.
Remark 2.1
The bound (2.29) in Theorem 2.1 shows that HSDA attains an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point within 𝒪(ε−3/2)\mathcal{O}(\varepsilon^{-3/2}) outer iterations. This iteration complexity matches the best known rates for second-order methods in nonconvex-strongly concave minimax optimization; see, for example, luo2022finding ; wang2025gradient .
3 Inexact Homogeneous Second-Order Descent Ascent Algorithm
The complexity analysis in Section 2 assumes that the homogenized eigenvalue subproblem (2.1) is solved exactly at every outer iteration. In practice, however, this assumption can be prohibitive: for large-scale problems, computing the smallest eigenpair of the homogenized matrix Gt(α)G_{t}(\alpha) typically requires expensive matrix factorizations or many iterations of a Krylov-type eigensolver. To overcome this limitation, we propose in this section an inexact homogeneous second-order descent ascent (IHSDA) algorithm, which solves the homogenized subproblem only approximately via a Lanczos procedure with carefully controlled residual. We prove that IHSDA retains the single‑loop structure of HSDA and achieves the same outer-iteration complexity.
Unlike the exact HSDA method, IHSDA avoids solving the homogenized eigenvalue subproblem (2.1) exactly. It instead employs a Lanczos procedure to obtain an approximate solution, which comprises two main steps:
- •
Inexact Solution via Lanczos Iteration. The homogenized subproblem (2.1) is solved approximately using the Lanczos Method with Skewed Randomization (zhang2025homogeneous, , Algorithm 4). This yields a Ritz pair (−ζt,[u^t;v^t])(-\zeta_{t},[\hat{u}_{t};\hat{v}_{t}]) of Gt(αt)G_{t}(\alpha_{t}) with Ritz residual [kt;ϱt][k_{t};\varrho_{t}], satisfying
| Gt(αt)[u^tv^t]+ζt[u^tv^t]=[ktϱt],|δt−ζt|⩽et,kt⊤u^t+ϱtv^t=0,‖[u^t;v^t]‖=1,G_{t}(\alpha_{t})\begin{bmatrix}\hat{u}_{t}\\ \hat{v}_{t}\end{bmatrix}+\zeta_{t}\begin{bmatrix}\hat{u}_{t}\\ \hat{v}_{t}\end{bmatrix}=\begin{bmatrix}k_{t}\\ \varrho_{t}\end{bmatrix},\ | \delta_{t}-\zeta_{t}|\leqslant e_{t},\ k_{t}^{\top}\hat{u}_{t}+\varrho_{t}\hat{v}_{t}=0,\ \big\|[\hat{u}_{t};\hat{v}_{t}]\big\|=1, | (3.1) |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- |
where −δt:=λ1(Gt(αt))-\delta_{t}:=\lambda_{1}\!\big(G_{t}(\alpha_{t})\big) is the true smallest eigenvalue and the approximation accuracy satisfies |δt−ζt|⩽et|\delta_{t}-\zeta_{t}|\leqslant e_{t} for a prescribed tolerance ete_{t} defined later. - •
Direction Generation and Safeguard. The search direction sts_{t} is generated from the approximate eigenvector [u^t;v^t][\hat{u}_{t};\hat{v}_{t}] using the same classification rule as in the exact HSDA algorithm (see (2.2)). If |v^t|>1/(1+Λ2)|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})}, and the Lanczos residual in (3.1) is sufficiently small, the next iterate xt+1x_{t+1} can be certified as an𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point; Otherwise, a large residual triggers an increase of the parameter α\alpha and a re-computation of the Ritz pair to obtain a more accurate approximation of the smallest eigenpair (see Lemma 3.2 for details).
By substituting Step 3 of the HSDA algorithm with the inexact Lanczos procedure described above, we obtain the complete IHSDA algorithm for solving problem (P). The detailed algorithm is formally stated in Algorithm 2.
Algorithm 2 Inexact Homogeneous Second-Order Descent Ascent (IHSDA) Algorithm
Step 1: Input x1x_{1}, y0y_{0}, η1>0\eta_{1}>0, η2>0\eta_{2}>0, L1>0L_{1}>0, L2>0L_{2}>0, Bg>0B_{g}>0,ω∈(1/4,1/2)\omega\in(1/4,1/2), {Nt⩾1}\{N_{t}\geqslant 1\}, ε>0\varepsilon>0, Λ>0\Lambda>0, and set t=1t=1.
Step 2: Update yty_{t}:
(2a): Set i=0i=0, yit=y~it=yt−1y_{i}^{t}=\tilde{y}_{i}^{t}=y_{t-1}.
(2b): Update yity_{i}^{t} and y~it\tilde{y}_{i}^{t}:
| yi+1t\displaystyle y_{i+1}^{t} | =y~it+η1∇yf(xt,y~it),\displaystyle=\tilde{y}_{i}^{t}+\eta_{1}\nabla_{y}f\big(x_{t},\tilde{y}_{i}^{t}\big), |
|---|---|
| y~i+1t\displaystyle\tilde{y}_{i+1}^{t} | =yi+1t+η2(yi+1t−yit).\displaystyle=y_{i+1}^{t}+\eta_{2}\big(y_{i+1}^{t}-y_{i}^{t}\big). |
(2c): If i⩾Nt−1i\geqslant N_{t}-1, set yt=yNtty_{t}=y_{N_{t}}^{t} and go to Step 3; otherwise set i=i+1i=i+1 and go to Step (2b).
Step 3: Compute
| gt=∇xf(xt,yt),Ht=[∇xx2f−∇xy2f(∇yy2f)−1∇yx2f](xt,yt).g_{t}=\nabla_{x}f(x_{t},y_{t}),\qquad H_{t}=\big[\nabla^{2}_{xx}f-\nabla^{2}_{xy}f(\nabla^{2}_{yy}f)^{-1}\nabla^{2}_{yx}f\big](x_{t},y_{t}). |
|---|
Set et=L2εe_{t}=\sqrt{L_{2}\varepsilon} and αt=L2ε\alpha_{t}=\sqrt{L_{2}\varepsilon}, and Gt(αt):=[Htgtgt⊤−αt]G_{t}(\alpha_{t}):=\begin{bmatrix}H_{t}&g_{t}\\ g_{t}^{\top}&-\alpha_{t}\end{bmatrix}.
(3a) By applying (zhang2025homogeneous, , Algorithm 4) to compute a Ritz pair of Gt(αt)G_{t}(\alpha_{t}), i.e., (−ζt,[u^t;v^t])(-\zeta_{t},[\hat{u}_{t};\hat{v}_{t}]) with Ritz residual [kt;ϱt][k_{t};\varrho_{t}], which satisfies
| Gt(αt)[u^tv^t]+ζt[u^tv^t]=[ktϱt],|δt−ζt|⩽et,kt⊤u^t+ϱtv^t=0,‖[u^t;v^t]‖=1.G_{t}(\alpha_{t})\begin{bmatrix}\hat{u}_{t}\\ \hat{v}_{t}\end{bmatrix}+\zeta_{t}\begin{bmatrix}\hat{u}_{t}\\ \hat{v}_{t}\end{bmatrix}=\begin{bmatrix}k_{t}\\ \varrho_{t}\end{bmatrix},\ | \delta_{t}-\zeta_{t}|\leqslant e_{t},\ k_{t}^{\top}\hat{u}_{t}+\varrho_{t}\hat{v}_{t}=0,\ \big\|[\hat{u}_{t};\hat{v}_{t}]\big\|=1. | (3.2) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- |
If |v^t|⩽1/(1+Λ2)|\hat{v}_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}, go to Step 4;
(3b) If ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2, set xt+1=xt+u^tv^tx_{t+1}=x_{t}+\dfrac{\hat{u}_{t}}{\hat{v}_{t}} and terminate. Otherwise, set
| αt=3L2ε+2‖gt‖Λ+(L1+ζt)Λ2,et=min{ε4,L2ε5/264(L1+αt+Bg)2},\alpha_{t}=3\sqrt{L_{2}\varepsilon}+2\|g_{t}\|\Lambda+(L_{1}+\zeta_{t})\Lambda^{2},\qquad e_{t}=\min\!\left\{\frac{\varepsilon}{4},\ \frac{\sqrt{L_{2}}\,\varepsilon^{5/2}}{64\,(L_{1}+\alpha_{t}+B_{g})^{2}}\right\}, | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
and go to Step 3(a).
Step 4: Update the direction sts_{t}:
| st={u^tv^t,|v^t|⩾ω,sgn(−gt⊤u^t)u^t,|v^t|<ω.s_{t}=\begin{cases}\dfrac{\hat{u}_{t}}{\hat{v}_{t}},&\qquad|\hat{v}_{t}|\geqslant\omega,\\[9.0pt] \operatorname{sgn}\!\big(-g_{t}^{\top}\hat{u}_{t}\big)\hat{u}_{t},&\qquad|\hat{v}_{t}|<\omega.\end{cases} | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
Step 5: Compute τt=Λ/‖st‖\tau_{t}=\Lambda/\|s_{t}\|, update xt+1=xt+τtstx_{t+1}=x_{t}+\tau_{t}s_{t}, set t=t+1t=t+1, and go to Step 2.
In the following subsection, we prove that IHSDA finds an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point of ℱ(x)\mathcal{F}(x) for problem (P) with an outer iteration complexity of 𝒪(ε−3/2)\mathcal{O}(\varepsilon^{-3/2}). Furthermore, we derive a high-probability upper bound on total number of Hessian-vector products is 𝒪~(ε−7/4)\tilde{\mathcal{O}}\big(\varepsilon^{-7/4}\big).
3.1 Complexity Analysis
For our subsequent analysis, we adopt the following standard assumption commonly used in the complexity analysis of second-order methods Cartis2011ARC ; Royer2018ComplexityAO .
Assumption 3.1
There exists a constant Bg>0B_{g}>0, independent of tt, such that
| ‖g(xt,yt)‖⩽Bg,∀t⩾1.\|g(x_{t},y_{t})\|\leqslant B_{g},\qquad\forall\,t\geqslant 1. | | --------------------------------------------------------------------------------------------- |
We now proceed to derive a quantitative decrease bound for the value function (1.1) under the condition |v^t|⩽1/(1+Λ2)|\hat{v}_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}.
Lemma 3.1
Under Assumption 2.1, let ω∈(1/4,1/2)\omega\in(1/4,1/2) and Λ⩽2/2\Lambda\leqslant\sqrt{2}/2. Then for any ε>0\varepsilon>0, and whenever|v^t|⩽1/(1+Λ2)|\hat{v}_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})} the following decrease bound holds with probability at least 1−4p1-4p (where p∈(exp(−n),1)p\in(\exp(-n),1)):
| ℱ(xt+1)−ℱ(xt)⩽4|ϱt|−αt2Λ2+Λε1+Λ22ε2+L26Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant 4|\varrho_{t}|-\frac{\alpha_{t}}{2}\Lambda^{2}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3}. | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Proof
Let Et:=τtgt⊤st+τt22st⊤HtstE_{t}:=\tau_{t}g_{t}^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t}. As in the analysis of HSDA, we have τt∈(0,1]\tau_{t}\in(0,1] and
| ℱ(xt+1)−ℱ(xt)⩽Et+Λε1+Λ22ε2+L26Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant E_{t}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3}. | (3.3) |
|---|
Case 1: |v^t|⩽ω|\hat{v}_{t}|\leqslant\omega. From (3.1) we obtain
| u^t⊤Htu^t=kt⊤u^t−ζt‖u^t‖2−v^tgt⊤u^t,gt⊤u^t=ϱt+v^t(αt−ζt).\hat{u}_{t}^{\top}H_{t}\hat{u}_{t}=k_{t}^{\top}\hat{u}_{t}-\zeta_{t}\|\hat{u}_{t}\|^{2}-\hat{v}_{t}g_{t}^{\top}\hat{u}_{t},\quad g_{t}^{\top}\hat{u}_{t}=\varrho_{t}+\hat{v}_{t}(\alpha_{t}-\zeta_{t}). | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Since st=sgn(−gt⊤u^t)u^ts_{t}=\operatorname{sgn}(-g_{t}^{\top}\hat{u}_{t})\hat{u}_{t}, we have ‖st‖=‖u^t‖\|s_{t}\|=\|\hat{u}_{t}\| and with τt‖st‖=Λ\tau_{t}\|s_{t}\|=\Lambda, also τt‖u^t‖=Λ\tau_{t}\|\hat{u}_{t}\|=\Lambda. Therefore,
| Et\displaystyle E_{t} | =τtgt⊤st+τt22st⊤Htst\displaystyle=\tau_{t}g_{t}^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t} |
|---|---|
| =τtsgn(−gt⊤u^t)gt⊤u^t+12τt2u^t⊤Htu^t\displaystyle=\tau_{t}\operatorname{sgn}(-g_{t}^{\top}\hat{u}_{t})g_{t}^{\top}\hat{u}_{t}+\frac{1}{2}\tau_{t}^{2}\hat{u}_{t}^{\top}H_{t}\hat{u}_{t} | |
| =−τt|gt⊤u^t | +12τt2kt⊤u^t−12τt2v^tgt⊤u^t−12τt2ζt‖u^t‖2\displaystyle=-\tau_{t} |
| ⩽−τt|gt⊤u^t | +12τt2kt⊤u^t+12τt2 |
| =−12τt2v^tϱt−(τt−12τt2|v^t | ) |
Since τt⩽1\tau_{t}\leqslant 1 and |v^t|⩽ω<1|\hat{v}_{t}|\leqslant\omega<1, we have τt2|v^t|⩽τt⩽1\tau_{t}^{2}|\hat{v}_{t}|\leqslant\tau_{t}\leqslant 1. Combining this with (3.3) and (3.1) yields
| ℱ(xt+1)−ℱ(xt)⩽|ϱt|−ζt2Λ2+Λε1+Λ22ε2+L26Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant|\varrho_{t}|-\frac{\zeta_{t}}{2}\Lambda^{2}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3}. | (3.5) | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |
Case 2: ω⩽|v^t|⩽1/(1+Λ2)\omega\leqslant|\hat{v}_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}. Here st=u^t/v^ts_{t}=\hat{u}_{t}/\hat{v}_{t}. Using (3.1) we obtain
| st⊤Htst+gt⊤st=−ζt‖st‖2+kt⊤u^tv^t2,gt⊤st=−ζt+αt+ϱtv^t.s_{t}^{\top}H_{t}s_{t}+g_{t}^{\top}s_{t}=-\zeta_{t}\|s_{t}\|^{2}+\frac{k_{t}^{\top}\hat{u}_{t}}{\hat{v}^{2}_{t}},\quad g_{t}^{\top}s_{t}=-\zeta_{t}+\alpha_{t}+\frac{\varrho_{t}}{\hat{v}_{t}}. | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
From the orthogonality relation kt⊤u^t+ϱtv^t=0k_{t}^{\top}\hat{u}_{t}+\varrho_{t}\hat{v}_{t}=0 in (3.1), it follows that
| Et\displaystyle E_{t} | =τtgt⊤st+τt22st⊤Htst\displaystyle=\tau_{t}g_{t}^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t} |
|---|---|
| =τtgt⊤st+12τt2(kt⊤u^tv^t2−gt⊤st−ζt‖st‖2)\displaystyle=\tau_{t}g_{t}^{\top}s_{t}+\frac{1}{2}\tau_{t}^{2}\left(\frac{k_{t}^{\top}\hat{u}_{t}}{\hat{v}_{t}^{2}}-g_{t}^{\top}s_{t}-\zeta_{t}\|s_{t}\ | ^{2}\right) |
| =(τt−12τt2)(ϱtv^t+αt−ζt)+τt22(kt⊤u^tv^t2)−ζt2Λ2\displaystyle=\left(\tau_{t}-\frac{1}{2}\tau_{t}^{2}\right)\!\left(\frac{\varrho_{t}}{\hat{v}_{t}}+\alpha_{t}-\zeta_{t}\right)+\frac{\tau_{t}^{2}}{2}\!\left(\frac{k_{t}^{\top}\hat{u}_{t}}{\hat{v}_{t}^{2}}\right)-\frac{\zeta_{t}}{2}\Lambda^{2} | |
| =(τt−12τt2)(αt−ζt)−(τt2−τt)ϱtv^t−ζt2Λ2.\displaystyle=\left(\tau_{t}-\frac{1}{2}\tau_{t}^{2}\right)(\alpha_{t}-\zeta_{t})-(\tau_{t}^{2}-\tau_{t})\frac{\varrho_{t}}{\hat{v}_{t}}-\frac{\zeta_{t}}{2}\Lambda^{2}. |
Since τt∈(0,1]\tau_{t}\in(0,1] and |v^t|⩾ω⩾1/4|\hat{v}_{t}|\geqslant\omega\geqslant 1/4,
| −(τt2−τt)ϱtv^t⩽|ϱtω|⩽4|ϱt|.-(\tau_{t}^{2}-\tau_{t})\frac{\varrho_{t}}{\hat{v}_{t}}\leqslant\left|\frac{\varrho_{t}}{\omega}\right|\leqslant 4|\varrho_{t}|. | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Hence,
| Et⩽(τt−12τt2)(αt−ζt)+4|ϱt|−ζt2Λ2.E_{t}\leqslant\left(\tau_{t}-\frac{1}{2}\tau_{t}^{2}\right)(\alpha_{t}-\zeta_{t})+4|\varrho_{t}|-\frac{\zeta_{t}}{2}\Lambda^{2}. | (3.6) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |
Moreover, by Theorems 5 and 6 of zhang2025homogeneous , with probability at least 1−4p1-4p we have
| ζt⩾αt.\zeta_{t}\geqslant\alpha_{t}. | (3.7) |
|---|
Substituting (3.6) and (3.7) into (3.3) gives
| ℱ(xt+1)−ℱ(xt)⩽4|ϱt|−ζt2Λ2+Λε1+Λ22ε2+L26Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant 4|\varrho_{t}|-\frac{\zeta_{t}}{2}\Lambda^{2}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3}. | (3.8) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |
Since (3.5) provides a tighter (i.e., smaller) upper bound, we unify the analysis of both cases by adopting (3.8) as a common estimate. Using ζt⩾αt\zeta_{t}\geqslant\alpha_{t} from (3.7) (which holds with the stated probability), we obtain
| ℱ(xt+1)−ℱ(xt)\displaystyle\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t}) | ⩽4|ϱt|−ζt2Λ2+Λε1+Λ22ε2+L26Λ3\displaystyle\leqslant 4|\varrho_{t}|-\frac{\zeta_{t}}{2}\Lambda^{2}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3} | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ⩽4|ϱt|−αt2Λ2+Λε1+Λ22ε2+L26Λ3,\displaystyle\leqslant 4|\varrho_{t}|-\frac{\alpha_{t}}{2}\Lambda^{2}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3}, | |
which completes the proof.
We now consider the case |v^t|>1/(1+Λ2)|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})}. The following lemma demonstrates that, under appropriate parameter choices, one of two outcomes must occur with high probability: either next iterate is already an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point, or, after increasing the parameter αt\alpha_{t} and re-solving the homogenized eigenvalue subproblem (2.1)—the Ritz residual will become sufficiently small.
Lemma 3.2
Under Assumptions 2.1 and 3.1, consider the case |v^t|>1/(1+Λ2)|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})}. Set Λ=ε/L2\Lambda=\sqrt{\varepsilon/L_{2}}, ε1=ε/12\varepsilon_{1}=\varepsilon/12 and ε2=L2ε/12\varepsilon_{2}=\sqrt{L_{2}\varepsilon}/12, with ε⩽min{L23/36,L2/2,1}.\varepsilon\leqslant\min\Bigl\{{L_{2}^{3}}/{36},L_{2}/2,1\Bigr\}.Then the following holds:
- (1)
If the Ritz residual satisfies ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2, then the next iterate point xt+1x_{t+1} is an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point. - (2)
Otherwise, whenever |v^t|>1/(1+Λ2)|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})}, with probability at least 1−4p1-4p, we have ‖kt‖⩽ε/2.\|k_{t}\|\leqslant\varepsilon/2.
Proof
Proof of (1). Without loss of generality, we can assume v^t>0\hat{v}_{t}>0, as the sign of the approximate eigenvector[u^t;v^t][\hat{u}_{t};\hat{v}_{t}] can be flipped without affecting any subsequent derivations. We show that when ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2,λ1(∇2ℱ(xt+1))⩾𝒪(−ε)\lambda_{1}\!\big(\nabla^{2}\mathcal{F}(x_{t+1})\big)\geqslant\mathcal{O}(-\sqrt{\varepsilon})and‖∇ℱ(xt+1)‖⩽𝒪(ε)\|\nabla\mathcal{F}(x_{t+1})\|\leqslant\mathcal{O}(\varepsilon). From the Ritz condition (3.1) we obtain
| −ζt=−αtv^t2+2v^tgt⊤u^t+u^t⊤Htu^t.-\zeta_{t}=-\alpha_{t}\hat{v}_{t}^{2}+2\hat{v}_{t}g_{t}^{\top}\hat{u}_{t}+\hat{u}_{t}^{\top}H_{t}\hat{u}_{t}. | (3.9) |
|---|
Using ‖u^t‖2+v^t2=1\|\hat{u}_{t}\|^{2}+\hat{v}_{t}^{2}=1, (3.9) can be rewritten as
| (ζt−αt)v^t2\displaystyle(\zeta_{t}-\alpha_{t})\hat{v}_{t}^{2} | =−2v^tgt⊤u^t−(ζt+u^t⊤Htu^t‖u^t‖2)‖u^t‖2\displaystyle=-2\hat{v}_{t}g_{t}^{\top}\hat{u}_{t}-\Bigl(\zeta_{t}+\frac{\hat{u}_{t}^{\top}H_{t}\hat{u}_{t}}{\|\hat{u}_{t}\|^{2}}\Bigr)\|\hat{u}_{t}\|^{2} | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ⩽2v^t1−v^t2‖gt‖−(ζt+u^t⊤Htu^t‖u^t‖2)‖u^t‖2\displaystyle\leqslant 2\hat{v}_{t}\sqrt{1-\hat{v}_{t}^{2}}\|g_{t}\|-\Bigl(\zeta_{t}+\frac{\hat{u}_{t}^{\top}H_{t}\hat{u}_{t}}{\|\hat{u}_{t}\|^{2}}\Bigr)\|\hat{u}_{t}\|^{2} | | | ⩽2v^t1−v^t2‖gt‖−(ζt+λ1(Ht))(1−v^t2).\displaystyle\leqslant 2\hat{v}_{t}\sqrt{1-\hat{v}_{t}^{2}}\|g_{t}\|-\bigl(\zeta_{t}+\lambda_{1}(H_{t})\bigr)(1-\hat{v}_{t}^{2}). | (3.10) |
Moreover, using Λ⩾1−v^t2/v^t\Lambda\geqslant\sqrt{1-\hat{v}_{t}^{2}}/\hat{v}_{t} andλ1(Ht)⩽L1\lambda_{1}(H_{t})\leqslant L_{1}, (3.10) yields
| ζt−αt⩽2Λ‖gt‖+|λ1(Ht)+ζt|Λ2⩽2Λ‖gt‖+(L1+ζt)Λ2.\zeta_{t}-\alpha_{t}\leqslant 2\Lambda\|g_{t}\|+\bigl|\lambda_{1}(H_{t})+\zeta_{t}\bigr|\Lambda^{2}\leqslant 2\Lambda\|g_{t}\|+(L_{1}+\zeta_{t})\Lambda^{2}. | (3.11) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |
Since Ht+δtI≽0H_{t}+\delta_{t}I\succcurlyeq 0 andδt⩽ζt+et⩽ζt+αt\delta_{t}\leqslant\zeta_{t}+e_{t}\leqslant\zeta_{t}+\alpha_{t}, we haveλ1(Ht)+δt⩾0\lambda_{1}(H_{t})+\delta_{t}\geqslant 0. Combining this with (3.11) gives
| λ1(Ht)+2αt+2‖gt‖Λ+(L1+ζt)Λ2⩾0.\lambda_{1}(H_{t})+2\alpha_{t}+2\|g_{t}\|\Lambda+(L_{1}+\zeta_{t})\Lambda^{2}\geqslant 0. | (3.12) | | ------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
When ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2, the argument parallels that of Lemma 2.6: we have‖st‖=‖u^t/v^t‖⩽Λ\|s_{t}\|=\|\hat{u}_{t}/\hat{v}_{t}\|\leqslant\Lambda and
| Ht+1≽−(2αt+2‖gt‖Λ+(L1+ζt)Λ2+LH[(1+κ)Λ+2A])I.H_{t+1}\succcurlyeq-\big(2\alpha_{t}+2\|g_{t}\|\Lambda+(L_{1}+\zeta_{t})\Lambda^{2}+L_{H}\left[(1+\kappa)\Lambda+2A\right]\big)I. | (3.13) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
A simple norm estimate yields
| ζt⩽‖Gt(αt)‖\displaystyle\zeta_{t}\leqslant\|G_{t}(\alpha_{t})\| | ⩽max‖[u;v]‖=1|[uv]⊤[Ht00−αt][uv]|+max‖[u;v]‖=1|[uv]⊤[0gtgt⊤0][uv]|\displaystyle\leqslant\max_{\|[u;v]\|=1}\left\lvert\!\begin{bmatrix}u\\ v\end{bmatrix}^{\!\top}\begin{bmatrix}H_{t}&0\\[2.0pt] 0&-\alpha_{t}\end{bmatrix}\begin{bmatrix}u\\ v\end{bmatrix}\!\right\rvert+\max_{\|[u;v]\|=1}\left\lvert\!\begin{bmatrix}u\\ v\end{bmatrix}^{\!\top}\begin{bmatrix}0&g_{t}\\[2.0pt] g_{t}^{\top}&0\end{bmatrix}\begin{bmatrix}u\\ v\end{bmatrix}\!\right\rvert | | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ⩽max{L1,αt}+‖gt‖⩽L1+αt+Bg.\displaystyle\leqslant\max\{L_{1},\alpha_{t}\}+\|g_{t}\|\leqslant L_{1}+\alpha_{t}+B_{g}. | (3.14) |
Inserting (3.14) and the parameter choices into (3.13), then applying Lemma 2.3 (which controls the Hessian approximation error), we obtain after elementary simplifications
| ∇2ℱ(xt+1)\displaystyle\nabla^{2}\mathcal{F}(x_{t+1}) | ≽−(2αt+2‖gt‖Λ+(L1+ζt)Λ2+ε2+LH[(1+κ)Λ+2A])I\displaystyle\succcurlyeq-\big(2\alpha_{t}+2\|g_{t}\|\Lambda+(L_{1}+\zeta_{t})\Lambda^{2}+\varepsilon_{2}+L_{H}\left[(1+\kappa)\Lambda+2A\right]\big)I | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ≽−(2αt+2BgΛ+(2L1+αt+Bg)Λ2+2ε2+LH(1+κ)Λ)I\displaystyle\succcurlyeq-\big(2\alpha_{t}+2B_{g}\Lambda+(2L_{1}+\alpha_{t}+B_{g})\Lambda^{2}+2\varepsilon_{2}+L_{H}(1+\kappa)\Lambda\big)I | | | =−((136L2+2Bg+LH(1+κ)L2)ε+2L1+BgL2ε+1L2ε3/2)I\displaystyle=-\Bigl(\Bigl(\frac{13}{6}\sqrt{L_{2}}+\frac{2B_{g}+L_{H}(1+\kappa)}{\sqrt{L_{2}}}\Bigr)\sqrt{\varepsilon}+\frac{2L_{1}+B_{g}}{L_{2}}\varepsilon+\frac{1}{\sqrt{L_{2}}}\varepsilon^{3/2}\Bigr)I | | | ≽−((136L2+2Bg+LH(1+κ)L2)ε+2L1+BgL2ε+L2ε)I\displaystyle\succcurlyeq-\Bigl(\Bigl(\frac{13}{6}\sqrt{L_{2}}+\frac{2B_{g}+L_{H}(1+\kappa)}{\sqrt{L_{2}}}\Bigr)\sqrt{\varepsilon}+\frac{2L_{1}+B_{g}}{\sqrt{L_{2}}}\sqrt{\varepsilon}+\sqrt{L_{2}}\sqrt{\varepsilon}\Bigr)I | | | =−(196L2+2L1+3Bg+LH(1+κ)L2)εI.\displaystyle=-\Bigl(\frac{19}{6}\sqrt{L_{2}}+\frac{2L_{1}+3B_{g}+L_{H}(1+\kappa)}{\sqrt{L_{2}}}\Bigr)\sqrt{\varepsilon}\,I. | |
We now bound the gradient norm ‖∇ℱ(xt+1)‖\|\nabla\mathcal{F}(x_{t+1})\|. Using the second‑order Lipschitz continuity of ∇2ℱ\nabla^{2}\mathcal{F}, together with (2.1) and (3.1), we have
| ‖gt+1‖\displaystyle\|g_{t+1}\| | ⩽‖gt+1−gt−Htst‖+‖gt+Htst‖\displaystyle\leqslant\|g_{t+1}-g_{t}-H_{t}s_{t}\|+\|g_{t}+H_{t}s_{t}\| | (3.15) | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | ------ | | =‖gt+1−gt−Htst‖+‖kt/v^t−ζtst‖\displaystyle=\|g_{t+1}-g_{t}-H_{t}s_{t}\|+\|k_{t}/\hat{v}_{t}-\zeta_{t}s_{t}\| | | | | ⩽L22Λ2+ε2Λ+2ε1+‖kt‖ω+|ζt|Λ.\displaystyle\leqslant\frac{L_{2}}{2}\Lambda^{2}+\varepsilon_{2}\Lambda+2\varepsilon_{1}+\frac{\|k_{t}\|}{\omega}+|\zeta_{t}|\Lambda. | | |
Theorem 6 of zhang2025homogeneous guarantees that with probability at least 1−4p1-4p,|ϱt|⩽ε2/(16L22)|\varrho_{t}|\leqslant\varepsilon^{2}/(16L_{2}^{2}). Moreover, (3.1) implies the scalar identityζt=αt+ϱt/v^t−gt⊤st.\zeta_{t}=\alpha_{t}+{\varrho_{t}}/{\hat{v}_{t}}-g_{t}^{\top}s_{t}.In addition, in the regime |v^t|>1/(1+Λ2)|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})}, we have1/|v^t|<1+Λ2=1+ε/L2⩽2{1}/{|\hat{v}_{t}|}<\sqrt{1+\Lambda^{2}}=\sqrt{1+\varepsilon/L_{2}}\leqslant\sqrt{2}, where we used 0<ε⩽L20<\varepsilon\leqslant L_{2}. Hence
| |ϱtv^t|⩽ε216L22⋅1|v^t|⩽2ε216L22.\Big|\frac{\varrho_{t}}{\hat{v}_{t}}\Big|\leqslant\frac{\varepsilon^{2}}{16L_{2}^{2}}\cdot\frac{1}{|\hat{v}_{t}|}\leqslant\frac{\sqrt{2}\varepsilon^{2}}{16L_{2}^{2}}. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
Combining this bound with the expression for ζt\zeta_{t} yields
| |ζt|⩽|αt|+|ϱtv^t|+|gt⊤st|⩽L2ε+2ε216L22+BgΛ.|\zeta_{t}|\leqslant|\alpha_{t}|+\Big|\frac{\varrho_{t}}{\hat{v}_{t}}\Big|+|g_{t}^{\top}s_{t}|\leqslant\sqrt{L_{2}\varepsilon}+\frac{\sqrt{2}\varepsilon^{2}}{16L_{2}^{2}}+B_{g}\Lambda. | (3.16) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |
Substituting (3.16) and the parameter values into (3.15), and again invoking Lemma 2.3 for the gradient error, we arrive at
| ‖∇ℱ(xt+1)‖\displaystyle\|\nabla\mathcal{F}(x_{t+1})\| | ⩽L22Λ2+ε2Λ+3ε1+‖kt‖ω+|ζt|Λ\displaystyle\leqslant\frac{L_{2}}{2}\Lambda^{2}+\varepsilon_{2}\Lambda+3\varepsilon_{1}+\frac{\|k_{t}\|}{\omega}+|\zeta_{t}|\Lambda | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | =56ε+‖kt‖ω+|ζt|Λ\displaystyle=\frac{5}{6}\varepsilon+\frac{\|k_{t}\|}{\omega}+|\zeta_{t}|\Lambda | | | ⩽56ε+ε2ω+(L2ε+216L22ε2+BgΛ)Λ\displaystyle\leqslant\frac{5}{6}\varepsilon+\frac{\varepsilon}{2\omega}+\Bigl(\sqrt{L_{2}\varepsilon}+\frac{\sqrt{2}}{16L_{2}^{2}}\varepsilon^{2}+B_{g}\Lambda\Bigr)\Lambda | | | =(116+12ω+BgL2)ε+216L25/2ε5/2\displaystyle=\Bigl(\frac{11}{6}+\frac{1}{2\omega}+\frac{B_{g}}{L_{2}}\Bigr)\varepsilon+\frac{\sqrt{2}}{16L_{2}^{5/2}}\varepsilon^{5/2} | | | ⩽(236+BgL2+216L2)ε.\displaystyle\leqslant\Bigl(\frac{23}{6}+\frac{B_{g}}{L_{2}}+\frac{\sqrt{2}}{16L_{2}}\Bigr)\varepsilon. | |
This shows that xt+1x_{t+1} is already an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point.
Proof of (2). We first show thatλ2(Gt(αt))−λ1(Gt(αt))⩾L2ε.\lambda_{2}(G_{t}(\alpha_{t}))-\lambda_{1}(G_{t}(\alpha_{t}))\geqslant\sqrt{L_{2}\varepsilon}.From (3.12) with the initial choice αt=L2ε\alpha_{t}=\sqrt{L_{2}\varepsilon} we have
| λ1(Ht)+2L2ε+2‖gt‖Λ+(L1+ζt)Λ2⩾0.\lambda_{1}(H_{t})+2\sqrt{L_{2}\varepsilon}+2\|g_{t}\|\Lambda+(L_{1}+\zeta_{t})\Lambda^{2}\geqslant 0. | (3.17) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
After updating αt\alpha_{t}, i.e.,αt=3L2ε+2‖gt‖Λ+(L1+ζt)Λ2,\alpha_{t}=3\sqrt{L_{2}\varepsilon}+2\|g_{t}\|\Lambda+(L_{1}+\zeta_{t})\Lambda^{2},the Cauchy interlacing theorem together with (3.17) yields
| λ2(Gt(αt))−λ1(Gt(αt))⩾λ1(Ht)+αt⩾L2ε.\lambda_{2}(G_{t}(\alpha_{t}))-\lambda_{1}(G_{t}(\alpha_{t}))\geqslant\lambda_{1}(H_{t})+\alpha_{t}\geqslant\sqrt{L_{2}\varepsilon}. | (3.18) |
|---|
By Lemma 13 of zhang2025homogeneous , we have
| ‖kt‖⩽ϕtet+2(max{L1,αt}+‖gt‖)etλ2(Gt(αt))−λ1(Gt(αt)),\|k_{t}\|\leqslant\phi_{t}e_{t}+2\big(\max\{L_{1},\alpha_{t}\}+\|g_{t}\|\big)\,\sqrt{\frac{e_{t}}{\lambda_{2}(G_{t}(\alpha_{t}))-\lambda_{1}(G_{t}(\alpha_{t}))}}, | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
where ete_{t} is the prescribed accuracy of the Lanczos run and ϕt∈[0,1]\phi_{t}\in[0,1]. Thus, using (3.18), we have
| ‖kt‖\displaystyle\|k_{t}\| | ⩽ϕtet+2(max{L1,αt}+‖gt‖)etλ2(Gt(αt))−λ1(Gt(αt))\displaystyle\leqslant\phi_{t}e_{t}+2\big(\max\{L_{1},\alpha_{t}\}+\|g_{t}\|\big)\,\sqrt{\frac{e_{t}}{\lambda_{2}(G_{t}(\alpha_{t}))-\lambda_{1}(G_{t}(\alpha_{t}))}} | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ⩽ϕtet+2(L1+αt+Bg)etL2ε\displaystyle\leqslant\phi_{t}e_{t}+2\bigl(L_{1}+\alpha_{t}+B_{g}\bigr)\,\sqrt{\frac{e_{t}}{\sqrt{L_{2}\varepsilon}}} | | | ⩽ε2,\displaystyle\leqslant\frac{\varepsilon}{2}, | |
which completes the proof.
We are now ready to present the main complexity result of this section: a high-probability bound on the iteration complexity of the IHSDA algorithm. Let ε>0\varepsilon>0 be the target accuracy. Recall that T(ε)T(\varepsilon) denotes the first iteration at which an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point is obtained. Additionally, we define a verifiable stopping index based on the approximate eigenvector component v^t\hat{v}_{t}:
| T^(ε):=min{t∣|v^t|>1/(1+Λ2)and‖kt‖⩽ε/2},\hat{T}(\varepsilon):=\min\Bigl\{t\,\Bigm|\,|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})}\ \text{and}\ \|k_{t}\|\leqslant\varepsilon/2\Bigr\}, | (3.19) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
where ktk_{t} is as defined in (3.1). By Lemma 3.2, T^(ε)\hat{T}(\varepsilon) is finite with high probability and satisfies T(ε)⩽T^(ε)T(\varepsilon)\leqslant\hat{T}(\varepsilon).
Theorem 3.1
Under Assumptions 2.1 and 3.1, define
| Kε:=1+6L2(ℱ(x1)−ℱinf)ε−3/2.K_{\varepsilon}:=1+6\sqrt{L_{2}}\bigl(\mathcal{F}(x_{1})-\mathcal{F}_{\inf}\bigr)\,\varepsilon^{-3/2}. |
|---|
Then the outer-iteration counts of IHSDA satisfy
| T(ε)⩽T^(ε)⩽Kε,T(\varepsilon)\leqslant\hat{T}(\varepsilon)\leqslant K_{\varepsilon}, | (3.20) |
|---|
and with probability at least (1−4p)2Kε(1-4p)^{2K_{\varepsilon}}, the algorithm returns an𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point.
Proof
We first bound T^(ε)\hat{T}(\varepsilon). Fix any outer iteration t<T^(ε)t<\hat{T}(\varepsilon). By Lemma 3.2, the Ritz pair used in this iteration satisfies |v^t|⩽1/(1+Λ2)|\hat{v}_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}. Recall that
| Λ=ε/L2,αt=L2ε,ε1=ε/12,ε2=L2ε/12,\Lambda=\sqrt{\varepsilon/L_{2}},\quad\alpha_{t}=\sqrt{L_{2}\varepsilon},\quad\varepsilon_{1}=\varepsilon/12,\quad\varepsilon_{2}=\sqrt{L_{2}\varepsilon}/12, |
|---|
with 0<ε⩽min{L23/36,L2/2, 1}0<\varepsilon\leqslant\min\{L_{2}^{3}/36,\,L_{2}/2,\,1\}. By Theorem 6 of zhang2025homogeneous , with probability at least 1−4p1-4p, we have |ϱt|⩽ε2/(16L22)|\varrho_{t}|\leqslant\varepsilon^{2}/(16L_{2}^{2}). On this high‑probability event, Lemma 3.1 yields
| ℱ(xt+1)−ℱ(xt)⩽4|ϱt|−αt2Λ2+Λε1+Λ22ε2+L26Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant 4|\varrho_{t}|-\frac{\alpha_{t}}{2}\Lambda^{2}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3}. | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
If the Ritz pair is computed with a larger parameter αt\alpha_{t}, the right‑hand side can only become smaller, so the bound remains valid. Substituting the parameter values gives
| ℱ(xt+1)−ℱ(xt)\displaystyle\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t}) | ⩽ε24L22−12L2εεL2+εL2⋅ε12+ε2L2⋅L2ε12+L26(εL2)3/2\displaystyle\leqslant\frac{\varepsilon^{2}}{4L_{2}^{2}}-\frac{1}{2}\sqrt{L_{2}\varepsilon}\,\frac{\varepsilon}{L_{2}}+\sqrt{\frac{\varepsilon}{L_{2}}}\cdot\frac{\varepsilon}{12}+\frac{\varepsilon}{2L_{2}}\cdot\frac{\sqrt{L_{2}\varepsilon}}{12}+\frac{L_{2}}{6}\Bigl(\frac{\varepsilon}{L_{2}}\Bigr)^{3/2} |
|---|---|
| =−524ε3/2L2+ε24L22⩽−16ε3/2L2,\displaystyle=-\frac{5}{24}\frac{\varepsilon^{3/2}}{\sqrt{L_{2}}}+\frac{\varepsilon^{2}}{4L_{2}^{2}}\;\leqslant\;-\frac{1}{6}\,\frac{\varepsilon^{3/2}}{\sqrt{L_{2}}}, |
where the last inequality uses ε⩽L23/36\varepsilon\leqslant L_{2}^{3}/36.
Summing this per-iteration decrease over t=1,…,T^(ε)−1t=1,\ldots,\hat{T}(\varepsilon)-1 and noting that the total possible decrease of ℱ\mathcal{F}is at most ℱ(x1)−ℱinf\mathcal{F}(x_{1})-\mathcal{F}_{\inf}, we obtain
| ℱ(x1)−ℱinf⩾∑t=1T^(ε)−1(ℱ(xt)−ℱ(xt+1))⩾(T^(ε)−1)16ε3/2L2,\mathcal{F}(x_{1})-\mathcal{F}_{\inf}\geqslant\sum_{t=1}^{\hat{T}(\varepsilon)-1}\bigl(\mathcal{F}(x_{t})-\mathcal{F}(x_{t+1})\bigr)\geqslant(\hat{T}(\varepsilon)-1)\,\frac{1}{6}\,\frac{\varepsilon^{3/2}}{\sqrt{L_{2}}}, |
|---|
and therefore
| T^(ε)⩽1+6L2(ℱ(x1)−ℱinf)ε−3/2.\hat{T}(\varepsilon)\leqslant 1+6\sqrt{L_{2}}\bigl(\mathcal{F}(x_{1})-\mathcal{F}_{\inf}\bigr)\varepsilon^{-3/2}. |
|---|
By the definition of T^(ε)\hat{T}(\varepsilon) in (3.19) together with Lemma 3.2, we have T(ε)⩽T^(ε)T(\varepsilon)\leqslant\hat{T}(\varepsilon). Combining this with the bound above establishes (3.20).
We next establish the high-probability statement. Lemma 3.2 states that whenever|v^t|>1/(1+Λ2)|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})}, either (i) ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2 already holds, or (ii) after at most one additional Ritz‑pair computation we obtain ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2 with probability at least 1−4p1-4p. Hence each outer iteration involves at most two calls to the Lanczos routine. Over at most KεK_{\varepsilon} outer iterations, the total number of Lanczos invocations is at most 2Kε2K_{\varepsilon}. Consequently, the probability that every Lanczos call succeeds is at least (1−4p)2Kε(1-4p)^{2K_{\varepsilon}}. On this event, the definition of T^(ε)\hat{T}(\varepsilon) guarantees that at iterationt=T^(ε)t=\hat{T}(\varepsilon) we have |v^t|>1/(1+Λ2)|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})} and ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2. Lemma 3.2 then implies that the next iterate is an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point, which completes the proof.
Remark 3.1
We remark that, by Bernoulli’s inequality, (1−4p)2Kε⩾1−8Kεp(1-4p)^{2K_{\varepsilon}}\geqslant 1-8K_{\varepsilon}p whenever p<1/4p<1/4. Since p∈(exp(−n),1)p\in(\exp(-n),1), this condition can be satisfied by taking nn sufficiently large, for instance, under the mild requirement n⩾𝒪(−logε)n\geqslant\mathcal{O}(-\log\varepsilon). Consequently, the high-probability guarantee “with probability at least (1−4p)2Kε(1-4p)^{2K_{\varepsilon}}” stated in the theorem can be presented equivalently as “with probability at least 1−8Kεp1-8K_{\varepsilon}p” without losing information. Moreover, combining the iteration bound (3.20) with the per-call complexity estimates of (zhang2025homogeneous, , Theorem 6 and Lemma 12) yields the total number of Hessian-vector products bound
| 𝒪~(L21/4(ℱ(x1)−ℱinf)ε−7/4max{L1,αt}+Bg),\widetilde{\mathcal{O}}\!\Bigl(L_{2}^{1/4}\bigl(\mathcal{F}(x_{1})-\mathcal{F}_{\inf}\bigr)\varepsilon^{-7/4}\sqrt{\max\{L_{1},\alpha_{t}\}+B_{g}}\Bigr), |
|---|
where 𝒪~(⋅)\widetilde{\mathcal{O}}(\cdot) hides logarithmic factors in nn, p−1p^{-1}, and ε−1\varepsilon^{-1}.
We now compare the computational effort required to solve the inner subproblems in gradient norm regularized second order methods and in IHSDA, highlighting the structural advantages of the homogeneous formulation.
Algorithms such as IGRTR and ILMNegCur wang2025gradient for minimizing ℱ(x)\mathcal{F}(x) require solving a regularized Newton system
| (Ht+εNI)dt=−gt,\bigl(H_{t}+\varepsilon_{N}I\bigr)d_{t}=-g_{t}, | (3.21) |
|---|
where the perturbation parameter εN\varepsilon_{N} is chosen on the order of ‖gt‖1/2\|g_{t}\|^{1/2}. Computing an ε\varepsilon-accurate solution of (3.21) takes at most
| 𝒪(κ(Ht+εNI)log1ε),κ(Ht+εNI):=λmax(Ht)+εNλ1(Ht)+εN,\mathcal{O}\!\left(\sqrt{\kappa\bigl(H_{t}+\varepsilon_{N}I\bigr)}\,\log\frac{1}{\varepsilon}\right),\qquad\kappa\!\bigl(H_{t}+\varepsilon_{N}I\bigr):=\frac{\lambda_{\max}(H_{t})+\varepsilon_{N}}{\lambda_{1}(H_{t})+\varepsilon_{N}}, |
|---|
iterations, and the spectral condition number κ(Ht+εNI)\kappa\bigl(H_{t}+\varepsilon_{N}I\bigr) can become arbitrarily large when εN→0\varepsilon_{N}\to 0.
In contrast, IHSDA replaces the linear system (3.21) with the homogenized eigenvalue subproblem defined by the matrix Gt(αt)G_{t}(\alpha_{t}) from (2.1). The Lanczos method applied to this subproblem requires at most
| 𝒪(κL(Gt(αt))log1ε),κL(Gt(αt)):=λmax(Gt(αt))−λ1(Gt(αt))λ2(Gt(αt))−λ1(Gt(αt)),\mathcal{O}\!\left(\sqrt{\kappa_{L}\!\bigl(G_{t}(\alpha_{t})\bigr)}\,\log\frac{1}{\varepsilon}\right),\quad\kappa_{L}\bigl(G_{t}(\alpha_{t})\bigr):=\frac{\lambda_{\max}\bigl(G_{t}(\alpha_{t})\bigr)-\lambda_{1}\bigl(G_{t}(\alpha_{t})\bigr)}{\lambda_{2}\bigl(G_{t}(\alpha_{t})\bigr)-\lambda_{1}\bigl(G_{t}(\alpha_{t})\bigr)}, |
|---|
iterations to deliver an approximate smallest eigenpair with accuracy ε\varepsilon. A key advantage of the homogeneous approach is that the Lanczos condition number κL(Gt(αt))\kappa_{L}\bigl(G_{t}(\alpha_{t})\bigr) is always bounded. Indeed, for any αt>0\alpha_{t}>0, it follows from (He2025HomogeneousSD, , Theorem 2.1) that
| κL(Gt(αt))⩽2(λmax(Ht)−αt−λ1(Gt(αt)))−λmax(Ht)+αt+(λmax(Ht)+αt)2+‖gt‖2/n<∞.\kappa_{L}\!\bigl(G_{t}(\alpha_{t})\bigr)\leqslant\frac{2\bigl(\lambda_{\max}(H_{t})-\alpha_{t}-\lambda_{1}\bigl(G_{t}(\alpha_{t})\bigr)\bigr)}{-\lambda_{\max}(H_{t})+\alpha_{t}+\sqrt{\bigl(\lambda_{\max}(H_{t})+\alpha_{t}\bigr)^{2}+\|g_{t}\|^{2}/n}}<\infty. | (3.22) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
Thus, unlike the condition number of the regularized Newton system (3.21), which can blow up as εN→0\varepsilon_{N}\to 0, the homogenized subproblem remains well-conditioned for any fixed αt>0\alpha_{t}>0.
To quantify the improvement, consider the degenerate case λ1(Ht)=0\lambda_{1}(H_{t})=0. Then (He2025HomogeneousSD, , Theorem 2.1) also implies
| κL(Gt(αt))κ(Ht+εNI)⩽𝒪(εN‖gt‖2/(λmax(Ht)+αt)+αt).\frac{\kappa_{L}\!\bigl(G_{t}(\alpha_{t})\bigr)}{\kappa\,\!\bigl(H_{t}+\varepsilon_{N}I\bigr)}\leqslant\mathcal{O}\!\left(\frac{\varepsilon_{N}}{\|g_{t}\|^{2}/\bigl(\lambda_{\max}(H_{t})+\alpha_{t}\bigr)+\alpha_{t}}\right). | (3.23) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |
The ratio (3.23) directly compares the conditioning of the two inner subproblems. In particular, if εN=αt→0\varepsilon_{N}=\alpha_{t}\to 0 while ‖gt‖\|g_{t}\| stays bounded away from zero, then the right-hand side of (3.23) converges to zero, indicating that the homogenized subproblem can be much better conditioned in this regime. In contrast, when ‖gt‖→0\|g_{t}\|\to 0 and εN\varepsilon_{N} and αt\alpha_{t} are kept at the same scale, the denominator in (3.23) is dominated by αt\alpha_{t} and the ratio remains of constant order, so the two condition numbers are comparable. This behavior aligns with the practical choice εN=‖gt‖1/2\varepsilon_{N}=\|g_{t}\|^{1/2} adopted in gradient norm regularized methodsDoikov2024GradientRN ; He2025HomogeneousSD ; Mishchenko2023RegularizedNM , yet the homogeneous formulation guarantees a bounded condition number even when the gradient is very small–a regime where Newton-type systems often become ill-conditioned.
4 Numerical Results
In this section, we conduct numerical experiments to demonstrate the practical performance of the proposed HSDA algorithm and its inexact variant IHSDA. We compare them with several existing methods: Gradient Descent Ascent (GDA), the IMCN algorithm luo2022finding , the MINIMAX-TRACE algorithm yao2024two , and the IGRTR algorithm wang2025gradient . The experiments are performed on two representative problem classes: a low-dimensional synthetic nonconvex-strongly concave minimax problem, and an adversarial training task on the MNIST dataset. All codes are implemented in Python 3.11 and executed on a laptop equipped with an Apple M1 processor and 16 GB of memory.
4.1 A synthetic nonconvex-strongly concave minimax problem
We begin with a low-dimensional synthetic nonconvex-strongly concave minimax problem introduced in luo2022finding :
| minx∈ℝ3maxy∈ℝ2f(x,y)=w(x3)−y1240+x1y1−5y222+x2y2,\min_{x\in\mathbb{R}^{3}}\max_{y\in\mathbb{R}^{2}}f(x,y)=w(x_{3})-\frac{y_{1}^{2}}{40}+x_{1}y_{1}-\frac{5y_{2}^{2}}{2}+x_{2}y_{2}, | (4.1) |
|---|
where x=[x1,x2,x3]⊤x=[x_{1},x_{2},x_{3}]^{\top} and y=[y1,y2]⊤y=[y_{1},y_{2}]^{\top}.
The scalar function w(⋅)w(\cdot) is a nonconvex, W-shaped piecewise cubic function defined by a slope parameter ε>0\varepsilon>0 and a length parameter L>1L>1:
| w(x)={ε(x+(L+1)ε)2−13(x+(L+1)ε)3−cε,x⩽−Lε,εx+ε3/23,−Lε<x⩽−ε,−εx2−x33,−ε<x⩽0,−εx2+x33,0<x⩽ε,−εx+ε3/23,ε<x⩽Lε,ε(x−(L+1)ε)2+13(x−(L+1)ε)3−cε,Lε⩽x,w(x)=\begin{cases}\sqrt{\varepsilon}\bigl(x+(L+1)\sqrt{\varepsilon}\bigr)^{2}-\dfrac{1}{3}\bigl(x+(L+1)\sqrt{\varepsilon}\bigr)^{3}-c_{\varepsilon},&x\leqslant-L\sqrt{\varepsilon},\\[3.01385pt] \varepsilon x+\dfrac{\varepsilon^{3/2}}{3},&-L\sqrt{\varepsilon}<x\leqslant-\sqrt{\varepsilon},\\[3.01385pt] -\sqrt{\varepsilon}x^{2}-\dfrac{x^{3}}{3},&-\sqrt{\varepsilon}<x\leqslant 0,\\[3.01385pt] -\sqrt{\varepsilon}x^{2}+\dfrac{x^{3}}{3},&0<x\leqslant\sqrt{\varepsilon},\\[3.01385pt] -\varepsilon x+\dfrac{\varepsilon^{3/2}}{3},&\sqrt{\varepsilon}<x\leqslant L\sqrt{\varepsilon},\\[3.01385pt] \sqrt{\varepsilon}\bigl(x-(L+1)\sqrt{\varepsilon}\bigr)^{2}+\dfrac{1}{3}\bigl(x-(L+1)\sqrt{\varepsilon}\bigr)^{3}-c_{\varepsilon},&L\sqrt{\varepsilon}\leqslant x,\end{cases} |
|---|
with cε:=13(3L+1)ε3/2.c_{\varepsilon}:=\frac{1}{3}(3L+1)\varepsilon^{3/2}.
In the experiment, we set ε=0.01\varepsilon=0.01, μ=0.05\mu=0.05, and L=5L=5. Two different initial points are used:
| (x1,y1)=([0.1,0.1,0.1]⊤,[0,0]⊤),(x2,y2)=([1.0,0.1,0.1]⊤,[0,0]⊤).(x_{1},y_{1})=([0.1,0.1,0.1]^{\top},[0,0]^{\top}),\qquad(x_{2},y_{2})=([1.0,0.1,0.1]^{\top},[0,0]^{\top}). |
|---|
The first point (x1,y1)(x_{1},y_{1}) lies near the strict saddle point ([0,0,0]⊤,[0,0]⊤)([0,0,0]^{\top},[0,0]^{\top}) of (4.1), while the second point (x2,y2)(x_{2},y_{2}) is intentionally chosen farther from this saddle to examine the algorithms’ global behavior. Figure 1 and Figure 2 display the performance of the five algorithms on this problem. The horizontal axis records the iteration index tt; the left and right vertical axes show the optimality gap ℱ(xt)−ℱ⋆\mathcal{F}(x_{t})-\mathcal{F}^{\star} and the gradient norm ‖∇ℱ(xt)‖2\|\nabla\mathcal{F}(x_{t})\|_{2}, respectively.
Figure 1: Numerical results of the tested algorithms on the synthetic W-shaped minimax example (4.1) with initialization(x1,y1)=([0.1,0.1,0.1]⊤,[0,0]⊤)(x_{1},y_{1})=([0.1,0.1,0.1]^{\top},[0,0]^{\top}).
Figure 2: Numerical results of the tested algorithms on the synthetic W-shaped minimax example (4.1) with initialization(x2,y2)=([1.0,0.1,0.1]⊤,[0,0]⊤)(x_{2},y_{2})=([1.0,0.1,0.1]^{\top},[0,0]^{\top}).
Since problem (4.1) contains a strict saddle point and the GDA algorithm struggles to escape once trapped near it, the GDA curves in Figure 1 and Figure 2 remain almost flat, showing very little decrease in either the objective gap or the gradient norm. In contrast, the other four algorithms effectively escape the saddle region and achieve substantial progress. The proposed HSDA algorithm exhibits the fastest decrease in both the optimality gap ℱ(xt)−ℱ⋆\mathcal{F}(x_{t})-\mathcal{F}^{\star} and the gradient norm‖∇ℱ(xt)‖2\|\nabla\mathcal{F}(x_{t})\|_{2}. For both initializations, it reduces the objective gap to about10−410^{-4} and the gradient norm to about 10−210^{-2} within roughly a dozen iterations. The GRTR algorithm also converges rapidly, but its trajectories display more pronounced oscillations. The MCN algorithm reduces these quantities more slowly, yet its convergence path is comparatively smooth. With the chosen parameters, the MINIMAX-TRACE algorithm converges more slowly, and its curves are more oscillatory than those of HSDA and GRTR.
4.2 Adversarial training on MNIST
We next examine an adversarial training task studied in chen2021cubic , whose goal is to train a classifier that remains robust against small input perturbations. Using the MNIST dataset with 50,000 training and 10,000 test samples, we solve the finite-sum minimax problem
| minxmaxy={yi}i=1n1n∑i=1n[ℓ(hx(yi),bi)−λ‖yi−ai‖22],\min_{x}\;\max_{y=\{y_{i}\}_{i=1}^{n}}\frac{1}{n}\sum_{i=1}^{n}\Bigl[\ell\bigl(h_{x}(y_{i}),b_{i}\bigr)-\lambda\,\|y_{i}-a_{i}\|_{2}^{2}\Bigr], | (4.2) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- |
where xx collects the parameters of a convolutional neural networkhx(⋅)h_{x}(\cdot), (ai,bi)(a_{i},b_{i}) denotes the iith image-label pair, andyi∈ℝ784y_{i}\in\mathbb{R}^{784} is an adversarial version of aia_{i}. Following chen2021cubic , we take ℓ\ell to be the cross-entropy loss and set λ=2\lambda=2.
The network hxh_{x} is a simple convolutional architecture: a single convolutional block (one input channel, one output channel, kernel size 33, stride 44, padding 11, followed by a sigmoid activation) produces a 7×77\times 7 feature map; this map is flattened to a 4949-dimensional vector and passed through a linear layer with 1010 outputs. All network parameters are stacked into a vector x∈ℝdxx\in\mathbb{R}^{d_{x}}; each adversarial variable yiy_{i} is initialized at the original image aia_{i}.
We apply IHSDA together with GDA, IMCN luo2022finding , IGRTR wang2025gradient , and ILMNegCur wang2025gradient to (4.2). All methods are run in mini‑batch mode with batch size 6464. For the inner maximization over yy, every algorithm approximates the maximizer by a limited number of (possibly accelerated) gradient-ascent steps. For IHSDA we set the strong-concavity parameter in the yy-direction to μ=1\mu=1, the Lipschitz constant of ∇yf\nabla_{y}f to ℓ=10\ell=10, and the Hessian Lipschitz constant of the value function to L2=0.2L_{2}=0.2. The homogeneous second‑order subproblem in each outer iteration is solved approximately by a Lanczos procedure limited to at most 8080 iterations. IMCN is implemented according to the description in luo2022finding , while IGRTR and ILMNegCur follow the specifications inwang2025gradient .
Figure 3 presents the results. Panel (a) plots test accuracy against wall-clock time, and panel (b) shows the approximate objective function value of (4.2) versus the outer iteration index. On this problem, the four second-order algorithms achieve higher test accuracies than GDA within the same time budget. IHSDA typically reaches a test accuracy around 80%80\% and attains the lowest objective values among the compared methods. IMCN, IGRTR, and ILMNegCur are competitive and follow closely. GDA improves more gradually and remains below about 70%70\% accuracy over the plotted range. Overall, the curves indicate that exploiting second-order information through the HSDA framework is beneficial for this adversarial training task, and that the resulting IHSDA method performs on par with existing second-order schemes.
Figure 3: Numerical results of the tested algorithms for solving (4.2).
5 Conclusions
In this paper, we have introduced a Homogeneous Second-Order Descent Ascent (HSDA) algorithm and its inexact variant (IHSDA) for solving nonconvex-strongly concave minimax problems. The algorithms leverage a homogenized eigenvalue subproblem to compute a search direction that ensures sufficient descent even when the Hessian of the value function is nearly positive semidefinite.
We prove that both HSDA and IHSDA find an 𝒪(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point within at most 𝒪~(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}) outer iterations, matching the best known iteration complexity for existing second-order methods in this setting. For the practical IHSDA variant, which solves the subproblem approximately via a Lanczos procedure, we further establish a high-probability bound of 𝒪~(ε−7/4)\tilde{\mathcal{O}}(\varepsilon^{-7/4}) for the total number of Hessian-vector products.
The numerical experiments on synthetic minimax problems and adversarial training tasks confirm the efficiency and robustness of the proposed methods. A natural and promising direction for future work is to extend the homogeneous second-order framework beyond the nonconvex-strongly concave setting, e.g., to more general minimax structures that appear in modern machine learning applications.
Data Availability
No datasets were generated or analysed during the current study.
Declarations
The authors declare that they have no conflict of interest.
References
- (1) C. Cartis, N. I. M. Gould, and P. L. Toint._Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results._Mathematical Programming, 127(2):245–295, 2011.
- (2) F. E. Curtis, D. P. Robinson, and M. Samadi._A trust-region algorithm with a worst-case iteration complexity of 𝒪(ε−3/2)\mathcal{O}(\varepsilon^{-3/2}) for nonconvex optimization._Mathematical Programming, 162(1):1–32, 2017.
- (3) Z. Chen, Z. Hu, Q. Li, Z. Wang, and Y. Zhou._A cubic regularization approach for finding local minimax points in nonconvex minimax optimization._Transactions on Machine Learning Research, pages 2835–8856, 2023.
- (4) N. Doikov and Y. Nesterov._Gradient regularization of Newton method with Bregman distances._Mathematical Programming, 204(1):1–25, 2024.
- (5) Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky._Domain-adversarial training of neural networks._Journal of Machine Learning Research, 17(59):1–35, 2016.
- (6) R. Gao and A. Kleywegt._Distributionally robust stochastic optimization with Wasserstein distance._Mathematics of Operations Research, 48(2):603–655, 2023.
- (7) G. H. Golub and C. F. Van Loan._Matrix Computations._Johns Hopkins University Press, 2013.
- (8) C. He, Y. Jiang, C. Zhang, D. Ge, B. Jiang, and Y. Ye._Homogeneous second-order descent framework: a fast alternative to Newton-type methods._Mathematical Programming, pages 1–62, 2025.
- (9) C. Jin, P. Netrapalli, and M. I. Jordan._What is local optimality in nonconvex-nonconcave minimax optimization?_International Conference on Machine Learning, PMLR, pages 4880–4899, 2020.
- (10) H. Li, Y. Tian, J. Zhang, and A. Jadbabaie._Complexity lower bounds for nonconvex-strongly-concave min-max optimization._Advances in Neural Information Processing Systems, 34:1792–1804, 2021.
- (11) T. Lin, C. Jin, and M. I. Jordan._On gradient descent ascent for nonconvex-concave minimax problems._International Conference on Machine Learning, PMLR, pages 6083–6093, 2020.
- (12) T. Lin, C. Jin, and M. I. Jordan._Near-optimal algorithms for minimax optimization._Conference on Learning Theory, PMLR, pages 2738–2779, 2020.
- (13) S. Lu, I. Tsaknakis, M. Hong, and Y. Chen._Hybrid block successive approximation for one-sided non-convex min-max problems: algorithms and applications._IEEE Transactions on Signal Processing, 68:3676–3691, 2020.
- (14) L. Luo, Y. Li, and C. Chen._Finding second-order stationary points in nonconvex-strongly concave minimax optimization._Advances in Neural Information Processing Systems, 35:36667–36679, 2022.
- (15) K. Mishchenko._Regularized Newton method with global convergence._SIAM Journal on Optimization, 33(3):1440–1462, 2023.
- (16) Y. Nesterov._Lectures on Convex Optimization._Springer, 2018.
- (17) Y. Nesterov and B. T. Polyak._Cubic regularization of Newton method and its global performance._Mathematical Programming, 108(1):177–205, 2006.
- (18) S. Qiu, Z. Yang, X. Wei, J. Ye, and Z. Wang._Single-timescale stochastic nonconvex-concave optimization for smooth nonlinear TD learning._arXiv preprint arXiv:2008.10103, 2020.
- (19) H. Rafique, M. Liu, Q. Lin, and T. Yang._Weakly-convex-concave min-max optimization: provable algorithms and applications in machine learning._Optimization Methods and Software, 37(3):1087–1121, 2022.
- (20) C. W. Royer and S. J. Wright._Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization._SIAM Journal on Optimization, 28(2):1448–1477, 2018.
- (21) M. Sanjabi, J. Ba, M. Razaviyayn, and J. D. Lee._On the convergence and robustness of training GANs with regularized optimal transport._Advances in Neural Information Processing Systems, 31, 2018.
- (22) A. Sinha, H. Namkoong, R. Volpi, and J. Duchi._Certifying some distributional robustness with principled adversarial training._arXiv preprint arXiv:1710.10571, 2017.
- (23) J. Wang and Z. Xu._Gradient norm regularization second-order algorithms for solving nonconvex-strongly concave minimax problems._Journal of Scientific Computing, 105(2):1–31, 2025.
- (24) Z. Xu, H. Zhang, Y. Xu, and G. Lan._A unified single-loop alternating gradient projection algorithm for nonconvex-concave and convex-nonconcave minimax problems._Mathematical Programming, 201(1):635–706, 2023.
- (25) T. Yao and Z. Xu._Two trust region type algorithms for solving nonconvex-strongly concave minimax problems._SCIENTIA SINICA Mathematica, 55:1–18, 2025.
- (26) Y. Ying, L. Wen, and S. Lyu._Stochastic online AUC maximization._Advances in Neural Information Processing Systems, 29, 2016.
- (27) C. Zhang, C. He, Y. Jiang, C. Xue, B. Jiang, D. Ge, and Y. Ye._A homogeneous second-order descent method for nonconvex optimization._Mathematics of Operations Research, 2025.
- (28) S. Zhang, J. Yang, C. Guzmán, N. Kiyavash, and N. He._The complexity of nonconvex-strongly concave minimax optimization._Uncertainty in Artificial Intelligence, PMLR, pages 482–492, 2021.