A Homogeneous Second-Order Descent Ascent Algorithm for Nonconvex-Strongly Concave Minimax Problems††thanks: This work is supported by National Key R & D Program of China (Nos. 2025YFA1017801 and 2025YFA1017800), the National Natural Science Foundation of China under the grant 12471294. (original) (raw)

11institutetext: Jia-Hao Chen22institutetext: Department of Mathematics, College of Sciences, Shanghai University, Shanghai 200444, P.R.China.
22email: chenjiahao@shu.edu.cn 33institutetext: Zi Xu44institutetext: Department of Mathematics, College of Sciences, Shanghai University, Shanghai 200444, P.R.China.
Corresponding author. 44email: xuzi@shu.edu.cn 55institutetext: Hui-Ling Zhang66institutetext: LSEC, ICMSEC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China.
66email: zhanghl1209@shu.edu.cn

(Received: date / Accepted: date)

Abstract

This paper introduces a novel Homogeneous Second-order Descent Ascent (HSDA) algorithm for nonconvex-strongly concave minimax optimization problems. At each iteration, HSDA uniquely computes a search direction by solving a homogenized eigenvalue subproblem built from the gradient and Hessian of the objective function. This formulation guarantees a descent direction with sufficient negative curvature even in near-positive-semidefinite Hessian regimes—a key feature that enhances escape from saddle points. We prove that HSDA finds an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point within 𝒪~​(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}) iterations, matching the optimal ε\varepsilon-order iteration complexity among second-order methods for this problem class. To address large-scale applications, we further design an inexact variant (IHSDA) that preserves the single-loop structure while solving the subproblem approximately via a Lanczos procedure. With high probability, IHSDA achieves the same 𝒪~​(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}) iteration complexity and attains an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point, with the total Hessian‑vector product cost bounded by 𝒪~​(ε−7/4)\tilde{\mathcal{O}}(\varepsilon^{-7/4}). Experiments on synthetic minimax problems and adversarial training tasks confirm the practical effectiveness and robustness of the proposed algorithms.

1 Introduction

In this paper, we consider the following unconstrained minimax problem:

minx∈ℝn⁡maxy∈ℝm⁡f​(x,y),\min_{x\in\mathbb{R}^{n}}\max_{y\in\mathbb{R}^{m}}f(x,y), (P)

where f​(x,y):ℝn×ℝm→ℝf(x,y):\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R} is a continuously differentiable function, which is strongly concave in yy, but possibly nonconvex in xx. For convenience, we denote

ℱ​(x):=maxy∈ℝm⁡f​(x,y).\mathcal{F}(x):=\max_{y\in\mathbb{R}^{m}}f(x,y). (1.1)

Such a structure captures a wide range of machine learning applications, including adversarial training and distributionally robust optimization gao2023distributionally ; sanjabi2018convergence ; sinha2017certifying , reinforcement learning, domain adaptation, and AUC maximization ganin2016domain ; qiu2020single ; ying2016stochastic .

To solve the minimax problem (P), three main classes of optimization algorithms have been developed: zeroth-order, first-order, and second-order methods, which utilize the function value, gradient, and Hessian of the objective function, respectively. Compared to zeroth- and first-order approaches, second-order methods have garnered considerable attention owing to their faster convergence rates. Moreover, they are more effective at escaping saddle points and avoiding poor local minima, thereby increasing the likelihood of converging to a globally optimal solution. This paper focuses on second-order optimization algorithms for solving (P).

For nonconvex-strongly concave minimax problems, first-order algorithms can obtain an ε\varepsilon-first-order stationary point in𝒪~​(κy2​ε−2)\tilde{\mathcal{O}}(\kappa_{y}^{2}\varepsilon^{-2}) iterations jin2020local ; lin2020gradient ; lu2020hybrid ; rafique2022weakly ; xu2023unified , where κy\kappa_{y} denotes the condition number of f​(x,⋅)f(x,\cdot). Acceleration frameworks further improve the iteration complexity to𝒪~​(κy​ε−2)\tilde{\mathcal{O}}(\sqrt{\kappa_{y}}\,\varepsilon^{-2}) lin2020near ; zhang2021complexity ; Li2021ComplexityLB .

There are few studies on second-order algorithms for solving nonconvex-strongly concave minimax optimization problems (P). Existing second-order algorithms can be divided into two categories, i.e., cubic regularization Newton type algorithms luo2022finding ; chen2021cubic and trust-region type algorithms yao2024two ; wang2025gradient . Building upon the cubic regularization (CR) framework, Luo et al. luo2022finding proposed the Minimax Cubic Newton (MCN) algorithm, which alternates between a cubic-regularized Newton step in the minimization variable and an ascent step in the maximization variable, achieving an iteration complexity of 𝒪​(ε−3/2)\mathcal{O}(\varepsilon^{-3/2}) to reach an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point. They further introduced an inexact variant (IMCN) that solves the cubic subproblem via gradient-based iterations and approximates Hessian inverse operations using Chebyshev polynomial expansions, relying solely on Hessian-vector products. Within the same line of work, Chen et al. chen2021cubic developed the Inexact Cubic-LocalMinimax (ICLM) algorithm, which attains the same order of iteration complexity. In the trust-region family of methods, Yao and Xu yao2024two proposed MINIMAX-TR, a fixed-radius inexact trust-region method that finds an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point within 𝒪​(ε−3/2)\mathcal{O}(\varepsilon^{-3/2}) iterations. To enhance practical performance, they also designed MINIMAX-TRACE, which adaptively adjusts the trust-region radius through contraction and expansion steps while maintaining the same theoretical iteration complexity. More recently, Wang and Xu wang2025gradient introduced a gradient norm regularized trust-region method (GRTR) and a Levenberg-Marquardt type negative-curvature method (LMNegCur) for nonconvex-strongly concave minimax problems. GRTR achieves an iteration complexity of 𝒪~​(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}), and its inexact variant IGRTR preserves this rate while reducing Hessian-vector product computations to be 𝒪~​(ε−7/4)\tilde{\mathcal{O}}(\varepsilon^{-7/4}). LMNegCur and its inexact counterpart ILMNegCur offer analogous convergence guarantees.

Collectively, these advances highlight the ongoing development of second-order methods tailored to nonconvex-strongly concave minimax optimization and motivate the algorithm design pursued in this work.

1.1 Contributions

In this paper, we propose a homogeneous second-order descent ascent (HSDA) algorithm whose outer iteration solves a single homogenized eigenvalue subproblem—constructed from the gradient and Hessian of the value function—to obtain an iteration direction. This homogenized formulation guarantees, even when the Hessian is nearly positive semidefinite, a descent direction with sufficient negative curvature for the value function. We prove that HSDA finds an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point within 𝒪~​(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}) iterations, matching the best known iteration complexity for second-order methods in this setting chen2021cubic ; luo2022finding ; wang2025gradient . For large-scale problems, we develop an inexact version (IHSDA) that approximately solves the homogenized eigenvalue subproblem via a Lanczos procedure with a carefully controlled residual. IHSDA preserves the same 𝒪~​(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}) outer iteration complexity and, with high probability, reaches an𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point, while its total Hessian-vector product computations are upper bounded by 𝒪~​(ε−7/4)\tilde{\mathcal{O}}(\varepsilon^{-7/4}).

Unlike recent inexact trust-region schemes (e.g., IGRTR and ILMNegCur wang2025gradient ) that require solving both a regularized Newton system and an nn-dimensional extremal-eigenvalue problem, while HSDA only solves a single (n+1)(n+1)-dimensional extremal-eigenvalue problem in a lifted space at each iteration. We further show (in Section 3) that the homogenized eigenvalue subproblems typically exhibit better conditioning than the regularized Newton/trust-region systems underlying IGRTR/ILMNegCur. Hence, HSDA and IHSDA offer an alternative second-order framework whose inner subproblem is structurally simpler, yet achieves the same 𝒪~​(ε−7/4)\tilde{\mathcal{O}}(\varepsilon^{-7/4}) Hessian-vector product computations as IGRTR and ILMNegCur.

Notation. We adopt the following notation throughout the paper: [a;b][a;b] and [a,b][a,b] denote vertical and horizontal concatenation, respectively; sgn⁡(⋅)\operatorname{sgn}(\cdot) is the sign function, defined by sgn⁡(a)=−1\operatorname{sgn}(a)=-1 if a<0a<0 and sgn⁡(a)=1\operatorname{sgn}(a)=1 if a⩾0a\geqslant 0. For a vector a∈ℝna\in\mathbb{R}^{n} and 0⩽j⩽n0\leqslant j\leqslant n, a[1:j]a_{[1:j]} denotes the subvector formed by its first jj entries. The symbol ∥⋅∥\|\cdot\| denotes the Euclidean norm for vectors and the induced ℓ2\ell_{2} operator norm for matrices. The eigenvalues of a matrix A∈ℝn×nA\in\mathbb{R}^{n\times n} are ordered as λ1​(A),λ2​(A),…,λmax​(A)\lambda_{1}(A),\lambda_{2}(A),\dots,\lambda_{\max}(A) in nondecreasing order. The identity matrix of dimension nn is written as InI_{n}, or simply II when the dimension is clear. For a function f​(x,y):ℝn×ℝm→ℝf(x,y):\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R}, ∇xf​(x,y)\nabla_{x}f(x,y) and ∇yf​(x,y)\nabla_{y}f(x,y) denote its partial gradients with respect to xx and yy, respectively; the full gradient is ∇f​(x,y):=(∇xf​(x,y),∇yf​(x,y))\nabla f(x,y):=(\nabla_{x}f(x,y),\nabla_{y}f(x,y)). Second-order partial derivatives are denoted by ∇x​x2f​(x,y)\nabla_{xx}^{2}f(x,y), ∇x​y2f​(x,y)\nabla_{xy}^{2}f(x,y), ∇y​x2f​(x,y)\nabla_{yx}^{2}f(x,y), and ∇y​y2f​(x,y)\nabla_{yy}^{2}f(x,y). Complexity notations 𝒪​(⋅),Ω​(⋅),Θ​(⋅)\mathcal{O}(\cdot),\Omega(\cdot),\Theta(\cdot) hide only absolute constants independent of problem parameters, while 𝒪~​(⋅)\tilde{\mathcal{O}}(\cdot) additionally hides logarithmic factors. We also define the value function ℱ​(x):=maxy∈ℝm⁡f​(x,y)\mathcal{F}(x):=\max_{y\in\mathbb{R}^{m}}f(x,y), the maximizer y∗​(x):=arg⁡maxy∈ℝm⁡f​(x,y)y^{\ast}(x):=\arg\max_{y\in\mathbb{R}^{m}}f(x,y), the partial gradient g​(x,y):=∇xf​(x,y)g(x,y):=\nabla_{x}f(x,y), and H​(x,y):=[∇x​x2f−∇x​y2f​(∇y​y2f)−1​∇y​x2f]​(x,y)H(x,y):=\bigl[\nabla_{xx}^{2}f-\nabla_{xy}^{2}f(\nabla_{yy}^{2}f)^{-1}\nabla_{yx}^{2}f\bigr](x,y).

2 A Homogeneous Second-Order Descent Ascent Algorithm

In this section, we propose a Homogeneous Second-order Descent Ascent (HSDA) algorithm for solving the nonconvex-strongly concave minimax problem (P). HSDA is inspired by the Homogeneous Second-order Descent Method (HSODM) zhang2025homogeneous , a second-order framework originally designed for unconstrained minimization problems of the form minx∈ℝn⁡ℱ​(x)\min_{x\in\mathbb{R}^{n}}\mathcal{F}(x). At each iteration, HSODM obtains a search direction by solving a homogenized eigenvalue subproblem of the form:

| min‖[u;v]‖⩽1[uv]⊤​[∇2ℱ​(x)∇ℱ​(x)∇ℱ​(x)⊤−α]​[uv],\begin{array}[]{ll}\mathop{\min}\limits_{\|[u;v]\|\leqslant 1}&\begin{bmatrix}u\\ v\end{bmatrix}^{\!\top}\begin{bmatrix}\nabla^{2}\mathcal{F}(x)&\nabla\mathcal{F}(x)\\[2.0pt] \nabla\mathcal{F}(x)^{\top}&-\alpha\end{bmatrix}\begin{bmatrix}u\\ v\end{bmatrix},\\ \end{array} | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

where α⩾0\alpha\geqslant 0 is a prescribed parameter. As demonstrated in zhang2025homogeneous , the resulting subproblem is an eigenvalue problem, where the optimal solution [ut;vt][u_{t};v_{t}] is a unit eigenvector associated with the smallest eigenvalue of the homogenized matrix. Building on this framework, we propose a HSDA algorithm as a generalized and inexact extension of HSODM tailored to the minimax setting in (P) with ℱ​(x)=maxy∈ℝm⁡f​(x,y)\mathcal{F}(x)=\max_{y\in\mathbb{R}^{m}}f(x,y). At each iteration, HSDA incorporates two key algorithmic components:

The detailed algorithm is formally stated in Algorithm 1.

Algorithm 1 A Homogeneous Second-Order Descent Ascent (HSDA) Algorithm

Step 1: Input x1x_{1}, y0y_{0}, η1>0\eta_{1}>0, η2>0\eta_{2}>0, ω∈(0,1/2)\omega\in(0,1/2), {Nt⩾1}\{N_{t}\geqslant 1\}, ε>0\varepsilon>0, α>0\alpha>0, Λ>0\Lambda>0 and set t=1t=1.

Step 2: Update yty_{t}:

(2a): Set i=0i=0, yit=y~it=yt−1y_{i}^{t}=\tilde{y}_{i}^{t}=y_{t-1}.

(2b): Update yity_{i}^{t} and y~it\tilde{y}_{i}^{t}:

yi+1t\displaystyle y_{i+1}^{t} =y~it+η1​∇yf​(xt,y~it),\displaystyle=\tilde{y}_{i}^{t}+\eta_{1}\nabla_{y}f\big(x_{t},\tilde{y}_{i}^{t}\big),
y~i+1t\displaystyle\tilde{y}_{i+1}^{t} =yi+1t+η2​(yi+1t−yit).\displaystyle=y_{i+1}^{t}+\eta_{2}\big(y_{i+1}^{t}-y_{i}^{t}\big).

(2c): If i⩾Nt−1i\geqslant N_{t}-1, set yt=yNtty_{t}=y_{N_{t}}^{t}; otherwise set i=i+1i=i+1 and go to Step (2b).

Step 3: Compute

gt=∇xf​(xt,yt),Ht=[∇x​x2f−∇x​y2f​(∇y​y2f)−1​∇y​x2f]​(xt,yt)g_{t}=\nabla_{x}f(x_{t},y_{t}),\qquad H_{t}=\big[\nabla^{2}_{xx}f-\nabla^{2}_{xy}f(\nabla^{2}_{yy}f)^{-1}\nabla^{2}_{yx}f\big](x_{t},y_{t})

and solve the following homogeneous subproblem to obtain [ut;vt][u_{t};v_{t}]:

| [ut;vt]=arg⁡min‖[u;v]‖⩽1⁡[uv]⊤​[Htgtgt⊤−α]​[uv].[u_{t};v_{t}]=\arg\min_{\|[u;v]\|\leqslant 1}\begin{bmatrix}u\\ v\end{bmatrix}^{\!\top}\begin{bmatrix}H_{t}&g_{t}\\[2.0pt] g_{t}^{\top}&-\alpha\end{bmatrix}\begin{bmatrix}u\\ v\end{bmatrix}. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

Step 4: Update the direction sts_{t}:

| st={utvt,|vt|⩾ω,sgn⁡(−gt⊤​ut)​ut,|vt|<ω.s_{t}=\begin{cases}\dfrac{u_{t}}{v_{t}},&\qquad|v_{t}|\geqslant\omega,\\[9.0pt] \operatorname{sgn}\!\big(-g_{t}^{\top}u_{t}\big)u_{t},&\qquad|v_{t}|<\omega.\end{cases} | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

Step 5: If some stationary condition or stopping criterion is satisfied, set xt+1=xt+stx_{t+1}=x_{t}+s_{t} and terminate. Otherwise, compute τt=Λ/‖st‖\tau_{t}=\Lambda/\|s_{t}\|, update xt+1=xt+τt​stx_{t+1}=x_{t}+\tau_{t}s_{t}, set t=t+1t=t+1 and go to Step 2.

In the following subsection, we establish that the proposed HSDA algorithm attains an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point of ℱ​(x)\mathcal{F}(x) for problem (P) with an iteration complexity of 𝒪~​(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}).

2.1 Complexity Analysis

Throughout this paper, we work under the following assumptions on f​(x,y)f(x,y), which ensure that the value function ℱ​(x)\mathcal{F}(x) is well-defined and has the required smoothness.

Assumption 2.1

f​(x,y)f(x,y) satisfies the following assumptions:

Under these assumptions, Lemma 2.1 establishes the Lipschitz continuity of∇ℱ\nabla\mathcal{F} and ∇2ℱ\nabla^{2}\mathcal{F}.

Lemma 2.1(chen2021cubic )

Under Assumption 2.1, the following properties hold:

Lemma 2.1 directly yields the upper bounds presented in Lemma 2.2.

Lemma 2.2(nesterov2018lectures )

Under Assumption 2.1, for all x,x′∈ℝnx,x^{\prime}\in\mathbb{R}^{n} the gradient and Hessian ofℱ\mathcal{F} satisfy the following inequalities:

| ‖∇ℱ​(x′)−∇ℱ​(x)−∇2ℱ​(x)​(x′−x)‖⩽L22​‖x′−x‖2,\displaystyle\left\|\nabla\mathcal{F}(x^{\prime})-\nabla\mathcal{F}(x)-\nabla^{2}\mathcal{F}(x)(x^{\prime}-x)\right\|\leqslant\frac{L_{2}}{2}\|x^{\prime}-x\|^{2}, | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | |ℱ​(x′)−ℱ​(x)−∇ℱ​(x)⊤​(x′−x)−12​(x′−x)⊤​∇2ℱ​(x)​(x′−x)|⩽L26​‖x′−x‖3,\displaystyle\left|\mathcal{F}(x^{\prime})-\mathcal{F}(x)-\nabla\mathcal{F}(x)^{\top}(x^{\prime}-x)-\frac{1}{2}(x^{\prime}-x)^{\top}\nabla^{2}\mathcal{F}(x)(x^{\prime}-x)\right|\leqslant\frac{L_{2}}{6}\|x^{\prime}-x\|^{3}, |

where L2L_{2} is the Lipschitz constant of ∇2ℱ\nabla^{2}\mathcal{F} given in Lemma 2.1.

We next recall the standard definitions of first- and second-order stationary points used in minimax optimization luo2022finding .

Definition 1

A point xx is an ε\varepsilon-first-order stationary point of ℱ​(x)\mathcal{F}(x)when ‖∇ℱ​(x)‖⩽ε\|\nabla\mathcal{F}(x)\|\leqslant\varepsilon.

Definition 2

A point xx is an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point of ℱ​(x)\mathcal{F}(x) when

| ‖∇ℱ​(x)‖⩽c1​εand∇2ℱ​(x)≽−c2​ε​I,\|\nabla\mathcal{F}(x)\|\leqslant c_{1}\varepsilon\quad\text{and}\quad\nabla^{2}\mathcal{F}(x)\succcurlyeq-c_{2}\sqrt{\varepsilon}I, | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

where positive constants c1c_{1} and c2c_{2} do not depend on ε\varepsilon.

The following lemma demonstrates that by selecting an appropriate step size and performing sufficiently many inner gradient-ascent updates on yy, we can obtain approximations of the gradient and Hessian of the value function within a desired accuracy.

Lemma 2.3(wang2025gradient )

Suppose that Assumption 2.1 holds. For any ε1,ε2>0\varepsilon_{1},\varepsilon_{2}>0, set the inner ascent step sizes as η1=1/ℓ1\eta_{1}=1/\ell_{1}, η2=(κ−1)/(κ+1)\eta_{2}=(\sqrt{\kappa}-1)/(\sqrt{\kappa}+1)and define A=min⁡{ε1ℓ1,ε22​LH}A=\min\left\{\frac{\varepsilon_{1}}{\ell_{1}},\ \frac{\varepsilon_{2}}{2L_{H}}\right\}, where LHL_{H} is the Lipschitz constant of H​(x,y)H(x,y) given in Lemma 2.1. If the iteration counts {Nt}\{N_{t}\} for yy-updates in Algorithm 1 satisfy

| N1\displaystyle N_{1} | ⩾2​κ​log⁡(κ+1​‖y0−y∗​(x1)‖A),\displaystyle\geqslant 2\sqrt{\kappa}\log\!\left(\frac{\sqrt{\kappa+1}\|y_{0}-y^{*}(x_{1})\|}{A}\right), | | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Nt\displaystyle N_{t} | ⩾2​κ​log⁡(κ+1​(A+κ​‖xt−xt−1‖)A),t⩾2,\displaystyle\geqslant 2\sqrt{\kappa}\log\!\left(\frac{\sqrt{\kappa+1}\big(A+\kappa\|x_{t}-x_{t-1}\|\big)}{A}\right),\qquad t\geqslant 2, |

then for every t⩾1t\geqslant 1 the following error bounds hold:

| ‖yt−y∗​(xt)‖⩽A,‖∇ℱ​(xt)−gt‖⩽ε1,‖∇2ℱ​(xt)−Ht‖⩽ε2.\left\|y_{t}-y^{*}\left(x_{t}\right)\right\|\leqslant A,\qquad\big\|\nabla\mathcal{F}(x_{t})-g_{t}\big\|\leqslant\varepsilon_{1},\qquad\big\|\nabla^{2}\mathcal{F}(x_{t})-H_{t}\big\|\leqslant\varepsilon_{2}. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

The following lemma characterizes the optimality conditions for the homogenized eigenvalue subproblem (2.1).

Lemma 2.4(zhang2025homogeneous )

The vector [ut;vt][u_{t};v_{t}] is a solution to the homogenized eigenvalue subproblem (2.1) if and only if there exists a dual scalar δt\delta_{t} such that

[Ht+δt​Igtgt⊤−α+δt]\displaystyle\begin{bmatrix}H_{t}+\delta_{t}I&g_{t}\\ g_{t}^{\top}&-\alpha+\delta_{t}\end{bmatrix} ≽0,\displaystyle\succcurlyeq 0, (2.3a)
(Ht+δt​I)​ut=−vt​gt,gt⊤​ut\displaystyle(H_{t}+\delta_{t}I)u_{t}=-v_{t}g_{t},\quad g_{t}^{\top}u_{t} =vt​(α−δt),\displaystyle=v_{t}(\alpha-\delta_{t}), (2.3b)
δt⩾α>0,‖[ut;vt]‖\displaystyle\delta_{t}\geqslant\alpha>0,\quad\big\|[u_{t};v_{t}]\big\ =1.\displaystyle=1.

Furthermore, −δt-\delta_{t} equals the smallest eigenvalue of the homogenized matrix Gt​(α)G_{t}(\alpha), i.e., −δt=λ1​(Gt​(α)),-\delta_{t}=\lambda_{1}\!\big(G_{t}(\alpha)\big), and [ut;vt][u_{t};v_{t}] is a corresponding unit eigenvector. Moreover, when gt≠0g_{t}\neq 0, the inequality δt⩾α>0\delta_{t}\geqslant\alpha>0 in (2.3c) can be strengthened to the strict form δt>α>0\delta_{t}>\alpha>0.

We now proceed to analyze the iteration complexity of the proposed HSDA algorithm. Our analysis begins by establishing a descent property for the case where |vt|⩽1/(1+Λ2)|v_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}.

Lemma 2.5

Suppose that Assumption 2.1 holds. Assume Λ⩽2/2\Lambda\leqslant\sqrt{2}/2 and ω∈(0,1/2)\omega\in(0,1/2). For the case |vt|⩽1/(1+Λ2)|v_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}, we have

ℱ​(xt+1)−ℱ​(xt)⩽Λ​ε1+Λ22​(ε2−α)+L26​Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}(\varepsilon_{2}-\alpha)+\frac{L_{2}}{6}\Lambda^{3}. (2.4)
Proof

We first prove that when |vt|⩽1/(1+Λ2)|v_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}, the direction sts_{t} satisfies:

| ‖st‖⩾Λ.\|s_{t}\|\geqslant\Lambda. | (2.5) | | --------------------------------------- | ----- |

Since when |vt|<ω|v_{t}|<\omega, according to st=sgn⁡(−gt⊤​ut)​uts_{t}=\operatorname{sgn}(-g_{t}^{\top}u_{t})u_{t} in Algorithm 1 and (2.3c), we have

| ‖st‖=‖ut‖=1−|vt|2⩾1−ω2⩾3/2⩾Λ.\|s_{t}\|=\|u_{t}\|=\sqrt{1-|v_{t}|^{2}}\geqslant\sqrt{1-\omega^{2}}\geqslant{\sqrt{3}}/{2}\geqslant\Lambda. | | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |

When ω⩽|vt|⩽1/(1+Λ2)\omega\leqslant|v_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}, according to Algorithm 1 and (2.3c), we get

| ‖st‖=‖ut/vt‖=1−|vt|2/|vt|⩾Λ.\|s_{t}\|=\|{u_{t}}/{v_{t}}\|={\sqrt{1-|v_{t}|^{2}}}/{|v_{t}|}\geqslant\Lambda. | | -------------------------------------------------------------------------------------------------------------------------- |

Therefore, (2.5) holds, and thus τt=Λ/‖st‖∈(0,1].\tau_{t}=\Lambda/\|s_{t}\|\in(0,1].

Denote Et:=τt​gt⊤​st+τt22​st⊤​Ht​stE_{t}:=\tau_{t}g_{t}^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t}. It follows from Step 5 in Algorithm 1, the L2L_{2}-Lipschitz continuity of ∇2ℱ​(x)\nabla^{2}\mathcal{F}(x), and Lemma 2.3 that

ℱ​(xt+1)−ℱ​(xt)\displaystyle~\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})
⩽\displaystyle\leqslant τt​∇ℱ​(xt)⊤​st+τt22​st⊤​∇2ℱ​(xt)​st+L26​τt3​‖st‖3\displaystyle~\tau_{t}\nabla\mathcal{F}(x_{t})^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}\nabla^{2}\mathcal{F}(x_{t})s_{t}+\frac{L_{2}}{6}\tau_{t}^{3}\|s_{t}\ ^{3}
=\displaystyle= Et+τt​(∇ℱ​(xt)−gt)⊤​st+τt22​st⊤​(∇2ℱ​(xt)−Ht)​st+L26​τt3​‖st‖3\displaystyle~E_{t}+\tau_{t}(\nabla\mathcal{F}(x_{t})-g_{t})^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}\big(\nabla^{2}\mathcal{F}(x_{t})-H_{t}\big)s_{t}+\frac{L_{2}}{6}\tau_{t}^{3}\|s_{t}\ ^{3}
⩽\displaystyle\leqslant Et+τt​‖∇ℱ​(xt)−gt‖​‖st‖+τt22​‖∇2ℱ​(xt)−Ht‖​‖st‖2+L26​τt3​‖st‖3\displaystyle~E_{t}+\tau_{t}\|\nabla\mathcal{F}(x_{t})-g_{t}\ \
⩽\displaystyle\leqslant Et+ε1​Λ+ε22​Λ2+L26​Λ3.\displaystyle~E_{t}+\varepsilon_{1}\Lambda+\frac{\varepsilon_{2}}{2}\Lambda^{2}+\frac{L_{2}}{6}\Lambda^{3}. (2.6)

When |vt|<ω|v_{t}|<\omega and gt≠0g_{t}\neq 0, by (2.3b), (2.3c) and st=sgn⁡(−gt⊤​ut)​uts_{t}=\operatorname{sgn}(-g_{t}^{\top}u_{t})u_{t} in Algorithm 1, we obtain

| st⊤​Ht​st=−δt​‖st‖2−vt2​(α−δt),gt⊤​st=|vt|​(α−δt).s_{t}^{\top}H_{t}s_{t}=-\delta_{t}\|s_{t}\|^{2}-v_{t}^{2}(\alpha-\delta_{t}),\quad g_{t}^{\top}s_{t}=|v_{t}|(\alpha-\delta_{t}). | (2.7) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |

Therefore, using (2.7), we get

Et\displaystyle E_{t} =τt​gt⊤​st+τt22​st⊤​Ht​st\displaystyle=\tau_{t}g_{t}^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t}
=τt​|vt ​(α−δt)−τt22​δt​‖st‖2−τt22​vt2​(α−δt)\displaystyle=\tau_{t}
⩽τt​vt2​(α−δt)−τt22​δt​‖st‖2−τt22​vt2​(α−δt)\displaystyle\leqslant\tau_{t}v_{t}^{2}(\alpha-\delta_{t})-\frac{\tau_{t}^{2}}{2}\delta_{t}\|s_{t}\ ^{2}-\frac{\tau_{t}^{2}}{2}v_{t}^{2}(\alpha-\delta_{t})
=(τt−τt22)​vt2​(α−δt)−τt22​δt​‖st‖2,\displaystyle=\Bigl(\tau_{t}-\frac{\tau_{t}^{2}}{2}\Bigr)v_{t}^{2}(\alpha-\delta_{t})-\frac{\tau_{t}^{2}}{2}\delta_{t}\|s_{t}\ ^{2},

where the inequalities are derived from |vt|<ω<1|v_{t}|<\omega<1 and α−δt<0\alpha-\delta_{t}<0. Since τt∈(0,1]\tau_{t}\in(0,1], we have τt−τt2/2⩾0\tau_{t}-\tau_{t}^{2}/2\geqslant 0. Further combining this with (2.3c), we get

(τt−τt22)​(α−δt)⩽0.\left(\tau_{t}-\frac{\tau_{t}^{2}}{2}\right)(\alpha-\delta_{t})\leqslant 0. (2.8)

Furthermore, using τt=Λ/‖st‖\tau_{t}=\Lambda/\|s_{t}\| and (2.8), we get

| Et⩽−τt22​δt​‖st‖2=−δt​Λ22⩽−Λ22​α.E_{t}\leqslant-\frac{\tau_{t}^{2}}{2}\delta_{t}\|s_{t}\|^{2}=-\delta_{t}\frac{\Lambda^{2}}{2}\leqslant-\frac{\Lambda^{2}}{2}\alpha. | (2.9) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |

When |vt|<ω|v_{t}|<\omega and gt=0g_{t}=0, then Et=τt22​st⊤​Ht​stE_{t}=\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t}. In this case, it can be similarly proven that:

Et⩽−Λ22​α.E_{t}\leqslant-\frac{\Lambda^{2}}{2}\,\alpha. (2.10)

When ω⩽|vt|⩽1/(1+Λ2)\omega\leqslant|v_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}, we have st=ut/vts_{t}=u_{t}/v_{t}. Substituting this into (2.3b) yields

| st⊤​Ht​st=−gt⊤​st−δt​‖st‖2,gt⊤​st=α−δt⩽0.s_{t}^{\top}H_{t}s_{t}=-g_{t}^{\top}s_{t}-\delta_{t}\|s_{t}\|^{2},\qquad g_{t}^{\top}s_{t}=\alpha-\delta_{t}\leqslant 0. | (2.11) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |

Since τt∈(0,1]\tau_{t}\in(0,1], it holds that τt−τt2/2⩾0\tau_{t}-\tau_{t}^{2}/2\geqslant 0, and consequently

(τt−τt22)​gt⊤​st⩽0.\left(\tau_{t}-\frac{\tau_{t}^{2}}{2}\right)g_{t}^{\top}s_{t}\leqslant 0. (2.12)

Using (2.11) and (2.12), we obtain

| Et\displaystyle E_{t} | =τt​gt⊤​st+τt22​st⊤​Ht​st=(τt−τt22)​gt⊤​st−τt22​δt​‖st‖2\displaystyle=\tau_{t}g_{t}^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t}=\Bigl(\tau_{t}-\frac{\tau_{t}^{2}}{2}\Bigr)g_{t}^{\top}s_{t}-\frac{\tau_{t}^{2}}{2}\delta_{t}\|s_{t}\|^{2} | (2.13) | | ------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ | | ⩽−τt22​δt​‖st‖2⩽−Λ22​α.\displaystyle\leqslant-\frac{\tau_{t}^{2}}{2}\delta_{t}\|s_{t}\|^{2}\leqslant-\frac{\Lambda^{2}}{2}\alpha. | | |

Combining (2.1) with (2.9), (2.10) and (2.13) completes the proof.

We now consider the case where |vt|>1/(1+Λ2)|v_{t}|>\sqrt{1/(1+\Lambda^{2})}. The subsequent lemma establishes explicit bounds for both ‖∇ℱ​(xt+1)‖\|\nabla\mathcal{F}(x_{t+1})\| and ∇2ℱ​(xt+1)\nabla^{2}\mathcal{F}(x_{t+1}).

Lemma 2.6

Under Assumption 2.1 and for the case |vt|>1/(1+Λ2)|v_{t}|>\sqrt{1/(1+\Lambda^{2})}, we further assume Λ⩽2/2\Lambda\leqslant\sqrt{2}/2. Then the following holds:

| ‖∇ℱ​(xt+1)‖\displaystyle\big\|\nabla\mathcal{F}(x_{t+1})\big\| | ⩽2​(L1+α)​Λ3+L22​Λ2+(ε2+α)​Λ+3​ε1,\displaystyle\leqslant 2(L_{1}+\alpha)\Lambda^{3}+\frac{L_{2}}{2}\Lambda^{2}+(\varepsilon_{2}+\alpha)\Lambda+3\varepsilon_{1}, | (2.14a) | | ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------- | | ∇2ℱ​(xt+1)\displaystyle\nabla^{2}\mathcal{F}(x_{t+1}) | ≽−{2​(L1+α)​Λ2+α+ε2+LH​[(1+κ)​Λ+2​A]}​I.\displaystyle\succcurlyeq-\Big\{2(L_{1}+\alpha)\Lambda^{2}+\alpha+\varepsilon_{2}+L_{H}\big[(1+\kappa)\Lambda+2A\big]\Big\}I. | (2.14b) |

Proof

We first examine the case gt≠0g_{t}\neq 0 to derive an upper bound for ‖∇ℱ​(xt+1)‖\big\|\nabla\mathcal{F}(x_{t+1})\big\|. The analysis begins by estimating ‖gt‖\|g_{t}\|. Given the condition |vt|>1/(1+Λ2)|v_{t}|>\sqrt{1/(1+\Lambda^{2})}, (2.3c) implies

| ‖st‖=‖utvt‖=1−|vt|2|vt|<Λ.\|s_{t}\|=\Big\|\frac{u_{t}}{v_{t}}\Big\|=\frac{\sqrt{1-|v_{t}|^{2}}}{|v_{t}|}<\Lambda. | (2.15) | | ----------------------------------------------------------------------------------------------------------------------------------- | ------ |

Combining (2.3b) with the upper bound (2.15) yields

Ht​st+gt\displaystyle H_{t}s_{t}+g_{t} =−δt​st,\displaystyle=-\delta_{t}s_{t}, (2.16a)
δt−α\displaystyle\delta_{t}-\alpha =−gt⊤​st⩽‖gt‖​‖st‖⩽Λ​‖gt‖.\displaystyle=-g_{t}^{\top}s_{t}\leqslant\|g_{t}\ \

Define the quadratic function

| h​(m):=m2+(gt⊤​Ht​gt‖gt‖2+α)​m−‖gt‖2.h(m):=m^{2}+\left(\frac{g_{t}^{\top}H_{t}g_{t}}{\|g_{t}\|^{2}}+\alpha\right)m-\|g_{t}\|^{2}. | | ------------------------------------------------------------------------------------------------------------------------------------------------ |

The equation h​(m)=0h(m)=0 has two real roots of opposite signs; denote its positive root by m2m_{2}. We now show that h​(δt−α)⩾0h\left(\delta_{t}-\alpha\right)\geqslant 0. To this end, consider the matrix

Q​(k):=[Ht+(k+α)​Igtgt⊤k].Q(k):=\begin{bmatrix}H_{t}+(k+\alpha)I&g_{t}\\ g_{t}^{\top}&k\end{bmatrix}.

From the optimality condition, we have Q​(δt−α)≽0Q(\delta_{t}-\alpha)\succcurlyeq 0 and δt−α>0\delta_{t}-\alpha>0. Applying the Schur complement with respect to the scalar block k=δt−αk=\delta_{t}-\alpha gives

Ht+δt​I−1δt−α​gt​gt⊤≽0.H_{t}+\delta_{t}I-\frac{1}{\delta_{t}-\alpha}g_{t}g_{t}^{\top}\succcurlyeq 0.

Premultiplying and postmultiplying the above inequality by the unit vector gt/‖gt‖g_{t}/\|g_{t}\| yields

| gt⊤​Ht​gt‖gt‖2+δt−‖gt‖2δt−α⩾0.\frac{g_{t}^{\top}H_{t}g_{t}}{\|g_{t}\|^{2}}+\delta_{t}-\frac{\|g_{t}\|^{2}}{\delta_{t}-\alpha}\geqslant 0. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |

Because δt−α>0\delta_{t}-\alpha>0, multiplying both sides by δt−α\delta_{t}-\alpha leads to

| (δt−α)​(gt⊤​Ht​gt‖gt‖2+(δt−α)+α)−‖gt‖2=(δt−α)2+(gt⊤​Ht​gt‖gt‖2+α)​(δt−α)−‖gt‖2⩾0,(\delta_{t}-\alpha)\!\left(\frac{g_{t}^{\top}H_{t}g_{t}}{\|g_{t}\|^{2}}+(\delta_{t}-\alpha)+\alpha\right)-\|g_{t}\|^{2}=(\delta_{t}-\alpha)^{2}+\left(\frac{g_{t}^{\top}H_{t}g_{t}}{\|g_{t}\|^{2}}+\alpha\right)(\delta_{t}-\alpha)-\|g_{t}\|^{2}\geqslant 0, | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

Hence δt−α⩾m2\delta_{t}-\alpha\geqslant m_{2} (since m2m_{2} is the positive root of h​(m)=0h(m)=0). Combining this with the bound from (2.16b), we obtain

| h​(Λ​‖gt‖)=Λ2​‖gt‖2+(gt⊤​Ht​gt‖gt‖2+α)​Λ​‖gt‖−‖gt‖2⩾0.h(\Lambda\|g_{t}\|)=\Lambda^{2}\|g_{t}\|^{2}+\left(\frac{g_{t}^{\top}H_{t}g_{t}}{\|g_{t}\|^{2}}+\alpha\right)\Lambda\|g_{t}\|-\|g_{t}\|^{2}\geqslant 0. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

Rearranging the inequality and using the condition Ht≼L1​IH_{t}\preccurlyeq L_{1}I (i.e. gt⊤​Ht​gt/‖gt‖2⩽L1{g_{t}^{\top}H_{t}g_{t}}/{\|g_{t}\|^{2}}\leqslant L_{1}) together with Λ⩽2/2\Lambda\leqslant\sqrt{2}/2, we obtain

| ‖gt‖⩽(gt⊤​Ht​gt‖gt‖2+α)​Λ1−Λ2⩽(L1+α)​Λ1−Λ2⩽2​(L1+α)​Λ.\|g_{t}\|\leqslant\frac{\left(\frac{g_{t}^{\top}H_{t}g_{t}}{\|g_{t}\|^{2}}+\alpha\right)\Lambda}{1-\Lambda^{2}}\leqslant\frac{(L_{1}+\alpha)\Lambda}{1-\Lambda^{2}}\leqslant 2(L_{1}+\alpha)\Lambda. | (2.17) | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |

From (2.15) and (2.16b), it follows that

| δt​‖st‖=(α+(δt−α))​‖st‖⩽(α+Λ​‖gt‖)​‖st‖⩽α​Λ+Λ2​‖gt‖.\delta_{t}\|s_{t}\|=\bigl(\alpha+(\delta_{t}-\alpha)\bigr)\|s_{t}\|\leqslant(\alpha+\Lambda\|g_{t}\|)\|s_{t}\|\leqslant\alpha\Lambda+\Lambda^{2}\|g_{t}\|. | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

Combining this with (2.16a), we obtain

| ‖Ht​st+gt‖=δt​‖st‖⩽α​Λ+Λ2​‖gt‖.\|H_{t}s_{t}+g_{t}\|=\delta_{t}\|s_{t}\|\leqslant\alpha\Lambda+\Lambda^{2}\|g_{t}\|. | | -------------------------------------------------------------------------------------------------------------------------------------- |

To bound ‖gt+1‖\|g_{t+1}\|, we note that

| ‖gt+1‖\displaystyle\|g_{t+1}\| | ⩽‖gt+1−Ht​st−gt‖+‖Ht​st+gt‖\displaystyle\leqslant\|g_{t+1}-H_{t}s_{t}-g_{t}\|+\|H_{t}s_{t}+g_{t}\| | | ------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | | ⩽‖gt+1−Ht​st−gt‖+α​Λ+‖gt‖​Λ2\displaystyle\leqslant\|g_{t+1}-H_{t}s_{t}-g_{t}\|+\alpha\Lambda+\|g_{t}\|\Lambda^{2} | | | ⩽‖gt+1−Ht​st−gt‖+α​Λ+2​(L1+α)​Λ3,\displaystyle\leqslant\|g_{t+1}-H_{t}s_{t}-g_{t}\|+\alpha\Lambda+2(L_{1}+\alpha)\Lambda^{3}, | |

where the last line uses inequality (2.17). Furthermore, using (2.15) together with Lemmas 2.2 and 2.3, we have

| ‖gt+1−Ht​st−gt‖⩽\displaystyle\left\|g_{t+1}-H_{t}s_{t}-g_{t}\right\|\leqslant | ‖∇ℱ​(xt+1)−∇2ℱ​(xt)​st−∇ℱ​(xt)‖+‖gt+1−∇ℱ​(xt+1)‖\displaystyle~\left\|\nabla\mathcal{F}\left(x_{t+1}\right)-\nabla^{2}\mathcal{F}\left(x_{t}\right)s_{t}-\nabla\mathcal{F}\left(x_{t}\right)\right\|+\left\|g_{t+1}-\nabla\mathcal{F}\left(x_{t+1}\right)\right\| | | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ | | +‖∇2ℱ​(xt)−Ht‖​‖st‖+‖∇ℱ​(xt)−gt‖\displaystyle~+\left\|\nabla^{2}\mathcal{F}\left(x_{t}\right)-H_{t}\right\|\left\|s_{t}\right\|+\left\|\nabla\mathcal{F}\left(x_{t}\right)-g_{t}\right\| | | | | ⩽\displaystyle\leqslant | L22​‖st‖2+ε2​‖st‖+2​ε1\displaystyle~\frac{L_{2}}{2}\left\|s_{t}\right\|^{2}+\varepsilon_{2}\left\|s_{t}\right\|+2\varepsilon_{1} | | | ⩽\displaystyle\leqslant | L22​Λ2+ε2​Λ+2​ε1.\displaystyle~\frac{L_{2}}{2}\Lambda^{2}+\varepsilon_{2}\Lambda+2\varepsilon_{1}. | (2.18) |

Consequently,

| ‖gt+1‖⩽2​(L1+α)​Λ3+L22​Λ2+(ε2+α)​Λ+2​ε1.\|g_{t+1}\|\leqslant 2(L_{1}+\alpha)\Lambda^{3}+\frac{L_{2}}{2}\Lambda^{2}+(\varepsilon_{2}+\alpha)\Lambda+2\varepsilon_{1}. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

Finally, applying Lemma 2.3 gives

| ‖∇ℱ​(xt+1)‖⩽‖gt+1‖+ε1⩽2​(L1+α)​Λ3+L22​Λ2+(ε2+α)​Λ+3​ε1.\left\|\nabla\mathcal{F}\left(x_{t+1}\right)\right\|\leqslant\left\|g_{t+1}\right\|+\varepsilon_{1}\leqslant 2(L_{1}+\alpha)\Lambda^{3}+\frac{L_{2}}{2}\Lambda^{2}+(\varepsilon_{2}+\alpha)\Lambda+3\varepsilon_{1}. | (2.19) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |

We now establish a lower bound for the Hessian ∇2ℱ​(xt+1)\nabla^{2}\mathcal{F}(x_{t+1}). From (2.3a), we have Ht+δt​I≽0H_{t}+\delta_{t}I\succcurlyeq 0. Combining this with (2.16b) and (2.17) yields

| Ht≽−δt​I≽−(Λ​‖gt‖+α)​I≽−2​(L1+α)​Λ2​I−α​I.H_{t}\succcurlyeq-\delta_{t}I\succcurlyeq-(\Lambda\|g_{t}\|+\alpha)I\succcurlyeq-2(L_{1}+\alpha)\Lambda^{2}I-\alpha I. | (2.20) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |

To bound Ht+1H_{t+1}, we first use the LHL_{H}-Lipschitz continuity of H​(x,y)H(x,y):

| Ht+1\displaystyle H_{t+1} | ≽Ht−‖Ht+1−Ht‖​I\displaystyle\succcurlyeq H_{t}-\|H_{t+1}-H_{t}\|I | | ------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------- | | ≽Ht−LH​(‖xt+1−xt‖+‖yt+1−yt‖)​I.\displaystyle\succcurlyeq H_{t}-L_{H}\bigl(\|x_{t+1}-x_{t}\|+\|y_{t+1}-y_{t}\|\bigr)I. | |

Since xt+1=xt+stx_{t+1}=x_{t}+s_{t} with ‖st‖<Λ\|s_{t}\|<\Lambda, and using the κ\kappa-Lipschitz continuity of y⋆​(x)y^{\star}(x) together with the bound‖yt−y⋆​(xt)‖⩽A\|y_{t}-y^{\star}(x_{t})\|\leqslant A from Lemma 2.3, we obtain

| Ht+1\displaystyle H_{t+1} | ≽Ht−LH​(‖st‖+‖yt+1−y⋆​(xt+1)‖+‖y⋆​(xt+1)−y⋆​(xt)‖+‖y⋆​(xt)−yt‖)​I\displaystyle\succcurlyeq H_{t}-L_{H}\Bigl(\|s_{t}\|+\|y_{t+1}-y^{\star}(x_{t+1})\|+\|y^{\star}(x_{t+1})-y^{\star}(x_{t})\|+\|y^{\star}(x_{t})-y_{t}\|\Bigr)I | | ---------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ≽Ht−LH​[(1+κ)​‖st‖+2​A]​I\displaystyle\succcurlyeq H_{t}-L_{H}\bigl[(1+\kappa)\|s_{t}\|+2A\bigr]I | | | ≽Ht−LH​[(1+κ)​Λ+2​A]​I.\displaystyle\succcurlyeq H_{t}-L_{H}\bigl[(1+\kappa)\Lambda+2A\bigr]I. | (2.21) |

Substituting the lower bound for HtH_{t} from (2.20) into (2.1) gives

Ht+1≽−2​(L1+α)​Λ2​I−α​I−LH​[(1+κ)​Λ+2​A]​I.H_{t+1}\succcurlyeq-2(L_{1}+\alpha)\Lambda^{2}I-\alpha I-L_{H}\left[(1+\kappa)\Lambda+2A\right]I. (2.22)

Finally, applying the Hessian approximation error from Lemma 2.3, we obtain the desired lower bound for the exact Hessian:

∇2ℱ​(xt+1)\displaystyle\nabla^{2}\mathcal{F}(x_{t+1}) ≽Ht+1−ε2​I\displaystyle\succcurlyeq H_{t+1}-\varepsilon_{2}I
≽−(2​(L1+α)​Λ2+α+ε2+LH​[(1+κ)​Λ+2​A])​I.\displaystyle\succcurlyeq-\Bigl(2(L_{1}+\alpha)\Lambda^{2}+\alpha+\varepsilon_{2}+L_{H}\bigl[(1+\kappa)\Lambda+2A\bigr]\Bigr)I. (2.23)

We now consider the case gt=0g_{t}=0. The argument for establishing an upper bound on ‖∇ℱ​(xt+1)‖\big\|\nabla\mathcal{F}(x_{t+1})\big\| proceeds analogously to the case gt≠0g_{t}\neq 0 with the simplification that gt=0g_{t}=0. Therefore, we obtain

| ‖gt+1‖⩽L22​Λ2+(ε2+α)​Λ+2​ε1.\|g_{t+1}\|\leqslant\frac{L_{2}}{2}\Lambda^{2}+(\varepsilon_{2}+\alpha)\Lambda+2\varepsilon_{1}. | | ------------------------------------------------------------------------------------------------------------------------------------------ |

Applying Lemma 2.3 then yields

| ‖∇ℱ​(xt+1)‖⩽‖gt+1‖+ε1⩽L22​Λ2+(ε2+α)​Λ+3​ε1.\left\|\nabla\mathcal{F}\left(x_{t+1}\right)\right\|\leqslant\left\|g_{t+1}\right\|+\varepsilon_{1}\leqslant\frac{L_{2}}{2}\Lambda^{2}+(\varepsilon_{2}+\alpha)\Lambda+3\varepsilon_{1}. | (2.24) | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |

Combining (2.19) with (2.24), we have (2.14a). The lower bound for the Hessian proceeds analogously. Under gt=0g_{t}=0, we have

Ht+1≽−α​I−LH​[(1+κ)​Λ+2​A]​I.H_{t+1}\succcurlyeq-\alpha I-L_{H}\bigl[(1+\kappa)\Lambda+2A\bigr]I.

Consequently,

∇2ℱ​(xt+1)≽−(α+ε2+LH​[(1+κ)​Λ+2​A])​I.\nabla^{2}\mathcal{F}(x_{t+1})\succcurlyeq-\Bigl(\alpha+\varepsilon_{2}+L_{H}\bigl[(1+\kappa)\Lambda+2A\bigr]\Bigr)I. (2.25)

Finally, combining (2.1) with the Hessian bound (2.25) establishes (2.14b), which completes the proof.

We are now prepared to establish the iteration complexity of the HSDA algorithm. Let ε>0\varepsilon>0 be the target accuracy. We define the first iteration at which an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point is reached as

| T​(ε):=min⁡{t∣‖∇ℱ​(xt+1)‖⩽c1​ε​and​∇2ℱ​(xt+1)≽−c2​ε​I},T(\varepsilon):=\min\Bigl\{t\,\Bigm|\,\|\nabla\mathcal{F}(x_{t+1})\|\leqslant c_{1}\varepsilon\ \text{and}\ \nabla^{2}\mathcal{F}(x_{t+1})\succcurlyeq-c_{2}\sqrt{\varepsilon}\,I\Bigr\}, | (2.26) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |

where c1c_{1} and c2c_{2} are positive constants independent of ε\varepsilon. In parallel, we introduce a verifiable stopping index based on the eigenvector component vtv_{t}:

| T~​(ε):=min⁡{t∣|vt|>1/(1+Λ2)}.\widetilde{T}(\varepsilon):=\min\Bigl\{t\,\Bigm|\,|v_{t}|>\sqrt{1/(1+\Lambda^{2})}\Bigr\}. | (2.27) | | ---------------------------------------------------------------------------------------------------------------------------------------- | ------ |

The following theorem shows that once |vt|>1/(1+Λ2)|v_{t}|>\sqrt{1/(1+\Lambda^{2})}, the next iterate xt+1=xt+stx_{t+1}=x_{t}+s_{t} is already an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point; consequently,T​(ε)⩽T~​(ε)T(\varepsilon)\leqslant\widetilde{T}(\varepsilon).

Theorem 2.1

Suppose Assumption 2.1 holds, and set the parameters asα=L2​ε,ε1=ε/12,ε2=L2​ε/12,Λ=ε/L2,ω∈(0,1/2)\alpha=\sqrt{L_{2}\varepsilon},\,\varepsilon_{1}=\varepsilon/12,\,\varepsilon_{2}=\sqrt{L_{2}\varepsilon}/12,\,\Lambda=\sqrt{\varepsilon/L_{2}},\,\omega\in(0,1/2)with 0<ε⩽min⁡{L2/2, 1}0<\varepsilon\leqslant\min\{L_{2}/2,\,1\}. Then the iterate xT~​(ε)+1x_{\widetilde{T}(\varepsilon)+1} satisfies

| ‖∇ℱ​(xT~​(ε)+1)‖⩽(2​L1L2+176)​ε,\displaystyle\|\nabla\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\|\leqslant\Bigl(\frac{\sqrt{2}\,L_{1}}{L_{2}}+\frac{17}{6}\Bigr)\varepsilon, | (2.28) | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ | | ∇2ℱ​(xT~​(ε)+1)≽−[2​L1L2+136​L2+LH​(1+κ)L2]​ε​I,\displaystyle\nabla^{2}\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\succcurlyeq-\Bigl[\frac{\sqrt{2}\,L_{1}}{\sqrt{L_{2}}}+\frac{13}{6}\sqrt{L_{2}}+\frac{L_{H}(1+\kappa)}{\sqrt{L_{2}}}\Bigr]\sqrt{\varepsilon}\,I, | |

and furthermore,

T​(ε)⩽T~​(ε)⩽ 1+24​L25​(ℱ​(x1)−ℱinf)​ε−3/2.T(\varepsilon)\ \leqslant\ \widetilde{T}(\varepsilon)\ \leqslant\ 1+\frac{24\sqrt{L_{2}}}{5}\bigl(\mathcal{F}(x_{1})-\mathcal{F}_{\inf}\bigr)\varepsilon^{-3/2}. (2.29)
Proof

We first prove the stationarity bounds in (2.28). By definition of T~​(ε)\widetilde{T}(\varepsilon), we have|vT~​(ε)|>1/(1+Λ2)|v_{\widetilde{T}(\varepsilon)}|>\sqrt{1/(1+\Lambda^{2})}. Applying Lemma 2.6 under this condition yields

| ‖∇ℱ​(xT~​(ε)+1)‖⩽2​(L1+α)​Λ3+L22​Λ2+(ε2+α)​Λ+3​ε1.\|\nabla\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\|\leqslant 2(L_{1}+\alpha)\Lambda^{3}+\frac{L_{2}}{2}\Lambda^{2}+(\varepsilon_{2}+\alpha)\Lambda+3\varepsilon_{1}. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

Using the parameter values specified in Theorem 2.1, we substitute into the gradient bound to obtain

| ‖∇ℱ​(xT~​(ε)+1)‖\displaystyle\|\nabla\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\| | ⩽2​(L1+L2​ε)​(εL2)3/2+L22​(εL2)+(L2​ε12+L2​ε)​εL2+ε4\displaystyle\leqslant 2\bigl(L_{1}+\sqrt{L_{2}\varepsilon}\bigr)\Bigl(\frac{\varepsilon}{L_{2}}\Bigr)^{3/2}+\frac{L_{2}}{2}\Bigl(\frac{\varepsilon}{L_{2}}\Bigr)+\Bigl(\frac{\sqrt{L_{2}\varepsilon}}{12}+\sqrt{L_{2}\varepsilon}\Bigr)\sqrt{\frac{\varepsilon}{L_{2}}}+\frac{\varepsilon}{4} | | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | =2​L1L23/2​ε3/2+2L2​ε2+116​ε.\displaystyle=\frac{2L_{1}}{L_{2}^{3/2}}\varepsilon^{3/2}+\frac{2}{L_{2}}\varepsilon^{2}+\frac{11}{6}\varepsilon. | |

By 0<ε⩽min⁡{L2/2, 1}0<\varepsilon\leqslant\min\{L_{2}/2,\,1\}, we can easily get

| ‖∇ℱ​(xT~​(ε)+1)‖⩽(2​L1L2+176)​ε.\|\nabla\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\|\leqslant\Bigl(\frac{\sqrt{2}\,L_{1}}{L_{2}}+\frac{17}{6}\Bigr)\varepsilon. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

Moreover, Lemma 2.6 provides the Hessian lower bound

∇2ℱ​(xT~​(ε)+1)≽−{2​(L1+α)​Λ2+α+ε2+LH​[(1+κ)​Λ+2​A]}​I,\nabla^{2}\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\succcurlyeq-\Bigl\{2(L_{1}+\alpha)\Lambda^{2}+\alpha+\varepsilon_{2}+L_{H}\bigl[(1+\kappa)\Lambda+2A\bigr]\Bigr\}I,

with A=min⁡{ε1/ℓ1,ε2/(2​LH)}A=\min\{\varepsilon_{1}/\ell_{1},\ \varepsilon_{2}/(2L_{H})\}. Since A⩽ε2/(2​LH)A\leqslant\varepsilon_{2}/(2L_{H}), we haveLH​[(1+κ)​Λ+2​A]⩽LH​(1+κ)​Λ+ε2L_{H}[(1+\kappa)\Lambda+2A]\leqslant L_{H}(1+\kappa)\Lambda+\varepsilon_{2}. Substituting the parameter choices from Theorem 2.1 into the above bound gives

2​(L1+α)​Λ2+α+ε2+LH​[(1+κ)​Λ+2​A]\displaystyle~2(L_{1}+\alpha)\Lambda^{2}+\alpha+\varepsilon_{2}+L_{H}\bigl[(1+\kappa)\Lambda+2A\bigr]
⩽\displaystyle\leqslant 2​(L1+L2​ε)​εL2+L2​ε+2⋅L2​ε12+LH​(1+κ)L2​ε.\displaystyle~2\bigl(L_{1}+\sqrt{L_{2}\varepsilon}\bigr)\frac{\varepsilon}{L_{2}}+\sqrt{L_{2}\varepsilon}+2\cdot\frac{\sqrt{L_{2}\varepsilon}}{12}+\frac{L_{H}(1+\kappa)}{\sqrt{L_{2}}}\sqrt{\varepsilon}.

Since 0<ε⩽min⁡{L2/2, 1}0<\varepsilon\leqslant\min\{L_{2}/2,\,1\}, the right-hand side of the previous inequality can be simplified, yielding

∇2ℱ​(xT~​(ε)+1)≽−[2​L1L2+136​L2+LH​(1+κ)L2]​ε​I,\nabla^{2}\mathcal{F}(x_{\widetilde{T}(\varepsilon)+1})\succcurlyeq-\Bigl[\frac{\sqrt{2}\,L_{1}}{\sqrt{L_{2}}}+\frac{13}{6}\sqrt{L_{2}}+\frac{L_{H}(1+\kappa)}{\sqrt{L_{2}}}\Bigr]\sqrt{\varepsilon}\,I,

which establishes the Hessian bound in (2.28). Consequently, we haveT​(ε)⩽T~​(ε)T(\varepsilon)\leqslant\widetilde{T}(\varepsilon).

We now proceed to bound T~​(ε)\widetilde{T}(\varepsilon). By the definition of T~​(ε)\widetilde{T}(\varepsilon), for any t<T~​(ε)t<\widetilde{T}(\varepsilon), we have|vt|⩽1/(1+Λ2)|v_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}. Hence Lemma 2.5 is applicable and gives

ℱ​(xt+1)−ℱ​(xt)⩽Λ​ε1+(ε2−α)​Λ22+L26​Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant\Lambda\varepsilon_{1}+(\varepsilon_{2}-\alpha)\frac{\Lambda^{2}}{2}+\frac{L_{2}}{6}\Lambda^{3}.

Substituting the parameter choices from Theorem 2.1 into this inequality leads to the per‑iteration decrease

ℱ​(xt+1)−ℱ​(xt)⩽−524​ε3/2L2,∀t<T~​(ε).\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant-\frac{5}{24}\,\frac{\varepsilon^{3/2}}{\sqrt{L_{2}}},\qquad\forall\,t<\widetilde{T}(\varepsilon). (2.30)

Summing (2.30) over t=1,…,T~​(ε)−1t=1,\ldots,\widetilde{T}(\varepsilon)-1 and noting that the total possible decrease in ℱ\mathcal{F} is at most ℱ​(x1)−ℱinf\mathcal{F}(x_{1})-\mathcal{F}_{\inf}, we obtain

ℱ​(x1)−ℱinf⩾∑t=1T~​(ε)−1(ℱ​(xt)−ℱ​(xt+1))⩾(T~​(ε)−1)​524​ε3/2L2.\mathcal{F}(x_{1})-\mathcal{F}_{\inf}\geqslant\sum_{t=1}^{\widetilde{T}(\varepsilon)-1}\bigl(\mathcal{F}(x_{t})-\mathcal{F}(x_{t+1})\bigr)\geqslant(\widetilde{T}(\varepsilon)-1)\,\frac{5}{24}\,\frac{\varepsilon^{3/2}}{\sqrt{L_{2}}}.

Rearranging gives the desired upper bound on T~​(ε)\widetilde{T}(\varepsilon) in (2.29). Together with the already proved relation T​(ε)⩽T~​(ε)T(\varepsilon)\leqslant\widetilde{T}(\varepsilon), the proof is complete.

Remark 2.1

The bound (2.29) in Theorem 2.1 shows that HSDA attains an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point within 𝒪​(ε−3/2)\mathcal{O}(\varepsilon^{-3/2}) outer iterations. This iteration complexity matches the best known rates for second-order methods in nonconvex-strongly concave minimax optimization; see, for example, luo2022finding ; wang2025gradient .

3 Inexact Homogeneous Second-Order Descent Ascent Algorithm

The complexity analysis in Section 2 assumes that the homogenized eigenvalue subproblem (2.1) is solved exactly at every outer iteration. In practice, however, this assumption can be prohibitive: for large-scale problems, computing the smallest eigenpair of the homogenized matrix Gt​(α)G_{t}(\alpha) typically requires expensive matrix factorizations or many iterations of a Krylov-type eigensolver. To overcome this limitation, we propose in this section an inexact homogeneous second-order descent ascent (IHSDA) algorithm, which solves the homogenized subproblem only approximately via a Lanczos procedure with carefully controlled residual. We prove that IHSDA retains the single‑loop structure of HSDA and achieves the same outer-iteration complexity.

Unlike the exact HSDA method, IHSDA avoids solving the homogenized eigenvalue subproblem (2.1) exactly. It instead employs a Lanczos procedure to obtain an approximate solution, which comprises two main steps:

By substituting Step 3 of the HSDA algorithm with the inexact Lanczos procedure described above, we obtain the complete IHSDA algorithm for solving problem (P). The detailed algorithm is formally stated in Algorithm 2.

Algorithm 2 Inexact Homogeneous Second-Order Descent Ascent (IHSDA) Algorithm

Step 1: Input x1x_{1}, y0y_{0}, η1>0\eta_{1}>0, η2>0\eta_{2}>0, L1>0L_{1}>0, L2>0L_{2}>0, Bg>0B_{g}>0,ω∈(1/4,1/2)\omega\in(1/4,1/2), {Nt⩾1}\{N_{t}\geqslant 1\}, ε>0\varepsilon>0, Λ>0\Lambda>0, and set t=1t=1.

Step 2: Update yty_{t}:

(2a): Set i=0i=0, yit=y~it=yt−1y_{i}^{t}=\tilde{y}_{i}^{t}=y_{t-1}.

(2b): Update yity_{i}^{t} and y~it\tilde{y}_{i}^{t}:

yi+1t\displaystyle y_{i+1}^{t} =y~it+η1​∇yf​(xt,y~it),\displaystyle=\tilde{y}_{i}^{t}+\eta_{1}\nabla_{y}f\big(x_{t},\tilde{y}_{i}^{t}\big),
y~i+1t\displaystyle\tilde{y}_{i+1}^{t} =yi+1t+η2​(yi+1t−yit).\displaystyle=y_{i+1}^{t}+\eta_{2}\big(y_{i+1}^{t}-y_{i}^{t}\big).

(2c): If i⩾Nt−1i\geqslant N_{t}-1, set yt=yNtty_{t}=y_{N_{t}}^{t} and go to Step 3; otherwise set i=i+1i=i+1 and go to Step (2b).

Step 3: Compute

gt=∇xf​(xt,yt),Ht=[∇x​x2f−∇x​y2f​(∇y​y2f)−1​∇y​x2f]​(xt,yt).g_{t}=\nabla_{x}f(x_{t},y_{t}),\qquad H_{t}=\big[\nabla^{2}_{xx}f-\nabla^{2}_{xy}f(\nabla^{2}_{yy}f)^{-1}\nabla^{2}_{yx}f\big](x_{t},y_{t}).

Set et=L2​εe_{t}=\sqrt{L_{2}\varepsilon} and αt=L2​ε\alpha_{t}=\sqrt{L_{2}\varepsilon}, and Gt​(αt):=[Htgtgt⊤−αt]G_{t}(\alpha_{t}):=\begin{bmatrix}H_{t}&g_{t}\\ g_{t}^{\top}&-\alpha_{t}\end{bmatrix}.

(3a) By applying (zhang2025homogeneous, , Algorithm 4) to compute a Ritz pair of Gt​(αt)G_{t}(\alpha_{t}), i.e., (−ζt,[u^t;v^t])(-\zeta_{t},[\hat{u}_{t};\hat{v}_{t}]) with Ritz residual [kt;ϱt][k_{t};\varrho_{t}], which satisfies

| Gt​(αt)​[u^tv^t]+ζt​[u^tv^t]=[ktϱt],|δt−ζt|⩽et,kt⊤​u^t+ϱt​v^t=0,‖[u^t;v^t]‖=1.G_{t}(\alpha_{t})\begin{bmatrix}\hat{u}_{t}\\ \hat{v}_{t}\end{bmatrix}+\zeta_{t}\begin{bmatrix}\hat{u}_{t}\\ \hat{v}_{t}\end{bmatrix}=\begin{bmatrix}k_{t}\\ \varrho_{t}\end{bmatrix},\ | \delta_{t}-\zeta_{t}|\leqslant e_{t},\ k_{t}^{\top}\hat{u}_{t}+\varrho_{t}\hat{v}_{t}=0,\ \big\|[\hat{u}_{t};\hat{v}_{t}]\big\|=1. | (3.2) | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- |

If |v^t|⩽1/(1+Λ2)|\hat{v}_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}, go to Step 4;

(3b) If ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2, set xt+1=xt+u^tv^tx_{t+1}=x_{t}+\dfrac{\hat{u}_{t}}{\hat{v}_{t}} and terminate. Otherwise, set

| αt=3​L2​ε+2​‖gt‖​Λ+(L1+ζt)​Λ2,et=min⁡{ε4,L2​ε5/264​(L1+αt+Bg)2},\alpha_{t}=3\sqrt{L_{2}\varepsilon}+2\|g_{t}\|\Lambda+(L_{1}+\zeta_{t})\Lambda^{2},\qquad e_{t}=\min\!\left\{\frac{\varepsilon}{4},\ \frac{\sqrt{L_{2}}\,\varepsilon^{5/2}}{64\,(L_{1}+\alpha_{t}+B_{g})^{2}}\right\}, | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

and go to Step 3(a).

Step 4: Update the direction sts_{t}:

| st={u^tv^t,|v^t|⩾ω,sgn⁡(−gt⊤​u^t)​u^t,|v^t|<ω.s_{t}=\begin{cases}\dfrac{\hat{u}_{t}}{\hat{v}_{t}},&\qquad|\hat{v}_{t}|\geqslant\omega,\\[9.0pt] \operatorname{sgn}\!\big(-g_{t}^{\top}\hat{u}_{t}\big)\hat{u}_{t},&\qquad|\hat{v}_{t}|<\omega.\end{cases} | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

Step 5: Compute τt=Λ/‖st‖\tau_{t}=\Lambda/\|s_{t}\|, update xt+1=xt+τt​stx_{t+1}=x_{t}+\tau_{t}s_{t}, set t=t+1t=t+1, and go to Step 2.

In the following subsection, we prove that IHSDA finds an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point of ℱ​(x)\mathcal{F}(x) for problem (P) with an outer iteration complexity of 𝒪​(ε−3/2)\mathcal{O}(\varepsilon^{-3/2}). Furthermore, we derive a high-probability upper bound on total number of Hessian-vector products is 𝒪~​(ε−7/4)\tilde{\mathcal{O}}\big(\varepsilon^{-7/4}\big).

3.1 Complexity Analysis

For our subsequent analysis, we adopt the following standard assumption commonly used in the complexity analysis of second-order methods Cartis2011ARC ; Royer2018ComplexityAO .

Assumption 3.1

There exists a constant Bg>0B_{g}>0, independent of tt, such that

| ‖g​(xt,yt)‖⩽Bg,∀t⩾1.\|g(x_{t},y_{t})\|\leqslant B_{g},\qquad\forall\,t\geqslant 1. | | --------------------------------------------------------------------------------------------- |

We now proceed to derive a quantitative decrease bound for the value function (1.1) under the condition |v^t|⩽1/(1+Λ2)|\hat{v}_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}.

Lemma 3.1

Under Assumption 2.1, let ω∈(1/4,1/2)\omega\in(1/4,1/2) and Λ⩽2/2\Lambda\leqslant\sqrt{2}/2. Then for any ε>0\varepsilon>0, and whenever|v^t|⩽1/(1+Λ2)|\hat{v}_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})} the following decrease bound holds with probability at least 1−4​p1-4p (where p∈(exp⁡(−n),1)p\in(\exp(-n),1)):

| ℱ​(xt+1)−ℱ​(xt)⩽4​|ϱt|−αt2​Λ2+Λ​ε1+Λ22​ε2+L26​Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant 4|\varrho_{t}|-\frac{\alpha_{t}}{2}\Lambda^{2}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3}. | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

Proof

Let Et:=τt​gt⊤​st+τt22​st⊤​Ht​stE_{t}:=\tau_{t}g_{t}^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t}. As in the analysis of HSDA, we have τt∈(0,1]\tau_{t}\in(0,1] and

ℱ​(xt+1)−ℱ​(xt)⩽Et+Λ​ε1+Λ22​ε2+L26​Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant E_{t}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3}. (3.3)

Case 1: |v^t|⩽ω|\hat{v}_{t}|\leqslant\omega. From (3.1) we obtain

| u^t⊤​Ht​u^t=kt⊤​u^t−ζt​‖u^t‖2−v^t​gt⊤​u^t,gt⊤​u^t=ϱt+v^t​(αt−ζt).\hat{u}_{t}^{\top}H_{t}\hat{u}_{t}=k_{t}^{\top}\hat{u}_{t}-\zeta_{t}\|\hat{u}_{t}\|^{2}-\hat{v}_{t}g_{t}^{\top}\hat{u}_{t},\quad g_{t}^{\top}\hat{u}_{t}=\varrho_{t}+\hat{v}_{t}(\alpha_{t}-\zeta_{t}). | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

Since st=sgn⁡(−gt⊤​u^t)​u^ts_{t}=\operatorname{sgn}(-g_{t}^{\top}\hat{u}_{t})\hat{u}_{t}, we have ‖st‖=‖u^t‖\|s_{t}\|=\|\hat{u}_{t}\| and with τt​‖st‖=Λ\tau_{t}\|s_{t}\|=\Lambda, also τt​‖u^t‖=Λ\tau_{t}\|\hat{u}_{t}\|=\Lambda. Therefore,

Et\displaystyle E_{t} =τt​gt⊤​st+τt22​st⊤​Ht​st\displaystyle=\tau_{t}g_{t}^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t}
=τt​sgn⁡(−gt⊤​u^t)​gt⊤​u^t+12​τt2​u^t⊤​Ht​u^t\displaystyle=\tau_{t}\operatorname{sgn}(-g_{t}^{\top}\hat{u}_{t})g_{t}^{\top}\hat{u}_{t}+\frac{1}{2}\tau_{t}^{2}\hat{u}_{t}^{\top}H_{t}\hat{u}_{t}
=−τt​|gt⊤​u^t +12​τt2​kt⊤​u^t−12​τt2​v^t​gt⊤​u^t−12​τt2​ζt​‖u^t‖2\displaystyle=-\tau_{t}
⩽−τt​|gt⊤​u^t +12​τt2​kt⊤​u^t+12​τt2​
=−12​τt2​v^t​ϱt−(τt−12​τt2​|v^t )​

Since τt⩽1\tau_{t}\leqslant 1 and |v^t|⩽ω<1|\hat{v}_{t}|\leqslant\omega<1, we have τt2​|v^t|⩽τt⩽1\tau_{t}^{2}|\hat{v}_{t}|\leqslant\tau_{t}\leqslant 1. Combining this with (3.3) and (3.1) yields

| ℱ​(xt+1)−ℱ​(xt)⩽|ϱt|−ζt2​Λ2+Λ​ε1+Λ22​ε2+L26​Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant|\varrho_{t}|-\frac{\zeta_{t}}{2}\Lambda^{2}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3}. | (3.5) | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |

Case 2: ω⩽|v^t|⩽1/(1+Λ2)\omega\leqslant|\hat{v}_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}. Here st=u^t/v^ts_{t}=\hat{u}_{t}/\hat{v}_{t}. Using (3.1) we obtain

| st⊤​Ht​st+gt⊤​st=−ζt​‖st‖2+kt⊤​u^tv^t2,gt⊤​st=−ζt+αt+ϱtv^t.s_{t}^{\top}H_{t}s_{t}+g_{t}^{\top}s_{t}=-\zeta_{t}\|s_{t}\|^{2}+\frac{k_{t}^{\top}\hat{u}_{t}}{\hat{v}^{2}_{t}},\quad g_{t}^{\top}s_{t}=-\zeta_{t}+\alpha_{t}+\frac{\varrho_{t}}{\hat{v}_{t}}. | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

From the orthogonality relation kt⊤​u^t+ϱt​v^t=0k_{t}^{\top}\hat{u}_{t}+\varrho_{t}\hat{v}_{t}=0 in (3.1), it follows that

Et\displaystyle E_{t} =τt​gt⊤​st+τt22​st⊤​Ht​st\displaystyle=\tau_{t}g_{t}^{\top}s_{t}+\frac{\tau_{t}^{2}}{2}s_{t}^{\top}H_{t}s_{t}
=τt​gt⊤​st+12​τt2​(kt⊤​u^tv^t2−gt⊤​st−ζt​‖st‖2)\displaystyle=\tau_{t}g_{t}^{\top}s_{t}+\frac{1}{2}\tau_{t}^{2}\left(\frac{k_{t}^{\top}\hat{u}_{t}}{\hat{v}_{t}^{2}}-g_{t}^{\top}s_{t}-\zeta_{t}\|s_{t}\ ^{2}\right)
=(τt−12​τt2)​(ϱtv^t+αt−ζt)+τt22​(kt⊤​u^tv^t2)−ζt2​Λ2\displaystyle=\left(\tau_{t}-\frac{1}{2}\tau_{t}^{2}\right)\!\left(\frac{\varrho_{t}}{\hat{v}_{t}}+\alpha_{t}-\zeta_{t}\right)+\frac{\tau_{t}^{2}}{2}\!\left(\frac{k_{t}^{\top}\hat{u}_{t}}{\hat{v}_{t}^{2}}\right)-\frac{\zeta_{t}}{2}\Lambda^{2}
=(τt−12​τt2)​(αt−ζt)−(τt2−τt)​ϱtv^t−ζt2​Λ2.\displaystyle=\left(\tau_{t}-\frac{1}{2}\tau_{t}^{2}\right)(\alpha_{t}-\zeta_{t})-(\tau_{t}^{2}-\tau_{t})\frac{\varrho_{t}}{\hat{v}_{t}}-\frac{\zeta_{t}}{2}\Lambda^{2}.

Since τt∈(0,1]\tau_{t}\in(0,1] and |v^t|⩾ω⩾1/4|\hat{v}_{t}|\geqslant\omega\geqslant 1/4,

| −(τt2−τt)​ϱtv^t⩽|ϱtω|⩽4​|ϱt|.-(\tau_{t}^{2}-\tau_{t})\frac{\varrho_{t}}{\hat{v}_{t}}\leqslant\left|\frac{\varrho_{t}}{\omega}\right|\leqslant 4|\varrho_{t}|. | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

Hence,

| Et⩽(τt−12​τt2)​(αt−ζt)+4​|ϱt|−ζt2​Λ2.E_{t}\leqslant\left(\tau_{t}-\frac{1}{2}\tau_{t}^{2}\right)(\alpha_{t}-\zeta_{t})+4|\varrho_{t}|-\frac{\zeta_{t}}{2}\Lambda^{2}. | (3.6) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |

Moreover, by Theorems 5 and 6 of zhang2025homogeneous , with probability at least 1−4​p1-4p we have

ζt⩾αt.\zeta_{t}\geqslant\alpha_{t}. (3.7)

Substituting (3.6) and (3.7) into (3.3) gives

| ℱ​(xt+1)−ℱ​(xt)⩽4​|ϱt|−ζt2​Λ2+Λ​ε1+Λ22​ε2+L26​Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant 4|\varrho_{t}|-\frac{\zeta_{t}}{2}\Lambda^{2}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3}. | (3.8) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |

Since (3.5) provides a tighter (i.e., smaller) upper bound, we unify the analysis of both cases by adopting (3.8) as a common estimate. Using ζt⩾αt\zeta_{t}\geqslant\alpha_{t} from (3.7) (which holds with the stated probability), we obtain

| ℱ​(xt+1)−ℱ​(xt)\displaystyle\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t}) | ⩽4​|ϱt|−ζt2​Λ2+Λ​ε1+Λ22​ε2+L26​Λ3\displaystyle\leqslant 4|\varrho_{t}|-\frac{\zeta_{t}}{2}\Lambda^{2}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3} | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ⩽4​|ϱt|−αt2​Λ2+Λ​ε1+Λ22​ε2+L26​Λ3,\displaystyle\leqslant 4|\varrho_{t}|-\frac{\alpha_{t}}{2}\Lambda^{2}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3}, | |

which completes the proof.

We now consider the case |v^t|>1/(1+Λ2)|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})}. The following lemma demonstrates that, under appropriate parameter choices, one of two outcomes must occur with high probability: either next iterate is already an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point, or, after increasing the parameter αt\alpha_{t} and re-solving the homogenized eigenvalue subproblem (2.1)—the Ritz residual will become sufficiently small.

Lemma 3.2

Under Assumptions 2.1 and 3.1, consider the case |v^t|>1/(1+Λ2)|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})}. Set Λ=ε/L2\Lambda=\sqrt{\varepsilon/L_{2}}, ε1=ε/12\varepsilon_{1}=\varepsilon/12 and ε2=L2​ε/12\varepsilon_{2}=\sqrt{L_{2}\varepsilon}/12, with ε⩽min⁡{L23/36,L2/2,1}.\varepsilon\leqslant\min\Bigl\{{L_{2}^{3}}/{36},L_{2}/2,1\Bigr\}.Then the following holds:

  1. (1)
    If the Ritz residual satisfies ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2, then the next iterate point xt+1x_{t+1} is an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point.
  2. (2)
    Otherwise, whenever |v^t|>1/(1+Λ2)|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})}, with probability at least 1−4​p1-4p, we have ‖kt‖⩽ε/2.\|k_{t}\|\leqslant\varepsilon/2.
Proof

Proof of (1). Without loss of generality, we can assume v^t>0\hat{v}_{t}>0, as the sign of the approximate eigenvector[u^t;v^t][\hat{u}_{t};\hat{v}_{t}] can be flipped without affecting any subsequent derivations. We show that when ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2,λ1​(∇2ℱ​(xt+1))⩾𝒪​(−ε)\lambda_{1}\!\big(\nabla^{2}\mathcal{F}(x_{t+1})\big)\geqslant\mathcal{O}(-\sqrt{\varepsilon})and‖∇ℱ​(xt+1)‖⩽𝒪​(ε)\|\nabla\mathcal{F}(x_{t+1})\|\leqslant\mathcal{O}(\varepsilon). From the Ritz condition (3.1) we obtain

−ζt=−αt​v^t2+2​v^t​gt⊤​u^t+u^t⊤​Ht​u^t.-\zeta_{t}=-\alpha_{t}\hat{v}_{t}^{2}+2\hat{v}_{t}g_{t}^{\top}\hat{u}_{t}+\hat{u}_{t}^{\top}H_{t}\hat{u}_{t}. (3.9)

Using ‖u^t‖2+v^t2=1\|\hat{u}_{t}\|^{2}+\hat{v}_{t}^{2}=1, (3.9) can be rewritten as

| (ζt−αt)​v^t2\displaystyle(\zeta_{t}-\alpha_{t})\hat{v}_{t}^{2} | =−2​v^t​gt⊤​u^t−(ζt+u^t⊤​Ht​u^t‖u^t‖2)​‖u^t‖2\displaystyle=-2\hat{v}_{t}g_{t}^{\top}\hat{u}_{t}-\Bigl(\zeta_{t}+\frac{\hat{u}_{t}^{\top}H_{t}\hat{u}_{t}}{\|\hat{u}_{t}\|^{2}}\Bigr)\|\hat{u}_{t}\|^{2} | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ⩽2​v^t​1−v^t2​‖gt‖−(ζt+u^t⊤​Ht​u^t‖u^t‖2)​‖u^t‖2\displaystyle\leqslant 2\hat{v}_{t}\sqrt{1-\hat{v}_{t}^{2}}\|g_{t}\|-\Bigl(\zeta_{t}+\frac{\hat{u}_{t}^{\top}H_{t}\hat{u}_{t}}{\|\hat{u}_{t}\|^{2}}\Bigr)\|\hat{u}_{t}\|^{2} | | | ⩽2​v^t​1−v^t2​‖gt‖−(ζt+λ1​(Ht))​(1−v^t2).\displaystyle\leqslant 2\hat{v}_{t}\sqrt{1-\hat{v}_{t}^{2}}\|g_{t}\|-\bigl(\zeta_{t}+\lambda_{1}(H_{t})\bigr)(1-\hat{v}_{t}^{2}). | (3.10) |

Moreover, using Λ⩾1−v^t2/v^t\Lambda\geqslant\sqrt{1-\hat{v}_{t}^{2}}/\hat{v}_{t} andλ1​(Ht)⩽L1\lambda_{1}(H_{t})\leqslant L_{1}, (3.10) yields

| ζt−αt⩽2​Λ​‖gt‖+|λ1​(Ht)+ζt|​Λ2⩽2​Λ​‖gt‖+(L1+ζt)​Λ2.\zeta_{t}-\alpha_{t}\leqslant 2\Lambda\|g_{t}\|+\bigl|\lambda_{1}(H_{t})+\zeta_{t}\bigr|\Lambda^{2}\leqslant 2\Lambda\|g_{t}\|+(L_{1}+\zeta_{t})\Lambda^{2}. | (3.11) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |

Since Ht+δt​I≽0H_{t}+\delta_{t}I\succcurlyeq 0 andδt⩽ζt+et⩽ζt+αt\delta_{t}\leqslant\zeta_{t}+e_{t}\leqslant\zeta_{t}+\alpha_{t}, we haveλ1​(Ht)+δt⩾0\lambda_{1}(H_{t})+\delta_{t}\geqslant 0. Combining this with (3.11) gives

| λ1​(Ht)+2​αt+2​‖gt‖​Λ+(L1+ζt)​Λ2⩾0.\lambda_{1}(H_{t})+2\alpha_{t}+2\|g_{t}\|\Lambda+(L_{1}+\zeta_{t})\Lambda^{2}\geqslant 0. | (3.12) | | ------------------------------------------------------------------------------------------------------------------------------------------- | ------ |

When ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2, the argument parallels that of Lemma 2.6: we have‖st‖=‖u^t/v^t‖⩽Λ\|s_{t}\|=\|\hat{u}_{t}/\hat{v}_{t}\|\leqslant\Lambda and

| Ht+1≽−(2​αt+2​‖gt‖​Λ+(L1+ζt)​Λ2+LH​[(1+κ)​Λ+2​A])​I.H_{t+1}\succcurlyeq-\big(2\alpha_{t}+2\|g_{t}\|\Lambda+(L_{1}+\zeta_{t})\Lambda^{2}+L_{H}\left[(1+\kappa)\Lambda+2A\right]\big)I. | (3.13) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |

A simple norm estimate yields

| ζt⩽‖Gt​(αt)‖\displaystyle\zeta_{t}\leqslant\|G_{t}(\alpha_{t})\| | ⩽max‖[u;v]‖=1⁡|[uv]⊤​[Ht00−αt]​[uv]|+max‖[u;v]‖=1⁡|[uv]⊤​[0gtgt⊤0]​[uv]|\displaystyle\leqslant\max_{\|[u;v]\|=1}\left\lvert\!\begin{bmatrix}u\\ v\end{bmatrix}^{\!\top}\begin{bmatrix}H_{t}&0\\[2.0pt] 0&-\alpha_{t}\end{bmatrix}\begin{bmatrix}u\\ v\end{bmatrix}\!\right\rvert+\max_{\|[u;v]\|=1}\left\lvert\!\begin{bmatrix}u\\ v\end{bmatrix}^{\!\top}\begin{bmatrix}0&g_{t}\\[2.0pt] g_{t}^{\top}&0\end{bmatrix}\begin{bmatrix}u\\ v\end{bmatrix}\!\right\rvert | | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ⩽max⁡{L1,αt}+‖gt‖⩽L1+αt+Bg.\displaystyle\leqslant\max\{L_{1},\alpha_{t}\}+\|g_{t}\|\leqslant L_{1}+\alpha_{t}+B_{g}. | (3.14) |

Inserting (3.14) and the parameter choices into (3.13), then applying Lemma 2.3 (which controls the Hessian approximation error), we obtain after elementary simplifications

| ∇2ℱ​(xt+1)\displaystyle\nabla^{2}\mathcal{F}(x_{t+1}) | ≽−(2​αt+2​‖gt‖​Λ+(L1+ζt)​Λ2+ε2+LH​[(1+κ)​Λ+2​A])​I\displaystyle\succcurlyeq-\big(2\alpha_{t}+2\|g_{t}\|\Lambda+(L_{1}+\zeta_{t})\Lambda^{2}+\varepsilon_{2}+L_{H}\left[(1+\kappa)\Lambda+2A\right]\big)I | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ≽−(2​αt+2​Bg​Λ+(2​L1+αt+Bg)​Λ2+2​ε2+LH​(1+κ)​Λ)​I\displaystyle\succcurlyeq-\big(2\alpha_{t}+2B_{g}\Lambda+(2L_{1}+\alpha_{t}+B_{g})\Lambda^{2}+2\varepsilon_{2}+L_{H}(1+\kappa)\Lambda\big)I | | | =−((136​L2+2​Bg+LH​(1+κ)L2)​ε+2​L1+BgL2​ε+1L2​ε3/2)​I\displaystyle=-\Bigl(\Bigl(\frac{13}{6}\sqrt{L_{2}}+\frac{2B_{g}+L_{H}(1+\kappa)}{\sqrt{L_{2}}}\Bigr)\sqrt{\varepsilon}+\frac{2L_{1}+B_{g}}{L_{2}}\varepsilon+\frac{1}{\sqrt{L_{2}}}\varepsilon^{3/2}\Bigr)I | | | ≽−((136​L2+2​Bg+LH​(1+κ)L2)​ε+2​L1+BgL2​ε+L2​ε)​I\displaystyle\succcurlyeq-\Bigl(\Bigl(\frac{13}{6}\sqrt{L_{2}}+\frac{2B_{g}+L_{H}(1+\kappa)}{\sqrt{L_{2}}}\Bigr)\sqrt{\varepsilon}+\frac{2L_{1}+B_{g}}{\sqrt{L_{2}}}\sqrt{\varepsilon}+\sqrt{L_{2}}\sqrt{\varepsilon}\Bigr)I | | | =−(196​L2+2​L1+3​Bg+LH​(1+κ)L2)​ε​I.\displaystyle=-\Bigl(\frac{19}{6}\sqrt{L_{2}}+\frac{2L_{1}+3B_{g}+L_{H}(1+\kappa)}{\sqrt{L_{2}}}\Bigr)\sqrt{\varepsilon}\,I. | |

We now bound the gradient norm ‖∇ℱ​(xt+1)‖\|\nabla\mathcal{F}(x_{t+1})\|. Using the second‑order Lipschitz continuity of ∇2ℱ\nabla^{2}\mathcal{F}, together with (2.1) and (3.1), we have

| ‖gt+1‖\displaystyle\|g_{t+1}\| | ⩽‖gt+1−gt−Ht​st‖+‖gt+Ht​st‖\displaystyle\leqslant\|g_{t+1}-g_{t}-H_{t}s_{t}\|+\|g_{t}+H_{t}s_{t}\| | (3.15) | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | ------ | | =‖gt+1−gt−Ht​st‖+‖kt/v^t−ζt​st‖\displaystyle=\|g_{t+1}-g_{t}-H_{t}s_{t}\|+\|k_{t}/\hat{v}_{t}-\zeta_{t}s_{t}\| | | | | ⩽L22​Λ2+ε2​Λ+2​ε1+‖kt‖ω+|ζt|​Λ.\displaystyle\leqslant\frac{L_{2}}{2}\Lambda^{2}+\varepsilon_{2}\Lambda+2\varepsilon_{1}+\frac{\|k_{t}\|}{\omega}+|\zeta_{t}|\Lambda. | | |

Theorem 6 of zhang2025homogeneous guarantees that with probability at least 1−4​p1-4p,|ϱt|⩽ε2/(16​L22)|\varrho_{t}|\leqslant\varepsilon^{2}/(16L_{2}^{2}). Moreover, (3.1) implies the scalar identityζt=αt+ϱt/v^t−gt⊤​st.\zeta_{t}=\alpha_{t}+{\varrho_{t}}/{\hat{v}_{t}}-g_{t}^{\top}s_{t}.In addition, in the regime |v^t|>1/(1+Λ2)|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})}, we have1/|v^t|<1+Λ2=1+ε/L2⩽2{1}/{|\hat{v}_{t}|}<\sqrt{1+\Lambda^{2}}=\sqrt{1+\varepsilon/L_{2}}\leqslant\sqrt{2}, where we used 0<ε⩽L20<\varepsilon\leqslant L_{2}. Hence

| |ϱtv^t|⩽ε216​L22⋅1|v^t|⩽2​ε216​L22.\Big|\frac{\varrho_{t}}{\hat{v}_{t}}\Big|\leqslant\frac{\varepsilon^{2}}{16L_{2}^{2}}\cdot\frac{1}{|\hat{v}_{t}|}\leqslant\frac{\sqrt{2}\varepsilon^{2}}{16L_{2}^{2}}. | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |

Combining this bound with the expression for ζt\zeta_{t} yields

| |ζt|⩽|αt|+|ϱtv^t|+|gt⊤​st|⩽L2​ε+2​ε216​L22+Bg​Λ.|\zeta_{t}|\leqslant|\alpha_{t}|+\Big|\frac{\varrho_{t}}{\hat{v}_{t}}\Big|+|g_{t}^{\top}s_{t}|\leqslant\sqrt{L_{2}\varepsilon}+\frac{\sqrt{2}\varepsilon^{2}}{16L_{2}^{2}}+B_{g}\Lambda. | (3.16) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ |

Substituting (3.16) and the parameter values into (3.15), and again invoking Lemma 2.3 for the gradient error, we arrive at

| ‖∇ℱ​(xt+1)‖\displaystyle\|\nabla\mathcal{F}(x_{t+1})\| | ⩽L22​Λ2+ε2​Λ+3​ε1+‖kt‖ω+|ζt|​Λ\displaystyle\leqslant\frac{L_{2}}{2}\Lambda^{2}+\varepsilon_{2}\Lambda+3\varepsilon_{1}+\frac{\|k_{t}\|}{\omega}+|\zeta_{t}|\Lambda | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | =56​ε+‖kt‖ω+|ζt|​Λ\displaystyle=\frac{5}{6}\varepsilon+\frac{\|k_{t}\|}{\omega}+|\zeta_{t}|\Lambda | | | ⩽56​ε+ε2​ω+(L2​ε+216​L22​ε2+Bg​Λ)​Λ\displaystyle\leqslant\frac{5}{6}\varepsilon+\frac{\varepsilon}{2\omega}+\Bigl(\sqrt{L_{2}\varepsilon}+\frac{\sqrt{2}}{16L_{2}^{2}}\varepsilon^{2}+B_{g}\Lambda\Bigr)\Lambda | | | =(116+12​ω+BgL2)​ε+216​L25/2​ε5/2\displaystyle=\Bigl(\frac{11}{6}+\frac{1}{2\omega}+\frac{B_{g}}{L_{2}}\Bigr)\varepsilon+\frac{\sqrt{2}}{16L_{2}^{5/2}}\varepsilon^{5/2} | | | ⩽(236+BgL2+216​L2)​ε.\displaystyle\leqslant\Bigl(\frac{23}{6}+\frac{B_{g}}{L_{2}}+\frac{\sqrt{2}}{16L_{2}}\Bigr)\varepsilon. | |

This shows that xt+1x_{t+1} is already an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point.

Proof of (2). We first show thatλ2​(Gt​(αt))−λ1​(Gt​(αt))⩾L2​ε.\lambda_{2}(G_{t}(\alpha_{t}))-\lambda_{1}(G_{t}(\alpha_{t}))\geqslant\sqrt{L_{2}\varepsilon}.From (3.12) with the initial choice αt=L2​ε\alpha_{t}=\sqrt{L_{2}\varepsilon} we have

| λ1​(Ht)+2​L2​ε+2​‖gt‖​Λ+(L1+ζt)​Λ2⩾0.\lambda_{1}(H_{t})+2\sqrt{L_{2}\varepsilon}+2\|g_{t}\|\Lambda+(L_{1}+\zeta_{t})\Lambda^{2}\geqslant 0. | (3.17) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |

After updating αt\alpha_{t}, i.e.,αt=3​L2​ε+2​‖gt‖​Λ+(L1+ζt)​Λ2,\alpha_{t}=3\sqrt{L_{2}\varepsilon}+2\|g_{t}\|\Lambda+(L_{1}+\zeta_{t})\Lambda^{2},the Cauchy interlacing theorem together with (3.17) yields

λ2​(Gt​(αt))−λ1​(Gt​(αt))⩾λ1​(Ht)+αt⩾L2​ε.\lambda_{2}(G_{t}(\alpha_{t}))-\lambda_{1}(G_{t}(\alpha_{t}))\geqslant\lambda_{1}(H_{t})+\alpha_{t}\geqslant\sqrt{L_{2}\varepsilon}. (3.18)

By Lemma 13 of zhang2025homogeneous , we have

| ‖kt‖⩽ϕt​et+2​(max⁡{L1,αt}+‖gt‖)​etλ2​(Gt​(αt))−λ1​(Gt​(αt)),\|k_{t}\|\leqslant\phi_{t}e_{t}+2\big(\max\{L_{1},\alpha_{t}\}+\|g_{t}\|\big)\,\sqrt{\frac{e_{t}}{\lambda_{2}(G_{t}(\alpha_{t}))-\lambda_{1}(G_{t}(\alpha_{t}))}}, | | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

where ete_{t} is the prescribed accuracy of the Lanczos run and ϕt∈[0,1]\phi_{t}\in[0,1]. Thus, using (3.18), we have

| ‖kt‖\displaystyle\|k_{t}\| | ⩽ϕt​et+2​(max⁡{L1,αt}+‖gt‖)​etλ2​(Gt​(αt))−λ1​(Gt​(αt))\displaystyle\leqslant\phi_{t}e_{t}+2\big(\max\{L_{1},\alpha_{t}\}+\|g_{t}\|\big)\,\sqrt{\frac{e_{t}}{\lambda_{2}(G_{t}(\alpha_{t}))-\lambda_{1}(G_{t}(\alpha_{t}))}} | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ⩽ϕt​et+2​(L1+αt+Bg)​etL2​ε\displaystyle\leqslant\phi_{t}e_{t}+2\bigl(L_{1}+\alpha_{t}+B_{g}\bigr)\,\sqrt{\frac{e_{t}}{\sqrt{L_{2}\varepsilon}}} | | | ⩽ε2,\displaystyle\leqslant\frac{\varepsilon}{2}, | |

which completes the proof.

We are now ready to present the main complexity result of this section: a high-probability bound on the iteration complexity of the IHSDA algorithm. Let ε>0\varepsilon>0 be the target accuracy. Recall that T​(ε)T(\varepsilon) denotes the first iteration at which an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point is obtained. Additionally, we define a verifiable stopping index based on the approximate eigenvector component v^t\hat{v}_{t}:

| T^​(ε):=min⁡{t∣|v^t|>1/(1+Λ2)​and​‖kt‖⩽ε/2},\hat{T}(\varepsilon):=\min\Bigl\{t\,\Bigm|\,|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})}\ \text{and}\ \|k_{t}\|\leqslant\varepsilon/2\Bigr\}, | (3.19) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |

where ktk_{t} is as defined in (3.1). By Lemma 3.2, T^​(ε)\hat{T}(\varepsilon) is finite with high probability and satisfies T​(ε)⩽T^​(ε)T(\varepsilon)\leqslant\hat{T}(\varepsilon).

Theorem 3.1

Under Assumptions 2.1 and 3.1, define

Kε:=1+6​L2​(ℱ​(x1)−ℱinf)​ε−3/2.K_{\varepsilon}:=1+6\sqrt{L_{2}}\bigl(\mathcal{F}(x_{1})-\mathcal{F}_{\inf}\bigr)\,\varepsilon^{-3/2}.

Then the outer-iteration counts of IHSDA satisfy

T​(ε)⩽T^​(ε)⩽Kε,T(\varepsilon)\leqslant\hat{T}(\varepsilon)\leqslant K_{\varepsilon}, (3.20)

and with probability at least (1−4​p)2​Kε(1-4p)^{2K_{\varepsilon}}, the algorithm returns an𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point.

Proof

We first bound T^​(ε)\hat{T}(\varepsilon). Fix any outer iteration t<T^​(ε)t<\hat{T}(\varepsilon). By Lemma 3.2, the Ritz pair used in this iteration satisfies |v^t|⩽1/(1+Λ2)|\hat{v}_{t}|\leqslant\sqrt{1/(1+\Lambda^{2})}. Recall that

Λ=ε/L2,αt=L2​ε,ε1=ε/12,ε2=L2​ε/12,\Lambda=\sqrt{\varepsilon/L_{2}},\quad\alpha_{t}=\sqrt{L_{2}\varepsilon},\quad\varepsilon_{1}=\varepsilon/12,\quad\varepsilon_{2}=\sqrt{L_{2}\varepsilon}/12,

with 0<ε⩽min⁡{L23/36,L2/2, 1}0<\varepsilon\leqslant\min\{L_{2}^{3}/36,\,L_{2}/2,\,1\}. By Theorem 6 of zhang2025homogeneous , with probability at least 1−4​p1-4p, we have |ϱt|⩽ε2/(16​L22)|\varrho_{t}|\leqslant\varepsilon^{2}/(16L_{2}^{2}). On this high‑probability event, Lemma 3.1 yields

| ℱ​(xt+1)−ℱ​(xt)⩽4​|ϱt|−αt2​Λ2+Λ​ε1+Λ22​ε2+L26​Λ3.\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t})\leqslant 4|\varrho_{t}|-\frac{\alpha_{t}}{2}\Lambda^{2}+\Lambda\varepsilon_{1}+\frac{\Lambda^{2}}{2}\varepsilon_{2}+\frac{L_{2}}{6}\Lambda^{3}. | | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

If the Ritz pair is computed with a larger parameter αt\alpha_{t}, the right‑hand side can only become smaller, so the bound remains valid. Substituting the parameter values gives

ℱ​(xt+1)−ℱ​(xt)\displaystyle\mathcal{F}(x_{t+1})-\mathcal{F}(x_{t}) ⩽ε24​L22−12​L2​ε​εL2+εL2⋅ε12+ε2​L2⋅L2​ε12+L26​(εL2)3/2\displaystyle\leqslant\frac{\varepsilon^{2}}{4L_{2}^{2}}-\frac{1}{2}\sqrt{L_{2}\varepsilon}\,\frac{\varepsilon}{L_{2}}+\sqrt{\frac{\varepsilon}{L_{2}}}\cdot\frac{\varepsilon}{12}+\frac{\varepsilon}{2L_{2}}\cdot\frac{\sqrt{L_{2}\varepsilon}}{12}+\frac{L_{2}}{6}\Bigl(\frac{\varepsilon}{L_{2}}\Bigr)^{3/2}
=−524​ε3/2L2+ε24​L22⩽−16​ε3/2L2,\displaystyle=-\frac{5}{24}\frac{\varepsilon^{3/2}}{\sqrt{L_{2}}}+\frac{\varepsilon^{2}}{4L_{2}^{2}}\;\leqslant\;-\frac{1}{6}\,\frac{\varepsilon^{3/2}}{\sqrt{L_{2}}},

where the last inequality uses ε⩽L23/36\varepsilon\leqslant L_{2}^{3}/36.

Summing this per-iteration decrease over t=1,…,T^​(ε)−1t=1,\ldots,\hat{T}(\varepsilon)-1 and noting that the total possible decrease of ℱ\mathcal{F}is at most ℱ​(x1)−ℱinf\mathcal{F}(x_{1})-\mathcal{F}_{\inf}, we obtain

ℱ​(x1)−ℱinf⩾∑t=1T^​(ε)−1(ℱ​(xt)−ℱ​(xt+1))⩾(T^​(ε)−1)​16​ε3/2L2,\mathcal{F}(x_{1})-\mathcal{F}_{\inf}\geqslant\sum_{t=1}^{\hat{T}(\varepsilon)-1}\bigl(\mathcal{F}(x_{t})-\mathcal{F}(x_{t+1})\bigr)\geqslant(\hat{T}(\varepsilon)-1)\,\frac{1}{6}\,\frac{\varepsilon^{3/2}}{\sqrt{L_{2}}},

and therefore

T^​(ε)⩽1+6​L2​(ℱ​(x1)−ℱinf)​ε−3/2.\hat{T}(\varepsilon)\leqslant 1+6\sqrt{L_{2}}\bigl(\mathcal{F}(x_{1})-\mathcal{F}_{\inf}\bigr)\varepsilon^{-3/2}.

By the definition of T^​(ε)\hat{T}(\varepsilon) in (3.19) together with Lemma 3.2, we have T​(ε)⩽T^​(ε)T(\varepsilon)\leqslant\hat{T}(\varepsilon). Combining this with the bound above establishes (3.20).

We next establish the high-probability statement. Lemma 3.2 states that whenever|v^t|>1/(1+Λ2)|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})}, either (i) ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2 already holds, or (ii) after at most one additional Ritz‑pair computation we obtain ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2 with probability at least 1−4​p1-4p. Hence each outer iteration involves at most two calls to the Lanczos routine. Over at most KεK_{\varepsilon} outer iterations, the total number of Lanczos invocations is at most 2​Kε2K_{\varepsilon}. Consequently, the probability that every Lanczos call succeeds is at least (1−4​p)2​Kε(1-4p)^{2K_{\varepsilon}}. On this event, the definition of T^​(ε)\hat{T}(\varepsilon) guarantees that at iterationt=T^​(ε)t=\hat{T}(\varepsilon) we have |v^t|>1/(1+Λ2)|\hat{v}_{t}|>\sqrt{1/(1+\Lambda^{2})} and ‖kt‖⩽ε/2\|k_{t}\|\leqslant\varepsilon/2. Lemma 3.2 then implies that the next iterate is an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point, which completes the proof.

Remark 3.1

We remark that, by Bernoulli’s inequality, (1−4​p)2​Kε⩾1−8​Kε​p(1-4p)^{2K_{\varepsilon}}\geqslant 1-8K_{\varepsilon}p whenever p<1/4p<1/4. Since p∈(exp⁡(−n),1)p\in(\exp(-n),1), this condition can be satisfied by taking nn sufficiently large, for instance, under the mild requirement n⩾𝒪​(−log⁡ε)n\geqslant\mathcal{O}(-\log\varepsilon). Consequently, the high-probability guarantee “with probability at least (1−4​p)2​Kε(1-4p)^{2K_{\varepsilon}}” stated in the theorem can be presented equivalently as “with probability at least 1−8​Kε​p1-8K_{\varepsilon}p” without losing information. Moreover, combining the iteration bound (3.20) with the per-call complexity estimates of (zhang2025homogeneous, , Theorem 6 and Lemma 12) yields the total number of Hessian-vector products bound

𝒪~​(L21/4​(ℱ​(x1)−ℱinf)​ε−7/4​max⁡{L1,αt}+Bg),\widetilde{\mathcal{O}}\!\Bigl(L_{2}^{1/4}\bigl(\mathcal{F}(x_{1})-\mathcal{F}_{\inf}\bigr)\varepsilon^{-7/4}\sqrt{\max\{L_{1},\alpha_{t}\}+B_{g}}\Bigr),

where 𝒪~​(⋅)\widetilde{\mathcal{O}}(\cdot) hides logarithmic factors in nn, p−1p^{-1}, and ε−1\varepsilon^{-1}.

We now compare the computational effort required to solve the inner subproblems in gradient norm regularized second order methods and in IHSDA, highlighting the structural advantages of the homogeneous formulation.

Algorithms such as IGRTR and ILMNegCur wang2025gradient for minimizing ℱ​(x)\mathcal{F}(x) require solving a regularized Newton system

(Ht+εN​I)​dt=−gt,\bigl(H_{t}+\varepsilon_{N}I\bigr)d_{t}=-g_{t}, (3.21)

where the perturbation parameter εN\varepsilon_{N} is chosen on the order of ‖gt‖1/2\|g_{t}\|^{1/2}. Computing an ε\varepsilon-accurate solution of (3.21) takes at most

𝒪​(κ​(Ht+εN​I)​log⁡1ε),κ​(Ht+εN​I):=λmax​(Ht)+εNλ1​(Ht)+εN,\mathcal{O}\!\left(\sqrt{\kappa\bigl(H_{t}+\varepsilon_{N}I\bigr)}\,\log\frac{1}{\varepsilon}\right),\qquad\kappa\!\bigl(H_{t}+\varepsilon_{N}I\bigr):=\frac{\lambda_{\max}(H_{t})+\varepsilon_{N}}{\lambda_{1}(H_{t})+\varepsilon_{N}},

iterations, and the spectral condition number κ​(Ht+εN​I)\kappa\bigl(H_{t}+\varepsilon_{N}I\bigr) can become arbitrarily large when εN→0\varepsilon_{N}\to 0.

In contrast, IHSDA replaces the linear system (3.21) with the homogenized eigenvalue subproblem defined by the matrix Gt​(αt)G_{t}(\alpha_{t}) from (2.1). The Lanczos method applied to this subproblem requires at most

𝒪​(κL​(Gt​(αt))​log⁡1ε),κL​(Gt​(αt)):=λmax​(Gt​(αt))−λ1​(Gt​(αt))λ2​(Gt​(αt))−λ1​(Gt​(αt)),\mathcal{O}\!\left(\sqrt{\kappa_{L}\!\bigl(G_{t}(\alpha_{t})\bigr)}\,\log\frac{1}{\varepsilon}\right),\quad\kappa_{L}\bigl(G_{t}(\alpha_{t})\bigr):=\frac{\lambda_{\max}\bigl(G_{t}(\alpha_{t})\bigr)-\lambda_{1}\bigl(G_{t}(\alpha_{t})\bigr)}{\lambda_{2}\bigl(G_{t}(\alpha_{t})\bigr)-\lambda_{1}\bigl(G_{t}(\alpha_{t})\bigr)},

iterations to deliver an approximate smallest eigenpair with accuracy ε\varepsilon. A key advantage of the homogeneous approach is that the Lanczos condition number κL​(Gt​(αt))\kappa_{L}\bigl(G_{t}(\alpha_{t})\bigr) is always bounded. Indeed, for any αt>0\alpha_{t}>0, it follows from (He2025HomogeneousSD, , Theorem 2.1) that

| κL​(Gt​(αt))⩽2​(λmax​(Ht)−αt−λ1​(Gt​(αt)))−λmax​(Ht)+αt+(λmax​(Ht)+αt)2+‖gt‖2/n<∞.\kappa_{L}\!\bigl(G_{t}(\alpha_{t})\bigr)\leqslant\frac{2\bigl(\lambda_{\max}(H_{t})-\alpha_{t}-\lambda_{1}\bigl(G_{t}(\alpha_{t})\bigr)\bigr)}{-\lambda_{\max}(H_{t})+\alpha_{t}+\sqrt{\bigl(\lambda_{\max}(H_{t})+\alpha_{t}\bigr)^{2}+\|g_{t}\|^{2}/n}}<\infty. | (3.22) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |

Thus, unlike the condition number of the regularized Newton system (3.21), which can blow up as εN→0\varepsilon_{N}\to 0, the homogenized subproblem remains well-conditioned for any fixed αt>0\alpha_{t}>0.

To quantify the improvement, consider the degenerate case λ1​(Ht)=0\lambda_{1}(H_{t})=0. Then (He2025HomogeneousSD, , Theorem 2.1) also implies

| κL​(Gt​(αt))κ​(Ht+εN​I)⩽𝒪​(εN‖gt‖2/(λmax​(Ht)+αt)+αt).\frac{\kappa_{L}\!\bigl(G_{t}(\alpha_{t})\bigr)}{\kappa\,\!\bigl(H_{t}+\varepsilon_{N}I\bigr)}\leqslant\mathcal{O}\!\left(\frac{\varepsilon_{N}}{\|g_{t}\|^{2}/\bigl(\lambda_{\max}(H_{t})+\alpha_{t}\bigr)+\alpha_{t}}\right). | (3.23) | | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------ |

The ratio (3.23) directly compares the conditioning of the two inner subproblems. In particular, if εN=αt→0\varepsilon_{N}=\alpha_{t}\to 0 while ‖gt‖\|g_{t}\| stays bounded away from zero, then the right-hand side of (3.23) converges to zero, indicating that the homogenized subproblem can be much better conditioned in this regime. In contrast, when ‖gt‖→0\|g_{t}\|\to 0 and εN\varepsilon_{N} and αt\alpha_{t} are kept at the same scale, the denominator in (3.23) is dominated by αt\alpha_{t} and the ratio remains of constant order, so the two condition numbers are comparable. This behavior aligns with the practical choice εN=‖gt‖1/2\varepsilon_{N}=\|g_{t}\|^{1/2} adopted in gradient norm regularized methodsDoikov2024GradientRN ; He2025HomogeneousSD ; Mishchenko2023RegularizedNM , yet the homogeneous formulation guarantees a bounded condition number even when the gradient is very small–a regime where Newton-type systems often become ill-conditioned.

4 Numerical Results

In this section, we conduct numerical experiments to demonstrate the practical performance of the proposed HSDA algorithm and its inexact variant IHSDA. We compare them with several existing methods: Gradient Descent Ascent (GDA), the IMCN algorithm luo2022finding , the MINIMAX-TRACE algorithm yao2024two , and the IGRTR algorithm wang2025gradient . The experiments are performed on two representative problem classes: a low-dimensional synthetic nonconvex-strongly concave minimax problem, and an adversarial training task on the MNIST dataset. All codes are implemented in Python 3.11 and executed on a laptop equipped with an Apple M1 processor and 16 GB of memory.

4.1 A synthetic nonconvex-strongly concave minimax problem

We begin with a low-dimensional synthetic nonconvex-strongly concave minimax problem introduced in luo2022finding :

minx∈ℝ3⁡maxy∈ℝ2⁡f​(x,y)=w​(x3)−y1240+x1​y1−5​y222+x2​y2,\min_{x\in\mathbb{R}^{3}}\max_{y\in\mathbb{R}^{2}}f(x,y)=w(x_{3})-\frac{y_{1}^{2}}{40}+x_{1}y_{1}-\frac{5y_{2}^{2}}{2}+x_{2}y_{2}, (4.1)

where x=[x1,x2,x3]⊤x=[x_{1},x_{2},x_{3}]^{\top} and y=[y1,y2]⊤y=[y_{1},y_{2}]^{\top}.

The scalar function w​(⋅)w(\cdot) is a nonconvex, W-shaped piecewise cubic function defined by a slope parameter ε>0\varepsilon>0 and a length parameter L>1L>1:

w​(x)={ε​(x+(L+1)​ε)2−13​(x+(L+1)​ε)3−cε,x⩽−L​ε,ε​x+ε3/23,−L​ε<x⩽−ε,−ε​x2−x33,−ε<x⩽0,−ε​x2+x33,0<x⩽ε,−ε​x+ε3/23,ε<x⩽L​ε,ε​(x−(L+1)​ε)2+13​(x−(L+1)​ε)3−cε,L​ε⩽x,w(x)=\begin{cases}\sqrt{\varepsilon}\bigl(x+(L+1)\sqrt{\varepsilon}\bigr)^{2}-\dfrac{1}{3}\bigl(x+(L+1)\sqrt{\varepsilon}\bigr)^{3}-c_{\varepsilon},&x\leqslant-L\sqrt{\varepsilon},\\[3.01385pt] \varepsilon x+\dfrac{\varepsilon^{3/2}}{3},&-L\sqrt{\varepsilon}<x\leqslant-\sqrt{\varepsilon},\\[3.01385pt] -\sqrt{\varepsilon}x^{2}-\dfrac{x^{3}}{3},&-\sqrt{\varepsilon}<x\leqslant 0,\\[3.01385pt] -\sqrt{\varepsilon}x^{2}+\dfrac{x^{3}}{3},&0<x\leqslant\sqrt{\varepsilon},\\[3.01385pt] -\varepsilon x+\dfrac{\varepsilon^{3/2}}{3},&\sqrt{\varepsilon}<x\leqslant L\sqrt{\varepsilon},\\[3.01385pt] \sqrt{\varepsilon}\bigl(x-(L+1)\sqrt{\varepsilon}\bigr)^{2}+\dfrac{1}{3}\bigl(x-(L+1)\sqrt{\varepsilon}\bigr)^{3}-c_{\varepsilon},&L\sqrt{\varepsilon}\leqslant x,\end{cases}

with cε:=13​(3​L+1)​ε3/2.c_{\varepsilon}:=\frac{1}{3}(3L+1)\varepsilon^{3/2}.

In the experiment, we set ε=0.01\varepsilon=0.01, μ=0.05\mu=0.05, and L=5L=5. Two different initial points are used:

(x1,y1)=([0.1,0.1,0.1]⊤,[0,0]⊤),(x2,y2)=([1.0,0.1,0.1]⊤,[0,0]⊤).(x_{1},y_{1})=([0.1,0.1,0.1]^{\top},[0,0]^{\top}),\qquad(x_{2},y_{2})=([1.0,0.1,0.1]^{\top},[0,0]^{\top}).

The first point (x1,y1)(x_{1},y_{1}) lies near the strict saddle point ([0,0,0]⊤,[0,0]⊤)([0,0,0]^{\top},[0,0]^{\top}) of (4.1), while the second point (x2,y2)(x_{2},y_{2}) is intentionally chosen farther from this saddle to examine the algorithms’ global behavior. Figure 1 and Figure 2 display the performance of the five algorithms on this problem. The horizontal axis records the iteration index tt; the left and right vertical axes show the optimality gap ℱ​(xt)−ℱ⋆\mathcal{F}(x_{t})-\mathcal{F}^{\star} and the gradient norm ‖∇ℱ​(xt)‖2\|\nabla\mathcal{F}(x_{t})\|_{2}, respectively.

Refer to caption

Figure 1: Numerical results of the tested algorithms on the synthetic W-shaped minimax example (4.1) with initialization(x1,y1)=([0.1,0.1,0.1]⊤,[0,0]⊤)(x_{1},y_{1})=([0.1,0.1,0.1]^{\top},[0,0]^{\top}).

Refer to caption

Figure 2: Numerical results of the tested algorithms on the synthetic W-shaped minimax example (4.1) with initialization(x2,y2)=([1.0,0.1,0.1]⊤,[0,0]⊤)(x_{2},y_{2})=([1.0,0.1,0.1]^{\top},[0,0]^{\top}).

Since problem (4.1) contains a strict saddle point and the GDA algorithm struggles to escape once trapped near it, the GDA curves in Figure 1 and Figure 2 remain almost flat, showing very little decrease in either the objective gap or the gradient norm. In contrast, the other four algorithms effectively escape the saddle region and achieve substantial progress. The proposed HSDA algorithm exhibits the fastest decrease in both the optimality gap ℱ​(xt)−ℱ⋆\mathcal{F}(x_{t})-\mathcal{F}^{\star} and the gradient norm‖∇ℱ​(xt)‖2\|\nabla\mathcal{F}(x_{t})\|_{2}. For both initializations, it reduces the objective gap to about10−410^{-4} and the gradient norm to about 10−210^{-2} within roughly a dozen iterations. The GRTR algorithm also converges rapidly, but its trajectories display more pronounced oscillations. The MCN algorithm reduces these quantities more slowly, yet its convergence path is comparatively smooth. With the chosen parameters, the MINIMAX-TRACE algorithm converges more slowly, and its curves are more oscillatory than those of HSDA and GRTR.

4.2 Adversarial training on MNIST

We next examine an adversarial training task studied in chen2021cubic , whose goal is to train a classifier that remains robust against small input perturbations. Using the MNIST dataset with 50,000 training and 10,000 test samples, we solve the finite-sum minimax problem

| minx⁡maxy={yi}i=1n⁡1n​∑i=1n[ℓ​(hx​(yi),bi)−λ​‖yi−ai‖22],\min_{x}\;\max_{y=\{y_{i}\}_{i=1}^{n}}\frac{1}{n}\sum_{i=1}^{n}\Bigl[\ell\bigl(h_{x}(y_{i}),b_{i}\bigr)-\lambda\,\|y_{i}-a_{i}\|_{2}^{2}\Bigr], | (4.2) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----- |

where xx collects the parameters of a convolutional neural networkhx​(⋅)h_{x}(\cdot), (ai,bi)(a_{i},b_{i}) denotes the iith image-label pair, andyi∈ℝ784y_{i}\in\mathbb{R}^{784} is an adversarial version of aia_{i}. Following chen2021cubic , we take ℓ\ell to be the cross-entropy loss and set λ=2\lambda=2.

The network hxh_{x} is a simple convolutional architecture: a single convolutional block (one input channel, one output channel, kernel size 33, stride 44, padding 11, followed by a sigmoid activation) produces a 7×77\times 7 feature map; this map is flattened to a 4949-dimensional vector and passed through a linear layer with 1010 outputs. All network parameters are stacked into a vector x∈ℝdxx\in\mathbb{R}^{d_{x}}; each adversarial variable yiy_{i} is initialized at the original image aia_{i}.

We apply IHSDA together with GDA, IMCN luo2022finding , IGRTR wang2025gradient , and ILMNegCur wang2025gradient to (4.2). All methods are run in mini‑batch mode with batch size 6464. For the inner maximization over yy, every algorithm approximates the maximizer by a limited number of (possibly accelerated) gradient-ascent steps. For IHSDA we set the strong-concavity parameter in the yy-direction to μ=1\mu=1, the Lipschitz constant of ∇yf\nabla_{y}f to ℓ=10\ell=10, and the Hessian Lipschitz constant of the value function to L2=0.2L_{2}=0.2. The homogeneous second‑order subproblem in each outer iteration is solved approximately by a Lanczos procedure limited to at most 8080 iterations. IMCN is implemented according to the description in luo2022finding , while IGRTR and ILMNegCur follow the specifications inwang2025gradient .

Figure 3 presents the results. Panel (a) plots test accuracy against wall-clock time, and panel (b) shows the approximate objective function value of (4.2) versus the outer iteration index. On this problem, the four second-order algorithms achieve higher test accuracies than GDA within the same time budget. IHSDA typically reaches a test accuracy around 80%80\% and attains the lowest objective values among the compared methods. IMCN, IGRTR, and ILMNegCur are competitive and follow closely. GDA improves more gradually and remains below about 70%70\% accuracy over the plotted range. Overall, the curves indicate that exploiting second-order information through the HSDA framework is beneficial for this adversarial training task, and that the resulting IHSDA method performs on par with existing second-order schemes.

Refer to caption

Figure 3: Numerical results of the tested algorithms for solving (4.2).

5 Conclusions

In this paper, we have introduced a Homogeneous Second-Order Descent Ascent (HSDA) algorithm and its inexact variant (IHSDA) for solving nonconvex-strongly concave minimax problems. The algorithms leverage a homogenized eigenvalue subproblem to compute a search direction that ensures sufficient descent even when the Hessian of the value function is nearly positive semidefinite.

We prove that both HSDA and IHSDA find an 𝒪​(ε,ε)\mathcal{O}(\varepsilon,\sqrt{\varepsilon})-second-order stationary point within at most 𝒪~​(ε−3/2)\tilde{\mathcal{O}}(\varepsilon^{-3/2}) outer iterations, matching the best known iteration complexity for existing second-order methods in this setting. For the practical IHSDA variant, which solves the subproblem approximately via a Lanczos procedure, we further establish a high-probability bound of 𝒪~​(ε−7/4)\tilde{\mathcal{O}}(\varepsilon^{-7/4}) for the total number of Hessian-vector products.

The numerical experiments on synthetic minimax problems and adversarial training tasks confirm the efficiency and robustness of the proposed methods. A natural and promising direction for future work is to extend the homogeneous second-order framework beyond the nonconvex-strongly concave setting, e.g., to more general minimax structures that appear in modern machine learning applications.

Data Availability

No datasets were generated or analysed during the current study.

Declarations

The authors declare that they have no conflict of interest.

References