Random-reshuffled SARAH does not need full gradient computations

Beznosikov, Aleksandr; Takáč, Martin

doi:10.1007/s11590-023-02081-x

Random-reshuffled SARAH does not need full gradient computations

Original Paper
Published: 11 December 2023

Volume 18, pages 727–749, (2024)
Cite this article

Optimization Letters Aims and scope Submit manuscript

184 Accesses
2 Altmetric
Explore all metrics

Abstract

The StochAstic Recursive grAdient algoritHm (SARAH) algorithm is a variance reduced variant of the Stochastic Gradient Descent algorithm that needs a gradient of the objective function from time to time. In this paper, we remove the necessity of a full gradient computation. This is achieved by using a randomized reshuffling strategy and aggregating stochastic gradients obtained in each epoch. The aggregated stochastic gradients serve as an estimate of a full gradient in the SARAH algorithm. We provide a theoretical analysis of the proposed approach and conclude the paper with numerical experiments that demonstrate the efficiency of this approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Why random reshuffling beats stochastic gradient descent

Article 29 October 2019

Multi-index antithetic stochastic gradient algorithm

Article Open access 03 March 2023

SAAGs: Biased stochastic variance reduction methods for large-scale learning

Article 05 April 2019

References

Ahn, K., Yun, C., Sra, S.: SGD with shuffling: optimal rates without component convexity and large epoch requirements. Adv. Neural Inf. Process. Syst. 33, 17526–17535 (2020)
Google Scholar
Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1200–1205 (2017)
Allen-Zhu, Z., Yuan, Y.: Improved SVRG for non-strongly-convex or sum-of-non-convex objectives. In: International Conference on Machine Learning, pp. 1080–1089. PMLR (2016)
Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. In: Neural Networks: Tricks of the Trade, 2nd ed., pp. 437–478. Springer (2012)
Bottou, L.: Curiously fast convergence of some stochastic gradient descent algorithms. In: Proceedings of the Symposium on Learning and Data Science, Paris, vol. 8, pp. 2624–2633. Citeseer (2009)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Article MathSciNet Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. TIST 2(3), 1–27 (2011)
Article Google Scholar
Cohen, M., Diakonikolas, J., Orecchia, L.: On acceleration with noise-corrupted gradients. In: International Conference on Machine Learning, pp. 1019–1028. PMLR (2018)
Cutkosky, A., Orabona, F.: Momentum-based variance reduction in non-convex sgd. arXiv preprint arXiv:1905.10018 (2019)
Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)
Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: near-optimal non-convex optimization via stochastic path integrated differential estimator. arXiv preprint arXiv:1807.01695 (2018)
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)
Article MathSciNet Google Scholar
Gurbuzbalaban, M., Ozdaglar, A., Parrilo, P.A.: On the convergence rate of incremental aggregated gradient algorithms. SIAM J. Optim. 27(2), 1035–1048 (2017)
Article MathSciNet Google Scholar
Hendrikx, H., Xiao, L., Bubeck, S., Bach, F., Massoulie, L.: Statistically preconditioned accelerated gradient method for distributed optimization. In: International Conference on Machine Learning, pp. 4203–4227. PMLR (2020)
Hu, W., Li, C.J., Lian, X., Liu, J., Yuan, H.: Efficient smooth non-convex stochastic compositional optimization via stochastic recursive gradient descent (2019)
Huang, X., Yuan, K., Mao, X., Yin, W.: An improved analysis and rates for variance reduction under without-replacement sampling orders. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 3232–3243. Curran Associates Inc, Red Hook (2021)
Google Scholar
Jain, P., Nagaraj, D., Netrapalli, P.: SGD without replacement: sharper rates for general smooth convex functions. arXiv preprint arXiv:1903.01463 (2019)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 26, 315–323 (2013)
Google Scholar
Khaled, A., Richtárik, P.: Better theory for SGD in the nonconvex world. arXiv preprint arXiv:2002.03329 (2020)
Koloskova, A., Doikov, N., Stich, S.U., Jaggi, M.: Shuffle SGD is always better than SGD: improved analysis of SGD with arbitrary data orders. arXiv preprint arXiv:2305.19259 (2023)
Li, B., Ma, M., Giannakis, G.B.: On the convergence of Sarah and beyond. In: International Conference on Artificial Intelligence and Statistics, pp. 223–233. PMLR (2020)
Li, Z., Bao, H., Zhang, X., Richtárik, P.: Page: a simple and optimal probabilistic gradient estimator for nonconvex optimization. In: International Conference on Machine Learning, pp. 6286–6295. PMLR (2021)
Li, Z., Richtárik, P.: Zerosarah: efficient nonconvex finite-sum optimization with zero full gradient computation. arXiv preprint arXiv:2103.01447 (2021)
Liu, D., Nguyen, L.M., Tran-Dinh, Q.: An optimal hybrid variance-reduced algorithm for stochastic composite nonconvex optimization. arXiv preprint arXiv:2008.09055 (2020)
Mairal, J.: Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. 25(2), 829–855 (2015)
Article MathSciNet Google Scholar
Malinovsky, G., Sailanbayev, A., Richtárik, P.: Random reshuffling with variance reduction: new analysis and better rates. arXiv preprint arXiv:2104.09342 (2021)
Mishchenko, K., Khaled Ragab Bayoumi, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. Adv. Neural Inf. Process. Syst. 33 (2020)
Mokhtari, A., Gurbuzbalaban, M., Ribeiro, A.: Surpassing gradient descent provably: a cyclic incremental method with linear convergence rate. SIAM J. Optim. 28(2), 1420–1447 (2018)
Article MathSciNet Google Scholar
Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 24. Curran Associates Inc, Red Hook (2011)
Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, New York (2003)
Google Scholar
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: International Conference on Machine Learning, pp. 2613–2621. PMLR (2017)
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: Stochastic recursive gradient algorithm for nonconvex optimization. arXiv preprint arXiv:1705.07261 (2017)
Nguyen, L.M., Nguyen, P.H., Richtárik, P., Scheinberg, K., Takác, M., van Dijk, M.: New convergence aspects of stochastic gradient algorithms. J. Mach. Learn. Res. 20, 176–1 (2019)
MathSciNet Google Scholar
Nguyen, L.M., Scheinberg, K., Takáč, M.: Inexact SARAH algorithm for stochastic optimization. Optim. Methods Softw. 36(1), 237–258 (2021)
Article MathSciNet Google Scholar
Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., Van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods. J. Mach. Learn. Res. 22(1), 9397–9440 (2021)
MathSciNet Google Scholar
Park, Y., Ryu, E.K.: Linear convergence of cyclic saga. Optim. Lett. 14(6), 1583–1598 (2020)
Article MathSciNet Google Scholar
Polyak, B.T.: Introduction to optimization
Qian, X., Qu, Z., Richtárik, P.: Saga with arbitrary sampling. In: International Conference on Machine Learning, pp. 5190–5199. PMLR (2019)
Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5(2), 201–226 (2013)
Article MathSciNet Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 400–407 (1951)
Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)
Article MathSciNet Google Scholar
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Book Google Scholar
Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127(1), 3–30 (2011)
Article MathSciNet Google Scholar
Stich, S.U.: Unified optimal analysis of the (stochastic) gradient method. arXiv preprint arXiv:1907.04232 (2019)
Sun, R.Y.: Optimization for deep learning: an overview. J. Oper. Res. Soc. China 8(2), 249–294 (2020)
Article MathSciNet Google Scholar
Sun, T., Sun, Y., Li, D., Liao, Q.: General proximal incremental aggregated gradient algorithms: Better and novel results under general scheme. Adv. Neural Inf. Process. Syst. 32 (2019)
Takác, M., Bijral, A., Richtárik, P., Srebro, N.: Mini-batch primal and dual methods for SVMs. In: International Conference on Machine Learning, pp. 1022–1030. PMLR (2013)
Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12, 389–434 (2012)
Article MathSciNet Google Scholar
Vanli, N.D., Gurbuzbalaban, M., Ozdaglar, A.: A stronger convergence result on the proximal incremental aggregated gradient method. arXiv preprint arXiv:1611.08022 (2016)
Vaswani, S., Bach, F., Schmidt, M.: Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1195–1204. PMLR (2019)
Yang, Z., Chen, Z., Wang, C.: Accelerating mini-batch SARAH by step size rules. Inf. Sci. 558, 157–173 (2021)
Article MathSciNet Google Scholar
Ying, B., Yuan, K., Sayed, A.H.: Variance-reduced stochastic learning under random reshuffling. IEEE Trans. Signal Process. 68, 1390–1408 (2020)
Article MathSciNet Google Scholar
Ying, B., Yuan, K., Vlaski, S., Sayed, A.H.: Stochastic learning under random reshuffling with constant step-sizes. IEEE Trans. Signal Process. 67(2), 474–489 (2018)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The work of A. Beznosikov was supported by a grant for research centers in the field of artificial intelligence, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730321P5Q0002) and the agreement with the Moscow Institute of Physics and Technology dated November 1, 2021 No. 70-2021-00138. This work was partially conducted while A. Beznosikov, was visiting research assistants in Mohamed bin Zayed University of Artificial Intelligence (MBZUAI).

Author information

Authors and Affiliations

Moscow Institute of Physics and Technology (MIPT), Moscow, Russian Federation
Aleksandr Beznosikov
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Masdar City, Abu Dhabi, UAE
Aleksandr Beznosikov & Martin Takáč

Authors

Aleksandr Beznosikov
View author publications
You can also search for this author in PubMed Google Scholar
Martin Takáč
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aleksandr Beznosikov.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Additional experimental results

See Figs. 4 and 5.

Appendix 2: RR-SARAH

This Algorithm is a modification of the original SARAH using Random Reshuffling. Unlike Algorithm 1, this algorithm uses the full gradient $\nabla P$.

Theorem 2

Suppose that Assumption 1 holds. Consider RR-SARAH (Algorithm 2) with the choice of $\eta$ such that

$$\begin{aligned} \eta \le \min \left[ \frac{1}{8n L}; \frac{1}{8n^{2} \delta } \right] . \end{aligned}$$

(6)

Then, we have

$$\begin{aligned} P(w_{s+1}) - P^*&\le \left( 1 - \frac{\eta \mu (n+1)}{2}\right) \left( P(w_s) - P^*\right) . \end{aligned}$$

Corollary 2

Fix $\varepsilon$, and let us run RR-SARAH with $\eta$ from (6). Then we can obtain an $\varepsilon$-approximate solution (in terms of $P(w) - P^* \le \varepsilon$) after

$$\begin{aligned} S = \mathcal {O}\left( \left[ n \cdot \frac{L}{ \mu } + n^2 \cdot \frac{\delta }{\mu } \right] \log \frac{1}{\varepsilon }\right) \quad \text {calls of terms }f_i. \end{aligned}$$

Appendix 3: Missing proofs for Sect. 3 and “Appendix 2”

Before we start to prove, let us note that $\delta$-similarity from Assumption 1 gives $(\delta /2)$-smoothness of function $(f_i - P)$ for any $i \in [n]$. This implies $\delta$-smoothness of function $(f_i - f_j)$ for any $i,j \in [n]$:

$$\begin{aligned}&\Vert \nabla f_i(w_1) - \nabla f_j(w_1) - (\nabla f_i(w_2) - \nabla f_j(w_2))\Vert \\&\quad \le \Vert \nabla f_i(w_1) - \nabla P(w_1) - (\nabla f_i(w_2) - \nabla P(w_2))\Vert \\&\qquad + \Vert \nabla P(w_1) - \nabla f_j(w_1) - (\nabla P(w_2) - \nabla f_j(w_2))\Vert \\&\quad \le 2\cdot (\delta /2)\Vert w_1 - w_2\Vert ^2 = \delta \Vert w_1 - w_2\Vert ^2. \end{aligned}$$

(7)

Next, we introduce additional notation for simplicity. If we consider Algorithm 1 in iteration $s \ne 0$, one can note that update rule is nothing more than

$$\begin{aligned} w_s&= w^0_s = w^{n+1}_{s-1}, \\ v_s&= v^0_s = \frac{1}{n} \sum \limits _{i=1}^{n} f_{\pi ^{i}_{s-1}} (w^{i}_{s-1}), \\ w^1_s&= w^0_s - \eta v^0_s, \end{aligned}$$

(8)

$$\begin{aligned} v^{i}_s&= v^{i-1}_s + f_{\pi ^{i}_{s}} (w^{i}_{s}) - f_{\pi ^{i}_{s}} (w^{i-1}_{s}), \end{aligned}$$

(9)

$$\begin{aligned} w^{i+1}_s&= w^{i}_s - \eta v^{i}_s . \end{aligned}$$

(10)

These new notations will be used further in the proofs. For Algorithm 2, one can do exactly the same notations with $v_s = v^0_s = \nabla P(w_s)$.

Lemma 1

Under Assumption 1, for Algorithms 1 and 2 with $\eta$ from (5) the following holds

$$\begin{aligned} P(w_{s+1})&\le P(w_s) - \frac{\eta (n + 1)}{2} \Vert \nabla P(w_s)\Vert ^2 + \frac{\eta (n + 1)}{2} \left\| \nabla P(w_s) - \frac{1}{n} \sum \limits _{i=0}^{n} v^i_s\right\| ^2. \end{aligned}$$

Proof

Using L-smoothness of function P (Assumption 1 (i)), we have

$$\begin{aligned} P(w_{s+1})&\le P(w_s) + \langle \nabla P(w_s), w_{s+1} - w_s \rangle + \frac{L}{2} \Vert w_{s+1} - w_s \Vert ^2 \\&= P(w_s) - \eta (n+1) \left\langle \nabla P(w_s), \frac{1}{n+1} \sum \limits _{i=0}^{n} v^i_s \right\rangle + \frac{\eta ^2 (n+1)^2 L}{2} \left\| \frac{1}{n+1} \sum \limits _{i=0}^{n} v^i_s\right\| ^2 \\&= P(w_s) - \frac{\eta (n+1)}{2} \left( \Vert \nabla P(w_s)\Vert ^2 + \left\| \frac{1}{n+1} \sum \limits _{i=0}^{n} v^i_s\right\| ^2 - \left\| \nabla P(w_s) - \frac{1}{n+1} \sum \limits _{i=0}^{n} v^i_s\right\| ^2\right) \\&\quad + \frac{\eta ^2 (n+1)^2 L}{2} \left\| \frac{1}{n+1} \sum \limits _{i=0}^{n} v^i_s\right\| ^2 \\&= P(w_s) - \frac{\eta (n+1)}{2} \Vert \nabla P(w_s)\Vert ^2 - \frac{\eta (n+1)}{2} (1 - \eta (n+1) L) \left\| \frac{1}{n+1} \sum \limits _{i=0}^{n} v^i_s\right\| ^2\\&\quad + \frac{\eta (n+1)}{2} \left\| \nabla P(w_s) - \frac{1}{n+1} \sum \limits _{i=0}^{n} v^i_s\right\| ^2. \end{aligned}$$

With $\eta \le \frac{1}{8nL} \le \frac{1}{(n+1)L}$, we get

$$\begin{aligned} P(w_{s+1})&\le P(w_s) - \frac{\eta (n+1)}{2} \Vert \nabla P(w_s)\Vert ^2 + \frac{\eta (n+1)}{2} \left\| \nabla P(w_s) - \frac{1}{n+1} \sum \limits _{i=0}^{n} v^i_s\right\| ^2. \end{aligned}$$

Which completes the proof. $\square$

Lemma 2

Under Assumption 1, for Algorithms 1 and 2 the following holds

$$\begin{aligned} \left\| \nabla P(w_s) - \frac{1}{n+1} \sum \limits _{i=0}^{n} v^i_s\right\| ^2 \le 2\Vert \nabla P(w_s) - v_s \Vert ^2 + \left( \frac{4L^2}{n+1} + 4 \delta ^2 n\right) \sum \limits _{i=1}^n \Vert w^{i}_s - w_s\Vert ^2. \end{aligned}$$

Proof

To begin with, we prove that for any $k = n, \ldots , 0$, it holds

$$\begin{aligned} \sum \limits _{i=k}^{n} v^i_s&= \sum \limits _{i=k+1}^n \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) + (n - k +1)v^k_s. \end{aligned}$$

(11)

One can prove it by mathematical induction. For $k = n$, we have $\sum _{i=n}^{n} v^i_s = v^n_s$. Suppose (11) holds true for k, let us prove for $k-1$:

$$\begin{aligned} \sum \limits _{i=k-1}^{n} v^i_s&= v^{k-1}_s + \sum \limits _{i=k}^{n} v^i_s \\&= v^{k-1}_s + \sum \limits _{i=k+1}^n \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) + (n - k +1)v^k_s \\&= v^{k-1}_s + \sum \limits _{i=k+1}^n \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \\&\quad + (n - k +1) \left[ v^{k-1}_s + f_{\pi ^{k}_{s}} (w^{k}_{s}) - f_{\pi ^{k}_{s}} (w^{k-1}_{s})\right] \\&= \sum \limits _{i= k}^n \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) + (n - k) v^{k-1}_s . \end{aligned}$$

Here we additionally used (9). This completes the proof of (11). In particular, (11) with $k = 0$ gives

$$\begin{aligned}&\Bigg \Vert \nabla P(w_s) - \frac{1}{n+1} \sum \limits _{i=0}^{n} v^i_s\Bigg \Vert ^2 \\&\quad = \frac{1}{(n+1)^2}\left\| (n+1)\nabla P(w_s) - \sum \limits _{i=0}^{n} v^i_s\right\| ^2 \\&\quad = \frac{1}{(n+1)^2}\Bigg \Vert (n+1)\nabla P(w_s) \\&\qquad - \sum \limits _{i=1}^n \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) + (n +1)v^0_s\Bigg \Vert ^2. \end{aligned}$$

Using $\Vert a + b\Vert ^2 \le 2\Vert a\Vert ^2 + 2\Vert b\Vert ^2$, we get

$$\begin{aligned}&\Bigg \Vert \nabla P(w_s) - \frac{1}{n+1} \sum \limits _{i=0}^{n} v^i_s\Bigg \Vert ^2 \\&\quad \le 2 \Vert \nabla P(w_s) - v^0_s\Vert ^2 \\&\qquad + \frac{2}{(n+1)^2}\left\| \sum \limits _{i=1}^n \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \right\| ^2. \end{aligned}$$

(12)

Again using mathematical induction, we prove for $k = n, \ldots , 1$ the following estimate:

$$\begin{aligned}&\Bigg \Vert \sum \limits _{i=1}^n \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \Bigg \Vert ^2 \\&\quad \le 2 L^2 (n+1) \sum \limits _{i=k+1}^n \Vert w^{i}_s - w_s\big \Vert ^2 + 2 \delta ^2 (n+1) \sum \limits _{i=k+1}^n (n + 1 - i)^2 \Vert w_s - w^{i-1}_s \Vert ^2 \\&\quad +\frac{n+1}{k+1} \Bigg \Vert (n-k)\nabla f_{\pi ^{k}_s} (w_s) + \nabla f_{\pi ^{k}_s} (w^{k}_s) - (n-k+1)\nabla f_{\pi ^{k}_s} (w^{k-1}_s) \\&\quad + \sum \limits _{i=1}^{k-1} \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \Bigg \Vert ^2. \end{aligned}$$

(13)

For $k = n$, the statement holds automatically. Suppose (13) holds true for k, let us prove for $k-1$:

$$\begin{aligned}&\Bigg \Vert \sum \limits _{i=1}^n \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \Bigg \Vert ^2 \\&\quad \le 2 L^2 (n+1) \sum \limits _{i=k+1}^n \Vert w^{i}_s - w_s\big \Vert ^2 + 2 \delta ^2 (n+1) \sum \limits _{i=k+1}^n (n - i + 1)^2 \Vert w_s - w^{i-1}_s \Vert ^2 \\&\qquad +\frac{n+1}{k+1} \Bigg \Vert (n-k)\nabla f_{\pi ^{k}_s} (w_s) + \nabla f_{\pi ^{k}_s} (w^{k}_s) - (n-k+1)\nabla f_{\pi ^{k}_s} (w^{k-1}_s)\\&\qquad +\sum \limits _{i=1}^{k-1} \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \Bigg \Vert ^2 \\&\quad =2 L^2 (n+1) \sum \limits _{i=k+1}^n \Vert w^{i}_s - w_s\big \Vert ^2 + 2 \delta ^2 (n+1) \sum \limits _{i=k+1}^n (n - i + 1)^2 \Vert w_s - w^{i-1}_s \Vert ^2 \\&\qquad +\frac{n+1}{k+1} \Bigg \Vert \nabla f_{\pi ^{k}_s} (w^{k}_s) - f_{\pi ^{k}_s} (w_s) + (n-k+1)\nabla f_{\pi ^{k}_s} (w_s) - (n-k+1)\nabla f_{\pi ^{k}_s} (w^{k-1}_s) \\&\qquad + (n-k+2)\cdot \left[ \nabla f_{\pi ^{k-1}_s} (w^{k-1}_s) -\nabla f_{\pi ^{k-1}_s} (w^{k-2}_s) \right] \\&\qquad + \sum \limits _{i=1}^{k-2} \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \Bigg \Vert ^2 \\&\quad =2 L^2 (n+1) \sum \limits _{i=k+1}^n \Vert w^{i}_s - w_s\big \Vert ^2 + 2 \delta ^2 (n+1) \sum \limits _{i=k+1}^n (n - i + 1)^2 \Vert w_s - w^{i-1}_s \Vert ^2 \\&\qquad +\frac{n+1}{k+1} \Bigg \Vert \nabla f_{\pi ^{k}_s} (w^{k}_s) - f_{\pi ^{k}_s} (w_s) \\&\qquad + (n-k+1) \cdot \left[ \nabla f_{\pi ^{k}_s} (w_s) - \nabla f_{\pi ^{k-1}_s} (w_s) - \nabla f_{\pi ^{k}_s} (w^{k-1}_s) + \nabla f_{\pi ^{k-1}_s} (w^{k-1}_s) \right] \\&\qquad + (n-k+1)\nabla f_{\pi ^{k-1}_s} (w_s) + \nabla f_{\pi ^{k-1}_s} (w^{k-1}_s) -(n-k+2) \nabla f_{\pi ^{k-1}_s} (w^{k-2}_s) \\&\qquad + \sum \limits _{i=1}^{k-2} \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \Bigg \Vert ^2. \end{aligned}$$

Using $\Vert a + b\Vert ^2 \le (1 + c)\Vert a\Vert ^2 + (1+ 1/c)\Vert b\Vert ^2$ with $c = k$, we have

$$\begin{aligned}&\Bigg \Vert \sum \limits _{i=1}^n \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \Bigg \Vert ^2 \\&\quad \le 2 L^2 (n+1) \sum \limits _{i=k+1}^n \Vert w^{i}_s - w_s\big \Vert ^2 + 2 \delta ^2 (n+1) \sum \limits _{i=k+1}^n (n - i + 1)^2 \Vert w_s - w^{i-1}_s \Vert ^2 \\&\qquad +(n+1) \Big \Vert \nabla f_{\pi ^{k}_s} (w^{k}_s) - f_{\pi ^{k}_s} (w_s) \\&\qquad + (n-k+1) \cdot \left[ \nabla f_{\pi ^{k}_s} (w_s) - \nabla f_{\pi ^{k-1}_s} (w_s) - \nabla f_{\pi ^{k}_s} (w^{k-1}_s) + \nabla f_{\pi ^{k-1}_s} (w^{k-1}_s) \right] \Big \Vert ^2 \\&\qquad + \frac{n+1}{k} \Bigg \Vert (n-k+1)\nabla f_{\pi ^{k-1}_s} (w_s) + \nabla f_{\pi ^{k-1}_s} (w^{k-1}_s) -(n-k+2) \nabla f_{\pi ^{k-1}_s} (w^{k-2}_s) \\&\qquad + \sum \limits _{i=1}^{k-2} \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \Bigg \Vert ^2. \end{aligned}$$

With $\Vert a + b\Vert ^2 \le 2\Vert a\Vert ^2 + 2\Vert b\Vert ^2$ Assumption 1 ($\delta$-similarity (7) and L-smoothness), one can obtain

$$\begin{aligned}&\Bigg \Vert \sum \limits _{i=1}^n \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \Bigg \Vert ^2 \\&\quad \le 2 L^2 (n+1) \sum \limits _{i=k+1}^n \Vert w^{i}_s - w_s\big \Vert ^2 + 2 \delta ^2 (n+1) \sum \limits _{i=k+1}^n (n - i + 1)^2 \Vert w_s - w^{i-1}_s \Vert ^2 \\&\qquad +2(n+1) \Vert \nabla f_{\pi ^{k}_s} (w^{k}_s) - f_{\pi ^{k}_s} (w_s) \Vert ^2 \\&\qquad + 2 (n+1) (n-k+1)^2 \Vert \nabla f_{\pi ^{k}_s} (w_s) - \nabla f_{\pi ^{k-1}_s} (w_s) - \nabla f_{\pi ^{k}_s} (w^{k-1}_s) + \nabla f_{\pi ^{k-1}_s} (w^{k-1}_s) \Vert ^2 \\&\qquad + \frac{n+1}{k} \Bigg \Vert (n-k+1)\nabla f_{\pi ^{k-1}_s} (w_s) + \nabla f_{\pi ^{k-1}_s} (w^{k-1}_s) -(n-k+2) \nabla f_{\pi ^{k-1}_s} (w^{k-2}_s) \\&\qquad + \sum \limits _{i=1}^{k-2} \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \Bigg \Vert ^2 \\&\quad \le 2 L^2 (n+1) \sum \limits _{i=k}^n \Vert w^{i}_s - w_s\big \Vert ^2 + 2 \delta ^2 (n+1) \sum \limits _{i=k}^n (n - i + 1)^2 \Vert w_s - w^{i-1}_s \Vert ^2 \\&\qquad + \frac{n+1}{k} \Bigg \Vert (n-k+1)\nabla f_{\pi ^{k-1}_s} (w_s) + \nabla f_{\pi ^{k-1}_s} (w^{k-1}_s) -(n-k+2) \nabla f_{\pi ^{k-1}_s} (w^{k-2}_s) \\&\qquad + \sum \limits _{i=1}^{k-2} \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \Bigg \Vert ^2. \end{aligned}$$

This completes the proof of (13). In particular, (13) with $k = 1$ gives

$$\begin{aligned}&\Bigg \Vert \sum \limits _{i=1}^n \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \Bigg \Vert ^2 \\&\quad \le 2 L^2 (n+1) \sum \limits _{i=2}^n \Vert w^{i}_s - w_s\big \Vert ^2 + 2 \delta ^2 (n+1) \sum \limits _{i=2}^n (n - i + 1)^2 \Vert w_s - w^{i-1}_s \Vert ^2 \\&\qquad +\frac{n+1}{2} \Vert (n-1)\nabla f_{\pi ^{1}_s} (w_s) + \nabla f_{\pi ^{1}_s} (w^{1}_s) - n\nabla f_{\pi ^{1}_s} (w^{0}_s)\Vert ^2. \end{aligned}$$

With (8) and L-smoothness of function $f_{\pi ^{1}_s}$ (Assumption 1 (i)), we have

$$\begin{aligned}&\Bigg \Vert \sum \limits _{i=1}^n \left( (n+1-i)\cdot \left[ \nabla f_{\pi ^{i}_s} (w^{i}_s) -\nabla f_{\pi ^{i}_s} (w^{i-1}_s) \right] \right) \Bigg \Vert ^2 \\&\quad \le 2 L^2 (n+1) \sum \limits _{i=2}^n \Vert w^{i}_s - w_s\big \Vert ^2 + 2 \delta ^2 (n+1) \sum \limits _{i=2}^n (n - i + 1)^2 \Vert w_s - w^{i-1}_s \Vert ^2 \\&\qquad +\frac{n+1}{2} \Vert \nabla f_{\pi ^{1}_s} (w^{1}_s) - \nabla f_{\pi ^{1}_s} (w_s)\Vert ^2 \\&\quad \le 2 L^2 (n+1) \sum \limits _{i=1}^n \Vert w^{i}_s - w_s\big \Vert ^2 + 2 \delta ^2 (n+1) \sum \limits _{i=2}^n (n - i + 1)^2 \Vert w_s - w^{i-1}_s \Vert ^2 \\&\quad \le \left( 2 L^2 (n+1) + 2 \delta ^2 (n+1) n^2 \right) \sum \limits _{i=1}^n \Vert w^{i}_s - w_s\big \Vert ^2. \end{aligned}$$

(14)

Substituting of (14) to (12) completes proof. $\square$

Lemma 3

Under Assumption 1, for Algorithms 1 and 2 with $\eta$ from (5) the following holds for $i \in [n]$

$$\begin{aligned} \Vert v^i_s\Vert ^2 \le \Vert v^{i-1}_s\Vert ^2. \end{aligned}$$

Proof

With (9) and (10), we have

$$\begin{aligned} \Vert v^{i}_s \Vert ^2&= \Vert v^{i-1}_s + f_{\pi ^{i}_{s}} (w^{i}_{s}) - f_{\pi ^{i}_{s}} (w^{i-1}_{s}) \Vert ^2 \\&= \Vert v^{i-1}_s \Vert ^2 + \Vert f_{\pi ^{i}_{s}} (w^{i}_{s}) - f_{\pi ^{i}_{s}} (w^{i-1}_{s}) \Vert ^2 + 2\langle v^{i-1}_s, f_{\pi ^{i}_{s}} (w^{i}_{s}) - f_{\pi ^{i}_{s}} (w^{i-1}_{s})\rangle \\&= \Vert v^{i-1}_s \Vert ^2 + \Vert f_{\pi ^{i}_{s}} (w^{i}_{s}) - f_{\pi ^{i}_{s}} (w^{i-1}_{s}) \Vert ^2 + \frac{2}{\eta }\langle w^{i-1}_s - w^i_s, f_{\pi ^{i}_{s}} (w^{i}_{s}) - f_{\pi ^{i}_{s}} (w^{i-1}_{s})\rangle . \end{aligned}$$

Assumption 1 (i) on convexity and L-smoothness of $f_{\pi ^{i}_{s}}$ gives (see also Theorem 2.1.5 from [30])

$$\begin{aligned} \Vert v^{i}_s \Vert ^2&= \Vert v^{i-1}_s + f_{\pi ^{i}_{s}} (w^{i}_{s}) - f_{\pi ^{i}_{s}} (w^{i-1}_{s}) \Vert ^2 \\&= \Vert v^{i-1}_s \Vert ^2 + \Vert f_{\pi ^{i}_{s}} (w^{i}_{s}) - f_{\pi ^{i}_{s}} (w^{i-1}_{s}) \Vert ^2 + 2\langle v^{i-1}_s, f_{\pi ^{i}_{s}} (w^{i}_{s}) - f_{\pi ^{i}_{s}} (w^{i-1}_{s})\rangle \\&= \Vert v^{i-1}_s \Vert ^2 + \Vert f_{\pi ^{i}_{s}} (w^{i}_{s}) - f_{\pi ^{i}_{s}} (w^{i-1}_{s}) \Vert ^2 + \frac{2}{\eta }\langle w^{i-1}_s - w^i_s, f_{\pi ^{i}_{s}} (w^{i}_{s}) - f_{\pi ^{i}_{s}} (w^{i-1}_{s})\rangle \\&\le \Vert v^{i-1}_s \Vert ^2 + \left( 1 - \frac{2}{L \eta } \right) \Vert f_{\pi ^{i}_{s}} (w^{i}_{s}) - f_{\pi ^{i}_{s}} (w^{i-1}_{s}) \Vert ^2. \end{aligned}$$

Taking into account that $\eta \le \frac{1}{8nL} \le \frac{1}{2\,L}$, we finishes the proof. $\square$

Proof of Theorem 2

For RR-SARAH $v_s = \nabla P(w_s)$, then by Lemma 2, we get

$$\begin{aligned} \left\| \nabla P(w_s) - \frac{1}{n+1} \sum \limits _{i=0}^{n} v^i_s\right\| ^2&\le \left( \frac{4L^2}{n+1} + 4 \delta ^2 n\right) \sum \limits _{i=1}^n \Vert w^{i}_s - w_s\Vert ^2. \end{aligned}$$

Combining with Lemma 1, one can obtain

$$\begin{aligned} P(w_{s+1})&\le P(w_s) - \frac{\eta (n+1)}{2} \Vert \nabla P(w_s)\Vert ^2 + \frac{\eta (n+1)}{2} \left( \frac{4L^2}{n+1} + 4 \delta ^2 n\right) \sum \limits _{i=1}^n \Vert w^{i}_s - w_s\Vert ^2. \end{aligned}$$

Next, we work with $\sum \nolimits _{i=1}^n \Vert w^{i}_s - w_s\Vert ^2$. By Lemma 3 and the update for $w^i_s$ ((8) and (10)), we get

$$\begin{aligned} \sum \limits _{i=1}^n \Vert w^{i}_s - w_s\Vert ^2&= \eta ^2 \sum \limits _{i=1}^n \left\| \sum \limits _{k=0}^{i-1} v^k_s \right\| ^2 \le \eta ^2 \sum \limits _{i=1}^n i \sum \limits _{k=0}^{i-1} \left\| v^k_s \right\| ^2 \le \eta ^2 \sum \limits _{i=1}^n i \sum \limits _{k=0}^{i-1} \left\| v_s \right\| ^2 \\&\le \eta ^2 \left\| v_s \right\| ^2 \sum \limits _{i=1}^n i \sum \limits _{k=0}^{i-1} 1 \le \eta ^2 n^3 \left\| v_s \right\| ^2 = \eta ^2 n^3 \left\| \nabla P(w_s) \right\| ^2. \end{aligned}$$

(15)

Hence,

$$\begin{aligned} P(w_{s+1})&\le P(w_s) - \frac{\eta (n+1)}{2} \Vert \nabla P(w_s)\Vert ^2 + \frac{\eta (n+1)}{2} \left( \frac{4L^2}{n+1} + 4 \delta ^2 n\right) \cdot \eta ^2 n^3 \left\| \nabla P(w_s)\right\| ^2 \\&\le P(w_s) - \frac{\eta (n+1)}{2} \left( 1 - \left( \frac{4L^2}{n+1} + 4 \delta ^2 n\right) \cdot \eta ^2 n^3\right) \Vert \nabla P(w_s)\Vert ^2. \end{aligned}$$

With $\gamma \le \min \left\{ \frac{1}{8n L}; \frac{1}{8n^{2} \delta } \right\}$, we get

$$\begin{aligned} P(w_{s+1}) - P^*&\le P(w_s) - P^* - \frac{\eta (n+1)}{4} \Vert \nabla P(w_s)\Vert ^2. \end{aligned}$$

Strong-convexity of P ends the proof:

$$\begin{aligned} P(w_{s+1}) - P^*&\le \left( 1 - \frac{\eta (n+1) \mu }{2}\right) \left( P(w_s) - P^*\right) . \end{aligned}$$

$\square$

Proof of Theorem 1

For Shuffled-SARAH $v_s = \frac{1}{n} \sum _{i=1}^{n} f_{\pi ^{i}_{s-1}} (w^{i}_{s-1})$, then

$$\begin{aligned}&\left\| \nabla P(w_s) - \frac{1}{n} \sum \limits _{i=1}^{n} v^i_s\right\| ^2 \\&\quad \le \left( \frac{4L^2}{n+1} + 4 \delta ^2 n\right) \sum \limits _{i=1}^n \Vert w^{i}_s - w_s\Vert ^2 + 2 \left\| \frac{1}{n} \sum \limits _{i=1}^{n} \left[ f_{\pi ^{i}_{s-1}}(w_s) - f_{\pi ^{i}_{s-1}} (w^{i}_{s-1}) \right] \right\| ^2 \\&\quad \le \left( \frac{4L^2}{n+1} + 4\delta ^2 n \right) \sum \limits _{i=1}^n \Vert w^{i}_s - w_s\Vert ^2 + \frac{2L^2}{n}\sum \limits _{i=1}^n \left\| w^{i}_{s-1} - w_s\right\| ^2. \end{aligned}$$

(16)

With $\sum \nolimits _{i=1}^n \Vert w^{i}_s - w_s\Vert ^2$ we can work in the same way as in proof of Theorem 2. In remains to deal with $\sum _{i=1}^n \left\| w^{i}_{s-1} - w_s\right\| ^2$. Using Lemma 3 and the update for $w^i_s$ ((8) and (10)), we get

$$\begin{aligned} \sum \limits _{i=1}^n \Vert w^{i}_{s-1} - w_s\Vert ^2&= \eta ^2 \sum \limits _{i=1}^{n} \left\| \sum \limits _{k=1}^{n+ 1 -i} v^{n + 1-k}_{s-1} \right\| ^2 \le \eta ^2 \sum \limits _{i=1}^n (n+1 -i) \sum \limits _{k=1}^{n+ 1 -i} \left\| v^{n+1-k}_{s-1} \right\| ^2 \\&\le \eta ^2 \sum \limits _{i=1}^n (n+1 -i)\sum \limits _{k=1}^{n+ 1 -i} \left\| v_{s-1} \right\| ^2 \\&\le \eta ^2 \left\| v_{s-1} \right\| ^2 \sum \limits _{i=1}^n (n+1 -i) \sum \limits _{k=1}^{n+1 -i} 1 \\&\le \eta ^2 n^3 \left\| v_{s-1} \right\| ^2. \end{aligned}$$

(17)

Combining the results of Lemma 1 with (16), (15) and (17), one can obtain

$$\begin{aligned} P(w_{s+1})&\le P(w_s) - \frac{\eta (n+1)}{2} \Vert \nabla P(w_s)\Vert ^2 \\&\quad + \frac{\eta (n+1)}{2} \left[ \left( \frac{4L^2}{n+1} + 4\delta ^2 n \right) \cdot \eta ^2 n^3 \left\| v_s\right\| ^2 + \frac{2L^2}{n}\cdot \eta ^2 n^3 \left\| v_{s-1} \right\| ^2\right] \\&= P(w_s) - \frac{\eta (n+1)}{4} \Vert \nabla P(w_s)\Vert ^2 \\&\quad + \frac{\eta (n+1)}{2} \left[ \left( \frac{4L^2}{n+1} + 4\delta ^2 n \right) \cdot \eta ^2 n^3 \left\| v_s\right\| ^2 + \frac{2L^2}{n}\cdot \eta ^2 n^3 \left\| v_{s-1} \right\| ^2\right] \\&\quad - \frac{\eta (n+1)}{4} \Vert \nabla P(w_s)\Vert ^2 \\&\le P(w_s) - \frac{\eta (n+1)}{4} \Vert \nabla P(w_s)\Vert ^2 \\&\quad + \frac{\eta (n+1)}{2} \left[ \left( \frac{4L^2}{n+1} + 4\delta ^2 n \right) \cdot \eta ^2 n^3 \left\| v_s\right\| ^2 + \frac{2L^2}{n}\cdot \eta ^2 n^3 \left\| v_{s-1} \right\| ^2\right] \\&\quad - \frac{\eta (n+1)}{8} \Vert v_s \Vert ^2 + \frac{\eta (n+1)}{4} \Vert v_s - \nabla P(w_s)\Vert ^2 \\&\le P(w_s) - \frac{\eta (n+1)}{4} \Vert \nabla P(w_s)\Vert ^2 \\&\quad + \frac{\eta (n+1)}{2} \left[ \left( \frac{4L^2}{n+1} + 4\delta ^2 n \right) \cdot \eta ^2 n^3 \left\| v_s\right\| ^2 + \frac{2L^2}{n}\cdot \eta ^2 n^3 \left\| v_{s-1} \right\| ^2\right] \\&\quad - \frac{\eta (n+1)}{8} \Vert v_s \Vert ^2 + \frac{\eta (n+1)}{4} \cdot \frac{2L^2}{n} \cdot \eta ^2 n^3 \left\| v_{s-1} \right\| ^2. \end{aligned}$$

The last step is deduced the same way as (17). Small rearrangement gives

$$\begin{aligned} P(w_{s+1}) - P^*&\le P(w_s) - P^* - \frac{\eta (n+1)}{4} \Vert \nabla P(w_s)\Vert ^2 \\&\quad - \frac{\eta (n+1)}{8} \left( 1 - \left( \frac{16L^2}{n+1} + 16\delta ^2 n \right) \cdot \eta ^2 n^3 \right) \Vert v_s \Vert ^2 \\&\quad + \eta (n+1)\cdot \frac{2L^2}{n} \cdot \eta ^2 n^3 \left\| v_{s-1} \right\| ^2. \end{aligned}$$

With the choice of $\eta \le \min \left\{ \frac{1}{8n L}; \frac{1}{8n^{2} \delta } \right\}$, we have

$$\begin{aligned}&P(w_{s+1}) - P^* + \frac{\eta (n+1)}{16}\Vert v_s \Vert ^2 \\&\quad \le P(w_s) - P^* - \frac{\eta (n+1)}{4} \Vert \nabla P(w_s)\Vert ^2 + \frac{\eta (n+1)}{16}\cdot \frac{32L^2}{n} \cdot \eta ^2 n^3 \left\| v_{s-1} \right\| ^2. \end{aligned}$$

Again using that $\eta \le \frac{1}{8nL}$, we obtain $32\,L^2 \eta ^2 n^2 \le \left( 1 - \frac{\eta (n+1) \mu }{2}\right)$ and

$$\begin{aligned}&P(w_{s+1}) - P^* + \frac{\eta (n+1)}{16}\Vert v_s \Vert ^2 \\&\quad \le P(w_s) - P^* - \frac{\eta (n+1)}{4} \Vert \nabla P(w_s)\Vert ^2 + \left( 1 - \frac{\eta (n+1) \mu }{2}\right) \cdot \frac{\eta (n+1)}{16}\left\| v_{s-1} \right\| ^2. \end{aligned}$$

Strong-convexity of P ends the proof. $\square$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Beznosikov, A., Takáč, M. Random-reshuffled SARAH does not need full gradient computations. Optim Lett 18, 727–749 (2024). https://doi.org/10.1007/s11590-023-02081-x

Download citation

Received: 25 September 2022
Accepted: 06 November 2023
Published: 11 December 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s11590-023-02081-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Random-reshuffled SARAH does not need full gradient computations

Abstract

Access this article

Similar content being viewed by others

Why random reshuffling beats stochastic gradient descent

Multi-index antithetic stochastic gradient algorithm

SAAGs: Biased stochastic variance reduction methods for large-scale learning

References

Acknowledgements