Functional Central Limit Theorem and Strong Law of Large Numbers for Stochastic Gradient Langevin Dynamics

Lovas, A.; Rásonyi, M.

doi:10.1007/s00245-023-10052-y

Functional Central Limit Theorem and Strong Law of Large Numbers for Stochastic Gradient Langevin Dynamics

Open access
Published: 28 August 2023

Volume 88, article number 78, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Mathematics & Optimization Submit manuscript

Functional Central Limit Theorem and Strong Law of Large Numbers for Stochastic Gradient Langevin Dynamics

Download PDF

858 Accesses
Explore all metrics

Abstract

We study the mixing properties of an important optimization algorithm of machine learning: the stochastic gradient Langevin dynamics (SGLD) with a fixed step size. The data stream is not assumed to be independent hence the SGLD is not a Markov chain, merely a Markov chain in a random environment, which complicates the mathematical treatment considerably. We derive a strong law of large numbers and a functional central limit theorem for SGLD.

Nonasymptotic Estimates for Stochastic Gradient Langevin Dynamics Under Local Conditions in Nonconvex Optimization

Article 13 January 2023

Polyak–Łojasiewicz inequality on the space of measures and convergence of mean-field birth-death processes

Article Open access 13 March 2023

Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices

1 Introduction

We consider a recursive stochastic scheme called “stochastic gradient Langevin dynamics” (SGLD), first suggested by Welling and Teh [17]. Let $\lambda >0$ be the stepsize, the measurable function $H:{\mathbb {R}}^d\times {\mathbb {R}}^m\rightarrow {\mathbb {R}}^d$ the updating function and define the ${\mathbb {R}}^d$-valued stochastic process $\theta _n$, $n\ge 1$ recursively by

$$\begin{aligned} \theta _{n+1}=\theta _n-\lambda H(\theta _n,Y_n)+\sqrt{2\lambda }\xi _{n+1}. \end{aligned}$$

(1)

Here $\xi _n$, $n\ge 1$ is an independent sequence of standard d-dimensional Gaussian random variables, $Y_n$, $n\in {\mathbb {Z}}$ is an ${\mathbb {R}}^m$-valued strict sense stationary process, independent of $(\xi _{n})_{n\in {\mathbb {N}}}$, which represents the data stream fed into this procedure. Furthermore, we assume (for simplicity) that the initial value $\theta _{0}\in {\mathbb {R}}^d$ is deterministic.

The algorithm (1) is used for approximate sampling from high-dimensional probability distributions that are not necessarily log-concave. More precisely, let $U:{\mathbb {R}}^d\rightarrow {\mathbb {R}}_+$ be differentiable with derivative $h=\nabla U$ such that $h(\theta )=E[H(\theta ,Y_0)]$, $\theta \in {\mathbb {R}}^{d}$. Assume U has a unique minimum at $\theta ^{\dagger }$. For $\lambda $ small and n large, $\text {Law}(\theta _n)$ is expected to be close to the probability defined by

$$\begin{aligned} \pi (A)=\frac{\int _A e^{-U(\theta )}d\theta }{\int _{{\mathbb {R}}^d} e^{-U(\theta ')}\textrm{d}\theta '},\ A\in {\mathcal {B}}({\mathbb {R}}^d), \end{aligned}$$

see eg. [1, 8, 17]. If $\sqrt{2\lambda }$ in (1) is replaced by $\sqrt{2\lambda /\beta }$ for some $\beta >0$ then the procedure samples from a distribution with density proportional to $e^{-\beta U(x)}$ which means, for $\beta $ large, that

$$\begin{aligned} E[\theta _{n}]\approx \int _{{\mathbb {R}}^{d}}x\,\pi (dx)\approx \theta ^{\dagger }, \end{aligned}$$

(2)

for n large enough and $\lambda $ small enough. (In this paper we keep $\beta =1$ for simplicity.)

Example 1.1

We consider a regularized logistic regression where $m\ge 2$, $d:=m-1$ and $(Q_{n},Z_{n})\in \{0,1\}\times {\mathbb {R}}^m$, $n\in {\mathbb {Z}}$ is a stationary sequence of random variables. The purpose is to optimize the regression parameters $\theta \in {\mathbb {R}}^d$ in such a way that the functional

$$\begin{aligned} U(\theta ):=-E\left[ \ln [\sigma ^{Q_{0}}(\langle \theta ,Z_{0}\rangle )(1-\sigma (\langle \theta ,Z_{0}\rangle ))^{1-Q_{0}}]\right] +c|\theta |^{2} \end{aligned}$$

is minimized, where $\sigma (x)=1/(1+e^{-x})$ is the sigmoid function and $c>0$ is a constant. One thus tries to guess the binary variable Q from the variables Z. We then have

$$\begin{aligned} H^{i}(\theta ,(q,z))=-(q-\sigma (\langle \theta ,Z_{0}\rangle ))z^{i}+2c \theta ^{i} \end{aligned}$$

for all $i=1,\ldots ,d$.

As can be easily verified, this functional satisfies Assumption 2.1. The SGLD algorithm in this context could be applied to standard sentiment analysis problems where, based on the occurrences of key words (represented by the coordinates of Z) it should be decided whether a given review on a webshop is positive or not ($Q=1$ or $Q=0$), see eg. [3].

Review data continuously arrive and often exhibit temporal dependencies and non-i.i.d. characteristics. This is because customers’ reviews can be influenced by previous reviews, current trends, or the changing sentiment of other customers, leading to dependencies between reviews. Consequently, the occurrence of certain key words and the overall sentiment may not be independent across reviews. For such sentiment analysis problems, variants of stochastic gradient descent are commonly used. However, due to the lack of convexity, it is worth considering the use of SGLD.

Furthermore, sentiment analysis faces the challenge of concept drift, which refers to the situation where the underlying sentiment distribution of the data changes over time. This could be due to various factors, such as changes in product features, external events, or trends. The SGLD algorithm is capable of adapting to concept drift scenarios by continuously updating the model parameters as new data arrives.

One would try to numerically approximate the integral in (2) by

$$\begin{aligned} \frac{\theta _{0}+\ldots +\theta _{n-1}}{n}. \end{aligned}$$

However, to guarantee the consistency of such a procedure, one needs to establish a corresponding law of large numbers.

In the case where $\lambda $ in (1) is replaced by $\lambda _{n}$ with a decreasing sequence $\lambda _n$, $n\ge 0$, under suitable assumptions, the averages

$$\begin{aligned} \frac{\sum _{k=0}^{n-1}\lambda _k \phi (\theta _k)}{\sum _{k=0}^{n-1}\lambda _k}. \end{aligned}$$

(3)

converge almost surely to $\int _{{\mathbb {R}}^d}\phi (z)\pi (dz)$ for appropriate functions $\phi $ as shown in [13], where a related central limit theorem is also established.

In the case of fixed $\lambda $, [16] estimated the $L^2$ distance of the averages from the mean of $\pi $. Both these papers, like most available studies, assume that $Y_n$, $n\in {\mathbb {Z}}$ are i.i.d. This does not hold true in several applications, prominently in the case of financial times series, see eg. [10], where stochastic approximation schemes were treated in a setting with possibly dependent data. See also [1, 8, 11, 14] for more about SGLD with dependent data.

When the $Y_{n}$ are independent, $\theta _{n}$ is a Markov chain. However, the case of general stationary $Y_{n}$ is an order of magnitude more involved mathematically since $\theta _{n}$ is only a Markov chain in a random environment, see Sect. 2 for details.

In this article, we establish a law of large numbers (LLN) for functionals for (3) when employing a fixed stepsize $\lambda >0$. Additionally, we will establish an invariance principle. These results serve as crucial theoretical guarantees for the consistency of estimates, such as (3), and form the foundation for constructing confidence intervals for these estimates. Our work builds upon and extends the findings in [6], where LLN and CLT were shown for the stochastic gradient method with dependent data, specifically in the special case of a linear updating rule.

Our arguments are based on results of [7] which require establishing mixing properties for the process $\theta _{t}$. The recent paper [15] is closely related to this part of our work: it shows mixing for a certain class of processes. That setting, however, does not cover ours since the strong minorization property ${\textbf{A}}{\textbf{2}}$ in [15] does not hold for our processes.

Section 2 states and explains our main results. Their proof in Sect. 3 is presented in a series of subsections.

2 The Main Result

First we formulate our working assumption on the stochastic iterative scheme given by (1).

Assumption 2.1

There are $\Delta ,b>0$ such that, for all $\theta \in {\mathbb {R}}^d$ and $y\in {\mathbb {R}}^m$,

$$\begin{aligned} \langle H(\theta ,y),\theta \rangle \ge \Delta \Vert \theta \Vert ^2-b, \end{aligned}$$

(4)

and for some $K>\Delta /\sqrt{2}$,

$$\begin{aligned} \Vert H(\theta ,y)\Vert \le K( \Vert \theta \Vert +\Vert y\Vert +1). \end{aligned}$$

(5)

Furthermore, we assume that the process $(Y_t)_{t\in {\mathbb {Z}}}$ is strictly stationary, and there is $M>0$ such that

$$\begin{aligned} \Vert Y_{0}\Vert \le M\ \text {a.s.}\end{aligned}$$

(6)

Condition (4) is a standard dissipativity requirement, (5) is also mild and holds for Lipschitz-continuous H. By stationarity, (6) implies uniform boundedness of the data stream. This may look stringent from the mathematical point of view, but it is evidently applicable in practice due to two main reasons. First, many real-world applications involve data that can be naturally bounded within certain ranges. For example, pixel values in images are confined to specific ranges (eg., 0 to 255 for grayscale images). Second, scaling the data to a compact domain is a common preprocessing step in machine learning. In conclusion, the assumptions we have made are met by a wide range of learning problems of considerable practical importance.

Next, we briefly recall the main concepts of $\alpha $-mixing. Throughout this paper the probability space is $(\Omega , {\mathcal {F}}, {\mathbb {P}})$, and for any two sub-$\sigma $-algebras ${\mathcal {G}}\mathbf{{\mathcal {G}}},{\mathcal {H}}\subset {\mathcal {F}}$, we define the measure of dependence

$$\begin{aligned} \alpha ({\mathcal {G}},{\mathcal {H}}) = \sup \limits _{G\in {\mathcal {G}}, H\in {\mathcal {H}}} \left| {\mathbb {P}}(G\cap H)-{\mathbb {P}}(G){\mathbb {P}}(H)\right| . \end{aligned}$$

(7)

Furthermore, for an arbitrary sequence of random variables $(W_t)_{t\in {\mathbb {Z}}}$, we define the $\sigma $-algebras ${\mathcal {F}}_{t,s}^W:=\sigma \left( W_k,\,t\le k\le s\right) $, $-\infty \le t\le s\le \infty $, and introduce the dependence coefficients

$$\begin{aligned} \alpha _j^W (n) = \alpha \left( {\mathcal {F}}_{-\infty ,j}^W,{\mathcal {F}}_{j+n,\infty }^W\right) ,\,\,j\in {\mathbb {Z}}. \end{aligned}$$

The mixing coefficient of W is $\alpha ^W (n)= \sup _{j\in {\mathbb {Z}}}\alpha _j^W (n)$, $n\ge 1$ which is obviously non-increasing in n. Note that, for strictly stationary W, $\alpha _j^W (n)$ does not depend on j, and thus $\alpha ^W (n)=\alpha _0^W (n)$. We say that W is $\alpha $-mixing if $\lim _{n\rightarrow \infty }\alpha ^W (n)=0$.

Assumption 2.2

For some $\epsilon >0$, the $\alpha $-mixing coefficients $\alpha ^{Y}(n)$, $n\in {\mathbb {N}}$ satisfy

$$\begin{aligned} \sum _{n=1}^{\infty }\alpha ^{Y}(n)^{1-\epsilon }<\infty . \end{aligned}$$

In [11] it was established (under somewhat weaker conditions than Assumption 2.1) that $\text {Law}(\theta _{n})$ converges in total variation to a limiting probability $\mu _{\lambda }$ as $n\rightarrow \infty $. A rate estimate of the order $\exp (-n^{1/3})$ was obtained. Clearly, $\mu _{\lambda }$ differs from $\pi $ and the bias is $O(\sqrt{\lambda })$ under suitable conditions, see [8].

In this paper, using results of [5], we prove an exponential convergence rate of $\text {Law}(\theta _{n})$ to $\mu _{\lambda }$ under Assumption 2.1. More importantly, a functional central limit theorem is established under the additional Assumption 2.2. In the sequel, $\phi :{\mathbb {R}}^d\rightarrow {\mathbb {R}}$ denotes an at most polynomially growing measurable function i.e. for fixed but arbitrary constants $c_\phi ,r>0$,

$$\begin{aligned} |\phi (\theta )|\le c_\phi (1+\Vert \theta \Vert ^r),\,\,\theta \in {\mathbb {R}}^{d}. \end{aligned}$$

(8)

Our main results are summarized in the next two theorems.

Theorem 2.3

Let Assumption 2.1 be in force, and $0<\lambda \le \frac{\Delta }{K^2}$ be fixed. Then there is a strictly stationary process $(\theta _t^*)_{t\in {\mathbb {N}}}$ on ${\mathbb {R}}^d$ and there are constants $c,\kappa >0$ depending only on $\lambda $, $\Delta $, b, K and M such that for any $k\in {\mathbb {N}}$ and indices $0\le i_1<\ldots <i_k$,

$$\begin{aligned} d_{\text {TV}}\left( \text {Law}((\theta _{i_1+n},\ldots ,\theta _{i_k+n})),\text {Law}((\theta _{i_1}^*,\ldots ,\theta _{i_k}^*))\right) \le c e^{-\kappa n}. \end{aligned}$$

Furthermore, we have

$$\begin{aligned} \frac{1}{n}{\sum _{j=1}^{n}\phi (\theta _{j})}\rightarrow {\mathbb {E}}(\phi (\theta _0^*)),\,\, n\rightarrow \infty \end{aligned}$$

almost surely and in $L^{p}$, for all $p\ge 1$ provided that $(Y_t)_{t\in {\mathbb {Z}}}$ is ergodic.

Theorem 2.4

Under Assumptions 2.1 and 2.2, for $0<\lambda \le \frac{\Delta }{K^2}$, the process $X_n:=\phi (\theta _n)-{\mathbb {E}}(\phi (\theta _n))$, $n\in {\mathbb {N}}$ satisfies the invariance principle i.e. for $S_n=X_1+\ldots +X_n$, ${\mathbb {E}}S_n^2/n\rightarrow \sigma ^2$ for some $\sigma \ge 0$ and the sequence of random functions

$$\begin{aligned} B_n(t) = \frac{S_{\lfloor nt\rfloor }}{\sqrt{n}},\,\,t\in [0,1],\,n\ge 1 \end{aligned}$$

is weakly convergent to $\sigma B_{t}$, $t\in [0,1]$ on D[0, 1] (the Skorokhod space endowed with the Skorohod topology) as $n\rightarrow \infty $. Here $B_{t}$, $t\in [0,1]$ is a standard Brownian motion.

Remark 2.5

As we shall see later (c.f. Corollary 3.2 and Lemma 3.11), it is also true that ${\mathbb {E}}(\phi (\theta _n))\rightarrow {\mathbb {E}}(\phi (\theta _0^*))$ exponentially fast as $n\rightarrow \infty $ hence the biased sequence $X_n':=\phi (\theta _n)-{\mathbb {E}}(\phi (\theta _0^*))$ also satisfies the invariance principle.

3 Proofs

Throughout the rest of the paper, we use the notation ${\mathcal {X}}:= {\mathbb {R}}^d$ and ${\mathcal {Y}}:={\mathbb {R}}^m$ moreover ${\mathcal {B}}({\mathcal {X}})$ will be used for the standard Borel $\sigma $-algebra of ${\mathcal {X}}$. As pointed out in Sect. 5 of [11] and also in [14], the recursive stochastic scheme (1) can be considered as a Markov chain in an exogenous random environment (MCRE) which means that there is a parametric kernel^{Footnote 1}$Q:{\mathcal {Y}}\times {\mathcal {X}}\times {\mathcal {B}}({\mathcal {X}})\rightarrow [0,1]$ such that

$$\begin{aligned} {\mathbb {P}}(\theta _{t+1}\in A\mid (\theta _i)_{0\le i\le t},\,(Y_j)_{j\in {\mathbb {Z}}}) = Q(Y_t,\theta _t,A) \end{aligned}$$

almost surely, for all $A\in {\mathcal {B}}({\mathcal {X}})$. In our case, transition kernel is given by

$$\begin{aligned} Q(y,\theta ,A) = {\mathbb {P}}\left( \theta -\lambda H(\theta ,y)+\sqrt{2\lambda }\xi _0 \in A\right) ,\,\,y\in {\mathbb {R}}^m,\theta \in {\mathbb {R}}^d,A\in {\mathcal {B}}({\mathcal {X}}), \end{aligned}$$

where $\xi _0$ is as in the recursion (1) i.e. a standard d-dimensional Gaussian random variable.

Here we give a brief explanation of the proof strategy. First, we fix a trajectory of Y (that is, we consider the “quenched” version of the process) and using a standard representation of MCREs by iterated random functions, we deduce an upper estimate for the coupling probability between realizations of the chain starting from different, possibly random, initial values Lemma 3.10). To achieve this, we demonstrate that small sets, where coupling can occur with a positive probability, are visited frequently enough with large probability. The so-called ”annealed version” of this crucial result (Lemma 3.11) allows us to establish that the process $(\theta _t)_{t\in {\mathbb {N}}}$ inherits the mixing properties of the environment (Lemma 3.14). The proof of Theorem 2.3 also heavily relies on this inequality. We actually prove a bit more: we show that there exist an almost surely finite random time at which suitable versions of $(\theta _t)_{t\in {\mathbb {N}}}$ and $(\theta _t^*)_{t\in {\mathbb {N}}}$ are coupled to each other. Finally, the proof of the invariance principle (Theorem 2.4) boils down to verifying conditions of Corollary 1 in Herrndorf’s paper [7]: we verify that the mixing coefficients decrease sufficiently fast and that the covariance function of the process $\theta $ converges to its stationary counterpart.

3.1 Drift and Minorization Conditions for $\theta _{t}$

In this point, we establish suitable versions of the standard drift and minorization conditions, known from the theory of Markov chains (See eg. [12]), for $(\theta _{t})_{t\in {\mathbb {N}}}$. According to the next lemma, there is an $a=a(\lambda )>0$ such that for the Lyapunov function $V(\theta ) = \exp (a\Vert \theta \Vert ^2)$ and for the parametric kernel Q, a Foster–Lyapunov-type drift condition holds.

Lemma 3.1

For any $0<\lambda <\frac{\Delta }{K^2}$, there exist $a>0$ such that for $V(\theta )=\exp (a \Vert \theta \Vert ^2)$,

$$\begin{aligned}{}[Q(y)V](\theta ):=\int _{{\mathcal {X}}} V(z)\,Q(y,\theta ,\textrm{d}z) \le \gamma V(\theta )+C,\,\,\Vert y\Vert \le M \end{aligned}$$

holds with constants $\gamma \in (0,1)$ and $C\ge 1$.

Proof

We can write

$$\begin{aligned}{}[Q(y)V](\theta )&= {\mathbb {E}}\left[ \exp \left( a\Vert \theta -\lambda H(\theta ,y)+\sqrt{2\lambda }\xi _0\Vert ^2\right) \right] \\&= \frac{1}{(1-4\lambda a)^{d/2}}\exp \left( {a\frac{\Vert \theta -\lambda H(\theta ,y)\Vert ^2}{1-4\lambda a}}\right) , \end{aligned}$$

where by Assumption 2.1, for $\Vert y\Vert \le M$, we have

$$\begin{aligned} \Vert \theta -\lambda H(\theta ,y)\Vert ^2 \le (2K^2\lambda ^2-2\Delta \lambda +1)\Vert \theta \Vert ^2 + 2(\lambda b+\lambda ^2 K^2 (1+M)^2). \end{aligned}$$

For $0<\lambda <\frac{\Delta }{K^2}$, $0<2K^2\lambda ^2-2\Delta \lambda +1<1$ hence we can choose $a>0$ so small that

$$\begin{aligned} \frac{2K^2\lambda ^2-2\Delta \lambda +1}{1-4\lambda a}<1. \end{aligned}$$

To sum up, we obtained that there are $c_1,c_2>0$ such that $c_2<a$ and $[Q(y)V](\theta )\le c_1 \exp (c_2 \Vert \theta \Vert ^2)$ hence for $r>0$ large enough $\gamma :=c_1 e^{-(a-c_2)r^2}<1$, and thus

$$\begin{aligned}{}[Q(y)V](\theta )\le \gamma V(\theta )+C \end{aligned}$$

holds with $C=c_1 e^{c_2 r^{2}}$, which completes the proof. $\square $

Corollary 3.2

By induction, easily follows that for any collection $\{y_a,y_{a+1},\ldots ,y_b\}\subset {\mathcal {Y}}$, $\Vert y_i\Vert \le M$, $a\le i\le b$, we have

$$\begin{aligned}{}[Q(y_b)\ldots Q(y_a)V](\theta ):=[Q(y_b)[\ldots [Q(y_a)V]\ldots ]](\theta )\le \gamma ^{b-a+1}V(\theta )+\frac{C}{1-\gamma }\nonumber \\ \end{aligned}$$

(9)

hence by the tower rule, we can estimate further and obtain

$$\begin{aligned} \sup _{t\in {\mathbb {N}}} {\mathbb {E}}(V (\theta _t'))\le {\mathbb {E}}(V (\theta _0'))+\frac{C}{1-\gamma }<\infty \end{aligned}$$

for initial values $\theta _0'$ satisfying $ {\mathbb {E}}(V (\theta _0'))<\infty $.

From now on, let us fix a $\lambda \in (0,\Delta /K^2)$ and $a>0$ as in Lemma 3.1.

In the theory of Markov chains, the Foster-Lyapunov condition is often accompanied by a minorization condition on suitable “small sets”. In the current model we do have such a minorization condition on every compact set. (In other words, compact sets are small.) To see this, for fixed $R>0$, $\Vert \theta \Vert \le R$ and $\Vert y \Vert \le M$, we can write

$$\begin{aligned} Q(y,\theta ,A)&= \int _{{\mathcal {X}}} \mathbbm {1}_{\theta -\lambda H(\theta ,y)+\sqrt{2\lambda }z\in A}f_{\xi _0}(z)\,\textrm{d}z \\&= \int _{{\mathcal {X}}} \mathbbm {1}_{u\in A}\frac{1}{(2\lambda )^{d/2}}f_{\xi _0}\left( \frac{1}{\sqrt{2\lambda }}(\theta -\lambda H(\theta ,y)-u) \right) \,\textrm{d}z \\&\ge m_{R,M,\lambda ,K} \times \textrm{Leb} ( A\cap \{x\mid \Vert x\Vert \le R\}), \end{aligned}$$

where $f_{\xi _0}$ is the probability density function of $\xi _0$ and the positive constant $m_{R,M,\lambda ,K}$ is given by

$$\begin{aligned} m_{R,M,\lambda ,K} = \inf \left\{ f_{\xi _0}(z)\bigg | z\in {\mathcal {X}},\,\Vert z\Vert \le \frac{(\lambda K+2)R+\lambda K (M+1)}{\sqrt{2\lambda }} \right\} . \end{aligned}$$

We note this observation in the next lemma.

Lemma 3.3

For every $R>0$, there is a Borel probability measure $\nu _R$ on ${\mathcal {B}}({\mathcal {X}})$ and a coefficient $\tilde{\alpha }_R\in (0,1)$ such that for $\Vert y\Vert \le M$ and $\Vert \theta \Vert \le R$,

$$\begin{aligned} Q(y,\theta ,A) \ge \tilde{\alpha }_R \nu _R (A),\,\,A\in {\mathcal {B}}({\mathcal {X}}). \end{aligned}$$

(10)

3.2 Stationary Initialization

We need to show that, starting from a suitable random initial state $\theta _0^*$, the process $(\theta _{t}^*)_{t\in {\mathbb {N}}}$ has a stationary version (in the strict sense).

Let ${\mathcal {M}}^Y$ be the set of Borel probability laws on ${\mathcal {X}}\times {\mathcal {Y}}^{\mathbb {Z}}$ such that their second marginal equals to the law of $(Y_t)_{t\in {\mathbb {Z}}}$, and ${\mathcal {M}}_b^Y$ denotes the set of those $\mu \in {\mathcal {M}}^Y$ for which the process $(\theta '_t)_{t\in {\mathbb {N}}}$ started from some random initial state $\theta '_0$ with $\text {Law}((\theta '_0,(Y_t)_{t\in {\mathbb {Z}}}))=\mu $ satisfies

$$\begin{aligned} \sup _{t\in {\mathbb {N}}} {\mathbb {P}}(\Vert \theta '_t\Vert \ge n)\rightarrow 0,\,n\rightarrow \infty . \end{aligned}$$

(11)

By Corollary 3.2 and the Markov inequality, for every random variable $(\theta '_0,(Y_t)_{t\in {\mathbb {Z}}})$ with law in ${\mathcal {M}}^Y$ and ${\mathbb {E}}(V(\theta _0'))<\infty $, (11) holds hence $\text {Law}((\theta '_0,(Y_t)_{t\in {\mathbb {Z}}}))\in {\mathcal {M}}_b^Y$. In particular, for any deterministic $\theta _0\in {\mathcal {X}}$, $\delta _{\theta _0}\otimes \text {Law}((Y_t)_{t\in {\mathbb {Z}}})\in {\mathcal {M}}_b^Y$, where $\delta _{\theta _0}$ stands for the Dirac measure concentrated on $\theta _0$. It follows that ${\mathcal {M}}_b^Y\ne \emptyset $.

Lemma 3.4

For each $\text {Law}((\theta '_0,(Y_t)_{t\in {\mathbb {Z}}}))\in {\mathcal {M}}_b^Y$, there exists a limiting probability $\mu ^*$ on ${\mathcal {B}}({\mathcal {X}}\times {\mathcal {Y}}^{\mathbb {Z}})$ such that

$$\begin{aligned} d_{\text {TV}}(\text {Law}((\theta '_t,(Y_{k+t})_{k\in {\mathbb {Z}}})),\mu ^*)\rightarrow 0,\,t\rightarrow \infty . \end{aligned}$$

In addition, $\mu ^*$ does not depend on the choice of $(\theta '_0,(Y_t)_{t\in {\mathbb {Z}}})$. If $\text {Law}((\theta ^{*}_{0},(Y_{k})_{k\in {\mathbb {Z}}}))=\mu _{*}$ then the process $(\theta _t^*,(Y_{k+t})_{k\in {\mathbb {Z}}})$, $t\in {\mathbb {N}}$ is (strict-sense) stationary.

Proof

The statement follows from Rásonyi and Gerencsér’s recent result, Theorem 3.10. in [5]. They also prove that $\text {Law}((\theta _t^*,(Y_{k+t})_{k\in {\mathbb {Z}}}))=\mu ^*$ for each $t\in {\mathbb {N}}$. Since $(\theta ^{*}_t,(Y_{t+k})_{k\in {\mathbb {Z}}})$, $t\in {\mathbb {N}}$ is a time-homogeneous Markovian process, strong stationarity follows. $\square $

Remark 3.5

By Corollary 3.2 and Lemma 3.4, for any arbitrary but deterministic $\theta _0\in {\mathcal {X}}$, we have

$$\begin{aligned} {\mathbb {E}}(V(\theta _0^*)) = \lim _{\Sigma \rightarrow \infty } {\mathbb {E}}(\min (\Sigma ,V(\theta _0^*))) = \lim _{\Sigma \rightarrow \infty } \lim _{t\rightarrow \infty } {\mathbb {E}}(\min (\Sigma ,V(\theta _t))) \le V(\theta _0)+\frac{C}{1-\gamma }, \end{aligned}$$

and thus by the strong stationarity of $(Y_t)_{t\in {\mathbb {Z}}}$, immediately follows that $\mu ^*\in {\mathcal {M}}_b^Y$.

Remark 3.6

With the above form of the drift and minorization condition in hand and using a recent result of Truquet’s (Theorem 1 in [15]), we could as well deduced the existence of a stationary process $(\theta _t^*)_{t\in {\mathbb {Z}}}$ satisfying

$$\begin{aligned} {\mathbb {P}}(\theta _{t+1}^*\in A\mid (\theta _{i}^*)_{i\le t}, (Y_j)_{j\in {\mathbb {Z}}}) = Q(Y_t,\theta _{t}^*,A),\,\,A\in {\mathcal {B}}({\mathcal {X}}),\,\,t\in {\mathbb {Z}}. \end{aligned}$$

However, we will need a bit more. We aim to show that there is a coupling between the iterations $(\theta _t)_{t\in {\mathbb {N}}}$ initialized with the deterministic value $\theta _{0}\in {\mathcal {X}}$ and an appropriate version of $(\theta _t^*)_{t\in {\mathbb {Z}}}$. That’s why we preferred the technology presented in [5].

It is also shown in [15] that the distribution of $(\theta _t^*)_{t\in {\mathbb {Z}}}$ is unique, moreover the process $(\theta _t^*, Y_t)_{t\in {\mathbb {Z}}}$ is ergodic provided that $(Y_t)_{t\in {\mathbb {Z}}}$ is ergodic. The latter will be very important for us, since the proof of Theorem 2.4 relies on this result.

In addition, Truquet proved that under a milder form of the drift and minorization conditions (See Assumptions A2 and A3 in [15]), $\text {Law}(\theta _t)\rightarrow \text {Law}(\theta _0^*)$ in total variation as $t\rightarrow \infty $. However, as Truquet remarked, assumptions of [15] did not to get a rate of convergence for $\text {Law}(\theta _t)$.

In the rest of this subsection, we show an alternative approach to a bit stronger result on the convergence of $(\text {Law}(\theta _t))_{t\in {\mathbb {N}}}$ using the results of the recent paper [4]. The reader can skip this part without affecting the understanding. The $(1+V)$-weighted total variation distance for any pair of Borel probability measures $\mu ,\nu $ on ${\mathcal {B}}({\mathcal {X}})$ is defined by

$$\begin{aligned} d_{\text {TV}}^{1+V}(\mu ,\nu ):= \int _{{\mathcal {X}}} (1+V(\theta ))|\mu -\nu |(\textrm{d}\theta ). \end{aligned}$$

Lemma 3.7

There exist constants $c_1,c_2>0$ such that for $V(\theta ) =e^{\frac{a}{2}\Vert \theta \Vert ^2}$,

$$\begin{aligned} d_{\text {TV}}^{1+V}(\text {Law}(\theta _n),\text {Law}(\theta _{0}^*))\le c_1 e^{-c_2 n},\,n\in {\mathbb {N}}. \end{aligned}$$

Proof

With the above choice of V, we have ${\mathbb {E}}(V(\theta _{0})^2+V(\theta _1)^2)<\infty $ hence the moment condition on initial values i.e. Assumption 2.6. in [4] is in force, and also the other assumptions of [4] are clearly met (with the quantities $\lambda ,\alpha ,K$ constant and with $\ell \equiv 0$ since Y is bounded) too. Hence Theorem 2.11 of [4] implies the convergence of $\text {Law}(\theta _t)$ towards the limiting distribution $\text {Law}(\theta _{0}^*)$ at a geometric rate in $d_{\text {TV}}^{1+V}$. $\square $

Corollary 3.8

It is clear from the definition of $d^{1+V}$ and from Lemma 3.7 that for any $\phi $ satisfying (8),

$$\begin{aligned} {\mathbb {E}}(\phi (\theta _n))\rightarrow {\mathbb {E}}(\phi (\theta _0^*)),\,\,n\rightarrow \infty . \end{aligned}$$

In particular, ${\mathbb {E}}(\Vert \theta _n\Vert ^p)\rightarrow {\mathbb {E}}(\Vert \theta _0^*\Vert ^p)$, as $n\rightarrow \infty $, for every $1\le p<\infty $.

3.3 Coupling Construction

Let $R>0$ which we fix later, and $(\varepsilon _t)_{t\in {\mathbb {Z}}}$ be a sequence of i.i.d. uniform variables on [0, 1] independent of $(Y_k)_{k\in {\mathbb {Z}}}$ and also independent of $(\xi _n)_{n\ge 1}$. The next Lemma is a standard representation results for parametric kernels satisfying the minorization condition (10).

Lemma 3.9

Under the minorization condition (c.f. (10) in Lemma 3.3), there exists a measurable function $T:{\mathcal {Y}}\times {\mathcal {X}}\times [0,1]\rightarrow {\mathcal {X}}$ such that

$$\begin{aligned} {\mathbb {P}}(T (y,\theta ,\varepsilon _0)\in {\mathcal {A}}) = Q(y,\theta ,A), \end{aligned}$$

for all $\theta \in {\mathcal {X}}$, $A\in {\mathcal {B}}({\mathcal {X}})$ and $y\in {\mathcal {Y}}$ such that $\Vert y\Vert \le M$. Furthermore, for $u\in [0,\tilde{\alpha }_R]$,

$$\begin{aligned} T (y,\theta _1,u) = T (y,\theta _2,u),\,\,\Vert y\Vert \le M,\,\theta _1,\theta _2\in \{\theta \mid \Vert \theta \Vert \le R\}. \end{aligned}$$

Proof

For the proof, we refer the reader to Lemma 7.1 in [11]. $\square $

We drop the dependence of the mappings T on $\varepsilon _t$ in the notation and will simply write $T_t(y)\theta := T(\theta , y, \varepsilon _t)$. For $s\in {\mathbb {Z}}$ and $\theta \in {\mathcal {X}}$, define the family of auxiliary processes

$$\begin{aligned} Z_{s,t}^{\theta ,\textbf{y}} = \theta ,\,t\le s, \quad Z_{s,t}^{\theta ,\textbf{y}} = T_t(y_{t-1})Z_{s,{t-1}}^{\theta ,\textbf{y}},\,t>s, \end{aligned}$$

(12)

where $\textbf{y}=(\ldots ,y_{-1},y_0,y_1,\ldots )\in {\mathcal {Y}}^{\mathbb {Z}}$ is a fixed trajectory. Clearly, for any random variable $(\theta '_0,(Y_k)_{k\in {\mathbb {Z}}})$ and $s\in {\mathbb {N}}$, $Z_{s,t}^{\theta '_s,\textbf{Y}}$, $t\ge s$ is a version of the process $(\theta '_t)_{t\in {\mathbb {N}}}$ defined through the iterative scheme (1), starting from $\theta '_0$ and driven by $(Y_k)_{k\in {\mathbb {Z}}}$. Furthermore, the process $Z_{s,t}^{\theta _0,\textbf{y}}$, $t\ge s$ is a time-inhomogeneous Markov chain that follows the dynamics of $\theta _t$, $t\in {\mathbb {N}}$ with the environment being ”frozen”. Since the process $(Y_k)_{k\in {\mathbb {Z}}}$ is almost surely bounded by $M>0$, we can restrict ourselves to trajectories $\textbf{y}\in {\mathcal {Y}}^{\mathbb {Z}}$ satisfying $\sup _{k\in {\mathbb {Z}}}\Vert y_k\Vert \le M$, and thus $Z_{s,t}^{\theta _0,\textbf{y}}$, $t\ge s$ is a Harris recurrent chain. The next lemma controls the coupling time between processes starting from different initial values.

Lemma 3.10

Let $\theta _1,\theta _2\in {\mathcal {X}}$ be arbitrary but fixed and $\textbf{y}\in {\mathcal {Y}}^{\mathbb {Z}}$ such that $\sup _{k\in {\mathbb {Z}}}\Vert y_k\Vert \le M$. Then there exists constants $\kappa >0$ and $N\in {\mathbb {N}}$ depending only on $\lambda $, $\Delta $, b, K and M such that for $n\ge N$,

$$\begin{aligned} {\mathbb {P}}(Z_{0,n}^{\theta _1,\textbf{y}}\ne Z_{0,n}^{\theta _2,\textbf{y}})\le \frac{V(\theta _1)+V(\theta _2)+3}{2}e^{-\kappa n}. \end{aligned}$$

Proof

First, we fix $\gamma<\gamma '<1$ and choose $R>0$ so large such that $2C<(\gamma '-\gamma )e^{\frac{a}{2}R^2}$. Furthermore, we introduce the notations $\overline{Z}_n:=\left( Z_{0,n}^{\theta _1,\textbf{y}}, Z_{0,n}^{\theta _2,\textbf{y}}\right) $, $\Vert \overline{Z}_n\Vert :=\max \left( \left\Vert Z_{0,n}^{\theta _1,\textbf{y}} \right\Vert , \left\Vert Z_{0,n}^{\theta _2,\textbf{y}} \right\Vert \right) $ and the sequence of successive visiting times

$$\begin{aligned} \sigma _0:=0,\,\sigma _{k+1} = \min \left\{ t>\sigma _k\bigg | \Vert \overline{Z}_t\Vert \le R \right\} ,\,k\in {\mathbb {N}}\end{aligned}$$

that are obviously $\sigma (\varepsilon _t,t\in {\mathbb {Z}})$-stopping times. Note that on $\{ \Vert \overline{Z}_t\Vert >R\}$ we have

$$\begin{aligned} \gamma (V( Z_{0,t}^{\theta _1,\textbf{y}})+V( Z_{0,t}^{\theta _2,\textbf{y}}))+2C \le \gamma ' (V( Z_{0,t}^{\theta _1,\textbf{y}})+V( Z_{0,t}^{\theta _2,\textbf{y}})) \end{aligned}$$

and thus for $k\ge 1$ and $s\ge 0$, we obtain

$$\begin{aligned} {\mathbb {P}}(\sigma _{k+1}-\sigma _{k}>s\mid \overline{Z}_{\sigma _k})&\le {\mathbb {E}}\left( {\mathbb {P}}(\Vert \overline{Z}_{\sigma _k+s}\Vert>R\mid \overline{Z}_{\sigma _{k}+s-1}) \prod _{j=1}^{s-1} \mathbbm {1}_{\Vert \overline{Z}_{\sigma _k+j}\Vert>R} \bigg | \overline{Z}_{\sigma _k}\right) \\&\le \gamma ' {\mathbb {E}}\left( \frac{V( Z_{0,\sigma _k+s-1}^{\theta _1,\textbf{y}})+V( Z_{0,\sigma _k+s-1}^{\theta _2,\textbf{y}})}{e^{\frac{a}{2}R^2}} \prod _{j=1}^{s-2} \mathbbm {1}_{\Vert \overline{Z}_{\sigma _k+j}\Vert >R} \bigg | \overline{Z}_{\sigma _k}\right) \end{aligned}$$

Iteration of this argument leads to the following estimation.

$$\begin{aligned} {\mathbb {P}}(\sigma _{k+1}-\sigma _{k}>s\mid \overline{Z}_{\sigma _k})&\le (\gamma ')^{s-1}\frac{\gamma V( Z_{0,\sigma _k}^{\theta _1,\textbf{y}})+\gamma V( Z_{0,\sigma _k}^{\theta _2,\textbf{y}})+2C}{e^{\frac{a}{2}R^2}} \\&\le (\gamma ')^{s-1} \frac{2\gamma e^{\frac{a}{2}R^2}+2C}{e^{\frac{a}{2}R^2}} \le (\gamma ')^{s-1} (\gamma '+\gamma )\le 2(\gamma ')^s. \end{aligned}$$

Along similar lines, we can show that

$$\begin{aligned} {\mathbb {P}}(\sigma _1>s){} & {} \le (\gamma ')^{s} \left[ e^{-\frac{a}{2}R^2}(V(\theta _1)+V(\theta _2)) + 1-\frac{\gamma }{\gamma '} \right] \\{} & {} \le (e^{-\frac{a}{2}R^2}(V(\theta _1)+V(\theta _2)) + 1)(\gamma ')^{s}. \end{aligned}$$

Let us fix $\gamma ''$ such that $\gamma '<\gamma ''<1$. For the generating function of the time elapsed between the kth and $(k+1)$th visits, we get

$$\begin{aligned}{} & {} {\mathbb {E}}\left( \frac{1}{(\gamma '')^{\sigma _{k+1}-\sigma _k}}\bigg | {\mathcal {F}}_{-\infty ,\sigma _k}^\varepsilon \right) = \sum _{j=1}^{\infty } \frac{1}{(\gamma '')^{j}} {\mathbb {P}}(\sigma _{k+1}-\sigma _{k}=j\mid \overline{Z}_{\sigma _k})\\{} & {} \quad \le \sum _{j=1}^{\infty } \frac{2 (\gamma ')^{j-1}}{(\gamma '')^{j}} = \frac{2}{\gamma ''-\gamma '},\,k\ge 1, \end{aligned}$$

and similarly, for $k=0$,

$$\begin{aligned} {\mathbb {E}}\left( \frac{1}{(\gamma '')^{\sigma _{1}}}\right) \le \frac{e^{-\frac{a}{2}R^2}(V(\theta _1)+V(\theta _2)) + 1}{\gamma ''-\gamma '} \end{aligned}$$

hence by the Markov inequality and the tower rule,for $0<m<n$, we obtain

$$\begin{aligned} {\mathbb {P}}(\sigma _m\ge n)&\le (\gamma '')^n {\mathbb {E}}\left( \frac{1}{(\gamma '')^{\sigma _m}}\right) = (\gamma '')^n {\mathbb {E}}\left( {\mathbb {E}}\left( \frac{1}{(\gamma '')^{\sigma _m-\sigma _{m-1}}} \bigg |{\mathcal {F}}_{-\infty ,\sigma _{m-1}}^\varepsilon \right) \frac{1}{(\gamma '')^{\sigma _{m-1}}}\right) \\&\le (\gamma '')^n\frac{2}{\gamma ''-\gamma '}{\mathbb {E}}\left( \frac{1}{(\gamma '')^{\sigma _{m-1}}}\right) \le \ldots \\&\le \frac{e^{-\frac{a}{2}R^2}(V(\theta _1)+V(\theta _2)) + 1}{2} \left( \frac{2^m}{(\gamma ''-\gamma ')^m}\right) (\gamma '')^n. \end{aligned}$$

Again we fix a constant $\gamma '''$ such that $\gamma ''<\gamma '''<1$, and define

$$\begin{aligned} m_n:= \left\lfloor n\frac{\log \gamma '''-\log \gamma ''}{\log 2-\log (\gamma ''-\gamma ')}\right\rfloor . \end{aligned}$$

Obviously, for n is so large such that $m_n\ge 1$, we have

$$\begin{aligned} {\mathbb {P}}(\sigma _{m_n}\ge n)\le \frac{e^{-\frac{a}{2}R^2}(V(\theta _1)+V(\theta _2)) + 1}{2} (\gamma ''')^n. \end{aligned}$$

Next, we estimate the probability of no-coupling on events when the small set is visited at least $m_n$-times. According to Lemma 3.9, for $j=1,\ldots ,m_n$, $\theta \mapsto T(y,\theta ,\varepsilon _{\sigma _j+1})$ is constant of the ball $\{\theta \mid \Vert \theta \Vert \le R\}$ with probability at least $\tilde{\alpha }_R$ hence we can write

$$\begin{aligned} {\mathbb {P}}(Z_{0,n}^{\theta _1,\textbf{y}}\ne Z_{0,n}^{\theta _2,\textbf{y}},\sigma _{m_n}<n) \le {\mathbb {P}}(\varepsilon _{\sigma _j+1}<\tilde{\alpha }_R;\, j= 1,\ldots , m_n) = \tilde{\alpha }_R^{m_n}, \end{aligned}$$

where we used that for every j, $\varepsilon _{\sigma _j+1}$ is independent of ${\mathcal {F}}_{-\infty ,\sigma _j}^\varepsilon $.

At least, we combine this estimate with that one what we got for the tail probability of the visiting times, and obtain

$$\begin{aligned} {\mathbb {P}}(Z_{0,n}^{\theta _1,\textbf{y}}\ne Z_{0,n}^{\theta _2,\textbf{y}})&\le {\mathbb {P}}(Z_{0,n}^{\theta _1,\textbf{y}}\ne Z_{0,n}^{\theta _2,\textbf{y}},\sigma _{m_n}<n) + {\mathbb {P}}(\sigma _{m_n}\ge n)\\&\le \tilde{\alpha }_R^{m_n} + \frac{e^{-\frac{a}{2}R^2}(V(\theta _1)+V(\theta _2)) + 1}{2} (\gamma ''')^n \end{aligned}$$

which completes the proof.

$\square $

The following annealed version of Lemma 3.10 will be important later.

Lemma 3.11

Let $\theta _{1},\theta _{2}$ be random variables independent of ${\mathcal {F}}^{\varepsilon }_{m+1,\infty }$ for some $m\in {\mathbb {N}}$. Then

$$\begin{aligned} {\mathbb {P}}(Z_{m,n}^{\theta _1,\textbf{Y}}\ne Z_{m,n}^{\theta _2,\textbf{Y}})\le \frac{{\mathbb {E}}(V(\theta _1))+{\mathbb {E}}(V(\theta _2))+3}{2}e^{-\kappa (n-m)},\,\,n\ge N+m. \end{aligned}$$

Proof

Estimate the conditional probability using Lemma 3.10 above yields

$$\begin{aligned}&{\mathbb {P}}(Z_{m,n}^{\theta _1,\textbf{Y}}\ne Z_{m,n}^{\theta _2,\textbf{Y}}\mid \textbf{Y}=\textbf{y},\,\theta _1=x_1,\,\theta _2=x_2) = {\mathbb {P}}(Z_{m,n}^{x_1,\textbf{y}}\ne Z_{m,n}^{x_2,\textbf{y}})\\&\quad = {\mathbb {P}}\left( Z_{0,n-m}^{x_1,S^m\textbf{y}}\ne Z_{0,n-m}^{x_2,S^m\textbf{y}} \right) \le \frac{V(x_1)+V(x_2)+3}{2}e^{-\kappa (n-m)},\,\, n\ge N+m, \end{aligned}$$

where $S^{m}\textbf{Y}$ refers to the m-times left-shifted trajectory of Y i.e. $\left( S^m\textbf{Y}\right) _k=Y_{k+m}$, $k\in {\mathbb {Z}}$. Finally, we take expectations and obtain the claimed inequality. $\square $

3.4 Mixing Properties

In what follows, we show that mixing properties of the exogenous environment transfer to the process $\theta _t$, $t\in {\mathbb {N}}$. For any system of sub-$\sigma $-algebras ${\mathcal {A}}_i\subset {\mathcal {F}}$, $i\in I$, we use the notation $\bigvee _{i\in I}A_i$ for the $\sigma $-algebra generated by the system $({\mathcal {A}}_i)_{i\in I}$.

Lemma 3.12

Suppose ${\mathcal {A}}_n$ and ${\mathcal {B}}_n$, $n=1,2,\ldots $ are sub-$\sigma $-algebras of ${\mathcal {F}}$ such that the $\sigma $-algebras ${\mathcal {A}}_n\vee {\mathcal {B}}_n$, $n=1,2,\ldots $ are pairwise independent. Then

$$\begin{aligned} \alpha \left( \bigvee _{n=1}^\infty {\mathcal {A}}_n,\bigvee _{n=1}^\infty {\mathcal {B}}_n\right) \le \sum _{n=1}^\infty \alpha ({\mathcal {A}}_n, {\mathcal {B}}_n). \end{aligned}$$

Proof

The proof can be found in [2, Lemma 8 on page 13]. $\square $

Remark 3.13

We need the special case when ${\mathcal {A}}_1,{\mathcal {A}}_2,{\mathcal {B}}_1,{\mathcal {B}}_2\subset {\mathcal {F}}$ are $\sigma $-algebras, where ${\mathcal {A}}_2$ and ${\mathcal {B}}_2$ are independent too. For this, Lemma 3.12 gives $\alpha ({\mathcal {A}}_1\vee {\mathcal {A}}_2,{\mathcal {B}}_1\vee {\mathcal {B}}_2)\le \alpha ({\mathcal {A}}_1,{\mathcal {B}}_1)$. By the definition of the measure of dependence between sigma algebras (7), the reverse inequality trivially holds hence

$$\begin{aligned} \alpha ({\mathcal {A}}_1\vee {\mathcal {A}}_2,{\mathcal {B}}_1\vee {\mathcal {B}}_2)= \alpha ({\mathcal {A}}_1,{\mathcal {B}}_1). \end{aligned}$$

The next lemma provides an upper bound for the strong mixing coefficient of the chain $(\theta _t)_{t\in {\mathbb {N}}}$ given $\alpha ^Y$.

Lemma 3.14

For the dependence coefficient $\alpha _j^\theta (n)$, we have the following upper estimate

$$\begin{aligned} \alpha _j^\theta (n) \le \alpha ^Y (\lfloor n/2\rfloor ) + \left( V(\theta _{0})+\frac{3}{2}+\frac{C}{2(1-\gamma )}\right) e^{-\frac{\kappa }{2}n},\,\,j\ge 0,\,n\ge 2N, \end{aligned}$$

where $\kappa $ and N are as in Lemma 3.10.

Proof

We introduce the notations $\theta _{\rightarrow t}:=(\theta _0,\theta _1,\ldots ,\theta _t)$ and $\theta _{t\rightarrow }:=(\theta _t,\theta _{t+1},\ldots )$, and also $Z_{s,\rightarrow t}^{\theta _0,\textbf{y}}:=(Z_{s,s}^{\theta _0,\textbf{y}},\ldots ,Z_{s,t}^{\theta _0,\textbf{y}})$, $Z_{s,t\rightarrow }^{\theta _0,\textbf{y}}:=(Z_{s,t}^{\theta _0,\textbf{y}},Z_{s,t+1}^{\theta _0,\textbf{y}},\ldots )$. Let $A\in {\mathcal {F}}_{0,j}^\theta $ and $B\in {\mathcal {F}}_{j+n,\infty }^\theta $ be arbitrary events. Then by the definition of the generated $\sigma $-algebra, exist $A^{\mathcal {X}}\in {\mathcal {B}}({\mathcal {X}}^{j+1})$ and $B^{\mathcal {X}}\in {\mathcal {B}}({\mathcal {X}}^{\mathbb {N}})$ such that

$$\begin{aligned} A=\left\{ \omega \in \Omega \bigg | \theta _{\rightarrow j}(\omega )\in A^{\mathcal {X}}\right\} \,\,\text {and}\,\,B=\left\{ \omega \in \Omega \bigg |\theta _{j+n\rightarrow }(\omega )\in B^{\mathcal {X}}\right\} . \end{aligned}$$

So, for any $r_n$ satisfying $0\le r_n\le n-N$ we can write

$$\begin{aligned} \begin{aligned} |{\mathbb {P}}(A\cap B)-{\mathbb {P}}(A){\mathbb {P}}(B)|&= |\textrm{Cov}(\mathbbm {1}_{\theta _{\rightarrow j}\in A^{\mathcal {X}}},\mathbbm {1}_{\theta _{j+n\rightarrow }\in B^{\mathcal {X}}})| = \left| \textrm{Cov}(\mathbbm {1}_{Z_{0,\rightarrow j}^{\theta _0,\textbf{Y}}\in A^{\mathcal {X}}},\mathbbm {1}_{Z_{0, j+n\rightarrow }^{\theta _0,\textbf{Y}}\in B^{\mathcal {X}}}) \right| \\&\le \left| \textrm{Cov}(\mathbbm {1}_{Z_{0,\rightarrow j}^{\theta _0,\textbf{Y}}\in A^{\mathcal {X}}},\mathbbm {1}_{Z_{j+r_n, j+n\rightarrow }^{\theta _0,\textbf{Y}}\in B^{\mathcal {X}}}) \right| + {\mathbb {P}}\left( Z_{0, j+n}^{\theta _0,\textbf{Y}}\ne Z_{j+r_n, j+n}^{\theta _0,\textbf{Y}}\right) . \end{aligned} \end{aligned}$$

(13)

Observe that $\mathbbm {1}_{Z_{0,\rightarrow j}^{\theta _0,\textbf{Y}}\in A^{\mathcal {X}}}$ is ${\mathcal {F}}_{-\infty ,j-1}^Y\vee {\mathcal {F}}_{1,j}^\varepsilon $-measurable and $\mathbbm {1}_{Z_{j+r_n, j+n\rightarrow }^{\theta _0,\textbf{Y}}\in B^{\mathcal {X}}}$ is ${\mathcal {F}}_{j+r_n,j+n-1}^Y\vee {\mathcal {F}}_{j+r_n+1,j+n}^\varepsilon $-measurable, moreover ${\mathcal {F}}_{-\infty ,j-1}^Y\vee {\mathcal {F}}_{j+r_n,j+n-1}^Y$ is independent of ${\mathcal {F}}_{1,j}^\varepsilon \vee {\mathcal {F}}_{j+r_n+1,j+n}^\varepsilon $, and also the $\sigma $-algebras ${\mathcal {F}}_{1,j}^\varepsilon $ and ${\mathcal {F}}_{j+r_n+1,j+n}^\varepsilon $ are independent of each other hence by Remark 3.13 and the stationarity of $(Y_k)_{k\in {\mathbb {Z}}}$, we have

$$\begin{aligned} \begin{aligned} \left| \textrm{Cov}(\mathbbm {1}_{Z_{0,\rightarrow j}^{\theta _0,\textbf{Y}}\in A^{\mathcal {X}}},\mathbbm {1}_{Z_{j+r_n, j+n\rightarrow }^{\theta _0,\textbf{Y}}\in B^{\mathcal {X}}}) \right|&\le \alpha \left( {\mathcal {F}}_{-\infty ,j-1}^Y\vee {\mathcal {F}}_{1,j}^\varepsilon , {\mathcal {F}}_{j+r_n,j+n-1}^Y\vee {\mathcal {F}}_{j+r_n+1,j+n}^\varepsilon \right) \\&\le \alpha \left( {\mathcal {F}}_{-\infty ,j-1}^Y, {\mathcal {F}}_{j+r_n,j+n-1}^Y \right) \le \alpha _{j-1}^Y (r_n+1) \\&=\alpha ^Y (r_n+1). \end{aligned} \end{aligned}$$

(14)

By Lemma 3.11 and Corollary 3.2, we can estimate the second term on the right-hand side of (13)

$$\begin{aligned}&{\mathbb {P}}\left( Z_{0, j+n}^{\theta _0,\textbf{Y}}\ne Z_{j+r_n, j+n}^{\theta _0,\textbf{Y}}\right) = {\mathbb {P}}\left( Z_{j+r_n, j+n}^{\theta _{j+r_n},\textbf{Y}}\ne Z_{j+r_n, j+n}^{\theta _0,\textbf{Y}}\right) \\&\quad \le \frac{{\mathbb {E}}(V(\theta _{j+r_n}))+V(\theta _{0})+3}{2}e^{-\kappa (n-r_n)} \\&\quad \le \left( V(\theta _{0})+\frac{3}{2}+\frac{C}{2(1-\gamma )}\right) e^{-\kappa (n-r_n)}. \end{aligned}$$

Combining this with (14), and taking the supremum on the left-hand side of (13) yields

$$\begin{aligned} \alpha _j^\theta (n)\le \alpha ^Y (r_n+1) + \left( V(\theta _{0})+\frac{3}{2}+\frac{C}{2(1-\gamma )}\right) e^{-\kappa (n-r_n)}. \end{aligned}$$

for any $0\le r_n\le n-N$. By choosing $r_n = \lfloor n/2\rfloor $, we obtain the desired inequality

$$\begin{aligned} \alpha _j^\theta (n) \le \alpha ^Y (\lfloor n/2\rfloor ) + \left( V(\theta _{0})+\frac{3}{2}+\frac{C}{2(1-\gamma )}\right) e^{-\frac{\kappa }{2}n},\,\,j\ge 0,\,n\ge 2N. \end{aligned}$$

$\square $

3.5 Proof of Theorem 2.3

Lemma 3.15

The sequence $(\phi (\theta _t))_{t\in {\mathbb {N}}}$ is uniformly $L^p$-bounded for every $1\le p<\infty $ that is

$$\begin{aligned} c_p:=\sup _{t\in {\mathbb {N}}}{\mathbb {E}}^{1/p}(|\phi (\theta _t)|^p)<\infty . \end{aligned}$$

(15)

Proof

Using $x^s\le \Gamma (s+1) e^x$, $x,s\ge 0$ and (8), by Corollary 3.2, we can write

$$\begin{aligned}&{\mathbb {E}}^{1/p}(|\phi (\theta _t)|^p) \le c_{\phi }\left( 1+\Gamma \left( \frac{rp}{2}+1\right) ^{1/p}{\mathbb {E}}^{1/p} (V(\theta _t))\right) \\&\quad \le c_{\phi }\left( 1+\Gamma \left( \frac{rp}{2}+1\right) ^{1/p}\left( V(\theta _0)+\frac{C}{1-\gamma }\right) ^{1/p}\right) , \end{aligned}$$

where the upper bound does not depend on t hence (15) clearly holds. $\square $

Proof of Theorem 2.3

Let $(\theta _0^*, (Y_k)_{k\in {\mathbb {Z}}})$ be a random variable such that $\text {Law}(\left( (\theta _0^*, (Y_k)_{k\in {\mathbb {Z}}})\right) )=\mu ^*$. For $k\in {\mathbb {N}}$, $0\le i_1<\ldots <i_k$ and $A\in {\mathcal {B}}({\mathcal {X}}^k)$ arbitrary, we can write

$$\begin{aligned} {\mathbb {P}}\left( \left( \theta _{i_1+n},\ldots ,\theta _{i_k+n}\right) \in A \right)&= {\mathbb {P}}\left( \left( Z_{0,i_1+n}^{\theta _0,\textbf{Y}},\ldots ,Z_{0,i_k+n}^{\theta _0,\textbf{Y}}\right) \in A \right) \\&\le {\mathbb {P}}\left( \left( Z_{0,i_1+n}^{\theta _0^*,\textbf{Y}},\ldots ,Z_{0,i_k+n}^{\theta _0^*,\textbf{Y}} \right) \in A \right) + {\mathbb {P}}\left( Z_{0,i_1+n}^{\theta _0,\textbf{Y}}\ne Z_{0,i_1+n}^{\theta _0^*,\textbf{Y}} \right) \\&\le {\mathbb {P}}\left( \left( \theta _{i_1+n}^*,\ldots ,\theta _{i_k+n}^*\right) \in A \right) + {\mathbb {P}}\left( Z_{0,i_1+n}^{\theta _0,\textbf{Y}}\ne Z_{0,i_1+n}^{\theta _0^*,\textbf{Y}} \right) . \end{aligned}$$

By interchanging the role of $(\theta _t)_{t\in {\mathbb {N}}}$ and $(\theta _t^*)_{t\in {\mathbb {N}}}$, we obtain

$$\begin{aligned}&\left| {\mathbb {P}}\left( \left( \theta _{i_1+n},\ldots ,\theta _{i_k+n}\right) \in A \right) - {\mathbb {P}}\left( \left( \theta _{i_1+n}^*,\ldots ,\theta _{i_k+n}^*\right) \in A \right) \right| \\&\quad \le {\mathbb {P}}\left( Z_{0,i_1+n}^{\theta _0,\textbf{Y}}\ne Z_{0,i_1+n}^{\theta _0^*,\textbf{Y}} \right) ,\,\,A\in {\mathcal {B}}({\mathcal {X}}^k). \end{aligned}$$

Next, we take supremum on the left hand-side in $A\in {\mathcal {B}}({\mathcal {X}}^k)$ and then by Lemma 3.11 and Remark 3.5, we arrive at

$$\begin{aligned}&d_{\text {TV}}\left( \text {Law}((\theta _{i_1+n},\ldots ,\theta _{i_k+n})),\text {Law}((\theta _{i_1}^*,\ldots ,\theta _{i_k}^*))\right) \le {\mathbb {P}}\left( Z_{0,i_1+n}^{\theta _0,\textbf{Y}}\ne Z_{0,i_1+n}^{\theta _0^*,\textbf{Y}} \right) \\&\quad \le \frac{V(\theta _0)+{\mathbb {E}}(V(\theta _0^*))+3}{2}e^{-\kappa (i_1+n)} \\&\quad \le \left( V(\theta _0)+\frac{3}{2}+\frac{C}{2(1-\gamma )}\right) e^{-\kappa n},\,\,n\ge N, \end{aligned}$$

where $N\in {\mathbb {N}}$ is as in Lemmas 3.10 and 3.11.

In what follows, we proceed with the proof of the law of large numbers both in strong and $L^p$ sense. Again by Lemma 3.11 and Remark 3.5, exist an almost surely finite random variable $\tau $ such that

$$\begin{aligned} Z_{0,n}^{\theta _0,\textbf{Y}}= Z_{0,n}^{\theta _0^*,\textbf{Y}},\,n\ge \tau . \end{aligned}$$

Furthermore, for the tail distribution of $\tau $,

$$\begin{aligned} {\mathbb {P}}(\tau \ge n)\le {\mathbb {P}}\left( Z_{0,n}^{\theta _0,\textbf{Y}}\ne Z_{0,n}^{\theta _0^*,\textbf{Y}}\right) \le \left( V(\theta _0)+\frac{3}{2}+\frac{C}{2(1-\gamma )}\right) e^{-\kappa n},\,\,n\ge N. \end{aligned}$$

holds with constants $\kappa >0$ and $N\in {\mathbb {N}}$ as in Lemmas 3.10 and 3.11.

When the data stream $(Y_k)_{k\in {\mathbb {Z}}}$ is ergodic, by Remark 3.6, the process $Z_{0,n}^{\theta _0^*,\textbf{Y}}$, $n\in {\mathbb {N}}$ is also ergodic, moreover by Remark 3.5, ${\mathbb {E}}(\phi (\theta _0^*))<\infty $ for any $\phi :{\mathcal {X}}\rightarrow {\mathbb {R}}$ satisfying (8) hence by Birkhoff’s ergodic theorem,

$$\begin{aligned} \frac{\phi (Z_{0,0}^{\theta _0^*,\textbf{Y}})+\ldots +\phi (Z_{0,n-1}^{\theta _0^*,\textbf{Y}})}{n} \rightarrow {\mathbb {E}}(\phi (\theta _0^*)),\,n\rightarrow \infty ,\,{\mathbb {P}}-\text {a.s.}\end{aligned}$$

Combining this with the above result on the almost surely finite coupling time yields the strong law of large numbers for $\phi \left( Z_{0,n}^{\theta _0,\textbf{Y}}\right) $, $n\in {\mathbb {N}}$. As we mentioned earlier, the discrete-time processes $(\theta _n)_{n\in {\mathbb {N}}}$ and $\left( Z_{0,n}^{\theta _0,\textbf{Y}}\right) _{n\in {\mathbb {N}}}$ are versions of each other hence the strong law of large numbers holds for $(\phi (\theta _n))_{n\in {\mathbb {N}}}$, as well.

Finally, by Lemma 3.15, the sequence $\frac{1}{n}(\phi (\theta _0)+\ldots +\phi (\theta _{n-1}))$, $n\ge 1$ is uniformly integrable on every power $1\le p<\infty $, and thus the law of large numbers holds in $L^p$-sense too for $1\le p<\infty $ which completes the proof. $\square $

3.6 Proof of Theorem 2.4

The subsequent lemma establishes a stability result for the autocovariance function of the sequence $(\phi (\theta _k))_{k\in {\mathbb {N}}}$. Additionally, it provides an explicit upper bound for $\sup _{k\in {\mathbb {N}}}|\textrm{Cov}(\phi (\theta _k),\phi (\theta _{k+l}))|$ in terms of the $\alpha $-mixing coefficient of Y. For the sake of readability, the proof is delegated to Appendix 1.

Lemma 3.16

The autocovariance function of $(\phi (\theta _k))_{k\in {\mathbb {N}}}$ has the following properties.

i)
For every $l\in {\mathbb {N}}$, $\textrm{Cov}(\phi (\theta _k),\phi (\theta _{k+l}))\rightarrow \textrm{Cov}(\phi (\theta _0^*),\phi (\theta _l^*))$, as $k\rightarrow \infty $.
ii)
There exists a constant $\Lambda >0$ depending only on $\Delta $, b, K, M and $\lambda $ such that for any $k,l\in {\mathbb {N}}$,
$$\begin{aligned} |\textrm{Cov}(\phi (\theta _k),\phi (\theta _{k+l}))|\le \Lambda \left( \alpha ^Y (\lfloor l/2\rfloor )^{1-\epsilon }+e^{-\frac{\kappa }{4} \lfloor l/2\rfloor } \right) , \end{aligned}$$
where $\epsilon >0$ is as in Assumption 2.2.

Proof of Theorem 2.4

We are going to verify the conditions of Corollary 1 in Herrndorf’s paper [7]. By Lemma 3.15, for $X_n:=\phi (\theta _n)-{\mathbb {E}}(\phi (\theta _n))$, $n\in {\mathbb {N}}$, and for any $1\le p<\infty $,

$$\begin{aligned} \sup _{n\in {\mathbb {N}}}{\mathbb {E}}^{1/p}(|X_n|^p)<\infty . \end{aligned}$$

Next, we prove for $S_n:=X_1+\ldots +X_n$, $\lim _{n\rightarrow \infty } {\mathbb {E}}S_n^2/n = \sigma ^2$ holds with some $\sigma \ge 0$. For this, we consider the decomposition

$$\begin{aligned} \frac{1}{n}{\mathbb {E}}S_n^2= \frac{1}{n}\sum _{k=1}^{n}{\mathbb {E}}(X_k^2) + \frac{2}{n}\sum _{1\le k<l\le n} {\mathbb {E}}(X_k X_l), \end{aligned}$$

(16)

where by point i) in Lemma 3.16, ${\mathbb {E}}(X_k^2)\rightarrow {\mathbb {D}}^2 (\phi (\theta _0^*))$ as $k\rightarrow \infty $ hence the first term on the right-hand side of (16) converges to ${\mathbb {D}}^2 (\phi (\theta _0^*))$. Regarding the second term, we introduce $A_{n,l}:=\frac{1}{n}\sum _{k=1}^{n-l} {\mathbb {E}}(X_k X_{k+l})$, $1\le l<n$, and define

$$\begin{aligned} b_n:=\frac{1}{n}\sum _{1\le k<l\le n} {\mathbb {E}}(X_k X_l)= \sum _{l=1}^{n-1}\frac{1}{n}\sum _{k=1}^{n-l}{\mathbb {E}}(X_k X_{k+l})= \sum _{l=1}^{n-1} A_{n,l}. \end{aligned}$$

By point ii) in Lemma 3.16, we have

$$\begin{aligned} |A_{n,l}|\le \frac{1}{n}\sum _{k=1}^{n-l} |{\mathbb {E}}(X_k X_{k+l})|\le \Lambda \left( \alpha ^Y (\lfloor l/2\rfloor )^{1-\epsilon }+e^{-\frac{\kappa }{4} \lfloor l/2\rfloor } \right) \end{aligned}$$

(17)

hence due to Assumption 2.2, for any $\delta >0$, exists $\tilde{N}_\delta \in {\mathbb {N}}$ such that $\sum _{l=\tilde{N}_{\delta }}^{n-1} |A_{n,l}|<\delta $, $n>\tilde{N}_{\delta }$, and thus for $m,n>\tilde{N}_{\delta }$, we have

$$\begin{aligned} |b_n-b_m|\le \sum _{l=1}^{\tilde{N}_{\delta }-1}|A_{n,l}-A_{m,l}| + \sum _{l=\tilde{N}_{\delta }}^{n-1} |A_{n,l}| + \sum _{l=\tilde{N}_{\delta }}^{m-1} |A_{m,l}| < \sum _{l=1}^{\tilde{N}_{\delta }-1}|A_{n,l}-A_{m,l}| + 2\delta . \end{aligned}$$

By point (i) in Lemma 3.16, for every $1\le l<\tilde{N}_{\delta }$, $A_{n,l}\rightarrow \textrm{Cov}(\phi (\theta _0^*),\phi (\theta _l^*))$, as $n\rightarrow \infty $, and since $\delta >0$ was arbitrary, we obtain that $(b_n)_{n\ge 1}$ is a Cauchy sequence.

At last, by Lemma 3.14, for the mixing coefficient $\alpha _j^X(n)$, we have

$$\begin{aligned} \alpha _j^X(n)= & {} \alpha _j^{\phi \circ \theta }(n)\le \alpha _j^\theta (n) \le \alpha ^Y (\lfloor n/2\rfloor ) + \left( V(\theta _{0})+\frac{3}{2}+\frac{C}{2(1-\gamma )}\right) e^{-\frac{\kappa }{2}n},\,\,\\{} & {} j\ge 0,\,n\ge 2N, \end{aligned}$$

and thus for $\alpha ^X(n):=\sup _{j\in {\mathbb {N}}}\alpha _j^X(n)$, $n\in {\mathbb {N}}$, $\sum _{n=0}^\infty \alpha ^X(n)^{1-\epsilon }<\infty $ holds.

To sum up, we have shown that all the conditions of Corollary 1 in [7] are satisfied hence we can conclude that, if $\sigma >0$ then the sequence of random function $(B_n)_{n\ge 1}$ given by

$$\begin{aligned} B_n(t) = \frac{S_{\lfloor nt\rfloor }}{\sigma \sqrt{n}},\,\,t\in [0,1],\,n\ge 1 \end{aligned}$$

is weakly convergent to a standard Brownian motion B on D[0, 1] endowed with the Skorohod topology which completes the proof. If $\sigma =0$ then ${\mathbb {E}}(S_{n}^{2})/n\rightarrow 0$ implies ${S_{\lfloor nt\rfloor }}/{\sqrt{n}}\rightarrow 0$ in probability, for all $t\in [0,1]$, hence also in D[0, 1]. $\square $

Notes

That is, $Q(\cdot ,\cdot ,A)$ is measurable for all $A\in {\mathcal {B}}({\mathcal {X}})$ and $Q(y,x,\cdot )$ is a probability for all $(y,x)\in {\mathcal {Y}}\times {\mathcal {X}}$.

References

Barkhagen, M., Chau, N.H., Moulines, É., Rásonyi, M., Sabanis, S., Zhang, Y.: On stochastic gradient Langevin dynamics with dependent data streams in the logconcave case. Bernoulli 27(1), 1–33 (2021)
Article MathSciNet MATH Google Scholar
Bradley, R.C.: Central limit theorems under weak dependence. J. Multivar. Anal. 11(1), 1–16 (1981)
Article MathSciNet MATH Google Scholar
Dey, S.: Online Learning: Sentiment Analysis with Logistic Regression Via Stochastic Gradient Ascent in Python. https://sandipanweb.wordpress.com/2017/03/31/online-learning-sentiment-analysis-with-logistic-regression-via-stochastic-gradient-ascent/
Gerencsér, B., Rásonyi, M.: On the ergodicity of certain Markov chains in random environments. J. Theor. Probab. 6, 1–33 (2023)
Google Scholar
Gerencsér, B., Rásonyi, M.: Invariant Measures for Multidimensional Fractional Stochastic Volatility Models. Stochastics and Partial Differential Equations: Analysis and Computations, pp. 1–33, (2022)
Györfi, L., Harro, W.: On the averaged stochastic approximation for linear regression. SIAM J. Control Optimiz. 34(1), 31–61 (1996)
Article MathSciNet MATH Google Scholar
Herrndorf, N.: A functional central limit theorem for weakly dependent sequences of random variables. Ann. Probab. 12(1), 141–153 (1984)
Article MathSciNet MATH Google Scholar
Huy, N.C., Moulines, É., Rásonyi, M., Sotirios, S., Ying, Z.: On stochastic gradient Langevin dynamics with dependent data streams: the fully nonconvex case. SIAM J. Math. Data Sci. 3(3), 959–986 (2021)
Article MathSciNet MATH Google Scholar
Ibragimov, I.A., Linnik, Yu.V.: Independent and Stationary Sequences of Random Variables. Nauka, Moscow (1965). (In Russian)
Google Scholar
Laruelle, S., Pagès, G.: Stochastic approximation with averaging innovation applied to finance. Monte Carlo Methods and Applications 18(1):1–51, (2012)
Lovas, A., Rásonyi, M.: Markov chains in random environment with applications in queuing theory and machine learning. Stoch. Processes Appl. 137, 294–326 (2021)
Article MathSciNet MATH Google Scholar
Meyn, S.P., Tweedie, R.L.: Markov Chains and Stochastic Stability. Springer, Berlin (1993)
Book MATH Google Scholar
Teh, Y.W., Vollmer, S.J., Thiery, A.H.: Consistency and fluctuations for stochastic gradient Langevin dynamics. Mach. Learn. Res. 17, 193–225 (2016)
MathSciNet MATH Google Scholar
Tikosi, K.: Convergence results regarding stochastic gradient descent methods for dependent data streams. PhD thesis, Central European University, (2021)
Truquet, L.: Ergodic properties of some Markov chains models in random environments, arXiv:2108.06211 (2021)
Vollmer, S.J., Zygalakis, K.C., Teh, Y.W.: Exploration of the (non-)asymptotic bias and variance of stochastic gradient Langevin dynamics. J. Mach. Learn. Res. 17(159), 1–48 (2016)
MathSciNet MATH Google Scholar
Welling, M., Teh, Y. W.: Bayesian learning via stochastic gradient langevin dynamics. In ICML, (2011)

Download references

Funding

Open access funding provided by ELKH Alfréd Rényi Institute of Mathematics. The authors were supported by the National Research, Development and Innovation Office within the framework of the Thematic Excellence Program 2021; National Research subprogramme “Artificial intelligence, large networks, data security: mathematical foundation and applications” and also by the grant K 143529.

Author information

Authors and Affiliations

Alfréd Rényi Institute of Mathematics: Renyi Alfred Matematikai Kutatointezet, Budapest, Hungary
A. Lovas & M. Rásonyi
Budapest University of Technology and Economics: Budapesti Műszaki és Gazdaságtudományi Egyetem, Budapest, Hungary
A. Lovas
Eotvos Lorand University: Eötvös Loránd Tudományegyetem, Budapest, Hungary
M. Rásonyi

Authors

A. Lovas
View author publications
You can also search for this author in PubMed Google Scholar
M. Rásonyi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Lovas.

Ethics declarations

Competing Interests

The authors have not disclosed any competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

M. Rásonyi was supported by the National Research, Development and Innovation Office within the framework of the Thematic Excellence Program 2021; National Research subprogramme “Artificial intelligence, large networks, data security: mathematical foundation and applications”.

A Proof of Lemma 3.16

The following auxiliary result is a variation of Theorem 17.2.2 from the renowned book by Ibragimov and Linnik [9]. In the interest of self-contained explanation and for future reference, we have chosen to present this result here in the required form.

Lemma A.1

Let $\Psi _1,\Psi _2:\Omega \rightarrow {\mathbb {R}}$ be random variables such that for $\sigma $-algebras ${\mathcal {A}}_1$ and ${\mathcal {A}}_2$, $\Psi _i$ is ${\mathcal {A}}_i$-measurable, $i=1,2$. Furthermore, for some $0<\epsilon <1$, ${\mathbb {E}}(|\Psi _i|^{2/\epsilon +1})<c,\,\,i=1,2$ holds with some $c>0$. Then

$$\begin{aligned} |\textrm{Cov}(\Psi _1,\Psi _2)|\le (4+5c)\alpha ({\mathcal {A}}_1, {\mathcal {A}}_2)^{1-\epsilon }. \end{aligned}$$

Proof

Let $L\ge 1$ be a positive number which we will fix later, and define the truncated random variables

$$\begin{aligned} \hat{\Psi }_i = \Psi _i \mathbbm {1}_{|\Psi _i|\le L} \,\,\text {and}\,\, \tilde{\Psi }_i = \Psi _i \mathbbm {1}_{|\Psi _i|> L}, \,\,i=1,2. \end{aligned}$$

We can estimate

$$\begin{aligned} |\textrm{Cov}(\hat{\Psi }_1,\hat{\Psi }_2)|&= \left| {\mathbb {E}}\left[ \hat{\Psi }_1 \left( {\mathbb {E}}(\hat{\Psi }_2\mid {\mathcal {A}}_1) - {\mathbb {E}}(\hat{\Psi }_2) \right) \right] \right| \le L {\mathbb {E}}\left( \left| {\mathbb {E}}(\hat{\Psi }_2\mid {\mathcal {A}}_1) - {\mathbb {E}}(\hat{\Psi }_2)\right| \right) \\&= L {\mathbb {E}}\left[ \zeta _1\left( {\mathbb {E}}(\hat{\Psi }_2\mid {\mathcal {A}}_1) - {\mathbb {E}}(\hat{\Psi }_2)\right) \right] = L\textrm{Cov}(\zeta _1, \hat{\Psi }_2), \end{aligned}$$

where $\zeta _1=\textrm{sgn}\left( {\mathbb {E}}(\hat{\Psi }_2\mid {\mathcal {A}}_1) - {\mathbb {E}}(\hat{\Psi }_2)\right) $ is ${\mathcal {A}}_1$-measurable hence by interchanging the role of $\zeta _1$ and $\hat{\Psi }_2$, we can apply the same argument, and thus obtain $|\textrm{Cov}(\hat{\Psi }_1,\hat{\Psi }_2)| \le L^2 |\textrm{Cov}(\zeta _1,\zeta _2)|$. Note that $\zeta _1$ and $\zeta _2$ take values in $\{-1,+1\}$. So, let $A_1 = \{\zeta _1=+1\}$, $A_2 = \{\zeta _1=-1\}$, $B_1 = \{\zeta _2=+1\}$, and $B_2 = \{\zeta _2=-1\}$, where $A_i$-s are ${\mathcal {A}}_1$-measurable, and $B_i$-s are ${\mathcal {A}}_2$-measurable. We can write

$$\begin{aligned}&\textrm{Cov}(\zeta _1,\zeta _2) = {\mathbb {P}}(A_1\cap B_1) + {\mathbb {P}}(A_2\cap B_2) - {\mathbb {P}}(A_1\cap B_2)\\&\quad - {\mathbb {P}}(A_2\cap B_1)-({\mathbb {P}}(A_1)-{\mathbb {P}}(A_2))({\mathbb {P}}(B_1)-{\mathbb {P}}(B_2)) \\&\quad \le \sum _{i,j=1}^2|{\mathbb {P}}(A_i\cap B_j)-{\mathbb {P}}(A_i){\mathbb {P}}(B_j)|\le 4\alpha ({\mathcal {A}}_1,{\mathcal {A}}_2), \end{aligned}$$

and thus we arrive at $|\textrm{Cov}(\hat{\Psi }_1,\hat{\Psi }_2)|\le 4\,L^2 \alpha ({\mathcal {A}}_1,{\mathcal {A}}_2)$. Trivially, ${\mathbb {E}}(|\tilde{\Psi }_i|)\le cL^{-2/\epsilon }$ and ${\mathbb {E}}(\tilde{\Psi }_i^2)\le c L^{1-2/\epsilon }$, $i=1,2$ hence for $i,j=1,2$, $|\textrm{Cov}(\hat{\Psi }_i,\tilde{\Psi }_j)|\le 2\,L{\mathbb {E}}(|\tilde{\Psi }_j|)\le 2cL^{1-2/\epsilon }$, and also by the Cauchy–Schwartz inequality, we get

$$\begin{aligned}&|\textrm{Cov}(\Psi _1,\Psi _2)| \le |\textrm{Cov}(\hat{\Psi }_1,\hat{\Psi }_2)| + |\textrm{Cov}(\hat{\Psi }_1,\tilde{\Psi }_2)| + |\textrm{Cov}(\tilde{\Psi }_1,\hat{\Psi }_2)| + |\textrm{Cov}(\tilde{\Psi }_1,\tilde{\Psi }_2)| \\&\quad \le 4L^2 \alpha ({\mathcal {A}}_1,{\mathcal {A}}_2) +4cL^{1-2/\epsilon } + {\mathbb {E}}(\tilde{\Psi }_1^2)^{1/2} {\mathbb {E}}(\tilde{\Psi }_2^2)^{1/2}\\&\quad \le 4L^2 \alpha ({\mathcal {A}}_1,{\mathcal {A}}_2) + 5cL^{1-2/\epsilon }. \end{aligned}$$

At last, we set $L=\alpha ({\mathcal {A}}_1,{\mathcal {A}}_2)^{-\epsilon /2}$, and since $\alpha ({\mathcal {A}}_1,{\mathcal {A}}_2)\le 1$, we obtain

$$\begin{aligned} |\textrm{Cov}(\Psi _1,\Psi _2)|&\le 4 \alpha ({\mathcal {A}}_1,{\mathcal {A}}_2)^{1-\epsilon } + 5c\alpha ({\mathcal {A}}_1,{\mathcal {A}}_2)^{1-\epsilon /2}\le (4+5c)\alpha ({\mathcal {A}}_1,{\mathcal {A}}_2)^{1-\epsilon }, \end{aligned}$$

which completes the proof. $\square $

Proof of Lemma 3.16

i) Let $l\in {\mathbb {N}}$ be arbitrary. Then by the Cauchy–Schwartz inequality and Lemma 3.15, we can write

$$\begin{aligned}&|{\mathbb {E}}(\phi (\theta _k)\phi (\theta _{k+l}))-{\mathbb {E}}(\phi (\theta _k^*)\phi (\theta _{k+l}^*))|\\&\quad \le c_2 \left[ {\mathbb {E}}^{1/2}((\phi (\theta _k)-\phi (\theta _k^*))^2) + {\mathbb {E}}^{1/2}((\phi (\theta _{k+l})-\phi (\theta _{k+l}^*))^2)\right] . \end{aligned}$$

Let $k\ge N$, where N is as in Lemmas 3.10 and 3.11. We can estimate further by

$$\begin{aligned} {\mathbb {E}}^{1/2}((\phi (\theta _k)-\phi (\theta _k^*))^2)&= {\mathbb {E}}^{1/2} \left[ \left( \phi \left( Z_{0,k}^{\theta _0,\textbf{y}}\right) - \phi \left( Z_{0,k}^{\theta _0^*,\textbf{y}}\right) \right) ^2 \mathbbm {1}_{Z_{0,k}^{\theta _0,\textbf{y}}\ne Z_{0,k}^{\theta _0^*,\textbf{y}}} \right] \\&\le ({\mathbb {E}}^{1/4}(\phi (\theta _k)^4)+{\mathbb {E}}^{1/4}(\phi (\theta _k^*)^4)) {\mathbb {P}}\left( Z_{0,k}^{\theta _0,\textbf{Y}}\ne Z_{0,k}^{\theta _0^*,\textbf{Y}}\right) ^{1/4} \\&\le 2c_4 \left( V(\theta _0)+\frac{3}{2}+\frac{C}{2(1-\gamma )}\right) ^{1/4} e^{-\frac{\kappa }{4} k} \end{aligned}$$

Along similar lines, one can show that the same upper bound works for ${\mathbb {E}}^{1/2}((\phi (\theta _{k+l})-\phi (\theta _{k+l}^*))^2)$ as well, and thus we get

$$\begin{aligned} |{\mathbb {E}}(\phi (\theta _k)\phi (\theta _{k+l}))-{\mathbb {E}}(\phi (\theta _k^*)\phi (\theta _{k+l}^*))| \le 4 c_2 c_4 \left( V(\theta _0)+\frac{3}{2}+\frac{C}{2(1-\gamma )}\right) ^{1/4} e^{-\frac{\kappa }{4} k}. \end{aligned}$$

For $l=0$ and $\sqrt{\phi (\theta _{t})}$, we obtain ${\mathbb {E}}(\phi (\theta _k))\rightarrow {\mathbb {E}}(\phi (\theta _0^*))$, as $k\rightarrow \infty $ hence we can conclude that $\textrm{Cov}(\phi (\theta _k),\phi (\theta _{k+l}))\rightarrow \textrm{Cov}(\phi (\theta _0^*),\phi (\theta _l^*))$, as $k\rightarrow \infty $.

ii) We estimate

$$\begin{aligned}{} & {} |\textrm{Cov}(\phi (\theta _k),\phi (\theta _{k-l}))| \le \left| \textrm{Cov}\left( \phi (Z_{k-\lfloor l/2\rfloor ,k}^{\theta _0,\textbf{Y}}), \phi (Z_{0,k-l}^{\theta _0,\textbf{Y}})\right) \right| \nonumber \\{} & {} \quad + \left| \textrm{Cov}\left( \phi (Z_{0,k}^{\theta _0,\textbf{Y}})-\phi (Z_{k-\lfloor l/2\rfloor ,k}^{\theta _0,\textbf{Y}}), \phi (Z_{0,k-l}^{\theta _0,\textbf{Y}})\right) \right| . \end{aligned}$$

(18)

Regarding the first term on the right-hand side of (18), $\Psi _1:=\phi (Z_{0,k-l}^{\theta _0,\textbf{Y}})$ is ${\mathcal {F}}_{-\infty , k-l-1}^Y\vee {\mathcal {F}}_{-\infty ,k-l}^\varepsilon $-measurable and $\Psi _2:=\phi (Z_{k-\lfloor l/2\rfloor ,k}^{\theta _0,\textbf{Y}})$ is ${\mathcal {F}}_{k-\lfloor l/2\rfloor ,\infty }^Y\vee {\mathcal {F}}_{k-\lfloor l/2\rfloor +1,\infty }^\varepsilon $-measurable. Clearly, ${\mathbb {E}}(|\Psi _1|^{2/\epsilon +1})={\mathbb {E}}(|\phi (\theta _{k-l})|^{2/\epsilon +1})$, and by the stationarity of $((Y_k,\varepsilon _{k+1}))_{k\in {\mathbb {Z}}}$, ${\mathbb {E}}(|\Psi _2|^{2/\epsilon +1}) = {\mathbb {E}}(|\phi (\theta _{\lfloor l/2\rfloor })|^{2/\epsilon +1})$ hence by Lemma 3.15, we have

$$\begin{aligned} {\mathbb {E}}(|\Psi _i|^{2/\epsilon +1})\le \tilde{c}_{2/\epsilon +1}:= c_{2/\epsilon +1}^{2/\epsilon +1},\,\,i=1,2. \end{aligned}$$

So we can apply Lemma A.1 for $\Psi _1$ and $\Psi _2$, and by Remark 3.13, we obtain

$$\begin{aligned} \begin{aligned}&\left| \textrm{Cov}\left( \phi (Z_{k-\lfloor l/2\rfloor ,k}^{\theta _0,\textbf{Y}}), \phi (Z_{0,k-l}^{\theta _0,\textbf{Y}})\right) \right| \\&\quad \le (4+5\tilde{c}_{2/\epsilon +1}) \alpha ({\mathcal {F}}_{-\infty , k-l-1}^Y\vee {\mathcal {F}}_{-\infty ,k-l}^\varepsilon ,{\mathcal {F}}_{k-\lfloor l/2\rfloor ,\infty }^Y\vee {\mathcal {F}}_{k-\lfloor l/2\rfloor +1,\infty }^\varepsilon )^{1-\epsilon } \\&\quad = (4+5\tilde{c}_{2/\epsilon +1}) \alpha ({\mathcal {F}}_{-\infty , k-l-1}^Y,{\mathcal {F}}_{k-\lfloor l/2\rfloor ,\infty }^Y)^{1-\epsilon } \\&\quad = (4+5\tilde{c}_{2/\epsilon +1}) \alpha ^Y (l+1-\lfloor l/2\rfloor )^{1-\epsilon } \le (4+5\tilde{c}_{2/\epsilon +1}) \alpha ^Y (\lfloor l/2\rfloor )^{1-\epsilon }. \end{aligned} \end{aligned}$$

(19)

In the second term on the right-hand side of (18), by the Cauchy–Schwartz inequality and Lemma 3.15, we have

$$\begin{aligned}&\left| \textrm{Cov}\left( \phi (Z_{0,k}^{\theta _0,\textbf{Y}})-\phi (Z_{k-\lfloor l/2\rfloor ,k}^{\theta _0,\textbf{Y}}), \phi (Z_{0,k-l}^{\theta _0,\textbf{Y}})\right) \right| \\&\quad \le c_2 {\mathbb {E}}^{1/2}\left( \mathbbm {1}_{Z_{0,k}^{\theta _0,\textbf{Y}}\ne Z_{k-\lfloor l/2\rfloor ,k}^{\theta _0,\textbf{Y}} }\left( \phi (Z_{0,k}^{\theta _0,\textbf{Y}})-\phi (Z_{k-\lfloor l/2\rfloor ,k}^{\theta _0,\textbf{Y}})\right) ^2\right) \\&\quad \le 2 c_2 c_4 {\mathbb {P}}(Z_{0,k}^{\theta _0,\textbf{Y}}\ne Z_{k-\lfloor l/2\rfloor ,k}^{\theta _0,\textbf{Y}})^{1/4} \end{aligned}$$

Note that $2N\le l$, and thus by Lemma 3.11 and Corollary 3.2, we can write

$$\begin{aligned} {\mathbb {P}}\left( Z_{0,k}^{\theta _0,\textbf{Y}}\ne Z_{k-\lfloor l/2\rfloor ,k}^{\theta _0,\textbf{Y}}\right)&= {\mathbb {P}}\left( Z_{k-\lfloor l/2\rfloor ,k}^{\theta _{k-\lfloor l/2\rfloor },\textbf{Y}}\ne Z_{k-\lfloor l/2\rfloor ,k}^{\theta _0,\textbf{Y}}\right) \\&\le \left( V(\theta _0)+\frac{3}{2}+\frac{C}{2(1-\gamma )}\right) e^{-\kappa \lfloor l/2\rfloor }, \end{aligned}$$

and thus we arrive at

$$\begin{aligned}{} & {} \left| \textrm{Cov}\left( \phi (Z_{0,k}^{\theta _0,\textbf{Y}})-\phi (Z_{k-\lfloor l/2\rfloor ,k}^{\theta _0,\textbf{Y}}), \phi (Z_{0,k-l}^{\theta _0,\textbf{Y}})\right) \right| \nonumber \\{} & {} \quad \le 2 c_2 c_4 \left( V(\theta _0)+\frac{3}{2}+\frac{C}{2(1-\gamma )}\right) ^{1/4} e^{-\frac{\kappa }{4} \lfloor l/2\rfloor }. \end{aligned}$$

(20)

Substituting (19) and (20) into (18), and replacing k by $k+l$ yields

$$\begin{aligned}{} & {} |\textrm{Cov}(\phi (\theta _k),\phi (\theta _{k+l}))|\le (4+5\tilde{c}_{2/\epsilon +1}) \alpha ^Y (\lfloor l/2\rfloor )^{1-\epsilon } \\{} & {} \quad + 2 c_2 c_4\left( V(\theta _0)+\frac{3}{2}+\frac{C}{2(1-\gamma )}\right) ^{1/4} e^{-\frac{\kappa }{4} \lfloor l/2\rfloor } \end{aligned}$$

which completes the proof. $\square $

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lovas, A., Rásonyi, M. Functional Central Limit Theorem and Strong Law of Large Numbers for Stochastic Gradient Langevin Dynamics. Appl Math Optim 88, 78 (2023). https://doi.org/10.1007/s00245-023-10052-y

Download citation

Accepted: 03 August 2023
Published: 28 August 2023
DOI: https://doi.org/10.1007/s00245-023-10052-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Functional Central Limit Theorem and Strong Law of Large Numbers for Stochastic Gradient Langevin Dynamics

Abstract

Similar content being viewed by others

Nonasymptotic Estimates for Stochastic Gradient Langevin Dynamics Under Local Conditions in Nonconvex Optimization

Polyak–Łojasiewicz inequality on the space of measures and convergence of mean-field birth-death processes

Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices

1 Introduction

Example 1.1

2 The Main Result

Assumption 2.1

Assumption 2.2

Theorem 2.3

Theorem 2.4

Remark 2.5

3 Proofs

3.1 Drift and Minorization Conditions for \(\theta _{t}\)

Lemma 3.1

Proof

Corollary 3.2

Lemma 3.3

3.2 Stationary Initialization

Lemma 3.4

Proof

Remark 3.5

Remark 3.6

Lemma 3.7

Proof

Corollary 3.8

3.3 Coupling Construction

Lemma 3.9

Proof

Lemma 3.10

Proof

Lemma 3.11

Proof

3.4 Mixing Properties

Lemma 3.12

Proof

Remark 3.13

Lemma 3.14

Proof

3.5 Proof of Theorem 2.3

Lemma 3.15

Proof

Proof of Theorem 2.3

3.6 Proof of Theorem 2.4

Lemma 3.16

Proof of Theorem 2.4

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher's Note

A Proof of Lemma 3.16

A Proof of Lemma 3.16

Lemma A.1

Proof

Proof of Lemma 3.16

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation