1 Introduction

Nonlocal continuity equations on the spaces of probability measures arise as macroscopic mathematical models of multi-agent dynamical systems describing the time evolution of large ensembles (beams, crowds, swarms, populations, networks) of structurally identical objects (e.g., elementary particles, people, animals, “neurons” of natural or artificial neural networks etc.). The main idea is to treat the many-particle dynamics as a whole by focusing on its “statistical” behavior assuming that the agents are homotypic and, therefore, indistinguishable.

Passing to the limit in the number of agents, a large set of individuals (described by a system of many similar ODEs) is replaced by their continual probability distribution, named the “mean field” (driven by a single transport PDE). This idea, rooted in statistical mechanics [28], has been found useful in different areas of applied mathematics such as mathematical biology [20, 21, 27, 39], modeling of pedestrian and urban traffic [25, 26, 42], mathematical neuroscience [37] and even theoretical foundations of artificial intelligence [13, 40, 49, 50], just to name a few.

Recent results in the analysis on the space of measures, achieved in the works of Ambrosio, Gigli, Lott, Otto, Santambrogio, Savaré, Villani, and others, have been proved fruitful for mathematical control theory, largely spurred by the variety of mentioned applications and the needs of control engineering. The starting point was the derivation of a mathematically rigorous “mean field limit” of the classical multi-agent optimal control problem [29, 30] (see also [12, 31]). In the consequent few years, the cornerstones of the classical optimal control theory—such as Pontryagin’s maximum principle (PMP) [7, 9,10,11, 25, 44,45,46], and the dynamic programming method [5, 6, 23, 38]—were extended to the area of mean field control.

The mean field PMP, which is at the focus of the present paper, was obtained on different levels of generality by various mathematical strategies. Its particular version was first derived in [44] for a specific “shepard’s” problem over the local continuity equation, and subsequently, for a general linear problem with relaxed controls [45]; these version of PMP are mainly reconstructed from the differential properties of flows of the driving vector field by standard analytical methods such as Filippov’s lemma. A result of the similar spirit for another particular local problem was recently obtained in [13] as a specification of a more general PMP [11] by an original technique of generalized Lagrange multipliers on the convex subset of Radon measures with unit mass. Notice that, in the local case, PMP takes the familiar form as it is formulated in terms of a certain decoupled optimality system with an explicit backward adjoint equation—a non-conservative transport PDE.

The first result in this line was obtained in [8] for a particular “bi-level” optimization problem; a natural strategy was to pass to the limit in the usual PMP conditions for conventional control problems obtained by the “finite-agent” approximations in the Dobrushin’s framework. Similar arguments, based on a finite-dimensional approximation and Ekeland’s variational principle, were used in [47] to prove an impulsive version of PMP for a nonlocal transport equation with states being measure-valued curves of bounded variation. For general (non-impulsive) nonlocal transport equations, PMP was first proved by an appropriate extension of the classical technique of needle-shaped control variations for problems without [11] and with [9] additional state-constraints. Another approach, relying on an appropriate linearization of the nonlocal dynamics, was further proposed in [10]. A different method to derive the necessary optimality conditions for mean-field control problems was suggested in [16] exploiting an appropriate generalization of Karush–Kuhn–Tucker conditions. Also, in [19, 34, 50], alternative versions of the mean-field PMP were obtained for stochastic optimal control problems.

1.1 Numerical Solution: Mainstream Approaches and Their Pitfalls

The use of existing analytical methods is limited to the simplest mean-field control problems, while the transition of these results to the numerical context is fraught with critical technical difficulties. Here, PMP would be a promising footing if it were not for a number of significant flaws. The key drawback is due to the mentioned coupling in the Hamiltonian system. The state of such a Hamiltonian equation—a measure on the cotangent bundle of the state space—is always singular, even if the solution of the primal continuity equation—a measure on the state space—has a density. This makes it impossible to solve the Hamiltonian system by the standard numerical schemes and, consequently, the existing forms of PMP do not provide a descent algorithm.

In the finite-dimensional case, a wide range of various direct and indirect numerical methods are described in numerous works. For nonlocal continuity equations, the numerical solution of optimal control problems still remains a burning question, which is principal for the transfer of the mean-field control theory to the practice of control engineering. The mainstream approaches are represented by the following two families:

  1. 1.

    Semi-direct (finite-particle) method: Approximation of the initial distribution by a discrete measure and transformation of a distributed control system to a high-dimensional ODE. The resulting finite-dimensional control problem is solved directly or using special techniques such as, e.g., “random batch” methods [35].

  2. 2.

    Direct method: Total discretization of a nonlocal equation and reduction of a variational problem to mathematical programming.

In practice, both the mentioned approaches typically lead to unsatisfactory results. The first one returns one to a high-dimensional classical optimal control problem followed by the “curse of dimensionality”; in fact, this approach rejects the very heart of the mean-field approximation along with all profits of the statistical averaging, while it draws us back to the need of keeping track of all individual representative of a large population. The second approach leads to a complex (high-dimensional, nonlinear and non-convex) mathematical programming problem, which is not always satisfactory solved even by commercial solvers. Here, the main difficulty is the presence of non-local terms depending on the density distribution over the entire spatial grid making the computations much more demanding. This feature also leads to a dramatic loss in the efficiency of parallelization, since integration steps require interprocessor communications of the “all-to-all” pattern.

In contrast to the classical setting, the bibliography on indirect numeric algorithms for optimal mean-field control is poor. There are only few results [2, 13, 43, 48, 49], all focusing on particular problems, and relying on adequate necessary optimality conditions. The work [43] deals with the so-called “shepard’s problem”, where one has to steer the population of non-interacting individuals to a given target set; the proposed numeric algorithm is based on a specific form of PMP. On the conceptual level, the algorithm [13] (named in the cited paper a “shooting method”) is a variant of the classical Krylov-Chernous’ko algorithm—probably the first indirect algorithm based on PMP in the history of optimal control. The convergence of the algorithm essentially depends on the convexity of the cost functional, and is not guaranteed in general, even for the finite-dimensional case \(\mu _t = \delta _{x(t)}\). An alternative algorithm was proposed in [49] for the linear problem of ensemble control employing an exact formula for the increment of the cost functional and feedback control variations. In [48], a version of the gradient descent method was constructed for a mean-field optimal control problem over a nonlocal Fokker-Planck-Kolmogorov equation modeling interactions in a Kuramoto type model: the first variation of the objective functional and the adjoint equation are obtained by a formal Lagrange method due to the model specifics. Finally, to the best of our knowledge, there are no results of this sort for the general \(\mu \)-nonlinear problem.

1.2 Goals, Contribution, and Organization of the Paper

In the present work, we put forth an indirect numerical method for optimal mean-field control. Namely, we design a PMP-based indirect deterministic numeric algorithm with backtracking line search for a class of optimal ensemble control problems involving nonlocal continuity equations in the space of probability measures. The method can be viewed as an adequate version of the classical gradient descent method, and demonstrates encouraging results in a series of numeric experiments. To our knowledge, this is the first indirect descent algorithm for mean-field control problems, nonlinear in measure.

The derivation of the algorithm is based on a set of new theoretical results, which are of independent interest. First, we derive the linearized form of the original nonlocal transport PDE. In contrast to [10], our arguments apply to nonlocal perturbations of the vector field, and therefore, cover the case, when the control is injected into the nonlocal term of the dynamics. As a byproduct, we compute the first variation of the cost functional within the class of weak variations of the control function. Another contribution is a new, equivalent articulation of PMP, where the Hamiltonian equation on the cotangent bundle of the state space is decoupled into the primal (forward) and dual (backward) parts; the dual systems turns to be a system of nonlocal linear balance laws (continuity equations with sources).

The rest of paper is organized as follows: A statement of the optimal control problem is presented in Sect. 1.3. Section 2 collects the necessary notation, and several noteworthy facts from the topology, analysis, and differential calculus over the space of probability measures. In Sect. 2.6, we introduce the concept of a flow of a nonlocal vector field and calculate a “directional derivative” of the flow along a nonlocal vector field. Sections 35 dwell on a simplified version of the stated optimization problem, where the running cost rate is lifted, and the driving vector field is affine in the control variable; this technical simplification is not critical but enables us to shorten the presentation of the main results.

In Sect. 3, we exhibit two standard representations of the increment of the cost functional. The first one is formulated in the language of flows of nonlocal vector fields, while the second formula is written down in terms of the mentioned Hamiltonian system. In Sect. 4, noting that none of these representations are suitable for numerical purposes, we derive the third version of the cost increment, which relies on the notion of adjoint equation. The corresponding numerical algorithm is presented in Sect. 5. We study the convergence of the algorithm, discuss certain principal aspects of its technical implementation and demonstrate its modus operandi by treating a simple but illustrative case, namely, an aggregation problem for a mean-field Kuramoto-type oscillatory model. Finally, in Sect. 6, the obtained results are extended to the general problem, involving the running cost and the nonlinear dependence on the control variable.

1.3 Problem Statement

Given the data \( V:I\times {{\mathbb {R}}}^{n}\times \mathscr {P}_{2}({{\mathbb {R}}}^{n})\times U\rightarrow {{\mathbb {R}}}^{n}\), \(L:I\times {{\mathbb {R}}}^{n}\times \mathscr {P}_{2}({{\mathbb {R}}}^{n})\times U\rightarrow {{\mathbb {R}}}\), \(\ell :\mathscr {P}_{2}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}},\) consider the following optimal control problem (P) on a fixed finite time interval \(I\doteq [0,T]\):

$$\begin{aligned}{} & {} \text {Minimize}\quad \mathscr {I}[u]\doteq \int _{0}^{T}L\left( t,\mu _{t},u(t)\right) \,dt + \ell (\mu _T)\quad \text {subject to} \end{aligned}$$
(1)
$$\begin{aligned}{} & {} \partial _t\mu _t + \textrm{div}_x \left( V_{t}\left( x,\mu _{t},u(t)\right) \, \mu _t\right) = 0,\quad \mu _0=\vartheta ,\end{aligned}$$
(2)
$$\begin{aligned}{} & {} \quad \qquad u \in {\mathscr {U}}. \end{aligned}$$
(3)

We assume that control signals are functions \(t \mapsto u(t)\) of time variable only, and take values in a given set \(U \subseteq {{\mathbb {R}}}^m\), i.e., \({\mathscr {U}} \doteq L^\infty (I; U)\), where \(L^\infty \) is equipped with the weak* topology \(\sigma (L^\infty , L^1)\).

Optimization problems of this sort appear in the framework of multi-agent dynamical systems, where the measure \(\mu _t\) represents the spatial distribution of agents at time t. The specified class of controls implies that u acts simultaneously on all agents (one can imagine that we are able to influence a common agents’ environment rather than agents in person). An important example of the nonlocal vector field is

$$\begin{aligned} V_{t}(x,\mu ,u) = f_{t}(x,u) + \int K_{t}(x-y,u)\,d\mu (y), \end{aligned}$$
(4)

where f models an external force pushing the agents and K stands for their internal interaction. Typical terminal cost functionals are

$$\begin{aligned} \ell _{1}(\mu )&= \int l(x)\,d\mu (x) + \iint W(x,y) \,d\mu (x)\,d\mu (y), \quad \\ \ell _{2}(\mu )&= \frac{1}{2}\left|\int x\,d\mu (x)-m_{T}\right|^{2}. \end{aligned}$$

Here, \( \ell _{1} \) represents the potential (l) and interaction (W) energy terms, while \( \ell _{2} \) is related to the averaged control problem [51], where the goal is to bring the expectation of the distribution \(\mu \) to some target position \(m_T\). Finally, common versions of running cost term are

$$\begin{aligned} L_{1}(t,\mu ,u) = \frac{1}{2}\vert u\vert ^{2},\quad L_{2}(t,\mu ,u) = \frac{1}{2}\left|\int x\,d\mu (x) -m(t) \right|^{2}. \end{aligned}$$

\( L_1 \) represents the “total energy” of the control action, and \( L_2 \) captures the problem of following a desired path \( t\mapsto m(t) \).

2 Preliminaries

In this section, we introduce some notations, and recall several useful facts from analysis on the metric space of probability measures.

2.1 Notation

Throughout the paper, we use the following notation:

  • \( |\cdot |\) the Euclidean norm on \( \mathbb {R}^{n} \).

  • \( {\varvec{B}}_r\subset \mathbb {R}^{n} \) the closed unit ball of radius r centered at the origin.

  • \( f_{\sharp }\mu \) pushforward measure for \( \mu \in {\mathscr {P}}(\mathbb {R}^n) \) and a Borel function \( f:\mathbb {R}^n\rightarrow \mathbb {R}^m \).

  • \(\textrm{spt}\mu \) the support of a measure \(\mu \).

  • \( \mathbb {M}^{m,n} \) the space of matrices A with m rows and n columns.

  • \(x =\begin{pmatrix} x^{1}\\ \vdots \\ x^{n} \end{pmatrix}\) an n-dimensional column vector, i.e., \( x\in \mathbb {M}^{n,1} = {{\mathbb {R}}}^{n}\).

  • \(p = \begin{pmatrix} p_{1}&\cdots&p_{n} \end{pmatrix}\) an n-dimensional row vector, i.e., \( p \in \mathbb {M}^{1,n}=({{\mathbb {R}}}^{n})^{*} \).

  • A vector field f on \( {{\mathbb {R}}}^{n} \) is a family of n real-valued functions \( f^{i}=f^{i}(t,x) \), \( i=1,\ldots ,n \).

  • A vector field f on \( {{\mathbb {R}}}^{n}\times ({{\mathbb {R}}}^{n})^{*} \) is a family of 2n real-valued functions \( f^{i}=f^{i}(t,x,p) \), \( f_{i}=f_{i}(t,x,p) \), \( i=1,\ldots ,n \).

  • \( \textrm{div}_x f = \sum _{i=1}^{n} \partial _{x^{i}}f^{i} \) divergence of the vector field \( f=f(t,x) \) in x.

  • \( \textrm{div}_{(x,p)} f=\sum _{i=1}^{n}\left( \partial _{x^{i}}f^{i}+\partial _{p_{i}}f_{i}\right) \) divergence of the vector field \( f=f(t,x,p) \) in (xp) .

  • \(D_xf = \begin{pmatrix} \partial _{x^{1}}f^{1}&{}\cdots &{}\partial _{x^{n}}f^{1}\\ \vdots &{}\ddots &{}\vdots \\ \partial _{x^{1}}f^{n}&{}\cdots &{}\partial _{x^{n}}f^{n} \end{pmatrix}\) derivative of the vector field \( f=f(t,x) \) in x.

  • \(\nabla _{x}\psi = \begin{pmatrix} \partial _{x^{1}}\psi&\cdots&\partial _{x^{n}}\psi \end{pmatrix}\) gradient of a real-valued function \( \psi =\psi (t,x,p) \) in x.

  • \(\nabla _{p}\psi = \begin{pmatrix} \partial _{p_{1}}\psi \\ \vdots \\ \partial _{p_{n}}\psi \end{pmatrix}\) gradient of a real-valued function \( \psi =\psi (t,x,p) \) in p.

Below, we will also deal with vector measures whose values belong to \( \mathbb {M}^{1,n} \), i.e., \(\nu = \begin{pmatrix} \nu _{1}&\cdots&\nu _{n} \end{pmatrix}\), where \( \nu _{1},\ldots ,\nu _{n} \) are Radon measures on \( {{\mathbb {R}}}^{n} \). Given \( \varphi :{{\mathbb {R}}}^{n}\rightarrow {{\mathbb {R}}}^{n} \), we set \( \displaystyle \langle \nu ,\varphi \rangle = \int \varphi \cdot \,d\nu \doteq \sum _{i=1}^{n} \int \varphi ^{i}\,d\nu _{i}\).

Let X be a Polish space. From measures on X one can construct several important topological spaces: \( {\mathscr {M}}(X)\supset {\mathscr {P}}(X) \supset {\mathscr {P}}_{2}(X) \supset {\mathscr {P}}_{c}(X). \) Here \( \mathscr {M}(X) \) consists of all signed Radon measures, \( {\mathscr {P}}(X) \) of all probability measures, \( {\mathscr {P}}_{2}(X) \) of all probability measures with finite second moments, \( {\mathscr {P}}_{c}(X) \) of all compactly supported probability measures. Below, the Wasserstein distance [1] on \( {\mathscr {P}}_{2}(X) \) is always denoted by \( W_{2} \).

Given a Radon measure \( \mu \) on \( \mathbb {R}^n \), denote by \( L_{\mu }^{p}(\mathbb {R}^{n};\mathbb {R}^{m}) \) the space of all \( \mu \)-measurable maps (equivalence classes) \( f:{{\mathbb {R}}}^{n}\rightarrow {{\mathbb {R}}}^{m} \) such that \( \Vert f\Vert _{L^p_\mu }\doteq \left( \int \vert f \vert ^{p}\,d\mu \right) ^{1/p}<\infty \). If \( \mu \) is the n-dimensional Lebesgue measure \( {\mathscr {L}}^{n} \), we simply write \( L^p(\mathbb {R}^{n};\mathbb {R}^{m}) \).

2.2 The Space \(\mathscr {P}_{c}({\mathbb {R}}^{n})\) and Functions of Probability Measures

The role of the main arena of our paper will be played by the space \( \mathscr {P}_{c}({{\mathbb {R}}}^{n}) \) endowed with the so-called final topology.

Definition 2.1

Let \(({\mathscr {X}}_n, \tau _n)\) be a sequence of topological spaces such that \({\mathscr {X}}_n\subset {\mathscr {X}}_{n+1}\) with continuous inclusion on every n. Let \({\mathscr {X}} = \cup _n{\mathscr {X}}_n\). The final topology is the strongest topology \(\tau \) on \({\mathscr {X}}\) which lets the inclusions \(\textrm{id}_n:{\mathscr {X}}_n\rightarrow {\mathscr {X}}\) be continuous for every n.

In our case, \(({\mathscr {X}}_n,\tau _n) \doteq ({\mathscr {P}}({\varvec{B}}_n),W_2)\) and \({\mathscr {X}} \doteq {\mathscr {P}}_c({\mathbb {R}}^n)\). The final topology \( \tau \) on \( \mathscr {P}_{c}({{\mathbb {R}}}^{n}) \) enjoys the following properties [33]:

  • \(\mu _n \xrightarrow {\tau } \mu \) if and only if \(\mu _n\xrightarrow {W_{2}} \mu \) in \(\mathscr {P}({\varvec{B}}_N)\) for some N,

  • if \({\mathscr {K}} \subset {\mathscr {P}}_c({\mathbb {R}}^n)\) is compact, then \({\mathscr {K}}\subset \mathscr {P}({\varvec{B}}_N)\) for some N,

  • \(\tau \) is a Hausdorff topology but it is not induced by any distance.

Below, we will constantly deal with mappings \( \Phi :I\times {{\mathbb {R}}}^{n}\times \mathscr {P}_{c}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}^{m} \) of a particular regularity. Recall the respective

Definition 2.2

Let \( \Phi \) be a map \(I\times {{\mathbb {R}}}^{n}\times \mathscr {P}_{c}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}^{m} \). We say that

  1. 1.

    \( \Phi \) is a Carathéodory map if and only if \( t\mapsto \Phi (t,x,\mu ) \) is measurable for each \( (x,\mu ) \), and \( (x,\mu )\mapsto \Phi (t,x,\mu ) \) is sequentially continuous for each t.

  2. 2.

    \( \Phi \) is locally bounded if its restriction on any compact subset of \( I\times {{\mathbb {R}}}^{n}\times \mathscr {P}_{c}({{\mathbb {R}}}^{n}) \) is bounded.

  3. 3.

    \( \Phi \) is locally Lipschitz if and only if, for each t, the restriction of \( (x,\mu )\mapsto \Phi (t,x,\mu ) \) to any compact set \( \mathscr {K} \subset {{\mathbb {R}}}^{n}\times \mathscr {P}_{c}({{\mathbb {R}}}^{n}) \) is Lipschitz with some constant \( L_{\mathscr {K}} \), independent of t.

  4. 4.

    \( \Phi \) is sublinear if and only if there exists \( C>0 \) such that \( \left|\Phi (t,x,\mu )\right|\le C \left( 1+\vert x\vert \right) \) for all t, x, \( \mu \).

Thanks to the outlined properties of the final topology, the definitions of the local boundedness and local Lipschitzianity can be given in the following equivalent way:

2\( ' \).:

\( \Phi \) is locally bounded if and only if, for any compact \( \Omega \subset {{\mathbb {R}}}^{n} \), there exists \( C_{\Omega }>0 \) such that \( \left|\Phi (t,x,\mu )\right|\le C_{\Omega } \) for all \( t\in I \), \( x\in \Omega \), \( \mu \in \mathscr {P}(\Omega ) \);

3\( ' \).:

\( \Phi \) is locally Lipschitz if and only if, for any compact \( \Omega \subset {{\mathbb {R}}}^{n} \), there exists \( L_{\Omega }>0 \) such that \( \left|\Phi (t,x,\mu ) - \Phi (t,x',\mu ')\right|\le L_{\Omega } \left( \vert x-x'\vert + W_{2}(\mu ,\mu ') \right) \) for all \( t\in I \), \( x,x'\in \Omega \), \( \mu ,\mu '\in \mathscr {P}(\Omega ) \).

2.3 Derivatives in the Space of Probability Measures

There are several concepts of derivative of a function \({\mathscr {P}} \rightarrow {{\mathbb {R}}}\). In this paper, we shall employ the notion of “intrinsic derivative” [18].

Definition 2.3

(\(\mathscr {C}^{1}\) maps) A function \( F :\mathscr {P}_{c}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}\) is said to be of class \( \mathscr {C}^{1} \) if and only if there exists a sequentially continuous, locally bounded map \( \frac{\delta F}{\delta \mu }:{\mathscr {P}}_{c}({{\mathbb {R}}}^{n})\times {{\mathbb {R}}}^{n}\rightarrow {{\mathbb {R}}}\) such that

$$\begin{aligned} F(\mu ') - F(\mu ) = \int _{0}^{1}\int \frac{\delta F}{\delta \mu }\left( (1-t)\mu +t\mu ',y\right) \,d(\mu '-\mu )(y)\,dt\qquad \forall \mu ,\mu '\in \mathscr {P}_{c}({{\mathbb {R}}}^{n}). \end{aligned}$$

Since \( \frac{\delta F}{\delta \mu } \) is defined up to an additive constant, we adopt the normalization convention

$$\begin{aligned} \int \frac{\delta F}{\delta \mu }(\mu ,y)\,d\mu (y) = 0\qquad \forall \mu \in \mathscr {P}_{c}({{\mathbb {R}}}^{n}). \end{aligned}$$

Definition 2.4

Let \( \frac{\delta F}{\delta \mu } \) be \( \mathscr {C}^{1} \) in y. Then the intrinsic derivative \( D_{\mu }F:\mathscr {P}_{c}({{\mathbb {R}}}^{n})\times {{\mathbb {R}}}^{n}\rightarrow {{\mathbb {R}}}^{n} \) is defined by \( D_{\mu }F \doteq D_{y}\frac{\delta F}{\delta \mu } \).

Some important properties of the intrinsic derivative are gathered in the following proposition, which combines the statements of Propositions 2.2\(-\)2.4 from [17].

Proposition 2.1

Let \( F:\mathscr {P}_{c}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}\) be \( \mathscr {C}^{1} \), \( \frac{\delta F}{\delta \mu } \) be \( \mathscr {C}^{1} \) in y, and \( D_{\mu }F \) be sequentially continuous and locally bounded. Then, the following holds:

  1. 1.

    For any Borel measurable, locally bounded map \( \varphi :{{\mathbb {R}}}^{n}\rightarrow {{\mathbb {R}}}^{n} \), the function \( s\mapsto F\left( (\textrm{id}+s\varphi )_{\sharp }\mu \right) \) is differentiable at zero, and

    $$\begin{aligned} \frac{d}{ds}\Big \vert _{s=0}F\left( (\textrm{id}+s\varphi )_{\sharp }\mu \right) = \int D_{\mu }F(\mu ,y)\cdot \varphi (y)\,d\mu (y). \end{aligned}$$
    (5)
  2. 2.

    Given a compact set \( \Omega \subset {\mathbb {R}}^n \), the restriction of F to \( \mathscr {P}(\Omega ) \) satisfies

    $$\begin{aligned}{} & {} \left|F(\mu ') - F(\mu ) - \iint D_{\mu }F(\mu ,y)\cdot (y-x)\,d\Pi (x,y) \right|\\ {}{} & {} \le o \left( \left( \iint \vert x-y\vert ^{2}\,d\Pi (x,y) \right) ^{1/2} \right) , \end{aligned}$$

    for any \( \mu ,\mu '\in \mathscr {P}(\Omega ) \) and any transport plan \( \Pi \) between \( \mu \) and \( \mu ' \).

  3. 3.

    The quantity \( \frac{\delta F}{\delta \mu } \) can be calculated as follows:

    $$\begin{aligned} \frac{\delta F}{\delta \mu }(\mu ,y) = \lim _{h\rightarrow 0+}\frac{1}{h}\left( F\left( (1-h)\mu +h\delta _{y}\right) -F(\mu )\right) . \end{aligned}$$

The first property links the intrinsic derivative with a “directional” derivative, where \( \varphi \) plays the role of direction. The second one relates the notion of intrinsic derivative with the so-called localized Wasserstein derivative [10]:

Definition 2.5

(localized Wasserstein derivative) We say that \( F:\mathscr {P}_{2}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}\) is locally differentiable at \( \mu \in \mathscr {P}_{2}({{\mathbb {R}}}^{n}) \) if there exists a tangent vector \( \xi \in \textrm{Tan}_{\mu }{\mathscr {P}}_{2}({{\mathbb {R}}}^{n}) \) such that, for any compact set \( \Omega \supset \textrm{spt}\mu \), the restriction of F to \( \mathscr {P}(\Omega ) \) satisfies

$$\begin{aligned} F(\mu ') - F(\mu ) = \iint \langle \xi (x),y-x\rangle \,d\Pi (x,y) + o\left( \left( \iint \vert x-y\vert ^{2}\,d\Pi (x,y)\right) ^{1/2}\right) , \end{aligned}$$

for any \( \mu '\in {\mathscr {P}}(\Omega ) \) and any transport plan \( \Pi \) between \( \mu \) and \( \mu ' \). Such \( \xi \) is uniquely defined and called the localized Wasserstein derivative of F at \( \mu \).

Recall that the tangent space \(\textrm{Tan}_{\mu }{\mathscr {P}}_{2}({{\mathbb {R}}}^{n})\) to \({\mathscr {P}}_{2}({{\mathbb {R}}}^{n})\) at \( \mu \in \mathscr {P}_{2}({{\mathbb {R}}}^{n}) \) is introduced as

$$\begin{aligned} \textrm{Tan}_{\mu }{\mathscr {P}}_{2}({{\mathbb {R}}}^{n}) = \overline{\left\{ \nabla \varphi \;\text {s.t.}\;\varphi \in \mathscr {C}^{\infty }_{c}({{\mathbb {R}}}^{n})\right\} }^{L^{2}_{\mu }} \subset L^{2}_{\mu } = L^{2}_{\mu }({{\mathbb {R}}}^{n};{{\mathbb {R}}}^{n}). \end{aligned}$$

Proposition 2.1 says that any \( \mathscr {C}^{1} \) functional on the space of probability measures with sequentially continuous and locally bounded intrinsic derivative \( D_{\mu }F \) is locally differentiable at any \( \mu \in \mathscr {P}_{c}({{\mathbb {R}}}^{n}) \), and the projection of \( D_{\mu }F(\mu ,\cdot ) \) onto \( \textrm{Tan}_{\mu }{\mathscr {P}}_{2}({{\mathbb {R}}}^{n}) \) coincides with the corresponding localized Wasserstein derivative.

The third assertion of Proposition 2.1 offers a convenient tool for practical calculation of the intrinsic derivative. We illustrate this machinery with the use of the following paradigmatic example.

Example 1

Let \( K:{{\mathbb {R}}}^{n}\rightarrow {{\mathbb {R}}}^{n} \) be a \( \mathscr {C}^{1} \) map. Fixed \( x\in {{\mathbb {R}}}^{n} \), let us compute the intrinsic derivative of the functional \(\displaystyle \mu \mapsto F(x,\mu ) \doteq (K*\mu )(x) \doteq \int K(x-y)\,d\mu (y).\) By observing that

$$\begin{aligned} F\left( x,(1-t)\mu +t\delta _{y}\right) = (1-t)\int K(x-y)\,d\mu (y) + t K(x-y), \end{aligned}$$

the flat derivative is easily found as

$$\begin{aligned} \frac{\delta F}{\delta \mu }(x,\mu ,y) = K(x-y) -\int K(x-z)\,d\mu (z), \end{aligned}$$

which gives: \( D_{\mu }F(x,\mu ,y) = D_{y}\frac{\delta F}{\delta \mu }(x,\mu ,y) = -DK(x-y). \)

Recall another useful fact:

Lemma 2.1

Let F be the same as in Proposition 2.1. Then F is locally Lipschitz.

Proof

Fix a compact set \( \Omega \) and two measures \( \mu ,\mu '\in \mathscr {P}(\Omega ) \). Denote by \( \Pi \) an optimal plan between \( \mu \) and \( \mu ' \) and let \( \mu _{t}=(1-t)\mu +t\mu ' \). Then, we have

$$\begin{aligned} F(\mu ') - F(\mu ) = \int _{0}^{1} \iint \left[ \frac{\delta F}{\delta \mu }\left( \mu _{t},y\right) - \frac{\delta F}{\delta \mu }\left( \mu _{t},x\right) \right] \,d\Pi (x,y)\,dt. \end{aligned}$$

The difference in the squared brackets is

$$\begin{aligned} \int _{0}^{1}D_{y}\frac{\delta F}{\delta \mu }\left( \mu _{t},(1-s)y + sx\right) (y-x)\,ds. \end{aligned}$$

Hence the statement follows from the local boundedness of \( D_{\mu }F = D_{y}\frac{\delta F}{\delta \mu } \). \(\square \)

Definition 2.6

We say that \( F:\mathscr {P}_{c}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}\) is of class \( \mathscr {C}^{1,1} \) if F is \( \mathscr {C}^{1} \), \( \frac{\delta F}{\delta \mu } \) is \( \mathscr {C}^{1} \) in y, and the intrinsic derivative \( D_{\mu }F \) is locally Lipschitz and locally bounded.

2.4 Nonlocal Vector Fields and Their Flows

A time-dependent nonlocal vector field is a map \(V:I\times {{\mathbb {R}}}^{n}\times \mathscr {P}_{c}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}^{n} \). If the dependence on \(\mu \in \mathscr {P}_{c}({{\mathbb {R}}}^{n})\) is fictitious we say that V is a local vector field (or simply “vector field”). The basic regularity of nonlocal vector fields is understood in the sense of Definition 2.2.

It is well-known that the local transport PDEs can be studied using their characteristic flows. Recall the following

Definition 2.7

We say that a vector field \( v:I\times {{\mathbb {R}}}^{n}\rightarrow {{\mathbb {R}}}^{n} \) is of class \( \mathscr {C}^{1,1} \) if

  1. 1.

    v is a locally bounded Carathéodory map;

  2. 2.

    v is \( \mathscr {C}^{1} \) in x for each t;

  3. 3.

    \( D_{x}v:I\times {{\mathbb {R}}}^{n}\rightarrow \mathbb {M}^{n,n} \) is Carathéodory, locally bounded and locally Lipschitz.

Any sublinear \( {\mathscr {C}}^{1,1} \) vector field v generates a unique continuous map \( P:I\times I\times {{\mathbb {R}}}^{n}\rightarrow {{\mathbb {R}}}^{n} \) named the flow of \( v_{t} \); this map is defined such that, for each \( t_{0}\in I \) and \( x\in {{\mathbb {R}}}^{n} \), \(t \mapsto P_{t_{0},t}(x)\) is as a solution of the Cauchy problem

$$\begin{aligned} \partial _t P_{t_{0},t}(x) = v_{t}\left( P_{t_{0},t}(x)\right) ,\quad P_{t_{0},t_{0}}(x) = x. \end{aligned}$$

For any \( t_{0},t\in I \) the map \( P_{t_{0},t}:{{\mathbb {R}}}^{n}\rightarrow {{\mathbb {R}}}^{n} \) is a \( \mathscr {C}^{2} \) diffeomorphism. Moreover, it satisfies the semigroup property: \(P_{t_{1},t_{2}}\circ P_{t_{0},t_{1}} = P_{t_{0},t_{2}}\) for all \(t_{0},t_{1},t_{2}\in I.\)

In fact, the concept of flow can be extended to the case of nonlocal vector fields. To this end, we modify Definition 2.7 as follows:

Definition 2.8

We say that a nonlocal vector field \( V:I\times {{\mathbb {R}}}^{n}\times \mathscr {P}_{2}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}^{n} \) is of class \( \mathscr {C}^{1,1} \) if

  1. 1)

    V is a locally bounded Carathéodory map;

  2. 2)

    V is \( \mathscr {C}^{1} \) in x for each t and \( \mu \), and \( \mathscr {C}^{1} \) in \( \mu \) for each t and x;

  3. 3)

    both \( D_{x}V:I\times {{\mathbb {R}}}^{n}\times \mathscr {P}_{c}({{\mathbb {R}}}^{n})\rightarrow \mathbb {M}^{n,n} \) and \( D_{\mu }V:I\times {{\mathbb {R}}}^{n}\times \mathscr {P}_{c}({{\mathbb {R}}}^{n})\times {{\mathbb {R}}}^{n}\rightarrow \mathbb {M}^{n,n} \) are Carathéodory, locally bounded and locally Lipschitz.

Now, observe that any sublinear \( \mathscr {C}^{1,1} \) nonlocal vector field V generates a unique sequentially continuous function \( X:I \times I\times {{\mathbb {R}}}^{n}\times \mathscr {P}_{c}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}^{n} \) such that, for each \( x\in {{\mathbb {R}}}^{n} \) and \( \vartheta \in \mathscr {P}_{2}({{\mathbb {R}}}^{n}) \), \(t \mapsto X^{\vartheta }_{t_0,t}(x)\) is a solution of the ODE

$$\begin{aligned} \partial _t X^{\vartheta }_{t_0,t}(x) = V_{t}\left( X^{\vartheta }_{t_0,t}(x),X^{\vartheta }_{t_0,t\sharp }\vartheta \right) ,\quad X^{\vartheta }_{t_0,t_0}(x) = x. \end{aligned}$$

We abbreviate \(X^\vartheta _t = X^\vartheta _{0,t}\) and stress that \( \mu _{t}=(X^{\vartheta }_{t})_\sharp \vartheta \) is the unique solution of the nonlocal continuity equation

$$\begin{aligned} \partial _{t}\mu _{t} + \textrm{div}_x\left( V_{t}(\cdot ,\mu _{t})\mu _{t}\right) = 0,\quad \mu _{0} = \vartheta . \end{aligned}$$

We call the map X the flow of the nonlocal vector field V.

Notice that, for a given \( \vartheta \), we can define \( v_{t}(x)\doteq V_{t}(x,X^{\vartheta }_{t\sharp }\vartheta ) \) and denote by P the flow of v. It is clear that \( X^{\vartheta }_{0,t} = P_{0,t} \). We will use this fact below several times.

The outlined facts (existence of the flow, well-posedness of the nonlocal continuity equation, and the representation formula for its solution) are well-known, refer, e.g., to [11, 29, 41].

2.5 \(\mathscr {O}_\textrm{loc}(\lambda ^{2})\) Families of Vector Fields

In this section, we discuss some differential properties of nonlocal vector fields and their flows.

Definition 2.9

Let \( \Phi ^{\lambda }:\mathscr {X} \mapsto {{\mathbb {R}}}^{m} \), \( \lambda \in [0,1] \), be a family of functions on a topological space \( \mathscr {X} \). We say that \( \Phi ^{\lambda } \) is \( \mathscr {O}_\textrm{loc}(\lambda ^{2}) \) family and write

$$\begin{aligned} \Phi ^{\lambda } = \mathscr {O}_\textrm{loc}(\lambda ^{2})\quad \text {or}\quad \Phi ^{\lambda }(x) = \mathscr {O}_\textrm{loc}(x;\lambda ^{2}) \end{aligned}$$

if for any compact set \(\mathscr {K} \subset \mathscr {X} \) there exists \( C_{\mathscr {K}}>0 \) such that \( \left|\Phi ^{\lambda }(x)\right|\le C_{\mathscr {K}}\lambda ^{2} \) for all \( \lambda \in [0,1] \) and \( x\in \mathscr {K} \).

In particular, a family of nonlocal vector fields \( V^{\lambda }:I\times {{\mathbb {R}}}^{n}\times \mathscr {P}_{c}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}^{n}\) is \( \mathscr {O}_\textrm{loc}(\lambda ^{2}) \) if, for any compact \( \Omega \subset {{\mathbb {R}}}^{n} \), there exists \( C_{\Omega }>0 \) such that \( \left| V^{\lambda }_{t}(x,\mu )\right| \le C_{\Omega }\lambda ^{2} \), for all \( \lambda \in [0,1] \), \( t\in I \), \( x\in \Omega \), \( \mu \in \mathscr {P}(\Omega ) \).

Lemma 2.2

Let V be a nonlocal vector field of class \( \mathscr {C}^{1,1} \). Then, for any locally bounded \( \varphi ,\psi :{{\mathbb {R}}}^{n}\rightarrow {{\mathbb {R}}}\), one has

$$\begin{aligned}{} & {} V_{t} \left( x+\lambda \varphi (x), (\textrm{id}+\lambda \psi )_{\sharp }\mu \right) - V_{t}(x,\mu ) \\{} & {} \quad - \lambda D_{x}V_{t}(x,\mu )\varphi (x) - \lambda \int D_{\mu }V_{t}(x,\mu ,y)\psi (y)\,d\mu (y)\\ {}{} & {} = \mathscr {O}_\textrm{loc}(t,x,\mu ;\lambda ^{2}),\quad {\lambda \in [0,1]}. \end{aligned}$$

Moreover, the constant \( C_{\Omega } \) that guaranties the estimate

$$\begin{aligned} \mathscr {O}_\textrm{loc}(t,x,\mu ;\lambda ^{2})\le C_{\Omega }\lambda ^{2}\qquad \forall (t,x,\mu )\in I\times \Omega \times \mathscr {P}(\Omega ) \end{aligned}$$

depends only on the data

$$\begin{aligned} r\doteq & {} \max \left\{ \vert x\vert +\max \{\vert \varphi (x)\vert ,\vert \psi (x)\vert \}\;:\;x\in \Omega \right\} \nonumber \\ L_{r}= & {} \max _{t \in I}\left\{ \max _{{\varvec{B}}_{r}\times \mathscr {P}({\varvec{B}}_{r})}\textrm{lip}(D_{x}V_{t}), \max _{{\varvec{B}}_{r}\times \mathscr {P}({\varvec{B}}_{r})\times {\varvec{B}}_{r}}\textrm{lip}(D_{\mu }V_{t})\right\} . \end{aligned}$$
(6)

Proof

We split the proof into several steps.

1. Fix a compact set \( \Omega \) and a triple \( (t,x,\mu )\in I\times \Omega \times \mathscr {P}(\Omega ) \). Consider the identity:

$$\begin{aligned}&V_{t} \left( x+\lambda \varphi (x), (\textrm{id}+\lambda \psi )_{\sharp }\mu \right) - V_{t}(x,\mu )\\&\quad =V_{t} \left( x+\lambda \varphi (x), (\textrm{id}+\lambda \psi )_{\sharp }\mu \right) - V_{t} \left( x, (\textrm{id}+\lambda \psi )_{\sharp }\mu \right) \\&\qquad + V_{t} \left( x, (\textrm{id}+\lambda \psi )_{\sharp }\mu \right) - V_{t}(x,\mu ). \end{aligned}$$

By the mean value theorem, the first difference in the right-hand side takes the form

$$\begin{aligned} \lambda \int _{0}^{1}D_{x}V_{t} \left( x + s\lambda \varphi (x), (\textrm{id}+ \lambda \psi )_{\sharp }\mu \right) \varphi (x)\,ds, \end{aligned}$$

and the second one yields

$$\begin{aligned}&\int _{0}^{1} \int \frac{\delta V_{t}}{\delta \mu } \left( \mu _{\lambda ,\tau },y \right) \,d\left( (\textrm{id}+\lambda \psi )_{\sharp }\mu - \mu \right) (y)\,d\tau \\&\quad =\int _{0}^{1}\int \left[ \frac{\delta V_{t}}{\delta \mu } \left( \mu _{\lambda ,\tau },y+\lambda \psi (y) \right) - \frac{\delta V_{t}}{\delta \mu } \left( \mu _{\lambda ,\tau },y \right) \right] \,d\mu (y)\,d\tau \\&\quad =\lambda \int _{0}^{1}\int _{0}^{1}\int D_{\mu }V_{t} \left( x, \mu _{\lambda ,\tau }, y+s\lambda \psi (y)\right) \psi (y)\,d\mu (y)\,ds\,d\tau , \end{aligned}$$

where \( \mu _{\lambda ,\tau } = (1-\tau )\mu +\tau (\textrm{id}+\lambda \psi )_{\sharp }\mu \).

2. Let r be as in (6). Then \( x +\lambda \varphi (x) \in {\varvec{B}}_{r} \) and \( (\textrm{id}+\lambda \psi )_{\sharp }\mu \in \mathscr {P}({\varvec{B}}_{r}) \) for all \( \lambda \in [0,1] \), \( x\in \Omega \), \( \mu \in \mathscr {P}(\Omega ) \). Since \( D_{x}V \) and \( D_{\mu }V \) are locally Lipschitz,

$$\begin{aligned}&\left|D_{x}V_{t} \left( x + s\lambda \varphi (x), (\textrm{id}+ \lambda \psi )_{\sharp }\mu \right) \varphi (x) - D_{x}V_{t}(x,\mu )\varphi (x)\right|\nonumber \\&\quad \le \lambda L_{r} \left( \left|\varphi (x) \right|^{2}+ \Vert \psi \Vert _{L^{2}_{\mu }}\vert \varphi (x)\vert \right) \end{aligned}$$
(7)
$$\begin{aligned}&\left| \int D_{\mu }V_{t} \left( x, \mu _{\lambda ,\tau }, y+s\lambda \psi (y)\right) \psi (y)\,d\mu (y)-\int D_{\mu }V_{t}(x,\mu ,y)\psi (y)\,d\mu (y) \right| \nonumber \\&\quad \le L_{r} \left( W_{2}(\mu ,\mu _{\lambda ,\tau })\Vert \psi \Vert _{L^{1}_{\mu }} +\lambda \Vert \psi \Vert ^{2}_{L^{2}_{\mu }} \right) . \end{aligned}$$
(8)

3. Let us estimate \(W_{2} \left( \mu ,\mu _{\lambda ,\tau } \right) \). To this end, recall that

$$\begin{aligned} W_{2}^{2}\left( (1- \tau )\mu _{0}+\tau \mu _{1},\nu \right) \le (1-\tau )W_{2}^{2}(\mu _{0},\nu ) + \tau W_{2}^{2}(\mu _{1},\nu ), \end{aligned}$$
(9)

for all \( \mu _{0},\mu _{1},\nu \in \mathscr {P}_{2}({{\mathbb {R}}}^{n})\) and all \( \tau \in [0,1] \). This inequality becomes evident if we note that, for any \( \Pi _{0}\in \Gamma _{o}(\mu _{0},\nu ) \) and \( \Pi _{1}\in \Gamma _{o}(\mu _{1},\nu ) \), the convex combination \( (1-\tau )\Pi _{0}+\tau \Pi _{1} \) is a transport plan between \( (1- \tau )\mu _{0}+\tau \mu _{1}\) and \( \nu \). In our case, (9) implies that

$$\begin{aligned} W_{2} \left( \mu ,\mu _{\lambda ,\tau } \right) \le \sqrt{\tau }W_{2} \left( \mu , (\textrm{id}+\lambda \psi )_{\sharp }\mu \right) \le \sqrt{\tau }\lambda \Vert \psi \Vert _{L^{2}_{\mu }}. \end{aligned}$$

The statement now follows from (7), (8) and the inequalities \( \vert \varphi \vert \le r \), \( \vert \psi \vert \le r \) on \( \Omega \). \(\square \)

Arguments, similar to those of the previous proof, lead to the following slight modification of Lemma 2.2.

Lemma 2.3

Let V be a nonlocal vector field of class \( \mathscr {C}^{1,1} \) and \( X:I\times {{\mathbb {R}}}^{n}\times \mathscr {P}_{c}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}^{n}\) be a sequentially continuous and locally bounded map such that \( x\mapsto X^{\mu }_{t}(x)\) is bijective for all t and \( \mu \). Then, for any locally bounded Carathéodory maps \( \varphi ,\psi :I\times {{\mathbb {R}}}^{n}\times \mathscr {P}_{c}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}^{n} \), we have

$$\begin{aligned}&V_{t} \left( X_{t}^{\mu }(x) +\lambda \varphi ^{\mu }_{t}(x), (X_{t}^{\mu }+\lambda \psi ^{\mu }_{t})_{\sharp }\mu \right) - V_{t}\left( X_{t}^{\mu }(x),X_{t\sharp }^{\mu }\mu \right) \\&\qquad - \lambda D_{x}V_{t}\left( X_{t}^{\mu }(x),X_{t\sharp }^{\mu }\mu \right) \varphi ^{\mu }_{t}(x)\\&\qquad - \lambda \int D_{\mu }V_{t}\left( X_{t}^{\mu }(x),X_{t\sharp }^{\mu }\mu ,X_{t}^{\mu }(y)\right) \psi ^{\mu }_{t}(y)\,d\mu (y)\\&\quad = \mathscr {O}_\textrm{loc}(t,x,\mu ;\lambda ^{2}). \end{aligned}$$

Moreover, the constant \( C_{\Omega } \) which guaranties the estimate

$$\begin{aligned} \mathscr {O}_\textrm{loc}(t,x,\mu ;\lambda ^{2})\le C_{\Omega }\lambda ^{2}\qquad \forall (t,x,\mu )\in I\times \Omega \times \mathscr {P}(\Omega ) \end{aligned}$$

depends only on

$$\begin{aligned} r \doteq \max \left\{ \left| X_{t}^{\mu }(x)\right| +\max \{\vert \varphi ^{\mu }_{t}(x)\vert ,\vert \psi ^{\mu }_{t}(x)\vert \}\;:\;(t,x,\mu )\in I\times \Omega \times \mathscr {P}(\Omega )\right\} \end{aligned}$$
(10)

and \( L_{r} \) which bounds, for all t, the Lipschitz constants of \( D_{x}V_{t} \) and \( D_{\mu }V_{t} \):

$$\begin{aligned} \textrm{lip}(D_{x}V_{t})\le L_{r}\text { on } {\varvec{B}}_{r}\times \mathscr {P}({\varvec{B}}_{r}),\quad \textrm{lip}(D_{\mu }V_{t})\le L_{r} \text { on } {\varvec{B}}_{r}\times \mathscr {P}({\varvec{B}}_{r})\times {\varvec{B}}_{r}. \end{aligned}$$

The following presents a refined version of the formula (5) for the intrinsic derivative.

Lemma 2.4

Let \( F:\mathscr {P}_{c}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}\) be of class \( \mathscr {C}^{1,1} \), and \( \Phi ^{\lambda }:{{\mathbb {R}}}^{n}\rightarrow {{\mathbb {R}}}^{n} \), \( \lambda \in [0,1] \), be a family of Borel maps which can be expanded as follows:

$$\begin{aligned} \Phi ^{\lambda }(x) = \Phi ^{0}(x) + \lambda \varphi (x) + \mathscr {O}_\textrm{loc}(x;\lambda ^{2}), \end{aligned}$$
(11)

for some \( \varphi :\mathbb {R}^n\rightarrow \mathbb {R}^n \). Then,

$$\begin{aligned} F\left( \Phi ^{\lambda }_{\sharp }\mu \right) - F(\Phi ^{0}_{\sharp }\mu ) - \lambda \int D_{\mu }F\left( \Phi ^{0}_{\sharp }\mu ,\Phi ^{0}(y)\right) \varphi (y)\,d\mu (y) = \mathscr {O}_\textrm{loc}(\mu ;\lambda ^{2}). \end{aligned}$$

Proof

In view of Lemma 2.3, it suffices to show that

$$\begin{aligned} F\left( \Phi ^{\lambda }_{\sharp }\mu \right) - F\left( (\Phi ^{0}+\lambda \varphi )_{\sharp }\mu \right) = \mathscr {O}_\textrm{loc}(\mu ; \lambda ^{2}). \end{aligned}$$

According to Lemma 2.1, F is locally Lipschitz. Hence, for any compact \( \Omega \subset {{\mathbb {R}}}^{n} \), there exists \( L_{\Omega }>0 \) such that

$$\begin{aligned}&\Big \vert F\left( \Phi ^{\lambda }_{\sharp }\mu \right) - F\left( (\Phi ^{0}+\lambda \varphi )_{\sharp }\mu \right) \Big \vert \le L_{\Omega }W_{2} \left( \Phi ^{\lambda }_{\sharp }\mu , (\Phi ^{0} +\lambda \varphi )_{\sharp }\mu \right) \\&\quad \le L_{\Omega }\left\| \Phi ^{\lambda } - \Phi ^0 - \lambda \varphi \right\| _{L^{2}_{\mu }}, \end{aligned}$$

for all \( \mu \in \mathscr {P}(\Omega ) \). Now, by using (11), we complete the proof. \(\square \)

2.6 Derivative of the Flow

Recall that, for a fixed initial measure, any sublinear \( \mathscr {C}^{1,1} \) nonlocal vector field (n.v.f.) V generates a map X that can be thought of as its flow. We shall study the flow \( X^{\lambda } \) of the perturbed n.v.f. \( V^{\lambda } = V + \lambda W \), where W is also sublinear \( \mathscr {C}^{1,1} \), and \( \lambda \in [0,1] \).

The results of this section, which provide the linearization of the nonlocal flow, are largely similar to those of [10, sec. 3.2] (both in their statements and their proofs). However, in contrast to [10], we accept here nonlocal perturbations of the vector field. On the other hand, we impose slightly more restrictive assumptions, enabling us to expand the nonlocal flow up to the term of order \(O(\lambda ^2)\) rather than \(o(\lambda )\) as demonstrated in [10]. This fact will play a crucial role in establishing the convergence of our numerical algorithm in Sect. 5.1.

Theorem 2.1

Let VW be sublinear \( \mathscr {C}^{1,1} \) nonlocal vector fields, X be the flow of V and \( X^{\lambda } \) be the flow of \( V^{\lambda }\doteq V+\lambda W \), where \( \lambda \in [0,1] \). Then

$$\begin{aligned} X^{\lambda }- X - \lambda w = \mathscr {O}_\textrm{loc}(\lambda ^{2}), \end{aligned}$$

where \( w:I\times \mathbb {R}^n \times {\mathscr {P}}_c(\mathbb {R}^n)\rightarrow \mathbb {R}^n \) satisfies the differential equation

$$\begin{aligned} \partial _tw^{\vartheta }_{t}\left( x\right)&= D_xV_{t}\left( X^{\vartheta }_{t}(x),X^{\vartheta }_{t\sharp }\vartheta \right) w^{\vartheta }_t(x)\nonumber \\&\quad + \int D_{\mu }V_{t}\left( X^{\vartheta }_{t}(x),X^{\vartheta }_{t\sharp }\vartheta , X^{\vartheta }_{t}(y)\right) w^{\vartheta }_{t}\left( y\right) \,d\vartheta (y) \nonumber \\&\quad + W_{t}\left( X^{\vartheta }_{t}(x) ,X^{\vartheta }_{t\sharp }\vartheta \right) \end{aligned}$$
(12)

and the initial condition

$$\begin{aligned} w^{\vartheta }_{0}(x) = 0. \end{aligned}$$
(13)

Moreover, the constant \( C_{\Omega } \) which guaranties the estimate

$$\begin{aligned} \mathscr {O}_\textrm{loc}(t,x,\vartheta ;\lambda ^{2})\le C_{\Omega }\lambda ^{2}\qquad \forall (t,x,\vartheta )\in I\times \Omega \times \mathscr {P}(\Omega ) \end{aligned}$$

depends only on the constants \( \rho \), \( C_{\rho } \), r, \( L_{r} \) defined below by (14), (15), (17), (18).

Remark 2.1

First, notice that \( \vartheta \) in (12) can be considered as a parameter. Thus, (12) can be thought of as “linear transport equation with nonlocal source term”. One can easily show (for example, by fixed-point arguments) that (12), (13) has a unique continuous solution w (see also [10, 11], where such solution is constructed explicitly for the case \( W_{t}(x,\mu )\equiv W_{t}(x) \)). Moreover, w is sequentially continuous as a function of t, x, \( \vartheta \).

Before presenting the proof, note that our assumptions on V and W imply that there exists \( C>0 \) such that \( \left| V^{\lambda }_{t}(x,\mu )\right| \le C\left( 1+\vert x\vert \right) \), for all t, x, \( \mu \), \( \lambda \). This means that \( \left| X^{\lambda ,\vartheta }_{t}(x) \right| \le e^{Ct}(Ct+\vert x\vert ) \) for all t, x, \( \vartheta \), \( \lambda \). As a consequence, \( (t,x,\vartheta )\mapsto \left( t,X^{\vartheta ,\lambda }_{t}(x), X^{\vartheta ,\lambda }_{t\sharp }\vartheta \right) \) maps \( I\times \Omega \times \mathscr {P}(\Omega ) \) into \( I\times {\varvec{B}}_{\rho }\times \mathscr {P}({\varvec{B}}_{\rho }) \), where

$$\begin{aligned} \rho \doteq \max \left\{ e^{CT}(CT+\vert x\vert ):\; x\in \Omega \right\} . \end{aligned}$$
(14)

Using the local boundedness of \( D_{x}V \), \( D_{\mu }V \) and W, we can find \( C_{\rho }>0 \) such that

$$\begin{aligned} \vert D_{x}V\vert \le C_{\rho }\quad \vert D_{\mu }V\vert \le C_{\rho },\quad \vert W\vert \le C_{\rho } \quad \text {on}\quad I\times {\varvec{B}}_{\rho }\times \mathscr {P}({\varvec{B}}_{\rho }). \end{aligned}$$
(15)

Now, it follows from (12), (13) that

$$\begin{aligned} \vert w\vert \le C_{\rho }e^{2C_{\rho }T}\quad \text {on}\quad I\times \Omega \times \mathscr {P}(\Omega ). \end{aligned}$$
(16)

This implies that \( (t,x,\vartheta )\mapsto \left( t,(X^{\lambda ,\vartheta }_{t}+\lambda w^{\vartheta }_{t})(x), (X^{\lambda ,\vartheta }_{t}+\lambda w^{\vartheta }_{t})_{\sharp }\vartheta \right) \) maps \( I\times \Omega \times \mathscr {P}(\Omega ) \) into \( I\times {\varvec{B}}_{r}\times \mathscr {P}({\varvec{B}}_{r}) \), where

$$\begin{aligned} r\doteq \rho + C_{\rho }e^{2C_{\rho }T}. \end{aligned}$$
(17)

Finally, since \( V^{\lambda } \), \( D_{x}V^{\lambda } \) and \( D_{\mu }V^{\lambda } \) are locally Lipschitz, we choose \( L_{r}>0 \) such that

$$\begin{aligned}{} & {} \textrm{lip}(V^{\lambda }_{t})\le L_{r},\quad \textrm{lip}(D_{x}V^{\lambda }_{t})\le L_{r}\quad \text {on}\quad {\varvec{B}}_{r}\times \mathscr {P}({\varvec{B}}_{r}),\nonumber \\{} & {} \textrm{lip}(D_{\mu }V^{\lambda }_{t})\le L_{r}\quad \text {on}\quad {\varvec{B}}_{r}\times \mathscr {P}({\varvec{B}}_{r})\times {\varvec{B}}_{r}, \end{aligned}$$
(18)

for all \( t\in I \) and \( \lambda \in [0,1] \).

Fix a compact set \( \Omega \subset {{\mathbb {R}}}^{n} \) and a measure \( \vartheta \in \mathscr {P}(\Omega ) \). From now on, we will omit the index \( \vartheta \) in \( X^{\lambda ,\vartheta }_{t} \) and \( w_{t}^{\vartheta } \). Consider the following set:

$$\begin{aligned} \mathscr {X}(\Omega ) = \left\{ \varphi \in C^{0}(I\times \Omega ;{{\mathbb {R}}}^{n}):\vert \varphi _{t}(x)\vert \le e^{Ct}(Ct+\vert x\vert )\right\} , \end{aligned}$$

and equip it with the norm \( \Vert \varphi \Vert _{\sigma } = \max _{I\times \Omega } e^{-\sigma t}\vert \varphi _{t}(x)\vert \), \(\sigma >0\). Since \( \Vert \cdot \Vert _{\sigma } \) is equivalent to the standard \( \sup \) norm, \( \mathscr {X}(\Omega ) \) becomes a complete metric space.

Finally, for any \( \lambda \in [0,1] \) and \( \varphi \in \mathscr {X}(\Omega ) \), we define

$$\begin{aligned} \mathscr {F}(\lambda ,\varphi )(t,x) = x + \int _{0}^{t}V_{\tau }^{\lambda }\left( \varphi _{\tau }(x), \varphi _{\tau \sharp }\vartheta \right) \,d\tau ,\quad t\in I,\; x\in \Omega . \end{aligned}$$

One can easily check that \( \mathscr {F} \) maps \( [0,1]\times \mathscr {X}(\Omega ) \) to \( \mathscr {X}(\Omega ) \).

Lemma 2.5

The map \( \varphi \mapsto \mathscr {F}(\lambda ,\varphi ) \) is contractive in the \( \sigma \)-norm for all sufficiently large \( \sigma \). Moreover, the corresponding Lipschitz constant \( \kappa < 1 \) does not depend on \( \lambda \).

Proof

Let r be defined by (17). Given \( \varphi ,\psi \in \mathscr {X}(\Omega ) \), we have

$$\begin{aligned} \left| \mathscr {F}(\lambda ,\varphi ) - \mathscr {F}(\lambda ,\psi )\right| (t,x)&\le \int _{0}^{t}\left| V^{\lambda }_{\tau }\left( \varphi _{\tau }(x),\varphi _{\tau \sharp }\vartheta \right) - V^{\lambda }_{\tau }\left( \psi _{\tau }(x),\psi _{\tau \sharp }\vartheta \right) \right| \,d\tau \\&\le L_{r}\int _{0}^{t}\left( \left\| \varphi _{\tau }-\psi _{\tau }\right\| _{\mathscr {C}^{0}(\Omega ;{{\mathbb {R}}}^{n})}+\left\| \varphi _{\tau }-\psi _{\tau }\right\| _{L^{2}_{\vartheta }}\right) \,d\tau , \end{aligned}$$

for any \( t\in I \), \( x\in \Omega \), \( \lambda \in [0,1] \). Since \(\left\| \varphi _{\tau }-\psi _{\tau }\right\| _{L^{2}_{\vartheta }}\le \left\| \varphi _{\tau }-\psi _{\tau }\right\| _{\mathscr {C}^{0}(\Omega ;{{\mathbb {R}}}^{n})},\) we obtain:

$$\begin{aligned} \left\| \mathscr {F}(\lambda ,\varphi )_{t} - \mathscr {F}(\lambda ,\psi )_{t}\right\| _{\mathscr {C}^{0}(\Omega ;{{\mathbb {R}}}^{n})}&\le 2L_{r}\int _{0}^{t}\left\| \varphi _{\tau }-\psi _{\tau }\right\| _{\mathscr {C}^{0}(\Omega ;{{\mathbb {R}}}^{n})}\,d\tau . \end{aligned}$$

Then, for all \( t\in I \),

$$\begin{aligned}&e^{-\sigma t}\left\| \mathscr {F}(\lambda ,\varphi )_{t} - \mathscr {F}(\lambda ,\psi )_{t}\right\| _{\mathscr {C}^{0}(\Omega ;{{\mathbb {R}}}^{n})} \le 2L_{r}e^{-\sigma t}\int _{0}^{t}e^{\sigma \tau }\left\| \varphi -\psi \right\| _{\sigma }\,d\tau \\&\quad \le \frac{2L_{r}}{\sigma }\left\| \varphi -\psi \right\| _{\sigma }, \end{aligned}$$

which means that \( \mathscr {F}(\lambda ,\cdot ) \) is contractive for any \( \sigma >2L_{r} \). \(\square \)

Proof of Theorem 2.1

Let \(\sigma \) be chosen so that \(\sigma >2L_r\) as in the proof of Lemma 2.5. By definition, \( X^{\lambda } \) is a fixed point of \( \mathscr {F}(\lambda ,\cdot ) \) for any \( \lambda \in [0,1] \). Therefore, by Theorem A.2.1 in [15],

$$\begin{aligned} \left\| X^{\lambda } - X - \lambda w \right\| _{\sigma }\le \frac{1}{(1-\kappa )}\left\| \mathscr {F}\left( \lambda ,X + \lambda w \right) -X - \lambda w \right\| _{\sigma }, \end{aligned}$$
(19)

where \( \kappa \doteq 2L_{r}/\sigma < 1 \).

It remains to estimate the right-hand side of (19). Since \( X = \mathscr {F}(0,X) \), we obtain

$$\begin{aligned}&\mathscr {F}\left( \lambda ,X_{t} + \lambda w_{t} \right) (t,x) - X_{t}(x) \\&\quad = \int _{0}^{t} \left[ V_{\tau }\left( X_{\tau }(x)+\lambda w_{\tau }(x),\left( X_{\tau }+\lambda w_{\tau }\right) _{\sharp }\vartheta \right) -V_{\tau }\left( X_{\tau }(x),X_{\tau \sharp }\vartheta \right) \right] \,d\tau \\&\qquad + \lambda \int _{0}^{t} W_{\tau }\left( X_{\tau }(x)+\lambda w_{\tau }(x),\left( X_{\tau }+\lambda w_{\tau }\right) _{\sharp }\vartheta \right) \,d\tau . \end{aligned}$$

Lemma 2.3 demonstrates that the first integrand is equal to

$$\begin{aligned}&\lambda D_{x}V_{\tau }\left( X_{\tau }(x),X_{\tau \sharp }\vartheta \right) w_{\tau }(x) + \lambda \int D_{\mu }V_{\tau }\left( X_{\tau }(x),X_{\tau \sharp }\vartheta ,X_{\tau }(y)\right) w_{\tau }\left( y\right) \,d\vartheta (y)\\&\quad +\mathscr {O}_\textrm{loc}(t,x,\vartheta ;\lambda ^{2}), \end{aligned}$$

and the second one can be rewritten as \( \lambda W_{\tau } \left( X_{\tau }(x), X_{\tau \sharp }\vartheta \right) + \mathscr {O}_\textrm{loc}(t,x,\vartheta ;\lambda ^{2})\). Now, the statement follows from (12). The fact that \( C_{\Omega } \) depends only on \( \rho \), \( C_{\rho } \), r, \( L_{r} \) is the consequence of (17), (18) and the second part of Lemma 2.3. \(\square \)

3 Increment Formula

Now, we turn to the analysis of the increment of the cost functional along an adequate class of control variations. The theory of Pontryagin’s maximum principle is commonly built around the class of needle-shaped variation. However, for the specified control-affine case, the latter can be replaced by a simpler class of weak control variations.

3.1 Problem Specification

In this section, in order to simplify the presentation, we assume that the driving vector field V is affine in control variable u, i.e.,

$$\begin{aligned} V_{t}(x, \mu , u) = V^0_{t}(x, \mu ) + \sum _{j=1}^{m} V^j_{t}(x, \mu ) \, u_j,\quad {u_j\in {\mathbb {R}}}, \end{aligned}$$
(20)

and the running cost is identically zero, i.e., \(L \equiv 0\). Later, in Sect. 6 we will discuss how to deal with the general case. We begin by listing our basic assumptions.

Assumption \(({\varvec{A}}_{1})\):

  1. 1.

    V takes the form (20), where all \( V^{j} \) with \( 0\le j\le m \) are of class \( \mathscr {C}^{1,1} \);

  2. 2.

    \( U\subset {{\mathbb {R}}}^{m} \) is compact and convex;

  3. 3.

    \( \ell :\mathscr {P}_{c}({{\mathbb {R}}}^{n})\rightarrow {{\mathbb {R}}}\) is of class \( \mathscr {C}^{1,1} \).

Assumption \( ({\varvec{A}}_{2}) \): all maps \( D_{x}V^{j} \) with \( 0\le j\le m \) are continuously differentiable in x and their derivatives are locally bounded.

Remark 3.1

We use Assumption \( ({\varvec{A}}_{2}) \) only once: to show that, for any fixed control function \( u\in \mathscr {U} \), the solution w of (12), (13) which corresponds to \( V_{t}(x,\mu )\doteq V_{t}(x,\mu ,u(t)) \) is \( \mathscr {C}^{1} \) in x. Indeed, \( t\mapsto w^{\vartheta }_{t}(x) \) satisfies the ODE: \(\displaystyle \frac{d}{dt}w_{t} = A(t,x)w_{t} + b(t,x),\) where both functions

$$\begin{aligned} A(t,x)&\doteq D_xV_{t}\left( X^{\vartheta }_{t}(x),X^{\vartheta }_{t\sharp }\vartheta \right) , \text{ and }\\ b(t,x)&\doteq \int D_{\mu }V_{t}\left( X_{t}^{\vartheta }(x),X^{\vartheta }_{t\sharp }\vartheta , X^{\vartheta }_{t}(y)\right) w^{\vartheta }_{t}\left( y\right) \,d\vartheta (y) + W_{t}\left( X^{\vartheta }_{t}(x) ,X^{\vartheta }_{t\sharp }\vartheta \right) \end{aligned}$$

are continuously differentiable. Hence, w is \( {\mathscr {C}}^1 \) in x, according to the standard ODE theory.

3.2 Increment Formula I

Further in this section, \( \vartheta \) is supposed to be fixed, so we will omit it when writing the arguments X and w.

Let us fix a pair of control functions \( u,{\bar{u}}\in \mathscr {U} \), \(u \ne {\bar{u}}\). We call u a reference control and \( {\bar{u}} \) a target control. A weak variation of u towards \( {\bar{u}} \) is the convex combination

$$\begin{aligned} u^{\lambda } \doteq u + \lambda ({\bar{u}} - u), \quad \lambda \in [0,1]. \end{aligned}$$
(21)

In view of (20), the variation (21) implies the following perturbation of the reference vector field \( V_{t}(x,\mu ) \doteq V_{t}\left( x,\mu ,u(t)\right) \):

$$\begin{aligned} V^\lambda _t(x, \mu ) \doteq&V_{t}\left( x, \mu ,u^{\lambda }(t)\right) = V_{t}(x, \mu ) + \lambda W_t(x, \mu ),\\ W_t(x, \mu ) \doteq&\sum _{j=1}^{m} V^j_{t}(x, \mu ) \, \left( {\bar{u}}_j(t) - u_j(t)\right) . \end{aligned}$$

Note that, by Assumption \( ({\varvec{A}}_{1}) \), there exists \( C>0 \) such that \( \left| V^{\lambda }_{t}(x,\mu ) \right| \le C(1+\vert x\vert ) \), for all t, x, \( \mu \), \( \lambda \), u, \( {\bar{u}} \). This means that \( \rho \) from (14) can be chosen independently from \( u, {\bar{u}}\in \mathscr {U} \). Again, by Assumption \( ({\varvec{A}}_{1}) \), we can find \( C_{\rho } \) which guarantees, for all \( u,{\bar{u}}\in \mathscr {U} \), the estimate (15), then construct r by (17) and find \( L_{r} \) such that (18) holds for all \( u,{\bar{u}}\in \mathscr {U} \). Now, Theorem 2.1 implies that

$$\begin{aligned} X^{\lambda }_{T} - X_{T} - \lambda w_{T} = \mathscr {O}_\textrm{loc}(x,\vartheta ,u,{\bar{u}};\lambda ^{2}), \end{aligned}$$

where w is a solution of (12), (13). Here, we think of \( \mathscr {U} \) as a compact topological space equipped with the weak-\( * \) topology \( \sigma (L^{\infty },L^{1}) \).

Since \( \mathscr {I}[u] = \ell (X_{T\sharp }\vartheta ) \) and \( \mathscr {I}[u^{\lambda }] = \ell (X^{\lambda }_{T\sharp }\vartheta ) \) and \( \vartheta \) is fixed, we can use Lemma 2.4 to get

Proposition 3.1

Under assumptions \( ({\varvec{A}}_{1}) \), \( ({\varvec{A}}_{2}) \) one has

$$\begin{aligned} \mathscr {I}[u^{\lambda }] - \mathscr {I}[u] = \lambda \int D_{\mu }\ell \left( X_{T\sharp }\vartheta ,X_{T}(y)\right) \, w_T\left( y\right) \,d\vartheta (y) + \mathscr {O}(u,{\bar{u}};\lambda ^{2}), \end{aligned}$$
(22)

where w is a solution of the linear problem (12), (13).

Here we write \( \mathscr {O} \) instead of \( \mathscr {O}_\textrm{loc}\) because \( \mathscr {U} \) is already compact.

Our next goal is to rewrite this formula in a “constructive” form, namely, in terms of a Hamiltonian system associated to our optimal control problem.

3.3 Hamiltonian System

The Hamiltonian system associated with Problem (P) (see (1)-(3)) is merely a continuity equation on the cotangent bundle of \( {{\mathbb {R}}}^n \), i.e., on the space \({{\mathbb {R}}}^{n}\times ({{\mathbb {R}}}^n)^* \simeq {{\mathbb {R}}}^{2n}\) comprised by pairs (xp), where x is the primal and p is the dual state variables. In our case, this equation takes the form

$$\begin{aligned}{} & {} \partial _t \gamma _t + \textrm{div}_{(x,p)}\left( \textbf{H}\left( \cdot , \cdot ,\gamma _t, u(t)\right) \gamma _t\right) =0, \end{aligned}$$
(23)
$$\begin{aligned}{} & {} \textbf{H}(x, p, \gamma , u)\doteq \begin{pmatrix} \displaystyle V_{t}(x, \pi ^1_\sharp \gamma , u)\\ \displaystyle - p\, D_x V_t(x, \pi ^1_\sharp \gamma ,u) - \iint q \, D_\mu V_t(y, \pi ^1_\sharp \gamma , u, x)\,d\gamma (y,q) \end{pmatrix}.\nonumber \\ \end{aligned}$$
(24)

This equation is supplemented with the terminal condition

$$\begin{aligned} \gamma _{T} = \left( \textrm{id},-D_{\mu }\ell (\mu _T)\right) _{\sharp }\mu _{T}, \end{aligned}$$
(25)

where \( \mu _{t} \) satisfies (2). The standard well-posedness result for nonlocal continuity equations (see, e.g., [46]) guarantees that (23), (25) has a unique solution \( \gamma _{t} \). Moreover, the projection of \( \gamma _{t} \) onto the x space coincides with \( \mu _{t} \):

$$\begin{aligned} \pi ^{1}_{\sharp }\gamma _{t} = \mu _{t}\quad \forall t\in I. \end{aligned}$$
(26)

3.4 Increment Formula II

Let us go back to (22). First, recalling that \(\mu _T\doteq X_{T\sharp }\vartheta \), we express the integral entering in its right-hand side as follows:

$$\begin{aligned}&\int D_{\mu }\ell (\mu _T, x)\, w_T\left( X_{T}^{-1}(x)\right) \,d\mu _{T}(x)\\&\quad = -\iint p \, w_T\left( X_{T}^{-1}(x)\right) \,d\left[ \left( \textrm{id},-D_{\mu }\ell (\mu _T)\right) _{\sharp }\mu _{T}\right] (x,p)\\&\quad = -\iint p \, w_T\left( X_{T}^{-1}(x)\right) \,d\gamma _{T}(x,p). \end{aligned}$$

By Lemma 8.1.2 [1], the following version of the classical Newton-Leibniz formula holds for any function \(\psi \in \mathscr {C}^1(I\times {{\mathbb {R}}}^{2n})\):

$$\begin{aligned}&\iint \psi _T \,d\gamma _{T} - \iint \psi _0 \,d\gamma _{0} = \int _{0}^{T}\Big (\iint \Xi _t(x, p) \,d\gamma _{t}(x,p)\Big )\,dt, \end{aligned}$$
(27)
$$\begin{aligned}&\Xi _t(x, p) \doteq \partial _{t}\psi _t(x,p) + \nabla _{x}\psi _t(x,p) \, V_t\left( x, \mu _t\right) \nonumber \\&\quad -\Big [p\, D_x V_t\left( x, \mu _t\right) + \iint q \, D_\mu V_t\left( y, \mu _t,x \right) \,d\gamma _t(y,q)\Big ] \nabla _{p}\psi _t(x,p). \end{aligned}$$
(28)

Remark 3.1 allows us to take \( \psi _t(x,p) \doteq p \cdot w_t\left( X_{t}^{-1}(x)\right) \) in the above expression. Recall that \( X_{t}(x) = P_{0,t}(x) \), where P is the flow of the noauthonomous vector field \( v_{t}(x) = V_{t}\left( x,\mu _{t},u(t)\right) \), in particular, \( X_{t}^{-1} = P_{t,0} \) and we can use the standard rules of flow differentiation (Theorem 2.3.3 [15]) to perform the calculations:

$$\begin{aligned} \partial _{t}\psi _t(x,p)&= p \left[ \partial _{t}w_t \left( P_{t,0}(x)\right) + D_x w_t\left( P_{t,0}(x)\right) \partial _t P_{t,0}(x)\right] \\&= p \left[ \partial _{t}w_t\left( P_{t,0}(x)\right) - D_x w_t \left( P_{t,0}(x)\right) \, D_{x}P_{t,0}(x) \, V_t\left( x, \mu _t\right) \right] \\&= p \, \partial _{t}w_t\left( P_{t,0}(x)\right) - \nabla _{x}\psi _t(x,p) \cdot V_t\left( x, \mu _t\right) ,\\ \nabla _x \psi _t(x,p)&= p \, D_x w_t\left( P_{t,0}(x)\right) \, D_xP_{t,0}(x),\quad \nabla _p \psi _t(x,p) = w_t\left( P_{t,0}(x)\right) . \end{aligned}$$

Then,

$$\begin{aligned} \iint \Xi _t \,d\gamma _{t} =&\iint \left[ p \, \partial _{t}w_t\left( X_{t}^{-1}(x)\right) - p \, D_x V_t\left( x, \mu _t\right) w_t\left( X_{t}^{-1}(x)\right) \right] \,d\gamma _t(x, p)\\&- \iint \Big [\iint qD_\mu V_t\left( y, \mu _t,x\right) \,d\gamma _t(y,q)\Big ] \, w_t\left( X_{t}^{-1}(x)\right) \,d\gamma _t(x, p). \end{aligned}$$

In view of (12), the right-hand side reduces to

$$\begin{aligned}&\iint p \, W_t\left( x,\mu _{t}\right) \,d\gamma _t(x, p)\\&\quad +\iint p \Big [\int D_{\mu }V_t(x,\mu _{t},y)\, w_{t}\left( X_{t}^{-1}(y)\right) \,d\mu _{t}(y)\Big ]\,d\gamma _t(x, p)\\&\quad - \iint \Big [\iint q \, D_\mu V_t\left( y, \mu _t,x\right) \,d\gamma _t(y,q)\Big ] \, w_t\left( X_{t}^{-1}(x)\right) \,d\gamma _t(x, p). \end{aligned}$$

Renaming the variables \((x, p) \leftrightarrow (y, q)\) in the latter term shows that the last two terms cancel out. Hence,

$$\begin{aligned} \iint \Xi _t \,d\gamma _{t} = \iint p \, W_t\left( x,\mu _{t}\right) \,d\gamma _t(x, p). \end{aligned}$$

Finally, noticing that \( \psi _0 \doteq p \, w_0 \equiv 0 \) and

$$\begin{aligned}{} & {} \lambda \iint \psi _T \,d\gamma _{T} = \lambda \iint p \, w_T\left( X_{T}^{-1}(x)\right) \,d\gamma _{T}(x,p) \\ {}{} & {} = - \left( \mathscr {I}[u^{\lambda }] - \mathscr {I}[u] - \mathscr {O}(u,{\bar{u}};\lambda ^{2})\right) , \end{aligned}$$

then using (27) and the definition of \(W_t\), we have

Proposition 3.2

Under assumptions \( ({\varvec{A}}_{1}) \), \( ({\varvec{A}}_{2}) \) it holds

$$\begin{aligned} \mathscr {I}[u^\lambda ] - \mathscr {I}[u] = -\lambda \int _0^T\left( H_{t}\left( \gamma _t, {\bar{u}}(t)\right) - H_{t}\left( \gamma _t, u(t)\right) \right) \,dt+\mathscr {O}(u,{\bar{u}};\lambda ^{2}), \end{aligned}$$
(29)

where

$$\begin{aligned} H_{t}(\gamma , u) \doteq \iint p \, V_{t}(x, \pi ^1_\sharp \gamma ,u)\,d{\gamma (x,p)} \end{aligned}$$
(30)

and \(t \mapsto \gamma _t\) is a solution of the Hamiltonian system (23)–(25).

3.5 Pontryagin’s Maximum Principle

A consequence of the increment formula (29) is the following version of Pontryagin’s maximum principle.

Theorem 3.1

(PMP in terms of Hamiltonian system) Assume that \(({\varvec{A}}_{1,2})\) hold, and \( \vartheta \in {\mathscr {P}}_{c}(\mathbb {R}^{n}) \). Let \((\mu ,u)\) be an optimal pair for (P). Then u(t) satisfies, for a.e. \(t\in I\), the maximum condition

$$\begin{aligned} H_{t}\left( \gamma _t, u(t)\right) = \max _{\upsilon \in U} H_{t}(\gamma _t, \upsilon ), \end{aligned}$$
(31)

where \(\gamma \) is a unique solution of the Hamiltonian system (23)–(25) and \( H_{t} \) is defined by (30).

Proof

Since u is optimal, we have \( \mathscr {I}[u^{\lambda }] - \mathscr {I}[u] \ge 0 \) for any target control \( {\bar{u}} \). Now, the increment formula implies that

$$\begin{aligned} \sup _{{\bar{u}} \in \mathscr {U}}\int _0^TH_{t}\left( \gamma _t, {\bar{u}}(t)\right) \,dt = \int _{0}^{T} H_{t}\left( \gamma _t, u(t)\right) \,dt. \end{aligned}$$

On the other hand,

$$\begin{aligned} \sup _{{\bar{u}} \in \mathscr {U}}\int _0^TH_{t}\left( \gamma _t, {\bar{u}}(t)\right) \,dt \le \int _{0}^{T}\max _{\upsilon \in U}H_{t}\left( \gamma _t, \upsilon \right) \,dt. \end{aligned}$$
(32)

Let \( \psi (t,\upsilon ) \doteq H_{t}(\gamma _{t},\upsilon ) \) and \( \alpha (t) \doteq \max _{\upsilon \in U}\psi (t,\upsilon ) \). It is easy to check that \( \psi \) is a Carathéodory map. Since \( \alpha (t)\in \psi (t,U) \) for a.e. \( t\in I \), we deduce from Filippov’s lemma [4, Theorem 8.2.10] that there exists \( {\tilde{u}}\in \mathscr {U} \) satisfying \( \alpha (t) = \psi (t,{\tilde{u}}(t)) \) for a.e. \( t\in I \). Hence, the inequality in (32) can be replaced by the equality, which completes the proof. \(\square \)

Remark 3.2

Pontryagin’s maximum principle displayed by Theorem 3.1 is essentially the same as in [10, 12]. However, in these papers, the driving vector field has a specific form: It can be represented as the sum of a nonlocal drift term and an external Lipschitz vector field \(u=u(t,x)\) playing the role of control action. In our case, the control is a measurable function of time variable only \(u=u(t)\), which may enter in the non-local term itself, thus enabling us, e.g., to govern convolution kernels as in (4). Finally, note that Theorem 3.1 can be derived from the (most general) version of PMP recently obtained in [7], which relies on the so-called Lagrangian interpretation [24] of the mean-field control problem (P).

Remark 3.3

We conclude this section by stressing two obvious drawbacks of the presented form of the necessary optimality condition, which are critical for its numerical implementation.

  1. 1.

    Equation (23) is defined on the space of dimension 2n, which makes its numerical solution computationally demanding even for \(n=2\).

  2. 2.

    Even if \(\mu \) is absolutely continuous, \(\gamma \) is not. In other words, \(\gamma \) never takes the form \(\rho _t \, {\mathscr {L}}^{2n}\) with a density function \((t,x,p) \mapsto \rho _t(x,p)\). This is due to the fact that \(\gamma _T\) is supported on the graph of the map \(x \mapsto -D_{\mu }\ell (\mu _T)(x)\), which is always \({\mathscr {L}}^{2n}\)-null set. This means that system (23) can not be solved by standard numerical methods for hyperbolic PDEs, which can be used only when densities exist.

These issues motivate the development of a new version of Theorem 3.1, which is obtained by extracting the “adjoint system” from the Hamiltonian PDE (23).

4 Adjoint Equation

It this section, we shall see that the Hamiltonian system (23) can be decoupled into the primal and dual parts just as one is used to experience in the classical optimal control theory. This fact will allow us to rewrite the increment formula and Pontryagin’s maximum principle in an equivalent form, suitable for numerics.

4.1 Derivation

After reflecting upon the formula (29), one comes up with an idea to take, as a matter of adjoint trajectory, the family of signed vector (namely, row vector) measures defined by

$$\begin{aligned} \langle \nu _{t},\varphi \rangle \doteq \int p\,\varphi (x) \,d\gamma _{t}(x,p), \quad \varphi \in \mathscr {C}^1({{\mathbb {R}}}^n;{{\mathbb {R}}}^{n}), \end{aligned}$$
(33)

where \( \gamma _{t} \) is the solution of (23)–(25). Indeed, return to representation (27), (28) and specify the class of test functions \(\psi \) as follows:

$$\begin{aligned} \psi _t(x, p)= p\,\varphi _t(x), \quad \varphi \in \mathscr {C}^1_{c}\left( (0,T)\times {{\mathbb {R}}}^{n};{{\mathbb {R}}}^{n}\right) . \end{aligned}$$

In this case, the left-hand side of (27) vanishes, which implies

$$\begin{aligned} \int _{0}^{T}\Big (\iint \Xi _t \,d\gamma _{t}\Big )\,dt = 0, \end{aligned}$$
(34)

where \(\Xi \) is defined in (28). In terms of \(\nu \), the parts of the integral in the left hand side of (34) can be represented as follows:

$$\begin{aligned} \iint \partial _{t}\psi _t(x,p) \,d\gamma _{t}(x,p)&= \iint p\,\partial _{t} \varphi _t(x)\,d\gamma _{t}(x,p) = \int \partial _{t}\varphi _t(x)\cdot \,d\nu _{t}(x),\\ \iint \nabla _{x}\psi _t(x,p) \, V_t(x, \mu _t)\,d\gamma _{t}(x,p)&= \iint p \,D_x\varphi _t(x) \, V_t(x, \mu _t)\,d\gamma _{t}(x,p)\\&=\int D_x\varphi _t(x) \, V_t(x, \mu _t)\cdot \,d\nu _{t}(x),\\ \iint p\,D_{x}V_{t}(x,\mu _{t})\,\nabla _{p}\psi _t(x,p)\,d\gamma _{t}(x,p)&= \int D_{x}V_{t}(x,\mu _{t})\, \varphi _t(x)\cdot \,d\nu _{t}(x), \end{aligned}$$

and, according to (26),

$$\begin{aligned}&\iint \bigg (\iint q D_\mu V_t\left( y, \mu _t,x\right) \,d\gamma _t(y,q)\bigg )\nabla _p\psi _t(x,p)\,d\gamma _t(x,p)\\&\quad = \iint \bigg (\int D_\mu V_t\left( y, \mu _t,x\right) \varphi _t(x)\cdot \,d\nu _t(y)\bigg )\,d\gamma _t(x,p)\\&\quad = \int \bigg (\int D_\mu V_t\left( y, \mu _t,x\right) \varphi _{t}(x)\cdot \,d\nu _t(y)\bigg )\,d\mu _t(x). \end{aligned}$$

Substituting these expressions into (34), we obtain

$$\begin{aligned} 0&\equiv \int _0^T \Big [\int \left( \partial _{t}\varphi _t(x) + D_x\varphi _t(x) \, V_t(x, \mu _t)\right) \cdot \,d\nu _t(x)\Big ] \,dt\,\nonumber \\&\quad -\int _0^T \!\int D_x V_t\left( x, \mu _t\right) \,\varphi _{t}(x)\cdot \,d\nu _t(x)\,dt\nonumber \\&\quad - \int _0^T\int \!\Big (\!\int D_\mu V_t\left( y, \mu _t,x\right) \varphi _{t}(x)\cdot \,d\nu _t(y)\Big )\,d\mu _t(x)\,dt. \end{aligned}$$
(35)

The choice \( \varphi (x) = (0,\ldots ,\varphi ^{i}(x),\ldots 0)^{{T}} \), where only i-th component of \( \varphi \) is nonzero, shows that this is merely the weak formulation of the following system of balance laws:

$$\begin{aligned} \partial _{t}\nu _{i} + \textrm{div}_x(v \, \nu _i)&= \sum _{j} \left[ \Big (\int m^{j}_{i} \,d\nu _j\Big )\mu - \partial _{x_{i}}v^{j} \, \nu _{j}\right] , \quad 1\le i\le n. \end{aligned}$$
(36)

Here, for the sake of readability, we omit the lower index t of \(\nu _t\) and abbreviate

$$\begin{aligned} v_{t}(x)\doteq V_{t}\left( x,\mu _{t},u(t)\right) ,\qquad \int m^{j}_{i}\,d\nu _{j} \doteq \int m^{j}_{i}(t,x,y)\,d(\nu _{j})_{t}(y), \end{aligned}$$

where \( m^{j}_{i} = m^{j}_{i}(t,x,y) \) are elements of the matrix \( D_{\mu }V_{t}\left( y,\mu _{t},u(t),x\right) \).

At the final time instant T, one has

$$\begin{aligned} \langle \nu _{T},\varphi \rangle= & {} \iint p\,\varphi (x)\,d\left[ (\textrm{id},-D_\mu \ell (\mu _T))_{\sharp }\mu _{T}\right] (x,p)\\= & {} -\int D_\mu \ell (\mu _T)(x)\,\varphi (x)\,d\mu _{T}(x), \end{aligned}$$

which can be rewritten in terms of the Radon-Nikodym derivative as

$$\begin{aligned} \frac{d\nu _{T}}{d\mu _{T}} = -D_{\mu }\ell (\mu _{T}). \end{aligned}$$
(37)

Definition 4.1

We call the backward system (36), (37) of nonlocal linear PDEs the adjoint system associated to the optimal control problem (P) .

4.2 Well-Posedness

We observe that there exists a solution of the adjoint system, namely, the one defined by (33). Let us show that this solution is unique. Basically, the adjoint system (36) is a system of linear balance laws with sources of the form

$$\begin{aligned} \varsigma _{i} \doteq \sum _{j=1}^{n} \Big [\Big (\int m^{j}_{i}\,d\nu _{j}\Big ) \mu -\partial _{x_{i}}v^{j}\nu _{j} \Big ]. \end{aligned}$$
(38)

To proceed, recall basic properties [47] of the linear balance law

$$\begin{aligned} \partial _t \rho _t + \textrm{div}_{x}(f_t\varrho _t) = \varsigma _t \end{aligned}$$
(39)

with a Carathéodory, locally Lipschitz, sublinear vector field \( f_{t} \) and an integrable source \( \varsigma _{t} \).

Definition 4.2

A curve \(\varsigma :I\rightarrow \mathscr {M}({{\mathbb {R}}}^n)\) is called integrable if for any Borel set \( A\subset {{\mathbb {R}}}^{d} \) the map \(t\mapsto \varsigma _{t}(A)\) is measurable and \( \int _{0}^{T}\Vert \varsigma _{t}\Vert _{TV}\,d t < +\infty \), where \( \Vert \cdot \Vert _{TV} \) denotes the total variation norm on \( \mathscr {M}({{\mathbb {R}}}^{n}) \).

For integrable curves we can define a notion of integral in the usual way: \(\displaystyle \left( \int _{0}^{t}\varsigma _{s}\,ds\right) (A) \doteq \int _{0}^{t}\varsigma _{s}(A)\,ds\), for all Borel sets \( A\subset {{\mathbb {R}}}^{n} \).

Definition 4.3

A curve \( \rho \in C\big (I;{\mathscr {M}}({{\mathbb {R}}}^{n})\big ) \) is called a solution of (39) if and only if, for any test function \( \varphi \in C_{c}^{\infty }({{\mathbb {R}}}^{n})\) and a.e. \( t\in I \), one has

$$\begin{aligned} \frac{d}{dt}\int \varphi \,d \rho _{t} = \int \nabla \varphi \cdot f_{t}\,d \rho _{t} + \int \varphi \,d \varsigma _{t}. \end{aligned}$$

Theorem 4.1

Under our assumptions, there exists a unique solution of (39) with the initial condition \(\rho _0=\xi \). Moreover, it can be expressed by

$$\begin{aligned} \rho _t = P_{0,t\sharp }\xi + \int _0^t P_{s,t\sharp }\varsigma _s\,d s,\quad t\in I, \end{aligned}$$
(40)

where P is the flow of \( f_{t} \).

The following Lemma collects several well-known properties of the total variation norm (since their proof is quite standard, we drop them for brevity).

Lemma 4.1

Let \( \rho \in \mathscr {M}({{\mathbb {R}}}^{n}) \) and \( \varsigma :I\rightarrow \mathscr {M}({{\mathbb {R}}}^{n}) \) be an integrable curve. Then,

  1. 1.

    for any Borel measurable bijective map \( f:{{\mathbb {R}}}^{n}\rightarrow {{\mathbb {R}}}^{n} \),

    $$\begin{aligned} \Vert f_{\sharp }\rho \Vert _{TV} = \Vert \rho \Vert _{TV}; \end{aligned}$$
  2. 2.

    for all \( t\in I \),

    $$\begin{aligned} \Big \Vert \int _{0}^{t}\varsigma _{s}\,ds \Big \Vert _{TV}\le \int _{0}^{t}\Vert \varsigma _{s}\Vert _{TV}\,ds; \end{aligned}$$
  3. 3.

    if \( \textrm{spt}\rho \) is contained in a compact set \( \Omega \), then for any \( \varphi \in \mathscr {C}^{0}({{\mathbb {R}}}^{n}) \)

    $$\begin{aligned} \Vert \varphi \rho \Vert _{TV} \le \Vert \varphi \Vert _{\mathscr {C}^{0}(\Omega )}\,\Vert \rho \Vert _{TV}; \end{aligned}$$
  4. 4.

    if \( \textrm{spt}\rho \) is contained in a compact set \( \Omega \), then for any \( K \in \mathscr {C}^{0}({{\mathbb {R}}}^{2n}) \)

    $$\begin{aligned} \left| \int K(x,y)\,d\rho (y)\right| \le \Vert K\Vert _{{\mathscr {C}^{0}}{(\Omega ^{2})}}\,\Vert \rho \Vert _{TV}\quad \forall x\in \Omega . \end{aligned}$$

The well-posedness of the adjoint system is established by the following result, where \(\mathscr {M}_{c}({{\mathbb {R}}}^{n})\) denotes the subset of \(\mathscr {M}({{\mathbb {R}}}^{n})\) composed of signed measures with compact support.

Proposition 4.1

Under assumptions \(({\varvec{A}}_{1,2})\), the adjoint system (36) with the terminal condition \( \nu _{T} = \xi \), \( \xi \in \left[ \mathscr {M}_{c}({{\mathbb {R}}}^{n})\right] ^n\), has a unique solution.

Proof

Take two terminal measures \( \xi , \xi ' \in \left[ \mathscr {M}_{c}({{\mathbb {R}}}^{n})\right] ^n \) and denote by \( \nu _{t} \) and \( \nu _{t}' \) the corresponding (potentially, non-unique) trajectories of (36). Then, from Theorem 4.1 and Lemma 4.1, it follows that

$$\begin{aligned} \left\| (\nu _{i}-\nu _{i}')_t\right\| _{TV}\le \left\| \xi _{i}-\xi _{i}'\right\| _{TV} + \int _0^t \left\| (\varsigma _i-\varsigma _{i}')_{s}\right\| _{TV}\,d s, \end{aligned}$$

where \( \varsigma \) and \( \varsigma ' \) are the corresponding sources defined by (38). Since

$$\begin{aligned} \varsigma _i-\varsigma _{i}' = \sum _{j=1}^{n} \partial _{x_{i}}v^{j}(\nu _{j} - \nu _{j}') - \sum _{j=1}^{n}\Big (\int m^{j}_{i}\,d(\nu _{j} - \nu _{j}')\Big ) \mu , \end{aligned}$$

we obtain, again by Lemma 4.1,

$$\begin{aligned} \left\| \varsigma _i-\varsigma _{i}'\right\| _{TV} \le C^{1}_{\Omega }\sum _{j=1}^{n} \left\| \nu _{j} - \nu _{j}'\right\| _{TV} + C^{2}_{\Omega }\Vert \mu \Vert _{TV} \sum _{j=1}^{n}\left\| \nu _{j} - \nu _{j}'\right\| _{TV}, \end{aligned}$$

where \( \Omega \) is a compact set containing the supports of the measures \( \nu _{t} \), \( \nu _{t}' \), \( \mu _{t} \), \( t\in I \) (one can show that there is such a set by reasoning as in [47]), \( C^{1}_{\Omega } \) is an upper bound of \( \sum _{i,j}\vert \partial _{x_{i}}v^{j}\vert \) on \( I\times \Omega \) and \( C^{2}_{\Omega } \) is an upper bound of \( \sum _{i,j}\vert m^{j}_{i}\vert \) on \( I\times \Omega \times \Omega \).

By letting \( r(t) = \displaystyle \sum _{i=1}^{n}\Vert ( \nu _{i}-\nu _{i}')_{t}\Vert _{TV} \), we obtain

$$\begin{aligned} r(t) \le \sum _{i=1}^{n}\left\| \xi _{i}-\xi _{i}'\right\| _{TV} + n\int _{0}^{t}\left( C^{1}_{\Omega } + C^{2}_{\Omega }\Vert \mu \Vert _{TV}\right) r(s)\,ds. \end{aligned}$$

Now, Grönwall’s lemma gives the uniqueness. \(\square \)

4.3 Increment Formula III

The increment formula (29) and Pontryagin’s maximum principle (Theorem 3.1) are trivially reformulated in terms of a solution to the adjoint system.

Theorem 4.2

(Increment formula) Assume that \(({\varvec{A}}_{1,2})\) hold, and \( \vartheta \in {\mathscr {P}}_{c}(\mathbb {R}^{n}) \). Let \( u,{\bar{u}}\in \mathscr {U} \) and \( u^{\lambda } = u+\lambda ({\bar{u}} - u) \), \( \lambda \in [0,1] \), be the weak variation of u. Then,

$$\begin{aligned} \mathscr {I}[u^\lambda ] - \mathscr {I}[u] = -\lambda \int _0^T\left( {\textbf{H}}_{t}\left( \mu _t,\nu _{t}, {\bar{u}}(t)\right) - {\textbf{H}}_{t}\left( \mu _{t}, \nu _t, u(t)\right) \right) \,dt + \mathscr {O}(u,{\bar{u}};\lambda ^{2}), \end{aligned}$$
(41)

where

$$\begin{aligned} {\textbf{H}}_{t}(\mu , \nu , u) \doteq \int V_{t}(x, \mu , u)\cdot \,d\nu (x). \end{aligned}$$
(42)

Theorem 4.3

(PMP in terms of the adjoint system) Assume that \(({\varvec{A}}_{1,2})\) hold, and \( \vartheta \in {\mathscr {P}}_{c}(\mathbb {R}^{n}) \). Let \((\mu ,u)\) be an optimal pair for (P). Then u satisfies, for a.e. \(t\in I\), the maximum condition

$$\begin{aligned} {\textbf{H}}_{t}\left( \mu _t,\nu _{t}, u(t)\right) = \max _{\upsilon \in U} {\textbf{H}}_{t}(\mu _t,\nu _{t}, \upsilon ), \end{aligned}$$
(43)

where \(\nu \) is a unique solution of the adjoint system (36), (37) and \( {{\textbf {H}}}_{t} \) is defined by (42).

Remark 4.1

Since the adjoint system (36), (37) has a unique solution \( \nu _{t} \), it must coincide with the one given by (33). In particular, \( \nu _{t} \) acts on test functions \( \varphi \in \mathscr {C}^{1}({{\mathbb {R}}}^{n};{{\mathbb {R}}}^{n}) \) by the rule

$$\begin{aligned} \langle \nu _{t},\varphi \rangle =\int p \, \varphi (x)\,d\gamma _{t}(x,p) = \int \Big (\int p\,d\gamma _{t}^{x}(p)\Big )\varphi (x)\,d\mu _{t}(x), \end{aligned}$$
(44)

where \( \gamma ^{x}_{t} \) is the disintegration of \( \gamma _{t} \) with respect to \( \mu _{t} \) (see [1, Theorem 5.3.1]). If the initial measure \( \vartheta \) is absolutely continuous with respect to the Lebesgue measure \( \mathscr {L}^{n} \), then so are all \( \mu _{t} \), \( t\in I \) (thanks to the representation \( \mu _{t} = X_{t\sharp }\vartheta \)). Now, (44) implies that every \( \nu _{t} \), \( t\in I \), must be absolutely continuous as well.

The discussed fact has important consequences, which answer the challenges outlined by Remark 3.3:

  1. 1.

    In contrast to the Hamiltonian continuity equation (23) as a whole, the adjoint system is solvable numerically.

  2. 2.

    While handling the adjoint equation, we deal with a system of \(n+1\) first-order hyperbolic PDEs, each one “living” on \({{\mathbb {R}}}^n\). Solving this system is less computationally expensive than treating a single equation on \({{\mathbb {R}}}^{2n}\).

4.4 Linear Case

Now, we establish a connection between Theorem 4.3 and the well-known version of PMP for \(\mu \)-independent vector fields \(V_{t}(x,\mu ,u)=V_{t}(x,u)\) (see, e.g., [13, 45]). For such fields, the part of adjoint state is played by a solution \((t, x)\mapsto \psi _t(x)\) of a single non-conservative transport equation

$$\begin{aligned} \partial _t \psi + \nabla _{x} \psi \, V_{t} =0. \end{aligned}$$
(45)

It is reasonable to expect that, under sufficient regularity, the adjoint system (36) boils down to (45). This ansatz is confirmed by the following

Proposition 4.2

Assume that \((\mathbf {A_{1,2}})\) hold, \( x\mapsto \frac{\delta \ell }{\delta \mu }(\mu ,x) \) is of class \( \mathscr {C}^{2} \), \( \vartheta \in {\mathscr {P}}_{c}(\mathbb {R}^{n}) \) and \( V_{t}(x,\mu ,u) = V_{t}(x,u) \). Let \(u \in {\mathscr {U}}\), and \( \mu _{t} \) and \( \nu _{t} \) be the corresponding solutions of (2) and (36), (37), respectively. Then, for a.e. \(t\in I\),

$$\begin{aligned} \frac{\,d\nu _{t}}{\,d\mu _{t}} = \nabla _{x}\psi _t, \end{aligned}$$
(46)

where \((t, x) \mapsto \psi _t(x)\) is a solution of the transport equation (45) with the terminal condition

$$\begin{aligned} \psi _T = - \frac{\delta \ell }{\delta \mu }(\mu _{T}). \end{aligned}$$
(47)

Proof

It is clear that the representation (46), (47) does agree with the terminal condition (37), since

$$\begin{aligned} -\nabla _x \psi _T \doteq D_{x}\frac{\delta \ell }{\delta \mu }(\mu _T) \doteq D_{\mu }\ell (\mu _T). \end{aligned}$$

Due to the uniqueness of a solution to (36), we only need to formally check that \(\nu _i \doteq \partial _{x_{i}} \psi \, \mu \), \( 1 \le i \le m \), meets the identity (35) with the vector field \(V_t(x, \mu _{t}, u(t)) = v_t(x)\).

A solution of (45), (47) can be written explicitly as \( \psi _{t}(x) = -\frac{\delta \ell }{\delta \mu }(\mu _{T}, P_{t,T}(x)) \), where P is the flow of v. This formula, together with our assumptions, implies that \( \psi \) admits the partial derivatives \( \partial _{x_i}\psi \), \( \partial _{x_i}\partial _{x_j}\psi \) and \( \partial _{x_i}\partial _{t}\psi \) for all \( 1\le i,j \le m \). These derivatives are at least measurable in t, continuous in x and locally bounded. Take the standard mollification kernel \( \eta _{\varepsilon }:\mathbb {R}\rightarrow \mathbb {R} \) and consider the convolution

$$\begin{aligned} \psi ^{\varepsilon }_{t}(x) = \int \eta _{\varepsilon }(t-s)\psi _{s}(x)\,ds. \end{aligned}$$

It is easy to see that \( \partial _{x_i}\partial _t\psi ^{\varepsilon } = \partial _t\partial _{x_i}\psi ^{\varepsilon } \) and \( \partial _{\alpha }\psi ^{\varepsilon } \rightarrow \partial _{\alpha }\psi \) as \( \varepsilon \rightarrow 0 \) in the sense that

$$\begin{aligned} \int _{I} \sup _{x} \left| \partial _{\alpha } \psi ^{\varepsilon } - \partial _{\alpha }\psi \right| \,dt \rightarrow 0, \end{aligned}$$
(48)

where \( \partial _{\alpha } \) denotes any of the derivatives \( \partial _{x_i} \), \( \partial _{x_i}\partial _{x_j} \), \( \partial _{x_i}\partial _t \). Let \( \nu ^{\varepsilon }_i \doteq \partial _{x_i}\psi ^{\varepsilon }\mu \). Then, we can formally write

$$\begin{aligned} \partial _t\nu _i^{\varepsilon } = \partial _{t}(\partial _{x_i}\psi ^{\varepsilon }\mu ) =\partial _{x_i}(\partial _t\psi ^{\varepsilon })\mu - \partial _{x_i} \psi ^\varepsilon \sum _j \partial _{x_j}(v^j\mu ). \end{aligned}$$

More precisely, for any test function \( \varphi \in C^\infty _c(I \times \mathbb {R}^n;\mathbb {R}) \), we have

$$\begin{aligned} -\left\langle \partial _{x_i}\psi ^{\varepsilon }\partial _{x_j}(v^j \mu ),\varphi \right\rangle&= -\left\langle \partial _{x_j}(v^j\mu ),\varphi \, \partial _{x_i}\psi ^{\varepsilon }\right\rangle \\&=\left\langle v^j\mu , \partial _{x_j}\varphi \, \partial _{x_i}\psi ^{\varepsilon } + \varphi \, \partial _{x_i}\partial _{x_j}\psi ^{\varepsilon } \right\rangle \\&= \left\langle v^j\nu _i^{\varepsilon },\partial _{x_j}\varphi \right\rangle + \left\langle \mu , \varphi \, \partial _{x_i}(v^j \, \partial _{x_j}\psi ^{\varepsilon }) \right\rangle - \left\langle \nu _j^{\varepsilon }, \varphi \ \partial _{x_i} v^j\right\rangle . \end{aligned}$$

Therefore,

$$\begin{aligned}&\left\langle \partial _t\nu ^{\varepsilon }_i,\varphi \right\rangle = \left\langle \mu , \partial _{x_i}\left( \partial _t\psi ^{\varepsilon } + \sum _{j}v^j \, \partial _{x_j}\psi ^{\varepsilon }\right) \, \varphi \right\rangle \\ {}&- \left\langle \sum _{j}\partial _{x_j}(v^j\nu ^{\varepsilon }_{i}),\varphi \right\rangle - \left\langle \sum _{j}\partial _{x_i} v^j \nu ^{\varepsilon }_{j},\varphi \right\rangle . \end{aligned}$$

It remains to use (48) for passing to the limit as \( \varepsilon \rightarrow 0 \). The first term in the right-hand side vanishes thanks to (45), so we get

$$\begin{aligned} \partial _t\nu _i + \sum _j\partial _{x_j}(v^j\nu _i) = - \sum _j\partial _{x_i}v^j\nu _j, \end{aligned}$$

in the sense of distributions. Since \( D_{\mu }V = 0 \), we conclude that \( \nu \) does satisfy (36). \(\square \)

5 Descent Method

Now, we are able to construct an algorithm for the numerical solution of Problem (P) with vector field as in (20). Note that similar algorithms were earlier proposed for solving classical [3] and stochastic [2] optimal control problems.

5.1 Algorithm

Let u be a reference control, \( \mu _{t} \) and \( \nu _{t} \) be the corresponding trajectory and co-trajectory. Construct the target control as follows:

$$\begin{aligned} {\bar{u}}(t) \doteq \arg \max _{\upsilon \in U} {\textbf{H}}_{t}(\mu _{t},\nu _{t},\upsilon ) = \arg \max _{\upsilon \in U}\sum _{j=1}^{m}\upsilon _{j}\int V^{j}_{t}(x,\mu _{t}) \cdot \,d\nu _{t}(x). \end{aligned}$$

The increment formula (41) shows that \( {\bar{u}} - u \) is a descent direction. Let us introduce the functional

$$\begin{aligned} \mathscr {E}[u]&\doteq \left\langle {\bar{u}}- u, d[u] \right\rangle _{L^{2}}{\doteq \int _0^T \langle {\bar{u}}-u,d[u]\rangle \,dt}, \text{ where }\nonumber \\ d_j[u](t)&\doteq \int V^j_{t}(x, \mu _t[u])\cdot \,d\nu _{t}[u](x), \quad t \in I, \quad 1\le j \le m. \end{aligned}$$
(49)

It is clear that \(\mathscr {E}[u] \ge 0\) and \(\mathscr {E}[u] = 0\) implies that the pair \((\mu [u], u)\) satisfies the PMP. In other words, \( \mathscr {E} \) measures the “non-extremality” of u.

Now, we can use the descent direction \(\bar{u}-u\) for developing the following version of the classical backtracking algorithm.

Algorithm 1
figure a

Descent method with backtracking (k-th iteration)

The convergence analysis of the algorithm is provided by the following theorem.

Theorem 5.1

For any initial control \(u^0\in \mathscr {U}\), the sequence \(\{u^k\}\) generated by the algorithm

  1. 1.

    is monotone in the sense that

    $$\begin{aligned} \mathscr {E}[u^k]\ne 0 \ \Rightarrow \ \mathscr {I}[u^{k+1}] < \mathscr {I}[u^{k}], \quad \text{ and } \quad \mathscr {E}[u^k]= 0 \ \Rightarrow \ \mathscr {I}[u^{k+1}] = \mathscr {I}[u^{k}] \end{aligned}$$
  2. 2.

    converges in the sense that \( \mathscr {E}[u^k]\rightarrow 0\) as \(k\rightarrow \infty . \)

Proof

Let \( e^{k}:= \mathscr {E}[u^{k}] \) and assume that \( e^{k}\not \rightarrow 0 \). In this case, there exists \( \varepsilon >0 \) such that \( e^{k}\ge \varepsilon \) for all indices k from some countable set \( K\subset {\mathbb {N}} \). By the choice of \( \lambda ^k \), we obtain, for all \( k\in K \),

$$\begin{aligned} \mathscr {I}[u^{k}] - \mathscr {I}[u^{k+1}] \ge c\lambda ^{k} \left\langle {\bar{u}}^{k} - u^{k}, d^{k} \right\rangle _{L^{2}} = c\lambda ^{k}e^{k}. \end{aligned}$$

This shows that \( (\lambda ^{k})_{k\in K} \rightarrow 0 \), because otherwise \( \mathscr {I}[u^{k}] \rightarrow -\infty \) up to a subsequence. For any large k we have \( \lambda ^{k}\le \theta \). Hence \( \lambda := \lambda ^{k}/\theta \) is an admissible step. On the other hand, by Step 4 of the algorithm for such \( \lambda \) we have: \( \mathscr {I}\left[ u^{k}+\lambda ({\bar{u}}^{k}-u^{k})\right] -\mathscr {I}[u^{k}]\ge c\lambda \langle u^{k}-{\bar{u}}^{k}, d^{k} \rangle _{L^2}, \) that is,

$$\begin{aligned} -\frac{c\lambda ^{k}}{\theta }\left\langle {\bar{u}}^{k} - u^{k}, d^{k} \right\rangle _{L^{2}}&\le \mathscr {I}\left[ u^{k}+(\lambda ^{k}/\theta )({\bar{u}}^{k} - u^{k})\right] - \mathscr {I}[u^{k}]\\&= - \frac{\lambda ^{k}}{\theta }e^{k} + \mathscr {O}\left( \left( \lambda ^{k}/\theta \right) ^{2}\right) , \end{aligned}$$

where we use the increment formula (41) to get the last equality. Hence \( \frac{(1-c)\lambda ^{k}}{\theta }e^{k}\le C\left( \frac{\lambda ^{k}}{\theta }\right) ^{2},\) for some \( C>0 \), or equivalently,

$$\begin{aligned} (1-c)\varepsilon \le (1-c)e^{k}\le C\frac{\lambda ^{k}}{\theta }. \end{aligned}$$

Since the right-hand side tends to zero, we come to a contradiction. \(\square \)

5.2 Implementation

In the algorithm described in Sect. 5.1, the primal and adjoint equations are solved numerically. If the original problem is periodic in space, and the driving vector field has a convolutional structure (4), then, for the numeric integration, one gives preference to so-called spectral methods [14].

Assume that the initial measure \( \vartheta \in \mathscr {P}_{c}(\mathbb {R}) \) is absolutely continuous. This implies that the corresponding trajectories \( \mu _t \) and \( \nu _t \) are absolutely continuous as well, and all ingredients of the algorithm can be recast in terms of their densities \(\rho _t\) and \(\zeta _t\), respectively. Moreover, since \( \vartheta \) is compactly supported, there exists a segment [ab] such that \( \textrm{spt}\mu _t \subset [a,b]\), and \(\textrm{spt}\nu _t\subset [a,b]\) for all \(t\in I. \) This implies that \( \mu _{t} \) and \( \nu _t \) can be considered as measures on the circle \( \mathbb {S}^{1} \) (i.e. the measures can be view as \(2\pi \)-periodic in x).

The primal and the adjoint equations can be written in the form:

$$\begin{aligned}{} & {} \partial _{t}\rho + \partial _{x}\left( V(x,\rho )\rho \right) = S(x,\rho ),\nonumber \\{} & {} \quad V(x,\rho ) = f(x)+(K*\rho )(x),\quad S(x,\rho ) = g(x)\rho (x)+h(x)(M*\rho )(x),\nonumber \\ \end{aligned}$$
(51)

\( f,g,h,K,M:I \times \mathbb {S}^{1}\rightarrow \mathbb {R} \) are given functions.

Suppose that all the densities are of the class \( L^{2}(\mathbb {S}^{1}) \). Upon substitution of the truncated Fourier series

$$\begin{aligned} \rho (x,t)=\sum _{n=-\mathscr {N}/2}^{\mathscr {N}/2} \widehat{\rho }_n(t) \textrm{e}^{inx}, \end{aligned}$$
(52)

in (51), the partial differential equation transforms into the system of ODEs

$$\begin{aligned} \frac{d\widehat{\rho }_{n}}{dt} = -in (\widehat{V\rho })_{n}+ \widehat{S}_{n},\quad n\in \mathbb {Z},\quad \Vert n \Vert \le \mathscr {N}/2, \end{aligned}$$
(53)

where the hat over nonlinear terms denotes their Fourier coefficients, and

$$\begin{aligned} \widehat{\rho }_{n}(t) = \frac{1}{2\pi }\int _{0}^{2\pi }\rho (x,t)e^{-inx}\,dx, \end{aligned}$$
(54)

stands for the Fourier coefficients of \(\rho (x,t)\).

The system (53) can be integrated by any appropriate numerical method (e.g. the Runge–Kutta method). Transformations between the physical and spectral (Fourier) spaces are computed by using the Fast Fourier Transforms (FFT). Multiplications of fields are usually computed in the physical space, derivatives and convolutions are evaluated in the Fourier space.

5.3 Numerical Experiment

As an example, we consider the paradigmatic model of Kuramoto [36], which describes an assembly of pairwise interacting homotypic oscillators. Specifically, we consider an optimization problem in the spirit of [48], in which the goal is to synchronize a continuous oscillatory network by a given time moment T.

The prototypic ODE representing the dynamics of N oscillators takes the form

$$\begin{aligned} \dot{x}_i = \omega _i + u_1 +u_2 \, \frac{1}{N}\sum _{j=1}^N\sin (x_j-x_i-\alpha ), \quad 1 \le i \le N. \end{aligned}$$
(55)

Here, \(x_i(t) \in {\mathbb {S}}^1\) and \(\omega _i \in \mathbb {R}\) are the phase and natural frequency of the ith oscillator, respectively, \(\alpha \) is the phase shift. Control inputs are \(t \mapsto u(t) \doteq (u_1(t), u_2(t))\), where \(u_1\) affects the angular velocity, and \(u_2\) modulates the connectivity of the network.

As in [48], we assume that all oscillators have a common natural frequency \(\omega \), which, in this case, can be specified as \(\omega =0\). As the number of oscillators \(N \rightarrow \infty \), the limiting mean-field version of (55) is described by the curve \(t \mapsto \mu _t \in \mathscr {P}({\mathbb {S}}^1)\) satisfying the nonlocal continuity equation driven by the vector field

$$\begin{aligned} V(x, \mu , u) \doteq u_1+u_2\int _{0}^{2\pi } \sin (y -x -\alpha )\,d\mu (y)\doteq u_1 V^1 + u_2 V^2(x, \mu ) . \end{aligned}$$
(56)

Consider the problem of steering the ensemble to a given phase \(x_{0} + 2\pi n\), \(n \in {\mathbb {Z}}\):

$$\begin{aligned} \min \mathscr {I}[u]&= \ell (\mu _T) \doteq \,\displaystyle \int _{0}^{2\pi } J(x,x_{0}) \,d\mu _T(x),\\ J(x,y)&\doteq \frac{1}{2}(\sin x-\sin y)^2+\,\frac{1}{2}(\cos x -\cos y)^2 = 1 - \cos (x-y). \end{aligned}$$

To specify the adjoint equation, we compute (see Example 1 in Sect. 2.3):

$$\begin{aligned}{} & {} D_{x}V(x,\mu ,u) = -u_2\int _{0}^{2\pi } \cos (x -y +\alpha )\,d\mu (y),\\ {}{} & {} \quad D_{\mu }V(y,\mu ,u,x) = u_2 \, \cos (y -x +\alpha ), \end{aligned}$$

and \( D_{\mu }\ell (\mu ;x) = \sin (x-x_0) \). Then, (36) becomes

$$\begin{aligned} \partial _t \nu + \partial _x\left( v \, \nu \right) =u_2\left[ (K_1*\nu )\mu + (K_2*\mu )\nu \right] , \end{aligned}$$
(57)

where \( v_{t}(x) = V_{t}(x,\mu _{t},u(t)) \), \( K_1(x) = \cos (-x+\alpha ) \), \( K_2(x) = \cos (x+\alpha ) \).

Let us associate \( \mu \) and \( \nu \) with their densities represented as in (52) in terms of the Fourier coefficients \( \widehat{a}_n \) and \( \widehat{b}_n \), respectively. To represent the PDE (2) in the Fourier space (i.e. in the form (53)), notice that the only non-vanishing Fourier coefficients of (56) are \(\widehat{V}_0=u_1,\ \widehat{V}_1=i \pi u_2\widehat{\mu }_1\exp (i\alpha )\), and the complex conjugate of the latter one is \(\widehat{V}_{-1}\). This form of \(V(x,\mu ,u)\) enables us to compute the r.h.s. of (53) exclusively in the Fourier space with no recourse to the physical space, in contrast with the case when applying the pseudospectral methods to the system with a generic \(V(x,\mu ,u)\).

In the Fourier space, the nonlocal continuity equation reads

$$\begin{aligned} \frac{d\widehat{a}_n}{dt} = -inu_{1}\widehat{a}_n + \pi n u_{2} \left( \widehat{a}_{1}\widehat{a}_{n-1}\textrm{e}^{i\alpha } - \widehat{a}_{-1}\widehat{a}_{n+1}\textrm{e}^{-i\alpha } \right) ,\quad n\in \mathbb {Z}, \end{aligned}$$
(58)

while the adjoint equation (36) and the terminal condition (37) become:

$$\begin{aligned} \frac{d\widehat{b}_{n}}{dt}&= -inu_{1}\widehat{b}_n +\pi n u_2 \left( \widehat{a}_1\widehat{b}_{n-1}\textrm{e}^{i\alpha }- \widehat{a}_{-1}\widehat{b}_{n+1}\textrm{e}^{-i\alpha } \right) \nonumber \\&\quad + \pi u_2\left( (\widehat{a}_1\widehat{b}_{n-1}+\widehat{b}_{-1}\widehat{a}_{n+1})\textrm{e}^{i\alpha }+ (\widehat{a}_{-1}\widehat{b}_{n+1}+\widehat{b}_{1}\widehat{a}_{n-1})\textrm{e}^{-i\alpha }\right) ,\nonumber \\ \widehat{b}_n(T)&= \frac{i}{2}\left( \widehat{a}_{n-1}(T)\textrm{e}^{-ix_{0}} - \widehat{a}_{n+1}(T)\textrm{e}^{ix_0} \right) ,\quad n\in \mathbb {Z}. \end{aligned}$$
(59)

In order to compute the transformation \(\rho (x_j,t)\mapsto \widehat{\rho }_k(t)\) and its inverse, we employ the forward and backward FFTs implemented in the library FFTW [32].

The problem is considered under the control constraint \( u_1^2+u_2^2\le 2 \); for the kth iteration, the corresponding target control \(({\bar{u}}_1^k, {\bar{u}}_2^k)\) provided by (50) takes the form: \(\frac{\sqrt{2}}{\sqrt{(d_1^k)^2 + (d_2^k)^2}}(d_1^k, d_2^k)\), and the control-update rule reads: \(u^{k+1} = u^k+\lambda ^k({\bar{u}}^k - u^k)\).

Some computational results are presented by Fig. 1.

Fig. 1
figure 1

Numerical solution of the synchronization problem for \(T=6\), \(\alpha =0\), \(x_0 = \pi \), the initial control \(u_1^0=\sqrt{2}\sin (2\pi t),\ u_2^0 = \sqrt{2}\cos (2\pi t)\), and the initial distribution \(\rho _{0}(x) = ( 2+\sin x + 0.8\cos 2x -0.2 \sin 2 x )/(4\pi )\). The trajectory \(\rho ^k_t\) produced by the algorithm is depicted in (a) at \(t=0\) and in (b) at \(t=T\). The adjoint trajectory \(\zeta ^k_t\) at the same time moments is presented in (c). The corresponding controls \(u^k_1(t)\) and \(u^k_2(t)\) are shown in (d). For these controls, the cost functional is \(\mathscr {I}[u^k]\approx 8\cdot 10^{-3}\) (cf. \(\mathscr {I}[u^0]=1\)). In computations, we used \(\mathscr {N}=2048\) spatial Fourier harmonics, and \(c = 0.01\), \(\theta = 0.5\) for determination of \(\lambda ^k\). The systems of ODEs (58) and (59) were solved by the 4th-order Runge–Kutta method with constant time step \(\tau =0.001\); stopping criterion: \(\lambda ^{k}<10^{-2}\)

Remark 5.1

Let us stress several differences between the problem that we solve here and the one addressed in [48]. First of all, in [48] the authors consider the so-called mean-field type controls, i.e., they assume that u depends not only on t but also on x. It is clear that this choice greatly improves the controllability of the system. Moreover, the system in [48] is subject to common noise, which also contributes to the controllability. Indeed, let the initial density be given by \( \rho _{0} = 1 + \sin kx \), \( k \ge 2 \). Then, the convolution in (56) vanishes, which means that our control options reduce to shifting the wave \( \rho _0 \) back and forth. On the other hand, under the presence of common noise, the Fourier coefficient corresponding to \( \sin x \) immediately becomes nonzero and, as a result, the system is self-synchronizing for any positive \( u_2 \). A similar effect can be observed if we try to solve (2), (56) with a discretization scheme that involves a numerical diffusion (such as the classical Lax-Friedrichs method).

6 General Case

In this section, we shall discuss a natural extension of the obtained results to the control-nonlinear case and general cost functional (1).

6.1 Nonlinear Dependence on Control

To handle the case of nonlinear dependence \(u \mapsto V_t(x, \mu , u)\), we shall resort to the standard technique based on the extension of the original class \({\mathscr {U}}\) of control signals to a broader space \( \widetilde{{\mathscr {U}}} \doteq \left\{ \eta \in {\mathscr {P}}(I\times U): \, \left[ (t, u) \mapsto t\right] _{\sharp }\eta = \frac{1}{T}{\mathscr {L}}^1\right\} \) of Young measures [22]. It is well-known that such an extension provides the linearization of the vector field w.r.t. the driving signal and, in a certain sense, reduces the general model to the above control-affine case. Recall that i) \({\mathscr {U}}\) is dense in \(\widetilde{{\mathscr {U}}}\) due to the embedding \(u \mapsto \eta \), \(\eta _t=\delta _{u(t)}\), where \(t \mapsto \eta _t \in {\mathscr {P}}(U)\) is the weakly measurable family of probability measures obtained by disintegration of \(\eta \) w.r.t. \(\frac{1}{T}{\mathscr {L}}^1\); and ii) \(\widetilde{{\mathscr {U}}}\) is compact in the topology of weak convergence of probability measures (and therefore, in any metric \(W_p\), \(p\ge 1\)) as soon as U is compact, thanks to the classical Prohorov theorem.

This passage, which is a routine of the mathematical control theory, leads to the following relaxation of the original dynamics (2):

$$\begin{aligned}{} & {} \partial _t\mu _t + \textrm{div}\left( {\widetilde{V}}_{t}\left( \cdot ,\mu _{t},\eta _t\right) \, \mu _t\right) = 0,\quad \mu _0=\vartheta ,\nonumber \\{} & {} \quad {\widetilde{V}}_{t}\left( x,\mu _{t},\eta _t\right) \doteq \int _{U} V_{t}\left( x,\mu _{t},u\right) \,d\eta _t(u)\doteq \langle \eta _t, V_{t}\left( x,\mu _{t},\cdot \right) \rangle ; \end{aligned}$$
(60)

the original cost should be reformulated in the corresponding form: \( \widetilde{{\mathscr {I}}} = \ell ({\tilde{\mu }}_T), \) where \({\tilde{\mu }}_t\) is a solution of (60).

Observing that the dependence \(\omega \mapsto {\widetilde{V}}_{t}\left( x,\mu ,\omega \right) \) is linear, we invite the reader to consider the weak variation \( \eta ^\lambda = \eta + \lambda ({\bar{\eta }} - \eta ) \) and the respective cost increment \(\widetilde{{\mathscr {I}}}[\eta ^\lambda ] - \widetilde{{\mathscr {I}}}[\eta ]\) in place of (21) and (22), and reproduce the arguments of Sect. 3 and 4. By doing this, one ensures that the resulting increment formula and necessary condition for the optimality of a Young measure \(\eta \) keep the form of Theorems 4.2 and 4.3, where V and \({\textbf{H}}_{t}\) are replaced by \({\widetilde{V}}\) and \(\widetilde{{\textbf{H}}}_{t}\), respectively, \( \displaystyle \widetilde{{\textbf{H}}}_{t}(\mu , \nu , \omega ) \doteq \int _{U}{\textbf{H}}_{t}(\mu , \nu , u) \,d\omega (u), \) and the maximum condition (43) becomes

$$\begin{aligned} \widetilde{{\textbf{H}}}_{t}({\tilde{\mu }}_t, {\tilde{\nu }}_t, \eta _t) = \max _{\omega \in {\mathscr {P}}(U)}\widetilde{{\textbf{H}}}_{t}({\tilde{\mu }}_t, {\tilde{\nu }}_t, \omega ) \quad&\Leftrightarrow \quad \textrm{spt}(\eta _t) \subseteq \arg \max _{\upsilon \in U} {\textbf{H}}_{t}({\tilde{\mu }}_t,{\tilde{\nu }}_{t}, \upsilon ), \end{aligned}$$

where \({\tilde{\nu }}_t\) is the adjoint backward solution associated to \(\eta \). Now, if the addressed control-nonlinear problem (P) does have a usual minimizer \(u\in \mathscr {U}\), then PMP for u is restored by taking \(\eta \) such that \(\eta _t =\delta _{u(t)}\).

6.2 Running Cost

If the map \(u \mapsto L_t(x, \mu , u)\) is affine, one easily adapts PMP by reformulating the dynamics (24) of the Hamiltonian PDE and the Hamiltonian (42) as

$$\begin{aligned} \textbf{H}\doteq \begin{pmatrix} \displaystyle V_{t}\\ \displaystyle - p\, D_x \, V_t - \iint q \, D_\mu V_t\,d\gamma + D_x \, L_{t} + \iint D_\mu L_t\,d\gamma \end{pmatrix}, \end{aligned}$$
(61)

and \(\displaystyle {\textbf{H}}_{t} \doteq \int V_{t}\cdot \,d\nu -L_{t}.\) Further details can be found, e.g., in [10]. The general u-nonlinear case refers to the relaxation technique exhibited in Sect. 6.1.