1 Introduction

We study the problem of determining an unknown nonlinearity f from data in a parameter-dependent dynamical system

$$\begin{aligned} \begin{aligned}&\dot{u}=F(\lambda ,u)+f(\alpha ,u) \qquad{} & {} \text {in } (0,T) \times \Omega \\&u(0)=u_0{} & {} \text {on }\Omega . \end{aligned} \end{aligned}$$
(1)

Here, the state u is a function on a finite time interval (0, T) and a bounded Lipschitz domain \(\Omega \), and \(\dot{u}\) denotes the first order time derivative. In (1), both Ff are nonlinear Nemytskii operators in \(\lambda , \alpha , u\); these Nemytskii operators are induced by nonlinear, time-dependent functions \( [F(\lambda ,u)](t): = F(t,\lambda ,u(t))\) and \( [f (\alpha ,u)](t,x): = f (\alpha ,u(t,x)), \) where we consistently abuse notation in this manner throughout the paper; see also Lemmas 2 and 4. We assume that F was specified beforehand from an underlying physical model, that the terms \(\lambda \), \(u_0\) are physical parameters (with \(\lambda =\lambda (x)\) depending only on space), and that \(\alpha \) is a finite dimensional parameter arising in the nonlinearity. Furthermore, the model (1) is equipped with Dirichlet or Neumann boundary conditions.

Some examples of partial differential equations (PDEs) of the from (1) are diffusion models \(\dot{u}=\Delta u + f(\alpha ,u)\) with a nonlinear reaction term \(f(\alpha ,u)\) as follows [30]:

  • \(f(\alpha ,u)= -\alpha u(1 - u)\): Fisher equation in heat and mass transfer, combustion theory.

  • \(f(\alpha ,u)= -\alpha u(1 - u)(\alpha - u), 0<\alpha <1\): Fitzhugh–Nagumo equation in population genetics.

  • \(f(\alpha ,u)=-u/(1+\alpha _1 u+\alpha _2 u^2)\), \(\alpha = (\alpha _1,\alpha _2)\), \( \alpha _1>0, \alpha _1^2<4\alpha _2 \): Enzyme kinetics.

  • \(f(\alpha ,u) = f(u) =-u|u|^p\), \( p\ge 1 \): Irreversible isothermal reaction, temperature in radiating bodies.

The underlying assumption of this work is that in some cases, the nonlinearity f is unknown due to simplifications or inaccuracies in the modeling process or due to undiscovered physical laws. In such situations, our goal is to learn f from data. In order to realize this in practice, we need to use a parametric representation. For this, we choose neural networks, which have become widely used in computer science and applied mathematics due to their excellent representation properties, see for instance [19] for the classical universal approximation theorem, [27] for recent results indicating superior approximation properties of neural networks with particular activations (potentially at the cost of stability) and [3, 9] for general, recent overviews on the topic. Learning the nonlinearity f thus reduces to identifying parameters \(\theta \) of a neural network \(\mathcal {N}_\theta \) such that \(\mathcal {N}_\theta \approx f\), rendering the problem of learning a nonlinearity to be a parameter identification problem of a particular form.

For the majority of this paper, the nonlinearity f will therefore not appear directly; instead, f will consistently be replaced by its neural network representation \(\mathcal {N}_\theta \), and our focus will be on showing the properties of \(\mathcal {N}_\theta \), rather than those of f.

A main point in our approach, which is motivated from feasibility for applications, is that learning the nonlinearity must be achieved only via indirect, noisy measurements of the state \(y^\delta \approx Mu\) with M a linear measurement operator. More precisely, we assume to have K different measurements

$$\begin{aligned} {y^k = Mu^k \qquad k=1{,\ldots ,\,} K} \end{aligned}$$
(2)

of different states \(u^k\) available, where the different states correspond to solutions of the system (1) with different, unknown parameters \((\lambda ^k,\alpha ^k,u_0^k)\), but the same, unknown nonlinearity f which is assumed to be part of the ground truth model. The simplest form of M is a full observation over time and space of the states, i.e. \(M=\text {Id}\) as in e.g. (theoretical) population genetics. In other contexts, M could be discrete observations at time instances of u, i.e. \(Mu={(u(t_i,\cdot ))_{i=1}^{n_T}}, t_i\in (0,T)\), as in material science [31], system biology [4] (see also Corollary 32), or Fourier transform as in MRI acquisition [2], etc. In most cases, M is linear, as is assumed here.

Our approach to address this problem is to use an all-at-once formulation that avoids constructing the parameter-to-state map (see for instance [21]). That is, we aim to identify all unknowns by solving a minimization problem of the form

$$\begin{aligned} \min _{\begin{array}{c} (\lambda ^k,\alpha ^k,u_0^k,u^k)_k {\subset X \times \mathbb {R}^m \times U_0\times \mathcal {V}} \\ \theta {\in \Theta } \end{array}}{} & {} \sum _{k=1}^K \Vert \mathcal {G}(\lambda ^k,\alpha ^k,u_0^k,u^k,\theta ) - (0,0,y^k) \Vert ^2_{{{\mathcal {W}\times H}\times \mathcal {Y}}} \nonumber \\{} & {} \quad + \mathcal {R}_1(\lambda ^k,\alpha ^k,u_0^k,u^k) + \mathcal {R}_2(\theta ), \end{aligned}$$
(3)

where we refer to Sect. 2 for details on the function spaces involved. Here, \(\mathcal {G}\) is a forward operator that incorporates the PDE model, the initial conditions and the measurement operator via

$$\begin{aligned} \mathcal {G}(\lambda ,\alpha ,u_0,u,\theta )=(\dot{u}-F(\lambda ,u)- \mathcal {N} _\theta (\alpha ,u),u(0)-u_0,Mu), \end{aligned}$$

and \(\mathcal {R}_1\), \(\mathcal {R}_2\) are suitable regularization functionals.

Once a particular parameter \(\hat{\theta }\) such that \(\mathcal {N}_{\hat{\theta }}\) accurately approximates f in (1) is learned, one can use the learning informed model in other parameter identification problems by solving

$$\begin{aligned} \min _{ (\lambda ,\alpha ,u_0, u){\in X \times \mathbb {R}^m \times U_0\times \mathcal {V}} } \Vert \mathcal {G}(\lambda ,\alpha ,u_0,u,\hat{\theta }) - (0,0,y) \Vert ^2_{{{\mathcal {W}\times H}\times \mathcal {Y}}} + \mathcal {R}_1(\lambda ,\alpha ,u_0,u) \end{aligned}$$
(4)

for a new measured datum \(y \approx Mu\).

Existing research towards learning PDEs and all-at-once identification Exploring governing PDEs from data is an active topic in many areas of science and engineering. With advances in computational power and mathematical tools, there have been numerous recent studies on data-driven discovery of hidden physical laws. One novel technique is to construct a rich dictionary of possible functions, such as polynomials, derivatives etc., and to then use sparse regression to determine candidates that most accurately represent the data [5, 33, 35]. This sparse identification approach yields a completely explicit form of the differential equation, but requires an abundant library of basic functions specified beforehand. In this work, we take the viewpoint that PDEs are constructed from principal physical laws. As it preserves the underlying equation and learns only some unknown components of the models, e.g. f in (1), our suggested approach is capable of refining approximate models by staying more faithful to the underlying physics.

Besides the machine learning part, the model itself may contain unknown physical parameters belonging to some function space. This means that if the nonlinearity f is successfully learned, one can insert it into the model. One thus has a learning-informed PDE, and can then proceed via a classical parameter identification. The latter problem was studied in [11] for stationary PDEs, where f is learned from training pairs (uf(u)). This paper emphasizes analysis of the error propagating from the neural network-based approximation of f to the parameter-to-state map and the reconstructed parameter.

In reality, one does not have direct access to the true state u, but only partial or coarse observations of u under some noise contamination. This factor affects the creation of training data pairs (uf(u)) with \(f(u)=\dot{u}-F(u)\) for the process of learning f, e.g in [11]. Indeed, with a coarse measurement of u, for instance \(u\in L^2((0,T)\times \Omega )\), one cannot evaluate \(\dot{u}\), nor terms such as \(\Delta u\) that may appear in F(u). Moreover, with discrete observations, e.g. a snapshot \(y={(u(t_i,\cdot ))_{i=1}^{n_T}}, t_i\in (0,T)\), one is unable to compute \(\dot{u}\) for the training data.

For this reason, we propose an all-at-once approach to identify the nonlinearity f, state u and physical parameter simultaneously. In comparison to [11], our approach bypasses the training process for f, and accounts for discrete data measurements. The all-at-once formulation avoids constructing the parameter-to-state map, which is nonlinear and often involves restrictive conditions [16, 20, 21, 23, 28]. Additionally, we here consider time-dependent PDE models.

For discovering nonlinearities in evolutionary PDEs, the work in [7] suggests an optimal control problem for nonlinearities expressed in terms of neural networks. Note that the unknown state still needs to be determined through a control-to-state map, i.e. via the classical reduced approach, as opposed to the new all-at-once approach.

While [7, 11] are the recent publications that are most related to our work, we also mention the very recent preprint [12] on an extension of [11] that appeared independently and after the original submission of our work. Furthermore, there is a wealth of literature on the topic of deep learning emerging in the last decade; for an authoritative review on machine learning in the context of inverse problems, we refer to [1]. For the regularization analysis, we follow the well known theory put forth in [13, 22, 26, 37]. It is worthwhile to note that since this work, to the knowledge of the authors, is the first attempt at applying an all-at-once approach to learning-informed PDEs, our focus will be on this novel concept itself, rather than on obtaining minimal regularity assumptions on the involved functions, in particular on the activation functions. In subsequent work, we might further improve upon this by considering, e.g., existing techniques from a classical optimal control setting with non-smooth equations [6] or techniques to deal with non-smoothness in the context of training neural networks [8].

Contributions Besides introducing the general setting of identifying nonlinearities in PDEs via indirect, parameter-dependent measurements, the main contributions of our work are as follows: Exploiting an all-at-once setting of handling both the state and the parameters explicitly as unknowns, we provide well-posedness results for the resulting learning- and learning-informed parameter identification problems. This is achieved for rather general, nonlinear PDEs and under local Lipschitz assumptions on the activation function of the involved neural network. Further, for the learning-informed parameter identification setting, we ensure the tangential cone condition on the neural-network part of our model. Together with suitable PDEs, this yields local uniqueness results as well as local convergence results of iterative solution methods for the parameter identification problem. We also provide a concrete application of our framework for parabolic problems, where we motivate our function-space setting by a unique existence result on the learning-informed PDE. Finally, we consider a case study in a Hilbert space setting, where we compute function-space derivatives of our objective functional to implement the Landweber method as solution algorithm. Using this algorithm, and also a parallel setting based on the ADAM algorithm [25], we provide numerical results that confirm feasibility of our approach in practice. Code is made available at https://github.com/hollerm/pde_learning.

Organization of the paper Section 2 introduces learning-informed parameter identification and the abstract setting. Section 3 examines existence, stability and solution methods for the minimization problem. Section 4 focuses on the learning-informed PDE, and analyzes some problem settings. Finally, in Sect. 5 we present a complete case study, from setup to numerical results.

2 Problem Setting

2.1 Notation and Basic Assertions

Throughout this work, \(\Omega \subset \mathbb {R}^d\) will always be a bounded Lipschitz domain, where additional smoothness will be required and specified as necessary. We use standard notation for spaces of continuous, integrable and Sobolev functions with values in Banach spaces, see for instance [10, 32], in particular [32, Sect. 7.1] for Sobolev-Bochner spaces and associated concepts such as time-derivatives of Banach-space valued functions.. For an exponent \(p \in [1,\infty ]\), we denote by \(p^*\) the conjugate exponent given as \(p^* = p/(p-1)\) if \(p \in (1,\infty )\), \(p^* = \infty \) if \(p=1\) and \(p^* = 1\) if \(p=\infty \). For \(l \in \mathbb {N}\), we denote by

$$\begin{aligned} W^{l,p}(\Omega )\hookrightarrow L^q(\Omega )\end{aligned}$$

the continuous embedding of \(W^{l,p}(\Omega )\) to \(L^q(\Omega )\), which exists for \(q\preceq \frac{dp}{d-lp}\), where the notation \(\preceq \) means if \(lp<d\), then \(q\le \frac{dp}{d-lp}\), if \(lp=d\), then \(q<\infty \), and if \(lp\ge d\), then \( q=\infty \) . An example of such an embedding, which will be used frequently in Sect. 4, is \(H^1(\Omega )\hookrightarrow L^6(\Omega )\) for \(d=3\). We further denote by \(C_{W^{l,p}\rightarrow L^q}\) the operator norm of the corresponding continuous embedding operator.

We also use to denote the compact embedding (see [32, Theorem 1.21])

(5)

The notation C indicates generic positive constants. Given any Banach spaces X, Y, we denote by \(\Vert \cdot \Vert _{{X\rightarrow Y}}\) the operator norm \(\Vert \cdot \Vert _{{\mathcal {L}(X,Y)}}\), and by \(\langle \cdot ,\cdot \rangle _{{X,X^*}}\) the pairing between dual spaces X, \(X^*\). We write \(\mathcal {C}_\text {locLip}({X,Y})\) for the space of locally Lipschitz continuous functions between X and Y. Furthermore, \(A\cdot B\) denotes the Frobenius inner product between generic matrices A, B, while AB stands for matrix multiplication, and \(A^T\) stands for the transpose of A. The notation \(\mathcal {B}^X_\rho (x^\dagger )\) means a ball of center \(x^\dagger \), radius \(\rho >0\) in X. For functions mapping between Banach spaces, by the term weak continuity we will always refer to weak-weak continuity, i.e., continuity w.r.t. weak convergence in both the domain and the image space.

2.2 The Dynamical System

For the general setting considered in this work, we use the following set of definitions and assumptions. A concrete application where these abstracts assumptions are satisfied can be found in Sect. 4 below.

Assumption 1

  • The space X (parameter space) is a reflexive Banach space. The spaces V (state space) and W (image space under the model operator), Y (observation space) and \(\tilde{V} \) are separable, reflexive Banach spaces. In view of initial conditions, we further require \(U_0 \) (initial data space) to be a reflexive Banach space, and H to be a separable, reflexive Banach space.

  • We assume the following embeddings:

    (6)

    Further, \(\tilde{V}\) will always be such that either \(L^{\hat{p}}(\Omega )\hookrightarrow \tilde{V}\) or \(\tilde{V} \hookrightarrow L^{\hat{p}}(\Omega )\).

  • The function

    $$\begin{aligned} F:(0,T)\times X \times V\rightarrow W \end{aligned}$$

    is such that for any fixed parameter \(\lambda \in X\), \(F(\cdot ,\lambda ,\cdot ): (0,T) \times V \rightarrow W\) meets the Carathéodory conditions, i.e., \(F(\cdot ,\lambda ,v)\) is measurable with respect to t for all \(v\in V\) and \(F(t,\lambda ,\cdot ) \) is continuous with respect to v for almost every \(t\in (0,T)\). Moreover, for almost all \(t\in (0,T)\) and all \(\lambda \in X\), \(v\in V\), the growth condition

    $$\begin{aligned} \Vert F(t,\lambda ,v)\Vert _W\le \mathcal {B}(\Vert \lambda \Vert _X,\Vert v\Vert _H)(\gamma (t)+\Vert v\Vert _V) \end{aligned}$$
    (7)

    is satisfied for some \(\mathcal {B}:\mathbb {R}^2\rightarrow \mathbb {R}\) such that \(b \mapsto \mathcal {B}(a,b)\) is increasing for each \(a \in \mathbb {R}\), and \(\gamma \in L^2(0,T)\).

  • We define the overall state space and image space including time dependence as

    $$\begin{aligned} \mathcal {V}=L^2(0,T;V)\cap H^1(0,T;{\widetilde{V}}), \quad \mathcal {W}=L^2(0,T;W), \end{aligned}$$
    (8)

    respectively with the norms \(\Vert u\Vert _\mathcal {V}:=\sqrt{\int _0^T \Vert u(t)\Vert _V^2+\Vert \dot{u}(t)\Vert _{\widetilde{V}}^2\,\textrm{d}t}\) and \(\Vert u\Vert _\mathcal {W}:=\sqrt{\int _0^T \Vert u(t)\Vert _W^2 \,\textrm{d}t}\).

  • We define the overall observation space including time as

    $$\begin{aligned} \mathcal {Y}=L^2(0,T;Y), \end{aligned}$$

    with the norm \(\Vert y\Vert _\mathcal {W}:=\sqrt{\int _0^T \Vert y(t)\Vert _Y^2 \,\textrm{d}t}\) and the corresponding measurement operator

    $$\begin{aligned} M \in \mathcal {L}(\mathcal {V},\mathcal {Y}). \end{aligned}$$
    (9)
  • We further assume the following embeddings for the state space:

    $$\begin{aligned} \mathcal {V}\hookrightarrow L^\infty ((0,T)\times \Omega ), \quad \mathcal {V}\hookrightarrow C(0,T;H). \end{aligned}$$

The embeddings in (6) are very feasible in the context of PDEs. The state space V usually has some certain smoothness such that its image under some spatial differential operators belongs to W. For the motivation of \(\mathcal {V}\hookrightarrow C(0,T;H)\), the abstract setting in [32, Lemma 7.3.] (see Appendix A) is an example. Note that due to \(\mathcal {V}\hookrightarrow C(0,T;H)\), clearly \(U_0=H\) is a feasible choice for the initial space; for the sake of generality, only \(U_0\hookrightarrow H\) is assumed in (6).

Under Assumption 1, the function F induces a Nemytskii operator on the overall spaces.

Lemma 2

Let Assumption 1 hold. Then the function \(F:(0,T)\times X \times V\rightarrow W \) induces a well-defined Nemytskii operator \(F: X \times \mathcal {V}\rightarrow \mathcal {W}\) given as

$$\begin{aligned}{}[F(\lambda ,u)](t) = F(t,\lambda ,u(t)). \end{aligned}$$
(10)

Proof

Under the Carathéodory assumption, \(t \mapsto F(t,\lambda ,u(t))\) is Bochner measurable for every \(\lambda \in X\) and \(u \in \mathcal {V}\). For such \(\lambda ,u\), we further estimate

$$\begin{aligned} \int _0^T \Vert F(t,\lambda ,u(t) \Vert ^2_W {\,\textrm{d}t}&\le 2 \int _0^T \mathcal {B}(\Vert \lambda \Vert _X,\Vert {u}(t)\Vert _H)^2(\gamma (t)^2+\Vert {u}(t)\Vert ^2_V) {\,\textrm{d}t}\\&\le 2\mathcal {B}(\Vert \lambda \Vert _X,\Vert {u}\Vert _{C(0,T;H)})^2( \Vert \gamma \Vert ^2_{L^2(0,T)} + \Vert u\Vert _{\mathcal {V}}^2 ) < \infty \end{aligned}$$

by \(b \mapsto \mathcal {B}(\Vert \lambda \Vert ,b)\) being increasing and by the embedding \( \mathcal {V}\hookrightarrow C(0,T;H)\). This allows to conclude that \(t \mapsto F(t,\lambda ,u(t))\) is Bochner integrable (see [10, Theorem II.2.2]) and that the Nemytskii operator \(F: X \times \mathcal {V}\rightarrow \mathcal {W}\) is well-defined. \(\square \)

Note that we use the same notation for the function \(F:(0,T) \times X \times V \rightarrow W\) and the corresponding Nemytskii operators.

2.3 Basics of Neural Networks

As outlined in the introduction, the unknown nonlinearity f will be represented by a neural network. In this work, we use a rather standard, feed-forward form of neural networks defined as follows.

Definition 3

A neural network \( \mathcal {N} _\theta \) of depth \(L \in \mathbb {N}\) with architecture \((n_i)_{i=0}^{L}\) is a function \( \mathcal {N} _\theta :\mathbb {R}^{n_0 } \rightarrow \mathbb {R}^{n_L }\) of the form

$$\begin{aligned} \mathcal {N} _\theta (x) = L_{\theta _L}\circ \ldots \circ L_{\theta _1}(x) \end{aligned}$$

where \(L_{\theta _l}:\mathbb {R}^{n_{l-1} } \rightarrow \mathbb {R}^{n_{l}} \), for \(z \in \mathbb {R}^{n_{l-1} }\) is given as

$$\begin{aligned} L_{\theta _l}(z):= \sigma ( \omega ^l z + \beta ^l ) \text { for}\, l=1,\ldots ,L-1, \qquad L_{\theta _L}(z):= \omega ^L z + \beta ^L. \end{aligned}$$

Here, \(\omega ^l \in \mathcal {L}( \mathbb {R}^{n_{l-1}},\mathbb {R}^{n_{l}})\), \(\beta ^l \in \mathbb {R}^{n_{l}}\), \(\theta _l = (\omega ^l,\beta ^l)\) summarizes all the parameters of the l-th layer and \(\sigma \) is a pointwise nonlinearity that is fixed. Given a depth \(L \in \mathbb {N}\) and architecture \((n_i)_{i=0}^{L}\), we also use \(\Theta \) to denote the finite dimensional vector space containing all possible parameters \(\theta _1,\ldots ,\theta _L\) of neural networks with this architecture.

In this work, neural networks will be used to approximate the nonlinearity \(f:\mathbb {R}^{m+1} \rightarrow \mathbb {R}\). Consequently, we always deal with neural networks \(\mathcal {N}_\theta :\mathbb {R}^{m+1} \rightarrow \mathbb {R}\), i.e., \(n_0=m+1\) and \(n_L = 1\).

As such, rather than showing that f induces a well-defined Nemytskii operator, we instead show that \(\mathcal {N}_\theta \) does so. A sufficient condition for this to be true is the continuity of the activation function \(\sigma \), as the following Lemma shows.

Lemma 4

Assume that \(\sigma \in \mathcal {C}(\mathbb {R},\mathbb {R})\). Then, with the setting of Assumption 1, \( \mathcal {N} _\theta :\mathbb {R}^{m} \times \mathbb {R}\rightarrow \mathbb {R}\) as in Definition 3 induces a well-defined Nemytskii operator \( \mathcal {N} _\theta : \mathbb {R}^m \times \mathcal {V}\rightarrow L^2(0,T;L^{{{\hat{p}}}}(\Omega )) \) via

$$\begin{aligned} {[} \mathcal {N} _\theta (\alpha ,u)](t)(x) = \mathcal {N} _\theta (\alpha ,u(t,x)), \end{aligned}$$

regarding \(u \in \mathcal {V}\) as \(u \in L^\infty ((0,T)\times \Omega )\) by the embedding \( \mathcal {V}\hookrightarrow L^\infty ((0,T)\times \Omega )\). Further, using the embedding \(L^2(0,T;L^{{{\hat{p}}}}(\Omega )) \hookrightarrow \mathcal {W}\), \( \mathcal {N} _\theta \) induces a well-defined Nemytskii operator \( \mathcal {N} _\theta :\mathbb {R}^m \times \mathcal {V}\rightarrow \mathcal {W}\).

Proof

We first fix \(\alpha \in \mathbb {R}^m\). By continuity of \(\sigma \), \( \mathcal {N} _\theta \) is also continuous and, for \(u \in L^\infty ((0,T) \times \Omega )\), \(\sup _{t,x} | \mathcal {N} _\theta (\alpha ,u(t,x))| < \infty \); thus, \( \mathcal {N} _\theta (\alpha ,u(t,\cdot )) \in L^{\hat{p}}(\Omega )\) for almost every \(t \in (0,T)\). It then follows by standard measurability arguments that the mapping \(t \mapsto \int _\Omega \mathcal {N} _\theta (\alpha ,u(t,x))w^*(x)\,\textrm{d}x\) is measurable for every \(w^* \in L^{{\hat{p}}^*}(\Omega )\). Using separability and the Pettis theorem [10, Theorem II.1.2], it follows that \(t \mapsto \mathcal {N} _\theta (\alpha ,u(t,\cdot )) \in L^{\hat{p}}(\Omega )\) is Bochner measureable. This, together with \(\sup _{t,x} | \mathcal {N} _\theta (\alpha ,u(t,x))| < \infty \) as before, implies that the Nemytskii operator \( \mathcal {N} _\theta :\mathbb {R}^m \times \mathcal {V}\rightarrow L^2(0,T;L^{\hat{p}}(\Omega )) \) is well defined. The remaining assertions follow immediately from \(L^{\hat{p}}(\Omega ) \hookrightarrow W\). \(\square \)

We again use the same notation for \( \mathcal {N} _\theta :\mathbb {R}^{m} \times \mathbb {R}\rightarrow \mathbb {R}\) and the corresponding Nemytskii operator.

2.4 The Learning Problem

As the nonlinearity f is represented by a neural network \(\mathcal {N}_\theta :\mathbb {R}^{m+1} \rightarrow \mathbb {R}\), we rewrite the partial-differential-equation (PDE) model (1) into the form

$$\begin{aligned}&e:X\times \mathbb {R}^m \times U_0 \times \mathcal {V}\times \Theta \rightarrow \mathcal {W}\times H,\nonumber \\&\quad e(\lambda ,\alpha ,u_0,u,\theta )=(\dot{u}-F(\lambda ,u)- \mathcal {N} _\theta (\alpha ,u),u(0)-u_0), \end{aligned}$$
(11)

and introduce the forward operator \(\mathcal {G}\), which incorporates the observation operator M, as

$$\begin{aligned}&\mathcal {G}:X \times \mathbb {R}^m \times U_0 \times \mathcal {V}\times \Theta \rightarrow \mathcal {W}\times H \times \mathcal {Y}, \nonumber \\&\mathcal {G}(\lambda ,\alpha ,u_0,u,\theta )=(e(\lambda ,\alpha ,u_0,u,\theta ),Mu). \end{aligned}$$
(12)

Here, \(U_0\) and H are the spaces related to the initial condition and the trace operator, that is, one has unknown initial data \(u_0\in U_0\) and trace operator \((\cdot )_{t=0}: \mathcal {V}\ni u\mapsto u(0)\in H\). With \(U_0\hookrightarrow H\) as assumed in (6), one has \(u(0)-u_0\in H\).

The minimization problem for the learning process is then given by

$$\begin{aligned} \min _{\begin{array}{c} (\lambda ^k,\alpha ^k,u_0^k,u^k)_k \subset X \times \mathbb {R}^m \times U_0\times \mathcal {V}\\ \theta \in \Theta \end{array}}{} & {} \sum _{k=1}^K \Vert \mathcal {G}(\lambda ^k,\alpha ^k,u_0^k,u^k,\theta ) - (0,0,y^k) \Vert ^2_{\mathcal {W}\times H\times \mathcal {Y}} \nonumber \\{} & {} \quad +\mathcal {R}_1(\lambda ^k,\alpha ^k,u_0^k,u^k) + \mathcal {R}_2(\theta ), \end{aligned}$$
(13)

where \(\mathcal {R}_1:X \times \mathbb {R}^m \times U_0 \times \mathcal {V}\rightarrow [0,\infty ]\) and \(\mathcal {R}_2:\Theta \rightarrow [0,\infty ]\) are suitable regularization functionals.

Assume now that the particular parameter \(\hat{\theta }\) has been learned. As in (4), one can now solve other parameter identification problems, given new measured datum \(y\approx Mu\), by solving

$$\begin{aligned} \min _{ (\lambda ,\alpha ,u_0, u)\in X \times \mathbb {R}^m \times U_0\times \mathcal {V}} \Vert \mathcal {G}(\lambda ,\alpha ,u_0,u,\hat{\theta }) - (0,0,y) \Vert ^2_{\mathcal {W}\times H\times \mathcal {Y}} + \mathcal {R}_1(\lambda ,\alpha ,u_0,u). \end{aligned}$$
(14)

3 Learning-Informed Parameter Identification

3.1 Well-Posedness of Minimization Problems

We start our analysis by studying existence theory for the optimization problems (13) and (14), where the unknown nonlinearity is replaced by a neural network approximation. To this aim, we first establish weak closedness of the forward operator. In what follows, the architecture of the network \( \mathcal {N} \) is considered fixed.

Lemma 5

Let Assumption 1 hold. Then, if \(\sigma \in \mathcal {C}_\text {locLip}(\mathbb {R},\mathbb {R})\), \( \mathcal {N} :\mathbb {R}^m \times \mathcal {V}\times \Theta \rightarrow \mathcal {W}\) is weakly continuous. Further, if eitherFootnote 1

$$\begin{aligned} F(t,\cdot ):X \times H \rightarrow W \text { is weakly continuous for a.e. }t \in (0,T) \end{aligned}$$
(15)

or

(16)

and \((-F)\) is pseudomonotone in the sense that for almost all \(t\in (0,T),\)

$$\begin{aligned}&\left. \begin{aligned} (u_k(t),\lambda _k) \overset{H\times X}{\rightharpoonup }\ (u,\lambda )&\\ \underset{k\rightarrow \infty }{\liminf }\ \langle F(t,\lambda _k,u_k(t)),u_k(t)-u(t)\rangle _{W,W^*}\ge 0&\end{aligned}\right\} \nonumber \\&\Rightarrow {\left\{ \begin{array}{ll} \forall v \in W^*: \langle F(t,\lambda ,u(t)),u(t)-v \rangle _{W,W^*}\\ \ge \underset{k\rightarrow \infty }{\limsup }\langle F(t,\lambda _k,u_k(t)),u_k(t)-v \rangle _{W,W^*}, \end{array}\right. } \end{aligned}$$
(17)

then F is weakly closed. Moreover, if \( \mathcal {N} \) is weakly continuous and F is weakly closed, then \(\mathcal {G}\) as in (12) is weakly closed.

Proof

We first consider weak closedness of \(\mathcal {G}\). To this aim, recall that \(\mathcal {G}\) is given as

$$\begin{aligned} \mathcal {G}(\lambda ,\alpha ,u_0,u,\theta )=(\dot{u}-F(\lambda ,u)- \mathcal {N} _\theta (\alpha ,u),u(0)-u_0,Mu). \end{aligned}$$

First note that \( M\in \mathcal {L}(\mathcal {V},\mathcal {Y})\) by (9). Weak closedness of \(((\cdot )_{t=0},\text {Id}):\mathcal {V}\times U_0 \rightarrow H\) follows from weak continuity of \(\text {Id}:U_0 \rightarrow H\) as \(U_0 \hookrightarrow H\), and from weak-weak continuity of \((\cdot )_{t=0}:\mathcal {V}\rightarrow H\) which follows from \(\Vert u(0)\Vert _H \le \sup _{t\in [0,T]}\Vert u(t)\Vert _H \le C\Vert u\Vert _\mathcal {V}\) for \(C>0\) and \(\mathcal {V}\hookrightarrow C(0,T;H)\). Weak continuity of \(\frac{d}{dt}:\mathcal {V}\rightarrow \mathcal {W}\) results from the choice of norms in the respective spaces. Thus, weak closedness of \(\mathcal {G}\) follows when F is weakly closed and \( \mathcal {N} \) is weakly continuous.

Weak continuity of \( \mathcal {N} \). First, we observe that \( \mathcal {N} : \mathbb {R}^m \times \mathbb {R}\times \Theta \rightarrow \mathbb {R}, (\alpha ,y,\theta ) \mapsto \mathcal {N} _\theta (\alpha ,y)\) is in \(\mathcal {C}_\text {locLip}(\Theta \times \mathbb {R}^m \times \mathbb {R},\mathbb {R})\), since the activation function \(\sigma \) is locally Lipschitz continuous. For a sequence \((\alpha _n,u_n,\theta _n)_n\) converging weakly to \((\alpha ,u,\theta )\) in \(\mathbb {R}^m \times \mathcal {V}\times \Theta \), we observe that by the embedding \(\mathcal {V}\hookrightarrow L^\infty ((0,T)\times \Omega )\), \(\sup _{t,x} \Vert (\alpha _n,u_n(t,x),\theta _n)\Vert < M\) for some \(M>0\).

Now the embeddings imply in particular that (in case \(\widetilde{V}\hookrightarrow L^{\hat{p}}(\Omega )\), this follows from together with [32, Lemma 7.7] (see Appendix A), in the other case that \(L^{\hat{p}}(\Omega )\hookrightarrow \widetilde{V}\), this follows directly from [32, Lemma 7.7]). Based on this, we deduce \(u_n\rightarrow u\) in \(L^2(0,T;L^{\hat{p}}(\Omega ))\). Then

$$\begin{aligned}&\Vert \mathcal {N} (\alpha _n,u_n,\theta _n)- \mathcal {N} (\alpha ,u,\theta )\Vert _\mathcal {W}=\underset{ \begin{array}{c} w^* \in \mathcal {W}^*, \\ \Vert w^*\Vert _{\mathcal {W}^*}\le 1 \end{array}}{\sup } \langle \mathcal {N} (\alpha _n,u_n,\theta _n)- \mathcal {N} (\alpha ,u,\theta ),w^* \rangle _{\mathcal {W},\mathcal {W}^*}\nonumber \\&\quad =\underset{ \begin{array}{c} w^* \in \mathcal {W}^*, \\ \Vert w^*\Vert _{\mathcal {W}^*}\le 1 \end{array}}{\sup } \int _0^T \int _ \Omega \left( \mathcal {N} (\alpha _n,u_n(t,x),\theta _n)- \mathcal {N} (\alpha ,u(t,x),\theta )\right) w^*(t,x) \,\textrm{d}x\,\textrm{d}t \nonumber \\&\quad \le L(M) \underset{\begin{array}{c} w^* \in \mathcal {W}^*, \\ \Vert w^*\Vert _{\mathcal {W}^*}\le 1 \end{array}}{\sup }\int _0^T \int _\Omega \left( |\alpha _n - \alpha | + |u_n(t,x)-u(t,x)| + |\theta _n - \theta | \right) |w^*(t,x)| \,\textrm{d}x\,\textrm{d}t \nonumber \\&\quad \le C L(M) \underset{\begin{array}{c} w^* \in \mathcal {W}^*, \\ \Vert w^*\Vert _{\mathcal {W}^*}\le 1 \end{array}}{\sup } \left( \Vert u_n-u\Vert _{L^2(0,T;L^{\hat{p}}(\Omega ))} + |\alpha _n - \alpha | + |\theta _n-\theta | \right) \Vert w^*\Vert _{L^2(0,T;L^{\frac{{\hat{p}}}{{\hat{p}}-1}}(\Omega ))} \nonumber \\&\quad \le C L(M) \left( \Vert u_n-u\Vert _{L^2(0,T;L^{\hat{p}}(\Omega ))} + |\alpha _n - \alpha | + |\theta _n-\theta | \right) \overset{n\rightarrow \infty }{\rightarrow }0 \end{aligned}$$
(18)

as \(W^*\hookrightarrow L^{\frac{{\hat{p}}}{{\hat{p}}-1}}(\Omega )\), \(u_n\overset{n\rightarrow \infty }{\rightarrow }u\) in \(L^2(0,T;L^{\hat{p}}(\Omega ))\), and \( \mathcal {N} (\alpha ,u_n(t,\cdot ),\theta _n)) \in L^{\hat{p}}(\Omega )\), as argued in the proof of Lemma  4. Above L(M) denotes the Lipschitz constant of \((\alpha ,y,\theta ) \mapsto \mathcal {N} _\theta (\alpha ,y)\) in the ball with radius M and \({\hat{p}}/({\hat{p}}-1) = \infty \) in case \({\hat{p}}=1\). This shows that here, we even obtain weak-strong continuity of \(\mathcal {N}_\theta \), which is stronger than weak-weak continuity, as required.

Weak closedness of F. To show weak closedness of the Nemytskii operator \(F:X \times \mathcal {V}\rightarrow \mathcal {W}\), we consider two cases. We first consider the case that \(F(t,\cdot )\) is weakly continuous. To this aim, take \((\lambda _n,u_n)_n\) to be a sequence weakly converging to \((\lambda ,u)\) in \(X \times \mathcal {V}\). As \(\mathcal {V}\hookrightarrow C(0,T;H)\), we have \(u_n\overset{C(0,T;H)}{\rightharpoonup }u\) as \(n \rightarrow \infty \). Now, we show \(u_n(t)\overset{H}{\rightharpoonup }u(t)\) for all \(t\in (0.T)\) via the fact that the point-wise evaluation function \((\cdot )(t):\mathcal {V}\rightarrow H\) for any \( t\in [0,T]\) is linear and bounded, thus weak-weak continuous. Indeed, its linearity is clear and boundedness follows from

$$\begin{aligned} \langle ({\tilde{u}})(t),h^*\rangle _{H,H^*}\le \underset{ {\tilde{t}}\in [0,T]}{\max }\langle ({\tilde{u}})({\tilde{t}}),h^*\rangle _{H,H^*}\le \Vert {\tilde{u}}\Vert _{C(0,T;H)}\Vert h^*\Vert _H\le C\Vert {\tilde{u}}\Vert _\mathcal {V}\Vert h^*\Vert _H. \end{aligned}$$

From this, we obtain \(u_n(t)\overset{H}{\rightharpoonup }u(t)\), thus having \( (u_n(t),\lambda _n)\overset{H\times X}{\rightharpoonup }(u(t),\lambda ))\) for all \(t \in (0,T).\) Using the growth condition (7), we now estimate

$$\begin{aligned}&\langle F(\lambda _n,u_n)-F(\lambda ,u),w^* \rangle _{\mathcal {W},\mathcal {W}^*}= \int _0^T \langle F(\lambda _n,u_n)(t)-F(\lambda ,u)(t),w^*(t) \rangle _{W,W^*}\,\textrm{d}t \nonumber \\&\quad =: \int _0^T \epsilon _n(t)\,\textrm{d}t \nonumber \\&\quad \le \int _0^T(\Vert F(\lambda _n,u_n)(t)\Vert _W+\Vert F(\lambda ,u)(t)\Vert _W)\Vert w^*(t)\Vert _{W^*}\,\textrm{d}t \nonumber \\&\quad \le \left( \mathcal {B}(\Vert \lambda _n\Vert _X,\sup _t \Vert u_n(t)\Vert _H)(\Vert \gamma \Vert _{L^2(0,T)} +\Vert u_n\Vert _\mathcal {V})\right. \nonumber \\&\qquad \left. \,+\mathcal {B}(\Vert \lambda \Vert _X,\sup _t \Vert u(t)\Vert _H)(\Vert \gamma \Vert _{L^2(0,T)}+\Vert u\Vert _\mathcal {V})\right) \Vert w^*\Vert _{\mathcal {W}^*}\nonumber \\&\quad \le C(\Vert \lambda \Vert _X,\Vert u\Vert _\mathcal {V})\Vert w^*\Vert _{\mathcal {W}^*}, \end{aligned}$$
(19)

where \(C(\Vert \lambda \Vert _X,\Vert u\Vert _\mathcal {V})>0\) can be obtained independently from n due to \(\mathcal {V}\hookrightarrow C(0,T;H)\), \(\mathcal {B}\) being increasing, and boundedness of \(((u_n,\lambda _n))_n \) in \(\mathcal {V}\times X\). Since F is assumed to be weakly continuous on \(H\times X\), when \(n\rightarrow \infty \) we have \(\epsilon _n(t)\rightarrow 0\) pointwise in t. Hence, applying Lebesgue’s Dominated Convergence Theorem yields convergence of the time integral to 0, thus weak convergence of \(F(\lambda _n,u_n)\) to \(F(\lambda ,u)\) in \(\mathcal {W}\) as claimed. Accordingly, if the condition (15) holds, we obtain weak-weak continuity of F.

Now we consider the second case, i.e. (16)-(17), for weak closedness of F. Assume that as in (16), \( H\hookrightarrow W^*\) and that \(-F\) is pseudomonotone as in (17). Given \((u_n,\lambda _n) \overset{\mathcal {V}\times X}{\rightharpoonup }\ (u,\lambda )\), \(F({\lambda _n,u_n})\overset{\mathcal {W}}{\rightharpoonup }\ g\) and , it follows that [32, Lemma 7.7] (see Appendix A) and that \(u_{n}\rightarrow u \) strongly in \(L^2(0,T;H)\). By the embedding \(H \hookrightarrow W^*\), it holds also \(u_{n} \rightarrow u\) in \(\mathcal {W}^*\). With \(\xi _{n}{(t)}:= |\langle F(t,\lambda _{n},u_{n}(t)),u_{n}(t)-u(t)\rangle _{W,W^*}|\), we obtain

$$\begin{aligned} \int _0^T |\xi _{n}(t)|\,\textrm{d}t \le \Vert F(\lambda _{n},u_{n})\Vert _\mathcal {W}\Vert u_{n}-u\Vert _{\mathcal {W}^*} \le C\Vert u_{n}-u\Vert _{\mathcal {W}^*}\overset{n\rightarrow \infty }{\rightarrow }0. \end{aligned}$$
(20)

By moving to a subsequence indexed by \((n_k)_k\), we thus have \(\xi _{n_k}(t)\rightarrow 0 \) as \(k\rightarrow \infty \) for almost every \(t\in (0,T)\). As \(\underset{k\rightarrow \infty }{\liminf }\ \, \xi _{n_k}(t)\rightarrow 0\), pseudomonotonicity (as in (17)) implies that for any \(v\in \mathcal {W}^*\),

$$\begin{aligned} \langle F(t,u(t),\lambda ),u(t)-v(t) \rangle _{W,W^*}\ge \underset{k\rightarrow \infty }{\limsup }\langle F(t,u_{n_k}(t),\lambda _{n_k}),u_{n_k}(t)-v(t) \rangle _{W,W^*}. \end{aligned}$$

Further, from the Fatou–Lebesgue theorem, we get

$$\begin{aligned}&\langle F(\lambda ,u),u-v \rangle _{\mathcal {W},\mathcal {W}^*}= \int _0^T \langle F(t,\lambda ,u(t)),u(t)-v(t) \rangle _{W,W^*}\,\textrm{d}t\\&\quad \ge \int _0^T\underset{k\rightarrow \infty }{\limsup }\ \langle F(t,\lambda _{n_k},u_{n_k}(t)),u_{n_k}(t)-v(t) \rangle _{W,W^*}\,\textrm{d}t \\&\quad \ge \underset{k\rightarrow \infty }{\liminf }\ \int _0^T\langle F(t,\lambda _{n_k},u_{n_k}(t)),u_{n_k}(t)-v(t) \rangle _{W,W^*}\,\textrm{d}t\\&\quad \ge \underset{k\rightarrow \infty }{\liminf }\int _0^T \langle F(\lambda _{n_k},u_{n_k}(t)),u_{n_k}(t)-u(t) \rangle _{W,W^*}\,\textrm{d}t\\&\qquad +\underset{k\rightarrow \infty }{\liminf }\int _0^T \langle F(\lambda _{n_k},u_{n_k}(t)),u(t)-v(t) \rangle _{W,W^*}\,\textrm{d}t\\&\quad = \underset{k\rightarrow \infty }{\lim }\int _0^T \langle F(\lambda _{n_k},u_{n_k}(t)),u_{n_k}(t)-u(t) \rangle _{W,W^*}\,\textrm{d}t\\&\qquad +\underset{k\rightarrow \infty }{\lim }\int _0^T \langle F(\lambda _{n_k},u_{n_k}(t)),u(t)-v(t) \rangle _{W,W^*}\,\textrm{d}t\\&\quad =0+\langle g,u-v \rangle _{\mathcal {W},\mathcal {W}^*}, \end{aligned}$$

where the last estimate follows from (20) and from weak convergence of of \(F(\lambda _n,u_n)\) to g in \(\mathcal {W}\). As this estimate is valid for any \(v\in \mathcal {W}^*\), we conclude that F is weakly closed on \(X\times \mathcal {V}\), that is,

$$\begin{aligned} F({\lambda ,u})=g. \end{aligned}$$

\(\square \)

Existence of a solution to (13) and (14) now follows from a standard application of the direct method [13, 37], using weak-closedness of \(\mathcal {G}\) and weak lower semi-continuity of the involved quantities.

Proposition 6

(Existence) Let the assumptions of Lemma 5 hold, and assume that \(\mathcal {R}_1, \mathcal {R}_2 \) are nonnegative, weakly lower semi-continuous and such that the sublevel sets of \((\lambda ,\alpha ,u_0,u,\theta ) \mapsto \mathcal {R}_1(\lambda ,\alpha ,u_0,u) + \mathcal {R}_2(\theta )\) are weakly precompact. Then the minimization problems (13) and (14) admit a solution.

Remark 7

(Stability) We note that under the assumptions of Proposition 6, also stability for the minimization problems (13) and (14) follows with standard arguments, see for instance [17, Theorem 3.2]. Here, stability means that for convergent sequence of data \((y_n)_n\) converging to some y, any corresponding sequence of solutions admits a weakly convergent subsequence, and any limit of such weakly convergent subsequence is a solution of the original problem with data y.

Next we deal with minimization problem (13) in the limit case where the given data converges to a noise-free ground truth, and the PDE should be fulfilled exactly. Our result in this context is a direct extension of classical results as provided for instance in [17], but since also variants of this result will be of interest, we provide a short proof.

Proposition 8

(Limit case) With the assumption of Proposition 6 and parameters \(\beta ^e,\beta ^M>0\), consider the parametrized learning problem

$$\begin{aligned}{} & {} \min _{\begin{array}{c} (\lambda ^k,\alpha ^k,u_0^k,u^k)_k \subset X \times \mathbb {R}^m \times U_0\times \mathcal {V}\\ \theta \in \Theta \end{array}} \sum _{k=1}^K \beta ^e\Vert e(\lambda ^k,\alpha ^k,u_0^k,u^k,\theta ) \Vert ^2_{\mathcal {W}\times H} + \beta ^M\Vert Mu^k-y^k \Vert ^2_{\mathcal {Y}} \nonumber \\{} & {} \quad + \mathcal {R}_1(\lambda ^k,\alpha ^k,u_0^k,u^k) + \mathcal {R}_2(\theta ), \end{aligned}$$
(21)

and assume that, for \(((y^\dagger )^k)_k \in \mathcal {Y}^K\), there exists \((\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k)_k \in X \times \mathbb {R}^m \times U_0\times \mathcal {V}\) and \(\hat{\theta }\in \Theta \) such that \( e(\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k,\hat{\theta }) = 0 \) , \( M{\hat{u}}^k = (y^\dagger )^k\), \(\mathcal {R}_1(\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k)< \infty \) for all k and \( \mathcal {R}_2(\hat{\theta })< \infty \).

Then, for any sequence \((y_n)_n = (y_n^1,\ldots ,y_n^K)_n \) in \(\mathcal {Y}^K\) with \(\sum _{k=1}^K\Vert y^k_n - {(y^\dagger )^k} \Vert ^2_{\mathcal {Y}}:= \delta _n^2 \rightarrow 0\) and parameters \(\beta _n^e,\beta _n^M\) such that

$$\begin{aligned} \beta ^e_n \rightarrow \infty , \, \beta ^M_n \rightarrow \infty \text { and }\beta ^M_n \delta ^2_n \rightarrow 0 \end{aligned}$$

as \(n \rightarrow \infty \), any sequence of solutions \(((\lambda _n^k,\alpha _n^k,(u_0^k)_n,u_n^k)_k,\theta _n)_n\) of (21) with parameters \(\beta _n^e,\beta _n^M\) and data \(y_n\) admits a weakly convergent subsequence, and any limit of such a subsequence is a solution to

$$\begin{aligned}{} & {} \min _{\begin{array}{c} (\lambda ^k,\alpha ^k,u_0^k,u^k)_k \subset X \times \mathbb {R}^m \times U_0\times \mathcal {V}\\ \theta \in \Theta \end{array}} \sum _{k=1}^K\mathcal {R}_1(\lambda ^k,\alpha ^k,u_0^k,u^k) + \mathcal {R}_2(\theta ) \nonumber \\{} & {} \quad \text {s.t. for all }k: {\left\{ \begin{array}{ll} e(\lambda ^k,\alpha ^k,u_0^k,u^k,\theta ) = 0 \\ Mu^k = (y^\dagger )^k \end{array}\right. } \end{aligned}$$
(22)

If, further, the solution to (22) is unique, then the entire sequence \(((\lambda _n^k,\alpha _n^k,(u_0^k)_n,u_n^k)_k,\theta _n)_n\) weakly converges to the solution of (22).

Proof

With \((\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k)_k \) and \(\hat{\theta }\) arbitrary such that \(e(\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k,\hat{\theta }) = 0 \) and \( M{\hat{u}}^k = (y^\dagger )^k\), and \(((\lambda _n^k,\alpha _n^k,(u_0^k)_n,u_n^k)_k,\theta _n)_n\) any sequence of solutions to (21) with parameters \(\beta ^e_n,\beta ^M_n\), by optimality it holds that

$$\begin{aligned}{} & {} \sum _{k=1}^K \beta ^e_n\Vert e(\lambda ^k_n,\alpha ^k_n,(u_0^k)_n,u^k_n,\theta _n) \Vert ^2_{\mathcal {W}\times H} + \beta ^M_n \Vert Mu_n^k-y_n^k \Vert ^2_{\mathcal {Y}} + \mathcal {R}_1(\lambda _n^k,\alpha _n^k,(u_0^k)_n,u^k_n) \nonumber \\{} & {} \quad + \mathcal {R}_2(\theta _n) \le {\sum _{k=1}^K }\mathcal {R}_1(\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k) + \beta _n^M\delta _n^2 + \mathcal {R}_2(\hat{\theta }) \end{aligned}$$
(23)

By weak precompactness of the sublevel sets of \(\mathcal {R}_1\) and \(\mathcal {R}_2\) and convergence of \(\beta _n^M\delta _n^2\) to zero it thus follows that \(((\lambda _n^k,\alpha _n^k,(u_0^k)_n,u_n^k)_k,\theta _n)_n\) admits a weakly convergent subsequence in \((X \times \mathbb {R}^m \times U_0\times \mathcal {V})^K\times \Theta \).

Now let \(((\lambda ^k,\alpha ^k,u_0^k,u^k)_k,\theta )\) be the limit of such a weakly convergent subsequence, which we again denote by \(((\lambda _n^k,\alpha _n^k,(u_0^k)_n,u_n^k)_k,\theta _n)_n\). Closedness of \(\mathcal {G}\) together with lower semi-continuity of the norm \(\Vert \cdot \Vert _{\mathcal {W}\times H}\) and the estimate (23) (possibly moving to another non-relabeled subsequence) then yields that both

$$\begin{aligned}&\sum _{k=1}^K \Vert e(\lambda ^k,\alpha ^k,u_0^k,u^k,\theta ) \Vert ^2_{\mathcal {W}\times H}\\&\quad \le \liminf _n \sum _{k=1}^K \Vert e(\lambda ^k_n,\alpha ^k_n,(u_0^k)_n,u^k_n,\theta _n) \Vert ^2_{\mathcal {W}\times H} \\&\quad \le \liminf _n {\sum _{k=1}^K }\mathcal {R}_1(\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k)/\beta ^e_n + \beta _n^M(\delta _n^2 /\beta ^e_n) + \mathcal {R}_2(\hat{\theta })/\beta ^e_n = 0 \end{aligned}$$

and

$$\begin{aligned}&\Vert Mu^k-(y^\dagger )^k \Vert ^2_{\mathcal {Y}} \le \liminf _n \Vert Mu_n^k-y_n^k \Vert ^2_{\mathcal {Y}}\nonumber \\&\quad \le \liminf _n \mathcal {R}_1(\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k)/\beta ^M_n + \mathcal {R}_2(\hat{\theta })/\beta ^M_n + \delta _n^2 = 0. \end{aligned}$$
(24)

This shows that \(e(\lambda ^k,\alpha ^k,u_0^k,u^k,\theta ) = 0 \) and \( Mu^k = (y^\dagger )^k\) for all k. Again using the estimate (23), now together with weak lower semi-continuity of \(\mathcal {R}_1,\mathcal {R}_2\), we further obtain that

$$\begin{aligned} \mathcal {R}_1(\lambda ^k,\alpha ^k,u_0^k,u^k) + \mathcal {R}_2(\theta )&\le \liminf _n \mathcal {R}_1(\lambda _n^k,\alpha _n^k,(u_0^k)_n,u_n^k) + \mathcal {R}_2(\theta _n)\\&\le \liminf _n \mathcal {R}_1(\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k) + \mathcal {R}_2(\hat{\theta }) + \beta _n^M\delta _n^2 \\&= \mathcal {R}_1(\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k) + \mathcal {R}_2(\hat{\theta }). \end{aligned}$$

Since \((\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k)_k \) and \(\hat{\theta }\) were arbitrary solutions of \(e(\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k,\hat{\theta }) = 0 \) and \( M{\hat{u}}^k = (y^\dagger )^k\), it follows that \(((\lambda ^k,\alpha ^k,u_0^k,u^k)_k,\theta )\) solves (22) as claimed.

At last, in case the solution to (22) is unique, weak convergence of the entire sequence follows by a standard argument, using that any subsequence contains another subsequence that weakly converges to the same limit. \(\square \)

Remark 9

(Different limit cases) The above result considers the limit case of both fulfilling the PDE exactly and matching noise-free ground truth measurements. Variants can be easily obtained as follows: In case only the PDE should be fulfilled exactly, one can consider \(\beta ^M\) fixed and only \(\beta ^e\) converging to infinity (at an arbitrary rate), such that the resulting limit solution will be a solution of the reduced setting. Likewise, one can consider the case that \(\beta ^e\) is fixed and \(\beta ^M\) converges to infinity appropriately in dependence of the noise level \(\delta \), in which case the limit solutions solves the all-at-once setting with the hard constraint \(Mu^k=(y^\dagger )^k\), see [18] for some general results in that direction. The corresponding assumption of existence of \(((\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k)_k,\hat{\theta })\) such that \(e(\hat{\lambda }^k,\hat{\alpha }^k,{\hat{u}}_0^k,{\hat{u}}^k,\hat{\theta }) = 0 \) and \( M{\hat{u}}^k = (y^\dagger )^k\) can be weakened in both cases accordingly.

Further, note that the convergence result as well as its variants can be deduced also for the learning-informed parameter identification problem (14) exactly the same way.

Remark 10

(Uniqueness of minimum-norm solution.) A sufficient condition for uniqueness of a minimum-norm solution, and thus for convergence of the entire sequence of minimizers as stated in Proposition 8, is the tangential cone condition and existence of a solution \((\hat{\lambda }^k,\hat{u}_0^k,\hat{u}^k)\) to the PDE such that \(Mu^k = (y^\dagger )^k\), see [22, Proposition 2.1]. In Sect. 3.3 below, we discuss this condition in more detail and provide a result which, together with Remark 19, ensures this condition to hold for some particular choices of F and \( \mathcal {N} _\theta \). Regarding solvability of the PDE, we refer to Proposition 24 below, where a particular application is considered.

3.2 Differentiability of the Forward Operator

Solution methods for nonlinear optimization problems, like gradient descent or Newton-type methods, require uniform boundedness of the derivative of \(\mathcal {G}\). Differentiability of \(\mathcal {G}\) is a question of differentiability of F and \( \mathcal {N} \), which is discussed in the following. Note that there, and henceforth, we denote by \(H'(a): A \rightarrow B\) the Gâteaux derivative of a function \(H:A \rightarrow B\) and define Gâteaux differentiability in the sense of [37, Sect. 2.6], i.e., require \(H'(a)\) to be a bounded linear operator. The basis for differentiability of the forward operator is the following lemma, which is a direct extension of [37, Lemma 4.12].

Lemma 11

Let ABS be Banach spaces such that \(A \hookrightarrow S\). For \(\Sigma \subset \mathbb {R}^N\) open and bounded, and \(r \in [1,\infty )\), let \(\mathcal {A}\), \(\mathcal {B}\) be Banach spaces such that \(\mathcal {A}\hookrightarrow L^r(\Sigma ,A)\) and \(\mathcal {A}\hookrightarrow L^\infty (\Sigma ,S)\), and \(L^r(\Sigma ,B) \hookrightarrow \mathcal {B}\). Further, let \(H:\Sigma \times A \rightarrow B\) be a function such that \(H(z,\cdot )\) is Gâteaux differentiable for every \(z \in \Sigma \) with derivative \(H'(z,\cdot )\), and such that H is locally Lipschitz continuous in the sense that, for any \(M>0\) there exists \(L(M)>0\) such that for every \(a,\xi \in A\) with \(\max \{\Vert a\Vert _S,\Vert \xi \Vert _S\} \le M\)

$$\begin{aligned} \Vert H(z,a) - H(z,\xi )\Vert _B \le L(M)(\Vert a-\xi \Vert _A + (\max \{\Vert a\Vert _A,\Vert \xi \Vert _A \}+1) \Vert a-\xi \Vert _S) . \nonumber \\ \end{aligned}$$
(25)

Then, if the Nemytskii operators \(H:\mathcal {A}\rightarrow \mathcal {B}\) given as \(H(a)(z) = H(z,a(z))\) and \(H':\mathcal {A}\rightarrow \mathcal {L}(\mathcal {A},\mathcal {B})\) given as \(H'(a)(\xi )(z) = H'(z,a(z))(\xi (z))\) are well defined, then \(H:\mathcal {A}\rightarrow \mathcal {B}\) is also Gâteaux differentiable with \(H'(a) \in \mathcal {L}(\mathcal {A},\mathcal {B})\) given as \(H'(a)(\xi )(z) = H'(z,a(z))(\xi (z))\). Further, \(H'\) is locally bounded in the sense that, for any bounded set \(\tilde{\mathcal {A}} \subset \mathcal {A}\), \( \sup _{a \in \tilde{A}} \Vert H'(a)\Vert < \infty . \)

Proof

Fix \(M>0\) and \(z \in \Sigma \). Local Lipschitz continuity implies for any \(\tilde{a},\xi \in A\) with \(\Vert \tilde{a}\Vert _S+1\le M\),

$$\begin{aligned} \Vert H'(z,\tilde{a})\xi \Vert _{{B}} = \lim _{\delta \rightarrow 0 } \left\| \frac{H(z,\tilde{a}+\delta \xi ) - H(z,\tilde{a})}{\delta } \right\| _B \le L(M)(\Vert \xi \Vert _A + (\Vert \tilde{a}\Vert _A + 2) \Vert \xi \Vert _S) . \nonumber \\ \end{aligned}$$
(26)

Next, define \(h:[0,1] \rightarrow B\) as \(h(s) = H(z,a + \epsilon s \xi )\), for \(a \in A\) and \(\epsilon \in (0,1)\) such that \(\Vert a\Vert _S+2 \le M\), \(\epsilon \Vert \xi \Vert _S \le 1\). We note that h is differentiable and Lipschitz continuous (hence absolutely continuous), such that by the fundamental theorem of calculus for Bochner spaces, see [15, Theorem 2.2.17], \(h(1) - h(0) = \int _0^1\,h'(s) \,\textrm{d}s.\) This yields

$$\begin{aligned}{} & {} \left( \frac{1}{\epsilon }\Vert H(z,a+\epsilon \xi )-H(z,a)-\epsilon H'(z,a)\xi \Vert _B\right) ^r\\{} & {} \quad =\frac{1}{\epsilon ^r}\left\| \int _0^1 \epsilon H'(z,a + s \epsilon \xi ) \xi -\epsilon H'(z,a)\xi {\,\textrm{d}s} \right\| _B^r \\{} & {} \quad \le \left( \int _0^1 \left\| H'(z,a + s \epsilon \xi ) \xi \right\| _B+ \left\| H'(z,a)\xi \right\| _B \,\textrm{d}s \right) ^r \\{} & {} \quad \le \left( \int _0^1 \sup _{\tilde{s} \in [0,1]} 2\left\| H'(z,a + \tilde{s} \epsilon \xi ) \xi \right\| _B \,\textrm{d}s \right) ^r \\{} & {} \quad \le \sup _{\tilde{s}\in [0,1]} 2^r \Vert H'(z,a+\tilde{s}\epsilon \xi ) \xi \Vert _B^r \le 2^{2r-1}L(M)^r (\Vert \xi \Vert _A^r + (\Vert a\Vert +\Vert \xi \Vert _A+2)^r\Vert \xi \Vert _S^r). \end{aligned}$$

Now by \(\mathcal {A}\hookrightarrow L^\infty (\Sigma ,S)\), for \(a,\xi \in \mathcal {A}\), we can apply the above with \(M:= \sup _{z \in \Sigma }\Vert a(z)\Vert _S+2\) and \(\epsilon \) sufficiently small such that \(\epsilon \sup _{z \in \Sigma }\Vert \xi (z)\Vert _S \le 1\) and obtain

$$\begin{aligned} r_H(\epsilon ):= & {} \int _\Sigma \left( \frac{1}{\epsilon } \Vert H(z,a(z)+\epsilon \xi (z))-H(z,a(z))-\epsilon H'(z,a(z))\xi (z) \Vert _B \right) ^r\,\textrm{d}z\\{} & {} \le \int _\Sigma 2^{2r-1}L(M)^r (\Vert \xi (z)\Vert _A^r + (\Vert a(z)\Vert _A + \Vert \xi (z)\Vert _A+2)^r\Vert \xi (z)\Vert _S^r) dz \\{} & {} \le 2^{2r-1}L(M)^r\left( \Vert \xi \Vert _\mathcal {A}^r + \sup _{z\in \Sigma } \Vert \xi (z)\Vert _S^r \int _\Sigma (\Vert a(z)\Vert _A + \Vert \xi (z)\Vert _A + 2)^r \right) < \infty \end{aligned}$$

Using the Lebesgue’s Dominated Convergence Theorem, we deduce \(\lim _{\epsilon \rightarrow 0} r_H(\epsilon )=0\), which, by \(L^r(\Sigma ,B) \hookrightarrow \mathcal {B}\), shows Gâteaux differentiability.

Local boundedness as claimed follows direct from choosing \(M:= \sup _{a \in \tilde{\mathcal {A}}} \sup _{t \in (0,t)}\Vert a(t)\Vert _S+1\), and integrating the rth power of (26) over time. \(\square \)

Proposition 12

(Differentiability) Let Assumption 1 hold and let \(\sigma \in \mathcal {C}^1(\mathbb {R},\mathbb {R})\). Assume that for every \(t \in (0,T)\), the mapping \(F(t,\cdot ,\cdot ):X \times V \rightarrow W\) is jointly Gâteaux differentiable with respect to the second and third arguments, with \((t,\lambda ,u,\xi ,v) \mapsto F'(t,\lambda ,u)(\xi ,v)\) satisfying the Carathéodory conditions.

In addition, assume that F satisfies the following local Lipschitz continuity condition: For all \( M\ge 0\) there exists \(L(M)>0\), such that for all \(v_i \in V\) and \(\lambda _i \in X\), \(i=1,2\), with \(\max \{\Vert v_i\Vert _H, \Vert \lambda _i\Vert _X\} \le M\) and for almost every \(t \in (0,T)\),

$$\begin{aligned}{} & {} \Vert F(t,\lambda _1,v_1)-F(t,\lambda _2,v_2)\Vert _W \le L(M) (\Vert v_1-v_2\Vert _V\nonumber \\{} & {} \quad +(\max \{\Vert v_1\Vert _V,\Vert v_2\Vert _V\} + 1)( \Vert v_1-v_2\Vert _H +\Vert \lambda _1-\lambda _2\Vert _X)). \end{aligned}$$
(27)

Then \(\mathcal {G}:X \times \mathbb {R}^m \times U_0 \times \mathcal {V}\times \Theta \rightarrow \mathcal {W}\times H \times \mathcal {Y}\) is Gâteaux differentiable with

$$\begin{aligned}&\mathcal {G}'(\lambda ,\alpha ,u_0,u,\theta )\\ {}&= \begin{pmatrix} -F'_\lambda (\cdot ,\lambda ,u) &{} - \mathcal {N} '_\alpha (\alpha ,u,\theta ) &{} 0 &{} \frac{d}{dt}-F'_u(\cdot ,\lambda ,u) - \mathcal {N} '_u(\alpha ,u,\theta ) &{} - \mathcal {N} '_\theta (\alpha ,u,\theta )\\ 0 &{} 0 &{} -\text {Id} &{} (\cdot )_{t=0} &{} 0\\ 0 &{} 0 &{} 0 &{} M &{} 0 \end{pmatrix}. \end{aligned}$$

Furthermore, \(\mathcal {G}'(\cdot )\) is locally bounded in the sense specified in Lemma 11.

Proof

First note that it suffices to show corresponding differentiability and local boundedness assertions for the different components of \(\mathcal {G}\) given as \(u \mapsto \dot{u}\), F, \( \mathcal {N} \), \((u,u_0) \mapsto u(0) - u_0\) and M. For all except F and \( \mathcal {N} \), the corresponding assertions are immediate, hence we focus on the latter two.

Regarding F, this is an immediate consequence of Lemma 11 with \(A = X \times V\), \(B = W\), \(S = X \times H\), \(\Sigma = (0,T)\), \(r = 2\), \(\mathcal {A}= X \times \mathcal {V}\) with \(\Vert (\lambda ,v)\Vert _\mathcal {A}= \Vert \lambda \Vert _X + \Vert v\Vert _\mathcal {V}\), \(\mathcal {B}= \mathcal {W}\) and \(H(t,(\lambda ,v)) = F(t,\lambda ,v)\).

For \( \mathcal {N} \), this is again an immediate consequence of Lemma 11 with \(A =S= \mathbb {R}^m \times \mathbb {R}\times \Theta \), \(B = \mathbb {R}\), \(\Sigma = (0,T) \times \Omega \), \(r=\max \{2,{\hat{p}}\}\), \(\mathcal {A}= \mathbb {R}^m \times \mathcal {V}\times \Theta \) with \(\Vert (\alpha ,v,\theta )\Vert _\mathcal {A}= |\alpha | + \Vert v\Vert _\mathcal {V}+ |\theta |\), \(\mathcal {B}= \mathcal {W}\) and \(H((t,x),(\alpha ,v,\theta )) = \mathcal {N} _{\theta }(\alpha ,v(t,x))\). \(\square \)

Remark 13

For stronger image spaces \(W\nsupseteq L^q(\Omega ), \forall q\in [1,\infty )\), differentiability of F remains valid if (27) holds, while differentiability of \( \mathcal {N} \) requires a smoother activation function, e.g., the one suggested in Remark 29 below.

3.3 Lipschitz Continuity and the Tangential Cone Condition

In this section, we focus on showing a rather strong Lipschitz-type result for the neural network. This property allows us to apply (finite-dimensional) gradient-based algorithms to learn the neural networks, where the Lipschitz constant and its derivatives are used to determine the step size. Moreover, by this Lipschitz continuity, the tangential cone condition on (14) can be verified. This condition, together with solvability of the learning-informed PDE, answers the important question of uniqueness of a minimizer to the limit case of (14), as mention in Remark 10.

For ease of notation, we assume in this lemma that the outer layer of the neural network has activation \(\sigma \), as in the lower layers. Adapting the proof for \(\sigma =\text {Id}\) in the last layer is straightforward.

Lemma 14

(Lipschitz properties of neural networks) Consider an L-layer neural network \( \mathcal {N} : \mathbb {R}^{m+1}\times \Theta \ni (z,\theta ) \mapsto \mathcal {N} _\theta ((z_1,\ldots ,z_m),z_{m+1})\in \mathbb {R}\), \(L\in \mathbb {N}\) (z taking the role of \((\alpha ,u(t,x))\) in Lemma 4). Denote by \( \mathcal {N} ^{i}_{\theta ^{i}}\) the i lowest layers of the neural network, depending only on z and on the i lowest-index pairs of parameters \(\theta ^{i}\), while \( \mathcal {N} ^0_{\theta ^0}(z):=z\in \mathbb {R}^{m+1}\).

Fix any subset \(\mathcal {B}\subseteq \mathbb {R}^{m+1}\times \Theta \). For each \(1\le i\le L\), define \(\mathcal {B}_i:=\{\omega ^i \mathcal {N} ^{i-1}_{\theta ^{i-1}}(z)+\beta ^i \mid (z,\theta )\in \mathcal {B})\}\), that is, the image of the i-th layer before applying the activation function. Assume that the activation function \(\sigma \in \mathcal {C}^1(\mathbb {R},\mathbb {R})\) associated to \(\mathcal {N}\) for all \(1\le i\le L\) satisfies the Lipschitz inequalities

$$\begin{aligned} \left| \sigma (x) - \sigma (\tilde{x})\right| \le C_\sigma \left| x - \tilde{x}\right| , \qquad \left| \sigma '(x) - \sigma '(\tilde{x})\right| \le C'_\sigma \left| x - \tilde{x}\right| \end{aligned}$$

for all x, \(\tilde{x}\in {\mathcal {B}_i}\) and some positive constants \(C_\sigma \), \(C'_\sigma \), and that \(s_i:=\sup _{x\in {\mathcal {B}_i}}\left| \sigma '(x)\right| < \infty \).

Fix now a layer l, \(1\le l\le L\), as well as \((\tilde{z},\theta )\), \((z,\bar{\theta })\), \((z,\hat{\theta })\in \mathcal {B}\), where \(\bar{\theta }\) differs from \(\theta \) only in that its l-th weight is replaced by some \(\tilde{\omega ^l}\) and \(\hat{\theta }\) differs from \(\theta \) only in that its l-th bias is replaced by some \(\tilde{\beta ^l}\); explicitly,

$$\begin{aligned} (\bar{\theta }_j)_k = {\left\{ \begin{array}{ll}\tilde{\omega }^l, &{} (j,k) = (1,l), \\ (\theta _j)_k &{} \text {otherwise,}\end{array}\right. } \qquad (\hat{\theta }_j)_k = {\left\{ \begin{array}{ll}\tilde{\beta }^l, &{} (j,k) = (2,l), \\ (\theta _j)_k &{} \text {otherwise.}\end{array}\right. } \end{aligned}$$

Then \( \mathcal {N} \) satisfies the Lipschitz estimates

$$\begin{aligned} \begin{aligned} \left| \mathcal {N} (z,\theta ) - \mathcal {N} (\tilde{z},\theta )\right|&\le (C_\sigma )^L \left( \prod _{k=1}^L\left| \omega ^k\right| \right) \left| z - \tilde{z}\right| , \\ \left| \mathcal {N} (z,\theta ) - \mathcal {N} (z,\bar{\theta })\right|&\le (C_\sigma )^{L-l+1} \left( \prod _{k=l+1}^L\left| \omega ^k\right| \right) \left| \mathcal {N} ^{l-1}_{\theta ^{l-1}}(z)\right| \left| \omega ^l - \tilde{\omega }^l\right| , \\ \left| \mathcal {N} (z,\theta ) - \mathcal {N} (z,\hat{\theta })\right|&\le (C_\sigma )^{L-l+1}\left| \beta ^l - \tilde{\beta }^l\right| , \end{aligned} \end{aligned}$$
(28)

while its derivatives with regards to z, \(\omega ^l\) and \(\beta ^l\), respectively, satisfy the Lipschitz estimates

$$\begin{aligned} \left| \mathcal {N} '_z(z,\theta ) - \mathcal {N} '_z({\tilde{z}},\theta )\right|&\le C^z_1\left| \omega ^1\right| \left| z - \tilde{z}\right| , \end{aligned}$$
(29)
$$\begin{aligned} \left| \mathcal {N} '_{\omega ^l}(z,\theta ) - \mathcal {N} '_{\omega ^l}(z,\bar{\theta })\right|&\le C^{\omega ^l}_l\left| \mathcal {N} ^{l-1}_{\theta ^{l-1}}(z)\right| \left| \omega ^l - \tilde{\omega }^l\right| , \end{aligned}$$
(30)
$$\begin{aligned} \left| \mathcal {N} '_{\beta ^l}(z,\theta ) - \mathcal {N} '_{\beta ^l}(z,\hat{\theta })\right|&\le C^{\beta ^l}_l\left| \beta ^l - \tilde{\beta }^l\right| , \end{aligned}$$
(31)

where one defines \(C^z_{L+1}:=C^{\omega ^l}_{L+1}:=C^{\beta ^l}_{L+1}:=0\) and, by backward recursion for \(1\le i\le L\),

$$\begin{aligned} \begin{aligned} C^z_i&:= C'_\sigma (C_\sigma )^{i-1}\left( \prod _{k=i+1}^Ls_k\right) \left( \prod _{k=1}^{L}\left| \omega ^k\right| \right) +C^z_{i+1}s_i\left| \omega ^{i+1}\right| , \\ C^{\omega ^l}_i&:= C'_\sigma (C_\sigma )^{i-l}\left( \prod _{k=i+1}^Ls_k\right) \left( \prod _{k=l+1}^{L}\left| \omega ^k\right| \right) \left| \mathcal {N} ^{l-1}(z,\theta ^{l-1})\right| + C^{\omega ^l}_{i+1}s_i\left| \omega ^{i+1}\right| , \\ C^{\beta ^l}_i&:= C'_\sigma (C_\sigma )^{i-l}\left( \prod _{k=i+1}^Ls_k\right) \left( \prod _{k=l+1}^{L}\left| \omega ^k\right| \right) +C^{\beta ^l}_{i+1}s_i\left| \omega ^{i+1}\right| . \end{aligned} \end{aligned}$$
(32)

Proof

See Appendix B. \(\square \)

Remark 15

If \(\sigma '\) is locally Lipschitz continuous on \(\mathbb {R}\), the existence of \(C_\sigma \), \(C'_\sigma \) and the \(s_i\) is clear whenever \(\mathcal {B}\) is a bounded set. Thus, it is a direct consequence of Lemma 14 (or follows simply by the properties of the functions \( \mathcal {N} \) is composed of) that the mapping \((z,\theta ) \mapsto \mathcal {N} (z,\theta )\) restricted to any bounded set is bounded, Lipschitz continuous and has Lipschitz continuous derivative. This is relevant for gradient-based optimization algorithms to solve the learning problem (13), where Lipschitz continuity of the derivative of the objective function is a key ingredient for (local) convergence, see for instance [36] for a result in Hilbert spaces. In particular, Lipschitz continuity of \(\theta \mapsto \mathcal {N} (z,\theta )\) for z fixed is useful for the learning problem (13), where the exact \((\lambda ,u)\) is known. In this case, one simply learns the finite-dimensional hyperparameter \(\theta \), thus standard convergence results on gradient-based methods in finite dimensional vector spaces apply, see, e.g., [34, Sect. 5.3].

Based on these Lipschitz estimates, we can study the tangential cone condition for the problem (14), given a learned \(\mathcal {N}_\theta \). For this, we assume that \(\mathcal {N}_\theta (\alpha ,u)=\mathcal {N}_\theta (u)\).

Condition 16

(Tangential cone condition [22, Expression (2.4)]) We say that the tangential cone condition for a mapping \(G:\mathcal {D}(G)(\subseteq X)\rightarrow Y\) holds in a ball \(\mathcal {B}^X_\rho (x^\dagger )\), if there exists \(c_{tc}<1\) such that

$$\begin{aligned} \Vert G(x)-G({\tilde{x}})-G'(x)(x-{\tilde{x}})\Vert _{{X}}\le c_{tc}\Vert G(x)-G({\tilde{x}})\Vert _{{Y}}\qquad \forall x,\tilde{x}\in \mathcal {B}_\rho (x^\dagger ). \end{aligned}$$

Here, \(G'(x)h\) denotes the directional derivative [24].

Analyzed in the all-at-once setting (14), the tangential cone condition reads as

$$\begin{aligned}{} & {} \Vert F(\lambda ,u)-F(\tilde{\lambda },{\tilde{u}} )-F_\lambda '(\lambda ,u)(\lambda -\tilde{\lambda }) -F_u'(\lambda ,u)(u-{\tilde{u}}) \nonumber \\{} & {} \quad +\mathcal {N}_\theta (u)-\mathcal {N}_\theta ({\tilde{u}})- \mathcal {N} _\theta '(u)(u-{\tilde{u}})\Vert _\mathcal {W}\nonumber \\{} & {} \le c_{tc} \Bigl (\Vert \dot{u}-\dot{{\tilde{u}}}-F(\lambda ,u)+F(\tilde{\lambda },{\tilde{u}})-\mathcal {N}_\theta (u) +\mathcal {N}_\theta ({\tilde{u}})\Vert _\mathcal {W}^2 \nonumber \\{} & {} \quad +\Vert u(0)-u^\dagger (0)-u_0+{\tilde{u}}_0\Vert _{U_0}^2 +\Vert M(u-{\tilde{u}})\Vert _\mathcal {Y}^2\Bigr )^{1/2} \end{aligned}$$
(33)

for all \((\lambda ,u_0,u), (\tilde{\lambda },{\tilde{u}}_0,{\tilde{u}})\in B_\rho ^{X \times U_0\times \mathcal {V}}(\lambda ^\dagger ,u_0^\dagger ,u^\dagger )\), where \(F'\) and \( \mathcal {N} ' \) are the Gâteaux derivatives.

The tangential cone condition strongly depends on the PDE model F and the architectures of \( \mathcal {N} \). By triangle inequality, a sufficient condition for (33) to hold is that the tangential cone condition holds for F and for \( \mathcal {N} \) separately. The tangential cone condition in combination with solvability of equation \(G(x)=0\) ensures uniqueness of a minimum-norm solution [22, Proposition 2.1] (see Appendix A). Solvability of the operator equation \(G(x)=0\), according to the all-at-once formulation, is the question of solvability of the learning-informed PDE and exact measurements, i.e. \(\delta =0\). For solvability of the learning-informed PDE, we refer to Proposition 24 in Sect. 4. In the following, we focus on the tangential cone condition for the neural networks by studying Condition 16 for \(G:=\mathcal {N}_\theta .\)

Lemma 17

(Tangential cone condition for neural networks) The tangential cone condition in Condition 16 for \(G = \mathcal {N} _\theta {: \mathcal {V}\rightarrow \mathcal {W}}\) with fixed parameter \(\theta \) holds in any ball \( \mathcal {B} ^\mathcal {V}_\rho (u^\dagger ) \) if \(M=\text {Id},\) \( Y\hookrightarrow L^{\hat{p}}(\Omega )\) with \({\hat{p}}>0\) as in (6), \(\sigma \in \mathcal {C}^1(\mathbb {R},\mathbb {R})\) and \(\rho \), depending on the Lipschitz constant in Lemma 14 is sufficiently small.

Proof

Since \(\mathcal {V}\hookrightarrow L^\infty ((0,T)\times \Omega )\) for \(u, {\tilde{u}} \in \mathcal {B}_\rho ^\mathcal {V}(u^\dagger )\), we have for almost all \((t,x)\in (0,T)\times \Omega \) that \(u(t,x), \tilde{u}(t,x)\in \mathcal {B}\) for some \(\mathcal {B}\) bounded. Thus, we can use Lemma 14 with such a \(\mathcal {B}\), and in particular the estimate (62) for \(z=u(t,x)\), to obtain

$$\begin{aligned}&\Vert \mathcal {N}_\theta (u)-\mathcal {N}_\theta ( {\tilde{u}})-\mathcal {N}_\theta '(u)(u- {\tilde{u}})\Vert _\mathcal {W}\\&\quad =\left\| \int _0^1 \mathcal {N}_\theta '( {\tilde{u}}+\mu (u- {\tilde{u}}))\,d\mu (u- {\tilde{u}})-\mathcal {N}_\theta '(u)(u- {\tilde{u}})\right\| _\mathcal {W}\\&\quad \le C_{L^{\hat{p}}\rightarrow W}\left\| \int _0^1 \left( \mathcal {N}_\theta '( {\tilde{u}}+\mu (u- {\tilde{u}}))-\mathcal {N}_\theta '(u)\right) d\mu (u- {\tilde{u}})\right\| _{L^2(0,T;L^{\hat{p}}(\Omega ))}\\&\quad \le C_{L^{\hat{p}}\rightarrow W}C^z_1|\omega _1|\left\| \int _0^1 (1-\mu )\,d\mu |u- {\tilde{u}}|^2\right\| _{L^2(0,T;L^{\hat{p}}(\Omega ))}\\&\quad \le (1/2)C_{Y\rightarrow L^{\hat{p}}(\Omega )}C_{L^{\hat{p}}\rightarrow W} C^z_1|\omega _1|\Vert u- {\tilde{u}}\Vert _{L^\infty ((0,T)\times \Omega )}\Vert u- {\tilde{u}}\Vert _{\mathcal {Y}}\\&\quad \le \rho \, C_{\mathcal {V}\rightarrow L^\infty ((0,T)\times \Omega )} C_{Y\rightarrow L^{\hat{p}}(\Omega )}C_{L^{\hat{p}}(\Omega )\rightarrow W} C^z_1|\omega _1|\Vert u- {\tilde{u}}\Vert _{\mathcal {Y}}=: c_{tc}\Vert u- {\tilde{u}}\Vert _{\mathcal {Y}}\\&\quad = {c_{tc}\Vert M(u- {\tilde{u}})\Vert _{\mathcal {Y}}} \end{aligned}$$

where \(C^z_1|\omega _1|\) is the Lipschitz constant of \( \mathcal {N} '_u\) derived in Lemma 14, and \(c_{tc}<1\) if

$$\begin{aligned} \rho <1/\left( C_{\mathcal {V}\rightarrow L^\infty ((0,T)\times \Omega )} C_{Y\rightarrow L^{\hat{p}}(\Omega )}C_{L^{\hat{p}}(\Omega )\rightarrow W} C^z_1|\omega _1|\right) . \end{aligned}$$

We note that having full observation, i.e. \(M=\text {Id}\), is crucial for establishing the tangential cone condition, as it allows us to link the estimate from \(\Vert u- {\tilde{u}}\Vert _\mathcal {Y}\) to \(\Vert M(u- {\tilde{u}})\Vert _\mathcal {Y}\), yielding the last quantity on the right hand side of (33). The necessity of full observation has also been mentioned in [24]. \(\square \)

Now using [22, Proposition 2.1] together with Lemma 17, a uniqueness result follows.

Proposition 18

(Uniqueness of minimizer for the limit case of (14)) With \(\nu \ge 1\), consider the regularizer \(\mathcal {R}_1=\Vert \cdot \Vert ^\nu _{X\times \mathbb {R}^m \times U_0\times \mathcal {V}}\), and assume that the conditions in Lemma 17 are satisfied. Moreover, suppose that the tangential cone condition for F holds in \(\mathcal {B}_\rho ^{X\times U_0\times \mathcal {V}}(\lambda ^\dagger ,u_0^\dagger ,u^\dagger )\) and the equation \(\mathcal {G}(\lambda ,u_0,u,\hat{\theta }) = 0\) with \(\mathcal {G}\) in (12) and \(\theta \) fixed is solvable in \(\mathcal {B}_\rho ^{X\times U_0\times \mathcal {V}}(\lambda ^\dagger ,u_0^\dagger ,u^\dagger )\). Then the limit case of the parameter identification problem (14) admits a unique minimizer in the ball \(\mathcal {B}_\rho ^{X\times U_0\times \mathcal {V}}(\lambda ^\dagger ,u_0^\dagger ,u^\dagger )\).

Remark 19

We refer to Section (4) below for solvability of the learning-informed PDE in an application. We refer to [24] for concrete choices of F and of function space settings such that the tangential cone condition can be verified.

Note that, while the tangential cone condition for limit case of the of the parameter identification problem (14) can be confirmed as above, the same question for the learning problem (13) remains open.

4 Application

In this section, as special case of the dynamical system (1), we examine a class of general parabolic problems given as

$$\begin{aligned}&\dot{u} - \nabla \cdot (a\nabla u) + cu - f({\alpha ,u}) = \varphi \quad{} & {} \text{ in } \Omega \times (0,T),\nonumber \\&u|_{\partial \Omega }=0{} & {} \text{ in } (0,T), \nonumber \\&u(0) = u_0{} & {} \text{ in } \Omega , \end{aligned}$$
(34)

where \(\Omega \subset \mathbb {R}^d\) is a bounded \(C^2\)-class domain, with \( d\in \{1,2,3 \}\) being relevant in practice. The nonlinearity f, which can be replaced by a neural network later, is assumed to be given as the Nemytskii operator \(f:\mathbb {R}^m\times \mathcal {V}\rightarrow \mathcal {W}\) [32, Sect. 1.3] of a pointwise function \(f:\mathbb {R}^m\times \mathbb {R}\rightarrow \mathbb {R}\), making use of the notation \( [f(\alpha ,u)](t,x) = f(\alpha ,u(t,x))\). We initially work with the following parameter spaces

$$\begin{aligned} \varphi \in X_\varphi := H^{-1}(\Omega ), \quad c\in X_c := L^2(\Omega ),\quad a\in X_a:= W^{1,Pa}(\Omega )\quad u_0\in U_0:= H^2(\Omega ), \end{aligned}$$
(35)

where \(Pa >d\), and, for existence of a solution, we will require the constraints

$$\begin{aligned} 0<\underline{a}\le a(x)\le \overline{a}\ \quad \text {for a.e. } x\in \Omega . \end{aligned}$$
(36)

Thus, the overall parameter space X is given as \(X=(X_\varphi ,X_c,X_a)\).

4.1 Unique Existence Results for (34)

Our next goal is to study unique existence of (34). The main purpose of this is to inspire a relevant choice of function space setting for the all-at-once setting of (13) and (14), even though unique existence is not required there. Also, a unique existence result is of interest for studying the reduced setting, where well-definedness of the parameter-to-state map is needed.

We will proceed in two steps: In the first step, we prove that (34) admits a unique solution

$$\begin{aligned} u\in W^{1,\infty ,\infty }(0,T;L^2(\Omega ),L^2(\Omega ))\cap W^{1,\infty ,2}(0,T;H^1_0(\Omega ),H^1_0(\Omega )) \end{aligned}$$

with \(W^{1,p,q}(0,T;V_1,V_2):=\{u\in L^p(0,T;V_1):\dot{u}\in L^q(0,T;V_2)\}\). Then, in the second step, we lift the regularity of u to the somewhat stronger space

$$\begin{aligned} u\in L^\infty ((0,T)\times \Omega )\end{aligned}$$

to achieve boundedness in time and space of the solution, which will later serve our purpose of working with a neural network acting pointwise. It is worth noting that the study for unique existence is carried out first of all for classes of general nonlinearity f satisfying some specific assumptions, such as pseudomonotonicity and growth condition, see Lemmas 21 and 23 below. The nonlinearity f as a neural network will then be considered in Proposition 24, Remark 25.

Before investigating (34), we summarize the unique existence theory as provided in [32, Theorems 8.18, 8.31] for the autonomous case.

Theorem 20

Let \({\hat{V}}\) be a Banach space, \({\hat{H}}\) be a Hilbert spaces and assume that for \({\hat{F}}:{\hat{V}}\rightarrow {\hat{V}}^*\), \(u_0 \in {\hat{H}}\) and \(\varphi \in {\hat{V}}^*\), with the Gelfand triple \({\hat{V}}\subseteq {\hat{H}}\cong {\hat{H}}^* \subseteq {\hat{V}}^*\), the following holds:

  1. S1.

    \({\hat{F}}\) is pseudomonotone.

  2. S2.

    \({\hat{F}}\) is semi-coercive, i.e,

    $$\begin{aligned} \forall v \in {\hat{V}}: \langle {\hat{F}}(v),v\rangle _{{\hat{V}}^*,{\hat{V}}} \ge c_0|v|^2_{\hat{V}}-c_1|v|_{\hat{V}}-c_2\Vert v\Vert _{{\hat{H}}}^2 \end{aligned}$$

    for some \(c_0 > 0\) and some seminorm \(|.|_{{\hat{V}}}>0\) satisfying \(\forall v \in {\hat{V}}: \Vert v\Vert _{{\hat{V}}} \le c_{|.|}(|v|_{{\hat{V}}} + \Vert v\Vert _{{\hat{H}}})\).

  3. S3.

    \({\hat{F}}\), \(u_0\) and \(\varphi \) satisfy the regularity condition \({\hat{F}}(u_0)-\varphi \in {\hat{H}}\), \(u_0 \in {\hat{V}}\) and

    $$\begin{aligned} \langle {\hat{F}}(u)-{\hat{F}}(v), u-v\rangle _{{\hat{V}}^*,{\hat{V}}}\ge C_0|u-v|^2_{{\hat{V}}} -C_2\Vert u-v\Vert _{{\hat{H}}}^2 \end{aligned}$$

    for all \(u,v \in V\) with some \(C_0>0.\)

Then the abstract Cauchy problem

$$\begin{aligned} \dot{u}(t) + {\hat{F}}(u(t)) = \varphi \, \qquad u(0) = u_0 \end{aligned}$$

has a unique solution \(u \in W^{1,\infty ,\infty }(0,T;{\hat{H}},{\hat{H}})\cap W^{1,\infty ,2}(0,T;{\hat{V}},{\hat{V}})\).

By verifying the conditions in Theorem 20, we now obtain unique existence as follows.

Lemma 21

(Unique existence) Let the nonlinearity \(f(\alpha ,\cdot ):H_0^1(\Omega ) \rightarrow H_0^1(\Omega )^*\) be given as the Nemytskii mapping of a measurable function \(f(\alpha ,\cdot ):\mathbb {R}\rightarrow \mathbb {R}\) that satisfies

$$\begin{aligned} \begin{aligned}&(-f(\alpha ,\cdot )):H_0^1(\Omega ) \rightarrow H_0^1(\Omega )^* \text { monotone and continuous}, \\&f(\alpha ,0)=0, \quad |f(\alpha ,v)|\le C_\alpha (1+|v|^5), \quad \text {for some } C_\alpha \ge 0. \end{aligned} \end{aligned}$$
(37)

Then, equation (34) with parameter \(\varphi \), c, a and \(u_0\) such that (35), (36) hold, admits a unique solution

$$\begin{aligned} u\in W^{1,\infty ,\infty }(0,T;L^2(\Omega ),L^2(\Omega ))\cap W^{1,\infty ,2}(0,T;H^1_0(\Omega ),H^1_0(\Omega )) \end{aligned}$$

Proof

We verify the conditions in Theorem 20 for \({\hat{H}}= L^2(\Omega )\), \({\hat{V}}= H_0^1(\Omega )\) with \(\Vert u\Vert _{{\hat{V}}}=\Vert \nabla u\Vert _{{\hat{H}}}\) and \({\hat{F}}(u):= -F(u) - f(\alpha ,u)\), where \(F:{\hat{V}}\rightarrow {\hat{V}}^*\) is given as

$$\begin{aligned} F(u) = \nabla \cdot (a\nabla u) - cu. \end{aligned}$$

First, note that due to measurability and the growth constraint, the Nemytskii mapping \(f(\alpha ,\cdot ):{\hat{V}}\rightarrow {\hat{V}}^*\), where we set \(f(\alpha ,u)(w):= \int _\Omega f(\alpha ,u(x))w(x) \,\textrm{d}x\) for \(w \in {\hat{V}}\), is indeed well-defined since,

$$\begin{aligned} \Vert f(\alpha ,v)\Vert _{{\hat{V}}^*}&= \underset{\Vert w\Vert _{\hat{V}}\le 1}{\sup } \int _\Omega f(\alpha ,{v}(x))w(x) \,\textrm{d}x \le { \underset{\Vert w\Vert _{\hat{V}}\le 1}{\sup } C_{\alpha }(|\Omega |^{5/6}+\Vert v^5\Vert _{L^{6/5}(\Omega )})\Vert w\Vert _{L^{6}(\Omega )}}\\&\le {C}C_{H^1\rightarrow L^6}(1+\Vert v^5\Vert _{L^{6/5}(\Omega )})\le {C}(C_{H^1\rightarrow L^6})^6(1+\Vert v\Vert _{H^1(\Omega )}^5). \end{aligned}$$

Since \(0<\underline{a}\le a\) almost everywhere on \(\Omega \) and \(c\in L^2(\Omega )\), the estimate

$$\begin{aligned} \langle cu,u \rangle _{{\hat{V}}^*,{\hat{V}}}&\le \Vert c\Vert _{L^2(\Omega )}\Vert u^{3/2}u^{1/2}\Vert _{L^2(\Omega )} \le (C_{H^1\rightarrow L^6})^{3/2}\Vert c\Vert _{L^2(\Omega )}\Vert u\Vert ^{3/2}_ {{\hat{V}}} \Vert u\Vert ^{1/2}_{{\hat{H}}} \nonumber \\&\le \frac{3}{4}\left( \underline{a}^{3/4}\Vert u\Vert ^{3/2}_{\hat{V}}\right) ^{4/3}+\frac{1}{4}\left( \frac{(C_{H^1\rightarrow L^6})^{3/2}\Vert c\Vert _{L^2(\Omega )}}{\underline{a}^{3/4}}\Vert u\Vert ^{1/2}_{\hat{H}}\right) ^{4}\nonumber \\&=\frac{3\underline{a}}{4}\Vert u\Vert _{\hat{V}}^2 + \frac{(C_{H^1\rightarrow L^6})^6}{4\underline{a}^3}\Vert c\Vert _{L^2(\Omega )}^4\Vert u\Vert _{\hat{H}}^2 \end{aligned}$$
(38)

yields

$$\begin{aligned}&\langle -\nabla \cdot (a\nabla u) + cu,u \rangle _{{\hat{V}}^*,{\hat{V}}}\\&\quad \ge \underline{a}\Vert u\Vert _{\hat{V}}^2-\left( \frac{3\underline{a}}{4}\Vert u\Vert _{\hat{V}}^2 + \frac{(C_{H^1\rightarrow L^6})^6}{4\underline{a}^3}\Vert c\Vert _{L^2(\Omega )}^4\Vert u\Vert _{\hat{H}}^2\right) = c_0\Vert u\Vert _{\hat{V}}^2 - c_2\Vert u\Vert _{\hat{H}}^2, \end{aligned}$$

with \(c_0:=\underline{a}/4\), \(c_2:=(C_{H^1\rightarrow L^6})^6\Vert c\Vert _{L^2(\Omega )}^4/4\underline{a}^3\). Together with monotonicity of \({-f(\alpha ,\cdot )}\) and \(f(\alpha ,0)=0\), one has \(\langle f(\alpha ,u),u\rangle _{V^*,V}=\langle f(\alpha ,u)-f(\alpha ,0), u-0\rangle _{{\hat{V}}^*,{\hat{V}}}\ge 0\). This implies that semicoercivity as in S2 with \(c_0\), \(c_2\) as above and \(c_1 = 0\). Also, the second estimate in the regularity condition S3 now follows directly with

$$\begin{aligned} c_0=C_0, \quad C_2=c_2, \end{aligned}$$

where again, we employ monotonicity of \(f(\alpha ,\cdot )\).

In order to verify pseudomonotonicity S1., we first notice that \({\hat{F}}:{\hat{V}}\rightarrow {\hat{V}}^*\) is bounded, i.e., it maps bounded sets to bounded sets, and continuous where the latter follows from continuity of F, which is immediate, and continuity of f, which holds by assumption. Using this, one can apply [14, Lemma 6.7] to conclude pseudomonotonicity if the following statement is true

$$\begin{aligned}{}[\,\, u_n \overset{{\hat{V}}}{\rightharpoonup }\ u \quad {\text {and}}\quad \limsup _{n\rightarrow \infty }\,\langle {\hat{F}}(u_n)-{\hat{F}}(u),u_n-u \rangle _{{\hat{V}}^*,{\hat{V}}} \le 0\,\,] \quad \Rightarrow \quad u_n \overset{{\hat{V}}}{\rightarrow }u. \end{aligned}$$

The latter follows since, by , one gets for \(u_n \overset{{\hat{V}}}{\rightharpoonup }\ u\) that \(u_n\overset{{\hat{H}}}{\rightarrow }u\) and

$$\begin{aligned} 0&\ge \quad \limsup _{n\rightarrow \infty }\,\langle {\hat{F}}(u_n)-{\hat{F}}(u),u_n-u \rangle _{{\hat{V}}^*,{\hat{V}}} \nonumber \\&\ge c_0 \limsup _{n\rightarrow \infty }\Vert u_n-u\Vert _{\hat{V}}^2-c_2\lim _{n\rightarrow \infty }\Vert u_n-u\Vert _{\hat{H}}^2 \nonumber \\&=c_0\limsup _{n\rightarrow \infty }\Vert u_n-u\Vert _{\hat{V}}^2, \end{aligned}$$
(39)

which implies \(u_n \overset{{\hat{V}}}{\rightarrow }u\) as \(n\rightarrow \infty .\) With this, Theorem 20 implies unique existence of a solution

$$\begin{aligned} u\in W^{1,\infty ,\infty }(0,T;{\hat{H}},{\hat{H}})\cap W^{1,\infty ,2}(0,T;{\hat{V}},{\hat{V}}). \end{aligned}$$

\(\square \)

Note that, by embedding, \(u\in W^{1,\infty ,\infty }(0,T;{\hat{H}},{\hat{H}})\cap W^{1,\infty ,2}(0,T;{\hat{V}},{\hat{V}})\) implies that \(u\in L^\infty (0,T;{\hat{V}})\cap H^1(0,T;{\hat{V}}))\). In a second step, we now aim to find suitable assumptions on the parameter spaces \(X_\varphi \), \(X_c\), \(X_a\) and \(U_0\) such that regularity of the solution u of (34) as obtained in the previous proposition is lifted to \(u\in L^\infty ((0,T)\times \Omega )\).

Remark 22

There are at least two ways to achieve this: One is to enhance space regularity of u from \(H^1(\Omega ) \) to \(W^{k,p}(\Omega )\) with \( kp>d\) such that \(W^{k,p} (\Omega ) \hookrightarrow C(\overline{\Omega })\) and we can ensure \(u \in L^\infty ( (0,T), C(\overline{\Omega })) \hookrightarrow L^\infty ((0,T)\times \Omega )\). The other possible approach is to ensure a \( W^{2,q}(\Omega )\)-space regularity with q sufficiently large such that \( u \in L^2((0,T),W^{2,q}(\Omega ))\cap H^1(0,T;W^{2,q}(\Omega ))\hookrightarrow C(0,T;L^\infty (\Omega ))\).

While the first approach might yield weaker condition on kp, it imposes a non-reflexive state space. The latter choice on the other hand fits better into our setting of reflexive spaces, thus we proceed with the latter choice.

Now our goal is to determine an exponent q such that, if \(u\in L^2(0,T;W^{2,q}(\Omega ))\cap H^1(0,T;H^1(\Omega ))\), it follows that \(u \in C(0,T;W^{1,2p}(\Omega ))\) with \(p>d/2\) such that \(W^{1,2p}(\Omega )\hookrightarrow L^\infty (\Omega )\) and ultimately \(u \in L^\infty ((0,T)\times \Omega )\). To this aim, first note that for \(u\in L^2(0,T;W^{2,q}(\Omega )\cap H^1_0(\Omega ))\,\cap \, H^1(0,T;H^1(\Omega ))\), by Friedrichs’s inequality, it follows that \(u \in C(0,T;L^{2p}(\Omega ))\) if

$$\begin{aligned} |{\nabla } u|^p \in C(0,T;L^2(\Omega )). \end{aligned}$$

To ensure the latter, we use that \(({\nabla u})^p \in L^2(0,T;W^{1,q/p}(\Omega ))\cap H^1(0,T;L^{2/p}(\Omega ))\) and that

$$\begin{aligned} L^2(0,T;W^{1,q/p}(\Omega ))\cap H^1(0,T;L^{2/p}(\Omega )) \hookrightarrow C(0,T;L^2(\Omega )) \end{aligned}$$

provided that \(dp>q\ge \frac{dp}{d+1}\) and \(\frac{p}{2}\le 1-\frac{p}{q}+\frac{1}{d}\). Indeed, in this case it follows that

$$\begin{aligned} L^{\frac{2}{p}}(\Omega )\hookrightarrow L^{\frac{dq}{dq-dp+q}}(\Omega )\hookrightarrow {(}W^{1,\frac{q}{p}}(\Omega ))^* \end{aligned}$$

such that the embedding into \(C(0,T;L^2(\Omega ))\) follows from [32, Lemma 7.3] (see Appendix A). Since \(2pd/(2d+2-dp)\ge dp/(d+1)\), it follows that we can ensure for \( p>d/2\) that \( u \in L^\infty ((0,T)\times \Omega )\) if

$$\begin{aligned} dp > q \ge \frac{2dp}{2d+2-dp}. \end{aligned}$$

This is fulfilled for \(p=d/2+\epsilon \) with \(\epsilon >0\) if \(dp > q\ge (2d^2+4\epsilon d)/(4+4d-d^2-2\epsilon d)\) and, more concretely, in case \(d=2\) for \(p=1+\epsilon \) \(q=(2+2\epsilon )/(2-\epsilon )\) and \(\epsilon \in (0,1)\) and in case \(d=3\) for \(p=3/2+\epsilon \), \(q = (18+12\epsilon )/(7-6\epsilon )\) and \(\epsilon \in (0,1/2)\).

Let us focus on the latter case of \(d=3\) and derive suitable assumptions on \(X_\varphi \), \(X_c\), \(X_a, U_0\) and f such that the solution u to (34) fulfills

$$\begin{aligned} \begin{aligned} u \in L^2(0,T;W^{2,q}(\Omega ))\cap H^1(0,T;H^1(\Omega ))\hookrightarrow C(0,T;W^{1,2p}(\Omega )), \end{aligned} \end{aligned}$$

where the embedding holds by our choice of q and p.

Lemma 23

(Lifted regularity) In addition to the assumptions of Lemma 21, assume that \(d=3\) and that, for positive numbers \(p,\epsilon ,q,\bar{q}\) and Pa with

$$\begin{aligned}{} & {} p=3/2+\epsilon ,\quad \min \{6,3p\}> q\ge \frac{18+12\epsilon }{7-6\epsilon },\quad \frac{q\bar{q}}{2\overline{q}-q} \le 2,\\{} & {} \quad \overline{q}\preceq \frac{3q}{3-q}, \quad Pa >\max \{3,\frac{q\bar{q}}{\overline{q}-q}\} \end{aligned}$$

it holds that

$$\begin{aligned}&c\in L^q(\Omega ),\quad a\in W^{1,Pa}(\Omega ), \text { and } 0<\underline{a}\le a(x)\le \overline{a}\,\,\, {\text {for almost all }} x\in \Omega ,\\&\varphi \in L^q(\Omega ),\quad u_0\in H^2(\Omega ),\\&|f(\alpha ,v)|< C_\alpha (1+|v|^{B}) \text { with }B < 6/q + 1, \end{aligned}$$

Then, the unique solution of (34) fulfills

$$\begin{aligned}&u\in L^2(0,T;W^{2,q}(\Omega ))\cap H^1(0,T;H^1(\Omega ))\nonumber \\ {}&\hookrightarrow C(0,T;W^{1,2p}(\Omega ))\hookrightarrow L^\infty ((0,T) \times \Omega ) \end{aligned}$$
(40)

Proof

From (34) we get

$$\begin{aligned} a\Delta u= \dot{u}-\nabla a\cdot \nabla u + cu + f(\alpha ,u)-\varphi , \end{aligned}$$
(41)

and by \(\overline{q}\preceq \frac{3q}{3-q}\) such that \(W^{1,q}(\Omega )\hookrightarrow L^{\overline{q}}(\Omega )\)), we estimate the components of the right hand side of (41), using parameters \(\delta ,\delta _1>0\) (which will be small later on). Since \(q\le 6\)

$$\begin{aligned} \Vert \dot{u}\Vert _{L^q(\Omega )}\le C_{H^1\rightarrow L^q}\Vert \dot{u}\Vert _{H^1(\Omega )}. \end{aligned}$$
(42)

By \(q \le 6\) and \(c\in L^q(\Omega )\), using density, we can choose \(c_\infty \in L^\infty (\Omega )\) such that \(\Vert c-c_\infty \Vert _{L^q(\Omega )} \le \delta \) and obtain

$$\begin{aligned} \Vert cu\Vert _{L^q(\Omega )}&\le \Vert c_\infty u\Vert _{L^q(\Omega )}+\Vert (c-c_\infty ) u\Vert _{L^q(\Omega )} \nonumber \\&\le \Vert c_\infty \Vert _{L^\infty (\Omega )}\Vert u\Vert _{L^q(\Omega )}+\Vert c - c_\infty \Vert _{L^q(\Omega )}\Vert u\Vert _{L^\infty (\Omega )} \nonumber \\&\le C_{H^1\rightarrow L^q}\Vert c_\infty \Vert _{L^\infty (\Omega )}\Vert u\Vert _{H^1(\Omega )}+ C_{W^{2,q}\rightarrow L^\infty }\delta \Vert u\Vert _{W^{2,q}(\Omega )}. \end{aligned}$$
(43)

Now by the assumption \(|f(\alpha ,v)| \le C_\alpha (1 + |v|^{B}) \) with \(B < 6/q+1\) (note that this means also \(B \le 5\)) then, by possibly increasing B, we can assume that \(6/q< { B < 6/q+1}\) and select \(\beta := B-6/q \in (0,1)\), such that \(q(B-\beta ) = 6\). Applying Young’s inequality with arbitrary positive factor \(\delta _1>0\), we have

$$\begin{aligned} \Vert f(\alpha ,u)\Vert _{L^q(\Omega )}&\le C_\alpha (1+\Vert |u|^B\Vert _{L^q(\Omega )}) \le C_\alpha \left( 1 + \Vert u\Vert ^\beta _{L^\infty (\Omega )}\Vert u^{B-\beta }\Vert _{L^q(\Omega )}\right) \nonumber \\&\le C_\alpha \left( 1+\beta \delta _1^{1/\beta }\Vert u\Vert _{L^\infty (\Omega )} +\dfrac{1-\beta }{\delta _1^{1/(1-\beta )}}\Vert u\Vert ^\frac{B-\beta }{1-\beta }_{L^{q(B-\beta )}(\Omega )} \right) \nonumber \\&\le C_\alpha \left( 1+ C_{W^{2,q}\rightarrow L^\infty }\beta \delta _1^{1/\beta }\Vert u\Vert _{W^{2,q}(\Omega )} +\dfrac{1-\beta }{\delta _1^{1/(1-\beta )}} C_{H^1\rightarrow L^{q(B-\beta )}} \Vert u\Vert ^{\frac{B-\beta }{1-\beta }}_{H^1(\Omega )} \right) \end{aligned}$$
(44)

Using \( a\in W^{1,Pa}(\Omega )\) with \(Pa\ge \frac{q\bar{q}}{\overline{q}-q}\) and \(\frac{q\bar{q}}{2\overline{q}-q} \le 2\), again using density, we can choose \(a_\infty \in W^{1,\infty }(\Omega )\) such that \(\Vert \nabla a-\nabla a_\infty \Vert _{L^{\frac{q\bar{q}}{\overline{q}-q}}(\Omega )}<\delta \) and obtain

$$\begin{aligned}&\Vert \nabla a\cdot \nabla u\Vert _{L^q(\Omega )}\le \Vert (\nabla a-\nabla a_\infty )\cdot \nabla u\Vert _{L^q(\Omega )}+\Vert \nabla a_\infty \cdot \nabla u\Vert _{L^q(\Omega )} \nonumber \\&\quad \le \Vert \nabla a-\nabla a_\infty \Vert _{L^{\frac{q\bar{q}}{\overline{q}-q}}(\Omega )} \Vert \nabla u\Vert _{L^{\overline{q}}(\Omega )}+\Vert \nabla a_\infty \Vert _{L^\infty (\Omega )}\Vert |\nabla u|^{1/2}|\nabla u|^{1/2}\Vert _{L^q(\Omega )} \nonumber \\&\quad \le \delta \Vert \nabla u\Vert _{L^{\overline{q}}(\Omega )} + \Vert \nabla a_\infty \Vert _{L^\infty (\Omega )}\left( \frac{\delta _1}{2}\Vert \nabla u\Vert _{L^{\overline{q}}(\Omega )}+\frac{1}{2\delta _1}\Vert \nabla u\Vert _{L^{\frac{q\bar{q}}{2\overline{q}-q}}(\Omega )} \right) \nonumber \\&\quad \le C_{W^{2,q}\rightarrow W^{1,\overline{q}}}\left( \delta +\frac{\delta _1}{2}\Vert \nabla a_\infty \Vert _{L^\infty (\Omega )}\right) \Vert u\Vert _{W^{2,q}(\Omega )} + C_{L^2\rightarrow L^{\frac{q\bar{q}}{2\overline{q}-q}}}\frac{\Vert \nabla a_\infty \Vert _{L^\infty (\Omega )}}{2\delta _1}\Vert u\Vert _{H^1(\Omega )} \end{aligned}$$
(45)

Using that also \(\varphi \in L^q(\Omega )\), taking the spatial \(L^q\)-norm in (41), estimating by the triangle inequality, raising everything to the second power, we arrive at

$$\begin{aligned}&\underline{a}^2\Vert \Delta u\Vert ^2_{L^2( 0,T;L^q(\Omega ))}\le \Vert a\Delta u\Vert ^2_{L^2( 0,T;L^q(\Omega ))}\\&\quad \le 5\left( \Vert \dot{u}\Vert ^2_{L^2( 0,T;L^q(\Omega ))}+\Vert \nabla a \cdot \nabla u \Vert ^2_{L^2( 0,T;L^q(\Omega ))} + \Vert cu\Vert ^2_{L^2( 0,T;L^q(\Omega ))} \right. \\&\qquad \left. + \Vert f(\alpha ,u)\Vert ^2_{L^2( 0,T;L^q(\Omega ))} + \Vert \varphi \Vert ^2_{L^2( 0,T;L^q(\Omega ))} \right) \\&\quad \le 15\left( \Vert \dot{u}\Vert ^2_{L^2( 0,T;L^q(\Omega ))}+C_{c,a}\Vert u\Vert ^2_{L^2( 0,T;H^1(\Omega ))}+ T\Vert \varphi \Vert ^2_{L^q(\Omega )}\right. \\&\qquad \left. + TC_{B,\alpha ,\beta }\Vert u\Vert ^{2\frac{B-\beta }{1-\beta }}_{L^\infty ( 0,T;H^1(\Omega ))} +TC^2_\alpha +\tilde{\epsilon }\Vert \Delta u\Vert ^2_{L^2( 0,T;L^q(\Omega ))} \right) \end{aligned}$$

with

$$\begin{aligned} \tilde{\epsilon }:= \left[ C_{W^{2,q}\rightarrow L^\infty }\delta + C_\alpha C_{W^{2,q}\rightarrow L^\infty }\beta \delta _1^{1/\beta }+C_{W^{2,q}\rightarrow W^{1,\overline{q}}}\left( \delta +\frac{\delta _1}{2}\Vert \nabla a_\infty \Vert _{L^\infty (\Omega )}\right) \right] ^2. \end{aligned}$$

For sufficiently small \(\delta ,\delta _1\), this leads to

$$\begin{aligned}&0<\quad \left( \frac{\underline{a}^2}{2^6}-\tilde{\epsilon }\right) \Vert \Delta u\Vert ^2_{L^2(0,T;L^q(\Omega ))}\nonumber \\&\quad \le C_{c,a,\varphi ,B,\beta ,T}\left( \Vert \dot{u}\Vert ^2_{L^2(0,T;H^1_0(\Omega ))}+\Vert u\Vert ^2_{L^\infty (0,T;H^1_0(\Omega ))} \right) \quad < \infty . \end{aligned}$$
(46)

The fact that \(\nabla u\in L^2(0,T;L^2(\Omega ))\) and \(\Delta u\in L^2(0,T;L^q(\Omega )), q\ge 2\) as above imply \(\nabla u\in L^2(0,T;H^1(\Omega ))\) thus \(\nabla u\in L^2(0,T;L^q(\Omega ))\) for \(q\le 6\). This and (46) ensures that \(u\in L^2(0,T;W^{2,q}(\Omega ))\). By Lemma 21, \(u\in W^{1,\infty ,\infty }(0,T;{\hat{H}},{\hat{H}})\cap W^{1,\infty ,2}(0,T;{\hat{V}},{\hat{V}})\); thus, by embedding, \(u \in H^1(0,T;H^1(\Omega ))\). Consequently,

$$\begin{aligned} u\in L^2(0,T;W^{2,q}(\Omega )\cap H^1(0,T;H^1(\Omega )). \end{aligned}$$

This, together with the argumentation after Remark 22 completes the proof. \(\square \)

The obtained unique existence result in now summarized in the following proposition.

Proposition 24

(i) The nonlinear parabolic PDE (34) with \(d=3\) admits the unique solution

$$\begin{aligned}&u\in L^2(0,T;W^{2,q}(\Omega ))\cap H^1(0,T;H^1(\Omega ))\nonumber \\ {}&\hookrightarrow C(0,T;W^{1,2p}(\Omega ))\hookrightarrow L^\infty ((0,T) \times \Omega ) \end{aligned}$$
(47)

if the following conditions are fulfilled:

$$\begin{aligned}&p=3/2+\epsilon \text { with }\epsilon>0 \\&\min \{6,3p\}> q\ge \frac{18+12\epsilon }{7-6\epsilon }, \quad \text {and}\quad \frac{q\bar{q}}{2\overline{q}-q} \le 2 \quad \text {with } \overline{q} \text { such that } \overline{q}\preceq \frac{3q}{3-q}\\&c\in L^q(\Omega ),\quad a\in W^{1,Pa}(\Omega ), Pa >\max \{3,\frac{q\bar{q}}{\overline{q}-q}\}\text { and } 0\\&<\underline{a}\le a(x)\le \overline{a}\,\,\, {\text {for almost all }} x\in \Omega ,\\&\varphi \in L^q(\Omega ),\quad u_0\in H^2(\Omega ),\\&(-f(\alpha ,\cdot )) \text { is monotone and } f(\alpha ,0)=0, |f(\alpha ,v)|< C_\alpha (1+|v|^{B}) \text { with }B < 6/q + 1, \end{aligned}$$

(ii) Moreover, the claim in (i) still holds in case \(f(\alpha ,\cdot )\) is replaced by neural network \( \mathcal {N} _\theta (\alpha ,\cdot )\) with \(\sigma \in \mathcal {C}_\text {Lip}(\mathbb {R},\mathbb {R})\).

Proof

(i) Lemma 21 ensures that (34) admits a unique solution

$$\begin{aligned} u\in W^{1,\infty ,\infty }(0,T;L^2(\Omega ),L^2(\Omega ))\cap W^{1,\infty ,2}(0,T;H^1(\Omega ),H^1(\Omega )), \end{aligned}$$

such that in particular \(u\in L^\infty (0,T;H^1(\Omega ))\cap H^1(0,T;H^1(\Omega ))\). Proposition 23 ensures the embeddings as in (47) hold true again by our choice of pq.

ii) Now consider the case that \(f(\alpha ,\cdot )\) is replaced by \( \mathcal {N} _\theta (\alpha ,\cdot )\) for some known \(\alpha , \theta \). With \(L_{\theta ,\alpha }\) the Lipschitz constant of \( \mathcal {N} _\theta (\alpha ,\cdot ):\mathbb {R}\rightarrow \mathbb {R}\), we first observe that, for \(v \in \mathbb {R}\),

$$\begin{aligned} | \mathcal {N} _\theta (\alpha ,v)| \le | \mathcal {N} _\theta (\alpha ,0)| + | \mathcal {N} _\theta (\alpha ,v)- \mathcal {N} _\theta (\alpha ,0)| \le | \mathcal {N} _\theta (\alpha ,0)| + L_{\theta ,\alpha }|v| \end{aligned}$$

such that the growth condition \(| \mathcal {N} _\theta (\alpha ,v)|< C_\alpha (1+|v|^{B})\) with \(B < 6/q + 1\) and in particular the growth condition of Proposition 23 holds. This shows in particular that the induced Nemytskii mapping \( \mathcal {N} _\theta (\alpha ,\cdot ):H^1(\Omega ) \rightarrow H^1(\Omega )^*\) is well-defined. Further, we can observe that, again for \(u,v \in H^1(\Omega )\)

$$\begin{aligned}&|\langle \mathcal {N}_\theta (\alpha ,u),u \rangle _{H^1(\Omega )^*,H^1(\Omega )}|\le L_{\theta ,\alpha }\Vert u\Vert ^2_{L^2(\Omega )}+| \mathcal {N} _{\theta ,\alpha }(0)|C_{H^1\rightarrow L^1}\Vert u\Vert _{H^1(\Omega )},\\&|\langle \mathcal {N}_\theta (\alpha ,u)-\mathcal {N}_\theta (\alpha ,v),u-v \rangle _{H^1(\Omega )^*,H^1(\Omega )}|\le L_{\theta ,\alpha }\Vert u-v\Vert ^2_{L^2(\Omega )}. \end{aligned}$$

Using these estimates, it is clear that the conditions S2 and S3 in Theorem 20 can be shown similarly as in Step 1 without requiring \(\mathcal {N}_\theta (\alpha ,0)=0\) or monotonicity of \( \mathcal {N} _\theta (\alpha ,\cdot )\). This completes the proof. \(\square \)

Remark 25

For neural networks, some examples fulfilling the conditions in Proposition 24, i.e. Lipschitz continuous activation functions, are the RELU function \(\sigma (x)=\max \{0,x\}\), the tansig function \(\sigma (x)=\tanh (x)\), the sigmoid (or soft step) function \(\sigma (x)=\frac{1}{1+e^{-x}}\), the softsign function \(\sigma =\frac{x}{1+|x|}\) or the softplus function \(\sigma (x)=\ln (1+e^x)\).

4.2 Well-Posedness for the All-at-Once Setting

With the result attained in Proposition 24, we are ready to determine the function spaces for the minimization problems (13), (14) in the all-at-once setting and explore further properties discussed in Sect. 3.

Remark 26

For minimization in the reduced setting, we usually invoke monotonicity in order to handle high nonlinearity (c.f. Proposition 24). The minimization problems in the all-at-once setting, however, do not require this condition, thus allowing for more general classes of functions, e.g. by including in F another known nonlinearity \(\phi \) as in the following Proposition.

Proposition 27

For \(d=3\) and \(\epsilon >0\) sufficiently small, define the spaces

$$\begin{aligned}&V= W^{2,q}(\Omega ), \quad {\widetilde{V}}=H^1(\Omega ),\quad H=W^{1,2p}(\Omega ),\\&W=L^q(\Omega ),\quad p=\frac{3}{2}+\epsilon , q=\frac{18+12\epsilon }{7-6\epsilon }, \end{aligned}$$

and Y such that \(V \hookrightarrow Y\), resulting in the following state-, image- and observation spaces

$$\begin{aligned} \mathcal {V}&=L^2(0,T;W^{2,q}(\Omega ))\cap H^1(0,T;H^1(\Omega )),\\ \mathcal {W}&=L^2(0,T;L^q(\Omega )),\quad \mathcal {Y}=L^2(0,T;Y). \end{aligned}$$

Further, define the corresponding parameter spaces \( U_0=H^2(\Omega )\), \(X = X_\varphi \times X_c \times X_a\), where

$$\begin{aligned}&X_\varphi =X_c= L^q(\Omega ),\quad X_a= \{a\in W^{1,Pa}(\Omega ), Pa >3\}, \end{aligned}$$

and let \(M \in \mathcal {L}(\mathcal {V},\mathcal {Y})\) be the observation operator.

Consider the minimization problems (13) and (14), with \({F: (0,T) \times X \times V} \rightarrow W\) given as

$$\begin{aligned} F(t,(\varphi ,c,a),u) {:=} \nabla \cdot (a\nabla u) - cu+\varphi + \phi (u), \end{aligned}$$

where \(\phi :V \rightarrow W\) is an additional known nonlinearity in F (c.f Remark 26); \(\phi \) is the induced Nemytskii mapping of a function \(\phi \in \mathcal {C}_\text {locLip}(\mathbb {R},\mathbb {R})\). The associated PDE given as,

$$\begin{aligned}&\dot{u} - \nabla \cdot (a\nabla u) + cu + \phi (u) + \mathcal {N}_\theta (\alpha ,u)= \varphi \quad{} & {} \text{ in } \Omega \times (0,T),\nonumber \\&u|_{\partial \Omega }=0{} & {} \text{ in } (0,T), \nonumber \\&u(0) = u_0{} & {} \text{ in } \Omega , \end{aligned}$$
(48)

with the activation functions \(\sigma \) of \(\mathcal {N}_\theta (\alpha ,u)\) satisfying \( \sigma \in \mathcal {C}_\text {locLip}(\mathbb {R},\mathbb {R})\), and with \(\mathcal {R}_1, \mathcal {R}_2 \) nonnegative, weakly lower semi-continuous and such that the sublevel sets of \((\lambda ,\alpha ,u_0,u,\theta ) \mapsto \mathcal {R}_1(\lambda ,\alpha ,u_0,u) + \mathcal {R}_2(\theta )\) are weakly precompact. Then, each of (13) and (14) admits a minimizer.

Proof

Our aim is examining the assumptions proposed Lemma 5, which leads to the result in Proposition 6. At first, we verify Assumption 1. The embeddings

are an immediate consequence of our choice of p and q and standard Sobolev embeddings. The embeddings

$$\begin{aligned} \mathcal {V}\hookrightarrow L^\infty ((0,T)\times \Omega ),\quad \mathcal {V}\hookrightarrow C(0,T;H) \end{aligned}$$

follow from the discussion in Step 2 above, see also Proposition 24.

Noting that well-definedness of the Nemytskii mappings as well as the growth condition (7) are consequences of the following arguments on weak continuity. We focus on weak continuity of \(F:\mathcal {V}\times (X_c,X_a,X_\varphi )\rightarrow \mathcal {W}, F(\lambda ,u):=\nabla \cdot (a\nabla u) - cu+\varphi + \phi (u)\) via weak continuity of the operator inducing it as presented in Lemma 5. First, for the cu part we see \((c,u)\mapsto cu\) is weakly continuous on \((X_c,H)\). Indeed, for \(c_n\rightharpoonup c\) in \(X_c\), \(u_n\rightharpoonup u\) in thus \(u_n\rightarrow u\) in \(L^\infty (\Omega )\), one has for any \(w^*\in W^*= L^{q^*}(\Omega )\),

$$\begin{aligned}&\int _\Omega (cu-c_nu_n)w^*\,\textrm{d}x = \int _{{\Omega }} (c-c_n)uw^*\,\textrm{d}x +\int _{{\Omega }} c_n(u-u_n)w^*\,\textrm{d}x \quad \overset{n\rightarrow \infty }{\rightarrow }0 \end{aligned}$$

due to \(uw^*\in L^{q^*}(\Omega )\), \( \Vert c_n w^*\Vert _{L^1(\Omega )}\le C<\infty \) for all n and \(u_n\rightarrow u\) in \(L^\infty (\Omega )\).

For the \(\nabla \cdot (a\nabla u)\) part, \(H=W^{1,2p}(\Omega )\) is not strong enough to enable weak continuity of \((a,u)\mapsto \nabla \cdot (a\nabla u)\) on \((X_a,H)\), we therefore evaluate directly weak continuity of the Nemytskii operator. So, let \((a_n,u_n) \rightharpoonup (a,u)\) in \(X_a\times \mathcal {V}\), taking \(w^*\in L^2(0,T;L^{q^*}(\Omega ))\) we have

$$\begin{aligned} \int _{{\Omega \times (0,T)}}&(\nabla \cdot (a\nabla u)-\nabla \cdot (a_n\nabla u_n))w^*\,\textrm{d}x\,\textrm{d}t\\&=\int _{{\Omega \times (0,T)}} \nabla (a-a_n)\cdot \nabla u w^* \,\textrm{d}x\,\textrm{d}t + \int _{{\Omega \times (0,T)}} \nabla a_n\cdot \nabla (u-u_n) w^* \,\textrm{d}x\,\textrm{d}t\\&\quad +\int _{{\Omega \times (0,T)}} (a-a_n)\cdot \Delta u_n w^*\,\textrm{d}x\,\textrm{d}t\\&\quad +\int _{{\Omega \times (0,T)}} a\Delta (u-u_n) w^*\,\textrm{d}x\,\textrm{d}t \qquad \overset{n\rightarrow \infty }{\rightarrow }0 \end{aligned}$$

due to the following: we have \(\nabla u w^*\in L^2(0,T;L^{Pa^*}(\Omega )), \nabla a_n\rightharpoonup \nabla a\) in \(L^{Pa}(\Omega )\) in the first estimate, and \(u_n\rightarrow u\) in \(L^2(0,T;W^{1,18}(\Omega )), \Vert \nabla a_n w^*\Vert _{L^2(0,T;L^{18/17}(\Omega ))}\le C<\infty \) for all n in the second estimate. In the third estimate, one has \(a_n\rightarrow a\) in \(L^\infty (\Omega )\) and

$$\begin{aligned} \Vert \Delta u_n w^*\Vert _{{L^1}(0,T;L^1(\Omega ))}{\le \Vert \Delta u_n\Vert _{L^2(0,T;L^{q}(\Omega ))}\Vert w^*\Vert _{L^2(0,T;L^{q^*}(\Omega ))}}\le C<\infty \quad \text {for all}\, n. \end{aligned}$$

Finally, in in the last estimate it is clear that \(aw^*\in L^2(0,T;L^{q^*}(\Omega )), u_n \rightharpoonup u\) in \(L^2(0,T;W^{2,q}(\Omega ))\) implying \(\Delta u_n \rightharpoonup \Delta u\) in \(L^2(0,T;L^{q}(\Omega )\)).

For the term \(\phi \), by we attain weak-strong continuity of \(\phi \) on H

$$\begin{aligned} \Vert \phi (u_n)-\phi (u)\Vert _W\le \Vert u_n-u\Vert _{L^\infty (\Omega )} L\left( \Vert u_n\Vert _{H},\Vert u\Vert _{H}\right) \quad \rightarrow 0\quad \text {for}\quad u_n\overset{H}{\rightharpoonup }u. \end{aligned}$$
(49)

Finally, the fact that activation function \(\sigma \) satisfies \(\sigma \in \mathcal {C}_\text {locLip}(\mathbb {R},\mathbb {R})\) completes the verification that the result of Proposition 6 holds. \(\square \)

For the following results, we set \(\phi =0\).

Lemma 28

(Differentiability) In accordance with Proposition 12 and the frameworks in Proposition 27, setting \(\phi =0\), the model operator \(F:X\times \mathcal {V}\rightarrow \mathcal {W}\) is Gâteaux differentiable, as is the neural network \(\mathcal {N}_\theta :\mathbb {R}^m\times \mathcal {V}\rightarrow \mathcal {W}\) with \(\sigma \in \mathcal {C}^1(\mathbb {R},\mathbb {R})\).

Proof

With the setting in Proposition 27, we verify local Lipschitz continuity of \(F(\lambda ,u)=\nabla \cdot (a\nabla u)-cu+\varphi \) with \(\lambda = (\varphi ,c,a)\). To this aim, we estimate

$$\begin{aligned}&\Vert F(\lambda _1,u_1)-F(\lambda _2,u_2)\Vert _W\\&\quad =\Vert \nabla \cdot (a_1\nabla (u_1-u_2)) -\nabla \cdot ((a_2-a_1)\nabla u_2 )\\&\qquad -c_1(u_1-u_2)+(c_2-c_1)u_2 + \varphi _1-\varphi _2 \Vert _{L^q(\Omega )} \\&\quad \le \Vert \nabla a_1\Vert _{L^{Pa}(\Omega )}\Vert \nabla (u_1-u_2)\Vert _{L^{\overline{q}}(\Omega )} \\&\qquad +\Vert a_1-a_2\Vert _{L^\infty (\Omega )}\Vert \Delta u_1-\Delta u_2\Vert _{L^q(\Omega )} + \Vert \nabla (a_2-a_1)\Vert _{L^{Pa}(\Omega )}\Vert \nabla u_2\Vert _{L^{\overline{q}}(\Omega )}\\&\qquad +\Vert a_2-a_1\Vert _{L^\infty (\Omega )}\Vert \Delta u_2\Vert _{L^q(\Omega )} + \Vert c_1\Vert _{L^q(\Omega )}\Vert u_1-u_2\Vert _{L^\infty (\Omega )}\\&\qquad +\Vert c_2-c_1\Vert _{L^q(\Omega )}\Vert u_2\Vert _{L^\infty (\Omega )} + \Vert \varphi _1-\varphi _2\Vert _{L^q(\Omega )}\\&\quad \le L(\Vert u_1\Vert _H,\Vert u_2\Vert _H,\Vert \lambda _1\Vert _X,\Vert \lambda _2\Vert _X)\big (\Vert u_1-u_2\Vert _V+\Vert u_1-u_2\Vert _H\\&\qquad + (1+\Vert u_2\Vert _V)\Vert \lambda _1-\lambda _2\Vert _X \big ) \end{aligned}$$

with \({\overline{q}}\overline{q}\preceq \frac{3q}{3-q}\). Also, Gâteaux differentiability of \(F:X\times V\rightarrow W\) as well as Carathéodory assumptions are clear from this estimate and bilinearity of F with respect to \(\lambda , u\). Differentiability of \(\mathcal {N}_\theta \) with \(\sigma \in \mathcal {C}^1(\mathbb {R},\mathbb {R})\) has been shown in Proposition 12, the last paragraph of its proof. \(\square \)

When the image space \(\mathcal {W}\) is stronger, that is, \(W\nsupseteq L^q(\Omega ), \forall q\in [1,\infty )\) as discussed in Remark 13, we require smoother activation functions than what was employed in Lemma 28 in order to ensure differentiability of \(\mathcal {N}_\theta \).

Remark 29

(Strong image space \(\mathcal {W}\) and smoother neural network) Consider the case where the unknown parameter is \(\varphi \), parameters ac are known, and the neural network \(\mathcal {N}_\theta \) has smoother activation

$$\begin{aligned} \sigma \in \mathcal {C}^1_\text {locLip}(\mathbb {R},\mathbb {R}), \text { i.e. } \sigma '\in \mathcal {C}_\text {locLip}(\mathbb {R},\mathbb {R}). \end{aligned}$$

The minimization problems introduced in Proposition 27 have minimizers that belong to the Hilbert spaces

$$\begin{aligned}&\mathcal {V}=L^2(0,T;H^3(\Omega ))\cap H^1(0,T;H^1(\Omega )),\\&\quad \mathcal {W}=L^2(0,T;H^1(\Omega )), \quad \mathcal {Y}=L^2(0,T;Y),\\&V=H^3(\Omega ){\hookrightarrow Y}, \quad \widetilde{V}=H^1(\Omega ), \quad H=H^2(\Omega ), \quad W=H^1(\Omega ), \end{aligned}$$

and

$$\begin{aligned}&X_c= H^1(\Omega ),\quad X_a= H^2(\Omega ),\quad X_\varphi = H^1(\Omega ),\quad U_0=H^2(\Omega ). \end{aligned}$$

Proof

For fixed \(\theta , \alpha \), let us denote \(\mathcal {N}_\theta (\alpha ,\cdot )=: \mathcal {N} _\theta \). It is clear that this setting fulfills all the embeddings in Assumption 1. Weak-strong continuity of \( \mathcal {N} _\theta \) is derived from

$$\begin{aligned} \Vert \mathcal {N}_{\theta ,\alpha }(u_n)-\mathcal {N}_{\theta ,\alpha }(u)\Vert _\mathcal {W}^2&= \Vert \mathcal {N}_{\theta ,\alpha }(u_n)-\mathcal {N}_{\theta ,\alpha }(u)\Vert _{L^2(0,T;L^2(\Omega ))}^2\\&\quad +\Vert \nabla \mathcal {N}_{\theta ,\alpha }(u_n)-\nabla \mathcal {N}_{\theta ,\alpha }(u)\Vert _{L^2(0,T;L^2(\Omega ))}^2\\&=:A+B\quad \overset{n\rightarrow \infty }{\rightarrow }0, \end{aligned}$$

since with \(\mathcal {V}\hookrightarrow C(0,T;H^2(\Omega ))\) and \(\sigma \in \mathcal {C}^1_\text {locLip}(\mathbb {R},\mathbb {R})\), one has

$$\begin{aligned} A&\le C(L'_{\theta ,\alpha }(\Vert u\Vert _\mathcal {V}))^2 \Vert u_n-u\Vert ^2_{L^2(0,T;L^2(\Omega ))},\\ B&\le 2\Vert \mathcal {N}_{\theta ,\alpha }'(u_n)(\nabla u_n-\nabla u)\Vert _{L^2(0,T;L^2(\Omega ))}^2\\&\quad +2\Vert (\mathcal {N}_{\theta ,\alpha }'(u_n)-\mathcal {N}_{\theta ,\alpha }'(u))\nabla u\Vert _{L^2(0,T;L^2(\Omega ))}^2\\&\le 2\Vert \mathcal {N}_{\theta ,\alpha }'(u_n)\Vert ^2_{L^\infty ((0,T)\times \Omega )}\Vert \nabla u_n-\nabla u\Vert _{L^2(0,T;L^2(\Omega ))}^2 \\&\quad + 2(L''_{\theta ,\alpha }(\Vert u\Vert _\mathcal {V}))^2\Vert (u_n- u)\nabla u\Vert _{L^2(0,T;L^2(\Omega ))}^2\\&\le 2(L'_{\theta ,\alpha }(\Vert u\Vert _\mathcal {V}))^2 \Vert \nabla u_n-\nabla u\Vert ^2_{L^2(0,T;L^2(\Omega ))} \\&\quad + 2(L''_{\theta ,\alpha }(\Vert u\Vert _\mathcal {V}))^2\Vert \nabla u\Vert _{C(0,T;L^6(\Omega ))}^2 \Vert u_n-u\Vert ^2_{L^2(0,T;L^3(\Omega ))}, \end{aligned}$$

implying \(A+B \rightarrow 0\) for and Lipschitz constants \(L', L''\). This shows continuity of \( \mathcal {N} \) in u; continuity of \( \mathcal {N} \) in \((\alpha ,\theta )\) can be done similarly. For F, when ca are known and fixed, it is just a linear operator on u. Weak continuity of F hence can be explained through its boundedness, which can be confirmed in the same fashion as AB above. \(\square \)

To conclude this section, we consider a Hilbert space setting that will be relevant for our subsequent applications.

Remark 30

(Hilbert space framework for application) Another possible Hilbert space framework where the all-at-once setting is applicable is

$$\begin{aligned} \mathcal {V}&=H^1(0,T;H^2(\Omega ))\hookrightarrow C(0,T;H^2(\Omega )),\\ \mathcal {W}&=L^2(0,T;L^2(\Omega )),\quad \mathcal {Y}=L^2(0,T;Y),\\ V&=\widetilde{V}=H=H^2(\Omega ){\hookrightarrow Y}, \quad W=L^2(\Omega )\end{aligned}$$

where Y is a Hilbert space, and

$$\begin{aligned}&X_c= L^2(\Omega ),\quad X_a= H^2(\Omega ), \quad X_\varphi = L^2(\Omega ),\quad U_0=H^2(\Omega ). \end{aligned}$$

Verification of weak continuity and the growth condition for F can be carried out similarly as in Proposition 27; moreover, weak continuity of \((X_a\times H)\ni (a,u)\mapsto \nabla \cdot (a\nabla )\in W\) can be confirmed like the part \((c,u)\mapsto cu\), without the need of evaluating directly the Nemytskii operator. This is the setting in which we will study in detail the application (34).

5 Case Studies in Hilbert Space Framework

5.1 Setup for Case Studies

In this section, for the sake of simplicity of implementation, we carry out case studies for some minimization examples in a Hilbert space framework, where we drop the unknown \(\alpha \) and use the regularizers \(\mathcal {R}_1=\Vert \cdot \Vert ^2_{X\times U_0\times \mathcal {V}}\), \(\mathcal {R}_2=\Vert \cdot \Vert ^2_\Theta \).

Proposition 31

Consider the minimization problem (13) (or (14)) associated with the learning informed PDE

$$\begin{aligned}&\dot{u}-\nabla \cdot (a\nabla u) + cu - \varphi -\mathcal {N}_\theta (u)=:\dot{u} - F(\lambda ,u)- \mathcal {N} (u,\theta ) =0 \quad{} & {} \text{ in } \Omega \times (0,T)\\&u(0) = u_0{} & {} \text{ in } \Omega \end{aligned}$$

for \({\sigma \in }\, \mathcal {C}^1(\mathbb {R},\mathbb {R})\), \(M=\text {Id}\) in the Hilbert spaces

$$\begin{aligned}&\mathcal {V}=H^1(0,T;H^2(\Omega )\cap H^1_0(\Omega ))\hookrightarrow C(0,T;H^2(\Omega )),\qquad \mathcal {W}=\mathcal {Y}=L^2(0,T;L^2(\Omega )),\\&V=\widetilde{V}=H=H^2(\Omega )\cap H^1_0(\Omega ), \quad W=Y=L^2(\Omega ),\\&X_c= L^2(\Omega ),\quad X_a= H^2(\Omega ),\quad X_\varphi = L^2(\Omega ),\quad U_0=H^2(\Omega ). \end{aligned}$$

The following statements are true:

  1. (i)

    The minimization problem admits minimizers.

  2. (ii)

    The corresponding model operator \(\mathcal {G}\) is Gâteaux differentiable with locally bounded \(\mathcal {G}'\).

  3. (iii)

    The adjoint of the derivative operator is given by

    $$\begin{aligned}&\mathcal {G}'(\lambda ,u,\theta )^*: \mathcal {W}\times H\times \mathcal {Y}\rightarrow X\times \mathcal {V}\times \Theta \\&\mathcal {G}'(\lambda ,u,\theta )^*= \begin{pmatrix} -F'_\lambda (\lambda ,u)^* &{} 0 &{} 0\\ \left( \frac{d}{dt}-F'_u(\lambda ,u)- \mathcal {N} _u'(u,\theta )\right) ^* &{} (\cdot )_{t=0}^* &{} M^*\\ - \mathcal {N} _\theta '(u,\theta )^* &{} 0 &{}0 \end{pmatrix} =:(g_{i,j})_{i,j=1}^3 \end{aligned}$$

    with

    $$\begin{aligned}&F'_\lambda (\lambda ,u)^*:\mathcal {W}\rightarrow X, \qquad{} & {} F'_u(\lambda ,u)^*: \mathcal {W}\rightarrow \mathcal {V}, \qquad{} & {} (\cdot )_{t=0}^*:H\rightarrow \mathcal {V}\\&\mathcal {N} _\theta '(u,\theta )^*: \mathcal {W}\rightarrow \Theta ,\ {}{} & {} \mathcal {N} _u'(u,\theta )^*:\mathcal {W}\rightarrow \mathcal {V},{} & {} M^*: \mathcal {Y}\rightarrow \mathcal {V}. \end{aligned}$$

By defining \((-\Delta )^{-1}(-\Delta +\text {Id})^{-1}: L^2(\Omega )\ni k^z\mapsto {\widetilde{z}}\in H^2(\Omega )\cap H^1_0(\Omega )\) such that \({\widetilde{z}}\) solves

$$\begin{aligned} {\left\{ \begin{array}{ll} -\Delta {\widetilde{z}}&{}=z_1 \quad \text {in } \Omega \\ \quad {\widetilde{z}}&{}=0 \quad \text { on }\partial \Omega \end{array}\right. }, \qquad {\left\{ \begin{array}{ll} -\Delta z_1+z_1&{}=k^z \quad \text {in } \Omega \\ \qquad \quad z_1&{}=0 \quad \text { on }\partial \Omega , \end{array}\right. } \end{aligned}$$
(50)

we can write explicitly

$$\begin{aligned}&g_{2,2}: \quad (\cdot )^*_{t=0}h=h, \end{aligned}$$
(51)
$$\begin{aligned}&g_{2,3}: \quad M^*z(t)=\int _0^T(t+1)(-\Delta )^{-1}(-\Delta +\text {Id})^{-1}z(t)\,\textrm{d}t\nonumber \\&\quad -\int _0^t(t-s)(-\Delta )^{-1}(-\Delta +\text {Id})^{-1}z(s)\,ds, \end{aligned}$$
(52)
$$\begin{aligned}&g_{2,1}: \quad \left( \frac{d}{dt}-F'_u(\lambda ,u)- \mathcal {N} _u'(u,\theta )\right) ^*z(t) \nonumber \\&\qquad \quad =\int _0^T(t+1)(-\Delta )^{-1}(-\Delta +\text {Id})^{-1}{\widetilde{K}}z(t)\,\textrm{d}t\nonumber \\&\quad -\int _0^t(-\Delta )^{-1}(-\Delta +\text {Id})^{-1}[(t-s){\widetilde{K}}z(s)-z(s)]\,ds \nonumber \\&\qquad \quad \text {with } {\widetilde{K}}= -\nabla \cdot (a\nabla \cdot )+c- \mathcal {N} _u'(u,\theta )\text { and } \mathcal {N} _u' \text { is computed as in Lemma }~14, \end{aligned}$$
(53)
$$\begin{aligned}&g_{1,1}: \quad -F'_\lambda (\lambda ,u)^*z= {\left\{ \begin{array}{ll} \int _0^T z(t)u(t)\,\textrm{d}t \qquad &{}\text {for } \lambda =c\\ \int _0^T -z(t)\,\textrm{d}t \qquad &{}\text {for } \lambda =\varphi \\ \int _0^T (-\Delta )^{-1}(-\Delta +\text {Id})^{-1}(-\nabla \cdot (z\nabla u))(t)\,\textrm{d}t \qquad \quad &{} \text {for } \lambda =a,\\ \end{array}\right. } \end{aligned}$$
(54)

\(g_{3,1}\): one has the recursive procedure

$$\begin{aligned}&\delta _L := 1, \qquad \delta _{l-1}:= {a'}^T_{l-1} \omega ^T_l \delta _l, \quad \qquad l=L\ldots 2, \nonumber \\&\nabla _{\omega _{l-1}} \mathcal {N} (u,\theta )^*z= \int _0^T\int _\Omega \delta _{l-1} a_{l-2}^T \,z \,\textrm{d}x\,\textrm{d}t,\nonumber \\&\nabla _{\beta _{l-1}} \mathcal {N} (u,\theta )^*z= \int _0^T\int _\Omega \delta _{l-1}\,z \,\textrm{d}x\,\textrm{d}t, \end{aligned}$$
(55)

with \(a_l,a'_l\) detailed in the proof.

Proof

Assertion (i) follows from Remark 30. Using Proposition 12, Assertion (ii) can be shown similarly as in Lemma 28. The proof for Assertion (iii) is presented in Appendix B. \(\square \)

Corollary 32

(Discrete measurements) In case of discrete measurements \(M_i:\mathcal {V}\rightarrow Y, M_i(u)=u(t_i), t_i\in (0,T)\), where the pointwise time evaluation is well-defined as \(\mathcal {V}\hookrightarrow C(0,T;H^2(\Omega ))\), the adjoint \(g_{2,3}\) is modified as follows. For \(h\in Y\),

$$\begin{aligned} (h,v(t_i))_{L^2(\Omega )}&= (\tilde{h},v(t_i))_{H^2(\Omega )}\\ {}&=\int _0^{t_i}(-{\ddot{u}^h}(t),v(t))_{H^2(\Omega )}\,\textrm{d}t+(\tilde{h},v(t_i))_{H^2(\Omega )}\\&=\int _0^{t_i}({\dot{u}^h}(t),\dot{v}(t))_{H^2(\Omega )}\,\textrm{d}t+(u^h(0),v(0))_{H^2(\Omega )}\\&\quad -({\dot{u}^h}(t_i)-\tilde{h}(t),v(t_i))_{H^2(\Omega )}+({\dot{u}^h}(0)-u^h(0),v(0))_{H^2(\Omega )}\\&=(u^h,v)_{H^1(0,t_i;H^2(\Omega ))}=(u^h,v)_\mathcal {V}, \end{aligned}$$

provided that \(u^h=\) const in \([t_i,T]\) in order to form the integral of the full time line (0, T) in the last line. Above, \(h,\tilde{h}\) are respectively in place of \(k^z\) and \({\widetilde{z}}\) in (50); besides, \(u^h\) solves

$$\begin{aligned} \begin{aligned}&{\ddot{u}^h}(t)=0 \qquad t\in (0,t_i)\\&{\dot{u}^h}(t_i)=\tilde{h}, \quad {\dot{u}^z}(0)-u^z(0)=0. \end{aligned} \end{aligned}$$
(56)

Thus we arrive at

$$\begin{aligned} (M_i)^*h=u^h(t)= {\left\{ \begin{array}{ll} (-\Delta )^{-1}(-\Delta +\text {Id})^{-1}h(t+1) \qquad &{} 0<t\le t_i\\ (-\Delta )^{-1}(-\Delta +\text {Id})^{-1}h(t_i+1) &{} t_i<t\le T. \end{array}\right. } \end{aligned}$$
(57)

This shows a numerical advantage of processing discrete observations in an Kaczmarz scheme, for instance in deterministic or stochastic optimization. To be specific, for each data point in the forward propagation, thanks to the all-at-one approach, no nonlinear model needs to be solved; in the backward propagation, by the same reason and (57), one needs to compute the corresponding adjoint only for small time intervals.

5.2 Numerical Results

This section is dedicated to a range of numerical experiments carried out in two parallel settings: by way of analytic adjoints in Sect. 5.2.1, and with Pytorch in Sect. 5.2.2. While, in our experiments, we evaluate and compare the proposed method for different settings, such as varying the number of time measurements or noise, we highlight that the main purpose of these experiments is to show numerical feasibility of the proposed approach in principle, rather than providing highly optimized results. In particular, a tailored optimization of, e.g., regularization parameters and initialization strategies involved in our method might still be able to improve results significantly.

For both settings (analytic adjoints and Pytorch), we use the following learning-informed PDE as special case of the one considered in Proposition 31:

$$\begin{aligned} \begin{aligned}&\dot{u}-\Delta u - \varphi -\mathcal {N}_\theta (u) =0 \quad{} & {} \text{ in } \Omega \times (0,T)\\&u(0) = u_0=0{} & {} \text{ in } \Omega , \end{aligned} \end{aligned}$$
(58)

We deal with time-discrete measurements as in Corollary 32, i.e., we use a time-discrete measurement operator \(M:\mathcal {V}\rightarrow L^2(\Omega )^{n_T}\), with \(n_T \in \mathbb {N}\), given as \(M(u)_{t_i} = u(t_i)\) for \(t_0 = 0\) and \(t_i \in (0,T)\) with \(i=1,\ldots ,n_T-1\). We further let a noisy measurement of the initial state \(u_0\) be given at timepoint \(t=0\). Further, we consider two situations:

  1. 1.

    The source \(\varphi \) in (58) is fixed; we estimate the state u and the nonlinearity \(\mathcal {N}_\theta \) only, yielding a model operator \(\mathcal {G}_\varphi :H^1(0,T;H^2(\Omega )\cap H^1_0(\Omega )) \times \Theta \rightarrow L^2(0,T;L^2(\Omega )) \times L^2(0,T;L^2(\Omega ))\) given as

    $$\begin{aligned} \mathcal {G}_\varphi (u,\theta ) = \begin{pmatrix} \dot{u}-\Delta u - \varphi - \mathcal {N}_\theta (u) \\ Mu \end{pmatrix}. \end{aligned}$$
  2. 2.

    The source \(\varphi \) in (58) is unknown, and we estimate the state u, the source \(\varphi \) and the nonlinearity \(\mathcal {N}_\theta \). This results in a model operator \(\mathcal {G}:L^2(\Omega ) \times H^1(0,T;H^2(\Omega )\cap H^1_0(\Omega )) \times \Theta \rightarrow L^2(0,T;L^2(\Omega )) \times L^2(0,T;L^2(\Omega ))\) given as

    $$\begin{aligned} \mathcal {G}(\varphi ,u,\theta ) = \begin{pmatrix} \dot{u}-\Delta u - \varphi - \mathcal {N}_\theta (u) \\ Mu \end{pmatrix}. \end{aligned}$$

For these two settings, the special case of the learning problem (13) we consider here is given as

$$\begin{aligned} \min _{\begin{array}{c} (u^k)_k \in \mathcal {V}\\ \theta \in \Theta \end{array}} \sum _{k=1}^K \left( \Vert \mathcal {G}_\varphi (u^k,\theta ) - (0,y^k) \Vert ^2_{\mathcal {W}\times \mathcal {Y}} + \Vert u^k\Vert _{\mathcal {V}}^2 \right) + \Vert \theta \Vert _2 ^2, \end{aligned}$$
(59)

for state- and nonlinearity identification and

$$\begin{aligned} \min _{\begin{array}{c} (\varphi ^k,u^k)_k \in L^2(\Omega ) \times \mathcal {V}\\ \theta \in \Theta \end{array}} \sum _{k=1}^K \left( \Vert \mathcal {G}(\varphi ^k,u^k,\theta ) - (0,y^k) \Vert ^2_{\mathcal {W}\times \mathcal {Y}} + \Vert u^k\Vert _{\mathcal {V}}^2 + \Vert \varphi \Vert _{L^2(\Omega )}^2 \right) + \Vert \theta \Vert _2 ^2 \end{aligned}$$
(60)

for state-, parameter and nonlinearity identification.

It is clear that identifying both the nonlinearity and the state introduces some ambiguities, since the PDE is for instance invariant under a constant offset in both terms (with flipped signs). To account for that, we always correct such a constant offset in the evaluation of our results. As the following remark shows, at least if the state u is fixed appropriately, a constant shift is the only ambiguity that can occur.

Remark 33

(Offsets) With \(\Omega _y:=u(\Omega \times (0,T))\) the range of u for all \(x\in \Omega , t\in (0,T)\), and given that \(\frac{\partial }{\partial t}u(x,t)\ne 0\), consider any solutions \(f: \Omega _y\rightarrow \mathbb {R}\), \(\varphi :\Omega \rightarrow \mathbb {R}\) of (34). Then all solutions of (34) are on the form

$$\begin{aligned} \tilde{f}(y):= f(y) + c, \qquad \tilde{\varphi }(x):= \varphi (x) - c, \qquad c\in \mathbb {R}. \end{aligned}$$

Indeed, assume \(\tilde{f}\), \(\tilde{\varphi }\) are solutions, and define \(g(y):=\tilde{f}(y) - f(y)\), \(\Phi (x):=\tilde{\varphi }(x) - \varphi (x)\). Since these are solutions, one has \(0 = g(u(x,t)) + \Phi (x)\) for all (xt) such that

$$\begin{aligned} 0 = -\frac{\partial }{\partial t}\Phi (x) = \frac{\partial }{\partial t}g(u(x,t)) = g'(u(x,t))\frac{\partial }{\partial t}u(x,t). \end{aligned}$$

As \(\frac{\partial }{\partial t}u(x,t)\ne 0\) on \(\Omega \times (0,T)\), it follows that \(g'(y)\equiv 0\) on \(u(\Omega \times (0,T))\), that is, there is some \(c\in \mathbb {R}\) such that \(c=g(u(x,t))=-\Phi (x)\) for all \((x,t)\in \Omega \times (0,T)\).

Moreover, finding any solutions f, \(\varphi \) and setting

$$\begin{aligned} c:= \frac{\int _\Omega \varphi (x)\, \,\textrm{d}x - \int _{\Omega _y}f(y)\,dy}{|\Omega | + |\Omega _y|} \end{aligned}$$

yields solutions \(\tilde{f}(y):=f(y)+c\), \(\tilde{\varphi }(x):=\varphi (x)-c\), minimizing \(\Vert \varphi \Vert _{L^2(\Omega )}^2 + \Vert f\Vert _{L^2(\Omega _y)}^2\).

Remark 34

(Different measurement operators) In our experiments, we use a time-discrete measurement operator, and at times where data was measured, we assume measurements to be available in all of the domain. As will be seen in the next two subsections, reconstruction of the nonlinearity is possible in this case even with rather few time measurements. A further extension of the measurement setup could be to use partial measurements also in space. While we expect similar results for approximately uniformly distributed partial measurements in space, highly localized measurements such as boundary measurements and measurements on subdomains are more challenging. In this case, we expect the reconstruction quality of the nonlinearity to strongly depend on the range of values the state u admits in the observed points, but given the analytical focus of our paper, we leave this topic to future research.

Discretization In all but one experiment (in which we test different spatial and temporal resolutions), we consider a time interval \(T=[0,0.1]\), uniformly discretized with 50 time steps, and a space domain \(\Omega = (0,1)\), uniformly discretized with 51 grid points. The time-derivative as well as the Laplace operator was discretized with central differences. For the neural network \(\mathcal {N}_\theta \), we consider a fully-connected network with \(\tanh \) activation functions, and three single-channel hidden layers of width [2, 4, 2] for all experiments. Note that this network architecture was chosen empirically by evaluating the approximation capacity of different architectures with respect to different nonlinear functions. For the sake of simplicity, we choose a simple, rather small architecture (satisfying the assumptions of our theory) for all experiments considered in this paper. In general, the architecture (together with regularization of the network parameters) must be chosen such that a balance between expressivity and overfitting may be reached (see for instance [3, Sects. 1.2.2 and 3]), but a detailed evaluation of different architectures is not within the scope of our work.

5.2.1 Implementation with Analytic Adjoints

Set up In what follows, we apply Landweber iteration to solve the minimization problem (13). The Landweber algorithm is implemented with the analytic adjoints computed in Proposition 31 and Corollary 32, ensuring that the backward propagation maps to the correct spaces.

PDE and adjoints. We employed finite difference methods to numerically compute the derivatives in the PDE model, as well as in the adjoints outlined in Proposition 31 and Corollary 32. In particular, central difference quotients were used to approximate time and space derivatives. For numerical integration, we applied the trapezoidal rule. The inverse operator \((-\Delta )^{-1}(-\Delta +\text {Id})^{-1}\) constructed in (50) is called in each Landweber iteration.

Neural network. In the examples considered, \(f: u(x)\mapsto f(u(x))\) is a real-valued smooth function, hence the suggested simple architecture with 3 hidden layers of [2, 4, 2] neurons is appropriate. As the reconstruction is carried out in the all-at-once setting, the hyperparameters were estimated simultaneously with the state. The iterative update of the hyperparameters is done in the recursive fashion (55).

Data measurement. We work with measured data y as limited snapshots of u (see Corollary 32) and evaluated examples in the case of no noise and \(\delta =3\%\) relative noise. Noise \(\epsilon \) is sampled from a Gaussian distribution \( \mathcal {N} (0,1)\), and the measured data is \(y=u+\delta \epsilon (\Vert u\Vert _2/\Vert \epsilon \Vert _2)\).

Error. Error between the reconstruction and the ground truth was measured in the corresponding norms, i.e. \(X_\varphi \)-norm for \(\varphi \) and \(\mathcal {W}\)-norm for the PDE residual and the error of f. For u, \(\mathcal {V}\)-norm is the recommended measure; for simplicity, we displayed \(L^2\)-error.

Minimization problem. The regularization parameters are \(R_u=R_\varphi \) and \(M_i(u)=10\,u(t_i)\) (c.f Corollary 32). We implement an adaptive Landweber step size scheme, i.e. if the PDE residual in the current step decreases, the step size is accepted, otherwise it is bisected. For noisy data, the iterations are terminated after a stopping rule via a discrepancy principle (c.f. [22]) is reached.

Numerical results

Figure 1 discusses the example where only a few snapshots of u are measured; explicitly, we here have three measurements \(y_j=u(t_j), j=1, 25, 50, n_T=3\). We test the performance using three datasets of differing source terms and states (i.e. \(K=3\) in (59)), but identical nonlinearity f. The top left panel (we denote by panel (1, 1)) displays three measurements of dataset \(u_1\), each line here represents a plot of \(u_1(t_i)\). The same plotting style applies for dataset 2 (panel (1, 2)) and dataset 3 (panel (1, 3)). The exact source \(\varphi _i,i=1,2,3\) in three equations are given in panel (2, 1). In panel (3, 2), the nonlinearity f is expressed via a network of 3 hidden layers with [2, 4, 2] neurons. In this example, we identify \(u_i, i=1,2,3\) (panels \((2,3-6)\)) and f (see Sect. 5.2.2 for more experiments, including recovering physical parameters). The output errors in f (panel (3, 3)), u (panels \((3,4-6)\)) and PDE (panel (2, 3)) hint at the convergence of the cost functional to a minimizer. The noisy case is presented in Fig. 2.

Fig. 1
figure 1

Numerical identification of state u and ground-truth nonlinearity \(f(u)=u^2-1\) in (58) for three different values of the source term \(\varphi \). In each case, three noise-free observations are given (\(n_T=3\)). Plots 1-3 and 4-6 in the top line show the given data and the ground truth state for the three equations, respectively. The content of the remaining plots is described in the titles

Fig. 2
figure 2

Numerical identification of state u and ground-truth nonlinearity \(f(u)=u^2-1\) in (58) for three different values of the source term \(\varphi \). In each case, three observations (\(n_T=3\)) with 3% noise are given. Plots 1-3 and 4-6 in the top line show the given data and the ground truth state for the three equations, respectively. The content of the remaining plots is described in the titles

5.2.2 Implementation with Pytorch

The experiments of this section were carried out using the Pytorch [29] package to numerically solve (59) and (60). More specifically, we used the pre-implemented ADAM [25] algorithm with automatic differentiation, a learning rate of 0.01 and \(10^4\) iterations for all experiments. In case noise is added to the data, we use Gaussian noise with zero mean and different standard deviations denoted by \(\sigma \). The code is available at https://github.com/hollerm/pde_learning.

Solving for state and nonlinearity In this paragraph we provide experiments for the learning problem with a single datum, where we solve for the state and the nonlinearity and test with increasing noise levels and reducing the number of observations. We refer to Fig. 3 for the visualization of selected results, and to Table 1 (top) for error measures for all tested parameter combinations.

It can be observed that reconstruction of the nonlinearity works reasonable well even up to a rather low number of measurements together with a rather high noise level: The shape of the nonlinearity is reconstructed correctly in all cases except the one with three time measurements and a noise level of \(\sigma = 0.1\).

Solving for parameter, state and nonlinearity In this section, we provide experiments for the learning problem with a single datum, where we solve for the parameter, the state and the nonlinearity and test with increasing noise levels and decreasing of observations. We refer to Fig. 4 for the visualization of selected results and to Table 1 (bottom) for error measures for all tested parameter combinations.

It can again be observed that the reconstruction works rather well, in this case for both the nonlinearity and the parameter. Nevertheless, due to the additional degrees of freedom, the reconstruction breaks down earlier than in the case of identifying just the state and the nonlinearity.

Varying the discretization level In this paragraph, we test the result of different spatial and temporal resolution levels of the state. To this aim, we reproduce the experiment as in line 3 of Fig. 4 (6 time measurements, \(\delta =0.03\), quadratic nonlinearity, solving for nonlinearity and state) for \(501 \times 500\) and \(5001 \times 5000\) gridpoints in space \(\times \) time (instead of 51 as in the original example).

Fig. 3
figure 3

Numerical identification of state u and ground-truth nonlinearity \(f(u)=u^2-1\) in (58) for decreasing numbers of discrete observations (lines 1-2, 3-4 and 5-6) and increasing noise levels (even lines versus odd lines). Left: Given data, center: recovered state, right: recovered nonlinearity (orange) compared to ground truth (blue) (Color figure online)

The result can be found in Fig. 5. As can be observed there, changing the resolution level has only a minor effect on result, possibly slightly decreasing the reconstruction quality for the nonlinearity. We attribute this to the fact that the number of spatial grid points for the measurement was equally increased, see also Remark 34 for a discussion of localized measurements.

Table 1 Summary of errors in recovering nonlinearity and state (top) and in recovering nonlinearity, state and parameter (bottom) for different noise levels and different numbers of discrete measurements (denoted by tmeas)

Reconstructing the nonlinearity from multiple samples In this paragraph we show numerically the effect of having different numbers of datapoints available, i.e., the effect of different numbers \(K\in \mathbb {N}\) in (60). We again consider the identification of state, parameter and nonlinearity and use three time measurements and a noise level of 0.08; a setting where the identification of the nonlinearity breaks down when having only a single datum available.

Fig. 4
figure 4

Numerical identification of state u, ground-truth nonlinearity \(f(u)=u^2-1\) and the parameter \(\varphi \) in (58) for decreasing numbers of discrete observations (lines 1–2, 3–4 and 5–6) and increasing noise levels (even lines versus odd lines). From left to right: Given data, recovered state, recovered nonlinearity (orange) compared to ground truth (blue), recovered parameter (orange) compared to ground truth (blue) and initialization (green) (Color figure online)

Fig. 5
figure 5

Identical setting as in line 3 of Fig. 4 (6 time measurements, \(\delta =0.03\), quadratic nonlinearity, solving for nonlinearity and state), but with different spatial \(\times \) temporal resolution levels. Left to right: Plots 1 and 3: Approximate state obtained with \(501 \times 500\) and \(5001 \times 5000\) grid points, respectively. Plots 2 and 4: Recovered nonlinearity (orange) compared to ground truth (blue) for \(501 \times 500\) and \(5001 \times 5000\) grid points, respectively. The error in the nonlinearity is 3.60e\(-\)06 for \(501 \times 500\) gridpoints and 5.74e\(-\)06 for \(5001 \times 5000\) gridpoints (compare Table 1) (Color figure online)

Fig. 6
figure 6

Numerical identification of state u, ground-truth nonlinearity \(f(u)=u^2-1\) and the parameter \(\varphi \) in (58) for an increasing number of measurement data. Top to bottom: 1,3 and 5 measurements. Left to right: recovered nonlinearity (orange) compared to ground truth (blue), ground truth parameters, recovered parameters (Color figure online)

Fig. 7
figure 7

Comparison of different approximation methods. From top to bottom: Neural network, polynomial, trigonometric polynomial. From left to right: ground-truth nonlinearity \(f(u) = 2 - u \), \(f(u) = u^2 - 1\), \(f(u) = (u-0.1)(u-0.5)(141.6u-30)\) and \(f(u) = \cos (3\pi u)\)

As can be observed in Fig. 6, having multiple data samples improves reconstruction quality as expected. It is worth noting that here, even though each single parameter is reconstructed rather imperfectly with strong oscillations, the nonlinearity is recovered reasonable well already for three data samples. This is to be expected, as the nonlinearity is shared among the different measurements, while the parameter differs.

Comparison of different approximation methods Here we evaluate the benefit of approximating the nonlinearity with a neural network, as compared to classical approximation methods. As test example, we consider the identification of the state and the nonlinearity only, using a noise level of 0.03 and 10 discrete time measurements. We consider four different ground-truth nonlinearities: \(f(u) = 2 - u \) (linear), \(f(u) = u^2 - 1\) (square), \(f(u) = (u-0.1)(u-0.5)(141.6u-30)\) (polynomial) and \(f(u) = \cos (3\pi u)\) (cosine).

As approximation methods we use polynomials as well as trigonometric polynomials, where in both settings we allow for the same number (\(=29\)) of degrees of freedom as with the neural network approximation. For all methods, the same algorithm (ADAM) was used, and the regularization parameters for the state and the parameters of the nonlinearity were optimized by gridsearch to achieve the best performance.

The results can be seen in Fig. 7. While each methods yields a good approximation in some cases, it can be observed that the polynomial approximation performs poorly both for the cosine-nonlinearity and the polynomial-nonlinearity (even tough the degrees of freedom would be sufficient to represent the later exactly). The trigonometric polynomial approximation on the other hand performs generally better, but produces some oscillations when approximating the square nonlinearity. The neural network approximation performs rather well for all types of nonlinearity, which might be interpreted as such that neural-network approximation is preferable when no structural information on the ground-truth nonlinearity is available. It should be noted, however, that due to non-convexity of the problem, this result depends many factors such as the choice of initialization and numerical algorithm.

6 Conclusion

We have considered the problem of learning a partially unknown PDE model from data, in a situation where access to the state is possible only indirectly via incomplete, noisy observations of a parameter-dependent system with unknown physical parameters. The unknown part of the PDE model was assumed to be a nonlinearity acting pointwise, and was approximated via a neural network. Using an all-at-once formulation, the resulting minimization problem was analyzed and well-posedness was obtained for a general setting as well a concrete application. Furthermore, a tangential cone condition was ensured for the neural network part of a resulting learning-informed parameter identification problem, thereby providing the basis for local uniqueness and convergence results. Finally, numerical experiments using two different types of implementation strategies have confirmed practical feasibility of the proposed approach.