Paper The following article is Open access

Replica symmetry breaking in supervised and unsupervised Hebbian networks

, , and

Published 9 April 2024 © 2024 The Author(s). Published by IOP Publishing Ltd
, , Citation Linda Albanese et al 2024 J. Phys. A: Math. Theor. 57 165003 DOI 10.1088/1751-8121/ad38b4

1751-8121/57/16/165003

Abstract

Hebbian neural networks with multi-node interactions, often called Dense Associative Memories, have recently attracted considerable interest in the statistical mechanics community, as they have been shown to outperform their pairwise counterparts in a number of features, including resilience against adversarial attacks, pattern retrieval with extremely weak signals and supra-linear storage capacities. However, their analysis has so far been carried out within a replica-symmetric theory. In this manuscript, we relax the assumption of replica symmetry and analyse these systems at one step of replica-symmetry breaking, focusing on two different prescriptions for the interactions that we will refer to as supervised and unsupervised learning. We derive the phase diagram of the model using two different approaches, namely Parisi's hierarchical ansatz for the relationship between different replicas within the replica approach, and the so-called telescope ansatz within Guerra's interpolation method: our results show that replica-symmetry breaking does not alter the threshold for learning and slightly increases the maximal storage capacity. Further, we also derive analytically the instability line of the replica-symmetric theory, using a generalization of the De Almeida and Thouless approach.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Since Hopfield's seminal work on the use of biologically inspired neural networks for associative memory and pattern recognition tasks [1], there have been many important contributions applying concepts from statistical mechanics, in particular spin-glass theory, to the study of neural networks with pairwise Hebbian interactions (for an overview of the vast literature, the reader is referred to the books [29] and to the recent reviews [10, 11]). This connection was first pointed out by Hopfield himself in [1], where he showed that neural networks with pairwise Hebbian interactions were particular realizations of spin glasses, and it was further developed by Amit et al [12] who showed that the free-energy landscape of such systems is characterized by a large number of local minima, corresponding to different patterns of information, retrieved by the network from different initial conditions. In addition to neural networks with pairwise Hebbian interactions, a large body of work has focused on extending Hebbian learning to multi-node interactions, both in the early days (see e.g. [1315]) and more recently (see e.g. [16, 17]). These models are often referred to as dense associative memories [16], as they can store many more patterns than the number of neurons in the network. In particular, they can perform pattern recognition with a supra-linear storage load [15] and can work at very low signal-to-noise ratio, when compared to their pairwise counterparts [18]. In addition, they have been shown to be resilient to adversarial attacks [19] and combinations of such models can lead to exponential storage capacity [20].

In recent years, a duality between Hebbian neural networks and machine learning models has been pointed out. For example, the Hopfield model has been shown to be equivalent to a restricted Boltzmann machine [21], an archetypical model for machine learning [22], and sparse restricted Boltzmann machines have been mapped to Hopfield models with diluted patterns [23, 24]. Furthermore, restricted Boltzmann machines with generic priors have led to the definition of generalized Hopfield models [2527] and neural networks with multi-node Hebbian interactions have recently been shown to be equivalent to higher-order Boltzmann machines [28, 29] and deep Boltzmann machines [30, 31]. As a result, multi-node Hebbian learning is receiving a second wave of interest since its foundation in the eighties [13, 14] as a paradigm to understand deep learning [17, 32].

In order to make this connection clearer, we should stress that neither the Hopfield model nor its generalized versions with multi-node Hebbian interactions, are machine learning models, in that they are not trained from data and their couplings do not evolve in time according to a learning rule. Instead, their couplings are fixed from the outset to the values that would have been attained by a neural network trained according to the Hebbian rule, which adjusts the strengths of the connections between the neurons based on the input patterns that the network is exposed to. In this work, we consider two different routes for exposing the network to data, as considered in [33, 34]. In particular, we analyze the scenario where the input patterns are noisy or corrupted versions of pattern archetypes. In the first route, that we will call 'unsupervised', we expose the network to all of the input patterns together, without specifying the archetype that each pattern is meant to represent [35]. In the second route, that we will call 'supervised', we split the training dataset in different classes, corresponding to the archetypes, and the network is exposed to the examples in each class, class by class, in the same way that a teacher helps a student to learn one topic at time, from different examples [36]. Beyond being more realistic than traditional Hebbian storing, this generalization allows to pose a number of new questions, such as: is the network able, in either scenario, to retrieve the archetypes by itself, i.e. find hidden and intrinsic structures in the training dataset? which is the minimal amount of examples per archetype that the network has to experience, given the noise in the examples and the number of archetypes, in the two scenarios?

These questions have been addressed in neural networks with pairwise as well as multinode Hebbian interactions, for both random uncorrelated datasets and structured datasets (including the well-known MNist or Fashion MNist) [34, 37, 38], within a replica-symmetric (RS) analysis, which is based on the assumption that different replicas (or copies) of the systems are invariant under permutations. However, since neural networks are particular realization of spin-glasses, such symmetry is expected to be broken in certain regions of the parameters space, where the system develops a multitude of degenerate states. In a recent work [39], we devised a simple and rigorous method to detect the onset of the instability of RS theories in neural networks with pairwise and multinode Hebbian interactions. Building on this work, we derive the RS instability lines for the two different protocols defined here, supervised and unsupervised, and we re-investigate the problem defined above, previously analysed within an RS assumption, by carrying out the analysis at one step of replica-symmetry breaking (1RSB). Since Parisi's seminal work on replica symmetry breaking [2], several alternative mathematical techniques have been proposed to investigate replica-symmetry broken phases [4042]. In this work, we will use interpolations techniques, that were pioneered by Guerra in the context of his work on spin glasses [43] and were later applied to neural networks (e.g. [44]). One advantage of this approach is that it avoids certain heuristics required by the replica method and it leads to simpler calculations. For completeness, we will also derive the same results using the replica approach, with the pedagogical aim of creating a bridge between the two methods, one being a cornerstone of the statistical physics community (Parisi's), the other constituting a golden niche in the mathematical physics community (Guerra's).

The paper is structured as follows. In section 2 we define a first model for dense associative memories (DAM) that we will shall refer to as 'unsupervised' and we analyse it at one step of replica symmetry breaking via Guerra's interpolation method and replica techniques. In section 3 we define a second model for DAM, that we will shall refer to as 'supervised' and we carry out the same analysis as in section 2. Finally, in section 4 we summarize and discuss our results. We relegate technical details to the Appendices as well as the derivation of the instability line of the RS theory, which shows the importance of analysing these systems under the assumption of replica symmetry breaking.

2. 'Unsupervised' DAM

In this section we analyse the information processing capabilities of DAM in the so-called 'unsupervised' setting, as introduced in section 2.1, via two different approaches, namely Guerra's interpolation techniques (section 2.2) and Parisi's replica approach (section 2.3). Both analyses are carried out at the first step of RSB. Results are discussed in section 2.4. The instability line of the RS theory (equivalent to the de Almeida and Thouless line in the context of spin-glasses), showing the importance of working under the assumption of broken replica symmetry, is derived in appendix A.1.

2.1. Model and definitions

We consider a system of N neurons, modelled via Ising spins $\sigma_i \in \{-1, 1 \}$, with $i = {1,\ldots,N}$, where neurons interact via P-node interactions. We assume to have K Rademacher archetypes, defined as N-dimensional vectors $\boldsymbol{\xi}^{\mu}$, with $\mu = 1,\ldots, K$, where each entry is drawn randomly and independently from the distribution

Equation (2.1)

Such archetypes are not provided to the network, instead the network will have to infer them by experiencing solely noisy or corrupted versions of them. In particular, we assume that for each archetype µ, M examples $\boldsymbol{\eta}^{\mu,a}$, with $a = 1, \ldots, M$, are available, which are corrupted versions of the archetype, such that

Equation (2.2)

where $r \in [0,1]$ controls the quality of the dataset, i.e. for r = 1 the example matches perfectly the archetype, while for r = 0 it is purely stochastic. To quantify the information content of the dataset it is useful to introduce the variable

Equation (2.3)

that we shall refer to as the dataset entropy 7 . We note that ρ vanishes either when the examples are identical to the archetypes (i.e. r → 1) or when the number of examples is infinite (i.e. $M \to \infty$), or both. If we regard the traditional DAM model as storing patterns and the unsupervised DAM model as learning patterns from stored examples, then ρ can be thought of as the parameter that quantifies the difference between storing and learning (which is larger, the larger ρ).

Definition 1. The cost function (or Hamiltonian) of the 'unsupervised' DAM model is

Equation (2.4)

where the constant $\mathcal{R}^{P/2}$, with $\mathcal{R} = r^2(1+\rho)$ and ρ defined in (2.3), is included in the definition for mathematical convenience, while the factor $N^{P-1}$ ensures that the Hamiltonian is extensive in the network size N. On the left hand side (LHS), the label (P) denotes the order of the multi-node interactions and η is a short-hand for the collection of all the examples, for all the archetypes $\{\boldsymbol{\eta}^{\mu,a}\}_{\mu = 1,a = 1}^{K,M}$.

Remark 1. For P = 2, i.e. pairwise interactions, the unsupervised DAM model reduces to the Hopfield neural network in the unsupervised setting [33]. In addition, in the absence of dataset corruption, i.e. r = 1, $\mathcal{R} = 1$ and the model reduces to the standard Hopfield model, as analysed in [4, 12].

Definition 2. The partition function associated to the Hamiltonian (2.4) at inverse noise level $\beta\in \mathcal{R}^+$ is defined as

Equation (2.5)

where $\mathcal{B}^{(P)}_{N}(\boldsymbol{\sigma} \vert \boldsymbol{\eta})$ is referred to as the Boltzmann factor. At finite network size N, the free energy of the model $\mathcal{F}^{(P)}_{N}$ is given by

Equation (2.6)

where $\mathbb{E}[.]$ denotes the average over the realizations of examples η , regarded as quenched.

We are interested in the behaviour of the system in the limit of large system size $N\to \infty$ and finite network load α, as specified by the following

Definition 3. In the thermodynamic limit $N \to \infty$, the load of the unsupervised DAM model is defined as

Equation (2.7)

We will mostly focus on the so-called 'saturated' regime, where α > 0. The regime where the system is away from saturation can be inspected by taking the limit α → 0. For convenience, we also introduce the parameter γ, defined by

Equation (2.8)

The free energy in the thermodynamic limit is denoted as

Equation (2.9)

In the following, we focus on the ability of the network to retrieve a single archetype, say ν. Given the invariance of the Hamiltonian w.r.t. permutations of the archetypes, we will set without loss of generality ν = 1. Then, following standard procedures [4], we split the sum over µ in the Hamiltonian into the contribution from µ = 1 (regarded as the signal) and the contributions from the other archetypes µ > 1 (regarded as slow noise). Furthermore, we add in the argument of the exponent of the Boltzmann factor a term $J \sum_i \xi_i^1 \sigma_i$, so that the free energy can serve as a moment generating functional of the so-called Mattis magnetization m1 (vide infra) by taking derivatives of $\ln\mathcal Z_N^{(P)}$ w.r.t. J. As this term is not part of the original Hamiltonian, J will be set to zero later on.

Starting from (2.5), the resulting partition function is

Equation (2.10)

where we have set $\beta^{^{\prime}} = 2\beta/ P!$ and for the first pattern (µ = 1) we used $P!\sum_{i_1 \lt\cdots \lt i_P} = \sum_{i_1\neq \cdots\neq i_P}$ and neglected terms which vanish in the thermodynamic limit. Focusing only on the last term in (2.10) we note that the product $\eta_{i_1}^{\mu,a}\ldots \eta_{i_P}^{\mu,a}$ is a random i.i.d. Binomial variable for each $\mu, a$, and the sum over µ converges, by the central limit theorem (CLT), to a Gaussian variable with suitably defined mean and variance, i.e.

Equation (2.11)

where

Equation (2.12)

and

Equation (2.13)

This enables us to write

Equation (2.14)

with

Thus the partition function of the unsupervised DAM model reads as

Equation (2.15)

To make analytical progress, it is convenient to introduce the order parameters of the model.

Definition 4. The order parameters of the unsupervised DAM model are the Mattis magnetization m1 of the archetype $\boldsymbol{\xi}^1$, the Mattis magnetizations $n_{a,1}$ of each example $a = 1\ldots M$ of the archetype $\boldsymbol{\xi}^1$ and the two-replica overlaps qlm , which quantifies the correlations between the variables σ in two copies l and m of the system, sharing the same disorder (i.e. two replicas):

Equation (2.16)

Equation (2.17)

Equation (2.18)

In the next subsection we calculate the free energy of the system in the thermodynamic limit using Guerra's interpolation technique, assuming one step of replica symmetry breaking (1RSB).

2.2. 1RSB analysis via Guerra's interpolation technique

The key idea of Guerra's interpolation method is to define an auxiliary free energy, $\mathcal{F}^{(P)}(t)$, which is function of a parameter $t \in [0,1]$ that interpolates between the free energy of the original DAM model (obtained at t = 1) and the free energy of an exactly solvable one-body model (obtained at t = 0), whose effective fields mimic those of the original model. As the direct calculation of $\mathcal{F}^{(P)}(t = 1)$ is cumbersome, this is obtained by using the fundamental theorem of calculus, namely we first evaluate $\mathcal{F}^{(P)}(t = 0)$ and then we obtain $\mathcal{F}^{(P)}(t = 1)$ as

Equation (2.19)

To perform the analysis within the first step of replica symmetry breaking, we make the standard 1RSB assumption [2], stated below:

Assumption 1. In 1RSB assumption, the distribution of the two-replica overlap q12, in the thermodynamic limit, displays two delta-peaks at the values $\bar{q}_1$ and $\bar{q}_2$, and the concentration on these two values is ruled by the parameter $\theta \in [0,1]$, namely

Equation (2.20)

where

and $\mathbb{P}_N(\boldsymbol{\sigma}^{(1)},\boldsymbol{\sigma}^{(2)}|\boldsymbol{\eta}) = \mathcal{B}_N(\boldsymbol{\sigma}_1|\boldsymbol{\eta})\mathcal{B}_N(\boldsymbol{\sigma}_2|\boldsymbol{\eta})$ as replicas are conditionally independent, given the disorder η .

The Mattis magnetizations m1 and $n_{1,a}$ are assumed to be self-averaging at their equilibrium values $\bar{m}$ and $\bar{n}$, respectively, namely

Equation (2.21)

Equation (2.22)

where $\mathbb{P}_N(m_1) = \mathbb{E}_{\boldsymbol{\eta}}\sum_{\boldsymbol{\sigma}}\mathcal{B}_N(\boldsymbol{\sigma}|\boldsymbol{\eta}) \delta (m_1-m_1(\boldsymbol{\sigma})) $ and $\mathbb{P}_N(n_{1,a}) = \mathbb{E}_{\boldsymbol{\eta}}\sum_{\boldsymbol{\sigma}}\mathcal{B}_N(\boldsymbol{\sigma}|\boldsymbol{\eta}) \delta (n_{1,a}-n_{1,a}(\boldsymbol{\sigma}))$, with $\mathbb{E}_{\boldsymbol{\eta}}f(\boldsymbol{\eta}) = \sum_{\boldsymbol{\eta}} f(\boldsymbol{\eta}) \mathbb{P}({\boldsymbol{\eta}})$.

Remark 2. We note that for θ = 0, equation (2.20) corresponds to the standard RS ansatz with the distribution of the two-replica overlap being delta-peaked at $\bar{q} = \bar{q}_2$. As θ increases, the 1-RSB Ansatz is gradually introduced. The optimal value of Parisi's parameter θ can in principle be obtained by minimizing the free energy ${\mathcal F}_N^{(P)}$ with respect to θ. However, the explicit form of the extremization condition of the free energy with respect to θ is known to be complicated and hardly used in practice, so that θ is often left as a free parameter (see e.g. [7]). In this work, we will fix θ by following a different procedure, namely by maximizing the region where the network accomplishes pattern retrieval, as done in [45] (see section 2.4 for a more detailed discussion).

Definition 5. Given the interpolating parameter $t \in [0,1]$, the constant $A_1,\ A_2, \ \psi \in \mathbb{R}$ to be fixed later on, and the i.i.d. standard Gaussian variables $Y_i^{(b)} \sim \mathcal{N}(0,1)$ for $i = 1, \ldots, N$ and $b = 1, 2$ (that must be averaged over as explained below), the Guerra's 1RSB interpolating partition function for the unsupervised DAM model is given by

Equation (2.23)

Equation (2.24)

The index 2 on the LHS stands for the number of vectors ${\boldsymbol{Y}}^{(b)}$ that must be averaged over. Their number is equal to $k\!+\!1$ where k is the number of steps of RSB (here k = 1).

The average over the generalised Boltzmann factor associated to the interpolating partition function $\mathcal{Z}^{(P)}_2(\boldsymbol{\eta},\boldsymbol{\eta}, \boldsymbol{Y}; J, t)$, can be defined as

Equation (2.25)

Remark 3. We note that for t = 1, (2.24) recovers the original partition function (2.15), whereas for t = 0 it reduces to the partition function of a system of N non-interacting neurons, described by the Hamiltonian $-\beta^{^{\prime}} \mathcal{H}_{N}^{(1)}(\boldsymbol{\sigma} \vert \boldsymbol{\eta}, {\mathbf{Y}} ; J) = \sum_i h_i \sigma_i$, with local fields $h_i = J\xi_i^1+\psi r\mathcal{R}^{-1}\sum_{a = 1}^M \eta_i^{1,a}+\sum_{b = 1}^2 A_bJ_i^{(b)}$, that is readily evaluated. The parameters of the one-body model ψ, A1 and A2 must be chosen in such a way that the t-dependent terms in $d\mathcal{F}^{(P)}/dt$ cancel out in the thermodynamic limit, under the 1RSB assumption.

In what follows, we will average over the fields ${\boldsymbol{Y}}^{(b)}$ for $b = 1,2$ recursively, as explained by Guerra in [43] and in the statements below. To this purpose, we define

Equation (2.26)

Equation (2.27)

Equation (2.28)

where with $\mathbb{E}_{\boldsymbol{Y}^{(b)}}[.]$ we mean the average over the vectors ${\boldsymbol{Y}}^{(b)}$ for $b = 1, 2$. The interpolating quenched free energy related to the partition function (2.24) is introduced as

Equation (2.29)

where $\mathbb{E}_0[.]$ denotes the average over the variables $\lambda^{\mu}_{i_1,\ldots,i_P}$'s and $\eta_{i}^{1,a}$'s. In the thermodynamic limit, assuming the limit exists, we write

Equation (2.30)

Remark 4. We note that in the RS case, where $\mathcal Z_2^{(P)}(\boldsymbol{\eta}^1,\boldsymbol{\lambda}, \boldsymbol{Y}; J, t) \equiv \mathcal Z_1^{(P)}(\boldsymbol{\eta}^1,\boldsymbol{\lambda}, \boldsymbol{Y}^{(1)}; J, t)$ and $\boldsymbol{Y}^{(2)} = {0}$, we have

Equation (2.31)

hence the average $\mathbb{E}_{\boldsymbol{Y}^{(1)}}$ acts on the same level as $\mathbb{E}_0$ i.e. outside the logarithm. When moving from the RS to the 1RSB scenario, an extra average is introduced, $\mathbb{E}_{\boldsymbol{Y}^{(2)}}$, which acts on a different level than $\mathbb{E}_0$ and $\mathbb{E}_{\boldsymbol{Y}^{(1)}}$ i.e. inside the logarithm. This reflects the presence of hierarchical valleys in the free energy landscape [46] which leads to the well-known multi-scale thermalization in spin glasses. The internal average $\mathbb{E}_{Y^{(2)}}$ is related to thermalization within a valley at the lower level of the (two-level) hierarchy, whereas the external averages account for thermalization across valleys (i.e. at the higher level of the hierarchy). The structure of the hierarchy is captured by the parameter θ, which controls the amplitudes of the valleys at the lower level of the hierarchy and can be related to effective temperatures [47], as discussed in [48] for Parisi's replica approach and in [49] for Guerra's interpolation techniques. Note that the distinction between the two levels of the hierarchy is not made for ψ which accounts for the signal contribution and is kept RS (as usually done for the magnetization in spin-glass models).

Now, following Guerra's prescription [43], given two copies (or replicas) of the system, we define the following averages, corresponding to thermalization within the two different levels of the hierarchy

Equation (2.32)

Equation (2.33)

where

Equation (2.34)

At this point, we are able to state our second assumption, that is at each level $a = 1,2$ of the hierarchy, the two-replica overlap $q_{12}(\boldsymbol{\sigma}^{(1)},\boldsymbol{\sigma}^{(2)})$ self-averages around the value $\bar{q}_a$ of the corresponding peak in the overlap distribution $\mathbb{P}_N(q)$, as given in assumption 1:

Assumption 2. For any $t\in[0;1]$

Finally, we provide the explicit expression of the quenched free energy in terms of the control parameters in the next

Proposition 1. In the thermodynamic limit $N\to\infty$, within the 1RSB Assumption and under the assumption 2, the quenched free energy for the unsupervised DAM model, reads as

Equation (2.35)

where

Equation (2.36)

$\mathbb{E}_{(\boldsymbol{\eta}^1|\boldsymbol{\xi}^1)} = \prod\nolimits_{a = 1}^M \mathbb{E}_{(\eta^{1,a}|\xi^1)}$ and $\mathbb{E}_{\boldsymbol{\xi}^1}$ and $\mathbb{E}_{(\eta^{1,a}|\xi^1)}$ denote the expectations over the Bernoulli distributions (2.1) and (2.2) respectively. In the above, $\bar n$, $\bar{q}_1, \ \bar{q}_2$ fulfill the following self-consistency equations

Equation (2.37)

with

Equation (2.38)

and $\beta^{^{\prime}} = 2\beta/P!$. Furthermore, as $\bar{m} = -\beta^{^{\prime}}\nabla_J \mathcal{F}^{(P)}(J)|_{J = 0}$, we have

Equation (2.39)

In order to prove the aforementioned proposition, we need to show the following

Lemma 1. In the thermodynamic limit, the t derivative of the interpolating quenched free energy (2.29) under the 1RSB assumption and under assumption 2 is given by

Equation (2.40)

The proof of this lemma is shown in appendix B.1. Now let us prove proposition 1:

Proof. Exploiting the fundamental theorem of calculus, we can relate the free energy of the original model $\mathcal{F}^{(P)}(J,t = 1)$ and the one stemming from the one body terms $\mathcal{F}^{(P)}(J,t = 0)$ via (2.19). Since, the t-derivative (2.40) does not depend on t, all we need is to add the expression above to the free energy $\mathcal{F}^{(P)}(J,t = 0)$, which only contains one body terms. Let us start with the computation of the latter at finite size N:

Equation (2.41)

with the definition of $-\beta \mathcal{H}_{N}^{(1)}(\boldsymbol{\sigma} \vert \boldsymbol{\eta}; J)$ given in remark 3.

Upon setting the constants to the values (B.8) stated in the proof of lemma 1, we have

Equation (2.42)

where $g(\beta, K,N, \boldsymbol{Y})$ reads as

Equation (2.43)

Recalling that $\lim\limits_{N\to \infty}K/N^{P-1} = 2\gamma/P!$, taking the thermodynamic limit of the above equations and inserting them in the fundamental theorem of calculus (2.19), we reach (2.35). Finally, by maximizing (2.35) w.r.t. to the order parameters $\bar{n}, \bar{q}_1, \bar{q}_2$, we obtain the self-consistency equations (2.37), hence we reach the thesis. □

2.3. 1RSB analysis via Parisi's replica trick

In this section, we derive the expression of the quenched free energy of the unsupervised DAM model provided in proposition 1 by using the replica method [2, 50], at the first step of RSB [2, 51]. The core of this approach consists in writing the quenched free energy, as defined in (2.6), as

Equation (2.44)

with $\beta^{^{\prime}} = 2 \beta/P!$ and where we have used the shorthand $\mathcal{Z}_N(J) = \mathcal{Z}_N^{(P)}(\boldsymbol{\eta};J)$ for the partition function, defined as

Equation (2.45)

$\mathbb{E}[.]$ denotes, as in (2.6), the average over the quenched disorder η and in accordance with the explanation provided in section 2.1, the inclusion of the last term in the round brackets of (2.45) is intended to make the free energy a moment-generating function for the Mattis magnetization $m_1(\boldsymbol{\sigma})$.

For integer n the function $\mathcal{Z}_N^n$ is the product of a system of n identical replicas of the original system

Equation (2.46)

namely

Equation (2.47)

Splitting, as before, the signal term (µ = 1) from the remaining terms µ > 1, which act as a quenched noise on the learning and retrieval of pattern $\boldsymbol{\xi}^1$, we can write

Equation (2.48)

where $\mathbb{E}_{\boldsymbol{\xi}^\mu}[.] = \prod_i \mathbb{E}_{\xi_i^\mu}[.]$ denotes the average over the pattern $\boldsymbol{\xi}^\mu$, whose entries $\xi_i^\mu$ are drawn from the distribution (2.1) and $\mathbb{E}_{(\boldsymbol{\eta}^\mu|\boldsymbol{\xi}^\mu)} = \prod_{i,a}\mathbb{E}_{(\eta_i^{\mu,a}|\xi_i^\mu)}$ with $\mathbb{E}_{\eta_i^{\mu,A}}$ denoting the expectation over the conditional distribution (2.2). As before, exploiting the CLT to rewrite the last term in (2.48) via (2.14), we obtain

Equation (2.49)

Treating the two terms $\mathbb{E}_{\boldsymbol{\xi}^{1}}\mathbb{E}_{(\boldsymbol{\eta}^{1}|\boldsymbol{\xi}^{\tiny \mbox{1}})} \mathcal{Z}_\textrm{signal}^n(J)$ and $\mathbb{E}_{\boldsymbol{\lambda}} \mathcal{Z}_\textrm{noise}^n$ separately, we insert in the former the identity

and use the Fourier representation of the Dirac delta, obtaining

Equation (2.50)

For the noise term, performing first the expectation over λ

Equation (2.51)

inserting

and using the Fourier representation of the Dirac delta, we have

Equation (2.52)

Inserting equations (2.50) and (2.52) in (2.49), we obtain

Equation (2.53)

where

Equation (2.54)

and we have used the shorthand notation ${\boldsymbol{Q}} = \{q_{ab}\}_{a,b = 1}^n$ and similarly for P , N and Z . Next, we assume that the limit n → 0 in

Equation (2.55)

can be taken by analytic continuation and that the two limits $N\to\infty$ and n → 0 can be interchanged (provided they exist), so that the integrals in (2.53) can be performed by steepest descent. This leads to

Equation (2.56)

where $\boldsymbol{Q}^\star, \boldsymbol{P}^\star, \boldsymbol{Z}^\star, \boldsymbol{N} ^\star$, are solution of the saddle point equations

Equation (2.57)

Equation (2.58)

Equation (2.59)

Equation (2.60)

where the average $\langle \ldots \rangle_{\mathrm{eff}}$ is computed over the distribution $p_{\mathrm{eff}}(\boldsymbol{\sigma}) = e^{-\beta^{^{\prime}} H_{\mathrm{eff}}(\boldsymbol{\sigma})}/Z_{\mathrm{eff}}$ where $H_{\mathrm{eff}}(\boldsymbol{\sigma})$ is equal to the terms in the round brackets in the second line of (2.54). Inserting (2.59) and (2.60) in $H_{\mathrm{eff}}(\boldsymbol{\sigma})$, we obtain

Equation (2.61)

and, substituting in (2.56)

Equation (2.62)

where $q_{ab}^\star$ and $[n_{1,A}^{(a)}]^*$ must be determined from (2.57) and (2.58). Since $p_{\mathrm{eff}}(\boldsymbol{\sigma})$ depends on Q , N via (2.61), equations (2.57) and (2.58) denote a set of self-consistency equations. Now, in order to proceed, we need to find the form of Q and N in the limit n → 0. We will make the 1RSB ansatz:

Equation (2.63)

Equation (2.64)

Using (2.63) to evaluate the first term in (2.62) and the first term in (2.61), we get

Equation (2.65)

Equation (2.66)

Upon inserting in (2.62) we obtain

Equation (2.67)

with

Equation (2.68)

where we have used the definition (2.36).

Next, we apply a Gaussian transformation to the last term in (2.67) to linearize the squared terms in the Hamiltonian (2.68)

Equation (2.69)

where we have set ${\mathcal{D}} u = \textrm{d}u \,e^{-u^2/2}/\sqrt{2\pi}$ and ${\mathcal{D}} v_k = \textrm{d}v_k \,e^{-v_k^2/2}/\sqrt{2\pi}$ for $k = 1, \ldots, n/\theta$. Now, writing

the sum over the spin configuration can be done explicitly, finding

Now applying the n → 0 limit [50] and exploiting the relation $\lim\limits_{n\to 0}\dfrac{1}{n} \ln\mathbb{E}_{Y^{(1)}}[f^n(Y^{(1)})] = \mathbb{E}_{Y^{(1)}}[\ln f(Y^{(1)})]$, we get

Finally, inserting this expression in (2.67) and denoting with $\mathbb{E}_{Y^{(1)}}[.]$ and $\mathbb{E}_{Y^{(2)}}[.]$ the Gaussian averages w.r.t. $Y^{(1)}$ and $Y^{(2)}$ respectively, we reach the following expression for the free energy

where $\bar{n},\bar{q}_1$ and $\bar{q}_2$ must fulfill the same self-consistency equations (2.37) obtained by using Guerra's interpolation technique, as reported in proposition 1.

2.4. Results

Solving numerically the self consistency equations (2.37), we obtain the phase diagram shown in figure 1, in the parameters space (γ, $\tilde{\beta}$), where γ is the storage load defined in (2.8) and $\tilde\beta = \beta^{^{\prime}}/(1+\rho)^{P/2}$ is a scaled inverse temperature, where ρ is the dataset entropy defined in (2.3). Panels show results for P = 4 and three different values of M, as shown in the legend. The result for $M\to \infty$ follows from the analysis carried out in section 2.4.1. In each panel, the grey curve marks the transition from the ergodic phase with $\bar{m} = \bar{q}_1 = \bar{q}_2 = 0$ (above) to the spin glass phase, where either $\bar{q}_1$ or $\bar{q}_2$ becomes non-zero (below). The black curve marks the transition from the retrieval phase with $\bar{m}\neq 0$ (left) to $\bar{m} = 0$ (right). The spin glass solution within the retrieval region is always unstable and it is delimited by the dotted curve (we refer to this as the instability region). In figure 2 we show the results for $P = 6, 8, 10$, as shown in the legend. As P and M grow, the spin glass region shrinks and the retrieval region expands. For $M = +\infty$ the critical storage lines become equal to those of traditional DAM models where archetypes are encoded in the interactions, rather than their noisy examples [45]. This is as expected: when an infinite number of examples is provided, the system reaches the same performance as a neural network where archetypes are stored in the interactions directly. For finite M, however, the network's ability to retrieve the archetypes degrades at values of the storage load which are lower than the storage capacity in traditional DAM models. Both figures 1 and 2 have been obtained for the value of $\theta = \theta_\otimes$ which maximizes the retrieval region, as shown in figure 3(left panel). In the latter, we show the line that separates the retrieval region from the spin-glass or the ergodic region, for different values of θ. For θ = 0 (which corresponds to the RS theory), the line exhibits a re-entrance, as commonly observed in spin-glasses and associative memories [45, 50, 52, 53]. As θ is increased, the retrieval region expands, reaching its maximum at $\theta = \theta_\otimes$, where the re-entrance disappears completely, showing the same qualitative behaviour as in the Hopfield model [12] and traditional DAM [45]. Increasing θ above $\theta_\otimes$ the re-entrance appears again and gets more pronounced as θ approaches 1, where the transition line becomes identical to the RS (and the θ = 0) case, as expected. In figure 3(right panel) we show the dependence of $\theta_\otimes$ on P. As P increases, the value of $\theta_{\otimes}$ decreases. This is as expected from spin-glass theory, as for large P, P-spin models are known to converge to the REM model [54], which is RS. Interestingly, we find that $\theta_\otimes$ takes the same value as in the DAM model considered in [45], for all values of M, hence the optimal 1RSB breaking parameter is not influenced by the use of corrupted examples rather than archetypes.

Figure 1.

Figure 1. Phase diagram of the unsupervised DAM model with P = 4, in the parameters space (γ, $\tilde{\beta}$), where γ is the storage load defined in (2.8) and $\tilde\beta = \beta^{^{\prime}}/(1+\rho)^{P/2}$ is the scaled inverse noise, where ρ is the dataset entropy defined in (2.3). Different panels refer to different sizes M of the training set, as shown in the legend, where $ \tilde M(r,P) = r^{-2P}$ and r has been set to r = 0.2. For high values of the noise $\tilde{\beta}^{-1}$, the system is ergodic, while lowering the noise two phases appear: a retrieval phase at small γ (below the black line), and a spin glass phase (below the grey line) at high values of γ. The spin-glass phase within the retrieval region (delimited by the dashed line) is unstable. As M increases, the spin glass region shrinks and the retrieval region expands.

Standard image High-resolution image
Figure 2.

Figure 2. Phase diagrams of the unsupervised DAMs model for $P = 6, 8, 10$, and different values of M, as shown in the legends. $\tilde M(r,P)$ and r are as in figure 1. As P increases, the spin glass region shrinks and the retrieval region expands.

Standard image High-resolution image
Figure 3.

Figure 3. Left: critical line separating the retrieval (left) from the spin-glass or ergodic region (right) for different values of θ. For θ = 0, where the theory is RS, the retrieval region exhibits a re-entrance.As θ is increased the retrieval region expands, reaching its maximum at $\theta = \theta_\otimes$, where the re-entrance disappears completely. Right: $\theta_\otimes$ versus P. As P increases the value of $\theta_{\otimes}$ decreases, indicating that the larger P the better the RS approximation is expected to be.

Standard image High-resolution image

2.4.1. Limiting cases: $M\to \infty$ and $\beta\to \infty$.

Finally, we consider two instructive limiting cases. One is the limit $M\to\infty$, where the number of available examples is large. Although idealised, this scenario is becoming less utopian nowadays in a number of applications, and it is instructive as an explicit relation between the archetype magnetization and the mean magnetization of the examples naturally emerges. The second scenario is the zero noise limit $\beta\to\infty$, where the information processing capabilities of the network are expected to be maximal. We now state the following

Corollary 1. In the limit $M\to\infty$, the 1RSB self consistent equations for the order parameters of the unsupervised DAM model are

Equation (2.70)

where

Equation (2.71)

and $\beta^{^{\prime}} = 2\beta /P!$.

Proof. In the limit $M\to \infty$, we can apply the CLT to the sum of the examples defined in (2.36), to write

Equation (2.72)

where λ is a standard Gaussian variable $\lambda\sim \mathcal{N}(0,1)$.

Inserting (2.72) in (2.38) and in the expression for $\bar{n}$ provided in (2.37), we get

Equation (2.73)

and

Hence, by applying to the standard Gaussian variable λ the Stein's lemma, which states that for a standard Gaussian variable $J\sim N(0, 1)$ and a generic function f(J), for which the two expectations $\mathbb{E}\left( J f(J)\right)$ and $\mathbb{E}\left( \partial_J f(J)\right)$ both exist, one has

Equation (2.74)

and explicitly averaging over ξ we get the self-consistency equations in (2.70). □

Finally, it has been shown in [38], within a RS analysis, that neglecting the second term in the equation for $\bar{n}$, given in (2.70), has negligible impact on the solutions of the self-consistency equations, in the relevant regime of low noise, where the network works as an associative memory. As the equation for $\bar{n}$ is the same in the RS theory and in the 1RSB theory that we are considering here, the same truncation of $\bar{n}$ can be adopted here

Equation (2.75)

This leads to a simplified expression for g that is

Equation (2.76)

where we re-scaled the noise $\tilde\beta = \beta^{^{\prime}}/(1+\rho)^{P/2}$. Solving numerically the self consistency equations (2.70) where the equation for $\bar{n}$ is replaced with its truncated version (2.75), leads to the phase diagram shown by the lines $M = +\infty$ in figures 1 and 2.

Next, we turn to the analysis of the ground state, i.e. to the limit $\beta \to \infty$, and we state the following

Corollary 2. In the limit $\beta\to\infty$ and for $M\gg 1$, the 1RSB self consistency equations for the order parameters of the unsupervised DAM model are

Equation (2.77)

Equation (2.78)

where $D = \tilde\beta \theta$ and

Equation (2.79)

We report the proof of this corollary in appendix B.2.

Next we compute, by numerically solving (2.77) and (2.78), the ground-state critical storage capacity γC beyond which a black-out scenario emerges, namely $\bar{m}\neq 0$ for $\gamma\lt\gamma_C$ and $\bar{m} = 0$ for $\gamma\gt\gamma_C$. In figure 4 we plot γC as a function of the ratio between the number M of experienced examples and the minimum number $M_\otimes$ of examples required by the unsupervised DAM model (with P > 2) to correctly learn and retrieve an archetype, within the RS theory, which has been proved to be

Equation (2.80)

in [37]. In order to ascertain whether replica symmetry breaking alters such threshold, we plot γC within both the 1RSB assumption (blue line) and the RS theory (red line). We see that γC becomes non-zero (meaning that retrieval can occur) at the same value of $M = M_\otimes$ for the RS and the 1RSB theory. This shows that the minimum number of examples that a network needs to accomplish retrieval of the archetype is the same within the RS or the 1RSB assumption.

Figure 4.

Figure 4. Critical storage capacity γc in the ground state for the unsupervised DAM model, for example noise r = 0.2, as a function of the ratio $M/M_\otimes^\textrm{unsup}$ both for the RS (red line) and 1RSB (blue line) assumptions. We notice that γc increases with M and, for $M \gg M_{\otimes}^\textrm{unsup}$, it saturates to the critical load ($\hat\gamma$) found numerically for the RS case in [33].

Standard image High-resolution image

Results show that, as in the standard Hebbian storage [44, 53, 55], the phenomenon of replica symmetry breaking induces a mild improvement in terms of the maximal storage.

3. 'Supervised' DAM

In this section we analyse the information processing capabilities of the DAM model in the so-called 'supervised' setting, where the dataset given to the network is now split into different categories, one for each archetype $\boldsymbol{\xi}^\mu$, with $\mu = 1\ldots P$. Again, we analyze the model, defined in section 3.1, via both Guerra's interpolation techniques (section 3.2) and Parisi's replica approach (section 3.3), at the first step of RSB. In appendix A we derive the instability line of the RS theory.

3.1. Model and definitions

As in the unsupervised DAM model, we consider a network of N Ising neurons $\sigma_i \in \{-1, +1\}$, $i = 1, \ldots, N$, interacting via P-node interactions. We assume to have K Rademacher archetypes $\boldsymbol{\xi}^\mu \in \{-1, +1\}^N$, with $\mu = 1, \ldots, K$, defined as N-dimensional vectors with entries drawn randomly and independently from the distribution (2.1). In addition, we assume to have M examples $\eta^{\mu,a} \in \{-1, +1\}^N$ for each archetype, which are corrupted version of the archetypes, with entries distributed according to (2.2). We shall refer to this model as the supervised DAM model.

Definition 6. The Hamiltonian of the supervised DAM model is

Equation (3.1)

where the constant $\mathcal{R}: = r^2 + \frac{1-r^2}{M}$ in the denominator of the r.h.s. is included for mathematical convenience and the factor $N^{P-1}$ ensures the Hamiltonian to be ${\mathcal O}(N)$, as explained previously for the unsupervised model, see equation (2.4).

We highlight the hidden role of a teacher that, before providing the dataset to the network, has grouped examples pertaining to the same archetype together (hence the proliferation of summations in the cost function (3.1) with respect to (2.4)).

As before, we will focus (without loss of generality) on the ability of the network to store and retrieve the first archetype $\boldsymbol \xi^1$, hence the Mattis magnetization $m_1(\boldsymbol{\sigma})$ provided in (2.16) remains a relevant order parameter. However, the set of order parameters for the examples, previously given by $n_{1,a}(\boldsymbol{\sigma})$ with $a = 1\ldots M$ (see (2.17)) have now to be substituted with a single order parameter

Equation (3.2)

Its probability distribution, in the thermodynamic limit, is assumed self-averaging, namely

Equation (3.3)

as for the Mattis magnetization, while the distribution for the overlap defined in (2.18) is still assumed bimodal as in assumption 1 (see (2.20)). As for the unsupervised model, we add an extra term in the cost function, namely $J \sum_i \xi_i^1 \sigma_i$, in order to generate the moments of the Mattis magnetization m1 by taking the derivatives of the quenched free energy w.r.t. J and, as this term is not part of the original Hamiltonian, it will be set to zero at the end of the calculations. Therefore, we write the partition function as

Equation (3.4)

where ρ is the dataset entropy defined in (2.3), $\beta^{^{\prime}}: = 2\beta/P!$ and for the first pattern, µ = 1, we have used the relation $P!\sum_{i_1 \lt\cdots \lt i_P} = \sum_{i_1, \cdots, i_P}$ and neglected terms which vanish in the thermodynamic limit. Next, as done for the unsupervised case, we apply the CLT to the variables in the square brackets in (3.4), namely

Equation (3.5)

where

Equation (3.6)

and

Equation (3.7)

with

Thus, we can write

where $\lambda_{i_1,\ldots,i_P}\sim \mathcal{N}(0,1)$.

Inserting all back in (3.4) we get

Equation (3.8)

In the next subsection we calculate the free energy of the system in the thermodynamic limit using Guerra's interpolation technique, assuming one step of replica symmetry breaking (1RSB).

3.2. 1RSB analysis via Guerra's interpolation technique

As for the unsupervised case, the plan is to construct an interpolation between the original model and a simpler one-body model, whose statistical features are as close as possible to the original one, then solve the one body model and finally obtain the solution of the original model via the fundamental theorem of calculus. Thus we define the following interpolating partition function

Definition 7. Given the interpolating parameter $t \in [0,1]$, $A_1,\ A_2, \ \psi \in \mathbb{R}$ constants to be set a posteriori, and the i.i.d. standard Gaussian variables $Y_i^{(b)} \sim \mathcal{N}(0,1)$ for $i = 1, \ldots, N$ and $b = 1,2$, the Guerra's 1-RSB interpolating partition function for the DAM model, trained by a teacher, is given by

Equation (3.9)

where $\mathcal{B}_{N,2}^{(P)}(\boldsymbol{\sigma} \vert \boldsymbol{\eta}^1,\boldsymbol{\lambda},\boldsymbol{Y}; J, t)$ is denoted as Boltzmann factor.

As before, the introduction of the interpolating partition function gives rise to a generalized measure, average and interpolating quenched free energy (that we do not repeat here). All these generalizations retrieve the standard definitions when evaluated at t = 1. Following Guerra's method [43], we must average out the fields ${\mathbf{Y}}^{(1)}$ and ${\mathbf{Y}}^{(2)}$ recursively, in the interpolating free energy resulting from the interpolating partition function (3.9), as already done in the previous section for the unsupervised case, see equations (2.26)–(2.28). Proceeding in the same way and omitting obvious details due to the similarity of the proofs in the supervised and unsupervised cases, we state directly the next.

Proposition 2. In the thermodynamic limit $N\to\infty$, within the 1RSB assumption and under the assumption 2, the quenched free energy for the supervised DAM model, reads as

Equation (3.10)

with $\bar n$, $\bar{q}_1, \ \bar{q}_2$ fulfilling the following self-consistency equations

Equation (3.11)

where

Equation (3.12)

and $\beta^{^{\prime}} = 2\beta /P!$. Furthermore, as $\bar{m} = -\beta^{^{\prime}} \nabla_J \mathcal{F}^{(P)}(J)|_{J = 0}$, we have

Equation (3.13)

The proof of the aforementioned proposition is lengthy but the steps to follow are identical to the ones provided for the unsupervised case.

3.3. 1RSB analysis via Parisi's replica trick

As stated in the previous section, the core of this approach consists in writing the logarithm of the partition function as $\lim\limits_{n\to 0} (\mathbb{E}\mathcal{Z}(J)^n-1)/n$ where

Equation (3.14)

Proceeding as in the unsupervised case, we compute separately the signal term and the noise term. The former, accounting for the examples pertaining to the first archetype, reads as

Equation (3.15)

while the latter accounting for the examples pertaining to all the other archetypes but the first one, can be written as

Equation (3.16)

where now $\mathbb{E}_{\boldsymbol{\lambda}}$ is the Gaussian average w.r.t. λ . Performing the integration over λ and inserting the definitions of the order parameters as done for the unsupervised setting, we can write

Equation (3.17)

where

Equation (3.18)

and $\hat{\eta}_{M} = (rM)^{-1}\sum_{A = 1}^M\eta^{1,A}$. Assuming (as usual within the replica method) that the limits $N\to\infty$ and n → 0 can be interchanged, the integrals can be performed by steepest descent. This leads to

Equation (3.19)

where $\boldsymbol{Q}^\star, \boldsymbol{P}^\star, \boldsymbol{Z}^\star, \boldsymbol{N} ^\star$, are solution of the saddle point equations

Equation (3.20)

Equation (3.21)

Equation (3.22)

Equation (3.23)

which provides a set of self-consistency equations for Q , N , P , and Z . Using the last two equations to eliminate P and Z from the description, we finally obtain

Equation (3.24)

To make progress, we need to find the form of Q and N in the limit n → 0. Again, we use the 1RSB ansatz for the two-replica overlap provided in (2.63), while we assume that $n_{1,a}$ is self-averaging, i.e. $n_{1}^{(a)} = \bar{n}$. Proceeding analogously to the unsupervised case, after taking the limit n → 0 we find the same expression for the quenched free energy (see (3.10)) and order parameters (see 3.11) previously obtained by using Guerra's interpolation technique.

3.4. Limiting cases: $M\to \infty $ and $\beta \to \infty$

Now we state the next corollaries concerning the large datasets $M \to\infty$ limit and the ground state $\beta \to \infty$ limit. We omit their proofs since these are trivial variations of those provided for the unsupervised case (cfr corollaries 1 and 2).

Corollary 3. In the large dataset limit $M\to \infty$, the 1RSB self-consistency equations for the order parameters of the supervised DAM model can be expressed as

Equation (3.25)

where the expression of g reads as

Equation (3.26)

Moreover, if we use the truncated expression for $\bar{n}$, namely

Equation (3.27)

we get the simplified expression of (3.26) that is

Equation (3.28)

where $\tilde\beta = \beta^{^{\prime}}/(1+\rho)^{P/2}$.

Corollary 4. The ground-state (i.e. $\beta \to \infty$) self-consistency equations for the order parameters of the theory in the large dataset limit (i.e. $M\gg 1$) and under the 1RSB assumption are

Equation (3.29)

Equation (3.30)

Equation (3.31)

where $D = \tilde\beta \theta$ and

Equation (3.32)

By numerically solving the self-consistency equations (3.25), we obtain the phase diagram shown in figure 5, for different values of P and different values of the dataset entropy ρ, as shown in the legend.

Figure 5.

Figure 5. Phase Diagram of the supervised DAM model for different values of the dataset entropy ρ, ranging from ρ = 0.1 (light grey), 0.05 (grey) up to 0 (black). The interpretation of the different regions is the same as in the unsupervised model. As ρ decrease (i.e. as more information is provided to the network), the instability region shrinks and the retrieval region expands, similarly to what was observed earlier upon increasing M.

Standard image High-resolution image

In [38] it has been proven that the minimum number of examples $M_{\otimes}(r, \gamma)$ required by the supervised DAM model to correctly learn and retrieve an archetype, is given by

Equation (3.33)

where γ ≠ 0. In figure 6 we plot the critical storage γC in the ground state versus the ratio $M/M_\otimes$. As previously observed for the unsupervised setting, the phenomenon of replica symmetry breaking slightly increases the critical storage and it does not alter the minimum number of examples required for learning.

Figure 6.

Figure 6. Critical storage capacity γc in the ground state for the supervised DAM model for example noise r = 0.2, as a function of the ratio $M/M_\otimes^{sup}$, both for the RS (red line) and 1RSB (blue line) assumptions. Different panels show results for different values of $P = 4,\ 6,\ 8,\ 10$.

Standard image High-resolution image

4. Conclusions and outlooks

This manuscript analyses the equilibrium behaviour of DAM trained with or without the supervision of a teacher, within a 1RSB assumption, thus extending previous analysis carried out at a RS level [37, 38]. The unsupervised and supervised settings differ in the choice of the couplings (which involve P nodes) and are given by

Equation (4.1)

Equation (4.2)

respectively, where $\left\lbrace\boldsymbol \eta^{\mu}_a \right\rbrace_{a = 1,{\ldots},M}^{\mu = 1,{\ldots},K}$ are perturbed versions of the unknown archetypes $\left\lbrace \boldsymbol{\xi}^\mu\right\rbrace^{\mu = 1, \ldots, K}$. The network does not experience the archetypes directly, instead it has to infer them from the supplied examples. For both the settings, we obtained explicit expressions for the quenched free energy and derived full phase diagrams. In doing so, we proved a full equivalence, at 1RSB level, between two different approaches, namely Guerra's telescopic interpolation [43] and Parisi's RSB theory [2]. In addition, we derived (in appendix A) the De Almeida-Thouless line, which marks the onset of the instability of the replica symmetric description, and below which the 1RSB description should be preferred.

The main differences brought about by the RSB description, with respect to the RS one, consist in the disappearance of the instability region within the retrieval zone of the phase diagram close to saturation (as standard in glassy statistical mechanics [4]) and in a slight improvement of the value of the critical storage. Importantly, the threshold for learning, both in the supervised and unsupervised settings, is not influenced by replica symmetry breaking, i.e. the minimum number of examples required to infer the archetypes is the same in the RS and 1RSB description. Interestingly, the optimal value of the Parisi's parameter θ, that controls the distribution of the overlaps in the 1RSB scenario, is not influenced by the dataset entropy and is equal to that of the classical Hopfield model. From the mathematical viewpoint, possible future developments would be relaxing the constraints of a self-averaging Mattis magnetization and inspecting how the learning and retrieval properties of these networks change with different kind of noise: in this work we have focused on multiplicative noise, however the use of additive noise has lately gained large popularity in generative models for machine learning (see for instance [56]). Furthermore, within the framework of multiplicative noise, an interesting outlook would be considering corruptions of archetypes which also consist of blank (in addition to inverted) entries, as done recently in [57] for networks away from saturation. Their operation in the saturated regime and the effects of replica symmetry breaking have not yet been investigated: we plan to report soon on these topics.

Acknowledgments

All the authors acknowledge the stimulating research environment provided by the Alan Turing Institute's Theory and Methods Challenge Fortnights event Physics-informed Machine Learning. Albanese acknowledges Ermenegildo Zegna Founder's Scholarship, UMI (Unione Matematica Italiana), INdAM—GNFM Project (CUP E53C22001930001) and PRIN grant Stochastic Methods for Complex Systems N. 2017JFFHS for financial support and King's College London for kind hospitality. Alessandrelli acknowledges INdAM (Istituto Nazionale d'Alta Matematica) and Unisalento for support via PhD-AI. Barra's research is supported by MAECI via the BULBUL grant (Italy-Israel collaboration), Project N. F85F21006230001 Brain-inspired Ultra-fast and Ultra-sharp machines for assisted healthcare and by MUR via the PRIN-2022 grant Statistical Mechanics of Learning Machines: from algorithmic and information-theoretical limits to new biologically inspired paradigms, Project N. 20229T9EAT that are gratefully acknowledged.

Data availability statement

No new data were created or analysed in this study.

Appendix A: Instability of the RS solution: AT lines

In the main text we have analysed the equilibrium behaviour of supervised and unsupervised DAM models under the assumption of replica symmetry breaking. While such phenomenon is expected in these models, a formal proof that the replica symmetric theory becomes unstable in certain ranges of the control parameters has not been provided in the literature. In this section we provide such a proof and we derive the critical line of the RS instability in the phase diagram, for both the unsupervised and the supervised DAM models, separately. We will use the method recently introduced in [39], which provides a simple alternative to the method originally introduced by de Almeida and Thouless in [58], as it does not require to compute the so-called replicon (i.e. the smallest eigenvalue of the spectrum) of the Hessian of the quadratic fluctuations of the free-energy around its RS value and it does not rely on the availability of an 'ansatz-free' expression for the free-energy.

A.1. AT line for DAM in unsupervised setting

Following the method introduced in [39], we aim to determine the region in the phase diagram where the quenched free energy evaluated within the 1RSB approximation, that from now on we denote for convenience as $\mathcal F^{(P)}_{1RSB}(\bar{n}, \bar{q}_2,\bar{q}_1|\theta)$, is smaller than the free energy evaluated within the RS assumption, $\mathcal F^{(P)}_{RS}(\bar{n}^{^{\prime}}, \bar{q})$, in the limit θ → 1, where the transition from RS to RSB is expected to occur.

We start by recalling the expression for $\mathcal F^{(P)}_{1RSB}(\bar{n}, \bar{q}_2,\bar{q}_1|\theta)$, given in (2.35), with $\bar{n}$, $\bar{q}_1$ and $\bar{q}_2$ determined from the self-consistency equations (2.37), and by providing the expression for the quenched free energy within the RS assumption, as derived in [37]

Equation (A.1)

where $\mathbb{E} = \mathbb{E}_{\boldsymbol{\xi}^1}\mathbb{E}_{(\boldsymbol{\eta}^1|\boldsymbol{\xi}^1)}\mathbb{E}_z$, $\hat{\eta}_{\tiny M}: = \frac{1}{rM}\displaystyle\sum\nolimits_{a = 1}^{M}\eta^{1,a}$, and $\bar{n}^{^{\prime}}$ and $\bar{q}$ fulfill the following self-consistency equations

Equation (A.2)

Equation (A.3)

We note that for θ = 1, $\bar{q}_1 = \bar{q}$, $\bar{n} = \bar{n}^{^{\prime}}$ and the 1RSB expression for the quenched free- energy reduces to the RS one. Now, we expand the 1RSB expression for the quenched free energy $\mathcal F^{(P)}_\textrm{1RSB}(\bar{n},\bar{q}_2,\bar{q}_1|\theta)$, as given in (2.35), around θ = 1, using $\mathcal F^{(P)}_\textrm{1RSB}(\bar{n},\bar{q}_2,\bar{q}_1|\theta)\vert_{\theta = 1} = \mathcal F^{(P)}_\textrm{RS}(\bar{n}^{^{\prime}},\bar{q})$

Equation (A.4)

Since $\bar{q}_1$ and $\bar{q}_2$ are determined through the self-consistency equations (2.37), they depend on θ as well, so we have to expand (2.37) around θ = 1 too. Following [39] closely, we obtain

Equation (A.5)

Equation (A.6)

where $\bar{q}$ is the solution of (A.3), $\tilde q_2(\bar{n}^{^{\prime}},\bar{q})$ is the solution of the following self-consistency equation

Equation (A.7)

where

Equation (A.8)

and the functions $A(\bar{n}^{^{\prime}},\tilde q_2(\bar{n}^{^{\prime}},\bar{q}), \bar{q})$ and $B(\bar{n}^{^{\prime}},\tilde q_2(\bar{n}^{^{\prime}},\bar{q}), \bar{q})$ are given by

Equation (A.9)

Equation (A.10)

respectively. Similarly, we can expand $\bar{n}$ as

Equation (A.11)

where

Equation (A.12)

In the limit θ → 1, equation (A.4) implies that $\mathcal F^{(P)}_\textrm{1RSB}(\bar{n}^{^{\prime}},\bar{q}_2(\bar{q}), \bar{q}|\theta)\lt \mathcal F^{(P)}_\textrm{RS}(\bar{n}^{^{\prime}},\bar{q})$ when we have that $\partial_\theta \mathcal{F}_\textrm{1RSB} (\bar{n}^{^{\prime}},\bar{q}_2(\bar{q}), \bar{q}|\theta)\vert_{\theta = 1}\gt0$. Next, we evaluate

Equation (A.13)

In order to determine the sign of the expression above, it is useful to note that $K(\bar{n}^{^{\prime}},\bar{q}, \bar{q}) = 0$, as the last term of (A.8) vanishes so, the last two addends in (A.13) elide each other. This is as expected, as $\bar{q}$ is an extremum of the RS free-energy, which is retrieved for θ = 1. Next, we study $K(\bar{n}^{^{\prime}},x, \bar{q})$ as a function of $x \in [0, \bar{q}]$ and locate its extrema. These are found from

Equation (A.14)

as

Equation (A.15)

where the last equality follows from algebraic manipulations of trigonometric functions.

Given that $K(\bar{n}^{^{\prime}},x,\bar{q})$ vanishes for $x = \bar{q}$, if the extremum $x = \tilde q_2(\bar{n}^{^{\prime}},\bar{q})$ is global in the domain considered, we must have that $K(\bar{n}^{^{\prime}},\tilde q_2(\bar{n}^{^{\prime}},\bar{q}), \bar{q})\gt0$ if $x = \tilde q_2(\bar{n}^{^{\prime}},\bar{q})$ is a maximum and $K(\bar{n}^{^{\prime}},\tilde q_2(\bar{n}^{^{\prime}},\bar{q}),\bar q)\lt0$ if $x = \tilde q_2(\bar{n}^{^{\prime}},\bar{q})$ is a minimum. Evaluating

Equation (A.16)

we have that if the expression in the square brackets is positive, namely if the parameter $\gamma {\beta^{^{\prime}}}^2$ satisfies

Equation (A.17)

$K(\bar{n}^{^{\prime}},\tilde q_2(\bar{n}^{^{\prime}},\bar{q}), \bar{q})\lt0$ and $\mathcal F^{(P)}_\textrm{1RSB}(\bar{n}^{^{\prime}},\tilde q_2(\bar{n}^{^{\prime}},\bar{q}), \bar{q}|\theta)-\mathcal F^{(P)}_\textrm{RS}(\bar{n}^{^{\prime}},\bar{q}) = -(\theta-1)K(\bar{n}^{^{\prime}},\tilde q_2$ $(\bar{n}^{^{\prime}},\bar{q}), \bar{q})/\beta^{^{\prime}}\lt0$, hence the RS theory is unstable. We plot the line (A.17) in figure 7, for different values of M, in the parameter space $(\gamma, \tilde\beta)$, where $\tilde\beta = \beta^{^{\prime}}/(1+\rho)^{P/2}$, together with the critical lines delimiting the retrieval region (top plots) and the spin-glass region (bottom plots), within the RS and the RSB theory, respectively. We note that the above recovers the expression for the RS instability line found in DAM models [39] in the limits r → 0 or $M \to \infty$, where the values of ρP and ρ vanish.

Figure 7.

Figure 7. The AT line (marking the onset of instability of the RS theory) is shown as a dashed line in the parameter space of storage load γ and scaled noise $\tilde \beta = \beta^{^{\prime}}/(1+\rho)^{P/2}$ for the unsupervised DAM model, with P = 4. For comparison, we also show the critical lines delimiting the retrieval region (top plots) and the spin-glass region (bottom plots), within the RS and the RSB theory, respectively, as obtained in the phase diagram in figure 1. Different columns correspond to different numbers of examples, M, presented to the network, namely $M = M_{\otimes}^\textrm{unsup}$ (left), $M = 2M_{\otimes}^\textrm{unsup}$ (middle) and $M = +\infty$ (right). We note that the AT line coincides with the line delimiting the SG region withing the RS theory.

Standard image High-resolution image

A.2. AT line for DAM in supervised setting

In this section we derive the RS instability line for the supervised DAM model. We start by providing the expression for the quenched free-energy in RS assumption as derived in [14] for the standard Dense Hopfield Model.

Equation (A.18)

with $\beta^{^{\prime}}: = 2\beta/P!$ and where $\bar{q}$ and $\bar{n}$ satisfy the self-consistency equations

Equation (A.19)

We also recall that the quenched free-energy within the 1RSB approximation, $\mathcal F^{(P)}_\textrm{1RSB}(\bar{n},\bar{q}_1, \bar{q}_2 \vert \theta)$, is as given in (3.10), where the order parameters $\bar{q}_2$, $\bar{q}_1$ and $\bar{n}$ satisfy the set of self-consistency equations provided in (3.11). We note that for θ = 1, $\bar{q}_1 = \bar{q}$, $\bar{n} = \bar{n}^{^{\prime}}$ and the 1RSB expression for the quenched free-energy reduces to the RS one. Now, we expand, to the leading order in $\theta-1$, the 1RSB quenched free-energy around its RS expression, as shown in (A.4). Since the self-consistency equations also depend on θ, we need to expand them too. We can write $\bar{q}_1$ as in (A.5), with $A(\bar{n}^{^{\prime}},\bar{q}_2(\bar{n}^{^{\prime}},\bar{q}),\bar{q})$ given in (A.9), and $\bar{q}_2$ as given in (A.6), where $\tilde q_2(\bar{n}^{^{\prime}},\bar{q})$ is the solution of (A.7) and $B(\bar{n}^{^{\prime}},\bar{q}_2(\bar{n}^{^{\prime}},\bar{q}),\bar{q})$ is as given in (A.10) (we recall that now the expression of $g(\bar{n}^{^{\prime}},\bar{q}_2(\bar{n}^{^{\prime}},\bar{q}),\bar{q} )$ is as given in (3.12)). With these expressions in hand, we can now compute the derivative of $\mathcal F^{(P)}_{\mathrm{1RSB}}$ w.r.t. θ when θ = 1, as needed in (A.4)

Equation (A.20)

Again, we have that $K(\bar{n}^{^{\prime}},\bar{q}, \bar{q}) = 0$ (as $\bar{q}$ is the extremum of the RS free-energy). Next, we inspect the sign of $K(\bar{n}^{^{\prime}},\tilde q_2(\bar{n}^{^{\prime}},\bar{q}), \bar{q})$. To this purpose, we study $K(\bar{n}^{^{\prime}},x,\bar{q})$ for $x \in [0, \bar{q}]$ and locate its extrema, which are found from

Equation (A.21)

as

Equation (A.22)

where the last equality follows from (A.7). Under the assumption that the extremum $x = \tilde q_2(\bar{n}^{^{\prime}},\bar{q})$ is global in the domain considered, we have that $K(\bar{n}^{^{\prime}},\tilde q_2(\bar{n}^{^{\prime}},\bar{q}), \bar{q})\gt0$ if $x = \tilde q_2(\bar{n}^{^{\prime}},\bar{q})$ is a maximum and $K(\bar{n}^{^{\prime}},\tilde q_2(\bar{n}^{^{\prime}},\bar{q}), \bar{q})\lt0$ if it is a minimum. In particular, if

Equation (A.23)

is positive, $K(\bar{n}^{^{\prime}},\tilde q_2(\bar{n}^{^{\prime}},\bar{q}), \bar{q})\lt0$ and $\mathcal F^{(P)}_\textrm{1RSB}(\bar{n}^{^{\prime}},\bar{q}_2(\bar{n}^{^{\prime}},\bar{q}),\bar{q}|\theta)\lt \mathcal F^{(P)}_\textrm{RS}(\bar{n}^{^{\prime}},\bar{q})$. This happens when the expression in the curly brackets of the equation above is negative, i.e. when the parameter $\gamma {\beta^{^{\prime}}}^2$ satisfies the inequality

Equation (A.24)

We note that (A.24) is functionally identical to the expression found in [39] for the RS instability line of the standard DAM model, the only difference being encoded in the term $g(\bar{n}^{^{\prime}},\tilde q_2(\bar{n}^{^{\prime}},\bar{q}), \bar{q})$, which is here defined differently, as it reflects the supervised protocol. In particular, in the limit of r → 0 or $M\to \infty$, where ρ and ρP vanish, (A.24) retrieves the RS instability line of standard DAM models, as obtained in [39].

Appendix B: Proofs

B.1. Proof of lemma 1

Defining the shorthand $\mathcal{Z}_2(J,t) = \mathcal{Z}_2^{(P)}(\boldsymbol{\eta}^1,\boldsymbol{\lambda}, \boldsymbol{Y};J,t)$, we start by computing the derivative of equation (2.29) w.r.t. t

Equation (B.1)

where $\mathbb{E}_0 = \mathbb{E}_{\boldsymbol{\xi}^1}\mathbb{E}_{(\boldsymbol{\eta}^1|\boldsymbol{\xi}^1)}\mathbb{E}_{\boldsymbol{\lambda}}$ and $\mathcal{W}_{2,t}$ is defined in (2.34). Recalling the definition of $\mathcal Z_2(J, t)$ in (2.23) we have:

Equation (B.2)

where, in order to lighten the notation we have set $\mathcal{B}^{(P)}_2(\boldsymbol{\sigma}|\boldsymbol{\eta}^1,\boldsymbol{\lambda},\boldsymbol{Y};J,t) = \mathcal{B}_2(\boldsymbol{\sigma};J,t)$. Inserting this expression in (B.1) and using the definition (2.25) for the average of a single replica of the system over the generalised Boltzmann factor $\mathcal{B}_2(\boldsymbol{\sigma};J,t)$, we obtain

Equation (B.3)

Next, denoting the combined average over the quenched disorder and the Boltzmann distribution as

Equation (B.4)

and applying the Stein's lemma (2.74) to the standard Gaussian variables $Y_i^{(b)}$ and $\lambda^{\mu}_{i_1,\ldots,i_P}$, we get

Equation (B.5)

Finally, we use the definitions (2.32) and (2.33) and we arrive at the compact expression:

Equation (B.6)

Next, we take the thermodynamic limit $N\to \infty$, using assumption 2. By manipulating the expression for the moments $\langle q_{12}^P(\boldsymbol{\sigma}^{(1)},\boldsymbol{\sigma}^{(2)})\rangle_{t,a}$, with $a = 1,2$, using Newton's binomial theorem

we obtain, in the thermodynamic limit, under assumption 2,

Equation (B.7)

for $a = 1,2$. Finally, we insert the above in (B.6) and we choose the constants $A_1,\ A_2, \ \psi$ in such a way that the terms dependent on $\langle q_{12}(\boldsymbol{\sigma}^{(1)},\boldsymbol{\sigma}^{(2)}) \rangle_{t,a}$ cancel out,

Equation (B.8)

This makes $\textrm{d}\mathcal{F}^{(P)}/\textrm{d}t$ t-independent and leads to the thesis (2.40).

B.2. Proof of corollary 2

Let us start from the self consistency equations in the $M\to\infty$ limit introduced in corollary 1, we recognize that as $\tilde\beta\to\infty$, we have $\bar{q}_2\to 1$, therefore in order to perform the limit we will introduce the reparametrization

Equation (B.9)

It is now useful to insert an additional term $\tilde\beta y$ in the expression of $g(\beta, \gamma, \boldsymbol{Y})$ in (2.76), which now reads as

Equation (B.10)

Using this new parameter y, we can recast the equation for $\bar{q}_2$ as a derivative of the magnetization

Equation (B.11)

where we have used $\Delta\bar{q} = \bar{q}_2-\bar{q}_1$ and, as $\tilde\beta\to\infty$, $\tilde\beta\theta\to D \in\mathbb{R}$. Thus, in the zero temperature limit the last three equations in equation (2.70) become

Equation (B.12)

Now, if we suppose $\Delta \bar{q}\ll 1$ the (B.10) reduces to

Equation (B.13)

where

Equation (B.14)

Performing the integral over $Y^{(2)}$ we reach equations (2.77) and (2.78).

Footnotes

  • It was shown in [34] that the conditional entropy which quantifies the amount of information needed to reconstruct the original pattern $\boldsymbol{\xi}^\mu$ given the set of related examples $\{\eta^{\mu,a}\}_{a = 1,\ldots,M}$, is a monotonically increasing function of ρ.

Please wait… references are loading.