Abstract
Hebbian neural networks with multi-node interactions, often called Dense Associative Memories, have recently attracted considerable interest in the statistical mechanics community, as they have been shown to outperform their pairwise counterparts in a number of features, including resilience against adversarial attacks, pattern retrieval with extremely weak signals and supra-linear storage capacities. However, their analysis has so far been carried out within a replica-symmetric theory. In this manuscript, we relax the assumption of replica symmetry and analyse these systems at one step of replica-symmetry breaking, focusing on two different prescriptions for the interactions that we will refer to as supervised and unsupervised learning. We derive the phase diagram of the model using two different approaches, namely Parisi's hierarchical ansatz for the relationship between different replicas within the replica approach, and the so-called telescope ansatz within Guerra's interpolation method: our results show that replica-symmetry breaking does not alter the threshold for learning and slightly increases the maximal storage capacity. Further, we also derive analytically the instability line of the replica-symmetric theory, using a generalization of the De Almeida and Thouless approach.
Export citation and abstract BibTeX RIS
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
Since Hopfield's seminal work on the use of biologically inspired neural networks for associative memory and pattern recognition tasks [1], there have been many important contributions applying concepts from statistical mechanics, in particular spin-glass theory, to the study of neural networks with pairwise Hebbian interactions (for an overview of the vast literature, the reader is referred to the books [2–9] and to the recent reviews [10, 11]). This connection was first pointed out by Hopfield himself in [1], where he showed that neural networks with pairwise Hebbian interactions were particular realizations of spin glasses, and it was further developed by Amit et al [12] who showed that the free-energy landscape of such systems is characterized by a large number of local minima, corresponding to different patterns of information, retrieved by the network from different initial conditions. In addition to neural networks with pairwise Hebbian interactions, a large body of work has focused on extending Hebbian learning to multi-node interactions, both in the early days (see e.g. [13–15]) and more recently (see e.g. [16, 17]). These models are often referred to as dense associative memories [16], as they can store many more patterns than the number of neurons in the network. In particular, they can perform pattern recognition with a supra-linear storage load [15] and can work at very low signal-to-noise ratio, when compared to their pairwise counterparts [18]. In addition, they have been shown to be resilient to adversarial attacks [19] and combinations of such models can lead to exponential storage capacity [20].
In recent years, a duality between Hebbian neural networks and machine learning models has been pointed out. For example, the Hopfield model has been shown to be equivalent to a restricted Boltzmann machine [21], an archetypical model for machine learning [22], and sparse restricted Boltzmann machines have been mapped to Hopfield models with diluted patterns [23, 24]. Furthermore, restricted Boltzmann machines with generic priors have led to the definition of generalized Hopfield models [25–27] and neural networks with multi-node Hebbian interactions have recently been shown to be equivalent to higher-order Boltzmann machines [28, 29] and deep Boltzmann machines [30, 31]. As a result, multi-node Hebbian learning is receiving a second wave of interest since its foundation in the eighties [13, 14] as a paradigm to understand deep learning [17, 32].
In order to make this connection clearer, we should stress that neither the Hopfield model nor its generalized versions with multi-node Hebbian interactions, are machine learning models, in that they are not trained from data and their couplings do not evolve in time according to a learning rule. Instead, their couplings are fixed from the outset to the values that would have been attained by a neural network trained according to the Hebbian rule, which adjusts the strengths of the connections between the neurons based on the input patterns that the network is exposed to. In this work, we consider two different routes for exposing the network to data, as considered in [33, 34]. In particular, we analyze the scenario where the input patterns are noisy or corrupted versions of pattern archetypes. In the first route, that we will call 'unsupervised', we expose the network to all of the input patterns together, without specifying the archetype that each pattern is meant to represent [35]. In the second route, that we will call 'supervised', we split the training dataset in different classes, corresponding to the archetypes, and the network is exposed to the examples in each class, class by class, in the same way that a teacher helps a student to learn one topic at time, from different examples [36]. Beyond being more realistic than traditional Hebbian storing, this generalization allows to pose a number of new questions, such as: is the network able, in either scenario, to retrieve the archetypes by itself, i.e. find hidden and intrinsic structures in the training dataset? which is the minimal amount of examples per archetype that the network has to experience, given the noise in the examples and the number of archetypes, in the two scenarios?
These questions have been addressed in neural networks with pairwise as well as multinode Hebbian interactions, for both random uncorrelated datasets and structured datasets (including the well-known MNist or Fashion MNist) [34, 37, 38], within a replica-symmetric (RS) analysis, which is based on the assumption that different replicas (or copies) of the systems are invariant under permutations. However, since neural networks are particular realization of spin-glasses, such symmetry is expected to be broken in certain regions of the parameters space, where the system develops a multitude of degenerate states. In a recent work [39], we devised a simple and rigorous method to detect the onset of the instability of RS theories in neural networks with pairwise and multinode Hebbian interactions. Building on this work, we derive the RS instability lines for the two different protocols defined here, supervised and unsupervised, and we re-investigate the problem defined above, previously analysed within an RS assumption, by carrying out the analysis at one step of replica-symmetry breaking (1RSB). Since Parisi's seminal work on replica symmetry breaking [2], several alternative mathematical techniques have been proposed to investigate replica-symmetry broken phases [40–42]. In this work, we will use interpolations techniques, that were pioneered by Guerra in the context of his work on spin glasses [43] and were later applied to neural networks (e.g. [44]). One advantage of this approach is that it avoids certain heuristics required by the replica method and it leads to simpler calculations. For completeness, we will also derive the same results using the replica approach, with the pedagogical aim of creating a bridge between the two methods, one being a cornerstone of the statistical physics community (Parisi's), the other constituting a golden niche in the mathematical physics community (Guerra's).
The paper is structured as follows. In section 2 we define a first model for dense associative memories (DAM) that we will shall refer to as 'unsupervised' and we analyse it at one step of replica symmetry breaking via Guerra's interpolation method and replica techniques. In section 3 we define a second model for DAM, that we will shall refer to as 'supervised' and we carry out the same analysis as in section 2. Finally, in section 4 we summarize and discuss our results. We relegate technical details to the Appendices as well as the derivation of the instability line of the RS theory, which shows the importance of analysing these systems under the assumption of replica symmetry breaking.
2. 'Unsupervised' DAM
In this section we analyse the information processing capabilities of DAM in the so-called 'unsupervised' setting, as introduced in section 2.1, via two different approaches, namely Guerra's interpolation techniques (section 2.2) and Parisi's replica approach (section 2.3). Both analyses are carried out at the first step of RSB. Results are discussed in section 2.4. The instability line of the RS theory (equivalent to the de Almeida and Thouless line in the context of spin-glasses), showing the importance of working under the assumption of broken replica symmetry, is derived in appendix A.1.
2.1. Model and definitions
We consider a system of N neurons, modelled via Ising spins , with , where neurons interact via P-node interactions. We assume to have K Rademacher archetypes, defined as N-dimensional vectors , with , where each entry is drawn randomly and independently from the distribution
Such archetypes are not provided to the network, instead the network will have to infer them by experiencing solely noisy or corrupted versions of them. In particular, we assume that for each archetype µ, M examples , with , are available, which are corrupted versions of the archetype, such that
where controls the quality of the dataset, i.e. for r = 1 the example matches perfectly the archetype, while for r = 0 it is purely stochastic. To quantify the information content of the dataset it is useful to introduce the variable
that we shall refer to as the dataset entropy 7 . We note that ρ vanishes either when the examples are identical to the archetypes (i.e. r → 1) or when the number of examples is infinite (i.e. ), or both. If we regard the traditional DAM model as storing patterns and the unsupervised DAM model as learning patterns from stored examples, then ρ can be thought of as the parameter that quantifies the difference between storing and learning (which is larger, the larger ρ).
Definition 1. The cost function (or Hamiltonian) of the 'unsupervised' DAM model is
where the constant , with and ρ defined in (2.3), is included in the definition for mathematical convenience, while the factor ensures that the Hamiltonian is extensive in the network size N. On the left hand side (LHS), the label (P) denotes the order of the multi-node interactions and η is a short-hand for the collection of all the examples, for all the archetypes .
Remark 1. For P = 2, i.e. pairwise interactions, the unsupervised DAM model reduces to the Hopfield neural network in the unsupervised setting [33]. In addition, in the absence of dataset corruption, i.e. r = 1, and the model reduces to the standard Hopfield model, as analysed in [4, 12].
Definition 2. The partition function associated to the Hamiltonian (2.4) at inverse noise level is defined as
where is referred to as the Boltzmann factor. At finite network size N, the free energy of the model is given by
where denotes the average over the realizations of examples η , regarded as quenched.
We are interested in the behaviour of the system in the limit of large system size and finite network load α, as specified by the following
Definition 3. In the thermodynamic limit , the load of the unsupervised DAM model is defined as
We will mostly focus on the so-called 'saturated' regime, where α > 0. The regime where the system is away from saturation can be inspected by taking the limit α → 0. For convenience, we also introduce the parameter γ, defined by
The free energy in the thermodynamic limit is denoted as
In the following, we focus on the ability of the network to retrieve a single archetype, say ν. Given the invariance of the Hamiltonian w.r.t. permutations of the archetypes, we will set without loss of generality ν = 1. Then, following standard procedures [4], we split the sum over µ in the Hamiltonian into the contribution from µ = 1 (regarded as the signal) and the contributions from the other archetypes µ > 1 (regarded as slow noise). Furthermore, we add in the argument of the exponent of the Boltzmann factor a term , so that the free energy can serve as a moment generating functional of the so-called Mattis magnetization m1 (vide infra) by taking derivatives of w.r.t. J. As this term is not part of the original Hamiltonian, J will be set to zero later on.
Starting from (2.5), the resulting partition function is
where we have set and for the first pattern (µ = 1) we used and neglected terms which vanish in the thermodynamic limit. Focusing only on the last term in (2.10) we note that the product is a random i.i.d. Binomial variable for each , and the sum over µ converges, by the central limit theorem (CLT), to a Gaussian variable with suitably defined mean and variance, i.e.
where
and
This enables us to write
with
Thus the partition function of the unsupervised DAM model reads as
To make analytical progress, it is convenient to introduce the order parameters of the model.
Definition 4. The order parameters of the unsupervised DAM model are the Mattis magnetization m1 of the archetype , the Mattis magnetizations of each example of the archetype and the two-replica overlaps qlm , which quantifies the correlations between the variables σ in two copies l and m of the system, sharing the same disorder (i.e. two replicas):
In the next subsection we calculate the free energy of the system in the thermodynamic limit using Guerra's interpolation technique, assuming one step of replica symmetry breaking (1RSB).
2.2. 1RSB analysis via Guerra's interpolation technique
The key idea of Guerra's interpolation method is to define an auxiliary free energy, , which is function of a parameter that interpolates between the free energy of the original DAM model (obtained at t = 1) and the free energy of an exactly solvable one-body model (obtained at t = 0), whose effective fields mimic those of the original model. As the direct calculation of is cumbersome, this is obtained by using the fundamental theorem of calculus, namely we first evaluate and then we obtain as
To perform the analysis within the first step of replica symmetry breaking, we make the standard 1RSB assumption [2], stated below:
Assumption 1. In 1RSB assumption, the distribution of the two-replica overlap q12, in the thermodynamic limit, displays two delta-peaks at the values and , and the concentration on these two values is ruled by the parameter , namely
where
and as replicas are conditionally independent, given the disorder η .
The Mattis magnetizations m1 and are assumed to be self-averaging at their equilibrium values and , respectively, namely
where and , with .
Remark 2. We note that for θ = 0, equation (2.20) corresponds to the standard RS ansatz with the distribution of the two-replica overlap being delta-peaked at . As θ increases, the 1-RSB Ansatz is gradually introduced. The optimal value of Parisi's parameter θ can in principle be obtained by minimizing the free energy with respect to θ. However, the explicit form of the extremization condition of the free energy with respect to θ is known to be complicated and hardly used in practice, so that θ is often left as a free parameter (see e.g. [7]). In this work, we will fix θ by following a different procedure, namely by maximizing the region where the network accomplishes pattern retrieval, as done in [45] (see section 2.4 for a more detailed discussion).
Definition 5. Given the interpolating parameter , the constant to be fixed later on, and the i.i.d. standard Gaussian variables for and (that must be averaged over as explained below), the Guerra's 1RSB interpolating partition function for the unsupervised DAM model is given by
The index 2 on the LHS stands for the number of vectors that must be averaged over. Their number is equal to where k is the number of steps of RSB (here k = 1).
The average over the generalised Boltzmann factor associated to the interpolating partition function , can be defined as
Remark 3. We note that for t = 1, (2.24) recovers the original partition function (2.15), whereas for t = 0 it reduces to the partition function of a system of N non-interacting neurons, described by the Hamiltonian , with local fields , that is readily evaluated. The parameters of the one-body model ψ, A1 and A2 must be chosen in such a way that the t-dependent terms in cancel out in the thermodynamic limit, under the 1RSB assumption.
In what follows, we will average over the fields for recursively, as explained by Guerra in [43] and in the statements below. To this purpose, we define
where with we mean the average over the vectors for . The interpolating quenched free energy related to the partition function (2.24) is introduced as
where denotes the average over the variables 's and 's. In the thermodynamic limit, assuming the limit exists, we write
Remark 4. We note that in the RS case, where and , we have
hence the average acts on the same level as i.e. outside the logarithm. When moving from the RS to the 1RSB scenario, an extra average is introduced, , which acts on a different level than and i.e. inside the logarithm. This reflects the presence of hierarchical valleys in the free energy landscape [46] which leads to the well-known multi-scale thermalization in spin glasses. The internal average is related to thermalization within a valley at the lower level of the (two-level) hierarchy, whereas the external averages account for thermalization across valleys (i.e. at the higher level of the hierarchy). The structure of the hierarchy is captured by the parameter θ, which controls the amplitudes of the valleys at the lower level of the hierarchy and can be related to effective temperatures [47], as discussed in [48] for Parisi's replica approach and in [49] for Guerra's interpolation techniques. Note that the distinction between the two levels of the hierarchy is not made for ψ which accounts for the signal contribution and is kept RS (as usually done for the magnetization in spin-glass models).
Now, following Guerra's prescription [43], given two copies (or replicas) of the system, we define the following averages, corresponding to thermalization within the two different levels of the hierarchy
where
At this point, we are able to state our second assumption, that is at each level of the hierarchy, the two-replica overlap self-averages around the value of the corresponding peak in the overlap distribution , as given in assumption 1:
Finally, we provide the explicit expression of the quenched free energy in terms of the control parameters in the next
Proposition 1. In the thermodynamic limit , within the 1RSB Assumption and under the assumption 2, the quenched free energy for the unsupervised DAM model, reads as
where
and and denote the expectations over the Bernoulli distributions (2.1) and (2.2) respectively. In the above, , fulfill the following self-consistency equations
with
and . Furthermore, as , we have
In order to prove the aforementioned proposition, we need to show the following
Lemma 1. In the thermodynamic limit, the t derivative of the interpolating quenched free energy (2.29) under the 1RSB assumption and under assumption 2 is given by
The proof of this lemma is shown in appendix B.1. Now let us prove proposition 1:
Proof. Exploiting the fundamental theorem of calculus, we can relate the free energy of the original model and the one stemming from the one body terms via (2.19). Since, the t-derivative (2.40) does not depend on t, all we need is to add the expression above to the free energy , which only contains one body terms. Let us start with the computation of the latter at finite size N:
with the definition of given in remark 3.
Upon setting the constants to the values (B.8) stated in the proof of lemma 1, we have
where reads as
Recalling that , taking the thermodynamic limit of the above equations and inserting them in the fundamental theorem of calculus (2.19), we reach (2.35). Finally, by maximizing (2.35) w.r.t. to the order parameters , we obtain the self-consistency equations (2.37), hence we reach the thesis. □
2.3. 1RSB analysis via Parisi's replica trick
In this section, we derive the expression of the quenched free energy of the unsupervised DAM model provided in proposition 1 by using the replica method [2, 50], at the first step of RSB [2, 51]. The core of this approach consists in writing the quenched free energy, as defined in (2.6), as
with and where we have used the shorthand for the partition function, defined as
denotes, as in (2.6), the average over the quenched disorder η and in accordance with the explanation provided in section 2.1, the inclusion of the last term in the round brackets of (2.45) is intended to make the free energy a moment-generating function for the Mattis magnetization .
For integer n the function is the product of a system of n identical replicas of the original system
namely
Splitting, as before, the signal term (µ = 1) from the remaining terms µ > 1, which act as a quenched noise on the learning and retrieval of pattern , we can write
where denotes the average over the pattern , whose entries are drawn from the distribution (2.1) and with denoting the expectation over the conditional distribution (2.2). As before, exploiting the CLT to rewrite the last term in (2.48) via (2.14), we obtain
Treating the two terms and separately, we insert in the former the identity
and use the Fourier representation of the Dirac delta, obtaining
For the noise term, performing first the expectation over λ
inserting
and using the Fourier representation of the Dirac delta, we have
Inserting equations (2.50) and (2.52) in (2.49), we obtain
where
and we have used the shorthand notation and similarly for P , N and Z . Next, we assume that the limit n → 0 in
can be taken by analytic continuation and that the two limits and n → 0 can be interchanged (provided they exist), so that the integrals in (2.53) can be performed by steepest descent. This leads to
where , are solution of the saddle point equations
where the average is computed over the distribution where is equal to the terms in the round brackets in the second line of (2.54). Inserting (2.59) and (2.60) in , we obtain
and, substituting in (2.56)
where and must be determined from (2.57) and (2.58). Since depends on Q , N via (2.61), equations (2.57) and (2.58) denote a set of self-consistency equations. Now, in order to proceed, we need to find the form of Q and N in the limit n → 0. We will make the 1RSB ansatz:
Using (2.63) to evaluate the first term in (2.62) and the first term in (2.61), we get
Upon inserting in (2.62) we obtain
with
where we have used the definition (2.36).
Next, we apply a Gaussian transformation to the last term in (2.67) to linearize the squared terms in the Hamiltonian (2.68)
where we have set and for . Now, writing
the sum over the spin configuration can be done explicitly, finding
Now applying the n → 0 limit [50] and exploiting the relation , we get
Finally, inserting this expression in (2.67) and denoting with and the Gaussian averages w.r.t. and respectively, we reach the following expression for the free energy
where and must fulfill the same self-consistency equations (2.37) obtained by using Guerra's interpolation technique, as reported in proposition 1.
2.4. Results
Solving numerically the self consistency equations (2.37), we obtain the phase diagram shown in figure 1, in the parameters space (γ, ), where γ is the storage load defined in (2.8) and is a scaled inverse temperature, where ρ is the dataset entropy defined in (2.3). Panels show results for P = 4 and three different values of M, as shown in the legend. The result for follows from the analysis carried out in section 2.4.1. In each panel, the grey curve marks the transition from the ergodic phase with (above) to the spin glass phase, where either or becomes non-zero (below). The black curve marks the transition from the retrieval phase with (left) to (right). The spin glass solution within the retrieval region is always unstable and it is delimited by the dotted curve (we refer to this as the instability region). In figure 2 we show the results for , as shown in the legend. As P and M grow, the spin glass region shrinks and the retrieval region expands. For the critical storage lines become equal to those of traditional DAM models where archetypes are encoded in the interactions, rather than their noisy examples [45]. This is as expected: when an infinite number of examples is provided, the system reaches the same performance as a neural network where archetypes are stored in the interactions directly. For finite M, however, the network's ability to retrieve the archetypes degrades at values of the storage load which are lower than the storage capacity in traditional DAM models. Both figures 1 and 2 have been obtained for the value of which maximizes the retrieval region, as shown in figure 3(left panel). In the latter, we show the line that separates the retrieval region from the spin-glass or the ergodic region, for different values of θ. For θ = 0 (which corresponds to the RS theory), the line exhibits a re-entrance, as commonly observed in spin-glasses and associative memories [45, 50, 52, 53]. As θ is increased, the retrieval region expands, reaching its maximum at , where the re-entrance disappears completely, showing the same qualitative behaviour as in the Hopfield model [12] and traditional DAM [45]. Increasing θ above the re-entrance appears again and gets more pronounced as θ approaches 1, where the transition line becomes identical to the RS (and the θ = 0) case, as expected. In figure 3(right panel) we show the dependence of on P. As P increases, the value of decreases. This is as expected from spin-glass theory, as for large P, P-spin models are known to converge to the REM model [54], which is RS. Interestingly, we find that takes the same value as in the DAM model considered in [45], for all values of M, hence the optimal 1RSB breaking parameter is not influenced by the use of corrupted examples rather than archetypes.
Download figure:
Standard image High-resolution imageDownload figure:
Standard image High-resolution image2.4.1. Limiting cases: and .
Finally, we consider two instructive limiting cases. One is the limit , where the number of available examples is large. Although idealised, this scenario is becoming less utopian nowadays in a number of applications, and it is instructive as an explicit relation between the archetype magnetization and the mean magnetization of the examples naturally emerges. The second scenario is the zero noise limit , where the information processing capabilities of the network are expected to be maximal. We now state the following
Corollary 1. In the limit , the 1RSB self consistent equations for the order parameters of the unsupervised DAM model are
where
and .
Proof. In the limit , we can apply the CLT to the sum of the examples defined in (2.36), to write
where λ is a standard Gaussian variable .
Inserting (2.72) in (2.38) and in the expression for provided in (2.37), we get
and
Hence, by applying to the standard Gaussian variable λ the Stein's lemma, which states that for a standard Gaussian variable and a generic function f(J), for which the two expectations and both exist, one has
and explicitly averaging over ξ we get the self-consistency equations in (2.70). □
Finally, it has been shown in [38], within a RS analysis, that neglecting the second term in the equation for , given in (2.70), has negligible impact on the solutions of the self-consistency equations, in the relevant regime of low noise, where the network works as an associative memory. As the equation for is the same in the RS theory and in the 1RSB theory that we are considering here, the same truncation of can be adopted here
This leads to a simplified expression for g that is
where we re-scaled the noise . Solving numerically the self consistency equations (2.70) where the equation for is replaced with its truncated version (2.75), leads to the phase diagram shown by the lines in figures 1 and 2.
Next, we turn to the analysis of the ground state, i.e. to the limit , and we state the following
Corollary 2. In the limit and for , the 1RSB self consistency equations for the order parameters of the unsupervised DAM model are
where and
We report the proof of this corollary in appendix B.2.
Next we compute, by numerically solving (2.77) and (2.78), the ground-state critical storage capacity γC beyond which a black-out scenario emerges, namely for and for . In figure 4 we plot γC as a function of the ratio between the number M of experienced examples and the minimum number of examples required by the unsupervised DAM model (with P > 2) to correctly learn and retrieve an archetype, within the RS theory, which has been proved to be
in [37]. In order to ascertain whether replica symmetry breaking alters such threshold, we plot γC within both the 1RSB assumption (blue line) and the RS theory (red line). We see that γC becomes non-zero (meaning that retrieval can occur) at the same value of for the RS and the 1RSB theory. This shows that the minimum number of examples that a network needs to accomplish retrieval of the archetype is the same within the RS or the 1RSB assumption.
Download figure:
Standard image High-resolution imageResults show that, as in the standard Hebbian storage [44, 53, 55], the phenomenon of replica symmetry breaking induces a mild improvement in terms of the maximal storage.
3. 'Supervised' DAM
In this section we analyse the information processing capabilities of the DAM model in the so-called 'supervised' setting, where the dataset given to the network is now split into different categories, one for each archetype , with . Again, we analyze the model, defined in section 3.1, via both Guerra's interpolation techniques (section 3.2) and Parisi's replica approach (section 3.3), at the first step of RSB. In appendix
3.1. Model and definitions
As in the unsupervised DAM model, we consider a network of N Ising neurons , , interacting via P-node interactions. We assume to have K Rademacher archetypes , with , defined as N-dimensional vectors with entries drawn randomly and independently from the distribution (2.1). In addition, we assume to have M examples for each archetype, which are corrupted version of the archetypes, with entries distributed according to (2.2). We shall refer to this model as the supervised DAM model.
Definition 6. The Hamiltonian of the supervised DAM model is
where the constant in the denominator of the r.h.s. is included for mathematical convenience and the factor ensures the Hamiltonian to be , as explained previously for the unsupervised model, see equation (2.4).
We highlight the hidden role of a teacher that, before providing the dataset to the network, has grouped examples pertaining to the same archetype together (hence the proliferation of summations in the cost function (3.1) with respect to (2.4)).
As before, we will focus (without loss of generality) on the ability of the network to store and retrieve the first archetype , hence the Mattis magnetization provided in (2.16) remains a relevant order parameter. However, the set of order parameters for the examples, previously given by with (see (2.17)) have now to be substituted with a single order parameter
Its probability distribution, in the thermodynamic limit, is assumed self-averaging, namely
as for the Mattis magnetization, while the distribution for the overlap defined in (2.18) is still assumed bimodal as in assumption 1 (see (2.20)). As for the unsupervised model, we add an extra term in the cost function, namely , in order to generate the moments of the Mattis magnetization m1 by taking the derivatives of the quenched free energy w.r.t. J and, as this term is not part of the original Hamiltonian, it will be set to zero at the end of the calculations. Therefore, we write the partition function as
where ρ is the dataset entropy defined in (2.3), and for the first pattern, µ = 1, we have used the relation and neglected terms which vanish in the thermodynamic limit. Next, as done for the unsupervised case, we apply the CLT to the variables in the square brackets in (3.4), namely
where
and
with
Thus, we can write
where .
Inserting all back in (3.4) we get
In the next subsection we calculate the free energy of the system in the thermodynamic limit using Guerra's interpolation technique, assuming one step of replica symmetry breaking (1RSB).
3.2. 1RSB analysis via Guerra's interpolation technique
As for the unsupervised case, the plan is to construct an interpolation between the original model and a simpler one-body model, whose statistical features are as close as possible to the original one, then solve the one body model and finally obtain the solution of the original model via the fundamental theorem of calculus. Thus we define the following interpolating partition function
Definition 7. Given the interpolating parameter , constants to be set a posteriori, and the i.i.d. standard Gaussian variables for and , the Guerra's 1-RSB interpolating partition function for the DAM model, trained by a teacher, is given by
where is denoted as Boltzmann factor.
As before, the introduction of the interpolating partition function gives rise to a generalized measure, average and interpolating quenched free energy (that we do not repeat here). All these generalizations retrieve the standard definitions when evaluated at t = 1. Following Guerra's method [43], we must average out the fields and recursively, in the interpolating free energy resulting from the interpolating partition function (3.9), as already done in the previous section for the unsupervised case, see equations (2.26)–(2.28). Proceeding in the same way and omitting obvious details due to the similarity of the proofs in the supervised and unsupervised cases, we state directly the next.
Proposition 2. In the thermodynamic limit , within the 1RSB assumption and under the assumption 2, the quenched free energy for the supervised DAM model, reads as
with , fulfilling the following self-consistency equations
where
and . Furthermore, as , we have
The proof of the aforementioned proposition is lengthy but the steps to follow are identical to the ones provided for the unsupervised case.
3.3. 1RSB analysis via Parisi's replica trick
As stated in the previous section, the core of this approach consists in writing the logarithm of the partition function as where
Proceeding as in the unsupervised case, we compute separately the signal term and the noise term. The former, accounting for the examples pertaining to the first archetype, reads as
while the latter accounting for the examples pertaining to all the other archetypes but the first one, can be written as
where now is the Gaussian average w.r.t. λ . Performing the integration over λ and inserting the definitions of the order parameters as done for the unsupervised setting, we can write
where
and . Assuming (as usual within the replica method) that the limits and n → 0 can be interchanged, the integrals can be performed by steepest descent. This leads to
where , are solution of the saddle point equations
which provides a set of self-consistency equations for Q , N , P , and Z . Using the last two equations to eliminate P and Z from the description, we finally obtain
To make progress, we need to find the form of Q and N in the limit n → 0. Again, we use the 1RSB ansatz for the two-replica overlap provided in (2.63), while we assume that is self-averaging, i.e. . Proceeding analogously to the unsupervised case, after taking the limit n → 0 we find the same expression for the quenched free energy (see (3.10)) and order parameters (see 3.11) previously obtained by using Guerra's interpolation technique.
3.4. Limiting cases: and
Now we state the next corollaries concerning the large datasets limit and the ground state limit. We omit their proofs since these are trivial variations of those provided for the unsupervised case (cfr corollaries 1 and 2).
Corollary 3. In the large dataset limit , the 1RSB self-consistency equations for the order parameters of the supervised DAM model can be expressed as
where the expression of g reads as
Moreover, if we use the truncated expression for , namely
we get the simplified expression of (3.26) that is
where .
Corollary 4. The ground-state (i.e. ) self-consistency equations for the order parameters of the theory in the large dataset limit (i.e. ) and under the 1RSB assumption are
where and
By numerically solving the self-consistency equations (3.25), we obtain the phase diagram shown in figure 5, for different values of P and different values of the dataset entropy ρ, as shown in the legend.
Download figure:
Standard image High-resolution imageIn [38] it has been proven that the minimum number of examples required by the supervised DAM model to correctly learn and retrieve an archetype, is given by
where γ ≠ 0. In figure 6 we plot the critical storage γC in the ground state versus the ratio . As previously observed for the unsupervised setting, the phenomenon of replica symmetry breaking slightly increases the critical storage and it does not alter the minimum number of examples required for learning.
Download figure:
Standard image High-resolution image4. Conclusions and outlooks
This manuscript analyses the equilibrium behaviour of DAM trained with or without the supervision of a teacher, within a 1RSB assumption, thus extending previous analysis carried out at a RS level [37, 38]. The unsupervised and supervised settings differ in the choice of the couplings (which involve P nodes) and are given by
respectively, where are perturbed versions of the unknown archetypes . The network does not experience the archetypes directly, instead it has to infer them from the supplied examples. For both the settings, we obtained explicit expressions for the quenched free energy and derived full phase diagrams. In doing so, we proved a full equivalence, at 1RSB level, between two different approaches, namely Guerra's telescopic interpolation [43] and Parisi's RSB theory [2]. In addition, we derived (in appendix
The main differences brought about by the RSB description, with respect to the RS one, consist in the disappearance of the instability region within the retrieval zone of the phase diagram close to saturation (as standard in glassy statistical mechanics [4]) and in a slight improvement of the value of the critical storage. Importantly, the threshold for learning, both in the supervised and unsupervised settings, is not influenced by replica symmetry breaking, i.e. the minimum number of examples required to infer the archetypes is the same in the RS and 1RSB description. Interestingly, the optimal value of the Parisi's parameter θ, that controls the distribution of the overlaps in the 1RSB scenario, is not influenced by the dataset entropy and is equal to that of the classical Hopfield model. From the mathematical viewpoint, possible future developments would be relaxing the constraints of a self-averaging Mattis magnetization and inspecting how the learning and retrieval properties of these networks change with different kind of noise: in this work we have focused on multiplicative noise, however the use of additive noise has lately gained large popularity in generative models for machine learning (see for instance [56]). Furthermore, within the framework of multiplicative noise, an interesting outlook would be considering corruptions of archetypes which also consist of blank (in addition to inverted) entries, as done recently in [57] for networks away from saturation. Their operation in the saturated regime and the effects of replica symmetry breaking have not yet been investigated: we plan to report soon on these topics.
Acknowledgments
All the authors acknowledge the stimulating research environment provided by the Alan Turing Institute's Theory and Methods Challenge Fortnights event Physics-informed Machine Learning. Albanese acknowledges Ermenegildo Zegna Founder's Scholarship, UMI (Unione Matematica Italiana), INdAM—GNFM Project (CUP E53C22001930001) and PRIN grant Stochastic Methods for Complex Systems N. 2017JFFHS for financial support and King's College London for kind hospitality. Alessandrelli acknowledges INdAM (Istituto Nazionale d'Alta Matematica) and Unisalento for support via PhD-AI. Barra's research is supported by MAECI via the BULBUL grant (Italy-Israel collaboration), Project N. F85F21006230001 Brain-inspired Ultra-fast and Ultra-sharp machines for assisted healthcare and by MUR via the PRIN-2022 grant Statistical Mechanics of Learning Machines: from algorithmic and information-theoretical limits to new biologically inspired paradigms, Project N. 20229T9EAT that are gratefully acknowledged.
Data availability statement
No new data were created or analysed in this study.
Appendix A: Instability of the RS solution: AT lines
In the main text we have analysed the equilibrium behaviour of supervised and unsupervised DAM models under the assumption of replica symmetry breaking. While such phenomenon is expected in these models, a formal proof that the replica symmetric theory becomes unstable in certain ranges of the control parameters has not been provided in the literature. In this section we provide such a proof and we derive the critical line of the RS instability in the phase diagram, for both the unsupervised and the supervised DAM models, separately. We will use the method recently introduced in [39], which provides a simple alternative to the method originally introduced by de Almeida and Thouless in [58], as it does not require to compute the so-called replicon (i.e. the smallest eigenvalue of the spectrum) of the Hessian of the quadratic fluctuations of the free-energy around its RS value and it does not rely on the availability of an 'ansatz-free' expression for the free-energy.
A.1. AT line for DAM in unsupervised setting
Following the method introduced in [39], we aim to determine the region in the phase diagram where the quenched free energy evaluated within the 1RSB approximation, that from now on we denote for convenience as , is smaller than the free energy evaluated within the RS assumption, , in the limit θ → 1, where the transition from RS to RSB is expected to occur.
We start by recalling the expression for , given in (2.35), with , and determined from the self-consistency equations (2.37), and by providing the expression for the quenched free energy within the RS assumption, as derived in [37]
where , , and and fulfill the following self-consistency equations
We note that for θ = 1, , and the 1RSB expression for the quenched free- energy reduces to the RS one. Now, we expand the 1RSB expression for the quenched free energy , as given in (2.35), around θ = 1, using
Since and are determined through the self-consistency equations (2.37), they depend on θ as well, so we have to expand (2.37) around θ = 1 too. Following [39] closely, we obtain
where is the solution of (A.3), is the solution of the following self-consistency equation
where
and the functions and are given by
respectively. Similarly, we can expand as
where
In the limit θ → 1, equation (A.4) implies that when we have that . Next, we evaluate
In order to determine the sign of the expression above, it is useful to note that , as the last term of (A.8) vanishes so, the last two addends in (A.13) elide each other. This is as expected, as is an extremum of the RS free-energy, which is retrieved for θ = 1. Next, we study as a function of and locate its extrema. These are found from
as
where the last equality follows from algebraic manipulations of trigonometric functions.
Given that vanishes for , if the extremum is global in the domain considered, we must have that if is a maximum and if is a minimum. Evaluating
we have that if the expression in the square brackets is positive, namely if the parameter satisfies
and , hence the RS theory is unstable. We plot the line (A.17) in figure 7, for different values of M, in the parameter space , where , together with the critical lines delimiting the retrieval region (top plots) and the spin-glass region (bottom plots), within the RS and the RSB theory, respectively. We note that the above recovers the expression for the RS instability line found in DAM models [39] in the limits r → 0 or , where the values of ρP and ρ vanish.
Download figure:
Standard image High-resolution imageA.2. AT line for DAM in supervised setting
In this section we derive the RS instability line for the supervised DAM model. We start by providing the expression for the quenched free-energy in RS assumption as derived in [14] for the standard Dense Hopfield Model.
with and where and satisfy the self-consistency equations
We also recall that the quenched free-energy within the 1RSB approximation, , is as given in (3.10), where the order parameters , and satisfy the set of self-consistency equations provided in (3.11). We note that for θ = 1, , and the 1RSB expression for the quenched free-energy reduces to the RS one. Now, we expand, to the leading order in , the 1RSB quenched free-energy around its RS expression, as shown in (A.4). Since the self-consistency equations also depend on θ, we need to expand them too. We can write as in (A.5), with given in (A.9), and as given in (A.6), where is the solution of (A.7) and is as given in (A.10) (we recall that now the expression of is as given in (3.12)). With these expressions in hand, we can now compute the derivative of w.r.t. θ when θ = 1, as needed in (A.4)
Again, we have that (as is the extremum of the RS free-energy). Next, we inspect the sign of . To this purpose, we study for and locate its extrema, which are found from
as
where the last equality follows from (A.7). Under the assumption that the extremum is global in the domain considered, we have that if is a maximum and if it is a minimum. In particular, if
is positive, and . This happens when the expression in the curly brackets of the equation above is negative, i.e. when the parameter satisfies the inequality
We note that (A.24) is functionally identical to the expression found in [39] for the RS instability line of the standard DAM model, the only difference being encoded in the term , which is here defined differently, as it reflects the supervised protocol. In particular, in the limit of r → 0 or , where ρ and ρP vanish, (A.24) retrieves the RS instability line of standard DAM models, as obtained in [39].
Appendix B: Proofs
B.1. Proof of lemma 1
Defining the shorthand , we start by computing the derivative of equation (2.29) w.r.t. t
where and is defined in (2.34). Recalling the definition of in (2.23) we have:
where, in order to lighten the notation we have set . Inserting this expression in (B.1) and using the definition (2.25) for the average of a single replica of the system over the generalised Boltzmann factor , we obtain
Next, denoting the combined average over the quenched disorder and the Boltzmann distribution as
and applying the Stein's lemma (2.74) to the standard Gaussian variables and , we get
Finally, we use the definitions (2.32) and (2.33) and we arrive at the compact expression:
Next, we take the thermodynamic limit , using assumption 2. By manipulating the expression for the moments , with , using Newton's binomial theorem
we obtain, in the thermodynamic limit, under assumption 2,
for . Finally, we insert the above in (B.6) and we choose the constants in such a way that the terms dependent on cancel out,
This makes t-independent and leads to the thesis (2.40).
B.2. Proof of corollary 2
Let us start from the self consistency equations in the limit introduced in corollary 1, we recognize that as , we have , therefore in order to perform the limit we will introduce the reparametrization
It is now useful to insert an additional term in the expression of in (2.76), which now reads as
Using this new parameter y, we can recast the equation for as a derivative of the magnetization
where we have used and, as , . Thus, in the zero temperature limit the last three equations in equation (2.70) become
Now, if we suppose the (B.10) reduces to
where
Performing the integral over we reach equations (2.77) and (2.78).
Footnotes
- 7
It was shown in [34] that the conditional entropy which quantifies the amount of information needed to reconstruct the original pattern given the set of related examples , is a monotonically increasing function of ρ.