1 Introduction

Since the end of Dennard scaling, computer architects have shifted to exploiting thread-level parallelism by integrating multiple processing elements (PE) as a system on chip. In area- and energy-constrained embedded systems, PEs may communicate via a shared scratchpad tightly coupled memory (TCM) to increase area utilization. A drawback of this approach is the introduction of access conflicts. Since the access latency of TCMs is usually limited to a single clock cycle, on-line conflict resolution and arbitration logic may degrade the clock frequency, and thus the system performance. The degradation becomes more severe when different run-time priorities have to be considered, as is the case in mobile communication platforms [1], for example.

Access interval prediction (AIP) has been proposed to resolve TCM access conflicts in advance [2, 3]. Memory accesses of PEs are predicted in the time domain so that an off-line arbiter speculatively allocates memory access time slots to one of the them and configures access logic accordingly. Logic on the path between the PEs and the TCM is kept to a minimum while priorities can be accounted for.

In an AIP system, mispredictions lead to misallocations. Therefore, a high prediction accuracy is crucial. AIP schemes such as the access look-ahead buffer (ALB) [2] or the maximum a-posteriori (MAP) estimator [3] achieve average hit rates of 50 % (c.f., Sect. 6) or have not been designed with hardware implementation in mind, respectively.

To find better-performing predictors, we turn to branch prediction for inspiration because it is a well-researched problem that is similar to AIP. In the former, the outcomes of conditional branches are predicted and instructions are executed speculatively based on the prediction. A plethora of high-precision branch predictors have been proposed [4]. Of these, the TAGE branch predictor (TAGE-BP) [5], which approximates prediction by partial matching (PPM) [6], is widely considered to be state of the art [4]. Fundamentally, both work by recording branch outcome sequences of multiple history lengths.

In this paper, we leverage the maturity of both prediction methods for AIP. We adapt them to work on sequences of interval lengths and evaluate their performance on access traces of MiBench [7] on a RISC-V core.

The rest of the paper is structured as follows: Sect. 2 provides an overview of the background and related work on memory sharing and arbitration. Afterwards, in Sect. 3, we review AIP and introduce the system model and the notation used throughout this paper. Section 4 explains PPM and its software implementation as a suffix tree. In Sect. 5, the TAGE-BP is adapted for AIP. We analyze the main modifications and show the trade-offs involved and make a comparison to other predictors in Sect. 6. We finish with conclusions in Sect. 7.

2 Background and Related Work

Shared memory systems are designed such that memory can be used by each master as seamlessly as if it had its own. A master is any entity that uses the shared memory, e.g., a central processing unit (CPU) or a dedicated accelerator. The memory system has to grant memory access to masters upon their request. As multi-ported static random access memory (SRAM) is prohibitively area-expensive, it also has to resolve conflicts when multiple masters request access concurrently.

Architectures such as the T-CREST [8] or MERASA [9] are able to arbitrate between masters with different priorities and ensure an upper bound for the transaction latency. However, their additional delay of several cycles renders them unsuitable for TCMs.

On-chip memory arbiters for TCMs [1014] have instead been optimized for low latency omitting additional prioritization logic. They achieve a good average performance but the performance of high-priority tasks is degraded.

The number of conflicts may be reduced by banking SRAM cells [2, 13, 14, 15]: If simultaneous requests target different memory banks they are not conflicting. This approach works especially well with compile-time information available.

AIP on the other hand is able to consider priorities utilizing only run-time information. High-priority tasks are sped up without degrading the clock frequency. Simple prediction schemes, such as the ALB [2], predict up to 50 % of accesses on data memory. The work in [3] showed that more sophisticated statistical methods may be able to reach accuracies over 90 %. Previous work [16] presented TAGE with a prediction accuracy over 97 % at moderate hardware requirements. In this work, we extend the investigation with PPM and improve the access identification in PPM and TAGE.

3 Access Interval Prediction

3.1 System Overview

Fig. 1
figure 1

AIP system model

Figure 1 depicts the example model of a system employing AIP that is used throughout the paper. Two masters, a high-priority master (HPM) and a low-priority master (LPM), share a TCM. Here, the HPM is the data port of a RISC-V core, the LPM a data movement engine accessing the memory every cycle. When the HPM requests the memory, it should always be granted unhindered access by the arbiter.

To achieve this without elongating the path between TCM and masters, we add the AIP unit. It monitors the data memory interfaces of the HPM which is made up of access and adr. access is set whenever an access (read or write) on the address adr takes place. In addition, the AIP unit intercepts the program counter pc of the core.

The AIP unit predicts the next access and sets pred whenever it determines that the HPM will request the data memory in the next cycle. When this happens, the arbiter pre-allocates the memory to the HPM, and to the LPM otherwise.

This simple model allows performance comparison of different prediction schemes. AIP can also be scaled to more complex priority hierarchies, since it is possible to do predictions for multiple masters and the priority logic is placed offline.

3.2 Theory and Notation

Fig. 2
figure 2

Example waveform of access and prediction signals. At cycle \(k = K_{j+1}\), the AIP unit tries to predict the length of the interval \(I_{j+1}\) until the next access at \(K_{j+2}\). The prediction \(\hat{I}_{j+1}\) may be subject to the access history \(H_j\), the current address \(A_{j+1}\) and the current program counter \(P_{j+1}\). The dashed signals are not known at prediction time

We regard the program flow as a sequence of \(n_\textrm{access}\) access blocks. Each access block \(B_i = (A_i, I_i, P_i)\) contains the address \(A_i\) accessed at cycle \(K_i\), the length of the interval until the next access \(I_i = K_{i+1} - K_i\) and the value of the program counter at access time \(P_i = \texttt{pc}(K_i)\). The program stops with the last access at \(K_\textrm{end}\).

An excerpt of an example program is shown in Fig. 2. In this case, \(j+1\) access intervals were completed and can be summarized as history \(H_j = {(B_i)}_{i=0}^{j}\). After each access on an address \(A_i\), the predictor makes a prediction of the next interval \(\hat{I}_i\) based on \(H_{i-1}\), \(A_i\) and \(P_i\). Subsequently, the indicator signal pred is set at cycle \(k = K_i + \hat{I}_i - 1\).

If the actual cycle equals the predicted, i.e., \(\texttt{access}(k) = \texttt{pred}(k-1) = 1\), an access has been predicted correctly. We denote \(n_\textrm{hit}\) the number of correctly predicted accesses, \(P_\textrm{hit}= \frac{n_\textrm{hit}}{n_\textrm{access}}\) the hit rate, and \(P_\textrm{miss}= 1 - P_\textrm{hit}\) the miss rate.

When \(\texttt{pred}(k-1) = 1 \ne \texttt{access}(k)\), a false positive prediction occurs. The memory is mistakenly reserved for the HPM and memory utilization is impaired. The interval utilization, the percentage of idle cycles where the memory is not blocked because of false predictions, is determined from the number of false positives \(n_\textrm{fp}\) by

$$\begin{aligned} U_\textrm{I}= 1 - \frac{n_\textrm{fp}}{K_\textrm{end}+ 1 - n_\textrm{access}}. \end{aligned}$$
(1)

4 Prediction by Partial Matching

Fig. 3
figure 3

PPM suffix tree for the program in (2) with a limit of \(N=3\)

Prediction by partial matching (PPM) originated in text compression [6]. The idea is to track for each subsequence with a length \(n < N\) in the history \(H_i\) the frequency of follow-up events. PPM then identifies the longest subsequence of \(H_{i-1}\) which is a suffix of \(H_i\). The prediction is given by the most frequent successor of the matching subsequence.

A suffix tree is a suitable data structure for implementation. An example for the history

$$\begin{aligned} H_i = (a,b,c,d,a,b,d,a,b,c,d,b,c,d,b) \end{aligned}$$
(2)

is depicted in Fig. 3. Every node in this tree with a value \(\sigma \) and a prediction \(\hat{\sigma }\) (c.f. Figure 3a) is the starting point of a subsequence of \(H_{i-1}\). To find the node associated with a sequence, one has to traverse the tree following the history sequence elements backwards. The top left node in Fig. 3b belongs to the sequence (abd), for example. If a sequence is not present in \(H_{i-1}\) or longer than N, one will find its longest suffix present instead. The prediction attribute of the node yields the prediction for the found sequence.

For example, the current history may be as in (2) and the limit \(N = 3\). Applying the suffix tree traversal to our example \(H_i\) returns the node of the sequence \(s = (c,d,b)\) with length \(n = 3\) as the longest suffix of \(H_i\). When it occurred in \(H_{i-1}\) (once in this case), it was most often succeeded by c. Therefore, c is predicted.

The removal of all redundant nodes, i.e., those who have a parent node with the same \(\hat{\sigma }\), renders a reduced tree as in Fig. 3c. Some nodes are only deactivated if they produce redundant predictions but have unredundant children, such as the node representing (ab). We define the tree size by the number of active nodes, as these are the ones for which predictions have to be stored. The tree size shrinks after reduction and deactivation while the predictions stay the same.

Table 1 Access trace of a simple example program
Fig. 4
figure 4

PPM suffix tree for the program in Table 1 and \(N = 3\)

We adopt PPM to fit AIP in a similiar fashion as prior work has adopted PPM for branch prediction [17, 18]. At each access, we construct a modified history dubbed the observation window

$$\begin{aligned} C_i(N) = (I_{i-N}, I_{i-N+1}, \ldots , I_{i-1}, a_i) \end{aligned}$$
(3)

out of the last N intervals and an identifier \(a_{i}\) for the current access. Either the access address \(A_{i}\) or the program counter \(P_{i}\) or a combination of both \((A_{i}, P_{i})\) can serve as the access identifier. The observation window is used to traverse the suffix tree and to update it once \(I_{i}\) is known. While other schemes utilizing more past access identifiers are conceivable, [17] demonstrated that usage of only one address does not impair the prediction accuracy considerably, at least for branch prediction.

Table 1 and Fig. 4 exemplify a program and the resulting full suffix tree for \(N = 3\). The first layer contains all occurred accesses and the child nodes contain the preceeding intervals. When we want to predict \(\hat{I}_{13}\) with suffix tree in Fig. 4, we land on the top left node in Fig. 4 belonging to the sequence \((3,4,a^{a})\) and obtain \(\hat{I} = 1\).

5 TAGE Access Interval Predictor

A detailed description and analysis of the original TAGE-BP can be found in [5]. TAGE is a hardware approximation of PPM: Instead of matching sequences of any length, only a few lengths are selected after a geometric series and instead of iterating through the different history lengths, a hash map is used for each selected length.

5.1 Design

Fig. 5
figure 5

TAGE-AIP with four components, each of which provides a subprediction based on the information indexed with the access identifier and the last intervals. For each subprediction, a tag check is performed and among the matches the one with the longest interval history is selected as the final prediction

The TAGE-AIP’s basic structure, as shown in Fig. 5, is similar to the TAGE-BP. It is made out of \(n_\textrm{T}\) components, of which every component t utilizes a different history length

$$\begin{aligned} L(t) = {\left\{ \begin{array}{ll} \lceil \alpha ^{t-1}L(1) - 0.5\rceil &{} \text {if } 1 \le t < n_\textrm{T}\\ 0 &{} \text {if } t = 0\\ \end{array}\right. }{.} \end{aligned}$$
(4)

The spacing \(\alpha \) and the first length L(1) are chosen at design-time. For example, \(\alpha = 2.88\) and \(L(1) = 1\) lead to \(L(t) = (0, 1, 3, 8, 24, \ldots )\).

The base table with \(n_\textrm{I}\) entries provides the fallback prediction when no tag matches. It employs only the address and no past intervals. Each interval \(I_i\) is stored at the address generated with a simple hash \(h(a_i)\) (e.g., bit slicing) of the access identifier. When the access \(a_j = a_i\) (or another for which \(h(a_j) = h(a_i)\)) occurs again, the base table emits \(\hat{I}_j(0) = I_i\).

The other tables consist of a memory bank with \(n_\textrm{I}\) entries from which the prediction information \(\iota _i(t)\), a tag and the useful counter u of bit-widths \(w_\mathrm {\iota }\), \(w_\textrm{t}\) and \(w_\textrm{u}\), respectively, are read. The subprediction \(\hat{I}_i(t)\) of width \(w_\textrm{I}\) is computed from \(\iota _i(t)\) by means of the subpredictor \(\Pi \). In contrast to the TAGE-BP, we consider a variety of subpredictors (c.f. section 5.3).

The table is addressed with a \(\log _2 n_\textrm{I}\)-bit wide hash \(h_\textrm{a}\) of the observation window \(C_i(L(t))\).

A different hash function \(h_\textrm{t} \ne h_\textrm{a}\) produces a tag from the same input \(C_i(L(t))\) to detect hash collisions of \(h_\textrm{a}\). Both, \(h_\textrm{a}\) and \(h_\textrm{t}\), are based on circular shift registers similar to [17]. Only subpredictions for which the tag matches are considered. Of these, the component with the highest L(t) provides the final prediction \(\hat{I}_i\). The useful counter u is used as in [5] to keep track of entries that can be overwritten because they provide the same prediction as another entry with a shorter context. This is akin to the removal or deactivation of redundant nodes in the PPM suffix tree detailed in Sect. 4.

\(\hat{I}_i\) is finally compared with a cycle counter which is reset at every access. The predictor sets pred when the next cycle is \(\hat{I}_i\) cycles after the last access.

5.2 Interval Representation

Fig. 6
figure 6

Probability mass function of recorded access intervals

In the case of branch prediction, the event to be predicted, branch outcome, has only two possible values. A single bit suffices for representation. More bits need to be allocated for interval representation. In theory, the set of possible intervals \(\mathbb {N}^+\) is infinite. However, large intervals occur rarely as can be seen in Fig. 6, which shows the probability P(I) of different intervals during the execution of selected MiBench programs. For example, only 6 % or 0.1 % of intervals exceed \(I = 16\) or \(I = 32\), respectively.

When \(w_\textrm{I}\) bits are used, intervals up to length \(I_\textrm{max}= 2^{w_\textrm{I}}\) are encoded correctly, whereas overflowing intervals \(I > I_\textrm{max}\) are aliased to \(I \bmod I_\textrm{max}\). On the other hand, larger \(w_\textrm{I}\) generally lead to higher \(w_\mathrm {\iota }\) and thus impose higher memory requirements. The storage requirements in bits of prediction information, tags and useful counters is

$$\begin{aligned} M = n_\textrm{T}n_\textrm{I}w_\mathrm {\iota }+ (n_\textrm{T}-1)n_\textrm{I}(w_\textrm{u}+ w_\textrm{t}). \end{aligned}$$
(5)

5.3 Subprediction and Update Method

Once an entry is allocated in a table at the address determined by \(h_\textrm{a}(C_k(L(t)))\), it is possible to save arbitrary information \(\iota (t)\) about the outcomes of the corresponding context. Based on this information, a subprediction \(\hat{I}_i(t) = \Pi (\iota (t)\) is computed whenever the current context matches \(C_i(L(t)) = C_k(L(t))\). When the subprediction was used as the final prediction, the entry will be updated.

The TAGE-BP stores a saturating counter that is incremented when the branch is taken and decremented otherwise. This method effectively tracks which of the two possible outcomes happened the most often and selects it.

An equivalent for AIP is the maximum-likelihood (ML) subprediction method which is similar to [3]. It can be realized by storing a counter for each possible interval and selecting the interval with the highest counter. The associated memory complexity is in the order of \(\mathcal {O}(2^{w_\textrm{I}})\). With the help of hashes and truncation, the memory requirement might be reduced at the cost of degraded performance.

Instead, we evaluate three approaches with linear memory complexity:

  1. 1.

    AVG, where the prediction is the moving average of the last \(n=2\) outcomes of \(C_i^t\).

  2. 2.

    EXP1, where exponential smoothing is used with a smoothing factor \(\beta = 0.5\). The new prediction is the average of the last prediction and the actual interval \(\hat{I}_i(t):= \lfloor \beta I_i + (1-\beta )\hat{I}_i(t)\rfloor \).

  3. 3.

    EXP2, which is similar to EXP but with a smoothing factor of \(\beta = 0.75\). The emphasis on newer intervals is therefore higher than in EXP1.

  4. 4.

    KEEP, where a prediction is kept for the entire existence of an entry until it is overwritten. If a new entry is allocated for \(C_l(L(t))\) with the outcome \(I_l\), then the table predicts \(\hat{I}_i = I_l\) whenever \(h_\textrm{a}(C_i(L(t))) = h_\textrm{a}(C_l(L(t)))\).

6 Performance Evaluation

Table 2 Overview of the access traces
Fig. 7
figure 7

Predictor evaluation setup

We executed selected programs of the MiBench [7] embedded benchmark suite on the simple RISC-V processor model from gem5 [19] (TimingSimpleCPU) and traced access, adr, and pc during execution. MiBench offers a “small” and “large” test case for each program. We usually opted for the “small” variant. But in some cases we chose the “large” or our own reduced “small” variant dubbed “tiny” instead to keep the number of accesses of each trace in roughly the same order of magnitude. Table 2 documents the selected programs, variants, access counts, and program lengths. Technical limitations prevented the use of the entire MiBench suite.

For performance evaluation, we set up a simulation framework as in Fig. 7. A master simulator sets access, adr, and pc according to the traces. These signals are fed into the predictor-under-evaluation. A tracker compares access and pred and records \(n_\textrm{hit}\) and \(n_\textrm{fp}\).

6.1 Prediction by Partial Matching

Fig. 8
figure 8

Misprediction rate of PPM with \(N = 50\) either the address, the program counter, or the combination of both as access identifiers

The performance of PPM with \(N = 50\) for different programs and for different access identifiers (c.f. Section 4) is shown in Fig. 8. Identifying an access by its program counter rather than its address results in more accurate PPM for most programs. This is especially true for data-sensitive programs like the sorting algorithm “qsort” where the access address often depends on the data content. As a result, address-based prediction performs poorly with these kind of programs. On the other hand, it works slightly better for data-oblivious programs such as the hashing algorithm “sha”. The combination of both fares worse then just the program counter even though more information is employed. This suggests that the program counter generalizes well. On average, program-counter-based PPM achieves a hit ratio of \(P_\textrm{hit}= 99 \%\).

While PPM is not feasible for a hardware implementation in typical embedded systems, the results may serve as a de facto limit of prediction accuracy. TAGE approximates PPM with practical adjustments and seems thus like a promising candidate for a highly-accurate access interval predictor. When designing the TAGE predictor, one needs to minimize the degradation caused by the adjustments.

6.2 Tagged Geometric History Length

Fig. 9
figure 9

Misprediction rate of TAGE-AIP with either the address, the program counter, or the combination of both as access identifiers. The remaining parameters are the same as in Table 3

Fig. 10
figure 10

Performance impact of different \(w_\textrm{I}\) and subprediction methods. All other predictor parameters are as listed in Table 3

It is observable in Fig. 9 that TAGE-AIP’s performance is slightly worse than PPM. Programs that PPM scores better for are also more accurately predicted by TAGE-AIP, with “susan” as a noticeable exception. Overall, \(P_\textrm{hit}\) decreases by 2 % from PPM to TAGE-AIP.

Access Identification

Just like PPM, TAGE prediction benefits from program-counter-based access identification. The average hit rate is around 2 % higher when compared to \(a_i = A_i\) and around 3 % higher when compared to \(a_i = (A_i, P_i)\).

Interval Representation

Figure 10 illustrates the average miss rate and interval utilization for different \(w_\textrm{I}\) values, as well as the associated memory requirements. If \(w_\textrm{I}\) is too small, too many collisions occur and the predictor is unable to properly record and recall interval histories. Prediction performance, in terms of \(P_\textrm{miss}\) and \(U_\textrm{I}\), increases with \(w_\textrm{I}\) until a saturation point is hit at \(w_\textrm{I}= 6\) as almost none of the recorded traces exhibit intervals \(I > 2^6 = 64\).

Subprediction and Update Method

There is a noticeable trade-off between prediction accuracy and memory requirements. The method with the lowest \(P_\textrm{miss}\), ML, needs the most memory. AVG requires less memory but prediction accuracy is slightly impaired. EXP2 is slightly behind AVG but with a reduced memory footprint. Finally, EXP1 and KEEP are the most inaccurate. \(U_\textrm{I}\) is nearly the same for all methods.

We choose \(w_\textrm{I}= 6\) and EXP2 as this combination yields the lowest product \(MP_\textrm{miss}\). Other parameters were determined as in Table 3 by random search within the specified ranges. More sophisticated design space exploration (DSE) for the other parameters is outside the scope of this paper. The selected configuration achieves \(P_\textrm{hit}= 97.0\ \%\) and \(U_\textrm{I}= 99.5\ \%\) with \(M = 9.75\) kB of memory.

Table 3 Parameters of the TAGE access interval predictor
Table 4 Comparison of different access interval predictors

6.3 Performance Comparison

A comparison between different access interval predictors is drawn in Table 4. We re-implemented the ALB and the simple counter based estimator (CBE) from [2] and stimulated them with the same traces as PPM and TAGE-AIP. TAGE-AIP predicts more accurately than the ALB and the CBE which does not rely on SRAM blocks. The improved hit rate comes at a price of higher memory complexity. Therefore, TAGE may be used where a high prediction accuracy is more important than a low predictor overhead. PPM is the most accurate of all but not feasible for hardware implementation. Our software implementation written in Python takes up to 5 GB of random-access memory (RAM) for the suffix tree. For reference, the TAGE software implementation consumes around 17 MB RAM.

7 Conclusion and Outlook

In this paper, we derived access interval predictors from state-of-the-art branch predictors PPM and TAGE-BP. We explored the adjustments that are needed to predict intervals instead of branch outcomes. Our first subject of interest was PPM as a theoretical precursor of TAGE-BP. It was modified to work on sequences of access intervals and identifiers instead of branch outcomes and addresses. PPM was capable of predicting 99 % of access intervals. The results indicate that an access is, for the purpose of AIP, better identified by its program counter and not its address as in prior AIP research [16, 2, 3].

Starting from PPM, the TAGE-AIP makes several compromises to lower implementation complexity to a reasonable level at the cost of a slightly degraded hit rate. For example, intervals need to be represented with a finite amount of bits. In addition, the number of bits is higher than in branch prediction where one bit suffices to encode all events. This raises memory requirements but does not hinder accurate predictions. In our case, 5-6 bits are enough for interval representation as interval exceeding 32 or 64 cycles are rare.

Furthermore, different methods of storing interval information and computing a prediction from it were discussed. Keeping track of the frequency of each possible outcome, as is done in branch prediction with means of saturating up-down counters, is too memory-expensive in the case of AIP. Therefore we investigated alternatives of which a moving average of the last outcomes and exponential smoothing proved to be efficient.

The resulting TAGE-AIP reaches a hit rate of over 97 % while being implementable with moderate hardware requirements of 9.75 kB of SRAM. It abounds with parameters which may be tuned to achieve the optimal trade-off between accuracy and cost required by the AIP use case at hand. Prediction for computing platforms not covered in this paper, e.g., ones based on x86, ARM, or Xtensa, may be better served by different TAGE-AIP configurations.

Finally, we made a comparison to other access interval predictors found in literature. Our proposed predictor achieves higher prediction accuracy than any other implementable predictor at the price of higher memory complexity.

It should be noted that while TAGE-BP is known to be precise it also has a high update cost. Therefore, further research will address register transfer logic design of the proposed TAGE-AIP and investigate the timing overhead. Another open topic is the performance of adaptations of other popular branch predictors, such as gshare, GEHL, or Perceptron [4].