1 Introduction

Metaheuristics are often a primary choice used in hard optimization problems [1]. This kind of algorithm is used for solving complex problems when there is a need for solutions that are good enough and the execution time is not large. In such scenarios, the use of exact methods would need too much time to execute as there are many reasons for that.

Swarm Intelligence (SI) algorithms are important representatives of the population based metaheuristics. SI algorithms are inspired by the behavior of collections of individuals present in nature, such as a flock of birds, an ant colony, or even a fish school. The key principle in SI is the simulation of many particles that have a simple behavior but, together, these particles evoke the emergence of solutions through ad hoc topologies and communication strategies. This principle can be observed in the most diverse SI algorithms, such as the Particle Swarm Optimization (PSO)[2], Artificial Bee Colony (ABC)[3] and the Ant Colony Optimization (ACO)[4]. Despite the fact that each of these algorithms aims to solve different types of problems, they share many characteristics, e.g., movement operators, communication patterns and fitness evaluations.

The collective behavior is one of the main reasons why SI algorithms are able to achieve good results, but not without costs. For problems that have a high number of dimensions and fitness functions that are computationally expensive, the executions might also take a long time to solve even when SI algorithms are used. Parallel implementations of these algorithms become necessary in such cases. Nowadays, many parallelization possibilities arise with the easy access to high performance hardware such as multi-core CPUs, graphics processing units (GPU) and cluster architectures. Different approaches and strategies have been introduced, all of them aiming to boost SI algorithms and solve very complex problems still in a feasible amount of time.

Parallelization frameworks such as OpenMP [5], MPI [6] and CUDA [7] are often used by programmers to support the implementation of parallel SI algorithms for the most diverse hardware. But even so, the development of such applications can be complex and error prone. Besides the natural complexity of the algorithm, the programmers must be also aware of problems inherent to parallel programs (i.e. race conditions, deadlocks, data transfers and hardware limitations). The generation of high performance parallel code requires some expertise from programmers, even when using such frameworks.

In this complex scenario, high-level parallelization frameworks such as Muesli[8] and MALLBA[9] were created to facilitate the development of parallel applications. By using such tools, the programmer benefits from pre-defined operations (namely skeletons) that can be used to build up an implementation. These skeletons represent commonly used parallelization patterns such as map and reduce. Parallelization aspects are hidden from the programmer, who has to take care only of the algorithmic aspects.

The parallelization of SI algorithms is not an easy task. Although having many independent individuals is definitely a characteristic that makes SI algorithms suitable for parallelization, the communication steps require a great deal of attention since all particles have to stop their individual movements and start the communication coherently. Especially for parallel implementations, where this phase can cause high costs, for example, when communication happens between several devices. So, unnecessary communication and transfer of data can create an overhead that for sure penalizes the execution of the algorithm and consumes more time.

The parallelization of such algorithms is nowadays supported by the high computational power available in modern hardware, such as multi-core CPUs or GPUs. The high processing power enables the deployment of many parallel threads (or even thousands in the case of GPUs), while it typically does not make sense to employ more than a few hundreds of particles in an SI algorithm[10]. Thus, a simple approach mapping each particle to a thread will not be able to exploit the full power of the hardware. Rather, a nested parallelization is required in order to achieve this.

Considering all this, the goal of this paper is to inspect the applicable literature and summarize the trending approaches regarding parallelization of the selected prominent SI algorithms, e.g., PSO, ABC and ACO. Papers aiming at both, CPU and GPU implementations, will be considered in this work. By extracting information from the relevant papers from the last years, we want to identify the parallelization tools used and the parallelization approaches taken. The objective is that, after this study, we are able to guide future research to explore the gaps identified and further improve parallel approaches of swarm intelligence algorithms.

We assume the readers have some familiarity with hardware for parallel computing such as clusters [11] of multi-core processors [12] equipped with possibly several accelerators such as GPUs [13]. We also assume some familiarity with frameworks for parallel programming such as MPI [6] for clusters, OpenMP [5] for multi-cores processors as well as CUDA [7] and OpenCL [14] for GPUs. Message passing frameworks such as MPI allow the different computing nodes of a cluster to asynchronously run in parallel independent tasks which communicate via messages. Multi-core processors can run several threads in parallel, which communicate via a shared memory. Finally, GPUs allow to run typically thousands of threads in an SIMD (single instruction multiple data) style, where the same instruction is executed on different data stored in a separate GPU memory. In order to use a GPU, data have to be exchanged between main (CPU) memory and GPU memory.

This document is structured as follows. Section 3 introduces swarm intelligent algorithms. Section 4 presents the protocol used in the literature review. Discussion about the main findings of this review are presented in Sect. 5. Finally, the conclusion of this work, together with some outlook, is presented in Sect. 6.

2 Related Work

Y. Tan’s work [10] is a survey over parallel implementations of Swarm Intelligence algorithms but only GPU-based ones. It also included other algorithms such as Genetic Algorithm (GA) and Differential Evolution (DE). While our focus is one the parallelization approach, Tan diverged some attention also to evaluation criteria, such as performance metrics.

Lalwani et al. presented a survey about parallel swarm optimization algorithms[15]. In the one hand, this survey goes over CPU- and GPU-based implementations, using well known frameworks such as OpenMP, MPI and R. On the other hand, it is limited to the PSO algorithm.

Kroemer et al. also presented a survey on parallel PSO implementations[16]. Their survey is limited to GPU implementations, while this work intends to identify aspects regarding the parallelization in different hardware.

3 Swarm Intelligence Algorithms

Before we come to the parallelization aspects, let us summarize the considered SI approaches, namely, PSO, ABC and ACO. Considering the size of the SI family, the selection of these approaches was based on specific characteristics that influence parallelization.

The first difference regarding these algorithms is the kind of problems they intend to solve. While PSO and ABC were originally designed to solve continuous optimization problems, ACO was designed to solve combinatorial optimization problems, such as the Traveling Salesmen (TSP)[17]. Figure 1 illustrates both types of problems. The main difference is that while a PSO particle may assume any position inside the continuous range established by the model, in this example the Ackley function, a TSP solution built by an ant in ACO has to include the exact nodes available in the graph to build a feasible solution to the problem.

Fig. 1
figure 1

continuous vs. discrete problems: Ackley function and a TSP instance

Besides that, there are other aspects regarding the algorithms that can impact the process of parallelization. For example, while the particles’ movements are guided by two opposing terms (namely cognitive and social) in PSO, bees have different roles that determine which function they should perform in ABC. Moreover, ACO differs from other algorithms the way it memorizes success. Instead of storing it in the ants themselves, pheromones are deposited in the environment which represent a body of shared memory from previous successes of the colony.

With the understanding of these three approaches and their differences, we cover aspects and parallelization scenarios that, conceptually, are also present in many other SI approaches. This allows us to widen our discussion that could also be extended for other SI approaches.

3.1 Particle Swarm Optimization

Particle Swarm Optimization (PSO) is one of the most famous population-based metaheuristics. It was first introduced by Kennedy and Eberhart back in 1995 [2]. The inspiration comes from the behavior of a flock of birds. PSO simulates all birds of a flock where each particle, representing a bird, is allowed to move inside the search space aiming to find the optimal location. For each particle, its coordinates in the search space determine a candidate solution for the considered optimization problem. PSO incorporates the notion of speed (as a metaphor of success) and each particle is moving in the search space at its own pace and direction. Updates in the particles’ movements are performed considering two factors (see Eq. 1): own success (cognitive term) and global success (social term), expressed by the second and third summand in the equation, respectively.

$$\begin{aligned} v_{t+1} = \omega v_t + c_1 r_1 *(b_t - x_t) + c_2 r_2 * (g_t - x_t) \end{aligned}$$
(1)

where \(\omega \) is the inertia factor, \(c_1\) and \(c_2\) are weights of the cognitive and social term, respectively. Moreover, \(b_t\), \(g_t\) and \(x_t\) are the local best position (found by the considered bird up to time t), the global best position (found by the whole flock up to time t) and the position of the considered bird at time t, respectively.. A wrong setup of \(\omega \), \(c_1\) and \(c_2\) might lead to an early convergence of the algorithm or make the particles get stuck in local minima.

After updating the speed, each particle will update its position according to Eq. 2, where \(v_{t+1}\) is the speed at time \(t+1\), while \(x_t\) and \(x_{t+1}\) are the positions of the particle at time t and \(t+1\), respectively.

$$\begin{aligned} x_{t+1} = x_{t} + v_{t+1} \end{aligned}$$
(2)

After performing the speed and position updates, the fitness of each particle at its current position is calculated. The current local best fitness is then compared to the new value and possibly updated. Moreover, the current global best position will be updated, if a particle has obtained a better fitness than the best known one. After these updates, the mentioned steps are repeated, until a termination criterion is fulfilled. A fixed number of iterations is often used as termination criterion. In some cases, the stagnation of the fitness is used instead, which means that the iterations will stop when the swarm stops improving after a given number of unsuccessful attempts.

Determining the right swarm size and number of iterations is crucial for the success of PSO. As particles are initialized in random positions, having more of them means that diversity is increased and a larger area of the search space is covered. Also, the social factor, represented by the exchange of coordinates of the currently best solution, is important in PSO, since it directs the search into a promising direction. The more iterations an algorithm performs, the more steps are followed in this promising direction and the higher are the chances of finding an optimal (or at least very good) position. PSO can be applied to problems of different magnitudes and, just by adjusting these two parameters, it might be able to find a good solution.

3.2 Artificial Bee Colony

The artificial bee colony (ABC) algorithm is inspired by the behavior of honey bee colonies and their ability of finding food sources[3]. The algorithm is characterized by a division of labor and the tasks are divided among bees of three different types: employee, onlooker and scout bees. In ABC, the number of food sources must be determined and for each food source, there will be an employee bee assigned to it. Each employed bee will determine a food source in the surroundings of an old food source in their memory. These new spots are shared among the hive and onlooker bees select one of these food sources and search for a new better place in the surroundings of the selected source.

A food source that has not improved for a given number of iterations will be abandoned. The employed bee responsible for this food source becomes a scout bee and starts exploring the search space for a new food source. Scout bees move freely through the search space without any guidance.

Initially, each employee bee is randomly allocated in the search space and then evaluates the quality of the current food source/position. An employee bee is allowed to look for a better food source in the surroundings of the current established position. Once per iteration, they choose a random direction and evaluate the quality of the reached point. If the quality of the solution improves, it becomes the new food source, replacing the old one.

Onlooker bees are associated with employee bees. (Eq. 3) shows the probability \(P_i\) that an onlooker bee is associated with the food source, which is currently being explored by employee bee i. Food sources of higher quality are more probable to be chosen by onlooker bees. Each onlooker bee is allowed to search in the surrounding of its chosen food source. If it finds a better food source, it also replaces the respective food source it was assigned to.

$$\begin{aligned} P_i = \frac{F(\theta _i)}{\sum ^{S}_{k=1} F(\theta _k)} \end{aligned}$$
(3)

where \(F(\theta _i)\) is the fitness of the food source i and S is the total number of food sources.

If a food source does not improve during a certain period, scout bees are activated and a new food source is assigned automatically. This behavior sets ABC apart from PSO. While in ABC a stagnation causes a switch from exploitation to exploration, PSO gradually changes its behavior from exploration to exploitation by continuously reducing the step size during the execution. Normally, PSO instances have bigger step sizes at the begging of the execution in order to explore the search space and, during the execution, this step size gets smaller allowing exploitation.

3.3 Ant Colony Optimization

The ant colony optimization (ACO) algorithm mimics the behavior of ants in their search for food. While searching, ants leave a trail of pheromone that will attract other ants. In this process, the shortest path found is the one with most pheromone, since ants will be passing it more often and pheromone on other trails is likely to evaporate. Proposed by Dorigo [4], the Ant System algorithm (AS), an implementation of ACO, was first employed to solve the shortest path problem.

When applied to problems such as the traveling salesperson problem (TSP), ACO repeatedly executes two steps. The first one is the tour construction, where each ant constructs a full round trip exploring the pheromone on the edges between the nodes of the considered graph. Starting at the initial node, each ant always determines its next step to a neighbor node in a probabilistic way. Edges with a higher amount of pheromone have higher chances of being chosen. Therefore, at each step, the probability of following a certain edge has to be re-calculated for all possible edges (Eq. 4).

$$\begin{aligned} P^k_{ij} = \frac{ \tau _{ij} \eta _{ij} }{\sum _{l\in N}\tau _{il} \eta _{il}} \end{aligned}$$
(4)

where, \(P^k_{ij}\) is the probability of ant k moving from node i to node j, \(\tau _{ij}\) represents the amount of pheromone between i and j and \(\eta _{ij}\) is equal to \(1/d_{ij}\), being \(d_{ij}\) the distance between i and j. \(\tau _{il}\) and \(\eta _{il}\) are defined analogously for the edge between nodes i and l. N is the set of unreached neighbors of node i.

After each ant finishes constructing a complete round trip, it is time to update the pheromone. The edges on shorter trips receive more pheromone compared to those of longer ones. Also, the evaporation of pheromone must be considered, and edges that were not used will suffer a reduction of pheromone. After this phase, the iteration is concluded and, if the stopping criteria is not reached, the ants will start creating new paths again.

In contrast to PSO, the success in ACO is not represented by the particles’ positions, but by pheromone levels on the edges. Thus, ACO is quite suitable for a parallelization, since the path construction is an independent task performed by each ant and the pheromone in the environment resembles shared memory. ACO also includes demanding tasks such as the calculation of probabilities for the next step and fitness calculations that can be parallelized.

4 Literature Review Protocol

This literature review includes recent approaches and concepts for the parallelization of prominent SI metaheuristics. In this section we present the protocol containing all steps and activities used to (1) collect the adequate papers that tackle the parallelization of swarm intelligence algorithms and (2) extract the relevant information.

With the main objective and research question defined, the next step was to define the keywords that should be used to build the search strings for each scientific base and also to assess the relevance of all gathered papers later on. For the parallelization theme, the keywords chosen are related to the main frameworks used nowadays for parallelization, namely, CUDA, OpenMP and MPI. The same rationale was used for the SI theme, where we focused on the prominent SI algorithms (i.e. PSO, ABC, FSS [18] and ACO). In addition to that, relevant terms such as parallel and high-performance were also used.

Besides keywords, selection and exclusion criteria were also defined in order to exclude papers that are not relevant for this work, such as papers that don’t focus on the parallelization or papers that don’t present concrete results.

The main scientific databases selected for this work were Scopus, Web of Science and Science Direct. Volume and relevance for Computer Science were the main criteria used in this selection. The keywords mentioned above were used to construct equivalent search strings for each scientific database.

The corresponding queries resulted in a list of initially more than 300 papers. After removing duplicates, we could start reading the papers, discard the ones that did not fulfill the inclusion criteria, and carefully extract the relevant information from the remaining ones. For this work, we’ve decided to include only full papers written in English. Papers to be included must also have some focus on the parallel implementation and concrete experimental results. Papers that did not meet these criteria were excluded from this work.

The remaining 87 papers were then organized in groups according to the SI algorithm used such that they could be compared better. The following subsections discuss these papers according to the group they belong to. The subsections give a first, coarse-grained overview of the approaches. A detailed discussion will follow in the next section.

4.1 Continuous Optimization Findings

4.1.1 PSO Findings

Among the papers that aimed to solve continuous optimization problems, PSO was the SI algorithm that appeared most frequently. From the 87 relevant papers found, 57 used classical PSO or an enhanced version of it. An overview of these works, including some information regarding the parallelization framework used and hardware environment, is displayed in Tables 1 and 2. For both tables, columns under Hardware label the maximum number of GPUs and CPUs used by each work. Furthermore, the number of nodes is also displayed so that works that aimed a distributed memory environment can be identified. The parallelization tools used by each work are listed under Frameworks. High level frameworks are explicitly named under the HL column, while approaches that used other approaches such as MATLAB were marked under the Other column. For clarity purposes, PSO works were divided among two tables. Table 1 lists works that did not use GPUs, while Table 2 lists works that use GPUs.

Table 1 PSO Hardware and Framework - CPUs
Table 2 PSO Hardware and Framework - GPUs

Tables 3 and 4 display details about the parallelization strategies used by the authors. For example, Garcia-Nieto et al. [19] implemented parallel swarms, with asynchronous communication between these swarms. They also used a topology to link the swarms such that they could exchange particles. Also, in the last column, the top speedup achieved from all experiments is displayed. This value can give an insight of what can be achieved when using a certain approach and tackling a specific problem. In contrast to this, the comparison between speedups from different works might not be realistic.

Table 3 PSO Parallelization Strategies - CPUs
Table 4 PSO Parallelization Strategies - GPUs

The papers listed in these tables show that there are many possibilities to explore when talking about the parallelization of PSO. From a simple single swarm parallel implementation to a more complex multi-swarm multi-objective one. Regarding the hardware, approaches aiming at CPUs only are just as present as the ones using GPUs, bringing points to be discussed later on in this work.

4.1.2 Other algorithms

9 papers out of the 87 made use other algorithms for continuous optimization that were not PSO. The Artificial Bee Colony algorithm was found in 8 works ([20,21,22,23,24,25]) while the Fish School Search algorithm was used in [26]. Tables 5 and 6 display the information about these papers.

Table 5 Continuous Optimization - Hardware and Framework
Table 6 Continuous optimization - parallelization strategies

4.2 Discrete Optimization Findings

Regarding discrete optimization, ACO is the most known and used SI algorithm. The vanilla version of ACO and its variants (i.e. Ant System) were among the algorithms used in the works gathered and therefore they are listed together here. Tables 7 and 8 display the properties of the papers gathered in this review regarding ACO. In total, 21 papers describe the parallelization of the ACO algorithm.

Concerning Table 8, it is important to mention that the probabilities column refers to works that parallelize also the calculation of probabilities of taking a certain edge in the next movement, assigning one thread for each calculation, for example.

Table 7 ACO hardware and framework
Table 8 ACO Parallelization Strategies

5 Discussion

In this section, we will discuss the main aspects extracted from the papers we have listed previously. Algorithms mainly used for continuous optimization problems will be addressed in one subsection and a second subsection will address discrete optimization problems. In order to structure the discussion, the different topics will be analyzed separately (as also done by Tan and Ying [10]).

5.1 Continuous Optimization

Most papers included in this literature review deal with continuous optimization problems. In these papers, algorithms such as Particle Swarm Optimization, Artificial Bee Colony and Fish School Search are used to solve a wide range of problems, from traditional benchmark functions to real world optimization problems. As these algorithms are aiming for the same type of problems, they share, for example, parallelization strategies and other aspects that will be explained in the following.

5.1.1 Parallelization of Basic Operations

The execution of SI algorithms in general includes movement operators and fitness function calculations. Normally, movement operators are composed of simple operations, which do not generate substantial computational costs and are hence not worthwhile to be parallelized. The fitness functions, in contrast, can get quite complex and their evaluations often require a big part of the computational time. For example, in [23], the authors state that, in a sequential implementation of the ABC algorithm, 85% of the execution time consists of fitness evaluations. Therefore, a speedup could be achieved just by parallelizing this step of the execution. Singh et al. use this strategy and state that it is possible to achieve a 10x speedup comparing their implementation (where only the fitness evaluation is performed in a CUDA kernel) to a fully sequential implementation [20, 27]. Unfortunately, the authors do not comment on the overhead generated by repeatedly transferring data between CPU and the GPU. Anyway, the achieved speedup is impressive.

5.1.2 Particle-level Parallelization

The straight-forward approach to implement SI metaheuristics in parallel is to employ a parallelization on the particle level, since the particles are rather independent in most of the algorithms. The authors of the large majority of the considered papers used this approach as displayed in Tables 3, 4 and 6. This strategy is easy to implement and it is capable of reducing considerably the execution time of the algorithm [25, 28,29,30,31,32,33,34,35,36,37,38,39,40].

When aiming for multicore CPUs, OpenMP provides tools for easily implementing SI algorithms that are parallelized on the particle level. These implementations also profit from features present in modern chips such as hyper-threading and other optimizations that allow the execution of several threads simultaneously by a single core. As SI algorithms are repeatedly iterating over their particles using loops, directives such as the parallel-for are can be applied, dividing the workload of those loops among all the cores available. Consequently, speedups are proportional to and limited by the number of available cores (up to hyper-threading).

This approach can be easily implemented and contribute to speedups for most of the applications. Even though, it must be taken into account that it might not have the best performance for more complex scenarios, for example when a large swarm size is needed or when the computations performed by each particle are too expensive. As each core processes its workload sequentially and there are no other resources available, performance improvements are limited to the number of CPU cores.

In order to exploit distributed memory architectures, this approach can be extended using MPI. Li et al. [41] presented in their work an MPI-OpenMP-based multi-objective PSO. Although the work is distributed among several nodes, all particles belong to the same swarm. Internal divisions, named subspecies are used to tackle the subproblems (as reducing communication).

The same parallelization can be applied when aiming a GPUs. With a high number of cores available, GPUs have an interesting architecture for running SI algorithms where the same instruction is applied in SIMD style to multiple data. A very simple implementation using this approach was presented by Dong et al. In their implementation, each particle was represented by one CUDA thread and, for their problem, just one CUDA block was enough to host the whole swarm [34]. A very good performance was achieved this way by using the memory available locally and the shared memory that belongs to the CUDA block.

Memory handling is normally a concern when developing for GPUs and using frameworks such as CUDA, especially when tackling bigger problems. Furthermore, there are other issues that the programmer has to consider in order to achieve high performance. One of them is regarding CUDA’s architecture features such as the computation capability of the GPU targeted. This capability determines how many threads can be spawned simultaneously and what are the resource limitations regarding threads and blocks in use. Exceeding these limitations generates overhead.

Another issue when using CUDA to implement such algorithms is the memory hierarchy. In order to optimize the execution of the algorithm, it is important to make good use of the local memory available for each of the CUDA threads. The problem is that, although being really fast to access, the local memory available for CUDA threads is very small. If the computations of the particles are simple enough, they can benefit from it. Otherwise other approaches have to be used, such as the use of shared memory and global memory.

Shared memory is used specifically if the data stored is to be shared among the threads that belong to the same CUDA block. It is not as fast as the local memory, but it offers more space. In CUDA, the use of global memory is recommended to store data structures that are large and/or need to be kept during the entire execution of the program (e.g. the position array of all particles in PSO). It is important to note that going up in the memory hierarchy (to faster memory) means that more time will be needed to transfer data. Therefore some overhead must also be considered.

If there are as many particles as cores on the GPU, a particle-level parallelization can be very efficient. If there are less particles, some cores will be running idle during the execution. Unfortunately on modern GPUs, the parallelization on particle level of algorithms like PSO and FSS typically only allows to use a small fraction of all cores. Thus, a parallelization only on particle-level will be insufficient to fully exploit the hardware.

5.1.3 Fine-Grained Parallelization

With the possibility of spawning many more threads on modern GPUs, some authors used a parallelization on a lower level of the algorithm. Instead of (only) running particles in parallel, the internal operations of each particle run in parallel. In particular in highly dimensional optimization problems, it is worthwhile to consider all dimensions of a particle in parallel. This approach can be seen in [42,43,44,45,46].

A finer degree of parallelization as mentioned provides a better occupancy of the GPU cores as the number of threads spawned can be higher. Furthermore, this approach benefits from executing much shorter routines (so-called kernels) when compared to a coarse grained implementation which runs loops at the particle level. More threads and shorter routines are a good combination to deal with branching problems in CUDA. Note that branching of a computation on a GPU leads to a sequentialization of the branches due to the SIMD style of the computation.

The implementation of a fine grained parallel algorithm is not as simple as a parallelization on the particle level. Some knowledge about the internal calculations of a particle is necessary, so that the workload can be divided in smaller shares. In the ideal scenario, these smaller shares can be processed independently, exactly like it happens in the PSO’s speed update phase (Eq. 1). During this step, the speed for each dimension is calculated independently. When this is not the case and threads must share information during a certain step, synchronization and data transfers become necessary. These operations are responsible for generating overhead and therefore must be used with care, aiming to minimize the time spent doing them. Furthermore, it is important to consider that not all operations can be parallelized in a finer degree. For example, some fitness functions are composed of just one simple term, which would not make sense to be parallelized. In such cases, the parallelization still occurs on the particle level.

Another problem with this approach is regarding the number of threads spawned and the limitations imposed by the hardware. In CUDA, the number of threads per block is limited, meaning that using this approach can be problematic if the problem has too many dimensions.

Even though, the use of the fine-grained parallelization approach generates better performance when using GPU environments compared to CPU environments. The high number of cores present in a GPU and the way it is organized make it perfect for the execution of simple instructions over a great quantity of data, which is exactly the case. On the other hand, the same approach aiming at a CPU environment would not profit so much because it is restrained by the limited number of cores present in such hardware. Even though modern CPUs have cores with more advanced features, GPUs compensate in numbers, being perfect for this kind of job.

5.1.4 Parallel and Cooperative Swarms

Until this point, all parallelization strategies mentioned lead to smaller execution times of the algorithms when compared to a sequential implementation of the considered metaheuristic. Moreover in both environments, GPUs and CPUs, the performance of the metaheuristic is equal or similar in terms of the quality of the computed solution of the considered optimization problem. However, the amount of computational power available on modern high performance computers allows to use even more complex and advanced parallelization approaches, aiming a satisfactory execution time but also a better solution at the end.

One way to profit from such powerful hardware is to run multiple swarms simultaneously. This kind of implementation can be seen in [24, 47, 47,48,49,50,51,52,53,54,55,56]. SI algorithms have a non-deterministic behavior, therefore each run typically produces different results. Each instance of the algorithm starts with its particles placed in different positions of the search space and this initial placement of particles is an important factor that has direct influence on the success of the algorithm. Running several instances of the same algorithm in parallel increases diversity and, at the same time, the chances of achieving a better fitness.

Distributed memory environments make it possible to run multiple instances of SI algorithms as separate swarms in parallel without increasing the run time, which would not be the case in a sequential approach. In this scenario, communication and cooperation strategies between the swarms have been introduced in order to improve the quality of the obtained solution. Such approaches were presented in [51, 57, 58] [19, 44, 48, 59,60,61,62]. These works also show that cooperative parallel swarms achieve better fitness results than parallel independent swarms due to the exchange of information.

There are many ways to exchange of information. The straightforward way is to share with all swarms the best global position found so far. This however reduces the diversity. A more complex approach is to introduce a topology to connect the swarms and determine how they share information with neighbor swarms in this topology. This helps to avoid that the swarms are guided to the same point and that a premature convergence of all swarms happens. It also enhances the diversity among the swarms, since the information shared is different among them.

Some authors propose to transmit just the location of the best position found so far by the sending swarm. Other approaches perform an exchange of particles among swarms. Both approaches increase the capability of finding better positions in the search space. But there are no comparisons that could help determine which approach leads to better solutions.

Another approach is to run different SI algorithms in parallel. Such an approach has been investigated in [63]. The algorithms need not belong to the same family. As each optimization algorithm has different operators and communication patters, such an approach can benefit from the strengths of each algorithm. Using this concept, more complex structures can be built. For example, in [63], the authors propose the idea of having multiple archipelagos. Each archipelago runs one algorithm; in this case PSO, a Genetic Algorithm (GA) and Simulated Annealing (SA). On an archipelago, each island runs an instance of the algorithm with different parameters. Islands communicate with each other using a given topology in an asynchronous way, and so do archipelagos.

It is important to choose suitable communication patterns when implementing such algorithms. In a distributed environment with several nodes, for example, data transfers have a huge impact on the execution time. An asynchronous communication strategy was proposed by Bourennani et al. in order to enable overlapping communications and computations [56]. In their proposed version of PSO, each particle uses its local best, the subswarm best and the best solution found so far among all subswarms to update its position. A master node is responsible for processing the global information. It works asynchronously, in a way that it does not impede the subswarms of keeping their processes running.

Another important aspect is regarding the efficiency of the algorithm and the occupancy of the processors. The programmer must keep in mind that different algorithms will require different computation times for one iteration. Therefore, the time spent for a synchronization is actually determined by the slowest algorithm. For this reason and also in order to reduce the communication overhead in general, it is a wise to exchange information not in every step, but every n steps with \(n>1\) [63]

Also, several authors have observed that asynchronous communication is recommendable. With asynchronous communication, it does not matter if the algorithms are in different iterations. The exchange of information between them is always beneficial [64].

5.1.5 Distributed Memory Environments

Cooperative Multi-swarm approaches are able to produce encouraging results. However, they require a considerable amount of hardware. The CUDA framework, for example, includes features that allow the development of applications that will run using more than one GPU. Or, even further, it is also possible to run an algorithm on several nodes of a cluster, each node having multiple GPUs [36].

Aiming at a CPU environment, two other low-level frameworks are often used (namely OpenMP and MPI). OpenMP consists of a set of compiler instructions that control how parallelism is added to the program running on a multi-core CPU with shared memory. MPI, on the other hand, is a message passing API used for parallel and distributed computing. It is typically used on clusters of computing nodes, which mostly consist of multi-core CPUs (and possibly GPUs). These two frameworks can be combined. Examples of parallel SI algorithms using these frameworks can be found in [26, 49, 56, 64,65,66,67,68].

The master-worker model is often used in combination with these frameworks. When aiming at a single node shared memory environment, the programmer can create n threads using OpenMP from a master thread. Worker threads are responsible for updating the values of the particles, while the master thread is responsible for determining the globally best solution so far and providing it to all worker threads.

Additionally, the programmer can add MPI instructions and adapt the implementation to a distributed memory environment, as it can be seen in [68]. Here, each node runs an MPI process such that the workload can be divided among the different nodes. The first approach is to divide the particles equally among the nodes. Additionally, OpenMP is used as mentioned above in order to take advantage of parallelism inside of each node. After executing the updates and calculating the new fitnesses, it is time to execute the movement steps. MPI offers the instruction MPI_send to send information from the worker nodes to the central processor and MPI_Broadcast for broadcasting global information from the central processor to the workers. In order to reduce the communication, each node calculates its local best and sends only its values, instead of sending all data of all particles it is responsible for.

The combination of master and workers enables a significant speedup compared to a sequential implementation. Nevertheless such a distributed approach also has disadvantages. For simple problems and a small number of particles, just using OpenMP is faster. This behavior is caused by the communication costs in a distributed environment. In some cases, even a sequential execution is faster than one using MPI. On the other hand, with a higher number of particles, a hybrid-approach is much faster than all other approaches. Therefore, in order to have an efficient execution, the programmer must be aware of the complexity and computational costs of the considered problem and the included steps.

Another point concerns the occupancy of the processors. Using frameworks such as MPI, the user is responsible for dividing the tasks among nodes and cores. Ideally, all nodes and cores are equally occupied. Achieving this is not always easy and it gets more complicated, if the execution environment has an heterogeneous structure.

5.1.6 Multi-core CPU vs. GPU

The number of possibilities for parallelizing swarm intelligence algorithms is indeed large. Besides the debate about which parallelization strategy to use, another current topic addresses the use of GPUs versus multi-core CPUs. Both have been widely used and shown to enable considerable speedups and high quality results [69].

Approaches using both types of hardware simultaneously have been also able to achieve impressive results. The use of OpenMP together with CUDA, for example, enables the concurrent processing on the CPU and GPU sides, increasing GPU utilization and reducing the overall execution time of the algorithm [63].

5.2 Discrete Optimization

Some algorithms of the SI family were tailored specifically to solve discrete problems, e.g. ACO. Despite the differences regarding the way these algorithms deal with the search space, many aspects are similar to the ones used in continuous search spaces such as PSO. For example, parallelization at the particle level is also commonly used for these algorithms. An exception can be found in the work of Cicireli et al.[70], where the processes were divided with respect to the search space rather than the particles, but the territory partitioning can lead to an unbalanced workload.

Ant Colony Optimization is an algorithm of the swarm intelligence family, initially created to solve combinatorial problems such as the Traveling Salesperson Problem. From the papers gathered, 20 explore this algorithm or a variant of it. Despite having a rather different way of operating when compared to the previously mentioned algorithms used for continuous optimization, such as PSO and ABC, ACO is also very suitable for a parallelization. The strategies for parallelizing such algorithms, together with pros and cons, will be discussed in the following.

5.2.1 Naive Parallelization

Path creation is the most demanding task in the ACO algorithm when solving problems such as TSP. In every step, an ant must calculate the probability of choosing each possible edge based on the distance and the amount of pheromone deposited on it. On the other hand, the path creation process of one ant is completely independent from that of the other ants. This property makes ACO suitable for parallelization in a very simple manner. As mentioned for continuous optimization techniques, naive parallelization is also possible for ACO. Using one thread per ant was investigated in [71,72,73]. For both, CPU and GPU, environments, this approach is quite straightforward and capable of achieving considerable speedups, especially when the dimensionality of the optimization problem grows.

5.2.2 Fine-Grained Parallelization

A more advanced strategy for parallelization can be found in [74,75,76,77,78]. Instead of having one thread per ant, the parallelization is performed in more refined way, in order to make better use of the available cores, if there are much more cores than ants. For example, Fingler et. al. [74] use one CUDA thread block to represent one ant and the threads of that block are used to execute internal calculations of the ant in parallel. This is possible, since many parts of the internal calculations are independent of each other, such as the calculation of probabilities during the path creation phase.

The path creation process is the most expensive step during the execution of ACO and therefore it receives most of the attention during the parallelization of the algorithm. However, others steps such as the pheromone deposit also have potential for a parallelization. Zhou et at. proposed a parallel approach for this step using CUDA’s atomic operations. The use of atomic operations enables the possibility of initiating several updates concurrently without the problem of race conditions, in the case that two threads try to update the pheromone value of the same edge at the same time. By the atomic operations, conflicting updates are sequentialized. The results show how this parallel approach outperforms the purely sequential pheromone updates [77].

The improvement in the execution times caused by a fine-grained implementation was also observed by Tan and Ying [10] in their work on GPU implementations. As a drawback from this strategy, the authors mention the impossibility of having an efficient synchronization method on the hardware level, since CUDA allows it to happen only between threads that belong to the same CUDA block. A synchronization on the hardware level would require the division of the work into two different CUDA kernels. In such a scenario, threads would have to be spawned 2 times (one time for each kernel) and, therefore, overhead would be generated.

5.2.3 Parallel Ant Colonies

Similarly to the multi-swarm approaches of e.g. PSO, a parallel multi-colony approach can be used with ACO, as it was done by Li [79] and Fingler et al. [74]. Instead of having one single colony, multiple colonies were deployed and processed in parallel. Fingler et al. state that it is very beneficial to have multiple instances of ACO running at the same time. The authors classify ACO as being an elitist algorithm, where only the most successful ants and the paths that they created are able to influence the behavior of the colony in the next iterations. Therefore, most of the work performed by other ants is basically discarded as they tend to follow the more successful paths. In order to avoid this problem, more possibilities can be generated by having multiple colonies running at the same time, expecting that they will have a bigger diversity of successful solutions.

Gao et al. proposed a multi-colony approach aiming at an environment with multiple GPUs [80]. The authors state that dividing the workload among available devices can be very beneficial to reduce the execution time of the algorithms. As each sub-colony runs independently on a GPU, an eventual synchronization between them is necessary in order to bring them to the same level regarding pheromone deposits. Synchronizing the pheromones between different GPUs can generate notable overhead and therefore the authors suggest that this step should take place sporadically.

Although ACO is aiming to solve discrete problems and the algorithm differs significantly from the ones used for continuous optimization (such as PSO and FSS), the parallelization strategies used in the gathered papers are similar for all of them. Independently of the platform or architecture, each of the mentioned strategies is able to reduce the execution time of the algorithms compared to a sequential execution. However, the actually achieved speedups have to be compared with care, since they depend on many factors, such as the quality of the implementations and the computational power of the hardware used. Nevertheless, all the described approaches give inspirations for own attempts to not only develop parallel implementations of SI metaheuristics but also for parallel implementations in other application areas, in particular in areas, where one level of parallelism is not sufficient to fully exploit the possibilities of the available hardware.

6 Conclusion

This document is the result of a literature review of the latest papers up to 2021 that approach the theme parallelization of swarm intelligence algorithms. The objective was to identify the most recent trends about this topic, including algorithms, parallelization frameworks and, most importantly, parallelization strategies. This literature review followed a systematic approach to gather the most relevant papers in the area available in the most relevant scientific databases. The research question, keywords, inclusion criteria and many other parameters helped to guide the paper selection and extraction of relevant information.

In the process, more than 300 papers were extracted from the three selected scientific databases. After the whole process including the removal of duplicates and checking the inclusion criteria, 85 papers remained. These papers were used for data extraction. They were also evaluated and their contributions were discussed. In particular, aspects regarding parallelization strategies of SI metaheuristics were analyzed and compared in order to give a clear view of the state of art and possible trends in the area.

The contribution of this document is not only a review that summarizes the latest papers about the considered topic. It can also serve as a reference when aiming for an own parallel implementation of an SI metaheuristic or other applications with similar characteristics, such as e.g. evolutionary algorithms and genetic algorithms. This is also more generally true for some other application areas which require more than one level of parallelism to be used in order to fully exploit the available hardware. Up to four levels of parallelism have been utilized. Besides the obvious parallel treatment of particles, there is also a parallelization of internal operations (such as the computation of probabilities in ACO), a fine-grained parallelization across the dimensions of the search space (e.g. when evaluating the fitness of a particle), and, most interestingly, a division into parallel collaborating swarms. Additionally, we have also raised some points that were not totally clarified by the authors of the selected papers and that can motivate further research, such as e.g. the use of hybrid swarms.

We can conclude that Swarm Intelligence metaheuristics are well suited for a parallelization. In all gathered papers, approaches were proposed, which enabled significant speedups. There were also variants of classical algorithms especially created for parallel environments. Also, several parallelization frameworks have been used; the most popular ones being CUDA, OpenMP and MPI, but there were also approaches using high-level approaches for parallel programming such as algorithmic skeletons.

The key point identified to improve the performance and reduce the execution time of SI algorithms was the wise use of resources. In order to solve problems in an efficient way, the programmer must be aware not only of the algorithm but also of the execution environment. A little surprising observation was that having more resources, such as a distributed memory system with several nodes, is not a guarantee for a faster execution. For smaller problems, the costs of communication can be higher than those of all the other steps, such that the use of such hardware is not justified. On the other hand, approaches with parallel collaborative swarms turn out to be an interesting alternative allowing to take advantage of such distributed environments. The improvements when compared to a single swarm can be huge, especially when considering hybrid swarms. From the latter, many new aspects emerge, regarding communication patterns, topology and so on.

Furthermore, we can state that the task of parallelization can be quite challenging for programmers due to the variety of approaches and different frameworks that can be used. Some experience is required in order to extract high performance from parallel applications and the hardware being used. Therefore, a further investigation on means that could be used for easing the implementation process would be of good use. For example, the combination of high-level parallelization frameworks and SI algorithms.

We conclude stating that this article provided a very decent perspective in the state of the art regarding the parallelization of swarm intelligence algorithms. Naturally, further research is needed to explore the gaps identified and so to create new methods that capitalize on the synergistic combination of SI and parallel HW platforms. More families of SI algorithms would also be interesting to broaden this study.