Introduction

Automatic Guided Vehicle (AGVs) are unmanned battery-powered industrial vehicles widely used to automate intralogistic and production applications in substitution of conveyors and manned industrial transport means. Its use increases the flexibility of the production and allows fast reactions to changes in demand and the market. The automation of logistics also improves the quality of the processes and reduces the injuries. All of this results in reduced production costs and increased competitiveness. Thus, they have become very popular with the advent of Industry 4.0.

The new capacities provided by the 5G networks—high availability, ultra-low latency, high bandwidth, and the allocation of computational resources closer to factories to reduce latencies and response times—have enabled the virtualization of the programmable logic controller (PLC) and the control algorithms that are running in the AGVs. This virtualized PLC can now run as a service in a Multi-access edge computing (MEC) infrastructure in a 5G network.

The relocation of the AGV’s controller into the MEC provides important cost-saving benefits and new functionalities. In particular, this strategy allows simplifying and reducing the hardware within the AGV, since the hardware to compute the control strategies is deployed in the MEC and shared by all AGVs. In this way, AGVs can be modeled as processes that are executed in an off-the-shelf multitask computer. This analogy can help us understand why hardware resources are optimized and how the use of computation resources is maximized. This profit is significant if we consider that in some factories, especially in the automotive sector, up to 200–300 AGVs can work simultaneously.

The advantages are not only related to cost-saving but to all benefits of virtualization: easy deployment, flexibility, replicability, and redundancy, among others, are extensible to the AGVs sector. In addition, today there is a trend in the market towards the servitization of solutions. Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS) are only some examples of this trend. Therefore, this MEC-based solution enables the deployment of a new paradigm: AGVs as Service (AaaS). Thus, AGVs can be considered as a flexible resource for adapting changes in production.

However, these benefits become drawbacks if a failure occurs in an AGV and the production line stops. In some sectors, such as the automotive industry, where around one car is produced per minute, a stop in a production line may cause losses of tens of thousands of euros per minute. Therefore, the capacity to predict in advance an AGV’s malfunction would provide great benefits. For this purpose, artificial intelligence and machine learning techniques, such as deep learning, can be a powerful tool, as they have proven their suitability for predicting complex events in many different domains (Lim & Zohren, 2021).

In this work, we propose applying deep learning techniques to predict in advance the guiding performance of an AGV whose PLC has been virtualized and is running in a 5G-based MEC infrastructure. We demonstrate in an industrial-grade environment that the AGV guiding performance can be forecast with sufficient anticipation by exclusively using features extracted from the packets of the AGV-PLC connection that are captured in a network switch. Therefore, there is no need to deploy any meters on the end-user equipment (AGV and PLC) to collect the packets, which greatly facilitates the deployment of the solution in real-time environments.

To demonstrate the validity of the solution, experiments with a 5G real network and an industrial AGV of the company ASTI Mobile Robotics (ASTI Mobile Robotics, 2020) have been carried out in 5TONIC, an open laboratory for 5G experimentation. An extensive set of experiments was designed to evaluate the forecasting performance of deep learning models under different network conditions. To this end, we introduced programatically several types of network disturbances creating delay and jitter perturbations in a similar way as they may occur in a real 5G network.

To implement the forecasting models, we initially selected two traditional deep learning architectures (LSTM, CNN-1D) that have demonstrated their ability to forecast a variable extracting non-linear relationships from multivariate time-series inputs (Lim & Zohren, 2021). In addition, we compared their performance against the state-of-the-art technique in NLP (Natural Language Processing) and Computer Vision sequence-to-sequence tasks, the Transformer neural network (Vaswani et al., 2017). Today, many research activities are being developed around this architecture not only in NLP and Computer Vision, but also in other areas where complex sequence analysis is required as in time series forecasting (Wen et al., 2022). In particular, we adapted to our problem a modification of the Transformer architecture for time-series recently proposed in 2021 (Zerveas et al., 2021). The rationale of this selection is that we want to evaluate whether Transformer networks, as representatives of state-of-the-art sequence-to-sequence models, are able to outperform traditional deep learning sequence models in the proposed problem.

Using these three architectures, we designed an extensive set of model configurations based on the selection of different architectures, time windows sizes, input features sets and feature aggregation mechanisms. The model configurations produced around 6, 600 deep learning models that were trained and tested and their results analysed to determine the influence of each configuration parameter on the forecasting performance.

We demonstrate that the real-time requirements of this use case can be fulfilled using a modest off-the-shelf PC workstation. In addition, we analyse the impact on the real-time performance when a specialised GPU with a high degree of parallelization is used instead of a CPU. Furthermore, and considering a more complex real-time scenario, we demonstrate the feasibility of using a single deep learning model to forecast in real-time the guidance error of a set of AGVs (256) when a CPU or a GPU are used.

Finally, several scenarios that may require model retraining are discussed. In particular, we analyse data drift problems that arise during model operation in a production environment, possibly caused by changes in network disturbance patterns and in the physical components of the AGV. To cope with this problem and avoid costly retraining processes from scratch, we suggest to use Transfer Learning techniques to do incremental training to speedup the process when data drift problems appear.

The rest of the paper is organized as follows. Section “Related work” reviews related works. The three technology enablers that were applied in the use case are briefly introduced in  “Use case technology enablers” section. In  “System model” section the system model of the use case and its architecture are detailed. Section “Experiments” is devoted to the presentation of the experiments carried out and in “Results and analysis” section the findings of the experiments are analysied and discussed. Finally, conclusions and future work are presented at the end of the document in “Conclusions” section.

Related work

The use of machine and deep learning techniques to anticipate the occurrence of interesting events in a factory is a recent and recurrent topic in many industrial sectors (e.g., Aktepe et al., 2021; Gürbüz et al., 2019; Tian et al., 2021; Wu et al., 2020; Zhao et al., 2020). The applicability of 5G in the industry is something that was already conceived from the very beginning of the 5G design process at 3GPP (the standardization body in charge of it) and, therefore, its requirements are fully integrated into the 5G architecture. In Rao and Prasad (2018) authors define the impact of 5G in Industry 4.0 as a technology enabler, highlighting the impact in IoT and the use of massive Machine Type Communications (MTC) that will impact in Smart factories, with examples in Artificial intelligence and Robotics. How industry should consider 5G in several AGV application areas, including management and control, is detailed in Oyekanlu et al. (2020). In this work, open challenges are depicted, such as the integration of MEC platforms and the use of AI in 5G connected factories, which point out future research as the one presented in this paper.

Deep learning technologies have been successfully used in robotics applications in the last years and recent surveys Pierson and Gashler (2017) and Shabbir and Anwer (2018) collect the proposed technologies and their applications in this area. These works are focused on navigation, context awareness and fault diagnosis. Surmann et al. (2020) present an application where deep learning is used for autonomous self-learning robot navigation in an unknown environment without a map or a planner. In Kim et al. (2018) an end-to-end method for training convolutional neural networks for autonomous navigation of a mobile robot is proposed. The robot deployed with the proposed model can navigate in a real-world environment by only using the camera without relying on any other sensors. Wang et al. (2018) improve the environmental perception ability of mobile robots during semantic navigation by a three-layer perception framework based on transfer learning that includes three recognition models: place, rotation region, and side.

Deep learning techniques also improve the perception abilities of robots. In Cartucho et al. (2018) deep learning is applied for object recognition when the robots are moving in real human environments. Shvets et al. (2018) use deep neural network architectures for semantic segmentation of robotic instruments and are applied in robot-assisted surgery.

Fault diagnosis and forecasting is another interesting application of deep learning in robotics. To mention some examples, in Long et al. (2020) an attitude data-based intelligent fault diagnosis approach is proposed for multi-joint industrial robots. Based on the analysis of the transmission mechanism, the attitude change of the last joint is employed to reflect the transmission fault of robot components. The intelligent fault diagnosis model is designed considering the characteristics of attitude data, a hybrid sparse auto-encoder (SAE) and support vector machine (SVM) approach. Gao et al. (2018) propose a fault-tolerant control method based on deep learning for multi displacement sensor fault of a wheel-legged robot. In Tsai et al. (2021) a fault diagnosis method for underwater thruster propellers based on deep convolutional neural network is proposed.

The application of deep learning techniques to forecast time series is a very intensive field of research in the last decade as they have proved to be an effective solution given their capacity to automatically learn the temporal dependencies present in time series in different domains (Busseti et al., 2012; Gasparin et al., 2019; Han et al., 2019; Kotsiopoulos et al., 2021; Lara-Benítez et al., 2021; Morariu & Borangiu, 2018; Sezer et al.,2020). However, there is a lack of published research for this topic in the Industry 4.0 area (Gui & Xu, 2021; Thurow et al., 2019) and to the best of our knowledge, this work is the first to combine Industry 4.0, 5G networks and deep learning to investigate the application of deep learning models to predict in advance AGV trajectory difficulties using a new generation of AGVs where the PLC is virtualized and the communications between the AGV and PLC are deployed in a 5G network.

Transformers are a recently proposed class of deep learning models, which were first proposed for natural language translation (Vaswani et al., 2017) but are currently considered the state-of-the-art representative of sequence-to-sequence tasks. Several recent works have proposed how this sequence-to-sequence architecture can be adapted to time series prediction problems (Wen et al., 2022). Furthermore, a very recent work proposes applying a modified version of the Transformer architecture to solve a regression problem using only the encoder part of a Transformer (Zerveas et al., 2021). However, to the best of our knowledge, our work is the first to apply the Transformer neural network architecture to an Industry 4.0 and 5G network forecasting scenario.

The use case considered for our analysis was initially presented in Vakaruk et al. (2021). In sharp contrast, this paper focuses on evaluating and comparing the forecasting performance of Transformer neural networks, the state-of-the-art of sequence-to-sequence architectures, against traditional deep learning architectures (LSTM and 1D-CNN). Furthermore, in this work an in-depth analysis was conducted on the performance of different model configurations segregated by deep model architectures, time window sizes, feature sets, and input feature aggregations, totalling 6, 600 deep learning models trained and tested. In addition, this paper thoroughly revises previous research in this area and discuss in detail the real-time issues that arose when the best performing deep learning models were deployed in a realistic environment (5TONIC laboratory). Finally, this paper explores quantitatively whether these deep learning models can be used to control not just a single AGV, but a fleet of AGVs. To do so, we considered 256 AGVs operating in parallel, with a single deep learning model collecting inputs from all AGVs and generating their predictions.

Use case technology enablers

Automatic guided vehicles

AGVs are unmanned vehicles that are utilized to replace manned industrial trucks and conveyors in the industrial sector. These self-driving cars have the potential to improve the efficiency and flexibility of industrial processes while also lowering human error and operational expenses. Due to the I4.0 approach, they have grown in popularity in recent years. Garcia et al. (2020).

The main application of the AGVs in the industrial sector is the warehousing and the logistic. Logistic is the management of the flow of products, materials and other existences in the production processes. In 2017 its use increased at 162 respect to 2016, data from International Federation of Robotics 2018 Service Report (Robotic Industries Association, 2020), moreover it is estimated that up to 485.00 new units will be sold between 2019 and 2020. Some relevant examples are Google and Alibaba, who invested 500 M and 15B respectively in the automatization of their logistics (Owen-Hill, 2020).

They have a variety of sensors for guiding and/or localisation, such as magnetic sensors, optical sensors, and lidars. The distance to a predetermined ground trajectory is related to the guiding information. Localization information, on the other hand, refers to absolute coordinates and plane orientation. The guidance and localization of the AGV employed in the trials are not connected. A magnetic sensor provides the guidance by returning the distance between the sensor’s center and the center of a magnetic tape put on the floor. A RFID reader, on the other hand, provides localization by identifying precise places on the ground.

Fig. 1
figure 1

AGV’s coordinate system

The kinematics of the AGV that was used in the experiments is shown in Fig. 1. The traction unit is represented by the green square and the body of the AGV by the blue rectangle. The traction unit is linked with the body by an axle (orangle circle) and can be pivoted around it. The distance between the wheels of the traction unit is denoted by \(L_h\) and the distance between the traction unit and the center of the rear axle by \(L_b\). The position of the centre of the rear wheels is denoted by \((x_b, y_b)\) and the orientation of the body in the inertial frame by \(\phi _b\). On the other hand, the position of the center of the traction unit is denoted by \((x_h, y_h)\) and the orientation of the traction unit in the inertial frame by \(\phi _h\). Under these conditions and assuming that there is not slippage in the wheels, the evolution of the position and orientation of the AGV are given by  14.

$$\begin{aligned} \dot{x}_b= & {} \frac{(v_l+v_r)}{2}\cos (\phi _h - \phi _b)\cos (\phi _b) \end{aligned}$$
(1)
$$\begin{aligned} \dot{y}_b= & {} \frac{(v_l+v_r)}{2}\cos (\phi _h - \phi _b)\sin (\phi _b) \end{aligned}$$
(2)
$$\begin{aligned} \dot{\phi }_b= & {} \frac{(v_l+v_r)}{2L_b}\sin (\phi _h - \phi _b) \end{aligned}$$
(3)
$$\begin{aligned} \dot{\phi }_h= & {} \frac{(v_r-v_l)}{L_h} \end{aligned}$$
(4)

where \(v_l\) is the linear velocity of the left wheel of the traction unit in m/s and \(v_r\) is the linear velocity of the right wheel of the traction unit in m/s.

5G networks

The fifth generation network, or 5G, is the latest leap in mobile communications networks. In contrast to previous generations, 5G not only seeks to provide greater connectivity and speed, but also flexibility and tools to adjust the network to different industries and services needs. To this end, three categories are defined for the behaviour of the network: (i) Accesses that require high bandwidth (eMBB or enhanced Mobile Broadband), such as augmented reality solutions; (ii) environments where speed is not a priority, but rather the density of equipment, mMTC (massive Machine Type Communications) with massive IoT environments where multiple sensors need connectivity; and (iii) services that require low latency in transmission (URLLC or Ultra Reliable Low Latency Communications), such as autonomous driving or sensitive industrial processes. Precisely, the case of Industry 4.0 is one of the verticals (Annex A.2 in TS 22.104) (3GPP, 2020) considered to improve 5G technology in the latest Release 16 to address these needs. A clear example is the use of new generation AGVs that require stable and low latency wireless connectivity for their operations.

One of the key elements in the architecture of a 5G network is the User Equipment (UE) that represents the 5G mobile terminal and has accessibility to different services in the network, through two distinct and interconnected domains: the New Generation Radio Access Network (NG-RAN) and the 5G Core (5GC). The NG-RAN provides the radio connectivity with the established service requirements between the UE and the 5G Core and is represented by the gNodeB (gNB) component. The gNB is designed to be distributed in a way that increases its capillarity (e.g., a small radio cell for a manufacturing plant floor) and can be separated into the Remote Radio Head (RRH), the Distributed Unit (DU) and the Central Unit (CU). The degree of disaggregation will depend on various factors such as the needs of the network operator, density of services and devices, geographical constraints, etc.

The 5G Core has been completely redesigned compared to previous monolithic generations, to adopt a micro-services architecture, which makes it scalable and multi-vendor, following an Internet application architecture model. It is worth noting that in this new design, there is a clear differentiation between the control plane or network signaling and the user plane. The control plane manages all functionalities related to connectivity, including registration, access control, resource and policy allocation and mobility. Communication between each network function (NF) uses service-based interfaces over HTTPS RESTful requests in most interfaces. The user plane ensures the connectivity of the services and applications themselves by encapsulating the traffic over the GPRS tunneling protocol (GTP). The main control and data plane network functions are summarized in Table 1.

Table 1 List of 5G Network functions

Two appealing capabilities of 5G networks and its applicability to Industry 4.0 are the concept of network slice and Non-Public Network (NPN), both recently standardized in 3GPP Release 16 and enhanced in the next Release 17.

Network slicing is defined as a mechanism to provide multi-tenancy and multi-services, with different resource allocations and isolation over the same 5G network. Therefore, in practical terms, it is possible that a 5G network operator uses the same infrastructure to provide different combinations of network features to Industry 4.0 clients, such as IIoT device onboarding, secondary authentication, TSN (Time-Sensitive Networking) integration, etc. NPN proposes different levels of integration between public and private 5G networks (Ordonez-Lucena et al., 2019), opening opportunities for alternative business models related to Industry 4.0, like deploying an isolated NPN into an industry factory campus.

To meet the low latency demands in the network (URLLC) it is necessary to bring network services closer to the user. To this end, an architecture called Multi-access Edge Computing (MEC) has been standardised in ETSI (Hu et al., 2015). The Edge computing capability is connected directly to the User Plane Function (UPF) closest to the customer (i.e., directly at the user plane) where low latency applications are offered. This approach enables Industry 4.0 services to offer the required functionalities such as the virtualized PLC guide system for the AGV described in our work.

Deep learning techniques

A time series is a sequence of observations in chronological order that are collected over fixed intervals of time. Time series can be divided into univariate or multivariate depending on the number of variables at each timestep. The traditional methods used for time series forecasting were mainly based on statistical models, such as ARIMA and exponential smoothing. However, the models based on artificial neural networks have attracted much attention in the last decade as they have shown better performance than statistical methods in many situations, especially due to their capacity and flexibility to map non-linear relationships from data and reduce many of the traditional data preprocessing steps (Raza & Khosravi, 2015). Furthermore, when a problem is considering the processing of a multivariate time series, the traditional methods are not well suited to them and therefore, deep neural networks emerge as the most appropriate technique to be applied.

In the rest of this subsection, we review the most relevant types of deep learning networks that can be used for time series forecasting.

Fully Connected Neural Network (FCNN) (also knows as Multi-Layer Perceptron): It is the most basic type of feed-forward artificial neural network and the most popular among researchers due to its simplicity. It is a type of Feed Forward network whose nodes are grouped into layers with weighted connections in which an input vector is transmitted from one layer to the next, terminating in an output vector. The first and last layers are known as input and output layers, respectively, and all layers in between are known as hidden layers. The FCNN is considered a deep learning model if it contains multiple hidden layers (Deng & Yu, 2014) that allow the FCNN to model complex nonlinear relations more efficiently compared to its shallow version called the Multilayer Perceptron (Bengio, 2009). The FCNN learning process is commonly carried out by iterative optimization using gradient descent (Rumelhart et al., 1985) that iteratively tunes the weights of the network by minimizing the prediction error. In the training of FCNNs, different problems can be encountered such as: local minima, vanishing gradients and overfitting. Some of the most commonly used techniques to address these problems are: L1 and L2 regularizations (Bengio et al., 2013) and dropout (Dahl et al., 2013) to deal with overfitting; batch normalization (Ioffe & Szegedy, 2015) and the ReLU activation function to deal with vanishing gradients. In addition, the FCNN architecture is usually placed after other stages of deep architectures as the output of the model.

Convolutional Neural Networks (CNN): The CNN architecture (LeCun et al., 1998) design has been inspired by the human visual system, which has even outperformed humans in some cases of recognition problems. The most typical configuration of a deep CNN architecture consists of multiple convolution layers with non-linear activation functions, a pooling layer after each convolution and several FCNN layers at the end. Convolution layers are configured as a set of filters that process the input data by highlighting important information taking into account the spatial relationship between the data. The pooling layers reduce the amount of data in the output of the convolutional filters by retaining most of the highlighted information. Multiple convolutional and pooling layers can extract more complex information through multiple levels of non-linearity. When FCNN has a few hidden layers it serves as a discriminator using the important information highlighted by the previous convolutional layers. Although CNN architectures are typically used in image classification problems, they can also be adapted for use with time series data sets and forecasting probes. The convolutional filters used in this case are 1-D filters that are very similar to finite impulse response (FIR) filters in digital signal processing (Lyons, 2011). Furthermore, 1D-CNN filters rely on two key assumptions for time series data: the filters can only use k data points from a defined lookback window to make forecasts; the filters can only learn time-invariant relationships between the data in the lookback window.

Recurrent Neural Networks (RNN): RNN architectures (Goodfellow et al., 2016) were designed to model long-term dependencies that are computationally infeasible for FCNN or CNN architectures. Sequentially formatted data are often modelled by RNN architectures with good results in applications such as: text translation, sound to text and temporal forecasting (Lim et al., 2020; Salinas et al., 2020; Rangapuram et al., 2018; Wang et al., 2019). The RNN has a similar architecture to the FCNN, but the neurons in the hidden layers of the RNN receive their own output together with the outputs of the previous layer as input. This recurrent connection can also be viewed as an internal memory state that is updated with each new observation at each time step. From a signal processing point of view, RNNs can be viewed as a nonlinear version of infinite impulse response (IIR) filters. It is worth noting that, although RNNs do not explicitly require a lookback window like CNN architectures, they can benefit from their use due to: a more efficient learning implementation, and their internal short-term memory.

Long Short-Term Memory (LSTM): The LSTM (Hochreiter & Schmidhuber, 1997) architecture overcomes the problems of RNN with an infinite lookback window that allows it to learn the long-term dependencies of the data (Bengio et al., 1994; Hochreiter et al., 2001). This version of an RNN architecture has a gate-based mechanism that controls the state of each cell to store or delete information. Gates are a set of simple neurons combined with arithmetic functions that can be trained together with the rest of the neural cells during optimization. There are three LSTM gates: the “forgetting” gate controls whether information from the previous state is retained; the “input” gate controls how new information is stored taking into account the previous state weighted by the forgetting gate; the “output” gate controls the output of the LSTM cell using the internal state updated by the input gate. In particular, the LSTM architecture solves the gradient explosion and vanishing gradient problems of previous versions of the RNN architecture. In addition, the LSTM architecture also benefits from the use of a finite lookback window to optimize its training process, but this may be longer than in the RNN case. Another hybrid deep learning model that can be built consists of a number of 1D-CNN layers before a number of LSTM layers. In this hybrid approach, 1D-CNN is used as an initial filter to highlight important information and LSTM to learn long-term relationships with the highlighted information.

Transfomers (TRA): The Transformer is a deep learning encoder-decoder architecture recently proposed in Vaswani et al. (2017) that implements a sequence-to-sequence model and is today considered the state-of-the-art of this type of models. The disruptive breakthrough of this neural network architecture is the use of multi-head self-attention layers that have been proven to be very effective in the fields of natural language processing (NLP) and computer vision. The original architecture consists of an encoder and a decoder block. In a typical NLP translation problem, the encoder gets as input a sentence in the original language and the decoder gets as input the same sentence in the target language. In turns, the decoder generates the next word in the target language at each iteration until the output sequence is completed. The encoder architecture starts with an Embedding layer for the source language that transforms words into numeric vectors and is followed by a Positional Encoding layer that encodes the position of the words in the phrase. Then, there are N sequential blocks starting with a multi-head self attention layer, followed by a dense non-linear hidden layer. The decoder follows a similar structure, starting also with an Embedding layer but for the target language and the same Positional Encoding layer. Then, there are N sequential blocks that start with a muti-head self attention layer as in the encoder blocks, followed by another multi-head mixed attention layer, followed also by a non-linear dense hidden layer.

This sequence-to-sequence architecture can be adapted to time series prediction problems, and specifically in Zerveas et al. (2021) it was shown that in classification and regression problems it is sufficient to use only the encoder part of the architecture without the embedding part. It is worth noting that although the Transformer architecture does not employ layers that directly relate the data sequentially as RNN, LSTM or 1D-CNN do, we have experimentally observed that the multi-head self attention blocks allow to replicate this behaviour by giving different importance to the elements in the sequence. Therefore, being a more general and powerful architecture, Transformers can learn during training complex sequential behaviours such as that of recurrent (RNN, LSTM) or convolutional neural networks.

Forecasting models: Deep learning models used for forecasting typically predict future values given a logical grouping of temporal information that can be observed at the same time. Time series is the most frequent format for temporal information. The elements of a time series vector are separated by the same time interval and are organized in chronological order. Examples of temporal information are: the vital signs of a patient in medicine, the information transmitted through a router in computer networks, or the trajectory of an AGV. The simplest type of forecasting is known as one-step-ahead, forecasting models can be represented by the following formula \(y_{t+1} = f(y_t,...,y_{t-k}; x^i_t,..., x^i_{t-k})\). Here, \(f(-)\) is the prediction function of a deep learning model (in our case) that uses the past y and x variables to predict the value of y one step ahead, which is represented as \(y_{t+1}\) in the formula. The y variables are called endogenous and are the output of the deep learning forecasting model. The x variables are called exogenous and are not predicted, but are used as input to the deep learning models together with the past y variables. The k value is the look-back window, which indicates the number of past steps of the variables that are used as input to the forecasting model.

A refinement of one-step-ahead forecasting is multi-horizon forecasting, which predicts multiple steps ahead instead of predicting only one step of the endogenous variable. Multi-horizon forecasting can be performed with an infinite or finite horizon. In the case of an infinite horizon, a model is typically built to recursively predict one step ahead using the previously predicted steps. A small amount of error accumulates with each step, which can result in a larger error after a number of steps. In the case of finite horizon, a model is trained to predict multiple steps ahead simultaneously, while the predictions are limited to the size of the horizon, but the cumulative error problem of the infinite horizon method is avoided. Sometimes, an endogenous variable has a high variance between consecutive steps, which makes forecasting it almost impossible, similar to forecasting a random variable (Mozo et al., 2018). In that case, the endogenous variable can be smoothed or aggregated by a mean value of some future steps. For example, instead of predicting the energy consumption three hours ahead, the average consumption between the three future hours can be predicted. The aggregated endogenous variable loses in accuracy, but gains robustness with a smaller variance between consecutive steps. This is the case of our forecasting model that instead of predicting the instant value of the AGV guidance error, an aggregated value for an interval is forecast to avoid the aforementioned problem.

System model

In this section, we describe the use case under analysis and the components involved in it. Our work aims to demonstrate that it is possible to anticipate the behaviour of an AGV controlled by a remote PLC by using deep learning techniques even when the 5G network is suffering different degrees of degradation due to the appearance of perturbations such as delay, jitter and packet corruption. To this end, we use AGV guidance information obtained in real time from its guide error sensor as this variable indicates whether the AGV is in difficulty following the magnetic tape on the floor. Forecast information of the guide error variable can be used by external control systems to detect potentially dangerous situations and act in an anticipatory manner to avoid them.

AGVs are typically equipped with a PLC, actuators and sensors. Although the PLC is an integral part of standard AGVs, our AGV, as a novelty, is linked to a virtual PLC relocated at the edge of a 5G network in a MEC environment to obtain, among other benefits, cost savings, scalability and flexibility. In this context, the AGV and PLC require a communication channel with a low latency rate (i.e., the delay between sending and receiving information) to support the real-time interactions necessary to ensure that the sensor information from the AGV is delivered to the PLC on time and that the control commands sent by the PLC are received and correctly processed by the AGV in a timely manner. The control dialogue between the AGV and its virtual PLC is as follows: (i) the AGV sends the information from the sensors to the virtual PLC deployed in the MEC to be processed, (ii) the PLC computes the proper control commands and sends them back to the AGV to reduce the guide deviation, (iii) finally, the AGV receives and redirects this control information to the actuators. In this way, the AGV works only as a gateway of signals and all control decisions are taken in the MEC-deployed PLC. Note that the virtual PLC can be installed in a virtual machine or a container to be run as a typical MEC application in a 5G network.

This challenging scenario is carried out thanks to a combination of 5G network technologies. First, 5G provides by design the URLLC connectivity (Ultra Reliable Low Latency Communications) that the AGV and his virtual PLC require for a fast and reliable communication. In addition, the Virtual PLC is deployed in a MEC platform to meet the required low latency.

Fig. 2
figure 2

Architecture of the AGV Use case and its main components: new generation AGV, Virtualised PLC, 5G RAN, 5G MEC, 5G CORE, ML module and IDS Connectors

The architecture of the proposed use case is detailed in Fig. 2. The 5G network setup includes the radio access network (5G RAN) to provide access to the AGV in mobility. All signalling to authenticate and deliver an IP connectivity is managed by the signalling traffic within the 5G Core, as it is described in “5G networks” section. The 5G link allows replacing the internal PLC module of an AGV with a 5G access equipment that will connect the AGV to a remote PLC running in a virtual machine (Master PLC VM). The low latency requirements of the PLC to AGV communication require the use of a MEC platform that hosts the virtual machine in which the remote PLC is deployed. In addition, the MEC platform contains several computing resources including a component that provides access to the user data plane for different services demands, such as data collection with no additional probes. The computing resources are deployed using virtualization technology with a hypervisor and several Virtual Machines (VM) where different functionalities are executed: a Master PLC to control AGVs, a packet aggregator to generate in real-time connection statistics to be input to machine learning components, machine and deep learning inference engines, and two software connectors to interact with manufacturing processes (e.g., dashboards and logistics process control systems). The deep learning-based forecasting model (ML engine) running in the MEC platform uses the information extracted in real-time from the AGV-PLC connection to predict in advance the AGV guide error.

A key design element of this use case is that no sensor equipment is required to be deployed on the AGV or PLC as the variables to be input to the ML engine are obtained from the AGV-PLC connection packets that are collected from a network link. We collect network packets (including the packets of the PLC-AGV connection) activating a port mirroring option in one of the switches that receives the traffic we want to collect. Port mirroring is used on network switches to send a copy of network packets seen on one switch port (or an entire VLAN) to a network monitoring connection on another switch port. In our setup, the copy of network packets is sent to the switch port where the machine learning VM is connected. The machine learning VM has a network interface from which it can read the copy of every packet transmitted on the network link where the port mirroring was activated. Only packets transmitted between the AGV and the PLC are processed, and all other traffic crossing the link is discarded.

Two different sets of variables are extracted from the AGV-PLC connection to be used as input to the forecasting models: (i) connection statistics that can help to determine whether the network is suffering some degradation problems and (ii) the current values of several AGV sensors and, among them, the guide error to be forecast. The two sets of variables will be tested separately and combined to determine which option obtains the best performance when forecasting the guide error. It is worth noting that, although their performance may not be the best, using connection statistics exclusively would be extremely useful in public networks where the payload (i.e., the AGV variables) is encrypted for privacy reasons and therefore, only connection statistics are available to be input to ML engines.

The guide error is obtained from an AGV magnetic sensor that is permanently measuring the distance from the center of the sensor to the magnetic tape on the floor. The values are positive or negative if the sensor is to the left or right of the band. Intuitively, if the values of the sensor (in absolute value) are large, we may infer that the AGV is suffering some difficulties to maintain its trajectory with respect to the tape on the floor. Therefore, predicting such difficulties in advance can allow to anticipate corrective actions to avoid harmful situations (e.g., going out of the circuit or running over a person).

The two software connectors deployed in the MEC platform (green boxes in Fig. 2) are based on the IDS Trusted Connector technology and aim to open the MEC infrastructure as a valid resource for Industry 4.0 verticals in 5G. Recently, the Boost 4.0 project (H2020 Boost-4.0, 2020), established a European Industrial Data Space (IDS) (Otto et al., 2016) to enable cooperation and collaboration between industries through the establishment of a unified data ecosystem where attendees can exchange their (raw, transformed, calculated, analysed, etc.) data in a secure and fast way. The Industrial Data Space (IDS) Trusted Connector is an open IoT edge gateway platform that provides a standardized way to communicate with external components as they are built on open standards to avoid vendor lock-in. The IDS Trusted Connector is an implementation of the Trusted Connector in the Industrial Data Space Reference Architecture, following the DIN Spec 27070 and ISO 62443-3 standards and is used for the connection between sensors, cloud services, and other connectors (using a vast range of protocol adapters).

In our use case, the left IDS connector allows exporting the ML predictions of the guide error standard deviation to an external Operation Support System (OSS) in which human operators or a fully automated Logistics Process Control will process the forecast values to apply some preventive action to the AGV via the right IDS connector in case of need. Although a detailed treatment of this information is beyond the scope of this work, a typical scenario could detect if the guide error is outside the safety or quality parameters set for the facility (e.g., due to a degradation of the network). In this case, the Logistics Process Controller could reduce the service level of the logistics facility (productivity, throughput, etc.) to match the network service level, for example, by reducing the forward speed of the AGVs.

Finally, a degradation emulator is added to the use case for creating perturbations in the network in a controlled manner. This element, inserted in the link between the MEC and the radio access, can degrade the connectivity of the AGV with the PLC to emulate different network problems that can appear in the shop floor (e.g., weak coverage, radio noise, network congestion) that are not present in a clean environment.

The latency affects the performance of the system, it makes the movement more oscillating as the system reacts later to changes in the path’s curvature. As the small guiding errors are not corrected in time, they grow quickly, and the corrections must be more aggressive, which increases the oscillations. If the network delay is too high, the oscillations are too large and the AGV leaves the path. Indeed, in the dynamic scenarios we tested, the network delay is increased linearly until the AGV goes off the route. However, in the static scenarios we tested the network delay is enough to degrade the performance of the AGV, but not so high to make the AGV leave the path.

Experiments

The use case implementing the scenario described in “System model” section was set up in 5TONIC, an open laboratory focusing in 5G technologies, to provide a realistic deployment but in a controlled scenario. 5TONIC was founded by Telefónica and has permanent and temporal infrastructures to set up specific experiments as the one we consider in this work.

We set up in 5TONIC a realistic scenario for AGV experimentation consisting in a figure-8 circuit represented by a black magnetic tape on the floor in a \(300 m^2\) room with a battery recharging point. In addition, a MEC infrastructure was set up to host the virtualized PLC, and the rest of the services (data collector, packet aggregator, deep learning engine, and IDS connectors).

All traffic transmitted between the AGV and the PLC was mirrored to the VM where the packet aggregator and ML engines were running. To obtain data for training and testing the ML models, network packets were captured in this VM using the tcpdump tool and stored in standard pcap files. Various types and levels of network perturbations were applied during the experiments to ensure the robustness and generalization of the trained deep learning models. The Traffic Control (tc) linux tool was run in the MEC to generate in a controlled way perturbations in the link between the AGV and PLC (delay, jitter, and packet drop and corruption).

In the following subsections we detail the set of experiments designed to collect the training and testing data, the preprocessing of such data before inputting them to ML models, and the process we followed for training and testing the deep learning models. Figure 3 summarizes the workflow followed to generate suitable data sets for training and testing the deep learning models. Finally, we describe the deployment of the whole system in a real-time scenario.

Fig. 3
figure 3

Data collection, preprocessing and ML training workflow

Data generation experiments

The deep learning techniques we selected to forecast the guide error of an AGV are part of a broader family of machine learning methods with supervised representation learning. To apply these techniques, it is necessary to have labeled data sets for training and testing the models, and therefore, we designed a set of experiments to obtain a variety of labelled data that were representative of situations in which different levels of network perturbations appeared in the AGV-PLC communication link.

In each experiment, the AGV drove around the 8-figure circuit repeatedly (at least 5 times), while different network perturbations were reproduced in the AGV-PLC link. The network perturbations were generated programmatically by applying different intensities of delay or jitter to the AGV-PLC connection, although delay and jitter were not generated simultaneously in any experiment. The delay was parameterized in microseconds (between 50 and 300 microseconds) and the jitter was created as a random delay, following the Pareto normal distribution with a mean between 50 and 300 microseconds and a standard deviation between 10 and 50 microseconds.

Three scenarios with different profiles of network perturbations were designed: (i) a clean scenario without any network perturbation, (ii) a static scenario with a constant level of perturbations, and (iii) a dynamic scenario where the level of perturbations introduced in the communication link increases linearly, also called ramping scenario.

In the static scenario, we model a situation with a constant level of network quality degradation with some periods without perturbations. Therefore, during the first 30 s, the last 30 s, and 30 s in the middle of the experiment, no perturbations were generated to force the experiment to a stable situation. In this way, we allow the ML models to learn the transitions from quiet to noise environments and vice versa. In addition, the ML model can learn what happens when the AGV reaches a stable situation, and how it reacts when sudden perturbations appear and disappear in the network. In the dynamic scenario, the perturbation level is increased every 2 min until the AGV goes out the circuit and stops. This scenario allows the ML models to learn the AGV behaviour when the AGV starts from a stable situation without network perturbations and the network conditions get worse increasingly until the AGV goes off course.

All experiments were repeated at least three times to ensure statistical significance. The communication between the AGV and the virtual PLC used the UDP protocol and only the packets transmitted between the AGV and the PLC were captured discarding the rest of traffic crossing the link. The captured traffic was preprocessed using two programs: (i) Tstat, an open source traffic analysis tool that allows to calculate in real-time a set of statistics associated to the AGV-PLC UDP flow and (ii) an in-house program to extract the AGV variables from the payload of UDP packets.

Using Tstat, we selected 7 features from the AGV-PLC flow: the interarrival time of the packets going from the client to the server and vice versa, and the number of packets transmitted by the client and server (their absolute values and some ratios). Note that in many network scenarios, the number of transmitted bytes is also selected as an input feature for ML tasks, but in this case, the size of the transmitted frames was a constant of 80 bytes, and therefore, it did not provide any relevant information as it was directly correlated with the number of packets. These features were selected as they had previously obtained good results when used for network traffic classification tasks (Draper-Gil et al., 2016). That work showed that these features provide a good balance between selecting a small number of input variables to generate compact and resource-efficient models and obtaining decent performance when used in machine learning models.

Note that the connection between AGV and PLC is based on the UDP protocol, which is datagram oriented and therefore does not establish error control or congestion control as the TCP protocol does. This means that the control information sent between the AGV and the PLC is not as abundant as that generated in a TCP connection where there is error and congestion control. Therefore, the number of features that the packet aggregator (Tstat) can extract from a communication over UDP is also limited by the nature of this transport protocol.

Future work could explore the possibility of finding a subset of these seven features that would allow machine learning models to be trained with equivalent levels of accuracy to those obtained in our work at a lower storage and computational cost. However, note that the seven features currently chosen are integer counters and ratios between counters, which implies that the storage cost is low even if the window size is large (150 bytes per time step of the window, 90Kbytes in the case of a 60 s window) and the computational cost to obtain them (a dozen additions and multiplications every 100 ms) is almost negligible when compared to the execution cost of our machine learning models during the inference phase.

The extracted AGV variables from the packet payload were (i) the guide error, which measures the distance in centimetres between the AGV and the magnetic black tape on the floor line, and (ii) the status flags (e.g., error code and charging status). After a preliminary analysis, we decided to store only the guide error as the other variables were not likely to provide any information on the forecasting scenario. All features were stored in separate csv files per experiment, totalling 72 files of approximately 1GB per file. The final set of features used for the forecasting scenario are summarized in Table 2.

Table 2 Features used as input to deep learning models

Data preprocesing

To exploit the time topology of the collected data, we transformed them into a window of lagged observations (e.g., \(t-1,t-2,\dots t-K\)) to be used as a time series of input variables in a supervised machine learning model that forecasts a future time step (\(t+R\)). The data transforming process was used first for training and testing different combinations of deep neural networks and later during the real-time deployment of the best performing deep learning models.

In the following subsections, we detail the process to obtain a window-based time series from the collected data, the variable to be forecast based on a combination of future values of the guide error, the training testing split, and an input data aggregation to increase the forecast performance of the deep learning models.

Time series input data

Collected data was aggregated and transformed into a time-series data set, in which each element of the new data set is a window of lagged observations (\(t-1,t-2,\dots t-K\)) with a granularity of 100 ms.

First, for all variables, each step of the time series was calculated with the mean of the values falling in each 100 ms interval, except for the AGV flags variable, which was calculated keeping the last value. Furthermore, the missing values were filled with the last observed value. This phase produced a new data set containing aggregated values for each variable at 100 ms of granularity. Second, elements of the time series data set were obtained by grouping the aggregated data from each experiment into time windows of K elements. Each window of lagged observations contained K prior observations. It is important to note that the process was done for each experiment separately as the lagged observations from different experiments cannot be mixed in the same time window. Regarding that this two-phase process is done for all variables, we obtain a three-dimensional data set in which axis 1 represents the set of windows of lagged observations, axis 2 contains the lagged observations of a concrete K-size window, and axis 3 contains the selected variables (network, agv or combined).

The deep learning models were trained with time window sizes between 1 and 60 s to analyse the effect of considering larger or shorter intervals of prior values in the performance of the forecast. For example, assuming a granularity of 100 ms for each variable, a time window of 1 s would have a window of \(K=10\) steps.

Forecast variable

This use case aims to predict in advance the guide error of an AGV as this variable is an indicative of whether the AGV is suffering some difficulties maintaining its trajectory with respect to the magnetic tape on the floor.

Forecasting the instant value of the guide error is a difficult task due to the significant amount of noise we observed that the forecast variable may contain. Therefore, we decided to predict ahead of time a statistical value of the guide error that can diminish the level of noise in the predicted variable. Regarding that due to the inertia of a moving AGV, it takes 10 s in the worst case to stop it, we establish the period from 10 to 15 s as the useful range to predict in advance the guide error statistic.

In addition, when the AGV is in difficulty to keep its trajectory, the guide error variable exhibits fast and significant oscillations from positive to negative values and vice versa. Therefore, computing the mean of the sensor value will produce a useless statistic with values always near to zero. Hence, we used the absolute values of the guide error variable to avoid positive and negative values compensating each other in the chosen statistic.

Finally, we observed that when the AGV is about to go out of the circuit and cannot follow the magnetic tape due to a severe degradation in the network conditions, it starts receiving very abrupt trajectory correction commands from the PLC, which causes the absolute values of the guide error variable to exhibit large differences. In this regard, it is not really significant if the guide error in an instant of time is small or large. For example, in curves the guide error naturally increases, but the AGV correctly navigates. On the other hand, the standard deviation of the absolute values of the guide error can indicate if the AGV oscillates too much around the path and the movement is erratic.

For the aforementioned reasons, we decided to select as the forecast variable the standard deviation of the absolute values of the guide error in the 10–15 s interval. Thus, the predictions are coarser-grained, but more relevant to detect ahead of time difficulties in the AGV trajectory.

Testing and training data sets

When training machine and deep learning models, the available data are usually split into training and testing sets. In addition, a validation data set is usually defined for selecting the best combination of hyperparameters for each model.

For each network scenario (with different levels of network perturbation) we ran the AGV experiment three times (rounds) at different times. Each round of an experiment consisted of the AGV running through the 5TONIC lab circuit and being subjected to the network perturbations of that experiment. Therefore, each round was itself a different experiment, but of similar statistical nature as the conditions (perturbations) of the experiment were generated with the same statistical distribution for the three rounds.

It is worth noting that although the three rounds of each experiment used the same circuit: (i) In each round the network perturbations were generated at random (following a statistical distribution), and therefore, their effect in the delay and jitter produced in the transmitted packets between the PLC and AGV was variable and non-deterministic. (ii) The PCL-AGV network packets were transmitted in a 5G network in which other network connection packets compete with PLC-AGV packets to be transmitted in physical links. Therefore, the aggregation of PLC-AGV packets to obtain the connection statistics sent to the forecasting model was not done in a deterministic way as it was affected by the time when PLC-AGV packets were transmitted and received. In light of this non-deterministic scenario, we can conclude that it is almost impossible that the PLC-AGV connection statistics (i.e., data points) collected on the physical link each time a new packet arrives, were coincident in any of the three rounds of the same experiment.

We used the data points collected in the first two rounds of each experiment for training and the data points from the third round exclusively for testing. Thus, the trained model can be confronted in testing with new data points collected from a round that was never used for training. Moreover, since we tested models with an entire round, we can evaluate the performance of a model in each part of the circuit and during the whole experiment.

The dataset composed of the first two rounds of each experiment was randomly split into three parts for training purposes: the first part (80%) was specifically the Training dataset, the second (10%) was reserved for the Epoch Validation dataset; and the last (10%) was used as the Hyperparameter Validation dataset. The Training dataset was used to fit the deep learning models. The Epoch Validation dataset was used to stop the training process after a number of training epochs without improvement or when an overfitting effect is detected in the training dataset with respect to the epoch validation dataset. The Hyperparameter Validation dataset allowed us to choose the best hyperparameter set for each architecture, feature combination and window size. Different combinations of hyperparameters were randomly generated using the Random Search heuristic. When the best hyperparameter combination was found for each deep learning architecture, feature combination, and past window size, the Test dataset was used in all these models to select the one with the best forecasting performance.

Note that the figures and tables presented in the Results section only show the results obtained with the Test data set as representative of a real deployment since these data points were not used in any phase of model training.

It should be noted that the proposed testing process is more conservative than one based on a typical K-fold cross-validation process. In a K-fold cross-validation, we use all available data to train the model and validate it (using for training and testing different slices in each k-fold round), which in our opinion does not give a true measure of how well the model will behave in production on completely new data, as the model will never be tested against a totally new experiment. In contrast, our proposal uses two rounds of the experiment for training and a third round exclusively for testing. Each round of an experiment consists of the AGV running several laps through the 5TONIC lab circuit and being subjected to the network disturbances of that experiment. Note that each round is itself a different experiment, but of similar statistical nature, as the conditions (network disturbances) of the experiment are similar in all three rounds. Therefore, the trained model will be confronted in testing with data from a round that was never used for training. Then the result of testing the model with the third round will be much closer to the execution of the model in a production environment.

Finally, the input features were normalized using a standardization process (for each feature, subtract the mean of each value and divide it by the standard deviation) to improve the numerical stability of the models and to accelerate the training process.

Feature augmentation

In general, the addition of aggregated variables to the input tends to improve the robustness of classical machine learning models (Mozo et al., 2018). In that work, we explored the effect on the forecast performance of combining aggregated input variables with the original ones.

The aggregated variables of feature X are generated as follows: for each element \(X_i\) of the time series, we compute the mean and standard deviation of X in the last M seconds since \(X_i\), being M, a hyperparameter of the model. Note that, given a value for M, this aggregation process generates two new time series composed, respectively, of the values of the calculated means and standard deviations. These two new time series will be used together with the original time series of X to subsequently create the new time windows. This aggregation can be seen as an efficient way to accelerate the training of deep learning models, making it easier for them to extract these statistics from past values without spending neither model parameters nor training steps to learn them. To give an illustration, in the case of mean aggregation, let \(\textrm{Xm}_i\) the aggregated mean of the original variable X at time step i. The mean aggregation \(\textrm{Xm}_i\) is calculated on the prior M values of the variable X at time i as \(\textrm{Xm}_i = \textrm{mean}(X_{i-M},...,X_{i-1},X_i)\).

The aggregation was not applied to ratio-type variables such as the number of packets per second, as we observed that these variables do not vary significantly over time, and therefore the aggregated variables (mean and standard deviation) would not probably provide significant information to the models.

For the best performing models we chose two different windows M and M/2 to calculate the aggregated features.

Deep learning models

Aiming to find the deep learning model with the best performance for forecasting the guide error value of an AGV in real time, we trained and tested different combinations of deep learning architectures, feature sets, past window sizes and feature aggregations.

The first set of proposed deep learning architectures were based on our previous experiences in deploying compact deep learning models to forecast network and datacenter events using time series of past events (Pastor et al., 2020; Mozo et al., 2018, 2019). In these works, Fully Connected Neural Networks (FCNN), one-dimensional Convolutional Neural Networks (1D-CNN), and Long Short-Term Memory (LSTM) were used as basic building blocks of the models. Therefore, in our experiments we adopted these blocks to build the first set of deep learning architectures that consisted in a sequence of stacked CNN 1-D, LSTM and FCNN blocks. Using 1-D CNNs and LSTMs allows to discover the topological relationships that can exist among the past and present values of the variables used as inputs to the models. It is worth noting that in recent works, CNNs have been shown to be efficient in revealing short-term relationships in time series variables, and on the contrary, LSTMs are very efficient in finding long-term relationships that can exist among these variables (Fu et al., 2019; Lang et al., 2019). Furthermore, it is usual to stack FCNN layers as a final stage to provide the model with universal classifier or regressor characteristics.

Additionally, we selected a variation of the Transformers architecture for time series regression problems, recently proposed in Zerveas et al. (2021), to compare the performance of typical deep learning architectures for time-series processing (LSTM and 1D-CNN) with the state-of-the-art architecture used in sequence-to-sequence problems. Although the Transformer architecture is not specifically designed to work with spatially or temporally related data, we have experimentally observed that its attention module can mimic the LSTM and 1D-CNN behaviour, which enables Transformers to extract topological information from time series in a similar way LSTM and 1D-CNN do. It is worth noting that in this type of problems, it has been observed experimentally that architectures that pay attention to the time steps of the input window are more effective than those that pay attention to the variables of the time series (Liu et al., 2021). Therefore, in contrast to the solution presented in Zerveas et al. (2021), we do not employ an input time series reduction layer, as instead of paying attention to variables, we pay attention to time steps.

Regarding that the architectures used in our experiments consist of a small number of stacked TRA, CNN 1-D, LSTM and FCNN blocks and their sizes were not excessively large, we did not applied complex Network Architecture Search (NAS) procedures (Ren et al., 2021) such as DARTS (Liu et al., 2018) or EAS (Ding et al., 2013) to optimise the structure of the deep neural network. Instead, we applied a more straightforward random-search approach to find the best combination of hyperparameters by testing random combinations of them. Maintaining fixed and small the structure of the neural networks allowed us to explore small amounts of hyperparameter variations (e.g., number of internal layers, filter sizes) using a simpler method such as the random search heuristic.

  1. 1.

    Three configurations of deep learning architectures, successfully applied in other domains (Mozo et al., 2019), were proposed to evaluate their performance for the forecasting of the AGV guide error by using time-series of network and AGV variables: (i) a stack of 1D-CNN layers followed by several FCNN layers (called CNN); (ii) a stack of LSTM layers followed by several FCNN layers (called LSTM); (iii) a stack of Transformer Encoder-Only layers followed by several FCNN layers (called TRA) and (iv) a pipeline of 1D-CNN layers followed by LSTM layers and FCNN layers as the final stage (called MIX).

    The MIX configuration places the 1D-CNN layers at the beginning to preprocess the data with narrow filters that can smooth or highlight details or non-linear relationships present in nearby time steps (i.e., short-term relationships). Furthermore, LSTM layers are able to discover long-term relationships among the past observations of a relatively large time window.

  2. 2.

    Three combinations of AGV and network variables were defined: (i) network variables (NET features), (ii) AGV guide error variable (AGV features), and (iii) a combination of network and AGV variables (COMBINED features). Table 2 details the features used in each set.

  3. 3.

    Five time-window of sizes 4, 7, 15, 30 and 60 s were used to allow the model to learn the straights and curves of the circuit.

  4. 4.

    Aggregation of the mean and standard deviation of several COMBINED and AGV set of features were calculated on 15 and 30 s windows respectively as these windows produced the best forecasting results in deep learning architectures without feature aggregation. The objective of this variation was to analyse whether feature aggregation could produce any benefit to the model and thus, increase the forecasting performance of those best models.

The training of deep learning models was performed with the Adam optimizer using the default learning rate and an early stopping criterion of 20 epochs (the training was stopped when the model did not improve in 20 epochs). Using the Epoch Validation data set to evaluate model improvement during training reduced the likelihood of overfitting that might have occurred if the Training data set had been used for this purpose.

A set of hyperparameters was defined for each architectural configuration (i.e., CNN, LSTM, TRA and MIX): (i) architecture-related hyperparameters such as the number of neurons, the number of layers, number of heads in the case of Transformerss, and the filter size in the case of 1D-CNNs; and (ii) regularization-related hyperparameters as the level of regularization, and the dropout percentage. The use of Batch Normalization layers deals with the vanishing gradient problem that appears in deep neural networks, and thus its activation in a model was added as another hyperparameter of the architectural configuration. In the case of Transformers the architecture is defined in blocks and often with the same configuration for each block. The normalization layers are by definition in the transformer architecture, but the effect of using Layer Normalization layers (used in the original architecture (Vaswani et al., 2017)) versus Batch Normalization layers (used in Zerveaset al., 2021) has also been explored in this work.

Due to the complexity of testing all combinations of hyperparameter values to find the best performing set, a Random Search heuristic was chosen. This algorithm generates a random configuration of hyperparameters, trains a deep learning model, and evaluates it using the Hyperparameter Validation Dataset. This process is repeated iteratively until the results obtained do not improve or the time consumed exceeds a predifined time limit (we established a limit of 50 rounds in our experiments). It should be noted that using a separate Hyperparameter Validation Dataset contributes to reduce the possibility of model overfitting due to a specific combination of hyperparameter values. Table 3 details each hyperparameter, its range of values, and the probability distribution that the random search heuristic used to select the random values of the hyperparameter.

Table 3 Hyperparameters used for training deep neural network models

The first set of experiments was generated as follows: for each architectural configuration (CNN, LSTM, TRA and MIX), feature set (AGV, NET and Combined) and time-window size (4, 7, 15, 30 and 60 s), 50 hyperparameter configurations (chosen by the random search algorithm) were evaluated, totalling 3000 experiments.

A second set of experiments using aggregated features was run with the best performing time window sizes of the first experiment (15 and 30 s) and only using the AGV and COMBINED feature sets as they obtained much better results than the NET set. The feature aggregation process on the 15 s window generated three variations of aggregated features: aggregation on the past 7.5 and 15 s, and both combined. Similarly, the 30 s window generated another three variations: aggregation on the past 15 and 30 s, and both combined. This aggregation process generated a second set of experiments: For each architectural configuration (CNN, LSTM, TRA and MIX), feature set (AGV, NET and Combined), time-window size (15 and 30 s), and aggregated set of features (3 combinations per window size: M/2, M and the two aggregations), 50 hyperparameter configurations (chosen by the random search algorithm) were evaluated, totalling 3600 experiments.

In both sets of experiments, we used the Hyperparameter Validation dataset to select the combination of hyperparameters that obtained the best performance among the 50 experiments run for each model configuration. Once the random search process was completed for a model configuration, the best performing hyperparameter combination was selected as representative of this configuration. This process was repeated for each model configuration, and a final evaluation was done with all model configuration representatives using the Testing dataset that was not used during the model training phase. Section “Results and analysis” shows the results obtained with the Testing data set.

Deployment in 5TONIC lab

In this subsection we detail how the best performing deep learning model was deployed in the 5TONIC laboratory to forecast the AGV behaviour in real-time.

The components and architecture of the use case were previously introduced in “System model” section, and the real-time deployment of the deep learning model was carried out using exactly these components. Recall that we trained an extensive set of deep learning model configurations (Sect. “Deep learning models”) and after that, we used the Testing dataset to select the best performing model among all of them.

The real-time execution of the system is as follows: Each time a packet of the AGV-PLC connection is transmitted and collected, (i) the AGV-PLC connection statistics and internal variables are updated, (ii) a time window of present and past values of these variables (features) is sent to the deep learning model, (iii) a forecast of the AGV guide error is generated by the deep learning model based on the time window values received, and finally (iv) the forecast value is sent to a remote monitoring system.

In the proposed real-time deployment, network packets are collected and processed using the Tstat tool to extract statistical information about the AGV-PLC connection. The AGV state variables are then extracted from the payload of each AGV-PLC packet. Depending on the configuration of the selected deep learning model, only the statistical network information (NET), the AGV internal information (AGV), or both (COMPLETE) are used. The data is pre-processed to create a time window as explained in “Data preprocesing” subsection and fed into the deep learning model that will forecast the standard deviation of the absolute guide error values in the interval 10–15 s ahead of the current instant.

Forecast values are sent through an IDS trusted connector communication channel to a remote ELK (Elasticsearch-Logstash-Kibana) system to display and monitor the AGV behaviour in advance. The trusted connector is a unidirectional communication system that ensures a secure connection for data streams between the source of data and a remote receiver. The source part of the IDS connector is deployed on a docker container in the MEC and receives the prediction results encapsulated in the MQTT protocol. The remote part of the connector is also deployed outside the MEC on a docker container and feeds the ELK Logstash component using HTTP as the transport protocol. The Logstash component stores the data in the Elasticsearch database to be plotted later on the Kibana dashboard. It is worth noting that the remote connector can receive data from many source connectors simultaneously and that the source connector can receive data from more than one ML engine through the MQTT protocol. This design allows that information (e.g., network statistical information, AGV internal state, and guide error predictions) coming from multiple AGVs can be sent in parallel to a remote Kibana dashboard using a single trusted connector.

An important requirement to be considered in the real-time deployment of the use case is the minimum frequency at which all system components must operate and communicate with each other. Regarding that a time series is input to the deep learning model with a step granularity of 100 ms, the Tstat aggregator, the predict function of the deep learning model, the ELK system and the IDS connector must operate and communicate with each other at least at this speed.

Deep learning models are usually invoked with a batch of inputs (e.g., 1024 or more) to make the prediction process more efficient, in particular when GPUs are used. In our real-time deployment, predictions need to be generated and transmitted to the Kibana dashboard as soon as a new time slot is created in the input window (every 100 ms). Therefore, the prediction function must be invoked every 100 msecs with a single input, which is quite inefficient from a purely computational perspective, but we have to do it this way if we want the system to work in real time. This limitation must be taken into account when evaluating the prediction speed of the deep learning model when inputs are provided one at a time. Only when several AGVs were using the same deep learning model could a batch of inputs from different AGVs be applied to increase the prediction efficiency of the system.

It should be noted that the right IDS connector shown in Fig. 2 was not implemented in the first version of the real-time deployment, where the focus was on forecasting performance and not on the corrective actions to be taken when a hazardous situation was detected in advance. In a future evolution of this use case, the right IDS connector will allow a Logistics Process Control (e.g., a human operator or an automated OSS) to react anticipatorily to a dangerous situation by transmitting urgent commands to the PLC to bypass its normal operation (e.g., stop immediately the AGV).

Results and analysis

Fig. 4
figure 4

MAE of the best model configurations. The X-axis represents the time window (4, 7.5, 15, 30 and 60 s) and the Y-axis the MAE value. For each window the three bars show the best results segregated by feature set (AGV, NET, COMPLETE). The best and lowest MAE value and the type of architectural model (LSTM, CNN, MIX or TRA) are shown at the top of each bar

In this section we present the results obtained by the representatives of each model configuration segregated by deep learning architectural configuration, input feature set, time window sizes and aggregated features. The Testing dataset was used to compute the performance of each model. In “Evaluation and comparison of model configurations” subsection we study the performance of models segregated by deep learning architecture, input feature sets, and time window sizes. Subsect. “Model improvement using aggregated features” analyses the effectiveness of adding aggregated features to the input of models. Finally, the feasibility of a real-time system deployment is discussed in “Real time deployment” subsection.

Evaluation and comparison of model configurations

A total of 60 different deep learning model configurations were generated by combining deep learning architectures (CNN, LSTM, TRA and MIX), input feature sets (NET, AGV, COMPLETE), and time window sizes (4, 7.5, 15, 30 and 60 secs). We sampled 50 random combinations of hyperparameter values for each model configuration yielding a total of 3, 000 trained deep learning models (see Sect. “Deep learning models” for more details).

In this section we evaluate the ability of these models to predict the standard deviation of the absolute value of the AGV guide error in the interval 10–15 s ahead of the current time. All models were compared to a naive predictor that acted as the baseline model. The naive predictor computed the standard deviation of the guide error values observed in the last 5 s (50 steps).

Figure 4 summarizes the obtained results. The time window sizes in seconds (4, 7.5, 15, 30 and 60) are represented in the X-axis and the smallest Mean Absolute Error (MAE) of the deep learning architectures (CNN, LSTM, TRA or MIX) is shown in Y-axis. The results are segregated by feature set (AGV, COMPLETE and NET) and represented with bars in different colors and patterns. The best performing deep learning architecture (CNN, LSTM, TRA or MIX) and the obtained MAE are shown on top of each bar. Table 4 extends Fig. 4 information showing the architectural details of the best models segregated by time window size (7.5, 15 and 30 s) and by feature set (AGV, COMPLETE and NET). The best model configuration for each feature set is highlighted in yellow color.

Table 4 Architecture configurations of the best models segregated by time window size (7.5, 15 and 30 s) and by feature set (AGV, COMBINED, NET)
Fig. 5
figure 5

MAE improvement over the baseline. Y-axis represents the percentage of improvement over the baseline model (standard deviation of the observed guide error values in the last 5 s). The X-axis represents the time window (4, 7.5, 15, 30 and 60 s). For each time window the three bars present the best results segregated by feature set (AGV, NET, COMPLETE). The best and highest MAE improvement and the type of architectural model (LSTM, CNN, TRA or MIX) are shown at the top of each bar

Fig. 6
figure 6

The symmetric mean absolute percentage error (SMAPE). The Y-axis represents a relative error in percentage SMAPE (a value lower than 50% is considered an accurate prediction in the industry). The X-axis represents the time window (4, 7.5, 15, 30 and 60 s). For each time window the three bars present the best results segregated by feature set (AGV, NET, COMPLETE). The best and lowest SMAPE and the type of architectural model (LSTM, CNN, TRA or MIX) are shown at the top of each bar

To highlight the differences of model configurations against the baseline, Fig. 5 presents the same results than Fig. 4 but in the Y-axis we plot the percentage of improvement achieved with respect to the baseline model (\(\frac{\text {MAE}_{base}{-}\text {MAE}_{model}}{\text {MAE}_{base} }{\times } 100\)). Note that in contrast to Fig. 4, in this figure, the larger the bar is, the better is the improvement.

In addition to MAE, we use a new metric to measure the relative percentage error with respect to the true value. Note that the typically used MAPE metric is not applicable to this problem, since the values of the guide error include the zero value on many occasions, which produces a division by zero error during the MAPE calculation. Therefore, in order to apply a relative metric that circunvents this problem, we selected a very similar metric called Symetric MAPE (SMAPE) (Lago et al., 2021). Figure 6 shows a bar plot analogous to Fig. 5 but with the SMAPE values on the Y-axis. It is generally accepted that any SMAPE value lower than \(100\%\) implies an error lower than the true value and, specifically in the industrial field a value lower than \(50\%\) implies a sufficiently accurate prediction (Blasco et al., 2013). As can be observed in Fig. 6 all the best models have a SMAPE value of less than \(50\%\).

In light of the MAE and SMAPE results (Figs. 4, 5 and 6) it can be concluded that (i) all the best models obtained a significantly better result than the baseline model and (ii) the Transformer models obtain the best results in nearly all time windows and feature sets. Only when a 15 s time window was used, a 1D-CNN obtained the best result although the best Transformer model was the second very close to the performance of the 1D-CNN (see Table 5).

In “Deep learning models” subsection we mentioned that CNNs have been shown to be efficient in revealing short-term relationships in time series variables, and on the contrary, LSTMs are very efficient in finding long-term relationships that can exist among these variables. Therefore, we conjecture that the data in this problem have complex temporal relationships. It appears that long-term relationships are not predominant, as no LSTM has managed to be selected among the best models. The same reasoning can be applied to short-term relationships as only two 1D-CNNs models are among the best models. In some cases, there seems to be a mix of short- and long-term relationships, of which MIX architectures (pipelines of 1D-CNN and LSTM) take some advantage. But in general, the data seem to contain a mixture of short-, medium- and long-term relationships since it is the Transformers architecture using the attention mechanism, which is the best performing architecture to identify and extract them in nearly all scenarios.

The Transformer models we trained demonstrate experimentally that they can replicate behaviours of 1D-CNN and LSTM and in addition, since they are more powerful, more complex relationships among input variables in the time-series can also be found by them. That is why Transformers have appeared as winners in almost all combinations of time windows and sets of features in our experiments. Note that Transformers can mimic the LSTM behaviour by giving more importance to distant data in the time series, and the 1D-CNN behaviour is replicated by giving more importance to topologically closer data. This flexible and powerful behaviour is possible due to the novel attentional mechanism of the Transformers that allows to put the focus on different locations of the time series at the same time.

In general, we can conclude that the best performing models use a COMPLETE set of features and a time window of 15 s. A 1D-CNN model achieved to obtain the best MAE and SMAPE values although a Transformer model obtained a very close result (see Table 5). This result can be explained by the inherent complexity of Transformer architecture compared to 1D-CNNs, which affects the training periods of the models in each architecture. As we trained the same number of model combinations per architecture, we should expect that by training more combinations of Transformer hyperparameters and for a longer period of time, we will eventually find a Transformer model that outperforms the 1D-CNN that obtained the best result. Nevertheless, using only the AGV variable a Transformer with a larger window of 30 s obtained a very close result, which indicates that in absence of network information we only have to increase the time window to maintain an optimal performance.

It can be seen that, once the optimal MAE and SMAPE values are reached, no model improves as the size of the time window increases. A possible explanation is because the enlargement of the time window tends to give unnecessary importance to events that are far in the past.

It is noticeable to observe that the best model configuration for each feature set is obtained with a different size of time window (NET with 7.5 s, COMPLETE with 15 s and AGV with 30 s).

Regarding that the optimal window size for the NET feature set is 7.5 s, only recent NET information (i.e., network perturbations) seems to affect the performance of the model predictions. We conjecture that AGV difficulties in its guidance by a remote PLC are less related to past network conditions than to recent ones. Therefore, future network conditions that could affect AGV guidance can be inferred without the need for very distant network information. Conversely, AGV guide error data going back far in time can help to more accurately identify where in the circuit the AGV is located and thus, more accurately predict its future location. Knowing the future location of the AGV in advance can be crucial in determining whether the AGV will have more difficulty being guided by a remote PLC. For example, when reaching a curve, the remote PLC needs to send commands to the AGV more frequently to change the direction of the AGV than when the AGV is on a straight part of the circuit. Therefore, severe difficulties with AGV guidance are more likely to arise when network disturbances appear in the middle of a curve than when the AGV is in the middle of a straight circuit. Furthermore, when both types of features are used (COMPLETE feature set), a balance in the window size (15 s) seems to be the most appropriate for the deep learning models to obtain an optimal forecast. The intermediate window size of 15 s allows the deep learning model to extract from both NET and AGV features the relevant information to forecast the AGV guide error with higher precision.

For almost all time window sizes, the best results are always obtained using the COMPLETE feature set. This result suggests that the network features provide beneficial information to forecast the behaviour of the AGV. This is partly because the difficulties that an AGV can suffer are subjected to network conditions affecting the PLC-AGV communication. Therefore, using network information can help to forecast perturbations in the PLC-AGV connection that will be directly translated to difficlties in the AGV guidance.

It is worth noting that we obtained an improvement of \(21\%\) against the baseline using only network features (NET configuration) with a time window of 7.5 s. This result is very useful to demonstrate that even in scenarios where the packet payload is encrypted (e.g., when the AGV and PLC are connected through a public network), we can anticipate the AGV behaviour with a decent performance using only the features extracted from the network packets.

Table 5 summarizes the results for all combinations of architecture, feature set and window size, using MAE, and MSE and SMAPE as complementary error metrics. Recall that MSE is very efficient for learning outliers, while MAE is good for ignoring them. Furthermore, the SMAPE allows to approach the problem from a point of view relative to the actual value in percentage (lower percentage is better). Note that although in this study we decided to select the best models according to the MAE metric, models can be chosen according to another criterion.

Fig. 7
figure 7

Forecasting behavior of the best prediction models for the three types of feature sets (COMPLETE, AGV and NET) in two experiments under different network disturbance levels. The models are referred as MODEL-FEATURES-TIME_WINDOW. Experiment 1 (left column): A moderate jitter disturbance was generated in the network with 50ms of mean delay and 10ms of standard deviation. Experiment 2 (right column): Incremental jitter is generated in the network using a standard deviation of 50ms, and an increasing mean value starting from 50ms and ending at 250ms, incremented in 50 ms steps after each circuit lap is completed. Guide error and predicted values are represented by blue and red lines respectively. The vertical axis represents the value of the predicted variable: the standard deviation of the absolute value of the guide error between \(t+10\) and \(t+15\). The horizontal axis represents the relative time of the experiment in seconds since its start. The MAE and SMAPE values obtained by the model in the experiments are shown in parenthesis

In addition to the analysis of the error metrics, it is interesting to compare the forecasting behavior along the time of the best models per feature set in two paradigmatic scenarios. Fig. 7 compares model forecast (red line) with the actual value of the predicted variable (blue line) in two experiments where different levels of network disturbance were introduced: moderate and increasing jitter. These two experiments were specially chosen from the total set of experiments to contrast the behavior of the models when moderate and increasing perturbations appear in the network. In fact, the second scenario faces forecast models in an extreme scenario of increasing network disturbances, which produces increasingly higher guide error values. The figures in the left column (7a, c, and e) represent the forecast and the actual values in the first experiment where a moderate jitter was generated in the network with a mean of 50ms and a standard deviation of 10ms. Furthermore, the figures in the right column (7b, d, and f) show the forecast and the actual values obtained in the second experiment, a more aggressive jitter scenario, in which the mean of the introduced delay followed a ramp profile from 50 to 250ms (incremented 50 ms at the end of each lap) and the standard deviation was 50ms. The models shown are the best models per feature sets (highlighted in yellow in Table 4): (i) COMPLETE feature set: 1D-CNNs (CNN) using 15 s window size, (ii) AGV feature set: Transformers (TRA) using 30 s window size and (iii) NET feature set: Transformers (TRA) using 7.5 s window size.

In the first experiment with moderate and constant jitter (Fig. 7, left column) the three input features (COMPLETE, AGV and NET) allow the models to capture the trend of the guide error variable and to forecast it with a decent performance. In particular, using as input the COMPLETE and AGV sets of features, the models (CNN and TRA) produce predictions ahead of time that are closer to the real values, which is reflected by lower MAE and SMAPE values than in the case of the Transformer model (TRA) that used only NET features. Obtaining the same MAE and SMAPE values, it can be observed that the Transformer model (TRA) using as input the AGV feature seems to follow the changes in the signal better than the 1D-CNN model using the COMPLETE set of features. The 1D-CNN model, in contrast, produces more conservative predictions around the mean values of the signal, which nevertheless allows it to obtain good MAE and SMAPE values. The Transformer model (TRA) using as input the NET set of features catches the general trend of the signal, but fails more frequently than the other two models to predict the changes in it.

In the second experiment with increasing jitter (Fig. 7, right column) only the models which input the COMPLETE and AGV feature sets are able to forecast with precision the trend of increasing values of the guide error variable that appear in the second 230 when the increasing jitter starts affecting severely to the AGV guidance.

It is worth noting that although these two experiments are not representative of all experiments conducted, they provide an interesting approximation on how the models behave when different input features are used. We can conclude that the two models using all the features (COMPLETE) (Fig. 7a, b) and the AGV feature (Fig. 7c, d) are able to capture the trend of the guide error variable in both experiments (constant and increasing jitter). Therefore, these two models, although they do not perfectly forecast the guide error variable, serve two key purposes in our work: (i) predict the trend of the variable and, what is most important, (ii) predict early enough when problems start to appear in the AGV (e.g., from the second 230 in the second experiment) in order to be able to stop it in time. In other words, the forecast of the instantaneous value of the guide error is not as important as the fact of being able to predict in advance the trend of the values in order to be able to act on the AGV in time (e.g., stop, slow down, etc.). In contrast, it can be observed that the Transformer model using only the NET set of features clearly fails to predict in advance the increasing trend of the guide error variable in the second experiment where the network disturbances are constantly increasing. This scenario is undesirable, as these increasing values of the guide error value might imply an AGV running out of the circuit, which could eventually produce a harmful situation.

Model improvement using aggregated features

In this section we analyse whether the use of aggregated features allows to improve the prediction performance. To this end, new deep learning models were trained using the original sets of input features (see Table 2) together with the aggregated features presented in “Feature augmentation” subsection.

We selected the mean and standard deviation of the prior values of each original feature to obtain the aggregated values. Two window sizes of 15 and 30 s were selected to evaluate whether the aggregated features could help increase the model performance. These two windows were selected as the AGV and COMPLETE set of features obtained the best results in a window of 30 s and 15 s, respectively (Sect. “Evaluation and comparison of model configurations”, Fig. 4). For each architectural configuration (TRA, CNN, LSTM and MIX), feature set (AGV, NET and COMPLETE), time-window size (15 and 30 s), and aggregated set of features (3 combinations of aggregations per window size), 50 hyperparameter configurations (chosen by the random search algorithm) were evaluated, totalling 3600 experiments.

Fig. 8
figure 8

Comparison of the prediction error (MAE) for the best deep models using three features sets (AGV, COMPLETE, and NET) with and without aggregated features. The x-axis divides the results in without aggregation (left) and with aggregation (right). The y-axis represents the value of the MAE which also is written on top of each bar. Without aggregation (No Aggregated) results are the same as in Fig. 4 with 15 s time-window (a) and 30 s time-window (b)

Fig. 9
figure 9

Comparison of the symmetric mean absolute percentage error (SMAPE) for the best deep models using three features sets (AGV, COMPLETE, and NET) with and without aggregated features. The x-axis divides the results in without aggregation (left) and with aggregation (right). The y-axis represents the value of the SMAPE which also is written on top of each bar. Without aggregation (No Aggregated) results are the same as in Fig. 6 with 15 s time-window (a) and 30 s time-window (b)

Figures 8a and 9a show the MAE and SMAPE values obtained by the best models using as input a 15 s window and two aggregations of 7.5 and 15 s. Figures 8b and 9b show the MAE and SMAPE values obtained by the best models using as input a 30 s window and two aggregations of 15 and 30 s. In preliminary experiments, we observed that the best results were obtained when the original features were combined with aggregations of two sizes: the total length of the time window and half of the time window (7.5 and 15 s for the 15 s window, and 15 and 30 s for the 30 s window). Therefore, we present only the comparison of the original features with this type of aggregation. In both figures, the three bars on the left represent the original results without aggregated features, and the three bars on the right represent the results when the aggregated features were added to the input features of the model. Recall that the aggregated time series contained past values farther back in time than those contained in the original window, and therefore this method is a way to extend the window size back in time but with aggregated values instead of real values.

Analysing the MAE results in Fig. 8 we can conclude that the aggregated features did not help the best models in each set of features to obtain any improvement with respect to the original configuration (without feature aggregation). However, looking at the SMAPE values in Fig. 9, the feature aggregation process seems to have boosted the performance of the two MIX models (time windows of 15 and 30 s respectively) that use as input the AGV variable. This enhancement in their performance allowed the two MIX models to slightly outperform the best Transformer models in the two categories (AGV as input and time windows of 15 and 30 s). However, when COMPLETE or NET feature sets were used as input, no model obtained a better performance when features were aggregated.

Therefore, we can conclude that the feature aggregation process, in which we add summarized past data, seems to be beneficial only for traditional deep learning models (1D-CNNs and LSTM) when the AGV feature is used as input. On the contrary, the Transformer models always obtained their best results using only raw past data without any aggregation independently of the input feature set and the time window size.

Although traditional machine learning models tend to increase their performance with the addition of aggregated features (Mozo et al., 2019), in the case of deep learning models, the literature suggests that the addition of such features should be unnecessary due to the capabilities of deep learning models to find complex linear and nonlinear relationships between all input features. However, in our experiments, we have seen that in some occasions (e.g., when the AGV feature set is used as input) the traditional deep learning models (1D-CNNS and LSTMs) have taken advantage of this pre-processing.

In light of these experiments, we obtain two interesting conclusions. First, aggregated past information can help deep learning models to increase their forecasting performance, but if available, past data without aggregation will produce better results, as deep learning models can extract more information from these raw data. Second, once the optimal window size is reached, adding past information, whether aggregated or not, does not improve the model performance, and, worse, in some cases the performance is decreased.

Fig. 10
figure 10

The first column shows the forecasting speed of the best models (in predictions per second) when a single AGV is operated and a specialised GPU (a) or a CPU (c) are used. The second column presents the forecasting speed of the best models (in batches of 256 predictions per second) using a GPU (b) or a CPU (d) when 256 AGVs are operated at the same time with a single model. The selected models were the best ones of Sect. “Evaluation and comparison of model configurations” segregated by feature sets (AGV, COMPLETE and NET) and time windows (7.5, 15 and 30 s)

Real time deployment

In this section we detail some considerations of the previously described deep learning models with respect to a real-time deployment. All measurements shown in this section were conducted on an off-the-shelf computer with 64GB of RAM, a Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz with 12 cores, and a GPU GTX 1080 Ti with 12GB of memory.

Recall that the granularity of all time series that were input to the models was 100 ms. Thus, a model that works in real time has to provide a prediction speed of at least 10 predictions per second. In this context, we analise the speed performance of the best models presented in “Evaluation and comparison of model configurations” section (Fig. 4) segregated by feature set and window size.

Figure 10a and c show the prediction speed (number of predictions per second) on a specialised GPU and on a CPU respectively. The deep learning models are the same as the ones used in Fig. 4, but in this case we only present the best models for the following time windows: 7.5, 15, and 30 s. The results show that, using the aforementioned hardware, all deep learning models are able to make more than 10 predictions per second for a single AGV even when running on a CPU instead of a specialized GPU. Considering that the minimum required speed was 10 predictions per second, we can conclude that the speed of all models is much higher (ten times in the worst case) than the minimum required for a real-time deployment and therefore, all the best models presented in Fig. 4 can be deployed in our real-time scenario.

In general, it can be seen that, regardless of whether they are executed on the CPU or GPU, complex architectures as Transformers are slower than the other deep learning models (1D-CNNs and MIX models). It is worth noting that when a single AGV is deployed all models except 1D-CNNs are slower when running on the GPU than when they are executed on the CPU. This counter-intuitive result is due to the fact that the GPU has to move the input data from a RAM buffer to its internal memory for each prediction, which is costly and in particular more so when the move is done for a single input element rather than for the entire set of elements that can be put into a memory buffer. The exception to this behaviour is 1D-CNN models, which take better advantage of the massively parallel GPU architecture to run much faster on it (twice as fast in the worst case) than on a CPU, even compensating for the previously commented movement of input data from RAM memory to GPU buffers. Finally, we did not observe any significant impact of the window size and number of input features on the speed of the models running on a GPU. In contrast, we observed that models running on the CPU tend to be slower when the time window or the number of input features are larger, as the CPU does not have a specialised architecture, like GPUs, to be able to process the input data in parallel regardless of its size.

In a realistic production environment, the situation can be much more challenging, if several AGVs are controlled simultaneously by a single deep learning model to predict their guide errors. To this end, we consider a new scenario where we have 256 AGVs working in parallel and a single deep learning model receiving their inputs to generate the corresponding predictions. We studied whether a single deep learning model, using the same hardware we stated at the beginning of this section, can satisfy the required real-time speed of 10 predictions per second for all 256 AGVs. To process all predictions efficiently, we run the model prediction function on a batch of 256 inputs each coming from a different AGV. In this way, we exploit the parallelization capabilities of the tensorflow library to generate all AGVs’ predictions in an efficient way, and in particular, when the deep learning models are deployed in a specialised GPU,

Figure  10b and d, show the comparison of the batch prediction speed (number of predicted batches per second, using 256 as the batch size) on a GPU and a CPU respectively. In addition, Table 5 compiles in the last two columns the GPU and CPU prediction speeds (in batch predictions per second and in seconds per batch prediction) segregated by feature set (AGV,COMPLETE,NET), time window (7.5, 15 and 30 seconds) and model architecture (CNN, LSTM, TRA and MIX). It should be noted that each batch prediction generates 256 predictions and therefore, to compare the values of Fig. 10b and d (256 AGVs per model) with Fig. 10a and c (1 AGV per model), the batch figures should be multiplied by 256. Note that the units of the Y-axis in Fig. 10b and d are shown in batch predictions per second to highlight whether the obtained speed is enough for a real-time deployment (i.e., if the number of predictions per second and AGV is greater than 10).

When a GPU is used, the prediction speeds observed in Fig. 10b and in the penultimate column of Table 5 are always greater than 10 batch predictions per second, which guarantees a real-time deployment in the proposed scenario of a single model running in a GPU and dealing with 256 AGVs in parallel. However, it can be observed in Fig. 10d and in the last column of Table 5 that some models of CNN, Transformers and MIX are not able to meet the real time requirement of generating at least 10 predictions per second when the CPU is used. The increase in the number of input data to be processed cannot be adequately managed by the CPU as it does not have a massively parallel architecture to handle large amounts of data in parallel as GPUs do. The solution to this issue is simple and consists of running the problematic models on a specialised GPU or a more powerful CPU.

Table 5 Best models segregated by feature sets (AGV, COMPLETE, and NET), time windows (7.5, 15, and 30 s), and architectural network configurations (CNN, LSTM,MIX and TRA)

Finally, two important aspects of a real-time deployment need to be considered: (i) how to detect when a supervised model needs to be retrained due to changes in the statistical behaviour of the input data and (b) how to determine the cost and feasibility of such retraining.

Regarding that the network perturbations appearing in 5G networks are similar to the ones generated in our experiments, and the physical AGV components (e.g., guide error sensor) are not expected to suffer relevant degradation over long periods of time (months) that could change their physical response and thus alter the performance of the predictions, we presume that the deep learning models should not require to be retrained in short periods of time (e.g., days or weeks). Therefore, only after several months passed or a substantial change in the environment is introduced, would it be necessary to retrain the model to cope with the appearance of data drift problems.

Note that if the use case were to be moved to a production environment that included variations with respect to the training and testing setups, or the model performance needed to be increased, it is likely that additional data would need to be collected from the specific environment and applied to the previously trained deep models. In this situation, we suggest preferably using Transfer Learning techniques (Pan & Yang, 2009) to refine the models and avoid training the models from scratch. These techniques avoid a complete retraining of deep learning models as they only require a small amount of new data to be applied in an incremental training step to the current models to learn the variations of the new environment. It should be noted that Transfer Learning not only allows to adapt the existing model to changes in the input data, but also to improve the performance of the model to the required real-time values by increasing its complexity by adding new layers to the model. Transfer Learning avoids a full retraining reusing the current model parameters in two different ways: (i) as the starting values for the old parameters in a new training process (i.e., there is no need to use random values to initialise the parameters, and therefore, the convergence to the new optimal values is likely to be much faster) or (ii) as fixed non-trainable parameters, which will also speed up the retraining process as only the new parameters have to be optimised during the training of the model.

Conclusions

This work presents an interesting use case that combines Industry 4.0, 5G networks and deep learning in a realistic real-time deployment (5TONIC laboratory). The performance of deep learning techniques to predict in advance the guidance error variable of an AGV remotely controlled by a virtualized PLC has been evaluated. In this way, decisions can be made in advance when a dangerous or harmful situation is detected to occur in the near future.

A new generation of AGVs remotely controlled by a virtualised PLC were selected for the experiments. The remote PLC was deployed in a 5G MEC infrastructure to guarantee minimum latencies during the communication with the AGV. We proposed the application of deep neural networks to analyse in advance the behaviour of AGVs capturing the packets of the PLC-AGV connection and not using any sensor in the user equipment (AGV or PLC) to facilitate the real-time deployment of the solution. To implement the forecasting models we selected two traditional deep learning architctures (1D-CNN and LSTM) and the current state-of-the-art technique in sequence-to-sequence tasks, the Transformer neural network. We want to evaluate whether Transformer networks can outperform traditional deep learning sequence models in this problem.

In an extensive set of 80 experiments run in the 5TONIC laboratory, the communication between the AGV and the virtual PLC was subjected to different network degradation profiles. Network packets from the AGV-PLC communication were captured and preprocessed to be used as input to train and validate advanced deep learning forecast models. We trained and tested 6,600 deep learning models, segregated by model architecture (TRA, LSTM, CNN and MIX), feature set (AGV, NET, COMBINED) and time window size (4, 7.5, 15, 30 and 60 s). In addition, we precomputed a set of aggregated statistics of past values to analise whether this past information added to the original input features could increase the performance of the models. The deep learning models were trained to forecast the standard deviation of the guide error variable in the interval of [10, 15] s ahead of the current instant, since the inertia of an AGVs needs in the worst case 10 s to stop the AGV. This time is sufficient to brake the AGV or perform other corrective actions to prevent the AGV from going off track.

On light of the results we can conclude that (i) all the best models obtained a significantly better result than the baseline model and (ii) the Transformer models obtain the best results in nearly all time windows and feature sets. Only when a 15 s time window was used, a 1D-CNN model obtained the best result, although the best Transformer model was the second very close to the performance of the 1D-CNN. Although the best models do not perfectly forecast the guide error variable, they serve two key purposes in our work: (i) predict the trend of the guide error variable and, what is most important, (ii) predict early enough when problems start to appear in the AGV in order to be able to stop it in time and avoid any potentially harmful situation in case the AGV leaves the path. In other words, the forecast of the instantaneous value of the guide error is not as important as the fact that we can predict in advance the trend of the values to be able to raise some alarm and act on the AGV in time (e.g., stop, slow down).

The data of this problem seem to contain a mixture of short-, medium- and long-term relationships since it is the Transformers architecture using the attention mechanism, which is the best performing architecture to identify and extract them in nearly all scenarios. It is worth noting that although the Transformer architecture does not employ layers that directly relate the data sequentially as RNN, LSTM or 1D-CNN do, we have experimentally observed that the Transformer multi-head self attention blocks allow to replicate this behaviour by giving different importance to the elements in the sequence. Therefore, being a more general and powerful architecture, Transformers can learn during training complex sequential behaviours such as that of recurrent (RNN, LSTM) or convolutional neural networks.

When using transformers to forecast time series variables, there is a trade-off to consider: As they are much more complex than traditional deep learning models, they can take longer to train and their inference times are slower, but in return they are able to extract much more complex relationships between the variables in the time series.

It is worth noting that, although the models using the NET features set obtained the worst results when compared to the models using the AGV or COMPLETE feature sets, the best model using the NET features obtained a \(21\%\) of improvement against the baseline. This results encourages to use this approach when the AGV-PLC communication is encrypted (for example, when transmitted on a public 5G network) and only network features can be extracted from the packets.

Adding aggregated past values to the input can help to increase the model performance only when traditional deep learning models (1D-CNNs and LSTM) are used and the AGV feature is used as input. In contrast, the Transformer models did not show any significant improvement when this pre-processing was done in the input.

Using a modest off-the-shelf PC, we demonstrated that the real-time requirements of this use case (i.e., at least 10 predictions per second must be generated by the model) can be fulfilled without any problem, as in the worst case 67 predictions can be generated in a second by a Transformer model running on the CPU. Using a specialised GPU does not provide any significant advantage when the model is used only for forecasting the guide error of a single AGV. When using the same model for a set of 256 AGVs, the high degree of parallelization of GPUs allows to maintain the individual prediction speed fo each AGV although 256 predictions are generated at the same time. However some models (Transformers, 1D-CNNs and MIX) are not able to meet the real-time requirement of generating at least 10 predictions per second when the CPU is used. To solve this issue, we suggest to run the problematic models on a specialised GPU or a more powerful CPU.

We presume that the deep learning models we proposed for this use case will not require frequent retraining (e.g., days or weeks) since the network perturbations appearing in 5G networks are similar to the ones generated in our experiments, and the physical AGV components (e.g., guide error sensor) are not expected to suffer relevant degradation over long periods of time (months). That said, we suggest adopting a Transfer Learning approach to do incremental training to speed up the process when data drift problems appear.

As future work, the application of these techniques to other types of AGVs, such as automated forklifts, can be investigated. Regarding that the retraining of deep learning models deployed in a production environment is a process that must be done as time passes due to changes in the physical components, online learning strategies should be researched as a complementary technique to the proposed offline retraining techniques to avoid the occurrence of large errors that may affect the accuracy and precision of the deployed model. Furthermore, more complex deep neural network architectures can be evaluated using DARTS or EAS techniques to see whether complex deep neural networks can increase the obtained performance in our experiments. Additionally, investigating the application of well-established techniques (e.g., Kolmogorov–Smirnov test and Kullback-Liebler divergence) will allow the identification of data drift problems that may lead to retraining the obsolete models.