Abstract
The last decade has seen the emergence of Deep Neural Networks (DNNs) as the de facto algorithm for various computer vision applications. In intelligent edge devices, sensor data streams acquired by the device are processed by a DNN application running on either the edge device itself or in the cloud. However, “edge-only” and “cloud-only” execution of State-of-the-Art DNNs may not meet an application’s latency requirements due to the limited compute, memory, and energy resources in edge devices, dynamically varying bandwidth of edge-cloud connectivity networks, and temporal variations in the computational load of cloud servers. This work investigates distributed (partitioned) inference across edge devices (mobile/end device) and cloud servers to minimize end-to-end DNN inference latency. We study the impact of temporally varying operating conditions and the underlying compute and communication architecture on the decision of whether to run the inference solely on the edge, entirely in the cloud, or by partitioning the DNN model execution among the two. Leveraging the insights gained from this study and the wide variation in the capabilities of various edge platforms that run DNN inference, we propose
1 INTRODUCTION
In recent years, Artificial Intelligence (AI) and specifically Deep Learning (DL) have become the dominant data analytics technology for cloud and edge computing. Intelligent applications based on DL, which are prevalent in various domains such as Computer Vision (CV) (e.g., face recognition, autonomous driving, video captioning, super resolution), Natural Language Processing (NLP) (e.g., machine translation, speech recognition, sentiment analysis), recommendation systems (used by Facebook, Amazon, Netflix, LinkedIn, etc.), deeply impact our lives and have fundamentally altered the way we interact with computing [61]. The success of these applications can be attributed to the ever-improving computing power of cloud-based data centers and to the ever-decreasing cost and increasing ease of deploying DL-based solutions in various types of edge devices. However, cloud computing infrastructures have been increasingly challenged by the growth of these workloads [53, 76]. Therefore, DL-based edge intelligence has garnered significant attention, as it complements the cloud by alleviating the backbone network and providing an agile response. Recent years have witnessed rapid growth of edge computing due to widespread research and innovation in Internet of Things (IoT) devices, embedded sensors, and smart systems coupled with ubiquitous wireless communication. Consequently, edge intelligence has enabled the democratization of AI to facilitate AI “for every person and every organization” [5]. Among the different technologies that encompass edge intelligence, we specifically focus on Deep Neural Network (DNN) inference.
Among the prevalent edge computing techniques for DL as shown in Figure 1, previous work [63] performs complete offloading in which the edge device offloads computation requests along with sensory data to the resource-rich cloud for DL inference. We refer to this approach as Cloud-only Inference (CoI). Although this technique allows the deployment of highly accurate but compute/memory-intensive DNNs in the cloud, high transmission cost, strict application latency demands, and lack of reliable network connectivity heavily impact the application efficiency. Many of these problems can be mitigated by using on-device or Edge-only Inference (EoI), where the entire DNN is executed on the edge device. Congruent efforts to accelerate energy-efficient EoI have focused on customizing and optimizing edge hardware and software, such as the design of highly optimized mobile CPU/GPU/ASIC/FPGA and edge DL frameworks (TensorFlow Lite, Embedded Learning Library, Qualcomm Neural Processing SDK for AI, ARM CMSIS-NN, etc.). However, despite extensive research in this domain, most edge devices, being highly resource-constrained, can only run lightweight DL models, since complex and accurate State-of-the-Art (SOTA) models continuously exceed the compute capacity, memory, and energy budget of edge devices.
In addition to these two strategies at opposite ends of the edge DL spectrum, edge-cloud collaborative inference has also been explored in some recent works. DL model partitioning or model splitting is the basis of this strategy. Note that DL model partitioning is a special case of general edge-cloud partitioning and offloading of computation tasks [13] from the end users. In this work, we focus on DNN partitioning since it requires customized solutions that can take advantage of the unique characteristics of DL algorithms, as general techniques may not be able to take advantage of specific information about DNN architectures. Among the prevalent DNN partitioning approaches, in horizontal collaboration [82], individual layers are partitioned into distributable tasks that are executed in parallel using multiple edge devices. On the other hand, in vertical collaboration [37, 71], DNN is partitioned at an intermediate layer according to one or more premeditated criteria, and data preprocessing followed by partitioned DL inference is executed on the edge device until the chosen layer. Subsequently, intermediate data (feature maps) are transmitted to a cloud server where DL inference is performed on the remaining layers. In this work, we focus on the vertical partitioning of DNN models across edge-cloud platforms and aim to find the optimal partitioning layer to minimize the DL inference latency. To the best of our knowledge, most existing partitioning approaches involve substantial offline characterization/profiling of edge devices and cloud platforms, where DNN layers are profiled to generate performance prediction models. This approach faces several key challenges that limit its feasibility for pervasive edge intelligence. First, due to the significant diversity in the System-on-Chip (SoC) architectures in edge platforms [77] and DL software frameworks and third-party libraries, offline characterization becomes necessary for each new system configuration. This limits the solution’s scalability to the large and ever-growing pool of edge systems. Second, the heterogeneity and non-monotony in the SOTA DNN layers and the size of the feature maps make the partitioning problem even more challenging. Third, different platforms may have been optimized for a specific type of DNN layer (convolution, pooling, fully-connected, normalization layers, etc.) due to their underlying hardware characteristics. This results in different layers being executed with widely varying efficiencies on these diverse platforms, some with extremely poor performance. Finally, existing operating conditions, such as wireless network bandwidth and cloud server load that accommodates edge device requests, can vary significantly over time, thus affecting the optimal partitioning point for a particular DNN. All these factors indicate the deficiencies in existing edge-cloud DNN partitioning solutions.
To overcome these challenges, we explore the edge-cloud partitioning problem using platform-agnostic adaptive partitioning of DNNs (
The rest of the article is organized as follows: Section 2 presents a brief overview of the existing literature in edge-only inference, cloud-only inference, and edge-cloud collaborative inference. Section 3 covers the necessary background and motivation for partitioned DNN inference and the different factors that influence the latency and partitioning decision. This is followed by the key design methodologies and the associated rationale behind the design decisions of the adaptive DNN partitioning algorithm in Section 4. The experimental setup and methodology comprising the five compute platforms, together with six different DNN architectures used in this work are described in Section 5. The article will then review the experimental results of
2 RELATED WORK
Rapid progress in the fields of edge computing and deep learning has led to a thrust in the industry to push cognitive abilities to each and every application that we interact with in our daily lives. To this end, various approaches have been proposed to infuse intelligence into the plethora of edge/IoT devices surrounding us [74]. Before diving into the existing literature, we clarify the ambiguous nature of the definition of “edge devices” in the literature. Without loss of generality, we consider all IoT or embedded and mobile/client devices (such as autonomous vehicles, wearables, smartphones, drones, conversational assistants, etc.) that sit on the edge of the IoT hierarchy and sense and generate data for the DL application, as “edge devices”. Other computing platforms, including fog nodes, network edge servers, roadside units, and cloud-based data centers, are classified as “cloud servers” for the entirety of this article.
Traditionally, computation requests arising from DL-based applications running on edge devices are offloaded to powerful cloud servers in CoI mode. Therefore, edge devices are not subject to additional computing overhead, scheduling delays, and the need for resource optimization [9, 28] in this scenario. In contrast, in the EoI paradigm, breakthrough research in the development of powerful mobile CPUs/GPUs/ASICs/FPGA [14, 32, 36, 50], hardware accelerators [2, 42, 65] coupled with DNN model optimization and compression strategies [46, 80] has enabled the deployment of complex DNNs on edge devices, thus contributing to the goal of democratizing AI. Furthermore, the rise in decentralized AI architectures based on federated learning and blockchain, data privacy, and security needs, in addition to bandwidth, latency, and cost issues, has driven industry research on edge AI [45]. However, both techniques face diverse challenges, as described in Section 1. The drawbacks pertaining to strict delay requirements, privacy and reliability issues for CoI, and limited compute/memory capability, energy consumption, and cost bottleneck in EoI have inspired the development of the collaborative edge-cloud computing paradigm.
In the collaborative sphere, previous works [44, 63] determine the optimal strategy to execute DL inference either on the edge device or on a remote server based on multiple criteria (such as DNN accuracy, inference latency, device energy, etc.). DL model segmentation or vertical model partitioning approaches [37, 40, 41, 71] have also been used to make the best use of the computing power of edge and cloud infrastructures to meet application latency demands and ensure DNN accuracy. A common limitation of these studies is the need for an offline characterization phase that is carried out to estimate the device-specific performance of different DNN layers and the energy consumption for the adopted distributed edge-cloud setup. For example, statistical modeling [37], analytical modeling [58], and application-specific profiling [17] of DNN layers have been found in the literature that use the results of the profiling phase to decide the partition point during the DNN inference run time. Related research works have adopted different Mixed Integer Linear Programming techniques [15, 18, 21] that offer theoretical guarantees to find the optimal partition point. However, these are computationally expensive because the size of the partitioning problem is large. On the contrary, researchers have also proposed heuristic algorithms, such as Genetic Algorithms (GA) [48], Standard Particle Swarm Optimization with GA [7], Approximate Solver [31], Multipath DAG partition [15], and so on, to find approximate or suboptimal solutions quickly. Both classes of approaches require an offline profiling phase or prediction algorithms. However, as stated by the authors in Reference [17], disjointed layer-wise profiling or prediction algorithms based on DNN layer configurations are prone to estimation errors due to non-monotonic acceleration provided by various hardware architectures and software frameworks to consecutive execution of layers. Therefore, an exponential number of profiling experiments would be needed if a DNN application has to be efficiently deployed on various kinds of mobile SoCs/edge devices [77] making such approaches impractical. Furthermore, variation in diurnal performance is observed in large-scale data centers, and profiling-based evaluation methods cannot capture these variations, as the authors demonstrated in Reference [66]. Furthermore, hardware heterogeneity and complex software architectures in these cloud servers could also lead to different benchmark/profiling results at different times and therefore are not representative statistics [16, 69]. These challenges limit the accuracy and scalability of the aforementioned profiling-based edge cloud collaborative solutions. Although the authors [83] proposed a resource-aware online partitioning scheme that does not involve profiling, the partition points are randomly selected based on predetermined probabilities. Furthermore, the evaluation is limited to a single DNN (VGG16), a single virtual hardware platform, and without dynamic bandwidth variation.
Another direction of research solves the critical problem of resource autoscaling (InferLine, FA2) [10, 64] to minimize e2e application latency by intelligently allocating resources or computation nodes/hardware accelerators (e.g., CPUs, GPUs, FPGAs, TPUs) in DL inference serving systems. These approaches cater to DNN inference pipelines consisting of multiple DNNs orchestrated with a Directed Acyclic Graph (DAG) in the application. The number of DNN instances and/or request batch size are dynamically configured for different hardware nodes to meet Service Level Agreement (SLA) guarantees for response time. Other related optimization techniques that improve resource utilization include parallel processing of requests and caching of past predictions [11], deadline bound delay of requests [8], and so on. However, these approaches do not explicitly include any features for partitioning a single DNN model across multiple devices. In addition, most of them include profiling entire DNNs on different combinations of hardware and batch size to obtain latency and throughput statistics. Furthermore, the inference latency of SOTA models such as YOLOv5 [34], EfficientDet [68] depends on the input size. Therefore, layer-wise DNN profiling on multiple hardware platforms using different input sizes will be extremely costly. On the other hand, early-exit inference [39, 70, 73] that allows DNN inference to exit using side branch classifiers depending on energy and accuracy requirements also reduces latency. We have provided a comparative analysis of some relevant previous work in Table 1. In contrast to the prior work, we have adopted an orthogonal and complementary heuristic-based approach to obtain a characterization-free, platform-agnostic, and adaptive solution for the collaborative edge-cloud inference paradigm. To this end, we also investigate the impact of the variation in edge platform and communication medium, and dynamically varying wireless network conditions and server load, on the optimal partitioning decision.
Frameworks | |||||||
---|---|---|---|---|---|---|---|
Features | PArtNNer | Neurosurgeon [37] | DeepDecision [63] | ADDA [73] | DADS [31] | Joint Optimize [15] | Resource-aware [83] |
Characterization free | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
Supports EoI and CoI | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Supports partitioned inference | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ |
Hardware agnostic | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
Bandwidth adaptive | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Latency reduction | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Energy reduction | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ |
Maintains DNN accuracy | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✓ |
Avoid exhaustive/random search | ✓ | ✓ | ✓ | ✗ | ✓ | ✓ | ✗ |
Handles chain and DAG topology DNNs | ✓ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ |
3 MOTIVATION
As already stated in Sections 1 and 2, the SOTA approach to offering intelligent edge services to the end user is to perform all DL processing in the cloud (CoI), or to deploy cognitive DL capabilities inherently within the edge device that reside with the end user (EoI), or to provide collaborative edge-cloud intelligence using coordinated integration of computing, communication, and networking resources [74]. For the purpose of this work, we consider that a user has access to a particular edge device that communicates with a single cloud server. Classification or detection inference request that originates at the edge node can only be executed in an EoI, CoI, or collaborative manner. We assume that communication or offloading is not possible between multiple edge devices. In this section, we mainly reinforce distributed DL inference in the form of a collaborative edge-cloud architecture. First, we motivate the partitioning of DNNs (vertically, i.e., along the edge and cloud) to satisfy real-time latency constraints (Section 3.1). Subsequently, we demonstrate that the partitioning decision is highly dependent on temporally varying operating conditions such as wireless network bandwidth and cloud server load (Section 3.2). Finally, we show that this decision is also highly influenced by the architectures of the underlying systems (edge/cloud) and communication subsystems (Section 3.3), which could lead to exorbitant device-specific profiling and characterization. Evidently, these challenging and diverse factors highlight the need for a platform-agnostic DNN partitioning system that can completely eliminate exhaustive profiling while simultaneously being adaptive to any change in environmental conditions.
3.1 Edge-Cloud Collaboration: Why Partition DNN?
Segmentation/partitioning of the DL model between the edge and the cloud is used in the edge computing architecture to satisfy different edge inference requirements, such as real-time latency, accuracy, energy efficiency, and so on. However, automatic selection of the partitioning point is challenging due to the heterogeneity and non-monotony of DNN architectures, the variation in DNN precision, the DNN processing framework, and the variation in dynamic operating conditions that affect the inference latency of these architectures [78]. There is no unique solution, and the Optimal Partition Point (OPP) that minimizes e2e latency varies from one DNN to another, as well as within a particular DNN. To demonstrate this variance in OPP, we show six subplots in Figure 2 corresponding to six DNNs, viz., AlexNet, InceptionV3, ResNet101, SqueezeNet1.1, MobileNetV2, and YOLOv3-Tiny (details on DNNs in Table 3) under identical operating conditions. Each subplot consists of a heat map showing the OPP for the corresponding DNN at different cloud server loads (y-axis) and wireless network bandwidth (x-axis). The associated color bar with each heat map shows all possible partitioning points for the respective DNN. The two extreme ends of the colorbar, C and E, represent CoI (0 layers at the edge) and EoI (all layers at the edge), respectively, which vary from network to network. Apart from these two extremes, lighter shades indicate that the OPP is among the initial layers of DNN, thus favoring the execution of most layers in the cloud. In contrast, darker shades suggest OPP toward the end of the DNN, implying execution of most layers on the edge device. To improve the readability of the figure, the OPP for each combination of load and bandwidth is indicated in the heatmap for each DNN. As observed, the OPPs for AlexNet, InceptionV3, ResNet101, SqueezeNet1.1, MobileNetV2, and YOLOv3-Tiny are {\(20(E), 13, 3\)}, {\(20(E), 0(C)\)}, {\(39(E), 0(C)\)}, {\(17(E), 9, 0(C)\)}, {\(25(E), 10, 7, 5, 0(C)\)}, and {\(24(E), 8, 0(C)\)} respectively across all the operating conditions. As we can clearly see, the OPP distinctly varies across different DNNs. The OPPs in compute-heavy DNNs, such as InceptionV3 and ResNet101, lie at either extreme, while there are multiple OPPs for the other DNNs, which are comparatively smaller and designed for edge deployment. For example, at load 10 and bandwidth 5 Mbps, the OPP for the minimum latency of SqueezeNet1.1 is 9, while the same for ResNet101 is 0. This discussion highlights the diversity of OPP among DNNs, which contributes to one of the many challenges in designing an automated partitioning algorithm.
To further illustrate the benefits of partitioned or edge-cloud collaborative inference compared to always statically deciding to perform full DNN inference on cloud or edge, we closely look at the heatmap for SqueezeNet1.1, a highly optimized DNN for edge inference. In addition to the heatmap, Figure 3 shows three different subplots for three different operating conditions, each highlighting the inference latency at all possible partition points. The x-axis and the y-axis in the stacked column charts represent the partition point or the number of layers (blocks) executed on edge and e2e inference latency, respectively. At each partition point, we show the total e2e latency (shown by a green line with markers) along with its breakdown into finer constituents, namely (i) edge latency shown in purple (time taken to execute the layers up to the partition point, including the layer indicated by the point, on the edge device), (ii) communication (comm) latency shown in orange (transmission time of the output feature maps from the partitioned layer), and (iii) cloud latency shown in maroon (time to run inference on the remaining layers of the DNN in the cloud). The top-right plot considers DNN inference at high server load 20 and network bandwidth 25 Mbps, and shows that the OPP corresponding to minimum latency is 0, i.e., all the layers executed in the cloud. In this case, the edge device uses its image sensor/camera module, consumes a small amount of time during image acquisition, and performs the necessary preprocessing before offloading the image to the cloud server. At load 1 and bandwidth 1 Mbps (bottom-left plot), the minimum e2e latency is observed when all layers are executed on the edge device. However, in the bottom-right plot, the minimum e2e latency is observed using partitioned inference where the OPP is 9. Similarly, the heatmap also shows that partitioned execution performs best under many operating conditions. Therefore, exploring and optimizing partitioned inference is an interesting research problem to solve. In the following two sections, we further investigate the impact of temporally varying operating conditions, as well as various system specifications, on the optimal partitioning point for six different SOTA DNNs used for computer vision applications on edge devices.
3.2 Impact of Temporally Varying Operating Conditions
Cloud-based DL inference for edge intelligence, in CoI or collaborative mode, always comes with additional constraints, such as available network connectivity, bandwidth, wireless channel condition, congestion, contention, concurrent service requests, data center/server load, different hardware/software configurations, and server geolocation (examples of some popular cloud servers are Amazon’s AWS DL AMIs, Google Cloud ML, and IBM Watson for AI workloads), among others. The time-varying wireless environment affects the transmission time of sensor data (image, video, audio, text, etc.) in the case of CoI or intermediate DNN feature maps in collaborative mode. On the other hand, a highly loaded cloud server can mitigate the computational advantage of cloud over edge, leading to slower DL inference latency. Furthermore, as shown in previous work [62], the relative geographical locations of the edge device and the cloud server also affect connectivity and e2e inference latency. Edge applications using cloud services are affected by the round trip time, which is usually proportional to double the geographical distance [29, 67]. These dynamic factors ultimately result in different response times, thus changing the OPP that offers the best latency at the corresponding operating conditions. In Figure 2, we can clearly observe the variation in OPP with the alteration of the server load and the bandwidth of the wireless network for six DNNs. Intuitively, the increase in bandwidth reduces the communication latency of the DNN input/feature maps, consequently favoring more layers to be offloaded to the cloud (i.e., fewer layers on the edge), which shifts the partition point, as we can see in the heat maps. On the other hand, an increase in server load favors computing more layers at the edge. Similar observations are evident in Figure 3, which depicts the variation in OPP for SqueezeNet1.1 with respect to different network bandwidths and cloud server loads. For example, the heat map on the top-left plot of this figure shows a change in OPP from 0 to 9 with an increase in server load from 15 to 20. The non-monotonic characteristics of the size of intermediate DNN feature maps directly result in a non-monotonic trend of communication latency, as evident in the stacked column charts in Figure 3. In all of the plots, the impact of variation in bandwidth on communication latency and, consequently, on the partitioning decision is clear. In addition to these variations, there is no guarantee on the delay and response time to access cloud services and could result in long waiting times for edge devices aiming to offload DL processing to the cloud. Even with a loss in network connectivity, intelligent critical services must be provided in near-real time, which invokes the need to design a partitioning framework adaptive to both temporally varying operating conditions.
3.3 Impact of Diverse Edge System Specifications
With an increasing trend to push DL inference to the edge to reduce latency, preserve privacy, and enable interactive use cases, various DL frameworks such as PyTorch Mobile, TensorFlow Lite, CoreML, MXNet, and so on, have released software modules in recent years. Many of these specialized libraries are used to optimize and deploy DNN models for DL inference on more than a billion mobile/edge devices [74, 77]. As the authors [77] point out: “These devices are comprised of over two thousand unique SoCs running in more than ten thousand smartphones and tablets”. As we have mentioned in Section 1, programmability and performance variation affect applications with real-time constraints. Specifically, hardware and software heterogeneity results in variation in latency and energy consumption for the same DNN across different edge devices and consequently changes the OPP when the same edge is operating in edge-cloud collaborative mode. Different SoCs and software frameworks offer different degrees of acceleration to different types of DNN layers, and this forms the fundamental rationale behind this variation in inference latency. Furthermore, operating conditions such as the number of concurrent CPU intensive threads on the edge device, the dynamic allocation in cores, and so on, which depends on the computing platform, affect latency, thus altering the OPP [78]. To demonstrate this variability, we performed several DNN inference experiments using six DNN benchmarks on two COTS compute platforms, namely RPi0 and RPi3, and developed computational models of three other platforms, namely NCS, Jetson, and ETPU, which closely matches the inference performance on the corresponding actual hardware. Specifically, the NCS model considers a combination of RPi3 attached to the NCS over USB. For the remainder of the article, DNN inference on any of these five devices will refer to execution on actual hardware for the two Raspberry Pi boards, whereas computational models for the rest. PyTorch was used as the primary software framework to conduct these experiments. More details on DNN benchmarks and hardware-software setup can be found in Section 5. Next, we investigate the inference performance of these devices.
Consider a real-time edge application with strict latency constraints, where SqueezeNet1.1 is used as the underlying image classification DNN and the edge device is operating in edge-cloud collaborative mode. The e2e latency comprises the time taken by the edge device to capture the image, perform adequate data preprocessing (reduce data size, packetize, etc.), transmit data (image/feature map/result) to the cloud (if any), and the DNN inference latency on the edge or cloud, or both (only during partitioned execution). Previous work in the domain of collaborative inference [17, 78, 84] involves offline DNN pre-characterization/profiling that includes resource cost modeling of different DNN layers coupled with individual cost prediction model guided by network bandwidth, process latency and energy consumption, for a specific set of edge devices and cloud server. We term this characterization information as Oracle data, which will vary depending on the DNN architecture, layers, wireless standard and bandwidth, edge platform specifications such as memory bandwidth, Trillion Operations per Second (TOPS) and so on. We performed an offline analysis of the network bandwidth and cloud processing load over a fixed duration of the DNN application to generate the oracle data. Leveraging the oracle to decide the OPP during application run-time will always result in minimum e2e DL inference latency for the prevailing wireless network and server conditions. If the same application (and DNN) is executed on any other edge device/platform, the OPP might change, given that we have the pre-characterized oracle data for the respective platforms. Furthermore, a change in the wireless standard and, subsequently, the real-world network bandwidth/speed might change the OPP for the same platform. Figures 4 and 5 show this variance in OPP due to hardware heterogeneity and variation in communication standards for three traditional DNNs and three edge-optimized DNNs. In both figures, each row represents data for a particular DNN, and each plot in a row shows the OPPs for three different communication standards for five devices. As we can observe in Figure 5, OPPs in SqueezeNet1.1 running on the five edge platforms mentioned above, i.e., RPi3, RPi0, NCS, Jetson, ETPU using the Wi-Fi5 standard at fixed network bandwidth of 1 Mbps is {\(17, 9, 17, 17, 17\)}, respectively. Similarly, the OPPs for the same set of systems that use the 5G standard at bandwidth of 250 Mbps is {\(0, 0, 0, 9, 17\)} and Wi-Fi6 standard at bandwidth of 1500 Mbps is {\(0, 0, 0, 0, 10\)}. As previously stated in Section 3.1, 0 represents CoI, and 17 indicates EoI for SqueezeNet1.1, while any intermediate value represents partitioned execution, where layers/blocks until the OPP (inclusive) are executed at the edge and the rest of the layers run on the cloud server using the transmitted feature maps. Looking at the graphs for MobileNetV2, another SOTA mobile-optimized DNN, we observe the variance in the OPP when we switch the edge platforms and the communication medium. Similar observations can be derived for YOLOv3-Tiny, a SOTA object detection DNN for edge AI applications. Clearly, there is no unanimous partition point that results in minimum latency across all these platforms and communication standards. Similar observations can be derived by inspecting the rest of the plots in Figures 4 and 5 corresponding to the other DNNs. These results allow us to derive the following takeaway: the OPP of any DNN is a function of the computing architecture and communication medium. These insights motivate us to design a platform-agnostic system that can rely on run-time measurements to derive the OPP that can minimize the e2e inference latency.
4 PARTNNER: DESIGN METHODOLOGY
As evident in Sections 3.2 and 3.3, partitioned DL inference using the edge-cloud collaborative framework faces multiple optimization challenges in the form of dynamic operating factors such as wireless network conditions and cloud server loads, as well as the characteristics of the underlying edge device, communication subsystem, and software divergence. To solve the DNN partitioning problem in the presence of the multifaceted challenges mentioned above, we propose
4.1 Adaptive DNN Partitioning Heuristic
We formulate DNN partitioning as a non-linear and non-convex optimization problem, since multiple locally optimal partitioning points might exist due to non-monotonous variations in the computation/memory requirements and data/feature map size of individual DNN layers. By nature, a non-convex problem has multiple local minima and one global minima, which is generally an NP-hard [6, 31] problem. This nature is also followed by the DNN inference characteristics of the individual layers, as can be observed from the e2e latency (green lines) in Figure 3. For a particular edge platform, the system designer may not have all the underlying details of the specific DNN models to be deployed to provide intelligent edge services. Due to the constantly evolving nature of the design space of the DNN architectures, more efficient and accurate DNNs could be used on the same platform in the future for the same application. This unpredictability of the DNN architecture presents the first challenge. Second, even with prior knowledge of DNN architecture deployed on the platform, the temporally varying operating conditions (such as bandwidth and server load) pose substantial challenges, thereby contributing aggressively to the non-convexity of the problem, as the OPP in a particular DNN at some operating state might not be optimal anymore for some other state. This is evident from the discussions in Section 3 as well as from Figures 2 and 3. During our investigation of the problem in the context of varying edge platforms and communication standards, we observed that even under constant temporal conditions, there are unique and distinct OPPs for the same DNN for different edge platforms. These are addressed in Section 3.3 and Figures 4 and 5. Therefore, the solution landscape of DNN partitioning comprising the aforementioned factors results in non-existence of any global minima altogether and makes it imperative for us to propose an efficient heuristic that is adaptive to any DNN architecture and temporal variations in operating conditions, as well as agnostic to the underlying platform.
The proposed partitioning heuristic can work with any SOTA DNN, without the need for any kind of DNN profiling on the hardware. Unlike prior partitioning approaches as discussed in Section 2, we eradicate any offline pre-characterization phase of edge and/or cloud platforms before the actual deployment of the DNN inference engine on the edge device. In contrast to the existing literature, we adopted only run-time measurements of e2e inference latency solely on the edge device to guide the heuristic. Using these measurements,
4.1.1 Heuristic Terminologies.
Before explaining the proposed heuristics and associated algorithms, we define different metrics and terminologies. In this work, we have chosen a coarse-grained approach instead of fine-grained partitioning at a per-layer granularity, as shown in the ResNet50 architecture in Figure 6. We have adopted block-level/module-level partitioning for DAG topology DNNs where individual layers in blocks may have multiple inputs and outputs, forming complex branching structures. These architectures have residual connections, concatenation, or addition (element-wise) operations. Consequently, they need edge-to-cloud transmission of output feature maps from multiple layers and proper handling of data dependencies due to the existence of multiple parallel paths (as seen in Figure 6), for successful collaborative inference. For simplicity, we only allow partitions after individual layers or blocks. Note that our solution can handle scenarios even when blocks receive multiple inputs. Figure 6 illustrates that a residual block, as seen in ResNet class of networks, takes input feature maps from the previous block/layer, applies one or more convolution layers, and finally adds up the final output with the original input (in identity block) or transformed original input (in bottleneck block). Similarly, DNNs such as SqueezeNet1.1, InceptionV3, MobileNetV2, and so on employ modules with internal concatenation operators that add the output of two or more internal layers to generate the output of the module fed to the next layer/block. However, block-level partitioning does not affect chain topology DNNs with simple architecture, such as AlexNet and VGG19_BN, where it boils down to fine-grained partitioning. The total number of partitioning points in a DNN using this approach is represented by N, which is different from the total number of layers. For example, ResNet50 has 50 layers and 23 possible partition points \((P_{O}~|~P_{O} \in [0, 22])\), as shown in Figure 6, where 0 and 22 represent CoI and EoI, respectively, while intermediate values indicate collaborative inference. Similar information for other DNN benchmarks used in this work is enumerated in Table 3. This approach allows us to partition both chain topology DNNs and DAG topology DNNs. Other heuristic parameters are defined as follows:
– | \(search\_top\): This input parameter is a Boolean flag (True/False) that instructs the algorithm to search for the starting partition point \(P_{S}\) from either the top or the bottom of the DNN. Here, top represents partition point 0, whereas, bottom indicates partition point N. We assume that the communication bandwidth is reliable during deployment. As observed in Figure 2, higher bandwidth correlates with OPP at \(\approx \!0\), i.e., more layers are computed in the cloud. Therefore, we set this value to True across the 15 system benchmarks and 6 DNNs. Note that the bandwidth can also be measured at the start to set this parameter accordingly. | ||||
– | k: This input parameter is a percentile factor that decides the range of DNN blocks/layers from which \(P_{S}\) is selected randomly. \(k = 1\) indicates that \(P_{S}\) can be any of the permitted partitions, whereas higher values (\(k \in [2, 4]\)) will reduce the search space of random choice. \(search\_top\) and k together enable the initialization of heuristic. e.g., \(search\_top\) = True and \(k = 4\) indicates that the algorithm will select \(P_{S}\) from the top 25% of the partitions (Algorithm 2). | ||||
– | \(\alpha\): This input parameter is the relative latency threshold that determines how eagerly the heuristic algorithm tries to find new partition points compared to staying at the previous point (Algorithm 3). A lower \(\alpha\) will allow the algorithm to explore more often, whereas a higher \(\alpha\) will make the algorithm more conservative; i.e., only large disruptions in the latency will cause the algorithm to move the partition point. We have empirically selected \(\alpha \in (0, 0.1]\) to account for measurement variations, noise, environmental uncertainties, and other uncontrollable factors. | ||||
– | \(near\_idx\) (\(ni\)): This input parameter influences the number of blocks/layers that the heuristic shifts in either direction (edge/cloud) if the latency difference exceeds the threshold \(\alpha\) (Algorithm 3). Essentially, \(ni\) determines the degree of feedback used by the algorithm to guide itself toward OPP. We have empirically observed that \(ni \in [2, 3]\) leads to faster convergence to OPP across all benchmarks. Higher values may result in a drop in heuristic performance, consequently, a higher e2e latency. | ||||
– | \(far\_idx\) (\(fi\)): This run-time variable is updated by the algorithm depending on past favorable or unfavorable decisions. The heuristic only uses \(fi\) to shift \(P_{O}\) if consecutive decisions are favorable (Algorithm 3). | ||||
– | \(scale\_idx\) (\(si\)): This input parameter is a scaling factor that changes \(fi\) depending on the difference in latency. (Algorithm 3). We use \(si\) to reward the heuristic to different degrees based on the frequency of favorable decisions (Algorithm 3). We empirically selected \(si \approx 2\) on average across all benchmarks. More details on the interaction between these parameters are provided in Section 4.2. | ||||
– | \(part\_prob\): This input parameter decides the probability of exploration if \(\Delta lat\) does not exceed the threshold \(\alpha\). Setting this parameter properly ensures that the algorithm does not get stuck indefinitely at local minima. In our experiments, we set \(part\_prob \in [0.15, 0.05]\) (Algorithm 4). | ||||
– | \(last\): This run-time variable records the last updated partitioning decision. The heuristic updates this variable (\(last \in \lbrace \mathsf {edge, cloud}\rbrace\)) if its decides to offload layers in a direction opposite to its previous decision (Algorithm 3). In addition, \(last\) may also be updated if the heuristics does the exploration (Algorithm 4). |
4.1.2 Heuristic Operation.
The proposed heuristic is a run-time/online solution to the partitioning problem. The top level Algorithm 1 is executed on the edge device together with the intelligent application that uses DNN as the underlying algorithm. In the first instance of application launch, it initializes the heuristic with different empirically chosen parameters (Line 1), by calling the function Init in Algorithm 2. Depending on the input parameters, the first partition point \(P_{S}\) is randomly chosen among N DNN blocks/layers (Lines 2–5) and heuristic variables viz., \(P_{O}, Ti_{prev}, Ti_{prev_2}, fi, last\) are initialized. Note that Algorithm 2 needs to run only once.
Following the initialization phase, Algorithm 1 calls the main partitioning heuristic (Algorithm 3) every time the parent application invokes the DL inference request on the edge device. The heuristic measures the e2e latency associated with the current request, \(Ti_{curr}\), only on the edge device (Line 2), which encapsulates all the computation (edge, cloud, or both) and the communication latency (if any) involved in a single DNN inference operation, as explained in Section 3.1. Subsequently, the relative latency difference \(\Delta lat\) is calculated using \(Ti_{curr}\) and \(Ti_{prev}\), the e2e latency corresponding to the last inference request. We also measure the relative latency difference of the previous partition decision, \(\Delta lat_{prev}\). Note that \(\Delta lat\) and \(\Delta lat_{prev}\) are initialized to 0 only at the first inference instance to avoid division by zero error. This will trigger exploration (Algorithm 4) to find \(P_{O}\). For each subsequent instance, the measured \(\Delta lat\) is compared with the predefined threshold (\(\alpha\)). \(\Delta lat\) > \(\alpha\) indicates that the last partition decision made (\(last\)) adversely affected latency. Thus, if the \(last\) decision was cloud, the heuristic decides to shift \(P_{O}\) toward the edge, i.e., compared to the present configuration, more layers will be executed on the edge device at the next inference request. On the contrary, if the \(last\) decision was edge, heuristic offloads the computation of more layers to the cloud, thus alleviating the edge of some of the existing computation load. This heuristic action is depicted in Lines 7–16, where \(P_{O}\) is shifted by \(ni\), essentially moving the computation of \(ni\) blocks/layers to the edge or cloud. On the other hand, if \(\Delta lat\) < \(\alpha\), the heuristic reinforces the previous partitioning decision, essentially shifting \(P_{O}\) in the same direction as indicated by \(last\). To increase the stability of the heuristic, we compare \(\Delta lat_{prev}\) with \(\alpha\) and shift \(P_{O}\) by any of the shift parameters, \(ni\) or \(fi\) (Lines 19–27). Furthermore, along with every favorable or unfavorable decision, \(fi\) increases or decreases using \(si\), or is reset directly to \(ni\). Finally, \(Ti_{prev}\) and \(Ti_{prev_2}\) are updated, and the heuristic ensures that the OPP is selected from the possible partitions.
4.1.3 Random Exploration Phase.
We promote random exploration of the non-convex partitioning search space for the operating DNN throughout the heuristic operation to encompass the large search space. As shown in Line 5 of Algorithm 3, the heuristic calls the function Explore in Algorithm 4 when the latency difference does not exceed the threshold between two consecutive inferences. Depending on the random choice of the partition decision, according to the parameter \(part\_prob\), \(P_{O}\) is shifted or left unchanged and \(last\) is updated (Lines 4–12). For example, two conflicting conditions, such as an increase in network bandwidth and server load, might result in little to no change in latency difference, as the first condition favors offloading to the cloud, whereas the second one prefers execution on the edge. However, a partition point other than the current \(P_{O}\) might result in lower latency, which can only be explored by heuristic by choosing a random partitioning decision. Note that \(part\_prob\) may be changed at run-time. However, the value should not be set to 0 as this will stop the exploration and the algorithm might be stuck in a sub-optimal minima for the entirely of the inference application.
4.1.4 Overall Flow.
The overall flow of the DNN partitioning heuristic, as explained in the previous subsections, has been summarized in a single flow chart, shown in Figure 7. The figure encompasses all algorithms in a synergistic way and gives a high-level overview of the overall flow of the heuristic. For example, Algorithm 2 uses different heuristic parameters (as described in Section 4.1.1) to initialize \(P_{S}=P2\) along with other run-time variables for the representative DNN with \(N=5\) (Steps 1–2). Inference algorithm is executed and e2e latency is measured (Steps 3–6). Algorithm 3 uses shift parameters/variables (\(ni\) or \(fi\)) and \(last\) to update the OPP (\(P_{O}\)) to \(P3, P4\) or \(P1\) based on \(\Delta lat\) and \(\Delta lat_{prev}\), followed by update of run-time variables (Steps 7–9). Based on \(\Delta lat\), Algorithm 4 might perform random exploration to shift \(P_{O}\) toward the edge or cloud, followed by \(last\) update (Steps 7–9). The updated \(P_{O}\) is recorded and partitioned DNN inference is executed using the saved value for the next inference request and the process continues for the entire lifetime of the application on the edge.
4.2 Rationale behind Heuristic Design Decisions
Finding a globally optimal solution for the non-convex partitioning problem is impractical due to the sheer size of the search space. The heuristic is devoid of any notion of the performance capability of the underlying edge device (since the proposed solution is profiling-/characterization-free). In addition, wireless network conditions and/or server load might vary at any time, completely altering the search space. The presence of other latent or unaccounted constraints such as the number of concurrent CPU-intensive threads, variability in the deployed DNN precision, DNN processing framework, dynamic allocation in big/little/hybrid cores, and so on, [78, 79] affects the DNN inference latency and edge energy consumption, therefore, adds up to the already high-dimensional search space. Dealing with such multidimensional constraints necessitates the heuristic to be optimally adaptive and still be efficient in converging to the optimal partition point that gives globally minimum e2e inference latency.
As discussed in Section 2, existing literature uses extensive profiling and computes processing and communication latency and measures multiple influential factors [17, 37, 78]. These works use measurements to address the diversity in DNN models, hardware, software, and external constraints and create a library of profile characteristics for each edge and cloud platform involved in the deployment of collaborative inference. Eliminating this exhaustive profiling step stimulated us to adopt a real-time measurement-based approach and design the heuristic parameters, viz., \(near\_idx\), \(scale\_idx\) and \(far\_idx\), which can guide the partition decision. The choice to scale \(fi\) up or down with favorable or unfavorable decision, respectively (as seen in Algorithm 3) allows faster convergence and less spurious outlier decisions, since any non-convex solution inherently allows non-optimal guesses from time to time. The heuristic embraces a self-feedback mechanism (using \(si\)) by rewarding itself for every favorable decision that reduces latency, thus closely approximating the nature of accelerated gradient descent, a common method used to solve convex optimization problems efficiently.
As is evident, the only measurement metric that guides the partitioning decision is the run-time e2e latency on the edge device. Unlike previous works, e2e latency measurement at the edge is comparatively accurate and simple and covers all additional processing and unknown overheads on edge device and cloud platforms. Furthermore, the algorithms add minimal overhead to DNN inference because of lightweight operations. The algorithm execution time is much lesser than the actual inference operation. Finally, the edge and cloud platforms can work asynchronously as the clock skew between the possibly geographically distant counterparts involved in the inference has zero impact on the heuristic, due to its non-dependence of cloud-based measurements.
5 EXPERIMENTAL METHODOLOGY
In this section, we describe the components used in the experimental evaluation of
5.1 System Benchmarks
We demonstrate the platform-agnostic feature of
RPi0 [19] was selected as the first compute platform that emulates SOTA tiny/wearable IoT devices, presenting limited on-device compute capabilities for AI. RPi3 [20] was chosen as the second physical platform, representative of the wide variety of embedded devices and edge IoT platforms. In contrast to resource-constrained RPi0, RPi3 can run different DNN models and is generally used as a baseline to compare AI hardware and DNN accelerators. Both of these platforms lack a dedicated GPU or coprocessor for DNN computation. Note that we performed real-world experiments on both these hardware platforms in vanilla format, i.e., without any hardware or software customization/optimization. The only modifications we made include the disablement of (i) the RPi camera module to free up memory, since this module by default allocates 128 MB to the Broadcom VideoCore GPU (only used for video encoding/decoding) and (ii) LightDM, the OS display manager, to free the processor of any additional computational overhead.
Taking into account the vast ecosystem of custom silicon (ASIC) for edge AI, we developed computational models for the remaining three hardware platforms, each including GPU or DNN accelerator. (i) Intel® NCS [32, 33] consists of Intel Neural Compute Engine, a dedicated hardware accelerator. Since this is a USB plug & play AI device, the computational model considered RPi3 connected to the NCS dongle, and this combination formed our third system benchmark. (ii) NVIDIA® Jetson [50] consists of a streaming multiprocessor (128 CUDA cores) that allows parallel execution of multiple AI workloads. The presence of a GPU facilitates fast DNN inference, and this device was our fourth benchmark. (iii) Google Coral [25, 26] is an ASIC specifically designed to accelerate DL. The underlying hardware architecture of ETPU supports fast matrix operations, thereby providing exceptional levels of acceleration for DNN inference. In addition, the fast data transfer rate between ETPU and internal memory in this Coral board also contributes to this speed up. We leveraged several benchmark articles [1, 3, 24, 27, 50, 57] to design these computational models such that the reported e2e inference time of the DNN model matches very closely with that of the actual hardware. Further details on how these systems fit into the e2e experimental setup are discussed in Section 5.2.
The communication standards explored for this study include Wi-Fi5, 5G, and Wi-Fi6 [35, 55, 59] with a measured maximum bandwidth of 25 Mbps, 2.5 Gbps, and 5 Gbps, respectively. The wireless module onboard Rpi0 only supports Wi-Fi4. Therefore, we use the terminology “BWiFi” to denote 4G LTE/Wi-Fi4 (specifically for RPi0) and Wi-Fi5 for all other devices. Due to the similarity in the observed bandwidth across these three standards, we used BWiFi in all relevant graphs in this article. During the evaluation of
5.2 End-to-End System and Software Setup
Figure 8 depicts the e2e experimental setup used in this work that consists of the edge device, the cloud server that is connected to a local PC through an SSH connection, display connected to the edge device, and finally a Logitech Webcam C310 connected to the device via USB. The webcam was used for image acquisition and real-time DNN inference using
We performed the partitioning experiments in real time on both Raspberry Pi boards and, therefore, obtained the oracle data and executed
5.3 DNN Benchmarks
To perform a comprehensive evaluation of
Note that # partitions is different from # layers due to our adoption of block-level partitioning. This approach not only handles complex data dependencies and ensures proper e2e collaborative inference execution, but also substantially reduces the search space of the heuristic. Among the classification DNNs, we considered three large-scale DNNs traditionally used on servers, such as AlexNet, InceptionV3, ResNet101, and two edge/mobile optimized DNNs such as SqueezeNet1.1 and MobileNetV2 used for real-time edge AI applications. All of these models were pretrained on the ImageNet dataset [12]. We also assess
5.4 Inference Latency Measurement and Optimal Partitioning Point Calculation
An IoT application that runs image classification/detection comprises several pipelined stages. As part of our measurements to obtain oracle inference latency during collaborative edge-cloud inference, we measure the latency of each constituent stage and ultimately obtain the total cumulative time to obtain the e2e latency. Specifically, we use publicly available Torchprof [75], a minimal dependency library, to profile each DNN benchmark (Table 3) to obtain the execution times at the granularity of a layer or block. In all our experiments, we assume that the DNN models are already loaded, i.e., the network is initialized, and the pretrained weights are loaded from the disk to the memory, since model loading poses a considerable overhead to the e2e latency and also depends on the architecture/memory characteristics of the system. However, the model loading time only affects the first inference latency and, therefore, has a negligible effect on the
Note that different constituent latencies are used only for the calculation of the e2e inference latency for EoI, CoI, and oracle and are used for the results and plots in Section 6. Conversely,
As discussed in Section 3, the variation in server load also affects server performance, as well as the DNN inference time. To simulate the load variation originating from multiple DNN inference requests from multiple edge devices, we used the Python
Note that we had to repeat these experiments for fifteen different systems formed by all combinations of five edge platforms (Table 2) and three communication standards. The processing speed and transmission times of intermediate feature maps for NCS, Jetson, and ETPU-based systems were estimated from computational models, as indicated in Section 5.1. We have already shown the partition points of the oracle in a RPi3-based system using Wi-Fi5 for wireless bandwidth in the range {\(0{-}25\) Mbps} and server load {\(1{-}15\)} for six DNNs in Figure 2 in Section 3. Finally, to gauge the quality of
6 EXPERIMENTAL RESULTS AND DISCUSSIONS
In this section, we present the results obtained during the experimental evaluation of
6.1 DL Inference Latency Improvement
Figure 9 shows how
Figure 10 depicts the same four-way comparison for all benchmark DNNs averaged over all 15 system benchmarks used in the study. This figure shows the efficacy of the proposed system, since
6.2 Adaptability to Dynamic Variations
In-depth analysis of
Among the five representative trends, trend 0 depicts a scenario where both load and bandwidth are constant throughout the duration of the experiment. The oracle inference latency remains constant for all instances since the oracle partition point does not change. On the contrary,
Similar observations can also be derived from the rest of the four plots in Figure 11, corresponding to trends 1 to 4. Trend 1 represents a scenario with monotonically increasing load and bandwidth. In trend 2, the server load increases stepwise after a few inference instances, while the bandwidth increases linearly during the first half and shows sinusoidal behavior for the latter half. Trend 3 shows the case where the bandwidth increases stepwise, while the load shows a mixture of linear and sinusoidal nature. Finally, trend 4 shows a gradual upward trend followed by a downward pattern in load and a repetitive upward-downward triangular trend in bandwidth. As clearly seen in all the plots,
To allow further comparative analysis, we have adopted relative error between the inference times of the oracle and any other mode of operation as the evaluation metric, as shown in Equation (1):
(1) \(\begin{equation} \varepsilon _{xo} = \frac{\sum _{k=1}^{I} (Ti_{x} - Ti_{oracle})}{\sum _{k=1}^{I} Ti_{oracle}}, \text{for } x \in \lbrace c, e, r, p\rbrace , \end{equation}\) where \(c, e, r, p\) represents CoI, EoI, random, and
6.3 Platform Agnostic Behavior
Now, we demonstrate the performance benefits of
6.4 Comparative Analysis for Multiple DNNs
Due to the exponential growth in DNN algorithm research, new network architectures are released quite frequently. Therefore, if any system designer intends to deploy a new DNN architecture for edge AI application, profiling becomes a necessary option following related work in this area. However, in addition to being platform-agnostic, our proposed system is also invariant to the DNN model. We demonstrate this adaptability to any DNN architecture in Figure 14. Here, the comparison among
6.5 Convergence Study
Any optimization algorithm is gauged by how fast it converges to the globally optimal value. Since the globally optimal value changes with even a minor change in the temporally varying operating conditions, we evaluated the convergence of our proposed algorithms by gauging local minima among the DNN for a trend with constant server load and network bandwidth, i.e., Trend 0. The number of consecutive inferences, out of a total of 1,024 inference instances, needed by
In contrast, an exhaustive search to select the OPP in a particular DNN is directly correlated with the number of partitioning points. A naive exhaustive partitioning algorithm using an offline profiling mechanism would evaluate the e2e latency at all viable points for a constant operating condition specific for fixed edge-cloud platforms. Switching to a different platform invokes the need for additional profiling. For example, performing offline profiling on ResNet101 on fifteen different platforms would require 40 \(\times\) 15 i.e., 600 profiling experiments for constant server load and bandwidth. In contrast, even without any profiling stage, the
7 ADAPTABILITY TO PARTITIONING FOR MINIMUM ENERGY
Although
8 DISCUSSIONS AND FUTURE WORK
In this section, we present limitations of this work, the scope of applicability, and future research directions aimed at solving the challenges of collaborative inference for DL at the edge. This study focuses on deep learning application scenarios with the optimization objective of minimizing e2e latency when a single edge node (client device) and a single cloud server participate in collaborative inference. We do not consider environments consisting of multiple homogeneous/heterogeneous edge devices and multiple cloud servers where the objective is to deploy multiple DNNs in an optimized way and perform resource autoscaling as seen in related work discussed in Section 2. However, these distributed serving architectures could potentially be used in conjunction with our partitioning technique as part of future work. We also do not consider general computation partitioning across edge, fog, and cloud as in other related work [13]. Nevertheless, the adaption of
Our technique falls under the category of dynamic or online partitioning, which requires a system-wide (comprising edge and cloud) replication of DNN weights that is necessary for run-time flexibility. An alternative approach could be to communicate the weights from the edge to the cloud or vice versa. However,
One limitation of the proposed algorithm is the static nature of two critical parameters, viz., \(\alpha\) and \(part\_prob\). However, the algorithm can be extended to improve the exploration and convergence timelines by dynamically updating these parameters over the lifetime of the inference application. In addition, the size of the timing window over which the algorithm compares the latency difference with \(\alpha\) is only 2, leading to spikes in latency. Future work can also explore different window sizes to calculate the moving average of latency measurements, which may eliminate such spikes. However, window size essentially trades off stability vs. adaptability. In addition, the decision also depends on the application scenario. Therefore, exploring these tradeoffs can be pursued as part of future work. Although the values of the algorithm parameters have been decided logically and empirically, future work could follow some simple guidelines before applying
9 CONCLUSION
In this article, we introduced
- [1] . 2019. Benchmarking Edge Computing: Comparing Google, Intel, and NVIDIA Accelerator Hardware. Medium. Retrieved May 15, 2022 from https://medium.com/@aallan/benchmarking-edge-computing-ce3f13942245Google Scholar
- [2] . 2016. Fused-layer CNN accelerators. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (Taipei, Taiwan) (
MICRO-49 ). IEEE Press, Article22 , 12 pages.DOI: Google ScholarCross Ref - [3] . 2021. DeepEdgeBench: Benchmarking deep neural networks on edge devices. In 2021 IEEE International Conference on Cloud Engineering (IC2E) (San Francisco, CA, USA). IEEE, 20–30.
DOI: Google ScholarCross Ref - [4] . 2010. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (Melbourne, Australia) (
IMC ’10 ). ACM, New York, NY, 267–280.DOI: Google ScholarDigital Library - [5] . 2016. Democratizing AI. Microsoft. Retrieved March 20, 2022 from https://news.microsoft.com/features/democratizing-aiGoogle Scholar
- [6] . 2016. Efficient multi-user computation offloading for mobile-edge cloud computing. IEEE/ACM Transactions on Networking 24, 14 (2016), 2795–2808.
DOI: Google ScholarDigital Library - [7] . 2022. Energy-efficient offloading for DNN-based smart IoT systems in cloud-edge environments. IEEE Transactions on Parallel and Distributed Systems 33, 3 (2022), 683–697.
DOI: Google ScholarCross Ref - [8] . 2021. Lazy batching: An SLA-aware batching system for cloud machine learning inference. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (Seoul, Korea (South)). IEEE Computer Society, Los Alamitos, CA, 493–506.
DOI: Google ScholarCross Ref - [9] . 2011. CloneCloud: Elastic execution between mobile device and cloud. In Proceedings of the 6th Conference on Computer systems (Salzburg, Austria) (
EuroSys ’11 ). ACM, New York, NY, 301–314.DOI: Google ScholarDigital Library - [10] . 2020. InferLine: Latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing (Virtual Event, USA) (
SoCC ’20 ). ACM, New York, NY, 477–491.DOI: Google ScholarDigital Library - [11] . 2017. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (Boston, MA, USA). USENIX Association, 613–627. Retrieved from https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/crankshawGoogle Scholar
- [12] . 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) (Miami, FL, USA). IEEE Computer Society, Los Alamitos, CA, 248–255.
DOI: Google ScholarCross Ref - [13] . 2022. Incentive mechanism and resource allocation for edge-fog networks driven by multi-dimensional contract and game theories. IEEE Open Journal of the Communications Society 3 (2022), 435–452.
DOI: Google ScholarCross Ref - [14] . 2017. A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Transactions on Circuits and Systems I: Regular Papers 65, 1 (2017), 198–208.
DOI: Google ScholarCross Ref - [15] . 2021. Joint optimization of DNN partition and scheduling for mobile cloud computing. In Proceedings of the 50th International Conference on Parallel Processing (Lemont, IL, USA) (
ICPP ’21 ). ACM, New York, NY, Article21 , 10 pages.DOI: Google ScholarDigital Library - [16] . 2020. In datacenter performance, the only constant is change. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID ’20) (Melbourne, VIC, Australia). IEEE, 370–379.
DOI: Google ScholarCross Ref - [17] . 2018. Energy and performance efficient computation offloading for deep neural networks in a mobile cloud computing environment. In Proceedings of the 2018 on Great Lakes Symposium on VLSI (Chicago, IL, USA) (
GLSVLSI ’18 ). ACM, New York, NY, 111–116.DOI: Google ScholarDigital Library - [18] . 2023. DNN partitioning for inference throughput acceleration at the edge. IEEE Access 11 (2023), 52236–52249.
DOI: Google ScholarCross Ref - [19] . 2017. Raspberry Pi Zero W. Retrieved May 15, 2022 from https://www.raspberrypi.org/products/raspberry-pi-zero-wGoogle Scholar
- [20] . 2018. Raspberry Pi 3 Model B+. Retrieved May 15, 2022 from https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plusGoogle Scholar
- [21] . 2019. Deep neural network task partitioning and offloading for mobile edge computing. In 2019 IEEE Global Communications Conference (GLOBECOM) (Waikoloa, HI, USA). IEEE, 1–6.
DOI: Google ScholarDigital Library - [22] . 2020. Approximate inference systems (AxIS): End-to-end approximations for energy-efficient inference at the edge. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. ACM, New York, NY, 7–12.
DOI: Google ScholarDigital Library - [23] . 2023. Energy-efficient approximate edge inference systems. ACM Trans. Embed. Comput. Syst. 22, 4, Article
77 (2023), 50 pages.DOI: Google ScholarDigital Library - [24] . 2019. Machine Learning Edge Devices: Benchmark Report. Tyro Labs. Retrieved May 15, 2022 from https://tryolabs.com/blog/machine-learning-on-edge-devices-benchmark-reportGoogle Scholar
- [25] . 2019. Coral Dev Board. Retrieved May 15, 2022 from https://coral.ai/products/dev-boardGoogle Scholar
- [26] . 2019. Edge TPU. Retrieved May 15, 2022 from https://cloud.google.com/edge-tpuGoogle Scholar
- [27] . 2022. Edge TPU Performance Benchmarks. Retrieved May 15, 2022 from https://coral.ai/docs/edgetpu/benchmarksGoogle Scholar
- [28] . 2012. \(COMET\): Code offload by migrating execution transparently. In 10th USENIX Symposium on Operating Systems Design and Implementation (Hollywood, CA, USA) (
OSDI’12 ). USENIX Association, 93–106. Retrieved from Google ScholarDigital Library - [29] . 2023. Does Location Matter In Cloud Computing? Ridge Cloud. Retrieved March 25, 2023 from https://www.ridge.co/blog/location-in-cloud-computingGoogle Scholar
- [30] . 2021. What Realistic Speeds Will I Get With Wi-Fi 5 and Wi-Fi 6? Increase Broadband Speed. Retrieved April 5, 2022 from https://www.increasebroadbandspeed.co.uk/realistic-speeds-wi-fi-5-and-wi-fi-6Google Scholar
- [31] . 2019. Dynamic adaptive DNN surgery for inference acceleration on the edge. In IEEE INFOCOM 2019—IEEE Conference on Computer Communications. IEEE, 1423–1431.
DOI: Google ScholarDigital Library - [32] . 2018. Intel® Neural Compute Stick 2. Retrieved May 15, 2022 from https://software.intel.com/content/www/us/en/develop/hardware/neural-compute-stick.htmlGoogle Scholar
- [33] . 2019. Intel® Movidius™ Myriad™ X Vision Processing Unit 4GB. Retrieved May 15, 2022 from https://www.intel.com/content/www/us/en/products/sku/125926/intel-movidius-myriad-x-vision-processing-unit-4gb/specifications.htmlGoogle Scholar
- [34] . 2022. YOLOv5 by Ultralytics.
DOI: Online; last accessed Aug 5, 2022. Google ScholarCross Ref - [35] . 2022. What are Wi-Fi 6 and Wi-Fi 6E? Trusted Reviews. Retrieved November 5, 2022 from https://www.trustedreviews.com/news/wifi-6-routers-speed-3442712Google Scholar
- [36] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (
ISCA’17 ). ACM, New York, NY, 1–12.DOI: Google ScholarDigital Library - [37] . 2017. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. SIGARCH Comput. Archit. News 45, 1 (2017), 615–629.
DOI: Google ScholarDigital Library - [38] . 2018. WonderShaper—A Tool to Limit Network Bandwidth in Linux. Tecmint: Linux Howtos, Tutorials & Guides. Retrieved April 31, 2022 from https://www.tecmint.com/wondershaper-limit-network-bandwidth-in-linuxGoogle Scholar
- [39] . 2018. Edge intelligence: On-demand deep learning model co-inference with device-edge synergy. In Proceedings of the 2018 Workshop on Mobile Edge Communications (Budapest, Hungary). ACM, New York, NY, 31–36.
DOI: Google ScholarDigital Library - [40] . 2018. Auto-tuning neural network quantization framework for collaborative inference between the cloud and edge. In Artificial Neural Networks and Machine Learning—ICANN 2018. Springer International Publishing, Cham, Switzerland, 402–411.
DOI: Google ScholarCross Ref - [41] . 2018. Learning IoT in edge: Deep learning for the Internet of Things with edge computing. IEEE Network 32, 1 (2018), 96–101.
DOI: Google ScholarCross Ref - [42] . 2020. A survey of AI accelerators for edge environment. In World Conference on Information Systems and Technologies: Trends and Innovations in Information Systems and Technologies. Springer International Publishing, Cham, Switzerland, 35–44.
DOI: Google ScholarCross Ref - [43] . 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (Zurich, Switzerland) (
Lecture Notes in Computer Science ), Vol. 8693. Springer International Publishing, Cham, Switzerland, 740–755.DOI: Google ScholarCross Ref - [44] . 2018. EdgeEye: An edge service framework for real-time intelligent video analytics. In Proceedings of the 1st International Workshop on Edge Systems, Analytics and Networking (Munich, Germany) (
EdgeSys’18 ). ACM, New York, NY, 1–6.DOI: Google ScholarDigital Library - [45] . 2018. Google Learn2Compress Moves AI Processing to Mobile and IoT Devices. Retrieved April 25, 2022 from https://www.bizety.com/2018/09/17/google-learn2compress-moves-ai-processing-to-mobile-and-iot-devicesGoogle Scholar
- [46] . 2019. A survey of related research on compression and acceleration of deep neural networks. Journal of Physics: Conference Series 1213, 5 (2019), 052003.
DOI: Google ScholarCross Ref - [47] . 2020. 5G Speed: 5G vs 4G Performance Compared. Tom’s Guide. Retrieved April 5, 2022 from https://www.tomsguide.com/features/5g-vs-4gGoogle Scholar
- [48] . 2022. Partitioning DNNs for optimizing distributed inference performance on cooperative edge devices: A genetic algorithm approach. Applied Sciences 12, 20, Article
10619 (2022), 14 pages.DOI: Google ScholarCross Ref - [49] . 2022. QoS Bandwidth Management. Retrieved April 31, 2022 from https://docs.paloaltonetworks.com/pan-os/9-1/pan-os-admin/quality-of-service/qos-concepts/qos-bandwidth-managementGoogle Scholar
- [50] . 2019. Jetson Nano Brings AI Computing to Everyone. Retrieved May 15, 2022 from https://developer.nvidia.com/blog/jetson-nano-ai-computing/Google Scholar
- [51] . 2020. NVIDIA Jetson Linux Developer Guide. Retrieved May 15, 2022 from https://docs.nvidia.com/jetson/l4t/index.htmlGoogle Scholar
- [52] . 2021. GPU Management and Deployment: Multi-Process Service. Retrieved April 31, 2022 from https://docs.nvidia.com/deploy/mps/index.htmlGoogle Scholar
- [53] . 2020. Survey: Most Data Centers Don’t Meet the Needs of their Users. Data Center Explorer, NetworkWorld. Retrieved March 20, 2022 from https://www.networkworld.com/article/3533998/survey-most-data-centers-dont-meet-the-needs-of-their-users.htmlGoogle Scholar
- [54] . 2019. Raspberry Pi Documentation: Processors. Raspberry Pi Ltd. Retrieved May 15, 2022 from https://www.raspberrypi.com/documentation/computers/processors.htmlGoogle Scholar
- [55] . 2020. 5G Speed: How Fast is 5G? Verizon News Center. Retrieved April 5, 2022 from https://www.verizon.com/about/our-company/5g/5g-speed-how-fast-is-5gGoogle Scholar
- [56] . 2022. Google Coral Edge TPU Explained in Depth. Retrieved May 15, 2022 from https://qengineering.eu/google-corals-tpu-explained.htmlGoogle Scholar
- [57] . 2023. Deep Learning with Raspberry Pi and Alternatives in 2023. Retrieved March 15, 2023 from https://qengineering.eu/deep-learning-with-raspberry-pi-and-alternatives.htmlGoogle Scholar
- [58] . 2017. Paleo: A performance model for deep neural networks. In 5th International Conference on Learning Representations (ICLR) (Toulon, France). OpenReview.net. Retrieved from https://openreview.net/forum?id=SyVVJ85lgGoogle Scholar
- [59] . 2022. Everything you Need to know about 5G. Retrieved April 5, 2022 from https://www.qualcomm.com/5g/what-is-5gGoogle Scholar
- [60] . 2021. Special session: Approximate TinyML systems: Full system approximations for extreme energy-efficiency in intelligent edge devices. In 2021 IEEE 39th International Conference on Computer Design (ICCD ’21) (Storrs, CT, USA). IEEE, 13–16.
DOI: Google ScholarCross Ref - [61] . 2023. Efficient hardware acceleration of emerging neural networks for embedded machine learning: An industry perspective. In Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing: Hardware Architectures, and (Eds.). Springer International Publishing, Cham, Switzerland, 121–172.
DOI: Google ScholarCross Ref - [62] . 2017. Delivering deep learning to mobile devices via offloading. In Proceedings of the Workshop on Virtual Reality and Augmented Reality Network (Los Angeles, CA, USA) (
VR/AR Network ’17 ). ACM, New York, NY, 42–47.DOI: Google ScholarDigital Library - [63] . 2018. DeepDecision: A mobile deep learning framework for edge video analytics. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications (Honolulu, HI, USA). IEEE, 1421–1429.
DOI: Google ScholarDigital Library - [64] . 2022. FA2: Fast, accurate autoscaling for serving deep learning inference with SLA guarantees. In 2022 IEEE 28th Real-Time and Embedded Technology and Applications Symposium (RTAS) (Milano, Italy). IEEE, 146–159.
DOI: Google ScholarCross Ref - [65] . 2021. AI accelerator survey and trends. In 2021 IEEE High Performance Extreme Computing Conference (HPEC) (Waltham, MA). IEEE, 1–9.
DOI: Google ScholarCross Ref - [66] . 2022. Evaluating performance variations cross cloud data centres using multiview comparative workload traces analysis. Connection Science 34, 1 (2022), 1582–1608.
DOI: Google ScholarCross Ref - [67] . 2022. Does Data Center Location Matter for Cloud Services? Retrieved April 25, 2022 from https://ussignal.com/blog/does-data-center-location-matter-for-cloud-servicesGoogle Scholar
- [68] . 2020. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Seattle, WA, USA). IEEE Computer Society, Los Alamitos, CA, 10778–10787.
DOI: Google ScholarCross Ref - [69] . 2019. Data Center Performance Analysis: Challenges and Practices. Medium. Retrieved April 20, 2022 from https://alibabatech.medium.com/data-center-performance-analysis-challenges-and-practices-c5c9a2b5e5a9Google Scholar
- [70] . 2016. BranchyNet: Fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR) (Cancun, Mexico). IEEE, 2464–2469.
DOI: Google ScholarCross Ref - [71] . 2017. Distributed deep neural networks over the cloud, the edge and end devices. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS ’17) (Atlanta, GA, USA). IEEE, 328–339.
DOI: Google ScholarCross Ref - [72] . 2019. How to Limit Bandwidth on Linux to Better Test your Applications. Tech Republic. Retrieved April 31, 2022 from https://www.techrepublic.com/article/how-to-limit-bandwidth-on-linux-to-better-test-your-applicationsGoogle Scholar
- [73] . 2019. ADDA: Adaptive distributed DNN inference acceleration in edge computing environment. In 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS ’19). IEEE, 438–445.
DOI: Google ScholarCross Ref - [74] . 2020. Convergence of edge computing and deep learning: A comprehensive survey. IEEE Communications Surveys & Tutorials 22, 2 (2020), 869–904.
DOI: Google ScholarCross Ref - [75] . 2020. torchprof. Retrieved from https://github.com/awwong1/torchprof
Online; last accessed April 10, 2021. Google Scholar - [76] . 2020. Is Edge Computing the Answer to a Data Center Overload? Tech Wire Asia. Retrieved March 20, 2022 from https://techwireasia.com/2020/04/is-edge-computing-the-answer-to-a-data-center-overloadGoogle Scholar
- [77] Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, Tommer Leyvand, Hao Lu, Yang Lu, Lin Qiao, Brandon Reagen, Joe Spisak, Fei Sun, Andrew Tulloch, Peter Vajda, Xiaodong Wang, Yanghan Wang, Bram Wasti, Yiming Wu, Ran Xian, Sungjoo Yoo, and Peizhao Zhang. 2019. Machine learning at Facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA ’19) (Washington, DC, USA). IEEE, 331–344.
DOI: Google ScholarCross Ref - [78] . 2019. DNNTune: Automatic benchmarking DNN models for mobile-cloud computing. ACM Trans. Archit. Code Optim. 16, 4, Article
49 (2019), 26 pages.DOI: Google ScholarDigital Library - [79] . 2020. A Note on Latency Variability of Deep Neural Networks for Mobile Inference. arXiv:2003.00138. Retrieved from https://arxiv.org/abs/2003.00138Google Scholar
- [80] . 2017. DeepIoT: Compressing deep neural network structures for sensing systems with a compressor-critic framework. In Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems (Delft, Netherlands) (
SenSys ’17 ). ACM, New York, NY, Article4 , 14 pages.DOI: Google ScholarDigital Library - [81] . 2019. Toward efficient compute-intensive job allocation for green data centers: A deep reinforcement learning approach. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS ’19) (Dallas, TX, USA). IEEE, 634–644.
DOI: Google ScholarCross Ref - [82] . 2017. LAVEA: Latency-aware video analytics on edge computing platform. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS ’17) (Atlanta, GA, USA). IEEE, 2573–2574.
DOI: Google ScholarDigital Library - [83] . 2022. Towards resource-aware DNN partitioning for edge devices with heterogeneous resources. In GLOBECOM 2022—2022 IEEE Global Communications Conference (Rio de Janeiro, Brazil). IEEE, 5649–5655.
DOI: Google ScholarCross Ref - [84] . 2018. ECRT: An edge computing system for real-time image-based object tracking. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems (Shenzhen, China) (
SenSys ’18 ). ACM, New York, NY, 394–395.DOI: Google ScholarDigital Library
Index Terms
- PArtNNer: Platform-Agnostic Adaptive Edge-Cloud DNN Partitioning for Minimizing End-to-End Latency
Recommendations
Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration
Edge intelligence has emerged as a promising paradigm to accelerate DNN inference by model partitioning, which is particularly useful for intelligent scenarios that demand high accuracy and low latency. However, the dynamic nature of the edge environment ...
An adaptive DNN inference acceleration framework with end–edge–cloud collaborative computing
AbstractDeep Neural Networks (DNNs) based on intelligent applications have been intensively deployed on mobile devices. Unfortunately, resource-constrained mobile devices cannot meet stringent latency requirements due to a large amount of ...
Highlights- An adaptive DNN inference acceleration framework is proposed to reduce DNN inference latency in the end–edge–cloud computing environment.
A Survey on End-Edge-Cloud Orchestrated Network Computing Paradigms: Transparent Computing, Mobile Edge Computing, Fog Computing, and Cloudlet
Sending data to the cloud for analysis was a prominent trend during the past decades, driving cloud computing as a dominant computing paradigm. However, the dramatically increasing number of devices and data traffic in the Internet-of-Things (IoT) era ...
Comments