research-article

Open Access

PArtNNer: Platform-Agnostic Adaptive Edge-Cloud DNN Partitioning for Minimizing End-to-End Latency

Authors:
Soumendu Kumar Ghosh

Purdue University, USA

Purdue University, USA

0000-0001-6776-1427
View Profile

,
Arnab Raha

Intel Corporation, USA

Intel Corporation, USA

0000-0002-8848-1069
View Profile

,
Vijay Raghunathan

Purdue University, USA

Purdue University, USA

0000-0003-4713-5386
View Profile

,
Anand Raghunathan

Purdue University, USA

Purdue University, USA

0000-0002-4624-564X
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 23 Issue 1Article No.: 6pp 1–38https://doi.org/10.1145/3630266

Published:10 January 2024Publication History

ACM Transactions on Embedded Computing Systems

Abstract

The last decade has seen the emergence of Deep Neural Networks (DNNs) as the de facto algorithm for various computer vision applications. In intelligent edge devices, sensor data streams acquired by the device are processed by a DNN application running on either the edge device itself or in the cloud. However, “edge-only” and “cloud-only” execution of State-of-the-Art DNNs may not meet an application’s latency requirements due to the limited compute, memory, and energy resources in edge devices, dynamically varying bandwidth of edge-cloud connectivity networks, and temporal variations in the computational load of cloud servers. This work investigates distributed (partitioned) inference across edge devices (mobile/end device) and cloud servers to minimize end-to-end DNN inference latency. We study the impact of temporally varying operating conditions and the underlying compute and communication architecture on the decision of whether to run the inference solely on the edge, entirely in the cloud, or by partitioning the DNN model execution among the two. Leveraging the insights gained from this study and the wide variation in the capabilities of various edge platforms that run DNN inference, we propose PArtNNer, a platform-agnostic adaptive DNN partitioning algorithm that finds the optimal partitioning point in DNNs to minimize inference latency. PArtNNer can adapt to dynamic variations in communication bandwidth and cloud server load without requiring pre-characterization of underlying platforms. Experimental results for six image classification and object detection DNNs on a set of five commercial off-the-shelf compute platforms and three communication standards indicate that PArtNNer results in 10.2× and 3.2× (on average) and up to 21.1× and 6.7× improvements in end-to-end inference latency compared to execution of the DNN entirely on the edge device or entirely on a cloud server, respectively. Compared to pre-characterization-based partitioning approaches, PArtNNer converges to the optimal partitioning point 17.6× faster.

1 INTRODUCTION

In recent years, Artificial Intelligence (AI) and specifically Deep Learning (DL) have become the dominant data analytics technology for cloud and edge computing. Intelligent applications based on DL, which are prevalent in various domains such as Computer Vision (CV) (e.g., face recognition, autonomous driving, video captioning, super resolution), Natural Language Processing (NLP) (e.g., machine translation, speech recognition, sentiment analysis), recommendation systems (used by Facebook, Amazon, Netflix, LinkedIn, etc.), deeply impact our lives and have fundamentally altered the way we interact with computing [61]. The success of these applications can be attributed to the ever-improving computing power of cloud-based data centers and to the ever-decreasing cost and increasing ease of deploying DL-based solutions in various types of edge devices. However, cloud computing infrastructures have been increasingly challenged by the growth of these workloads [53, 76]. Therefore, DL-based edge intelligence has garnered significant attention, as it complements the cloud by alleviating the backbone network and providing an agile response. Recent years have witnessed rapid growth of edge computing due to widespread research and innovation in Internet of Things (IoT) devices, embedded sensors, and smart systems coupled with ubiquitous wireless communication. Consequently, edge intelligence has enabled the democratization of AI to facilitate AI “for every person and every organization” [5]. Among the different technologies that encompass edge intelligence, we specifically focus on Deep Neural Network (DNN) inference.

Among the prevalent edge computing techniques for DL as shown in Figure 1, previous work [63] performs complete offloading in which the edge device offloads computation requests along with sensory data to the resource-rich cloud for DL inference. We refer to this approach as Cloud-only Inference (CoI). Although this technique allows the deployment of highly accurate but compute/memory-intensive DNNs in the cloud, high transmission cost, strict application latency demands, and lack of reliable network connectivity heavily impact the application efficiency. Many of these problems can be mitigated by using on-device or Edge-only Inference (EoI), where the entire DNN is executed on the edge device. Congruent efforts to accelerate energy-efficient EoI have focused on customizing and optimizing edge hardware and software, such as the design of highly optimized mobile CPU/GPU/ASIC/FPGA and edge DL frameworks (TensorFlow Lite, Embedded Learning Library, Qualcomm Neural Processing SDK for AI, ARM CMSIS-NN, etc.). However, despite extensive research in this domain, most edge devices, being highly resource-constrained, can only run lightweight DL models, since complex and accurate State-of-the-Art (SOTA) models continuously exceed the compute capacity, memory, and energy budget of edge devices.

Fig. 1. Edge computing techniques for DL inference. All DNN layers are executed on edge in edge-only inference (mode 1) and on cloud in cloud-only inference (mode 2). Few layers run on edge and rest of the layers run on cloud in collaborative mode (mode 3).

In addition to these two strategies at opposite ends of the edge DL spectrum, edge-cloud collaborative inference has also been explored in some recent works. DL model partitioning or model splitting is the basis of this strategy. Note that DL model partitioning is a special case of general edge-cloud partitioning and offloading of computation tasks [13] from the end users. In this work, we focus on DNN partitioning since it requires customized solutions that can take advantage of the unique characteristics of DL algorithms, as general techniques may not be able to take advantage of specific information about DNN architectures. Among the prevalent DNN partitioning approaches, in horizontal collaboration [82], individual layers are partitioned into distributable tasks that are executed in parallel using multiple edge devices. On the other hand, in vertical collaboration [37, 71], DNN is partitioned at an intermediate layer according to one or more premeditated criteria, and data preprocessing followed by partitioned DL inference is executed on the edge device until the chosen layer. Subsequently, intermediate data (feature maps) are transmitted to a cloud server where DL inference is performed on the remaining layers. In this work, we focus on the vertical partitioning of DNN models across edge-cloud platforms and aim to find the optimal partitioning layer to minimize the DL inference latency. To the best of our knowledge, most existing partitioning approaches involve substantial offline characterization/profiling of edge devices and cloud platforms, where DNN layers are profiled to generate performance prediction models. This approach faces several key challenges that limit its feasibility for pervasive edge intelligence. First, due to the significant diversity in the System-on-Chip (SoC) architectures in edge platforms [77] and DL software frameworks and third-party libraries, offline characterization becomes necessary for each new system configuration. This limits the solution’s scalability to the large and ever-growing pool of edge systems. Second, the heterogeneity and non-monotony in the SOTA DNN layers and the size of the feature maps make the partitioning problem even more challenging. Third, different platforms may have been optimized for a specific type of DNN layer (convolution, pooling, fully-connected, normalization layers, etc.) due to their underlying hardware characteristics. This results in different layers being executed with widely varying efficiencies on these diverse platforms, some with extremely poor performance. Finally, existing operating conditions, such as wireless network bandwidth and cloud server load that accommodates edge device requests, can vary significantly over time, thus affecting the optimal partitioning point for a particular DNN. All these factors indicate the deficiencies in existing edge-cloud DNN partitioning solutions.

To overcome these challenges, we explore the edge-cloud partitioning problem using platform-agnostic adaptive partitioning of DNNs (PArtNNer) to minimize end-to-end (e2e) edge inference latency. Specifically, we focus on reducing latency for DL-based applications that are used for image recognition and object detection on edge devices. In a nutshell, this article makes the following key contributions:

–	Exploring the factors affecting partitioned DL inference—We experimentally demonstrate that the optimal partitioning point for any DNN (and, correspondingly, the group of layers to be executed on the edge device and/or the cloud) that minimizes the total end-to-end application latency depends on the system architecture, including (i) edge device/platform processing hardware, (ii) device’s communication subsystem, as well as temporally varying operating conditions, such as (iii) available wireless network bandwidth, and (iv) cloud server load. Based on these insights, we motivate the need for a partitioning solution that is both adaptive to dynamic factors and agnostic to the underlying system architecture and thus can eliminate the need for a detailed device-specific profiling/characterization phase.
–	Platform-agnostic adaptive partitioning system—We propose PArtNNer, a system to automatically partition DNNs in an edge-cloud collaborative execution scenario that is adaptive to dynamically varying operating conditions. PArtNNer does not require any pre-characterization of the underlying computing (edge and cloud) platforms and relies on run-time measurements only on the edge platform to derive the optimal partitioning point.
–	Multi-system and Multi-DNN evaluation—We experimentally evaluated PArtNNer on two Commercial Off-The-Shelf (COTS) compute platforms, viz., Raspberry Pi 0 (RPi0) and Raspberry Pi 3 Model B+ (RPi3) and used computational models from three widely used edge AI platforms, viz., Intel^® Neural Compute Stick 2 (NCS), NVIDIA^® Jetson Nano™ Developer Kit (Jetson), and Google Edge TPU Coral Dev Board (ETPU). We adopted three different wireless communication standards, viz., Wi-Fi5, 5G, and Wi-Fi6 for the performance validation of PArtNNer for each of these 5 devices. Using these diverse architectures, we prepared an evaluation framework comprising 15 distinct system configurations and six DNNs, viz., AlexNet, InceptionV3, ResNet101, SqueezeNet1.1, MobileNetV2, and YOLOv3-Tiny. The experimental results obtained using this test bench demonstrate that PArtNNer provides \({3.7}\times\)–\({21.1}\times\) and \({1.4}\times\)–\({6.7}\times\) improvements in end-to-end inference latency compared to traditional edge-only and cloud-only execution of the DNN. Compared to a naive technique that randomly selects an intermediate partitioning point, PArtNNer provides \({4.6}\times\)–\({11.5}\times\) improvement in end-to-end inference latency. The proposed algorithm converges to the optimal partition point \(17.6\times\) faster compared to an exhaustive search with characterization.

The rest of the article is organized as follows: Section 2 presents a brief overview of the existing literature in edge-only inference, cloud-only inference, and edge-cloud collaborative inference. Section 3 covers the necessary background and motivation for partitioned DNN inference and the different factors that influence the latency and partitioning decision. This is followed by the key design methodologies and the associated rationale behind the design decisions of the adaptive DNN partitioning algorithm in Section 4. The experimental setup and methodology comprising the five compute platforms, together with six different DNN architectures used in this work are described in Section 5. The article will then review the experimental results of PArtNNer in Section 6. Finally, Section 9 concludes the article with a short summary and guidelines for future research on collaborative edge-cloud inference systems.

2 RELATED WORK

Rapid progress in the fields of edge computing and deep learning has led to a thrust in the industry to push cognitive abilities to each and every application that we interact with in our daily lives. To this end, various approaches have been proposed to infuse intelligence into the plethora of edge/IoT devices surrounding us [74]. Before diving into the existing literature, we clarify the ambiguous nature of the definition of “edge devices” in the literature. Without loss of generality, we consider all IoT or embedded and mobile/client devices (such as autonomous vehicles, wearables, smartphones, drones, conversational assistants, etc.) that sit on the edge of the IoT hierarchy and sense and generate data for the DL application, as “edge devices”. Other computing platforms, including fog nodes, network edge servers, roadside units, and cloud-based data centers, are classified as “cloud servers” for the entirety of this article.

Traditionally, computation requests arising from DL-based applications running on edge devices are offloaded to powerful cloud servers in CoI mode. Therefore, edge devices are not subject to additional computing overhead, scheduling delays, and the need for resource optimization [9, 28] in this scenario. In contrast, in the EoI paradigm, breakthrough research in the development of powerful mobile CPUs/GPUs/ASICs/FPGA [14, 32, 36, 50], hardware accelerators [2, 42, 65] coupled with DNN model optimization and compression strategies [46, 80] has enabled the deployment of complex DNNs on edge devices, thus contributing to the goal of democratizing AI. Furthermore, the rise in decentralized AI architectures based on federated learning and blockchain, data privacy, and security needs, in addition to bandwidth, latency, and cost issues, has driven industry research on edge AI [45]. However, both techniques face diverse challenges, as described in Section 1. The drawbacks pertaining to strict delay requirements, privacy and reliability issues for CoI, and limited compute/memory capability, energy consumption, and cost bottleneck in EoI have inspired the development of the collaborative edge-cloud computing paradigm.

In the collaborative sphere, previous works [44, 63] determine the optimal strategy to execute DL inference either on the edge device or on a remote server based on multiple criteria (such as DNN accuracy, inference latency, device energy, etc.). DL model segmentation or vertical model partitioning approaches [37, 40, 41, 71] have also been used to make the best use of the computing power of edge and cloud infrastructures to meet application latency demands and ensure DNN accuracy. A common limitation of these studies is the need for an offline characterization phase that is carried out to estimate the device-specific performance of different DNN layers and the energy consumption for the adopted distributed edge-cloud setup. For example, statistical modeling [37], analytical modeling [58], and application-specific profiling [17] of DNN layers have been found in the literature that use the results of the profiling phase to decide the partition point during the DNN inference run time. Related research works have adopted different Mixed Integer Linear Programming techniques [15, 18, 21] that offer theoretical guarantees to find the optimal partition point. However, these are computationally expensive because the size of the partitioning problem is large. On the contrary, researchers have also proposed heuristic algorithms, such as Genetic Algorithms (GA) [48], Standard Particle Swarm Optimization with GA [7], Approximate Solver [31], Multipath DAG partition [15], and so on, to find approximate or suboptimal solutions quickly. Both classes of approaches require an offline profiling phase or prediction algorithms. However, as stated by the authors in Reference [17], disjointed layer-wise profiling or prediction algorithms based on DNN layer configurations are prone to estimation errors due to non-monotonic acceleration provided by various hardware architectures and software frameworks to consecutive execution of layers. Therefore, an exponential number of profiling experiments would be needed if a DNN application has to be efficiently deployed on various kinds of mobile SoCs/edge devices [77] making such approaches impractical. Furthermore, variation in diurnal performance is observed in large-scale data centers, and profiling-based evaluation methods cannot capture these variations, as the authors demonstrated in Reference [66]. Furthermore, hardware heterogeneity and complex software architectures in these cloud servers could also lead to different benchmark/profiling results at different times and therefore are not representative statistics [16, 69]. These challenges limit the accuracy and scalability of the aforementioned profiling-based edge cloud collaborative solutions. Although the authors [83] proposed a resource-aware online partitioning scheme that does not involve profiling, the partition points are randomly selected based on predetermined probabilities. Furthermore, the evaluation is limited to a single DNN (VGG16), a single virtual hardware platform, and without dynamic bandwidth variation.

Another direction of research solves the critical problem of resource autoscaling (InferLine, FA2) [10, 64] to minimize e2e application latency by intelligently allocating resources or computation nodes/hardware accelerators (e.g., CPUs, GPUs, FPGAs, TPUs) in DL inference serving systems. These approaches cater to DNN inference pipelines consisting of multiple DNNs orchestrated with a Directed Acyclic Graph (DAG) in the application. The number of DNN instances and/or request batch size are dynamically configured for different hardware nodes to meet Service Level Agreement (SLA) guarantees for response time. Other related optimization techniques that improve resource utilization include parallel processing of requests and caching of past predictions [11], deadline bound delay of requests [8], and so on. However, these approaches do not explicitly include any features for partitioning a single DNN model across multiple devices. In addition, most of them include profiling entire DNNs on different combinations of hardware and batch size to obtain latency and throughput statistics. Furthermore, the inference latency of SOTA models such as YOLOv5 [34], EfficientDet [68] depends on the input size. Therefore, layer-wise DNN profiling on multiple hardware platforms using different input sizes will be extremely costly. On the other hand, early-exit inference [39, 70, 73] that allows DNN inference to exit using side branch classifiers depending on energy and accuracy requirements also reduces latency. We have provided a comparative analysis of some relevant previous work in Table 1. In contrast to the prior work, we have adopted an orthogonal and complementary heuristic-based approach to obtain a characterization-free, platform-agnostic, and adaptive solution for the collaborative edge-cloud inference paradigm. To this end, we also investigate the impact of the variation in edge platform and communication medium, and dynamically varying wireless network conditions and server load, on the optimal partitioning decision.

Table 1.

	Frameworks
Features	PArtNNer	Neurosurgeon [37]	DeepDecision [63]	ADDA [73]	DADS [31]	Joint Optimize [15]	Resource-aware [83]
Characterization free	✓	✗	✗	✗	✗	✗	✓
Supports EoI and CoI	✓	✓	✓	✓	✓	✓	✓
Supports partitioned inference	✓	✓	✗	✓	✓	✓	✓
Hardware agnostic	✓	✗	✗	✗	✗	✗	✗
Bandwidth adaptive	✓	✓	✓	✓	✓	✓	✓
Latency reduction	✓	✓	✓	✓	✓	✓	✓
Energy reduction	✓	✓	✓	✗	✗	✗	✓
Maintains DNN accuracy	✓	✓	✗	✗	✓	✓	✓
Avoid exhaustive/random search	✓	✓	✓	✗	✓	✓	✗
Handles chain and DAG topology DNNs	✓	✗	✓	✗	✓	✓	✗

View Table

Table 1. Comparison of Related Works on Edge-Cloud Collaborative DNN Inference

3 MOTIVATION

As already stated in Sections 1 and 2, the SOTA approach to offering intelligent edge services to the end user is to perform all DL processing in the cloud (CoI), or to deploy cognitive DL capabilities inherently within the edge device that reside with the end user (EoI), or to provide collaborative edge-cloud intelligence using coordinated integration of computing, communication, and networking resources [74]. For the purpose of this work, we consider that a user has access to a particular edge device that communicates with a single cloud server. Classification or detection inference request that originates at the edge node can only be executed in an EoI, CoI, or collaborative manner. We assume that communication or offloading is not possible between multiple edge devices. In this section, we mainly reinforce distributed DL inference in the form of a collaborative edge-cloud architecture. First, we motivate the partitioning of DNNs (vertically, i.e., along the edge and cloud) to satisfy real-time latency constraints (Section 3.1). Subsequently, we demonstrate that the partitioning decision is highly dependent on temporally varying operating conditions such as wireless network bandwidth and cloud server load (Section 3.2). Finally, we show that this decision is also highly influenced by the architectures of the underlying systems (edge/cloud) and communication subsystems (Section 3.3), which could lead to exorbitant device-specific profiling and characterization. Evidently, these challenging and diverse factors highlight the need for a platform-agnostic DNN partitioning system that can completely eliminate exhaustive profiling while simultaneously being adaptive to any change in environmental conditions.

3.1 Edge-Cloud Collaboration: Why Partition DNN?

Segmentation/partitioning of the DL model between the edge and the cloud is used in the edge computing architecture to satisfy different edge inference requirements, such as real-time latency, accuracy, energy efficiency, and so on. However, automatic selection of the partitioning point is challenging due to the heterogeneity and non-monotony of DNN architectures, the variation in DNN precision, the DNN processing framework, and the variation in dynamic operating conditions that affect the inference latency of these architectures [78]. There is no unique solution, and the Optimal Partition Point (OPP) that minimizes e2e latency varies from one DNN to another, as well as within a particular DNN. To demonstrate this variance in OPP, we show six subplots in Figure 2 corresponding to six DNNs, viz., AlexNet, InceptionV3, ResNet101, SqueezeNet1.1, MobileNetV2, and YOLOv3-Tiny (details on DNNs in Table 3) under identical operating conditions. Each subplot consists of a heat map showing the OPP for the corresponding DNN at different cloud server loads (y-axis) and wireless network bandwidth (x-axis). The associated color bar with each heat map shows all possible partitioning points for the respective DNN. The two extreme ends of the colorbar, C and E, represent CoI (0 layers at the edge) and EoI (all layers at the edge), respectively, which vary from network to network. Apart from these two extremes, lighter shades indicate that the OPP is among the initial layers of DNN, thus favoring the execution of most layers in the cloud. In contrast, darker shades suggest OPP toward the end of the DNN, implying execution of most layers on the edge device. To improve the readability of the figure, the OPP for each combination of load and bandwidth is indicated in the heatmap for each DNN. As observed, the OPPs for AlexNet, InceptionV3, ResNet101, SqueezeNet1.1, MobileNetV2, and YOLOv3-Tiny are {\(20(E), 13, 3\)}, {\(20(E), 0(C)\)}, {\(39(E), 0(C)\)}, {\(17(E), 9, 0(C)\)}, {\(25(E), 10, 7, 5, 0(C)\)}, and {\(24(E), 8, 0(C)\)} respectively across all the operating conditions. As we can clearly see, the OPP distinctly varies across different DNNs. The OPPs in compute-heavy DNNs, such as InceptionV3 and ResNet101, lie at either extreme, while there are multiple OPPs for the other DNNs, which are comparatively smaller and designed for edge deployment. For example, at load 10 and bandwidth 5 Mbps, the OPP for the minimum latency of SqueezeNet1.1 is 9, while the same for ResNet101 is 0. This discussion highlights the diversity of OPP among DNNs, which contributes to one of the many challenges in designing an automated partitioning algorithm.

Table 2.

View Table

Table 2. Hardware Specifications of Different Edge Platforms used for PArtNNer [51, 54, 56]

Table 3.

View Table

Table 3. DNN Benchmarks used to Evaluate PArtNNer on Image Classification and Object Detection

Fig. 2. Variance in optimal partitioning points across six DNNs with variation in wireless network bandwidth and cloud server load on Raspberry Pi 3 Model B+. The numbers indicate the partition point which is same as the number of layers executed at the edge.

To further illustrate the benefits of partitioned or edge-cloud collaborative inference compared to always statically deciding to perform full DNN inference on cloud or edge, we closely look at the heatmap for SqueezeNet1.1, a highly optimized DNN for edge inference. In addition to the heatmap, Figure 3 shows three different subplots for three different operating conditions, each highlighting the inference latency at all possible partition points. The x-axis and the y-axis in the stacked column charts represent the partition point or the number of layers (blocks) executed on edge and e2e inference latency, respectively. At each partition point, we show the total e2e latency (shown by a green line with markers) along with its breakdown into finer constituents, namely (i) edge latency shown in purple (time taken to execute the layers up to the partition point, including the layer indicated by the point, on the edge device), (ii) communication (comm) latency shown in orange (transmission time of the output feature maps from the partitioned layer), and (iii) cloud latency shown in maroon (time to run inference on the remaining layers of the DNN in the cloud). The top-right plot considers DNN inference at high server load 20 and network bandwidth 25 Mbps, and shows that the OPP corresponding to minimum latency is 0, i.e., all the layers executed in the cloud. In this case, the edge device uses its image sensor/camera module, consumes a small amount of time during image acquisition, and performs the necessary preprocessing before offloading the image to the cloud server. At load 1 and bandwidth 1 Mbps (bottom-left plot), the minimum e2e latency is observed when all layers are executed on the edge device. However, in the bottom-right plot, the minimum e2e latency is observed using partitioned inference where the OPP is 9. Similarly, the heatmap also shows that partitioned execution performs best under many operating conditions. Therefore, exploring and optimizing partitioned inference is an interesting research problem to solve. In the following two sections, we further investigate the impact of temporally varying operating conditions, as well as various system specifications, on the optimal partitioning point for six different SOTA DNNs used for computer vision applications on edge devices.

Fig. 3. Variance in partition point in SqueezeNet1.1 showing minimum end-to-end latency using cloud-only, edge-only, and partitioned inference.

3.2 Impact of Temporally Varying Operating Conditions

Cloud-based DL inference for edge intelligence, in CoI or collaborative mode, always comes with additional constraints, such as available network connectivity, bandwidth, wireless channel condition, congestion, contention, concurrent service requests, data center/server load, different hardware/software configurations, and server geolocation (examples of some popular cloud servers are Amazon’s AWS DL AMIs, Google Cloud ML, and IBM Watson for AI workloads), among others. The time-varying wireless environment affects the transmission time of sensor data (image, video, audio, text, etc.) in the case of CoI or intermediate DNN feature maps in collaborative mode. On the other hand, a highly loaded cloud server can mitigate the computational advantage of cloud over edge, leading to slower DL inference latency. Furthermore, as shown in previous work [62], the relative geographical locations of the edge device and the cloud server also affect connectivity and e2e inference latency. Edge applications using cloud services are affected by the round trip time, which is usually proportional to double the geographical distance [29, 67]. These dynamic factors ultimately result in different response times, thus changing the OPP that offers the best latency at the corresponding operating conditions. In Figure 2, we can clearly observe the variation in OPP with the alteration of the server load and the bandwidth of the wireless network for six DNNs. Intuitively, the increase in bandwidth reduces the communication latency of the DNN input/feature maps, consequently favoring more layers to be offloaded to the cloud (i.e., fewer layers on the edge), which shifts the partition point, as we can see in the heat maps. On the other hand, an increase in server load favors computing more layers at the edge. Similar observations are evident in Figure 3, which depicts the variation in OPP for SqueezeNet1.1 with respect to different network bandwidths and cloud server loads. For example, the heat map on the top-left plot of this figure shows a change in OPP from 0 to 9 with an increase in server load from 15 to 20. The non-monotonic characteristics of the size of intermediate DNN feature maps directly result in a non-monotonic trend of communication latency, as evident in the stacked column charts in Figure 3. In all of the plots, the impact of variation in bandwidth on communication latency and, consequently, on the partitioning decision is clear. In addition to these variations, there is no guarantee on the delay and response time to access cloud services and could result in long waiting times for edge devices aiming to offload DL processing to the cloud. Even with a loss in network connectivity, intelligent critical services must be provided in near-real time, which invokes the need to design a partitioning framework adaptive to both temporally varying operating conditions.

3.3 Impact of Diverse Edge System Specifications

With an increasing trend to push DL inference to the edge to reduce latency, preserve privacy, and enable interactive use cases, various DL frameworks such as PyTorch Mobile, TensorFlow Lite, CoreML, MXNet, and so on, have released software modules in recent years. Many of these specialized libraries are used to optimize and deploy DNN models for DL inference on more than a billion mobile/edge devices [74, 77]. As the authors [77] point out: “These devices are comprised of over two thousand unique SoCs running in more than ten thousand smartphones and tablets”. As we have mentioned in Section 1, programmability and performance variation affect applications with real-time constraints. Specifically, hardware and software heterogeneity results in variation in latency and energy consumption for the same DNN across different edge devices and consequently changes the OPP when the same edge is operating in edge-cloud collaborative mode. Different SoCs and software frameworks offer different degrees of acceleration to different types of DNN layers, and this forms the fundamental rationale behind this variation in inference latency. Furthermore, operating conditions such as the number of concurrent CPU intensive threads on the edge device, the dynamic allocation in cores, and so on, which depends on the computing platform, affect latency, thus altering the OPP [78]. To demonstrate this variability, we performed several DNN inference experiments using six DNN benchmarks on two COTS compute platforms, namely RPi0 and RPi3, and developed computational models of three other platforms, namely NCS, Jetson, and ETPU, which closely matches the inference performance on the corresponding actual hardware. Specifically, the NCS model considers a combination of RPi3 attached to the NCS over USB. For the remainder of the article, DNN inference on any of these five devices will refer to execution on actual hardware for the two Raspberry Pi boards, whereas computational models for the rest. PyTorch was used as the primary software framework to conduct these experiments. More details on DNN benchmarks and hardware-software setup can be found in Section 5. Next, we investigate the inference performance of these devices.

Consider a real-time edge application with strict latency constraints, where SqueezeNet1.1 is used as the underlying image classification DNN and the edge device is operating in edge-cloud collaborative mode. The e2e latency comprises the time taken by the edge device to capture the image, perform adequate data preprocessing (reduce data size, packetize, etc.), transmit data (image/feature map/result) to the cloud (if any), and the DNN inference latency on the edge or cloud, or both (only during partitioned execution). Previous work in the domain of collaborative inference [17, 78, 84] involves offline DNN pre-characterization/profiling that includes resource cost modeling of different DNN layers coupled with individual cost prediction model guided by network bandwidth, process latency and energy consumption, for a specific set of edge devices and cloud server. We term this characterization information as Oracle data, which will vary depending on the DNN architecture, layers, wireless standard and bandwidth, edge platform specifications such as memory bandwidth, Trillion Operations per Second (TOPS) and so on. We performed an offline analysis of the network bandwidth and cloud processing load over a fixed duration of the DNN application to generate the oracle data. Leveraging the oracle to decide the OPP during application run-time will always result in minimum e2e DL inference latency for the prevailing wireless network and server conditions. If the same application (and DNN) is executed on any other edge device/platform, the OPP might change, given that we have the pre-characterized oracle data for the respective platforms. Furthermore, a change in the wireless standard and, subsequently, the real-world network bandwidth/speed might change the OPP for the same platform. Figures 4 and 5 show this variance in OPP due to hardware heterogeneity and variation in communication standards for three traditional DNNs and three edge-optimized DNNs. In both figures, each row represents data for a particular DNN, and each plot in a row shows the OPPs for three different communication standards for five devices. As we can observe in Figure 5, OPPs in SqueezeNet1.1 running on the five edge platforms mentioned above, i.e., RPi3, RPi0, NCS, Jetson, ETPU using the Wi-Fi5 standard at fixed network bandwidth of 1 Mbps is {\(17, 9, 17, 17, 17\)}, respectively. Similarly, the OPPs for the same set of systems that use the 5G standard at bandwidth of 250 Mbps is {\(0, 0, 0, 9, 17\)} and Wi-Fi6 standard at bandwidth of 1500 Mbps is {\(0, 0, 0, 0, 10\)}. As previously stated in Section 3.1, 0 represents CoI, and 17 indicates EoI for SqueezeNet1.1, while any intermediate value represents partitioned execution, where layers/blocks until the OPP (inclusive) are executed at the edge and the rest of the layers run on the cloud server using the transmitted feature maps. Looking at the graphs for MobileNetV2, another SOTA mobile-optimized DNN, we observe the variance in the OPP when we switch the edge platforms and the communication medium. Similar observations can be derived for YOLOv3-Tiny, a SOTA object detection DNN for edge AI applications. Clearly, there is no unanimous partition point that results in minimum latency across all these platforms and communication standards. Similar observations can be derived by inspecting the rest of the plots in Figures 4 and 5 corresponding to the other DNNs. These results allow us to derive the following takeaway: the OPP of any DNN is a function of the computing architecture and communication medium. These insights motivate us to design a platform-agnostic system that can rely on run-time measurements to derive the OPP that can minimize the e2e inference latency.

Fig. 4. Variance in partition point in three traditional DNNs across five different edge platforms and three communication subsystems.

Fig. 5. Variance in partition point in three edge-optimized DNNs across five different edge platforms and three communication subsystems.

4 PARTNNER: DESIGN METHODOLOGY

As evident in Sections 3.2 and 3.3, partitioned DL inference using the edge-cloud collaborative framework faces multiple optimization challenges in the form of dynamic operating factors such as wireless network conditions and cloud server loads, as well as the characteristics of the underlying edge device, communication subsystem, and software divergence. To solve the DNN partitioning problem in the presence of the multifaceted challenges mentioned above, we propose PArtNNer, a Platform-agnostic Adaptive system that automatically executes DNN partitioning, which is presented in this section along with the underlying guidelines that drive our design decisions.

4.1 Adaptive DNN Partitioning Heuristic

We formulate DNN partitioning as a non-linear and non-convex optimization problem, since multiple locally optimal partitioning points might exist due to non-monotonous variations in the computation/memory requirements and data/feature map size of individual DNN layers. By nature, a non-convex problem has multiple local minima and one global minima, which is generally an NP-hard [6, 31] problem. This nature is also followed by the DNN inference characteristics of the individual layers, as can be observed from the e2e latency (green lines) in Figure 3. For a particular edge platform, the system designer may not have all the underlying details of the specific DNN models to be deployed to provide intelligent edge services. Due to the constantly evolving nature of the design space of the DNN architectures, more efficient and accurate DNNs could be used on the same platform in the future for the same application. This unpredictability of the DNN architecture presents the first challenge. Second, even with prior knowledge of DNN architecture deployed on the platform, the temporally varying operating conditions (such as bandwidth and server load) pose substantial challenges, thereby contributing aggressively to the non-convexity of the problem, as the OPP in a particular DNN at some operating state might not be optimal anymore for some other state. This is evident from the discussions in Section 3 as well as from Figures 2 and 3. During our investigation of the problem in the context of varying edge platforms and communication standards, we observed that even under constant temporal conditions, there are unique and distinct OPPs for the same DNN for different edge platforms. These are addressed in Section 3.3 and Figures 4 and 5. Therefore, the solution landscape of DNN partitioning comprising the aforementioned factors results in non-existence of any global minima altogether and makes it imperative for us to propose an efficient heuristic that is adaptive to any DNN architecture and temporal variations in operating conditions, as well as agnostic to the underlying platform.

The proposed partitioning heuristic can work with any SOTA DNN, without the need for any kind of DNN profiling on the hardware. Unlike prior partitioning approaches as discussed in Section 2, we eradicate any offline pre-characterization phase of edge and/or cloud platforms before the actual deployment of the DNN inference engine on the edge device. In contrast to the existing literature, we adopted only run-time measurements of e2e inference latency solely on the edge device to guide the heuristic. Using these measurements, PArtNNer dynamically selects the optimal DNN partition point for the specific edge-cloud platform duo for the current operating conditions, as described in Algorithms 1, 2, 3, and 4.

4.1.1 Heuristic Terminologies.

Before explaining the proposed heuristics and associated algorithms, we define different metrics and terminologies. In this work, we have chosen a coarse-grained approach instead of fine-grained partitioning at a per-layer granularity, as shown in the ResNet50 architecture in Figure 6. We have adopted block-level/module-level partitioning for DAG topology DNNs where individual layers in blocks may have multiple inputs and outputs, forming complex branching structures. These architectures have residual connections, concatenation, or addition (element-wise) operations. Consequently, they need edge-to-cloud transmission of output feature maps from multiple layers and proper handling of data dependencies due to the existence of multiple parallel paths (as seen in Figure 6), for successful collaborative inference. For simplicity, we only allow partitions after individual layers or blocks. Note that our solution can handle scenarios even when blocks receive multiple inputs. Figure 6 illustrates that a residual block, as seen in ResNet class of networks, takes input feature maps from the previous block/layer, applies one or more convolution layers, and finally adds up the final output with the original input (in identity block) or transformed original input (in bottleneck block). Similarly, DNNs such as SqueezeNet1.1, InceptionV3, MobileNetV2, and so on employ modules with internal concatenation operators that add the output of two or more internal layers to generate the output of the module fed to the next layer/block. However, block-level partitioning does not affect chain topology DNNs with simple architecture, such as AlexNet and VGG19_BN, where it boils down to fine-grained partitioning. The total number of partitioning points in a DNN using this approach is represented by N, which is different from the total number of layers. For example, ResNet50 has 50 layers and 23 possible partition points \((P_{O}~|~P_{O} \in [0, 22])\), as shown in Figure 6, where 0 and 22 represent CoI and EoI, respectively, while intermediate values indicate collaborative inference. Similar information for other DNN benchmarks used in this work is enumerated in Table 3. This approach allows us to partition both chain topology DNNs and DAG topology DNNs. Other heuristic parameters are defined as follows:

Fig. 6. ResNet50 architecture depicting different flavors of DNN partitioning/segmentation for edge-cloud collaborative inference. P0 to P22 are partition points. Block-level partitioning is adopted in our work.

–	\(search\_top\): This input parameter is a Boolean flag (True/False) that instructs the algorithm to search for the starting partition point \(P_{S}\) from either the top or the bottom of the DNN. Here, top represents partition point 0, whereas, bottom indicates partition point N. We assume that the communication bandwidth is reliable during deployment. As observed in Figure 2, higher bandwidth correlates with OPP at \(\approx \!0\), i.e., more layers are computed in the cloud. Therefore, we set this value to True across the 15 system benchmarks and 6 DNNs. Note that the bandwidth can also be measured at the start to set this parameter accordingly.
–	k: This input parameter is a percentile factor that decides the range of DNN blocks/layers from which \(P_{S}\) is selected randomly. \(k = 1\) indicates that \(P_{S}\) can be any of the permitted partitions, whereas higher values (\(k \in [2, 4]\)) will reduce the search space of random choice. \(search\_top\) and k together enable the initialization of heuristic. e.g., \(search\_top\) = True and \(k = 4\) indicates that the algorithm will select \(P_{S}\) from the top 25% of the partitions (Algorithm 2).
–	\(\alpha\): This input parameter is the relative latency threshold that determines how eagerly the heuristic algorithm tries to find new partition points compared to staying at the previous point (Algorithm 3). A lower \(\alpha\) will allow the algorithm to explore more often, whereas a higher \(\alpha\) will make the algorithm more conservative; i.e., only large disruptions in the latency will cause the algorithm to move the partition point. We have empirically selected \(\alpha \in (0, 0.1]\) to account for measurement variations, noise, environmental uncertainties, and other uncontrollable factors.
–	\(near\_idx\) (\(ni\)): This input parameter influences the number of blocks/layers that the heuristic shifts in either direction (edge/cloud) if the latency difference exceeds the threshold \(\alpha\) (Algorithm 3). Essentially, \(ni\) determines the degree of feedback used by the algorithm to guide itself toward OPP. We have empirically observed that \(ni \in [2, 3]\) leads to faster convergence to OPP across all benchmarks. Higher values may result in a drop in heuristic performance, consequently, a higher e2e latency.
–	\(far\_idx\) (\(fi\)): This run-time variable is updated by the algorithm depending on past favorable or unfavorable decisions. The heuristic only uses \(fi\) to shift \(P_{O}\) if consecutive decisions are favorable (Algorithm 3).
–	\(scale\_idx\) (\(si\)): This input parameter is a scaling factor that changes \(fi\) depending on the difference in latency. (Algorithm 3). We use \(si\) to reward the heuristic to different degrees based on the frequency of favorable decisions (Algorithm 3). We empirically selected \(si \approx 2\) on average across all benchmarks. More details on the interaction between these parameters are provided in Section 4.2.
–	\(part\_prob\): This input parameter decides the probability of exploration if \(\Delta lat\) does not exceed the threshold \(\alpha\). Setting this parameter properly ensures that the algorithm does not get stuck indefinitely at local minima. In our experiments, we set \(part\_prob \in [0.15, 0.05]\) (Algorithm 4).
–	\(last\): This run-time variable records the last updated partitioning decision. The heuristic updates this variable (\(last \in \lbrace \mathsf {edge, cloud}\rbrace\)) if its decides to offload layers in a direction opposite to its previous decision (Algorithm 3). In addition, \(last\) may also be updated if the heuristics does the exploration (Algorithm 4).

4.1.2 Heuristic Operation.

The proposed heuristic is a run-time/online solution to the partitioning problem. The top level Algorithm 1 is executed on the edge device together with the intelligent application that uses DNN as the underlying algorithm. In the first instance of application launch, it initializes the heuristic with different empirically chosen parameters (Line 1), by calling the function Init in Algorithm 2. Depending on the input parameters, the first partition point \(P_{S}\) is randomly chosen among N DNN blocks/layers (Lines 2–5) and heuristic variables viz., \(P_{O}, Ti_{prev}, Ti_{prev_2}, fi, last\) are initialized. Note that Algorithm 2 needs to run only once.

Following the initialization phase, Algorithm 1 calls the main partitioning heuristic (Algorithm 3) every time the parent application invokes the DL inference request on the edge device. The heuristic measures the e2e latency associated with the current request, \(Ti_{curr}\), only on the edge device (Line 2), which encapsulates all the computation (edge, cloud, or both) and the communication latency (if any) involved in a single DNN inference operation, as explained in Section 3.1. Subsequently, the relative latency difference \(\Delta lat\) is calculated using \(Ti_{curr}\) and \(Ti_{prev}\), the e2e latency corresponding to the last inference request. We also measure the relative latency difference of the previous partition decision, \(\Delta lat_{prev}\). Note that \(\Delta lat\) and \(\Delta lat_{prev}\) are initialized to 0 only at the first inference instance to avoid division by zero error. This will trigger exploration (Algorithm 4) to find \(P_{O}\). For each subsequent instance, the measured \(\Delta lat\) is compared with the predefined threshold (\(\alpha\)). \(\Delta lat\) > \(\alpha\) indicates that the last partition decision made (\(last\)) adversely affected latency. Thus, if the \(last\) decision was cloud, the heuristic decides to shift \(P_{O}\) toward the edge, i.e., compared to the present configuration, more layers will be executed on the edge device at the next inference request. On the contrary, if the \(last\) decision was edge, heuristic offloads the computation of more layers to the cloud, thus alleviating the edge of some of the existing computation load. This heuristic action is depicted in Lines 7–16, where \(P_{O}\) is shifted by \(ni\), essentially moving the computation of \(ni\) blocks/layers to the edge or cloud. On the other hand, if \(\Delta lat\) < \(\alpha\), the heuristic reinforces the previous partitioning decision, essentially shifting \(P_{O}\) in the same direction as indicated by \(last\). To increase the stability of the heuristic, we compare \(\Delta lat_{prev}\) with \(\alpha\) and shift \(P_{O}\) by any of the shift parameters, \(ni\) or \(fi\) (Lines 19–27). Furthermore, along with every favorable or unfavorable decision, \(fi\) increases or decreases using \(si\), or is reset directly to \(ni\). Finally, \(Ti_{prev}\) and \(Ti_{prev_2}\) are updated, and the heuristic ensures that the OPP is selected from the possible partitions.

4.1.3 Random Exploration Phase.

We promote random exploration of the non-convex partitioning search space for the operating DNN throughout the heuristic operation to encompass the large search space. As shown in Line 5 of Algorithm 3, the heuristic calls the function Explore in Algorithm 4 when the latency difference does not exceed the threshold between two consecutive inferences. Depending on the random choice of the partition decision, according to the parameter \(part\_prob\), \(P_{O}\) is shifted or left unchanged and \(last\) is updated (Lines 4–12). For example, two conflicting conditions, such as an increase in network bandwidth and server load, might result in little to no change in latency difference, as the first condition favors offloading to the cloud, whereas the second one prefers execution on the edge. However, a partition point other than the current \(P_{O}\) might result in lower latency, which can only be explored by heuristic by choosing a random partitioning decision. Note that \(part\_prob\) may be changed at run-time. However, the value should not be set to 0 as this will stop the exploration and the algorithm might be stuck in a sub-optimal minima for the entirely of the inference application.

4.1.4 Overall Flow.

The overall flow of the DNN partitioning heuristic, as explained in the previous subsections, has been summarized in a single flow chart, shown in Figure 7. The figure encompasses all algorithms in a synergistic way and gives a high-level overview of the overall flow of the heuristic. For example, Algorithm 2 uses different heuristic parameters (as described in Section 4.1.1) to initialize \(P_{S}=P2\) along with other run-time variables for the representative DNN with \(N=5\) (Steps 1–2). Inference algorithm is executed and e2e latency is measured (Steps 3–6). Algorithm 3 uses shift parameters/variables (\(ni\) or \(fi\)) and \(last\) to update the OPP (\(P_{O}\)) to \(P3, P4\) or \(P1\) based on \(\Delta lat\) and \(\Delta lat_{prev}\), followed by update of run-time variables (Steps 7–9). Based on \(\Delta lat\), Algorithm 4 might perform random exploration to shift \(P_{O}\) toward the edge or cloud, followed by \(last\) update (Steps 7–9). The updated \(P_{O}\) is recorded and partitioned DNN inference is executed using the saved value for the next inference request and the process continues for the entire lifetime of the application on the edge.

Fig. 7. Flowchart depicting how all the algorithms work together in conjunction to perform platform-agnostic adaptive DNN partitioning. Steps 1 and 2 are only performed at start of the DNN inference application. Steps 3–6 are performed at every inference instance. Heuristic runs step 7 (1 out of 5 options). Steps 8 and 9 are selectively executed based on step 7.

4.2 Rationale behind Heuristic Design Decisions

Finding a globally optimal solution for the non-convex partitioning problem is impractical due to the sheer size of the search space. The heuristic is devoid of any notion of the performance capability of the underlying edge device (since the proposed solution is profiling-/characterization-free). In addition, wireless network conditions and/or server load might vary at any time, completely altering the search space. The presence of other latent or unaccounted constraints such as the number of concurrent CPU-intensive threads, variability in the deployed DNN precision, DNN processing framework, dynamic allocation in big/little/hybrid cores, and so on, [78, 79] affects the DNN inference latency and edge energy consumption, therefore, adds up to the already high-dimensional search space. Dealing with such multidimensional constraints necessitates the heuristic to be optimally adaptive and still be efficient in converging to the optimal partition point that gives globally minimum e2e inference latency.

As discussed in Section 2, existing literature uses extensive profiling and computes processing and communication latency and measures multiple influential factors [17, 37, 78]. These works use measurements to address the diversity in DNN models, hardware, software, and external constraints and create a library of profile characteristics for each edge and cloud platform involved in the deployment of collaborative inference. Eliminating this exhaustive profiling step stimulated us to adopt a real-time measurement-based approach and design the heuristic parameters, viz., \(near\_idx\), \(scale\_idx\) and \(far\_idx\), which can guide the partition decision. The choice to scale \(fi\) up or down with favorable or unfavorable decision, respectively (as seen in Algorithm 3) allows faster convergence and less spurious outlier decisions, since any non-convex solution inherently allows non-optimal guesses from time to time. The heuristic embraces a self-feedback mechanism (using \(si\)) by rewarding itself for every favorable decision that reduces latency, thus closely approximating the nature of accelerated gradient descent, a common method used to solve convex optimization problems efficiently.

As is evident, the only measurement metric that guides the partitioning decision is the run-time e2e latency on the edge device. Unlike previous works, e2e latency measurement at the edge is comparatively accurate and simple and covers all additional processing and unknown overheads on edge device and cloud platforms. Furthermore, the algorithms add minimal overhead to DNN inference because of lightweight operations. The algorithm execution time is much lesser than the actual inference operation. Finally, the edge and cloud platforms can work asynchronously as the clock skew between the possibly geographically distant counterparts involved in the inference has zero impact on the heuristic, due to its non-dependence of cloud-based measurements.

5 EXPERIMENTAL METHODOLOGY

In this section, we describe the components used in the experimental evaluation of PArtNNer at both the hardware and software levels. Specifically, we highlight key details about different hardware platforms, communication standards, and DNN models. These factors together impact the e2e latency and govern heuristic performance.

5.1 System Benchmarks

We demonstrate the platform-agnostic feature of PArtNNer by evaluating its performance on five different edge computing platforms and three different communication architectures. To encompass the vast spectrum of various mobile SoCs and edge platforms [77] used for IoT and AI applications, five representative systems were selected from the spectrum of low to high compute efficiency. Table 2 presents the hardware specifications of these systems.

RPi0 [19] was selected as the first compute platform that emulates SOTA tiny/wearable IoT devices, presenting limited on-device compute capabilities for AI. RPi3 [20] was chosen as the second physical platform, representative of the wide variety of embedded devices and edge IoT platforms. In contrast to resource-constrained RPi0, RPi3 can run different DNN models and is generally used as a baseline to compare AI hardware and DNN accelerators. Both of these platforms lack a dedicated GPU or coprocessor for DNN computation. Note that we performed real-world experiments on both these hardware platforms in vanilla format, i.e., without any hardware or software customization/optimization. The only modifications we made include the disablement of (i) the RPi camera module to free up memory, since this module by default allocates 128 MB to the Broadcom VideoCore GPU (only used for video encoding/decoding) and (ii) LightDM, the OS display manager, to free the processor of any additional computational overhead.

Taking into account the vast ecosystem of custom silicon (ASIC) for edge AI, we developed computational models for the remaining three hardware platforms, each including GPU or DNN accelerator. (i) Intel^® NCS [32, 33] consists of Intel Neural Compute Engine, a dedicated hardware accelerator. Since this is a USB plug & play AI device, the computational model considered RPi3 connected to the NCS dongle, and this combination formed our third system benchmark. (ii) NVIDIA^® Jetson [50] consists of a streaming multiprocessor (128 CUDA cores) that allows parallel execution of multiple AI workloads. The presence of a GPU facilitates fast DNN inference, and this device was our fourth benchmark. (iii) Google Coral [25, 26] is an ASIC specifically designed to accelerate DL. The underlying hardware architecture of ETPU supports fast matrix operations, thereby providing exceptional levels of acceleration for DNN inference. In addition, the fast data transfer rate between ETPU and internal memory in this Coral board also contributes to this speed up. We leveraged several benchmark articles [1, 3, 24, 27, 50, 57] to design these computational models such that the reported e2e inference time of the DNN model matches very closely with that of the actual hardware. Further details on how these systems fit into the e2e experimental setup are discussed in Section 5.2.

The communication standards explored for this study include Wi-Fi5, 5G, and Wi-Fi6 [35, 55, 59] with a measured maximum bandwidth of 25 Mbps, 2.5 Gbps, and 5 Gbps, respectively. The wireless module onboard Rpi0 only supports Wi-Fi4. Therefore, we use the terminology “BWiFi” to denote 4G LTE/Wi-Fi4 (specifically for RPi0) and Wi-Fi5 for all other devices. Due to the similarity in the observed bandwidth across these three standards, we used BWiFi in all relevant graphs in this article. During the evaluation of PArtNNer under these standards, we adopted improvements in data rates from the available literature [35, 47]. Note that in practical scenarios, the actual bandwidth observed in the network varies due to several factors, such as bandwidth sharing, network congestion, distance from the router, and so on as mentioned in References [30, 47]. Therefore, we used the experimentally measured data rate in all of our experiments.

5.2 End-to-End System and Software Setup

Figure 8 depicts the e2e experimental setup used in this work that consists of the edge device, the cloud server that is connected to a local PC through an SSH connection, display connected to the edge device, and finally a Logitech Webcam C310 connected to the device via USB. The webcam was used for image acquisition and real-time DNN inference using PArtNNer. The cloud server used for collaborative inference consists of Intel^® Xeon^® Silver 4114 CPUs equipped with NVIDIA^® GeForce GTX 1080Ti GPUs. Note that the local PC in Figure 8 is only used as a console for the server and does not participate in the inference process. As mentioned in Section 1, we use image classification and object detection as the edge AI application and use their corresponding datasets for validation and inference. For our experiments, we first needed to obtain the optimal partition points of Oracle (defined in Section 3.3) for each DNN and, the corresponding oracle inference time for different network bandwidths and server loads, and the underlying system architectures. The process of obtaining oracle data is described in Section 5.4. The oracle data allowed us to have a reference to gauge the performance of PArtNNer. Note that the oracle can choose the OPP in a DNN in every instance, resulting in a minimum inference latency since the oracle data consists of optimal choices under any environmental or operating condition.

Fig. 8. End-to-end experimental setup using Raspberry Pi 3 Model B+ as the edge, Intel Xeon + GPU as the cloud server (PC only used to control the server using SSH). PArtNNer on the RPi3 runs partitioned inference using real-time images captured using camera connected to RPi3 over USB.

We performed the partitioning experiments in real time on both Raspberry Pi boards and, therefore, obtained the oracle data and executed PArtNNer directly on the edge device. In contrast, the computational models for the three custom systems were based on the relative speedup of the inference calculations available in multiple benchmark articles. However, the computation times reported in different benchmarks [27, 50, 57] by the parent organization (Intel^® for NCS, NVIDIA^® for Jetson, and Google for Edge TPU) consider the time measured to execute the DNN model only and do not include the input data preprocessing time (required for operations such as downscaling images to fit the model input specification, etc.), and these generally vary from system to system and also depend on the underlying application. Furthermore, the software frameworks used on the platforms vary across the systems, as specified in Section 3.3. Due to these factors, we adapt the results from References [1, 3, 24], as the authors account for this variance in the native benchmarks. Furthermore, we also included preprocessing times and communication time (if any) in all our experimental validation data, since the partitioning decision depends on the e2e latency, which comprises multiple constituents other than the execution time of the DNN model. The software frameworks/utilities used in this work are PyTorch used as the DL framework for edge and cloud platforms, Torchvision, Pillow, and OpenCV used for image processing, and Traffic Control (tc) [72] and WonderShaper [38] used to modulate Quality of Service and control network bandwidth at the wireless interface of edge devices.

5.3 DNN Benchmarks

To perform a comprehensive evaluation of PArtNNer, we consider a reference suite of six popular DNN architectures that we used to investigate the impact of multiple factors on partitioned DNN inference (Section 3) as well as for the main partitioning heuristic (Section 4.1). Table 3 lists five image classification models and one object detection model, together with specifications such as accuracy, number of parameters, and number of layers. Since we adopted block-level partitioning as explained in Section 4.1.1, we also reported the total allowed number of partitions and the corresponding layer/block identifier (ID) for different types of layers/blocks in these DNN models.

Note that # partitions is different from # layers due to our adoption of block-level partitioning. This approach not only handles complex data dependencies and ensures proper e2e collaborative inference execution, but also substantially reduces the search space of the heuristic. Among the classification DNNs, we considered three large-scale DNNs traditionally used on servers, such as AlexNet, InceptionV3, ResNet101, and two edge/mobile optimized DNNs such as SqueezeNet1.1 and MobileNetV2 used for real-time edge AI applications. All of these models were pretrained on the ImageNet dataset [12]. We also assess PArtNNer on YOLOv3-Tiny, a SOTA object detection model targeted for edge device, which is trained on the MS COCO dataset [43].

5.4 Inference Latency Measurement and Optimal Partitioning Point Calculation

An IoT application that runs image classification/detection comprises several pipelined stages. As part of our measurements to obtain oracle inference latency during collaborative edge-cloud inference, we measure the latency of each constituent stage and ultimately obtain the total cumulative time to obtain the e2e latency. Specifically, we use publicly available Torchprof [75], a minimal dependency library, to profile each DNN benchmark (Table 3) to obtain the execution times at the granularity of a layer or block. In all our experiments, we assume that the DNN models are already loaded, i.e., the network is initialized, and the pretrained weights are loaded from the disk to the memory, since model loading poses a considerable overhead to the e2e latency and also depends on the architecture/memory characteristics of the system. However, the model loading time only affects the first inference latency and, therefore, has a negligible effect on the PArtNNer algorithm. Each inference experiment (on a single image i.e., batch size of 1) was performed 100 times before an average was calculated and the first run was discarded. Let us examine the inference stages and the corresponding e2e latency in each of the three execution scenarios outlined in Section 3, namely EoI, CoI, and partitioned inference.

–	Edge only Inference. First, the edge device uses the camera sensor (webcam) to capture the image at the baseline image resolution (res) of \(640\times 480\). The CPU or processing unit on the edge device applies appropriate preprocessing steps; (i) aspect-preserving resize and central-crop the image to res based on the DNN input specification, viz., \(416\times 416\) for YOLOv3-Tiny, \(299\times 299\) for InceptionV3 and \(224\times 224\) for the rest of the classification DNNs, (ii) image to tensor (FP32 format), and (iii) normalize the resulting tensor. Subsequently, the on-device processor, along with the accelerator (if available) on the edge device, performs DNN inference by executing all layers. These steps contribute to the e2e inference latency (edge latency) for the EoI scenario.
–	Cloud only Inference. The edge device performs image acquisition, adequate preprocessing, and packetization. Note that we perform the preprocessing on the edge device in order to reduce the image resolution, which in turn reduces the amount of data to be transmitted. These steps contribute to edge latency. The data are then transmitted to the cloud server, leading to comm latency. The server runs the entire DNN inference and returns the result back to the edge (cloud latency). The total time measured from the start of image acquisition to the final application result gives the e2e CoI inference latency.
–	Partitioned Inference. For this edge-cloud collaborative mode, the edge device performs all the aforementioned steps, from image acquisition to normalization. Consider a DNN partitioned at \(P_{O}\); the edge device performs inference from layer 0 to layer/block \(P_{O}\) (inclusive) and serializes and packetizes intermediate feature maps from \(P_{O}\). All of these steps contribute to edge latency. Subsequently, the fmap data is transmitted to the cloud, leading to comm latency. Finally, the cloud de-serializes the data and runs inference on the rest of the layers and transmits the result back, thus contributing to cloud latency. All of these times cumulatively add up to the e2e inference latency.

Note that different constituent latencies are used only for the calculation of the e2e inference latency for EoI, CoI, and oracle and are used for the results and plots in Section 6. Conversely, PArtNNer works in real time and only uses run-time measurements. This takes into account not only all variations in operating conditions and system architectures, but also any other unexpected factors that affect e2e latency, as discussed in Section 4.2. Therefore, the e2e latency indicated in Figure 7, i.e., \(Ti_{curr}\) in Algorithm 3 is measured from the moment the image capture operation is initiated until the moment when the inference result is generated on the edge device or obtained from the cloud. To evaluate e2e latency at different wireless network bandwidths under three communication standards, we performed multiple experiments implementing two approaches; viz., we used the Quality of Service (QoS) bandwidth management [49] option available in most common routers. This feature allowed us to control the traffic flow between the edge device and the cloud server and set the maximum network bandwidth for different experiments. Furthermore, we used the Linux utilities, namely tc and WonderShaper, to limit bandwidth usage of the DNN inference application. Note that although DNNs can handle approximate input/minor errors in the input image [22, 23, 60], we set the packet loss probability in tc to \(0\%\), as we intend to partition DNN without any impact on accuracy, unlike previous work [71].

As discussed in Section 3, the variation in server load also affects server performance, as well as the DNN inference time. To simulate the load variation originating from multiple DNN inference requests from multiple edge devices, we used the Python multiprocessing library to dispatch multiple concurrent processes, each running inference tasks using separate model instances of the same DNN architecture. In this work, we assume that multiple requests from application users might arrive at the server asynchronously; hence batched inference is not considered, resulting in the need to instantiate separate DNN models for each of the individual requests. Unless batched together, a single instantiated network, if allowed to run inference on multiple input data concurrently, will result in corruption of shared data in the memory, which in this case are the intermediate feature maps. Moreover, the maximum number of DNN models clustered on a single GPU, and consequently, the number of concurrent processes is bound by the GPU architecture, GPU memory (11 GB in our setup) on the cloud server along with the size of the DNN model. In fact, virtualization is usually not supported on most commercially available GPUs. The only way to run concurrent models on GPU is to either divide the GPU cores and GPU memory into multiple groups or use time-sharing or GPU serialization with independent CUDA contexts, one for each process/model. The latter allows the GPU to execute activity from one inference process and context-switch to another process when the current activity becomes idle. The latest development in this domain is the NVIDIA^® Multi-Process Service (MPS) [52], which allows co-operative multi-process CUDA applications. However, the maximum number of allowed CUDA contexts is 16 (pre-Volta MPS server) or 48 (Volta MPS server). Taking into account these limitations, we used the following variants of server load (i.e., number of parallel inference instances or CUDA contexts), viz., {\(1, 5, 10, 15, 20\)} in all our experiments. In addition, we also implemented a FIFO queue to maintain the server at maximum load to ensure scalability, and thus our system setup has the capability to handle the number of inference requests more than the maximum limit. We performed individual experiments at each of the server loads mentioned above to obtain the oracle information for all DNNs.

Note that we had to repeat these experiments for fifteen different systems formed by all combinations of five edge platforms (Table 2) and three communication standards. The processing speed and transmission times of intermediate feature maps for NCS, Jetson, and ETPU-based systems were estimated from computational models, as indicated in Section 5.1. We have already shown the partition points of the oracle in a RPi3-based system using Wi-Fi5 for wireless bandwidth in the range {\(0{-}25\) Mbps} and server load {\(1{-}15\)} for six DNNs in Figure 2 in Section 3. Finally, to gauge the quality of PArtNNer, we synthetically created 25 different dynamic trends of network bandwidth and server load that vary over time, as shown in some related work [4, 81]. The maximum measured bandwidth (as stated in Section 5.1) was used to generate these trends, while the maximum load ranged between 15–20 based on the DNN model. The results and plots in Section 6 are based on these synthetic trends.

6 EXPERIMENTAL RESULTS AND DISCUSSIONS

In this section, we present the results obtained during the experimental evaluation of PArtNNer. We have subdivided the results into five different subsections. First, we show the latency benefits of PArtNNer compared with existing cloud-only and edge-only inference approaches, as well as the oracle inference latency, for all benchmark systems and DNNs (Section 6.1). Furthermore, we compare PArtNNer with a random partitioning scheme that randomly chooses a partition point among all layers of the DNN architecture. To obtain the e2e latency in this system, we calculated the latency at each possible partitioning point and took the average. Next, a study of the adaptability of PArtNNer to temporally varying operating conditions is presented (Section 6.2), which validates its effectiveness in navigating the non-convex partitioning search space. We then present the platform-agnostic behavior of our proposed system by showing a relative comparison of PArtNNer with oracle inference for the different edge systems (Section 6.3). Furthermore, we also present a comparative analysis of our system across multiple DNNs that highlights the redundancy of the DNN profiling phase (Section 6.4). Finally, we present a convergence study of PArtNNer that shows its ability to solve the complex optimization problem of DNN partitioning across temporal and system variations (Section 6.5).

6.1 DL Inference Latency Improvement

Figure 9 shows how PArtNNer (represented by red bars) performs when the edge device is running in an edge-cloud collaborative mode to execute DL inference compared with (i) always offloading sensor data for complete DNN inference on the cloud server (CoI, represented by blue bars), (ii) always running the full DNN locally on the edge device using the available edge CPU/GPU/accelerator embedded in the platform (EoI, represented by orange bars), (iii) random partitioning system (represented by green bars), and (iv) oracle that has the knowledge of the best partitioning point for all possible operating conditions (represented by purple bars) and always provides minimum latency. The experimental results have been used to represent this four-way comparison in Figure 9 for five edge platforms (Table 2) and three different communication standards for six DNN benchmarks (Table 3). As we can observe from the subplots for AlexNet, PArtNNer provides \(3.8\times\) on average and up to \(17\times\) latency improvement over CoI across 15 different benchmark systems. On the other hand, it achieves \(6.6\times\) on average and up to \(33.4\times\) latency improvement over EoI across the same system configurations. Note that a maximum PArtNNer improvement over CoI is observed when the underlying compute platform is Google Coral Board with ETPU (the fastest among the five benchmark platforms studied), using Wi-Fi5 (802.11ac) communication technology. Clearly, PArtNNer intelligently modulates the partitioning decision to execute inference mostly on the fast ETPU rather than offloading to the cloud, thus providing the maximum improvement. Whereas, PArtNNer running on Raspberry Pi 0 using Wi-Fi6, the fastest communication standard, shows the maximum improvement over EoI, as it most often decides to offload the inference request to the cloud rather than executing on the slow edge platform. Note that PArtNNer achieves \(5.6\times\) and \(12\times\) average latency improvement over EoI for SqueezeNet1.1 and MobileNetV2, respectively, highlighting its efficacy for edge-optimized DNNs. A similar performance trend can be clearly observed for all DNNs in the other subplots.

Fig. 9. Comparison of inference latency using PArtNNer vs cloud-only inference, edge-only inference, random inference, and oracle inference, for six DNNs for different computing and communication configurations. Latency is normalized w.r.t. to oracle. Note that lower inference latency is better.

Figure 10 depicts the same four-way comparison for all benchmark DNNs averaged over all 15 system benchmarks used in the study. This figure shows the efficacy of the proposed system, since PArtNNer partition point and, consequently, inference latency closely track the oracle, differing only by a factor of \(1.1\times\)–\(2.3\times\) across the six DNN benchmarks. On the contrary, cloud-only, edge-only, random inference latency performs poorly, viz., \(2.8\times\)–\(9.0\times\), \(4.9\times\)–\(26.2\times\), and \(5.9\times\)–\(15.1\times\), respectively, compared with the oracle. As observed in Figures 9 and 10, PArtNNer provides substantial improvements in inference latency across a wide range of systems and DNNs.

Fig. 10. Comparison of inference latency (averaged over all system configurations) using PArtNNer vs cloud-only inference, edge-only inference, random inference, and oracle inference. Lower latency is better.

6.2 Adaptability to Dynamic Variations

In-depth analysis of PArtNNer performance is essential to evaluate its adaptability to constantly varying dynamic operating conditions. In this section, we consider a representative inference system comprising RPi3 combined with Intel NCS as the underlying computing platform using Wi-Fi6 as the communication standard and ResNet101 as the DNN. We measured the e2e latency of executing a single image classification inference for 1,024 consecutive instances when this system operates in an edge-cloud collaborative mode using PArtNNer. Figure 11 shows the latency for 5 of 25 generated synthetic trends mentioned in Section 5.4 with dynamically varying network bandwidth and server load. To assess the efficacy of PArtNNer, we compare its inference latency, shown by the green line, with the inference latency of the system operating with oracle knowledge, shown by the blue line. As stated in Sections 5 and 6.1, the oracle, due to its extensive knowledge, always selects the OPP, \(P_{O}\) for any bandwidth-load combination and results in the minimum inference latency for each inference instance.

Fig. 11. Comparison of PArtNNer and oracle inference latency for five distinct trends of dynamically varying cloud server load and network bandwidth on ResNet101 running on Raspberry Pi 3 coupled with Intel® Neural Compute Stick 2 on Wi-Fi6 communication standard.

Among the five representative trends, trend 0 depicts a scenario where both load and bandwidth are constant throughout the duration of the experiment. The oracle inference latency remains constant for all instances since the oracle partition point does not change. On the contrary, PArtNNer starts at a random partition point, \(P_{S}\), selected by Algorithm 2 resulting in a higher inference latency than the oracle. However, the underlying PArtNNer algorithm changes quickly and intelligently to the optimal partition point \(P_{O}\), resulting in the minimum inference latency for the constant environment state. As we can observe from the figure, the blue line is superimposed on the green line, as the PArtNNer latency mostly tracks the oracle latency throughout the 1,024 instances with minimal deviations. Since the algorithm does not have any notion of server load or bandwidth, it randomly triggers the exploration phase (Algorithm 4) to investigate the possibility of achieving a different partition point, which could result in even lower inference latency than the current value. Due to unchanged operating conditions, any other partition point other than current \(P_{O}\) results in a higher inference latency, which leads PArtNNer to revert back to \(P_{O}\) using Algorithm 3. The green spikes in Figure 11 represent this random exploration phase.

Similar observations can also be derived from the rest of the four plots in Figure 11, corresponding to trends 1 to 4. Trend 1 represents a scenario with monotonically increasing load and bandwidth. In trend 2, the server load increases stepwise after a few inference instances, while the bandwidth increases linearly during the first half and shows sinusoidal behavior for the latter half. Trend 3 shows the case where the bandwidth increases stepwise, while the load shows a mixture of linear and sinusoidal nature. Finally, trend 4 shows a gradual upward trend followed by a downward pattern in load and a repetitive upward-downward triangular trend in bandwidth. As clearly seen in all the plots, PArtNNer inference latency closely tracks the inference latency of the oracle for all five trends, thus validating the adaptive efficiency of the proposed system to meet the different challenges that arise from dynamically varying conditions, as discussed in Section 3.2. Moreover, the large spikes toward the end in the plots for trends 2–4 indicate that consecutive partition points can lead to a substantial increase in latency, with either sharp or gradual change in the operating parameters. PArtNNer handles such scenarios effectively by quickly switching to \(P_{O}\) after encountering any high-latency instance, as it can weigh both the current measured latency and the latency of previous inference instances to intelligently select any partition point (Algorithm 3).

To allow further comparative analysis, we have adopted relative error between the inference times of the oracle and any other mode of operation as the evaluation metric, as shown in Equation (1): (1) \(\begin{equation} \varepsilon _{xo} = \frac{\sum _{k=1}^{I} (Ti_{x} - Ti_{oracle})}{\sum _{k=1}^{I} Ti_{oracle}}, \text{for } x \in \lbrace c, e, r, p\rbrace , \end{equation}\) where \(c, e, r, p\) represents CoI, EoI, random, and PArtNNer, respectively, while I is the total number of inference instances during the application lifetime. A lower value of this error metric represents better performance in terms of inference latency. For ResNet101 inference in the RPi3 + NCS-Wi-Fi6 system, \(\varepsilon _{po}^{avg} = 0.05\) across the 25 trends of operating conditions, while \(\varepsilon _{po}^{max} = 0.28\). Compared with this, \(\varepsilon _{co}^{avg} = 0.32\) and \(\varepsilon _{po}^{max} = 1.3\) for CoI, while \(\varepsilon _{eo}^{avg} = 1.76\) \(\varepsilon _{po}^{max} = 3.66\) for EoI. The metrics for the random partitioning case are \(\varepsilon _{ro}^{avg} = 1.12\) and \(\varepsilon _{po}^{max} = 2.02\). The error metrics for all the trends corresponding to this configuration of the system are plotted along with their standard deviation in Figure 12. Evidently, the lowest value of the metric is observed for PArtNNer, which demonstrates the benefits of PArtNNer compared with the other inference modes. For this system, PArtNNer performs \(6\times\) and \(34\times\) better than conventional CoI and EoI, respectively, thus facilitating real-time pervasive edge intelligence.

Fig. 12. Comparison of relative error between inference latency of oracle with other modes of operation, viz., CoI, EoI, Random, and PArtNNer. The results correspond to Raspberry Pi 3 + NCS2 system operating at WiFi6 communication standard.

6.3 Platform Agnostic Behavior

Now, we demonstrate the performance benefits of PArtNNer when applied to various types of computing and communication standards. Figure 13 depicts a comparative evaluation among PArtNNer and oracle for fifteen different systems running InceptionV3 in edge-cloud collaborative mode. The plots in each column of the figure represent PArtNNer performance on distinct hardware platforms. For example, column one is for Raspberry Pi 3, column five for Coral Dev Board, and so on. In contrast, each row represents a distinct communication standard, e.g., row zero, one, and two represent Wi-Fi5, 5G, and Wi-Fi6, respectively. Note that the y-axis values (in red) are scaled with respect to the maximum measured bandwidth during experiments, the y-axis in orange denotes the variation in server load, while the y-axis in green represents the actual measured e2e inference latency. All plots correspond to one particular dynamic trend, where the load gradually increases and the bandwidth follows a triangular pattern, i.e., it starts from a very low value, gradually increases to the maximum, and then wanes out. This kind of trend emulates the real-world behavior of the operating constraints that govern typical edge AI services today. Evidently, even in the presence of dynamic variations in cloud server load and network bandwidth, as shown in the trend, PArtNNer inference latency (green line) closely matches the oracle inference latency (blue line) for all systems. Note that spikes in PArtNNer latency for RPi0 and RPi3 indicate more frequent exploration compared to the other three devices, which can be easily adjusted by controlling the probability of exploration \(part\_prob\) in Algorithm 4. Using Equation (1), the calculated error metric for PArtNNer running InceptionV3, when averaged across all systems, gives \(\varepsilon _{po}^{sys\_avg} = 0.31\). For comparison, the mean values for CoI and EoI (\(\varepsilon _{co}^{sys\_avg}\), \(\varepsilon _{eo}^{sys\_avg}\)) are 2.08 and 30.58, while the maximum values \(\varepsilon _{po}^{sys\_max}\), \(\varepsilon _{co}^{sys\_max}\) and \(\varepsilon _{eo}^{sys\_max}\) are 1.18, 15.66, and 194, respectively. Note that PArtNNer provides the lowest value for these error metrics, indicating its advantage compared to the cloud-only and edge-only execution models.

Fig. 13. Comparison of PArtNNer and oracle inference latency of InceptionV3 for fifteen distinct systems with different computing and communication configuration for one trend.

6.4 Comparative Analysis for Multiple DNNs

Due to the exponential growth in DNN algorithm research, new network architectures are released quite frequently. Therefore, if any system designer intends to deploy a new DNN architecture for edge AI application, profiling becomes a necessary option following related work in this area. However, in addition to being platform-agnostic, our proposed system is also invariant to the DNN model. We demonstrate this adaptability to any DNN architecture in Figure 14. Here, the comparison among PArtNNer and oracle is shown for a specific system, viz., Coral Dev Board-Wi-Fi6 configuration, and one dynamically varying trend for the six benchmark DNNs. As observed, PArtNNer follows the oracle for all models, validating our claim that DNN-specific profiling of individual layers on any system is unwarranted. The error metrics (relative error w.r.t. oracle inference latency) for PArtNNer, CoI, and EoI (\(\varepsilon _{po}^{net\_avg}\), \(\varepsilon _{co}^{net\_avg}\), \(\varepsilon _{eo}^{net\_avg}\)) averaged over all six DNNs for this system configuration are 0.04, 1.2, and 0.14, while the maximum values are 0.10, 4.51, and 0.37, respectively. Note that lower values of the relative error metric indicate faster inference latency. Across the six DNNs and the fifteen system benchmarks, PArtNNer provides \(1.4\times\)–\(6.7\times\) and \(3.7\times\)–\(21.1\times\) latency acceleration compared to cloud-only and edge-only execution. Evidently, the error metrics and the latency speedup indicate that PArtNNer performs substantially better than CoI and EoI.

Fig. 14. Comparison of PArtNNer and oracle inference latency for six benchmarks DNNs for one particular trend on Coral Dev Board using Wi-Fi6 communication standard.

6.5 Convergence Study

Any optimization algorithm is gauged by how fast it converges to the globally optimal value. Since the globally optimal value changes with even a minor change in the temporally varying operating conditions, we evaluated the convergence of our proposed algorithms by gauging local minima among the DNN for a trend with constant server load and network bandwidth, i.e., Trend 0. The number of consecutive inferences, out of a total of 1,024 inference instances, needed by PArtNNer algorithms to reach the optimal oracle partition point is termed the convergence metric \(cm\), which is used here to evaluate PArtNNer performance for six DNNs and fifteen systems. As depicted in Figure 15, the average convergence metric across all systems for a particular DNN, \(cm_{avg}^{sys}\) (shown by diamond markers), ranges from \(11.15{-}53.5\), where the lower value indicates better convergence. If we consider InceptionV3 (DNN-1 in the graph), PArtNNer requires only \(10{-}12\) time steps/execution instances to converge to the oracle in 13 system configurations and \(\approx 55\) steps to reach a suboptimal point with a latency difference of 0.002 seconds (\(0.4\%\) > oracle latency). In the graph, DNN-2, ResNet101 and DNN-4, MobileNetV2 have a higher divergence among the different systems, since the number of possible partitioning points is higher compared to the other DNN benchmarks (see Table 3). The metric averaged over all systems and all DNNs, \(cm_{avg}^{net}\) (lower value is better) gives a value of 26.5. These numbers prove that our non-convex solution for DNN partitioning is highly optimized.

Fig. 15. Convergence Metric for different DNNs for all fifteen benchmark system configurations. Shaded markers indicate individual system’s metric, whereas, diamond markers shows the average across all systems.

In contrast, an exhaustive search to select the OPP in a particular DNN is directly correlated with the number of partitioning points. A naive exhaustive partitioning algorithm using an offline profiling mechanism would evaluate the e2e latency at all viable points for a constant operating condition specific for fixed edge-cloud platforms. Switching to a different platform invokes the need for additional profiling. For example, performing offline profiling on ResNet101 on fifteen different platforms would require 40 \(\times\) 15 i.e., 600 profiling experiments for constant server load and bandwidth. In contrast, even without any profiling stage, the PArtNNer converges to the optimal partition point within 29.7 inference instances on average across all systems, thus providing \(20\times\) speed up in convergence. Similarly, \(6\!\!\times -24\times\) speed-up is obtained for the other DNN benchmarks, whereas \(17.6\times\) speed up is observed on average across all DNNs.

7 ADAPTABILITY TO PARTITIONING FOR MINIMUM ENERGY

Although PArtNNer is designed to minimize the e2e latency for DNN inference on the edge device, we can easily extend the algorithms to optimize instead \(E_{edge-sys}\), energy consumption of the edge platform when performing a single DNN inference, as shown in the following equation: (2) \(\begin{equation} E_{edge-sys} = P_{compute} * Ti_{compute} + P_{comm} * Ti_{comm} + P_{other} * (Ti_{compute} + Ti_{comm} + Ti_{const}), \end{equation}\) where \(P_{compute}\) represents the compute energy consumed by the edge platform to perform the DNN inference computation, while \(P_{comm}\) comprises the energy of the communication subsystem spent on packetizing the data to be transmitted (if any) and the power to transmit the same to the cloud. Similarly, \(Ti_{compute}\) and \(Ti_{comm}\) indicate the total computation and communication times spent on the edge device, which are both components of e2e latency as elaborated in Section 3.1. \(P_{other}\) is the remaining energy consumed by the edge device, in other subsystems that remain roughly constant over time, while \(Ti_{const}\) is the constant time overhead for performing the necessary precondition and postprocessing. All these constants (\(P_{compute}\), \(P_{comm}\), \(P_{other}\), and \(Ti_{const}\)) are unique and measurable (or present in the datasheet) for a particular system and can be obtained beforehand and added to the PArtNNer algorithms. Since ensuring energy efficiency is indeed one of the primary constraints in the implementation of DNN-based intelligent services on highly energy-constrained IoT devices such as smart watches, drones, and so on, the system designer could intend to run edge-cloud collaborative inference and assign a higher priority to preserve energy from the edge device rather than to minimize e2e latency. In these scenarios, the designer can maintain the same approach as the proposed system and simply switch the objective function to obtain the optimal partition point.

8 DISCUSSIONS AND FUTURE WORK

In this section, we present limitations of this work, the scope of applicability, and future research directions aimed at solving the challenges of collaborative inference for DL at the edge. This study focuses on deep learning application scenarios with the optimization objective of minimizing e2e latency when a single edge node (client device) and a single cloud server participate in collaborative inference. We do not consider environments consisting of multiple homogeneous/heterogeneous edge devices and multiple cloud servers where the objective is to deploy multiple DNNs in an optimized way and perform resource autoscaling as seen in related work discussed in Section 2. However, these distributed serving architectures could potentially be used in conjunction with our partitioning technique as part of future work. We also do not consider general computation partitioning across edge, fog, and cloud as in other related work [13]. Nevertheless, the adaption of PArtNNer to target general partitioning can be explored in the future.

Our technique falls under the category of dynamic or online partitioning, which requires a system-wide (comprising edge and cloud) replication of DNN weights that is necessary for run-time flexibility. An alternative approach could be to communicate the weights from the edge to the cloud or vice versa. However, PArtNNer is better since this alternative incurs a high communication overhead. Another limitation of this study is the transmission of intermediate feature maps with full precision (32 bits). Future research can explore different compression techniques such as quantization of intermediate feature maps and analyze the tradeoffs between communication latency/energy savings and DNN accuracy.

One limitation of the proposed algorithm is the static nature of two critical parameters, viz., \(\alpha\) and \(part\_prob\). However, the algorithm can be extended to improve the exploration and convergence timelines by dynamically updating these parameters over the lifetime of the inference application. In addition, the size of the timing window over which the algorithm compares the latency difference with \(\alpha\) is only 2, leading to spikes in latency. Future work can also explore different window sizes to calculate the moving average of latency measurements, which may eliminate such spikes. However, window size essentially trades off stability vs. adaptability. In addition, the decision also depends on the application scenario. Therefore, exploring these tradeoffs can be pursued as part of future work. Although the values of the algorithm parameters have been decided logically and empirically, future work could follow some simple guidelines before applying PArtNNer to their own application scenario. For example, if the designer is aware of high edge vs. cloud performance gap, \(search\_top\) could be set to True and k could be set to a high value. Starting partition will be closer to the cloud, thereby helping in faster convergence. On the other hand, if the communication network delay is highly fluctuating, \(\alpha\) could be set close to 0 and \(near\_index\) may be set closer to 1 to make the algorithm more reactive to changes in inference latency. \(part\_prob\) could be set at a high value initially and reduced after the completion of a certain number of exploration rounds.

9 CONCLUSION

In this article, we introduced PArtNNer, a platform-agnostic adaptive partitioning system that partitions DNNs for edge-cloud collaborative execution to minimize the end-to-end DL inference latency. We demonstrated that the underlying compute platform, the device’s communication subsystem, and temporally varying operating conditions affect the decision to run inference either entirely on the edge or solely on the cloud or in an edge-cloud collaborative mode by partitioning the DNN execution among the two. Experimental evaluation across six different State-of-the-Art image classification and object detection DNN benchmarks and five different COTS edge platforms and three communication standards demonstrated that the system, adaptive to all the varying factors, successfully finds optimal partition point to achieve significant latency benefits over the status quo approach of cloud-only and edge-only inference, without needing any pre-characterization of edge/cloud platforms and DNN deployed for the intelligent application on the edge device. To facilitate the goal of democratizing AI in this era of ubiquitous edge intelligence, PArtNNer provides up to \(21.1\times\) and \(6.7\times\) improvements in end-to-end inference latency compared to the execution of the entire DNN on the edge device or on the cloud server, respectively.

REFERENCES

[1] Allan Alasdair. 2019. Benchmarking Edge Computing: Comparing Google, Intel, and NVIDIA Accelerator Hardware. Medium. Retrieved May 15, 2022 from https://medium.com/@aallan/benchmarking-edge-computing-ce3f13942245Google Scholar
Reference 1Reference 2
[2] Alwani Manoj, Chen Han, Ferdman Michael, and Milder Peter. 2016. Fused-layer CNN accelerators. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture (Taipei, Taiwan) (MICRO-49). IEEE Press, Article 22, 12 pages. DOI:Google ScholarCross Ref
Reference
[3] Baller Stephan Patrick, Jindal Anshul, Chadha Mohak, and Gerndt Michael. 2021. DeepEdgeBench: Benchmarking deep neural networks on edge devices. In 2021 IEEE International Conference on Cloud Engineering (IC2E) (San Francisco, CA, USA). IEEE, 20–30. DOI:Google ScholarCross Ref
Reference 1Reference 2
[4] Benson Theophilus, Akella Aditya, and Maltz David A.. 2010. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (Melbourne, Australia) (IMC ’10). ACM, New York, NY, 267–280. DOI:Google ScholarDigital Library
Reference
[5] Center Microsoft News. 2016. Democratizing AI. Microsoft. Retrieved March 20, 2022 from https://news.microsoft.com/features/democratizing-aiGoogle Scholar
Reference
[6] Chen Xu, Jiao Lei, Li Wenzhong, and Fu Xiaoming. 2016. Efficient multi-user computation offloading for mobile-edge cloud computing. IEEE/ACM Transactions on Networking 24, 14 (2016), 2795–2808. DOI:Google ScholarDigital Library
Reference
[7] Chen Xing, Zhang Jianshan, Lin Bing, Chen Zheyi, Wolter Katinka, and Min Geyong. 2022. Energy-efficient offloading for DNN-based smart IoT systems in cloud-edge environments. IEEE Transactions on Parallel and Distributed Systems 33, 3 (2022), 683–697. DOI:Google ScholarCross Ref
Reference
[8] Choi Yujeong, Kim Yunseong, and Rhu Minsoo. 2021. Lazy batching: An SLA-aware batching system for cloud machine learning inference. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (Seoul, Korea (South)). IEEE Computer Society, Los Alamitos, CA, 493–506. DOI:Google ScholarCross Ref
Reference
[9] Chun Byung-Gon, Ihm Sunghwan, Maniatis Petros, Naik Mayur, and Patti Ashwin. 2011. CloneCloud: Elastic execution between mobile device and cloud. In Proceedings of the 6th Conference on Computer systems (Salzburg, Austria) (EuroSys ’11). ACM, New York, NY, 301–314. DOI:Google ScholarDigital Library
Reference
[10] Crankshaw Daniel, Sela Gur-Eyal, Mo Xiangxi, Zumar Corey, Stoica Ion, Gonzalez Joseph, and Tumanov Alexey. 2020. InferLine: Latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing (Virtual Event, USA) (SoCC ’20). ACM, New York, NY, 477–491. DOI:Google ScholarDigital Library
Reference
[11] Crankshaw Daniel, Wang Xin, Zhou Guilio, Franklin Michael J., Gonzalez Joseph E., and Stoica Ion. 2017. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (Boston, MA, USA). USENIX Association, 613–627. Retrieved from https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/crankshawGoogle Scholar
Reference
[12] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops) (Miami, FL, USA). IEEE Computer Society, Los Alamitos, CA, 248–255. DOI:Google ScholarCross Ref
Reference
[13] Diamanti Maria, Charatsaris Panagiotis, Tsiropoulou Eirini Eleni, and Papavassiliou Symeon. 2022. Incentive mechanism and resource allocation for edge-fog networks driven by multi-dimensional contract and game theories. IEEE Open Journal of the Communications Society 3 (2022), 435–452. DOI:Google ScholarCross Ref
Reference 1Reference 2
[14] Du Li, Du Yuan, Li Yilei, Su Junjie, Kuan Yen-Cheng, Liu Chun-Chen, and Chang Mau-Chung Frank. 2017. A reconfigurable streaming deep convolutional neural network accelerator for Internet of Things. IEEE Transactions on Circuits and Systems I: Regular Papers 65, 1 (2017), 198–208. DOI:Google ScholarCross Ref
Reference
[15] Duan Yubin and Wu Jie. 2021. Joint optimization of DNN partition and scheduling for mobile cloud computing. In Proceedings of the 50th International Conference on Parallel Processing (Lemont, IL, USA) (ICPP ’21). ACM, New York, NY, Article 21, 10 pages. DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[16] Duplyakin Dmitry, Uta Alexandru, Maricq Aleksander, and Ricci Robert. 2020. In datacenter performance, the only constant is change. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID ’20) (Melbourne, VIC, Australia). IEEE, 370–379. DOI:Google ScholarCross Ref
Reference
[17] Eshratifar Amir Erfan and Pedram Massoud. 2018. Energy and performance efficient computation offloading for deep neural networks in a mobile cloud computing environment. In Proceedings of the 2018 on Great Lakes Symposium on VLSI (Chicago, IL, USA) (GLSVLSI ’18). ACM, New York, NY, 111–116. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[18] Feltin Thomas, Marché Léo, Cordero-Fuertes Juan-Antonio, Brockners Frank, and Clausen Thomas H.. 2023. DNN partitioning for inference throughput acceleration at the edge. IEEE Access 11 (2023), 52236–52249. DOI:Google ScholarCross Ref
Reference
[19] Foundation Raspberry Pi. 2017. Raspberry Pi Zero W. Retrieved May 15, 2022 from https://www.raspberrypi.org/products/raspberry-pi-zero-wGoogle Scholar
Reference
[20] Foundation Raspberry Pi. 2018. Raspberry Pi 3 Model B+. Retrieved May 15, 2022 from https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plusGoogle Scholar
Reference
[21] Gao Mingjin, Cui Wenqi, Gao Di, Shen Rujing, Li Jun, and Zhou Yiqing. 2019. Deep neural network task partitioning and offloading for mobile edge computing. In 2019 IEEE Global Communications Conference (GLOBECOM) (Waikoloa, HI, USA). IEEE, 1–6. DOI:Google ScholarDigital Library
Reference
[22] Ghosh Soumendu Kumar, Raha Arnab, and Raghunathan Vijay. 2020. Approximate inference systems (AxIS): End-to-end approximations for energy-efficient inference at the edge. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. ACM, New York, NY, 7–12. DOI:Google ScholarDigital Library
Reference
[23] Ghosh Soumendu Kumar, Raha Arnab, and Raghunathan Vijay. 2023. Energy-efficient approximate edge inference systems. ACM Trans. Embed. Comput. Syst. 22, 4, Article 77 (2023), 50 pages. DOI:Google ScholarDigital Library
Reference
[24] González Juan Pablo. 2019. Machine Learning Edge Devices: Benchmark Report. Tyro Labs. Retrieved May 15, 2022 from https://tryolabs.com/blog/machine-learning-on-edge-devices-benchmark-reportGoogle Scholar
Reference 1Reference 2
[25] Google. 2019. Coral Dev Board. Retrieved May 15, 2022 from https://coral.ai/products/dev-boardGoogle Scholar
Reference
[26] Google. 2019. Edge TPU. Retrieved May 15, 2022 from https://cloud.google.com/edge-tpuGoogle Scholar
Reference
[27] Google. 2022. Edge TPU Performance Benchmarks. Retrieved May 15, 2022 from https://coral.ai/docs/edgetpu/benchmarksGoogle Scholar
Reference 1Reference 2
[28] Gordon Mark S., Jamshidi D. Anoushe, Mahlke Scott, Mao Z. Morley, and Chen Xu. 2012. \(COMET\): Code offload by migrating execution transparently. In 10th USENIX Symposium on Operating Systems Design and Implementation (Hollywood, CA, USA) (OSDI’12). USENIX Association, 93–106. Retrieved from Google ScholarDigital Library
Reference
[29] Grand Rachel. 2023. Does Location Matter In Cloud Computing? Ridge Cloud. Retrieved March 25, 2023 from https://www.ridge.co/blog/location-in-cloud-computingGoogle Scholar
Reference
[30] Heath Mark. 2021. What Realistic Speeds Will I Get With Wi-Fi 5 and Wi-Fi 6? Increase Broadband Speed. Retrieved April 5, 2022 from https://www.increasebroadbandspeed.co.uk/realistic-speeds-wi-fi-5-and-wi-fi-6Google Scholar
Reference
[31] Hu Chuang, Bao Wei, Wang Dan, and Liu Fengming. 2019. Dynamic adaptive DNN surgery for inference acceleration on the edge. In IEEE INFOCOM 2019—IEEE Conference on Computer Communications. IEEE, 1423–1431. DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[32] Intel. 2018. Intel^® Neural Compute Stick 2. Retrieved May 15, 2022 from https://software.intel.com/content/www/us/en/develop/hardware/neural-compute-stick.htmlGoogle Scholar
Reference 1Reference 2
[33] Intel. 2019. Intel^® Movidius™ Myriad™ X Vision Processing Unit 4GB. Retrieved May 15, 2022 from https://www.intel.com/content/www/us/en/products/sku/125926/intel-movidius-myriad-x-vision-processing-unit-4gb/specifications.htmlGoogle Scholar
Reference
[34] Jocher Glenn. 2022. YOLOv5 by Ultralytics. DOI:Online; last accessed Aug 5, 2022.Google ScholarCross Ref
Reference
[35] Jones Ryan. 2022. What are Wi-Fi 6 and Wi-Fi 6E? Trusted Reviews. Retrieved November 5, 2022 from https://www.trustedreviews.com/news/wifi-6-routers-speed-3442712Google Scholar
Reference 1Reference 2
[36] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA’17). ACM, New York, NY, 1–12. DOI:Google ScholarDigital Library
Reference
[37] Kang Yiping, Hauswald Johann, Gao Cao, Rovinski Austin, Mudge Trevor, Mars Jason, and Tang Lingjia. 2017. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. SIGARCH Comput. Archit. News 45, 1 (2017), 615–629. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[38] Kili Aaron. 2018. WonderShaper—A Tool to Limit Network Bandwidth in Linux. Tecmint: Linux Howtos, Tutorials & Guides. Retrieved April 31, 2022 from https://www.tecmint.com/wondershaper-limit-network-bandwidth-in-linuxGoogle Scholar
Reference
[39] Li En, Zhou Zhi, and Chen Xu. 2018. Edge intelligence: On-demand deep learning model co-inference with device-edge synergy. In Proceedings of the 2018 Workshop on Mobile Edge Communications (Budapest, Hungary). ACM, New York, NY, 31–36. DOI:Google ScholarDigital Library
Reference
[40] Li Guangli, Liu Lei, Wang Xueying, Dong Xiao, Zhao Peng, and Feng Xiaobing. 2018. Auto-tuning neural network quantization framework for collaborative inference between the cloud and edge. In Artificial Neural Networks and Machine Learning—ICANN 2018. Springer International Publishing, Cham, Switzerland, 402–411. DOI:Google ScholarCross Ref
Reference
[41] Li He, Ota Kaoru, and Dong Mianxiong. 2018. Learning IoT in edge: Deep learning for the Internet of Things with edge computing. IEEE Network 32, 1 (2018), 96–101. DOI:Google ScholarCross Ref
Reference
[42] Li Wenbin and Liewig Matthieu. 2020. A survey of AI accelerators for edge environment. In World Conference on Information Systems and Technologies: Trends and Innovations in Information Systems and Technologies. Springer International Publishing, Cham, Switzerland, 35–44. DOI:Google ScholarCross Ref
Reference
[43] Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, and Zitnick C. Lawrence. 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (Zurich, Switzerland) (Lecture Notes in Computer Science), Vol. 8693. Springer International Publishing, Cham, Switzerland, 740–755. DOI:Google ScholarCross Ref
Reference
[44] Liu Peng, Qi Bozhao, and Banerjee Suman. 2018. EdgeEye: An edge service framework for real-time intelligent video analytics. In Proceedings of the 1st International Workshop on Edge Systems, Analytics and Networking (Munich, Germany) (EdgeSys’18). ACM, New York, NY, 1–6. DOI:Google ScholarDigital Library
Reference
[45] LLC Bizety. 2018. Google Learn2Compress Moves AI Processing to Mobile and IoT Devices. Retrieved April 25, 2022 from https://www.bizety.com/2018/09/17/google-learn2compress-moves-ai-processing-to-mobile-and-iot-devicesGoogle Scholar
Reference
[46] Long Xin, Ben Zongcheng, and Liu Yan. 2019. A survey of related research on compression and acceleration of deep neural networks. Journal of Physics: Conference Series 1213, 5 (2019), 052003. DOI:Google ScholarCross Ref
Reference
[47] McGarry Caitlin. 2020. 5G Speed: 5G vs 4G Performance Compared. Tom’s Guide. Retrieved April 5, 2022 from https://www.tomsguide.com/features/5g-vs-4gGoogle Scholar
Reference 1Reference 2
[48] Na Jun, Zhang Handuo, Lian Jiaxin, and Zhang Bin. 2022. Partitioning DNNs for optimizing distributed inference performance on cooperative edge devices: A genetic algorithm approach. Applied Sciences 12, 20, Article 10619 (2022), 14 pages. DOI:Google ScholarCross Ref
Reference
[49] Networks Palo Alto. 2022. QoS Bandwidth Management. Retrieved April 31, 2022 from https://docs.paloaltonetworks.com/pan-os/9-1/pan-os-admin/quality-of-service/qos-concepts/qos-bandwidth-managementGoogle Scholar
Reference
[50] NVIDIA. 2019. Jetson Nano Brings AI Computing to Everyone. Retrieved May 15, 2022 from https://developer.nvidia.com/blog/jetson-nano-ai-computing/Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[51] NVIDIA. 2020. NVIDIA Jetson Linux Developer Guide. Retrieved May 15, 2022 from https://docs.nvidia.com/jetson/l4t/index.htmlGoogle Scholar
Reference
[52] NVIDIA. 2021. GPU Management and Deployment: Multi-Process Service. Retrieved April 31, 2022 from https://docs.nvidia.com/deploy/mps/index.htmlGoogle Scholar
Reference
[53] Patrizio Andy. 2020. Survey: Most Data Centers Don’t Meet the Needs of their Users. Data Center Explorer, NetworkWorld. Retrieved March 20, 2022 from https://www.networkworld.com/article/3533998/survey-most-data-centers-dont-meet-the-needs-of-their-users.htmlGoogle Scholar
Reference
[54] Pi Raspberry. 2019. Raspberry Pi Documentation: Processors. Raspberry Pi Ltd. Retrieved May 15, 2022 from https://www.raspberrypi.com/documentation/computers/processors.htmlGoogle Scholar
Reference
[55] Platforms Network &. 2020. 5G Speed: How Fast is 5G? Verizon News Center. Retrieved April 5, 2022 from https://www.verizon.com/about/our-company/5g/5g-speed-how-fast-is-5gGoogle Scholar
Reference
[56] Q-engineering. 2022. Google Coral Edge TPU Explained in Depth. Retrieved May 15, 2022 from https://qengineering.eu/google-corals-tpu-explained.htmlGoogle Scholar
Reference
[57] Q-engineering. 2023. Deep Learning with Raspberry Pi and Alternatives in 2023. Retrieved March 15, 2023 from https://qengineering.eu/deep-learning-with-raspberry-pi-and-alternatives.htmlGoogle Scholar
Reference 1Reference 2
[58] Qi Hang, Sparks Evan R., and Talwalkar Ameet. 2017. Paleo: A performance model for deep neural networks. In 5th International Conference on Learning Representations (ICLR) (Toulon, France). OpenReview.net. Retrieved from https://openreview.net/forum?id=SyVVJ85lgGoogle Scholar
Reference
[59] Qualcomm. 2022. Everything you Need to know about 5G. Retrieved April 5, 2022 from https://www.qualcomm.com/5g/what-is-5gGoogle Scholar
Reference
[60] Raha Arnab, Ghosh Soumendu, Mohapatra Debabrata, Mathaikutty Deepak A., Sung Raymond, Brick Cormac, and Raghunathan Vijay. 2021. Special session: Approximate TinyML systems: Full system approximations for extreme energy-efficiency in intelligent edge devices. In 2021 IEEE 39th International Conference on Computer Design (ICCD ’21) (Storrs, CT, USA). IEEE, 13–16. DOI:Google ScholarCross Ref
Reference
[61] Raha Arnab, Sung Raymond, Ghosh Soumendu Kumar, Gupta Praveen Kumar, Mathaikutty Deepak A., Cheema Umer I., Hyland Kevin, Brick Cormac, and Raghunathan Vijay. 2023. Efficient hardware acceleration of emerging neural networks for embedded machine learning: An industry perspective. In Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing: Hardware Architectures, Pasricha Sudeep and Shafique Muhammad (Eds.). Springer International Publishing, Cham, Switzerland, 121–172. DOI:Google ScholarCross Ref
Reference
[62] Ran Xukan, Chen Haoliang, Liu Zhenming, and Chen Jiasi. 2017. Delivering deep learning to mobile devices via offloading. In Proceedings of the Workshop on Virtual Reality and Augmented Reality Network (Los Angeles, CA, USA) (VR/AR Network ’17). ACM, New York, NY, 42–47. DOI:Google ScholarDigital Library
Reference
[63] Ran Xukan, Chen Haolianz, Zhu Xiaodan, Liu Zhenming, and Chen Jiasi. 2018. DeepDecision: A mobile deep learning framework for edge video analytics. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications (Honolulu, HI, USA). IEEE, 1421–1429. DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[64] Razavi Kamran, Luthra Manisha, Koldehofe Boris, Mühlhäuser Max, and Wang Lin. 2022. FA2: Fast, accurate autoscaling for serving deep learning inference with SLA guarantees. In 2022 IEEE 28th Real-Time and Embedded Technology and Applications Symposium (RTAS) (Milano, Italy). IEEE, 146–159. DOI:Google ScholarCross Ref
Reference
[65] Reuther Albert, Michaleas Peter, Jones Michael, Gadepally Vijay, Samsi Siddharth, and Kepner Jeremy. 2021. AI accelerator survey and trends. In 2021 IEEE High Performance Extreme Computing Conference (HPEC) (Waltham, MA). IEEE, 1–9. DOI:Google ScholarCross Ref
Reference
[66] Ruan Li, Xu Xiangrong, Xiao Limin, Ren Lei, Min-Allah Nasro, and Xue Yunzhi. 2022. Evaluating performance variations cross cloud data centres using multiview comparative workload traces analysis. Connection Science 34, 1 (2022), 1582–1608. DOI:Google ScholarCross Ref
Reference
[67] Signal US. 2022. Does Data Center Location Matter for Cloud Services? Retrieved April 25, 2022 from https://ussignal.com/blog/does-data-center-location-matter-for-cloud-servicesGoogle Scholar
Reference
[68] Tan Mingxing, Pang Ruoming, and Le Quoc V.. 2020. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Seattle, WA, USA). IEEE Computer Society, Los Alamitos, CA, 10778–10787. DOI:Google ScholarCross Ref
Reference
[69] Tech Alibaba. 2019. Data Center Performance Analysis: Challenges and Practices. Medium. Retrieved April 20, 2022 from https://alibabatech.medium.com/data-center-performance-analysis-challenges-and-practices-c5c9a2b5e5a9Google Scholar
Reference
[70] Teerapittayanon Surat, McDanel Bradley, and Kung Hsiang-Tsung. 2016. BranchyNet: Fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR) (Cancun, Mexico). IEEE, 2464–2469. DOI:Google ScholarCross Ref
Reference
[71] Teerapittayanon Surat, McDanel Bradley, and Kung Hsiang-Tsung. 2017. Distributed deep neural networks over the cloud, the edge and end devices. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS ’17) (Atlanta, GA, USA). IEEE, 328–339. DOI:Google ScholarCross Ref
Reference 1Reference 2Reference 3
[72] Wallen Jack. 2019. How to Limit Bandwidth on Linux to Better Test your Applications. Tech Republic. Retrieved April 31, 2022 from https://www.techrepublic.com/article/how-to-limit-bandwidth-on-linux-to-better-test-your-applicationsGoogle Scholar
Reference
[73] Wang Huitian, Cai Guangxing, Huang Zhaowu, and Dong Fang. 2019. ADDA: Adaptive distributed DNN inference acceleration in edge computing environment. In 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS ’19). IEEE, 438–445. DOI:Google ScholarCross Ref
Reference 1Reference 2
[74] Wang Xiaofei, Han Yiwen, Leung Victor C. M., Niyato Dusit, Yan Xueqiang, and Chen Xu. 2020. Convergence of edge computing and deep learning: A comprehensive survey. IEEE Communications Surveys & Tutorials 22, 2 (2020), 869–904. DOI:Google ScholarCross Ref
Reference 1Reference 2Reference 3
[75] Wong Alexander William. 2020. torchprof. Retrieved from https://github.com/awwong1/torchprofOnline; last accessed April 10, 2021.Google Scholar
Reference
[76] Wong Emily. 2020. Is Edge Computing the Answer to a Data Center Overload? Tech Wire Asia. Retrieved March 20, 2022 from https://techwireasia.com/2020/04/is-edge-computing-the-answer-to-a-data-center-overloadGoogle Scholar
Reference
[77] Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, Tommer Leyvand, Hao Lu, Yang Lu, Lin Qiao, Brandon Reagen, Joe Spisak, Fei Sun, Andrew Tulloch, Peter Vajda, Xiaodong Wang, Yanghan Wang, Bram Wasti, Yiming Wu, Ran Xian, Sungjoo Yoo, and Peizhao Zhang. 2019. Machine learning at Facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA ’19) (Washington, DC, USA). IEEE, 331–344. DOI:Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[78] Xia Chunwei, Zhao Jiacheng, Cui Huimin, Feng Xiaobing, and Xue Jingling. 2019. DNNTune: Automatic benchmarking DNN models for mobile-cloud computing. ACM Trans. Archit. Code Optim. 16, 4, Article 49 (2019), 26 pages. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[79] Yang Luting, Lu Bingqian, and Ren Shaolei. 2020. A Note on Latency Variability of Deep Neural Networks for Mobile Inference. arXiv:2003.00138. Retrieved from https://arxiv.org/abs/2003.00138Google Scholar
Reference
[80] Yao Shuochao, Zhao Yiran, Zhang Aston, Su Lu, and Abdelzaher Tarek. 2017. DeepIoT: Compressing deep neural network structures for sensing systems with a compressor-critic framework. In Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems (Delft, Netherlands) (SenSys ’17). ACM, New York, NY, Article 4, 14 pages. DOI:Google ScholarDigital Library
Reference
[81] Yi Deliang, Zhou Xin, Wen Yonggang, and Tan Rui. 2019. Toward efficient compute-intensive job allocation for green data centers: A deep reinforcement learning approach. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS ’19) (Dallas, TX, USA). IEEE, 634–644. DOI:Google ScholarCross Ref
Reference
[82] Yi Shanhe, Hao Zijiang, Zhang Qingyang, Zhang Quan, Shi Weisong, and Li Qun. 2017. LAVEA: Latency-aware video analytics on edge computing platform. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS ’17) (Atlanta, GA, USA). IEEE, 2573–2574. DOI:Google ScholarDigital Library
Reference
[83] Zawish Muhammad, Abraham Lizy, Dev Kapal, and Davy Steven. 2022. Towards resource-aware DNN partitioning for edge devices with heterogeneous resources. In GLOBECOM 2022—2022 IEEE Global Communications Conference (Rio de Janeiro, Brazil). IEEE, 5649–5655. DOI:Google ScholarCross Ref
Reference 1Reference 2
[84] Zhao Zhihe, Jiang Zhehao, Ling Neiwen, Shuai Xian, and Xing Guoliang. 2018. ECRT: An edge computing system for real-time image-based object tracking. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems (Shenzhen, China) (SenSys ’18). ACM, New York, NY, 394–395. DOI:Google ScholarDigital Library
Reference

Index Terms

PArtNNer: Platform-Agnostic Adaptive Edge-Cloud DNN Partitioning for Minimizing End-to-End Latency
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks
  2. Embedded and cyber-physical systems
    1. Embedded systems
2. Computing methodologies
  1. Artificial intelligence

Recommendations

Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration
Edge intelligence has emerged as a promising paradigm to accelerate DNN inference by model partitioning, which is particularly useful for intelligent scenarios that demand high accuracy and low latency. However, the dynamic nature of the edge environment ...
Read More
An adaptive DNN inference acceleration framework with end–edge–cloud collaborative computing
Abstract
Deep Neural Networks (DNNs) based on intelligent applications have been intensively deployed on mobile devices. Unfortunately, resource-constrained mobile devices cannot meet stringent latency requirements due to a large amount of ...
Highlights
- An adaptive DNN inference acceleration framework is proposed to reduce DNN inference latency in the end–edge–cloud computing environment.
Read More
A Survey on End-Edge-Cloud Orchestrated Network Computing Paradigms: Transparent Computing, Mobile Edge Computing, Fog Computing, and Cloudlet

Sending data to the cloud for analysis was a prominent trend during the past decades, driving cloud computing as a dominant computing paradigm. However, the dramatically increasing number of devices and data traffic in the Internet-of-Things (IoT) era ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Embedded Computing Systems Volume 23, Issue 1
January 2024
406 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3613501
Editor:
Tulika Mitra
National University of Singapore, Singapore
Issue’s Table of Contents
Copyright © 2024 Copyright held by the owner/author(s).
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 10 January 2024
- Online AM: 27 October 2023
- Accepted: 7 May 2023
- Revised: 8 April 2023
- Received: 11 October 2022
Published in tecs Volume 23, Issue 1

Check for updates
Author Tags
Deep learning
edge inference
collaborative AI
real-time computing
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 781
  Total Downloads
- Downloads (Last 12 months)781
- Downloads (Last 6 weeks)248
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PArtNNer: Platform-Agnostic Adaptive Edge-Cloud DNN Partitioning for Minimizing End-to-End Latency

ACM Transactions on Embedded Computing Systems

Abstract

1 INTRODUCTION

2 RELATED WORK

3 MOTIVATION

3.1 Edge-Cloud Collaboration: Why Partition DNN?

3.2 Impact of Temporally Varying Operating Conditions

3.3 Impact of Diverse Edge System Specifications

4 PARTNNER: DESIGN METHODOLOGY

4.1 Adaptive DNN Partitioning Heuristic

4.1.1 Heuristic Terminologies.

4.1.2 Heuristic Operation.

4.1.3 Random Exploration Phase.

4.1.4 Overall Flow.

4.2 Rationale behind Heuristic Design Decisions

5 EXPERIMENTAL METHODOLOGY

5.1 System Benchmarks

5.2 End-to-End System and Software Setup

5.3 DNN Benchmarks

5.4 Inference Latency Measurement and Optimal Partitioning Point Calculation

6 EXPERIMENTAL RESULTS AND DISCUSSIONS

6.1 DL Inference Latency Improvement

6.2 Adaptability to Dynamic Variations

6.3 Platform Agnostic Behavior

6.4 Comparative Analysis for Multiple DNNs

6.5 Convergence Study

7 ADAPTABILITY TO PARTITIONING FOR MINIMUM ENERGY

8 DISCUSSIONS AND FUTURE WORK

9 CONCLUSION

REFERENCES

Cited By

Index Terms

Recommendations

Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration

An adaptive DNN inference acceleration framework with end–edge–cloud collaborative computing

A Survey on End-Edge-Cloud Orchestrated Network Computing Paradigms: Transparent Computing, Mobile Edge Computing, Fog Computing, and Cloudlet

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media