TY - JOUR
T1 - CrowdMesh
T2 - A Dynamic Model Parallel Training System on Mobile Devices
AU - Wang, Ziqi
AU - Guo, Bin
AU - Liu, Sicong
AU - Yu, Zhiwen
AU - Zhang, Daqing
N1 - Publisher Copyright:
© 2025 IEEE. All rights reserved.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - With the rapid development of artificial intelligence (AI), integrating deep neural networks (DNNs) into mobile and embedded devices has become an important trend. This integration significantly enhances the ability of these devices to collect and analyze perceptual data. Traditionally, the integration paradigm relies on cloud-based training and deployment on mobile devices. However, the dynamic characteristics and privacy issues related to real-world perceptual data require training on the device. Despite its advantages, the limited computing resources of mobile devices constitute a key bottleneck that hinders the efficiency of model training. To address this issue, parallel distributed training across mobile device clusters has become a feasible paradigm. However, the inherent mobility of these devices not only increases the possibility of training interruptions but also exacerbates the inefficiency caused by data imbalance. These challenges make traditional cloud-based model parallelization methods unsuitable for mobile environments. To overcome these limitations, this article proposes a novel model parallel system CrowdMesh designed for mobile device clusters. CrowdMesh consists of three key modules: 1) alliance-game-based dynamic cluster start-up and construction; 2) computation-cost-based DL model parallel; and 3) device-mobility-aware parameter propagation module. These modules work together to address training interruptions and restarts caused by device mobility, as well as efficiency issues caused by data imbalance. The experimental results show that CrowdMesh performs better than existing cloud and device-based model parallelization baselines in various training tasks and deep learning models and can reduce training latency by more than 18%.
AB - With the rapid development of artificial intelligence (AI), integrating deep neural networks (DNNs) into mobile and embedded devices has become an important trend. This integration significantly enhances the ability of these devices to collect and analyze perceptual data. Traditionally, the integration paradigm relies on cloud-based training and deployment on mobile devices. However, the dynamic characteristics and privacy issues related to real-world perceptual data require training on the device. Despite its advantages, the limited computing resources of mobile devices constitute a key bottleneck that hinders the efficiency of model training. To address this issue, parallel distributed training across mobile device clusters has become a feasible paradigm. However, the inherent mobility of these devices not only increases the possibility of training interruptions but also exacerbates the inefficiency caused by data imbalance. These challenges make traditional cloud-based model parallelization methods unsuitable for mobile environments. To overcome these limitations, this article proposes a novel model parallel system CrowdMesh designed for mobile device clusters. CrowdMesh consists of three key modules: 1) alliance-game-based dynamic cluster start-up and construction; 2) computation-cost-based DL model parallel; and 3) device-mobility-aware parameter propagation module. These modules work together to address training interruptions and restarts caused by device mobility, as well as efficiency issues caused by data imbalance. The experimental results show that CrowdMesh performs better than existing cloud and device-based model parallelization baselines in various training tasks and deep learning models and can reduce training latency by more than 18%.
KW - Artificial Internet of Things
KW - distributed training
KW - mobile computing
UR - https://www.scopus.com/pages/publications/105019679691
U2 - 10.1109/JIOT.2025.3623109
DO - 10.1109/JIOT.2025.3623109
M3 - Article
AN - SCOPUS:105019679691
SN - 2327-4662
VL - 12
SP - 55166
EP - 55181
JO - IEEE Internet of Things Journal
JF - IEEE Internet of Things Journal
IS - 24
ER -