NSF RINGS: Scalable and Resilient Networked Learning Systems

Investigators: Gustavo de Veciana (ECE, UT Austin), and Haris Vikalo (ECE, UT Austin)
Students and Participants :

Monica Riberio (PhD Graduated)
Parkishit Hegde (PhD Graduated)
Hasan Beytur Hegde (PhD student)
Ahmed Aydin (PhD student)

Support: This material is based upon work supported by the National Science Foundation under Grant No. 2148224 and is supported in part by funds from OUSD R&E, NIST, and industry partners as specified in the Resilient & Intelligent NextG Systems (RINGS) program.

Goal: Next-generation learning systems enabling applications in, e.g., healthcare, energy, banking, AR/VR design and car/robot navigation, will be privacy-driven, distributed and large-scale, resulting in substantially increased exposure to network congestion/failures. This research proposal centers on developing new, as well as expanding traditional, engineering principles for the design of resilient and scalable networked learning systems. To explore these challenges, we specifically leverage Federated Learning (FL) based systems as a model learning framework.

The proposed research centers on four interrelated themes wherein we combine the development of theoretical underpinnings, architecture, applications and protocol design.

In Theme 1 we study how to achieve resilience to uncertainty in FL systems experiencing intermittent client availability and time-varying network capacity. We propose to explore a novel approach which effectively `learns how to learn' in an uncertain/resource constrained environment.
In Theme 2 we address scalability challenges encountered in large-scale FL by relying on clustering of `exchangeable' clients. In particular, we move from client- to efficient cluster-centric system management leveraging multicast-based estimation/tracking of cluster populations, combined with probabilistic scheduling of clients in the clusters. This offers new avenues to scalability and resiliency as well as potential privacy enhancements.
Theme 3 builds on ideas from rate-distortion theory and scalable video coding, exploring the use of scalable/layered (learned) model compression as a basis for adaptive congestion-aware FL. A key idea here is recognizing that aggressive compression leads to faster delivery, which motivates the search for a tradeoff sweetspot where FL performs more updates but with poorer (noisier) models. This research exemplifies research synergies of ideas from information, queueing and learning theory towards achieving resilience and adaptability. We further propose the design and use of overlay Data Aggregation Networks (DANs) which exploit the aggregative character of FL client model updates via in-network update aggregation and associated data compression. This can be viewed as the `dual' of Content Delivery Network (CDN) overlays which are a core element managing the cost and performance in current network infrastructure.
The final theme, Theme 4, recognizes that the basis for FL applications is client participation and thus brings into focus the joint incentivization of clients and management of limited resources in uncertain environments.

Selected Publications to date

Federated learning under intermittent client availability and time-varying capacity constraints.
M. Ribero, H. Vikalo, and G. de Veciana. IEEE Journal of Selected Topics in Signal Processing, 17(1):98-111, January 2023.

Network Adaptive Federated Learning: Congestion and Lossy Compression
P. Hegde, G. de Veciana and A. Moktari. Proceedings of IEEE INFOCOM, May 2023, pp: 1-10. Extended version is
here.

Mohawk: Mobility and heterogeneity-aware dynamic community selection for hierarchical federated learning.
A.-J. Farcas, M. Lee, R. Kompella, H. Latapie, G. de Veciana and R.Marculescu. In Proc. 8th ACM/IEEE Conference on Internet of Things Design and Implementation, pages 1--12, May 2023.

Federated learning at scale: Addressing client intermittency and resource constraints .
M. Ribero, H. Vikalo, and G. de Veciana. IEEE Journal of Selected Topics in Signal Processing pages 1-14, July 2024.

Clustered federated learning via gradient partitioning.
Heasung Kim, Hyeji Kim, and G. de Veciana In Proc. ICML, pages 1-11, July 2024.

Optimal aggregation via overlay trees: Delay-MSE tradeoffs under failures.
Parikshit Hegde and Gustavo de Veciana, Proc. ACM Meas. Anal. Comput. Syst.,(POMACS) 8(3):1-37, December 2024.