NSF RINGS: Scalable and Resilient Networked Learning Systems
Investigators:
Gustavo de Veciana (ECE, UT Austin),
and Haris Vikalo (ECE, UT Austin)
Students and Participants :
- Parkishit Hegde (PhD student)
- Hasan Beytur Hegde (PhD student)
- Ahmed Aydin (PhD student)
Additional collaborations with Aryan Moktari (ECE, UT Austin) and Monica Ribero (Google).
Support:
This material is based upon work supported by the National Science Foundation
under Grant No. 2148224 and is supported in part by funds from OUSD R&E, NIST,
and industry partners as specified in the Resilient & Intelligent NextG Systems (RINGS) program.
Goal:
Next-generation learning systems enabling applications in, e.g., healthcare, energy, banking, AR/VR design
and car/robot navigation, will be privacy-driven, distributed and large-scale, resulting in substantially increased
exposure to network congestion/failures. This research proposal centers on developing new, as well as
expanding traditional, engineering principles for the design of resilient and scalable networked learning
systems. To explore these challenges, we specifically leverage Federated Learning (FL) based systems as a
model learning framework.
The proposed research centers on four interrelated themes wherein we combine the development of theoretical
underpinnings, architecture, applications and protocol design.
-
In Theme 1 we study how to achieve resilience to
uncertainty in FL systems experiencing intermittent client availability and time-varying network capacity. We
propose to explore a novel approach which effectively `learns how to learn' in an uncertain/resource constrained
environment.
- In Theme 2 we address scalability challenges encountered in large-scale FL by relying on clustering
of `exchangeable' clients. In particular, we move from client- to efficient cluster-centric system management
leveraging multicast-based estimation/tracking of cluster populations, combined with probabilistic scheduling of
clients in the clusters. This offers new avenues to scalability and resiliency as well as potential privacy
enhancements.
- Theme 3 builds on ideas from rate-distortion theory and scalable video coding, exploring the use
of scalable/layered (learned) model compression as a basis for adaptive congestion-aware FL. A key idea here
is recognizing that aggressive compression leads to faster delivery, which motivates the search for a tradeoff
sweetspot where FL performs more updates but with poorer (noisier) models. This research exemplifies research
synergies of ideas from information, queueing and learning theory towards achieving resilience and adaptability.
We further propose the design and use of overlay Data Aggregation Networks (DANs) which exploit the aggregative
character of FL client model updates via in-network update aggregation and associated data compression.
This can be viewed as the `dual' of Content Delivery Network (CDN) overlays which are a core element managing
the cost and performance in current network infrastructure.
- The final theme, Theme 4, recognizes that the basis for
FL applications is client participation and thus brings into focus the joint incentivization of clients and management
of limited resources in uncertain environments.
Overall, the proposed research centers on new forms of network
intelligence and adaptability which aim to address scalability through device-to-edge-to-cloud continuum.
Publications to date
Federated Learning Under Intermittent Client Availability and Time-Varying Communication Constraints
M. Ribero, H. Vikalo and G. de Veciana .
IEEE Journal of Selected Topics in Signal Processing, 17 (1), 2023, pp: 98-111.
Network Adaptive Federated Learning: Congestion and Lossy Compression
P. Hegde, G. de Veciana and A. Moktari.
Proceedings of IEEE INFOCOM, May 2023, pp: 1-10. Extended version is
here.
Federated Learning at Scale: Addressing Client Intermittency and Resource Constraints
M. Ribero, H. Vikalo and G. de Veciana.
In submission.