Enabling efficient, globally distributed machine learning

A group of researchers at U-M is working on the full big data stack for training machine learning models on millions of devices worldwide.
Graphic of devices around the world.
Federated learning enables the distributed training of machine learning models on millions of devices around the world.

A new technique to scale up the training of machine learning models has received a big boost from two papers by the labs of Profs. Mosharaf Chowdhury and Harsha Madhyastha. Called federated learning (FL), the technique distributes the models’ training over the devices of thousands of participants, saving costs on data transfer and storage and allowing more data privacy for participants. The researchers recently published a new method for better participant selection, as well as the largest benchmarking dataset to date for FL models.

Still a fairly new technology, FL operates by training models directly on end-user devices, like smartphones, while orchestrated by a centralized coordinator.

“By training in-situ on data where it is generated we can train on larger real-world data,” explains PhD student Fan Lai, the lead author on both papers. “This also allows us to mitigate privacy risks and high communication and storage costs associated with collecting the raw data from end-user devices into the cloud.”

While FL is theoretically able to make model training more resource efficient, it faces several challenges to be applied in practical use cases. Training across multiple devices means that there’s no certainty about the resources available, and uncertainties like user connection speeds and device specs make the pool of data options varied in quality. Early FL implementations circumvent this complexity by selecting participants randomly.

To address these performance limitations, Lai, working with Chowdhury, Madhyastha, and undergrad researcher Xiangfeng Zhu, introduced Oort. Oort is the first system to eschew random participant selection, instead providing a way to guide selections toward higher-utility options. The high-utility participants that Oort looks for provide a double bonus: data that will help improve model performance, and devices that can run faster. Together, this helps both reduce the number of training rounds necessary and cuts down the duration of each round.

“During the course of training, which often lasts for hundreds of rounds, Oort adaptively explores and exploits these high utility clients from the millions of possible clients to jointly optimize these two aspects,” says Lai.

The team’s evaluation showed that, compared to existing participant selection mechanisms, Oort improves time-to-accuracy performance by 1.2-14.1 times, and final model accuracy by 1.3%-9.8%.

In a followup paper, Lai, Chowdhury, Zhu, and master’s student Yinwei Dai published FedScale, an FL evaluation platform packaged with the largest benchmarking dataset for federated learning released to date. The platform can simulate the behavior of millions of FL clients on a few GPUs/CPUs, providing different practical FL metrics without the need for large-scale deployment.

The FedScale datasets are unique to FL settings, covering different aspects of realistic FL deployments like the execution speed, real distribution of the training data, and availability of clients over time.

“Over the course of the last couple years, we have collected dozens of datasets,” says Lai. “The raw data are mostly publically available, but hard to use because they are in various sources and formats.” 

The team partitioned them using real-world client identifiers, and then integrated the new datasets into their framework. Now, FedScale can serve various FL tasks, like image classification, object detection, language modeling, speech recognition, or machine translation.

These two projects are part of SymbioticLab’s larger goal of making federated learning broadly practical.

“We aim to build the distributed software stack for federated learning and data analytics,” Lai says. FedScale and Oort address the first two layers of this stack: prototype evaluation with standardized benchmarking, and the ability to orchestrate the millions of remote clients involved in the computation. In a previous work called Sol (published at NSDI 2020), they developed the capability to utilize the distributed devices efficiently. 

“After building this stack,” says Lai, “we can run the different phases of the ML pipeline, like preprocessing, training, evaluation, and model debugging.”

Explore:
Harsha Madhyastha; Mosharaf Chowdhury; Networked & Distributed Computing; Research News