Tech Champion: Xabier Lareo
Training, testing and validating machine-learning models require data. Data that sometimes is dispersed amongst many, even millions of, parties (devices). Federated learning is a relatively new way of developing machine-learning models where each federated device shares its local model parameters instead of sharing the whole dataset used to train it. The federated learning topology defines the way parameters are shared. In a centralised topology, the parties send their model parameters to a central server that uses them to train a central model which in turn sends back updated parameters to the parties. In other topologies, such as the peer-to-peer or hierarchical one, the parties share their parameters with a subset of their peers.
Federated learning is a potential solution for developing machine-learning models that require huge or very disperse datasets. However, it is not a one-size-fits-all machine learning scenarios.
Federated learning still has open issues that scientists and engineers work hard to solve, some of which are detailed below.
- Communication efficiency: federated learning involves numerous data transfers. Consequently, the central server or parties receiving the parameters need to be resilient to communication failures and delays. Ensuring efficient communication and synchronisation amongst the federated devices remains a relevant issue.
- Device heterogeneity: computing capacities of the federated parties are often heterogeneous and sometimes unknown to the other parties or central server. It is still difficult to ensure the training tasks will work within a heterogeneous set of devices.
- Data heterogeneity: federated parties’ datasets can be very heterogeneous in terms of data quantity, quality and diversity. It is difficult to measure beforehand the statistical heterogeneity of the training datasets and to mitigate the potential negative impacts such heterogeneity might have.
- Privacy: there is a need for efficient implementation of privacy enhancing technologies to avoid information leakages from shared model parameters.
Positive foreseen impacts on data protection:
- Decentralisation: by leveraging on distributed datasets, federated learning avoids data centralisation and allows the parties to have better control over the processing of their personal data.
- Data minimisation: federated learning reduces the amount of personal data transferred and processed by third parties for machine-learning model training.
- International cooperation: when the shared parameters are anonymous, federated learning facilitates the training of models with data coming from different jurisdictions.
Negative foreseen impacts on data protection:
- Interpretability: machine-learning developers often rely on the analysis of the training dataset to interpret the model behaviour. The developers using federated learning do not have access to the full training dataset, which can reduce the models’ interpretability.
- Fairness: some federated learning settings may facilitate bias toward some parties, for example towards devices hosting the most common model types.
- Security issues: the distributed nature of federated learning facilitates some types of attacks (e.g. model poisoning). Classic defence mechanisms do not currently provide sufficient protection in a federated learning setup. Ad hoc defence methods still have to be developed and tested.
Our three picks of suggested readings:
- L. Tian, A. Kumar Sahu, A. S. Talwalkar and V. Smith, Federated Learning: Challenges, Methods, and Future Directions, IEEE Signal Processing Magazine 37, 2020.
- Q. Li, W. Zeyi, H. Bingsheng, A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection, ArXiv abs/1907.09693, 2021.
- P. Kairouz et al, Advances and Open Problems in Federated Learning, Foundations and Trends in Machine Learning Vol 4 Issue 1, 2021.