CHAPTER 4

Horizontal Federated Learning

In this chapter, we introduce horizontal federated learning (HFL), covering the concept, architecture, application examples, and related works, as well as open research challenges.

4.1    THE DEFINITION OF HFL

HFL, a.k.a. sample-partitioned federated learning, or example-partitioned federated learning [Kairouz et al., 2019], can be applied in scenarios in which datasets at different sites share overlapping feature space but differ in sample space, as illustrated in Figure 4.1. It resembles the situation that data is horizontally partitioned inside a tabular view. In fact, the word “horizontal” comes from the term “horizontal partition,” which is widely used in the context of the traditional tabular view of a database (e.g., rows of a table are horizontally partitioned into different groups and each row contains complete data features). For example, two regional banks may have very different user groups from their respective regions, and the intersection set of their users is very small. However, their business models are very similar. Hence, the feature spaces of their datasets are the same. Formally, we summarize the conditions for HFL as:

Image

where the data feature space and label space pair of the two parties, i.e., (Xi, Yi) and (Xj, Yj), are assumed to be the same, whereas the user identifiers Ii and Ij are assumed to be different; Di and Dj denote the datasets of the i th party and the j th party, respectively.

Security of an HFL system. An HFL system typically assumes honest participants and security against an honest-but-curious server [Phong et al., 2018, Bonawitz et al., 2017]. That is, only the server can compromise the user privacy and data security of the participants.

Shokri and Shmatikov [2015] proposed a collaborative deep learning (DL) scheme where participants train models independently and share only subsets of model parameter updates, which is a special form of HFL. In 2016, researchers at Google proposed an HFL-based solution for Android smartphone model updates [McMahan et al., 2016a]. In this framework, a single Android smartphone updates the model parameters locally and uploads the model parameters to the Android cloud, thus jointly training the federated model together with other Android smartphones.

A secure aggregation scheme for protecting the privacy of the user model updates under this federated learning framework was introduced in Bonawitz et al. [2017]. More recently, Phong et al. [2018] applied additively homomorphic encryption for model parameter aggregation to provide security against an untrustworthy central server.

Image

Figure 4.1: Illustration of HFL, a.k.a. sample-partitioned federated learning [Yang et al., 2019].

In Smith et al. [2017], a multi-task style federated learning system is proposed to allow multiple sites to complete different tasks, while sharing knowledge and preserving security. Their proposed multi-task learning model can also address the issues of high communication costs, stragglers, and fault tolerance.

In McMahan et al. [2016a], the authors proposed a secure client-server structure where the federated learning system partitions data by users, and allows models built at client devices to collaborate at the server site to build a global federated model. The process of model building ensures that there is no data leakage. Likewise, in Konecný et al. [2016b], the authors proposed methods to reduce the communication cost to facilitate the training of federated models based on data distributed over mobile clients. More recently, a compression approach called Deep Gradient Compression [Lin et al., 2018] is proposed to greatly reduce the communication bandwidth in large-scale distributed model training.

Security proof has been provided in these works. Recently, another security model considering malicious user [Hitaj et al., 2017] is also proposed, posing additional privacy challenges. At the end of federated training, the aggregated model and the entire model parameters are exposed to all participants.

4.2    ARCHITECTURE OF HFL

In this section, we describe two popular architectures for HFL systems, namely the client-server architecture and the peer-to-peer architecture.

4.2.1    THE CLIENT-SERVER ARCHITECTURE

A typical client-server architecture of an HFL system is shown in Figure 4.2, which is also known as master-worker architecture. In this system, K participants (also known as clients or users or parties) with the same data structure collaboratively train a machine learning (ML) model with the help of a server (also known as parameter server or aggregation server or coordinator). A typical assumption is that the participants are honest whereas the server is honest-but-curious. Therefore, the aim is to prevent leakage of information from any participants to the server [Phong et al., 2018]. The training process of such an HFL system usually consists of the following four steps.

•  Step 1: Participants locally compute training gradients, mask a selection of gradients with encryption [Phong et al., 2018], differential privacy [Abadi et al., 2016], or secret sharing [Bonawitz et al., 2017] techniques, and send the masked results to the server.

•  Step 2: The server performs secure aggregation, e.g., via taking weighted average.

•  Step 3: The server sends back the aggregated results to the participants.

•  Step 4: The participants update their respective models with the decrypted gradients.

Image

Figure 4.2: Exemplary client-server architecture for an HFL system [Yang et al., 2019].

Iterations through the above steps continue until the loss function converges or until the maximum number of allowable iterations is reached or until the maximum allowable training time is reached. This architecture is independent of specific ML algorithms (e.g., logistic regression and DNN, and all participants will share the same final model parameters.

Note that in the above steps, it is described that the participants send gradients to the server, which in turn, aggregates the received gradients. We call this approach gradient averaging [Tang et al., 2019, Su and Chen, 2018]. Gradient averaging is also known as synchronous stochastic gradient descent (SGD) or federated SGD (FedSGD) [McMahan et al., 2016a, Chen et al., 2017]. Alternatively, instead of gradients, the participants can share model weights. That is, participants locally compute model weights and send them to the server [Phong and Phuong, 2019]. The server aggregates the received local model weights and sends the aggregated results back to the participants. We call this approach model averaging [McMahan et al., 2016a, Yu et al., 2018, Xu et al., 2018]. In the extreme case, in which model parameters are averaged after each weight update carried out locally at the participants and the participants all start with the same initial model weights, model averaging is equivalent to gradient averaging [Su and Chen, 2018, McMahan et al., 2016a]. We summarize the comparison between gradient averaging and model averaging in Table 4.1. Note that both gradient averaging and model averaging are referred to as federated averaging (FedAvg) in McMahan et al. [2016a].

Table 4.1: Comparison between gradient averaging and model averaging [Tang et al., 2019, Su and Chen, 2018]

Method

Advantage

Disadvantage

Gradient averaging

Accurate gradient information Guaranteed convergence

Heavy communication Require reliable connection

Model averaging

Not bound to SGD Tolerance of update loss Infrequent synchronization

No guarantee of convergence Performance loss

The above architecture is able to prevent data leakage against a semi-honest server, if gradient aggregation is performed with secure multi-party computation [Bonawitz et al., 2017] or additively homomorphic encryption [Phong et al., 2018]. However, it may be vulnerable to attacks by a malicious participant training a Generative Adversarial Network (GAN) in the collaborative learning process [Hitaj et al., 2017].

The client-server architecture appears similar to the architecture of a distributed machine learning (DML) system, especially the data-parallel paradigm (see also Section 3.2) of DML. HFL also resembles geo-distributed machine learning (GDML) [Xu et al., 2018, Cano et al., 2016, Hsieh et al., 2017]. A parameter server [Li et al., 2014, Ho et al., 2013] is a typical element in DML. As a tool to accelerate the training process, the parameter server stores data on distributed working nodes, allocates data and computing resources through a central scheduling node, so as to train the model more efficiently. For HFL, a data owner is a working party. It has full autonomy to operate on its local data, and can decide when and how to join and contribute to an HFL system. In the parameter server paradigm [Li et al., 2014, Ho et al., 2013], the central node always takes control, while an HFL system is faced with a more complex learning environment. Further, HFL takes into account data privacy protection during model training. Effective measures to protect data privacy can better cope with the increasingly stringent user privacy and data security requirements coming up in the near future. Finally, in an HFL system, the data held of the different participants are not identically distributed in most of the practical applications, while in a DML system, the data held by the different computing nodes normally follow the same distribution.

4.2.2    THE PEER-TO-PEER ARCHITECTURE

In addition to the client-server architecture discussed above, an HFL system can also make use of the peer-to-peer architecture shown in Figure 4.3 [Zantedeschi et al., 2019, Chang et al., 2017, 2018, Phong and Phuong, 2019]. Under the peer-to-peer architecture, there is no central server or coordinator. In such scenarios, the K participants of an HFL system are also called trainers or distributed trainers or workers. Each trainer is responsible for training the same ML or DL model (e.g., a DNN model) using only its local data. Further, the trainers need secure channels to transfer the model weights to each other. To ensure secure communications between any two trainers, security measures, such as public key based encryption schemes, can be adopted.

Image

Figure 4.3: Exemplary peer-to-peer architecture for an HFL system.

Since there is no central server, the trainers must agree to the order of sending and receiving model weights in advance. There are mainly two ways to do this.

•  Cyclic transfer. In the cyclic transfer mode, the trainers are organized into a chain. The first trainer (i.e., the top of the chain) sends its current model weights to its downstream trainer. One trainer receives model weights from its upstream trainer, and it updates the received model weights using mini-batches of training data from its own dataset. Then, it sends the updated model weights to its downstream trainer. For example, from trainer 1 to trainer 2, from trainer 2 to trainer 3,…, from trainer (K – 1) to trainer K, and from trainer K back to trainer 1. This procedure is repeated until the model weights converge or until the maximum number of iterations is reached or until the maximum allowable training time is reached.

•  Random transfer. The kth trainer selects a number i from {1, …, L} {k} atrandomwith equal probability, and sends its model weights to trainer i. When the i th trainer receives model weights from the k th trainer, it updates the received model weights using mini-batches of training data from its own dataset. Then, the i th trainer also selects a number j in {1, …, L} {i} at random and with equal probability, and sends its model weights to trainer j. This procedure takes place concurrently among the K trainers until the trainers agree that the model weights have converged or until the maximum allowable training time is reached. This method is also known as Gossip Learning [Hardy et al., 2018, Hegedüs et al., 2019].

Sharing model weights is used as an example in the above descriptions. It is also possible for the trainers to share gradients, such as using a gossip SGD-based approach, see, e.g., Liu et al. [2018] and Daily et al. [2018], for more details.

Compared with the client-server architecture, the obvious advantage of the peer-to-peer architecture is the possibility to remove the central server (also known as server, parameter server, aggregation server, or coordinator), which may not be available in practical applications, and it clears the chance of leaking information to the server. However, there are several disadvantages. For instance, in the cyclic transfer mode, since there is no central server, weight parameters are updated serially rather than in parallel batches, which takes more time to train a model.

4.2.3    GLOBAL MODEL EVALUATION

In HFL, model training and evaluation are carried out distributively at each participant, and it is impossible to access the datasets of the participants. As a consequence, each participant can easily test the mode performance using its local testing dataset to get the local model performance, but it takes more efforts to get the global model performance across all participants. Here, local model performance means the performance of an HFL model examined on the local test dataset of a single participant, and global model performance refers to the performance of an HFL model evaluated on the test datasets of all the participants. Model performance may be expressed in terms of accuracy, precision, recall, and area under the receiver operating characteristic curve (AUC), etc. For ease of elaboration, we use a two-class classification task as an example to explain how we can obtain the global model performance in HFL.

For the client-server architecture, the participants and the coordinator can cooperate to get the global model performance. During model training process and after model training being completed in HFL, we can obtain the global model performance according to the following steps.

•  Step 1: The k th participant evaluates the performance of the current HFL model using its local test dataset. For the two-class classification task, this step generates the local model evaluation results such as Image, where Image, and Image denote the number of true positive results, the number of false positive results, the number of true negative results, and the number of false negative results, respectively, for k = 1,2, …, K.

•  Step 2: The kth participant sends the local model evaluation results Image to the coordinator, for k = 1, 2, …, K.

•  Step 3: After collecting the local model evaluation results of the K participants, i.e., Image, the coordinator can calculate the global model performance. For example, for the two-class classification task, the global model recall can be computed as: Image.

•  Step 4: The coordinator then sends the computed global model performance (e.g., accuracy, precision, recall, and AUC, etc.) back to all the participants.

For the peer-to-peer architecture, since there is no central coordinator, it would be more complicated to obtain global model performance. One possible way is to pick one of the trainers to serve as a temporary coordinator. Then, we can follow the above procedure proposed for the client-server architecture to obtain the global model performance for the peer-to-peer architecture. This method is recommended for evaluating the final HFL model after the training is completed. However, if we apply this method during the training process, it would overburden the temporary coordinator, which may not be acceptable if the trainers are mobile or IoT devices with limited resources (e.g., battery). One possible way to remedy this issue is to ask the trainers to take turns to act as the temporary coordinator.

4.3    THE FEDERATED AVERAGING ALGORITHM

In McMahan et al. [2016a,b], the federated averaging (FedAvg) algorithm was employed for federated model training in HFL systems. We review the FedAvg algorithm and its secured version in this section, assuming a client-server architecture. Note that the FedAvg algorithm is also known as parallel restarted SGD and local SGD [Yu et al., 2018, Haddadpour et al., 2019], as opposed to parallel mini-batch SGD.

4.3.1    FEDERATED OPTIMIZATION

The optimization problem arising from federated learning is referred to as federated optimization [Li et al., 2019, McMahan et al., 2016b], so to name it differently from distributed optimization. In fact, federated optimization has several key properties that differentiate it from a conventional distributed optimization problem [McMahan et al., 2016b, Xu et al., 2018, Cano et al., 2016].

•  Non-independent identical distributions (Non-IID) of datasets. For distributed optimization within a data center, it is possible to ensure that different computing nodes have IID datasets so that all local updates look very similar. In federated optimization, this cannot be guaranteed. The data owned by different participants may follow completely different distributions, i.e., we cannot make IID assumptions about the decentralized datasets in federated learning [Li et al., 2019, Liu et al., 2018, Sattler et al., 2019]. For example, while similar participants might have similar local training data, two randomly picked participants might produce very different model weight updates or gradient updates.

•  Unbalanced number of data points. For distributed optimization within a data center, it is possible to divide the data equally among the computing nodes. However, in realistic scenarios, different participants usually have very different volumes of training datasets [Chen et al., 2019, Li et al., 2018, Duan, 2019]. For example, some participants may hold only a handful of data points, while others might have large amounts of data.

•  Huge number of participants. For distributed optimization within a data center, the number of parallel computing nodes can easily be controlled. However, since ML or DL generally requires a lot of data, the applications that use federated learning may need to involve many participants, especially with mobile devices [Bonawitz and Eichner et al., 2019]. Every one of these participants can theoretically participate in federated learning, making it far more distributed than that within a data center.

•  Slow and unreliable communication links. In a data center, it is expected that nodes can communicate quickly with each other and that packets are almost never lost. However, in federated learning, communication between clients and the server relies on existing Internet connections. For example, uploads (from client to server) are typically going to be much slower than downloads, especially if the connection is from a mobile terminal. Some clients might also temporally lose connections to the Internet [Tang et al., 2019, Hartmann, 2019].

To address the above challenges faced in federated optimization, McMahan et al. first adopted the FedAvg algorithm for federated optimization [McMahan et al., 2016b]. The focus of FedAvg [McMahan et al., 2016a,b] is on non-convex objective functions commonly seen when training DNNs. FedAvg is applicable to any finite-sum objective function of the following form:

Image

where n denotes the number of data points and wRd represents model parameters (e.g., model weights of a DNN) of dimension d.

For an ML or DL problem, we typically take fi(w) = L(xi, yi; w), which is the loss of the prediction on sample (xi, yi) for the given model parameters w, where xi and yi denote the i th data point and the corresponding label, respectively.

Assume that there are K participants (also known as data owners or clients) in an HFL system, with Dk denoting the dataset owned by the kth participant, with Pk being the set of indexes of data points on client k. Define nk = |Pk| as the cardinality of Pk. That is, it is assumed that the i th participant has nk training data points. As a result, considering there are K participants, Equation (4.2) can be rewritten as

Image

When the data points owned by the K participants are independent and identically distributed (IID), then we have Image, where the expectation Image is taken over the set of data points owned by the kth participant. This IID assumption is typically made by distributed optimization algorithms in DML paradigm. If the IID assumption does not hold, which is known as the non-IID setting described above, the loss function Fk(·) maintained at the k th participant could be an arbitrarily bad approximation of the function f(·) [Goodfellow et al., 2016, Zhao et al., 2018, Sattler et al., 2019].

SGD and its variants (e.g., mini-batch gradient descent) are the most popular optimization algorithms for DL [Zhang et al., 2019]. Many advances on DL can be understood as adapting the structure of the model (and hence the loss function) to be more amenable to optimization by simple gradient-based methods [Goodfellow et al., 2016, Zhang et al., 2019]. In light of the widespread applications of DL, it is natural that we also develop new algorithms for federated optimization starting from SGD [McMahan et al., 2016b].

SGD can be applied naively to federated optimization, where a single mini-batch gradient calculation (e.g., on a randomly selected subset of participants) is performed during each round of federated training. Here, “one round” refers to the operations of sending updates from the participants to the server and from the server back to the participants, i.e., including the Steps 1–4 of Figure 4.2. This approach is computationally efficient, but requires very large number of communication rounds of training to produce satisfactory models, e.g., even using an advanced approach like batch normalization (BN) [Ioffe and Szegedy, 2015] training on the MNIST dataset requires 50,000 rounds on mini-batches of size 60 [McMahan et al., 2016b].

For DML, with parallel training within data centers, or computing-clusters, communication costs are relatively small, and computational costs dominate. Recent approaches focus on applying graphics processing units (GPUs) to lower these costs. In contrast, in federated learning, communication costs dominate as communication takes place over the Internet or wide area networks (WANs), even with wireless and mobile networks.

In federated learning, a single on-site dataset is usually small, compared to the total dataset size, and modern terminals (such as smartphones) have relatively fast processors, even including GPUs. As a result, computation cost is negligible compared to communication costs for many model types in federated learning. Hence, we may use additional computation to decrease the number of rounds of communication needed to train a model. The following are two primary ways to add computation [McMahan et al., 2016b].

•  Increased parallelism. We can engage more participants working independently in between client-server communication rounds.

•  Increased computation on each participant. Rather than performing a simple computation like a gradient calculation, each client performs a more complex calculation in between communication rounds, such as performing multiple model weight update over a training epoch.

4.3.2    THE FEDAVG ALGORITHM

As articulated in McMahan et al. [2016b], the FedAvg algorithm family allows us to add computation using both approaches outlined above. The amount of computation is controlled by three key parameters, namely: (1) ρ, the fraction of clients that perform computation during each round; (2) S, the number of training steps each client performs over its local dataset during each round (i.e., the number of local epochs); and (3) M, the mini-batch size used for the client updates. We use M = ∞ to indicate that the full local dataset is treated as a single mini-batch.

We may set M = ∞ and S = 1 to produce a form of SGD with a varying mini-batch size. This algorithm selects a ρ-fraction of participants during each round, and computes the gradient and the loss function over all the data held by these participants. Therefore, in this algorithm, p controls the global batch size, with ρ = 1 corresponding to the full-batch gradient descent using all data held by all participants. Since we still select batches by using all the data on the chosen participants, we refer to this simple baseline algorithm as FederatedSGD. While the batch selection mechanism is different from selecting a batch by choosing individual examples uniformly at random, the batch gradients g computed by the FederatedSGD algorithm still satisfy E[g] = ∇ f(w), provided that the datasets held at different participants are IID.

It is commonly assumed that the coordinator or server has the initial ML model, and the participants know the settings of the optimizer. For a typical implementation of distributed gradient descent with a fixed learning rate η, in the t th round of global model weight update, the k th participant computes gk = ∇ Fk(wt), the average gradient on its local data points at the current model weight wt, and the coordinator aggregates these gradients and applies the update of model weights according to McMahan et al. [2016b]:

Image

where Image, provided that the data points held at different participants are IID. The coordinator can then send the updated model weights wt+1 back to the participants. Alternatively, the coordinator can send the averaged gradients Image back to the participants, and the participants calculate the updated model weights wt+1 according to Equation (4.4). This method is called gradient averaging [Tang et al., 2019, Su and Chen, 2018].

Algorithm 4.1 The FedAvg Algorithm (adapted from McMahan et al. [2016b])

Image

It is straightforward to show that an equivalent approach is given by [McMahan et al., 2016b]

Image
Image

That is, each client locally takes one step (or multiple steps) of gradient descent on the current model weights Image using its local data according to Equation (4.5), and sends the locally updated model weights Image to the server. The server then takes a weighted average of the resulting models according to Equation (4.6), and sends the aggregated model weights Image back to the participants. This method is called model averaging [McMahan et al., 2016b, Yu et al., 2018].

The model averaging variant of the FedAvg algorithm is summarized in Algorithm 4.1. Once the algorithm is written in this way, it is natural to ask what happens when the participant iterates the local update (see Equation (4.5)) multiple times before going into the averaging step? For a participant with nk local data points, the number of local updates per round is given by Image. The complete pseudo-code of the FedAvg algorithm with model averaging is given in Algorithm 4.1.

However, for general non-convex objectives, model averaging in the model weight space may produce an arbitrarily bad model, and it may not even converge at all [Tang et al., 2019, Su and Chen, 2018]. Luckily, recent work indicates that in practice, the loss surfaces of sufficiently over-parameterized DNNs are surprisingly well behaved, and in particular less prone to bad local minima than previously thought [Goodfellow et al., 2015]. When we start two models from the same random initialization and then, again train each independently on a different subset of the data (as described above), it has been found out empirically that model averaging based approach works surprisingly well [McMahan et al., 2016a,b, Yu et al., 2018]. The success of dropout training may provide some intuitions for the success of the federated model averaging scheme. Dropout training can be interpreted as averaging models of different architectures that share model weights, and the inference-time scaling of the model weights is analogous to the model averaging used in Srivastava et al. [2014].

4.3.3    THE SECURED FEDAVG ALGORITHM

The plain FedAvg algorithm in the form of Algorithm 4.1 exposes plaintext intermediate results, such as gradients from an optimization algorithm like SGD or model weights of DNNs, to the coordinator. No security guarantee is provided against the coordinator, and the leakage of gradients or model weights may actually leak important data and model information [Phong et al., 2018] if the data structure is also exposed. We can leverage privacy-preserving techniques, such as the widely used methods described in Chapter 2, to ensure user privacy and data security in FedAvg.

Algorithm 4.2 The Secured FedAvg Algorithm (model averaging with AHE)

Image

As an illustrative example, we use additively homomorphic encryption (AHE) [Acar et al., 2018] (e.g., the Paillier algorithm [Paillier, 1999] or the learning with errors- (LWE) based encyption [Phong et al., 2018]) to enhance the security feature of the FedAvg algorithm.

Recall that AHE is a semi-homomorphic encryption algorithm that only supports the addition and multiplication operations (i.e., additional homomorphism and multiplicative homomorphism [Paillier, 1999]). For ease of reference, we summarize here the key properties of AHE. Let ⟦u⟧ and ⟦v⟧ denote the homomorphic encryption of u and v, respectively. With AHE, the following holds (see Section 2.4.2):

•  Addition: Decsk(⟦u⟧ ⊕ ⟦v⟧) = Decsk(⟦u + v⟧), where “⊕” may represent multiplication of the ciphertexts (see, e.g., Paillier [1999]).

•  Scalar multiplication: Decsk (⟦u⟧ ⊙ n) = Decsk(⟦u · n⟧), where “⊙” may represent taking the power of n of the ciphertext (see, e.g., Paillier [1999]).

Thanks to these two nice properties of AHE, we can directly apply AHE to the FedAvg algorithm to ensure security against the coordinator/server.

Specifically, by comparing Algorithm 4.1 with Algorithm 4.2, it can be observed that security measures, such as AHE, can be easily added on top of the original FedAvg algorithm to provide secure federated learning. It was shown in Phong et al. [2018] that, under certain conditions, the secured FedAvg algorithm in Algorithm 4.2 leaks no information of the participants to an honest-but-curious coordinator, provided that the underlying homomorphic encryption scheme is chosen-plaintext attack (CPA) secure. In other words, Algorithm 4.2 ensures honest-but-curious security against the coordinator. With AHE, the data and the model itself are not transmitted in plaintext form. Hence, there is almost no possibility of leakage at the raw data level. However, encyption and decryption operations will increase the computational complexity, and transmission of cyphertext will introduce extra communication overhead. Another drawback of AHE is that polynomial approximations (e.g., using first-order Taylor approximation for loss and gradient computations) need to be performed in order to evaluate nonlinear functions. As a result, there is a trade-off between accuracy and privacy. Security measures for the FedAvg algorithm needs further studies.

4.4    IMPROVEMENT OF THE FEDAVG ALGORITHM

4.4.1    COMMUNICATION EFFICIENCY

In the implementation of the FedAvg algorithm, each participant needs to send a full model weight update to the server during each round of federated training. As modern DNN models can easily have millions of parameters, sending model weights for so many values to a coordinator leads to huge communication costs, which grows with the number of participants and iteration rounds. When there are a large number of participants, uploading model weights from participants to the coordinator becomes the bottleneck of federated learning. To reduce communication costs, some methods are proposed to improve the communication efficiency. One example is Konecný et al. [2016a], in which two strategies for computing model weights are proposed.

•  Sketched updates. Participants compute a normal model weight update and perform a compression afterward locally. The compressed model weight update is often an unbiased estimator of the true update, meaning they are the same on average. One possible way of performing model weight update compression is using probabilistic quantization. Participants then send the compressed updates to the coordinator.

•  Structured updates. During the training process, the model weight update is restricted to be of a form that allows for an efficient compression. For example, the model weight may be forced to be sparse or low-rank, or model weight update is computed within a restricted space that can be parameterized using a smaller number of variables. The optimization process then finds the best possible update for this form.

Han et al. [2015] studied DNN model compression proposed a three-stage pipeline for carrying out model weight compression. Firstly, prune the DNN connections by removing redundancy, keeping only the most meaningful connections. Secondly, the weights are quantized so that multiple connections share the same weights, only effective weights are kept. Finally, Huffman coding is applied to take advantage of the biased distribution of effective weights.

When model weights are shared in federated learning, we can use model weight compression to reduce communication costs. Similarly, when gradients are shared in federated learning, we can use gradient compression to bring down communication overhead. One well-known gradient compression method is the deep gradient compression (DGC) approach [Kamp et al., 2018]. DGC employs four methods: namely (1) momentum correction, (2) local gradient clipping, (3) momentum factor masking, and (4) warm-up training. Kamp et al. [2018] applied DGC on image classification, speech recognition, and language modeling tasks. The results showed that DGC can achieve a gradient compression ratio from 270–600 times without compromising model accuracy. Therefore, DGC can be employed to reduce communication bandwidth required for sharing gradients or model weights and facilitate large-scale federated DL or federated learning.

Besides compression, quantization is another efficient method for reducing communication overhead in federated learning [Konecný et al., 2016a, Reisizadeh et al., 2019]. For example, the signSGD-based approach proposed in Chen et al. [2019] has very low per-iteration communication overhead, since it employs one-bit quantization per gradient dimension. Chen et al. [2019] developed a novel gradient correction mechanism that perturbs the local gradients with noise and then applies one-bit quantization, which can also be seen as a special gradient compression scheme.

It is also possible for clients to avoid uploading irrelevant model updates to the server, so as to reduce communication overhead, provided that the training convergence can still be guaranteed [Wang et al., 2019, Hsieh et al., 2017]. For example, Wang et al. [2019] proposed to provide clients with the feedback information regarding the global tendency of model updating. Each client checks whether its local model update aligns with the global tendency and whether it is relevant enough to global model improvement. In this way, each client can decide whether or not it shall upload its local model update to the server. This can be seen as a special case of client selection.

4.4.2    CLIENT SELECTION

In the original work of McMahan et al. [2016b], client selection was recommended to reduce communication cost and time taken for each global training round. However, no approach was proposed for client selection. Nishio and Yonetani [2018] introduced a method for client selection, which consists of two steps. The first step is the resource check step. That is, the coordinator sends queries to a random number of participants to ask about their local resources and the size of data that are relevant to the training task. In the second step, the coordinator uses this information to estimate the time required for each participant to compute model weight update locally, and the time to upload the update. The coordinator then determines which participants to select based on these estimates. The coordinator wants to select as many participants as possible given a specific time budget for one global federated training round.

4.5    RELATED WORKS

The most recent Google workshop on federated learning brought together world-class researchers and practitioners and presented the newest development of federated learning [Google, 2019], such as agnostic federated learning [Mohri et al., 2019], federated transfer learning (see Chapter 6), the incentive mechanism design for federated learning (see Chapter 7), the privacy, security, and fairness aspects of federated learning (see, e.g., Agarwal et al. [2018], Pillutla et al. [2019], Melis et al. [2018], Ma et al. [2019]), as well as a lecture on using the open-source platform TensorFlow Federated [TFF, 2019] for research and deployment of federated learning. We review here some examples of the related studies.

Communication is one of the major challenges of federated learning. In Wang and Joshi [2019], an adaptive communication strategy, termed as AdaComm, was proposed to tackle the problem of random communication delays encountered in federated learning. AdaComm first starts with infrequent model averaging to save communication bandwidth and to deal with random delays, as well as to improve convergence speed. Then it increases the communication frequency in order to achieve better model performance and a low error floor. Wang and Joshi [2019] presented theoretical analysis of the error-runtime trade-off for periodic-averaging SGD algorithm, where each participant performs local updates and their models are averaged periodically (e.g., after every τ iterations). Wang and Joshi [2019] is the first work to analyze the convergence of periodic-averaging SGD in terms of error with respect to wall-clock time, instead of the number of iterations, while taking into account the effect of computation and communication delays on the runtime per iteration. AdaComm is a communication-efficient SGD algorithm for federated learning, which is particularly suitable for mobile applications.

As was discussed in Section 4.3, while the FedAvg algorithm works well for certain non-convex objective functions (also known as cost or loss functions) under IID settings, it could produce unpredictable results for generally non-convex objective functions with non-IID datasets [Goodfellow et al., 2015]. In Xie et al. [2019], a new asynchronous solution was proposed to improve the flexibility and scalability of federated optimization with non-IID training data. The key idea is that the server and the clients conduct model updates asynchronously. The server immediately updates the global model whenever it receives a local model from a client. The communication between the server and the clients is non-blocking. Xie et al. [2019] further analyzed the convergence of their proposed asynchronous approach for a restricted family of non-convex problems under non-IID settings. It was also demonstrated with numerical examples that the asynchronous algorithm enjoys fast convergence and tolerates staleness. Several mixing hyperparameters are introduced to control the trade-off between the convergence rate and variance reduction according to the staleness. However, the hyperparameters of the proposed asynchronous algorithm could be difficult to tune in practice.

The coordinator in HFL is a potential privacy leak and some scholars actually prefer to remove the coordinator. For example, Zantedeschi et al. [2019] considered federated learning under the peer-to-peer architecture, i.e., without the central coordinator, and proposed an optimization procedure that leverages a collaboration graph describing the relationships between the tasks of the participants. The collaboration graph and the ML models are jointly learned. The fully decentralized solution of Zantedeschi et al. [2019] alternates between (i) training nonlinear ML models given the graph in a greedy boosting manner, and (ii) updating the collaboration graph (with controlled sparsity) when given the ML models. Further, the participants exchange messages only with a small number of peers (their direct neighbors in the graph and a few more random participants), thus ensuring the scalability of this fully decentralized solution.

HFL was originally proposed and has been promoted by Google for B2C (business-to-consumer) applications, i.e., mainly for collaborative ML model training employing mobile devices [Konecný et al., 2016b, Yang et al., 2018, Hard et al., 2018], and especially for scenarios with a large number of mobile devices [Bonawitz and Eichner et al., 2019]. While Google is advocating and developing HFL for mobile applications, HFL has been applied to a variety of practical scenarios, particularly to the B2B (business-to-business) applications. For instance, during the Google federated learning workshop [Google, 2019], Prof. Dawn Song from UC Berkeley, talked about applying federated learning for anomaly detection, such as fraud detection [Song, 2019, Hynes et al., 2018], and there are several recent works along this direction, see, e.g., Nguyen et al. [2019] and Preuveneers et al. [2018]. Chapters 8 and 10 will provide more examples of practical applications of HFL.

4.6    CHALLENGES AND OUTLOOK

There are some examples of commercial deployment of HFL systems, such as the one carried out by Google on mobile devices, i.e., the Gboard system [Bonawitz and Eichner et al., 2019]. However, HFL is still in its infancy, and widespread adoption of HFL still faces numerous challenges [Yang et al., 2019, Li, Wen, and He, 2019, Li et al., 2019]. Here, we briefly discuss some of the major challenges.

The first major challenge is the inability to inspect training data, which leads to one of the central problems, namely choosing the hyperparameters of an ML or DL model and setting the optimizers, particularly for training DNN models. It is common to assume that the coordinator or the server has the initial model and knows how to train the model. However, in practice, since we do not collect any data in advance, it is almost impossible to choose the right hyperparameters for a DNN model and set up an optimizer beforehand. Here, the hyperparameters may include the number of layers for a DNN, the number of nodes in each layer of a DNN, the structure of the convolutional neural networks (CNNs), the structure of the recurrent neural networks (RNNs), the output layer of a DNN, and the activation functions, etc. Optimizer options may include which optimizer to use, batch sizes, and learning rates. For example, even the learning rate is difficult to determine since we have no information about the gradient magnitude at each participant. Trying out many different learning rates in production would take time and could worsen development experience. To address this challenge, the simulation based approach suggested by Hartmann [2018, 2019] appears to be promising.

The second challenge is how to effectively motivate companies and organizations to participate in HFL. Traditionally, large companies and organizations have been trying to collect data and create data silos so to be more competitive in the AI age. By joining HFL, other competitors may benefit from such large companies’ data, leading to these large companies losing market dominance. As a result, motivating the large companies to adopt HFL could be difficult. To resolve this challenge, we need to devise effective data protection policies, appropriate incentive mechanisms, and business models for HFL.

When applied to mobile devices, it will also be difficult to convince the mobile device owners to allow for their devices to participate in federated learning. Sufficient incentives and benefits shall be demonstrated to the mobile users to draw their interests in offering their mobile devices to join in federated learning, such as potential for better user experience after joining in federated learning.

The third challenge is how to prevent cheating behaviors from participants. It is commonly assumed that the participants are honest. However, in real-life scenarios, honesty only comes under regulations and laws. For example, one participating party may fraudulently claim the number of data points it can contribute for model training and report false testing results of the trained model, in order to gain more rewards. Since we are not able to inspect the datasets of any participants, it is difficult to detect such cheating behaviors. To address this challenge, we need to design a holistic approach for protecting the rights and interests of the honest participants.

To realize the full potential HFL, much research is still needed. In addition to addressing the aforementioned challenges, we need to study mechanisms for managing the training process. For example, since model training and evaluation are carried out locally at each participant, we need to develop new ways to avoid over-fitting and to trigger early-stopping. Another interesting direction is how to handle participants with different reliability levels. For example, some participants may leave during the training process of HFL due to interrupted network connections or other issues. As a result, we need smart solutions to replace the dropped participants with new participants without affecting the training process and model accuracy, especially without affecting the convergence speed of model training. Finally, we need to develop efficient mechanisms to defend against model poisoning attacks (such as targeted backdoor attacks) in federated learning systems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset