This chapter will provide an overview of the architecture, procedure flow, sequence of messages, and basics of model aggregation of the federated learning (FL) system. As discussed in Chapter 2, What Is Federated Learning?, the conceptual basics of the FL framework are quite simple and easy to understand. However, the real implementation of the FL framework needs to come with a good understanding of both AI and distributed systems.
The content of this chapter is based on the most standard foundation of FL systems, which is used in hands-on exercises later in the book. First, we will introduce the building blocks of FL systems, such as an aggregator with an FL server, an agent with an FL client, a database server, and communication between these components. The architecture introduced in this chapter is designed in a decoupled way so that further enhancement to the system will be relatively easier than with an FL system that contains everything on one machine. Then, an explanation of the flow of the operation of FL from initialization to aggregation will follow.
Finally, we will examine the way an FL system is scaled with a horizontal design of decentralized FL setups.
This chapter covers the following topics:
FL systems are distributed systems that are dispersed into servers and distributed clients. Here, we will define a representative architecture of an FL system with the following components: an aggregator with an FL server, an agent with an FL client, and a database:
Figure 3.1 shows the typical overall architecture consisting of a single cluster aggregator and a database server, as well as multiple distributed agents:
Figure 3.1 – Overall architecture of an FL system
One advantage of the FL system’s architecture is that users do not have to send private raw data to the server, especially that owned by a third party. Instead, they only have to send locally trained models to the aggregator. The locally trained models can be in a variety of formats such as the weights of the entire ML models, the changes of weights (gradients), or even a subset of them. Another advantage includes reducing the communication load because the users only have to exchange models that are usually much lighter than raw data.
A cluster aggregator consists of an FL server module, FL state manager module, and model aggregation module, as in Figure 3.1. We just call a cluster aggregator with an FL server an aggregator. While these modules are the foundation of the aggregator, advanced modules can be added to ensure further security and flexibility of the aggregation of ML models. Some of the advanced modules are not implemented in the simple-fl GitHub repository provided with this book with exercises because the main purpose of this book is to understand the basic structure and system flow of the FL system. In the aggregator system, the following modules related to the FL server, the state manager of FL, and model aggregation are the keys to implementing the aggregator-side functionalities.
A distributed agent consists of an FL client module that includes the communication handler and client libraries as well as local ML applications connected to the FL system through the FL client libraries:
A database server consists of a database query handler and a database, as storage. The database server can reside on the server side, such as on the cloud, and is tied closely to aggregators, while the recommended design is to separate this database server from aggregator servers to decouple the functionalities to enhance the system’s simplicity and resilience. The functionality of the database query handler and sample database tables are as follows:
Note
In the code sample of the simple-fl repository, only the database tables related to the local models and cluster models are covered to simplify the explanation of the entire FL process.
Now that the basic architecture of the FL system has been introduced, next, we will talk about how to enhance the FL system’s architecture if the computation resources are limited on the agent-side devices.
Sometimes, the computational capability of local user devices is limited – ML training may be difficult in those devices, but inference or predictions can be made possible by just downloading the global model. In these cases, an FL platform may be able to set up an additional intermediate server layer, such as with smartphones, tablets, or edge servers. For example, in a healthcare AI application, users manage their health information on their smart watches, which can be transferred to their smart tablets or synched with laptops. In those devices, it is easy to retrain ML models and integrate the distributed agent functionalities.
Therefore, the system architecture needs to be modified or redesigned depending on the applications into which the FL system is integrated, and the concept of intermediate servers can be applied using distributed agents to realize FL processes. We do not have to modify the interactions and communication mechanisms between the aggregators and the intermediate servers. Just by implementing APIs between the user devices and the intermediate servers, FL will be possible in most use cases.
Figure 3.2 illustrates the interaction between the aggregators, intermediate servers, and user devices:
Figure 3.2 – An FL system with intermediate servers
Now that we have learned about the basic architecture and components of an FL system, let us look into how an FL system operates in the following section.
Each distributed agent belongs to an aggregator that is managed by an FL server, where ML model aggregation is conducted to synthesize a global model that is going to be sent back to the agents. An agent uses its local data to train an ML model and then uploads the trained model to the corresponding aggregator. The concept sounds straightforward, so we will look into a bit more detail to realize the entire flow of those processes.
We also define a cluster global model, which we simply call a cluster model or global model, which is an aggregated ML model of local models collected from distributed agents.
Note
In the next two chapters, we will guide you on how to implement the procedure and sequence of messages discussed in this chapter. However, some of the system operation perspectives, such as an aggregator or agent system registration in the database, are not introduced in the code sample of the simple-fl repository in order to simplify the explanation of the entire FL process.
The sequence of the initialization processes is quite simple. The initialization and registration processes need to happen in the order of database, aggregator, and agents. The overall registration sequence of an aggregator and an agent with a database is depicted in Figure 3.3 as follows:
Figure 3.3 – The process of aggregator and agent registration in the database server
Here is the initialization and registration procedure of each component in the FL system:
In particular, when it receives the participation message from the agent, the aggregator goes through the following procedure, as in Figure 3.4. The key process after receiving the participation request is (i) checking whether the agent is trusted or not, or whether the agent is already registered or not, and (ii) checking whether the initial global model is already registered or not. If (i) is met, the registration process keeps going. If the (initial) global model is already registered, the agent will be able to receive the global model and start using that global model for the local training process on the agent side.
The agent participation and registration process at an aggregator side is depicted in Figure 3.4:
Figure 3.4 – The registration process of an agent by an aggregator
Now that we understand the initialization and registration process of the FL system components, let us move on to the basic configuration of the ongoing FL process, which is about uploading the initial ML model.
The next step in running an FL process is to register the initial ML model whose architecture will be used in the entire and continuous process of FL by all the aggregators and agents. The initial model can be distributed by the company that owns the ML application and FL servers. They’ll likely provide the initial base model as part of the aggregator configuration.
We call the initial ML model used as a reference for model aggregation a base model. We also call the agent that uploads the initial base model an initial agent. The base model info could include the ML model itself as well as the time it was generated and the initial performance data. That being said, the process of initializing the base model can be seen in Figure 3.5:
Figure 3.5 – Base model upload process for the initial agent
Now, the FL process is ready to be conducted. Next, we will learn about the FL cycle, which is a very core part of the FL process.
In this section, we will only give an example with a single agent and aggregator, but in real cases and operations, the agent environments are various and dispersed into distributed devices. The following is the list of the processes for how the local models are uploaded, aggregated, stored, and sent back to agents as a global model:
Figure 3.6 shows the overall process of how FL is continuously conducted between an agent, an aggregator, and a database typically:
Figure 3.6 – Overview of the continuous FL cycle
Now that we understand the overall procedure of the FL process, we will look into the different round management approaches in the FL process next: synchronous FL and asynchronous FL.
When the model aggregation happens at the aggregator, there are multiple criteria related to how many local models it needs to collect from which agents. In this section, we will briefly talk about the differences between synchronous and asynchronous FL, which have been discussed in a lot of literature, such as https://iqua.ece.toronto.edu/papers/ningxinsu-iwqos22.pdf, so please refer to it to learn about these concepts further.
Synchronous FL requires the aggregator to select the agents that need to send the local models for each round in order to proceed with the model aggregation. This synchronous FL approach is simple to design and implement and suitable for FL applications that require a clear selection of agents. However, if the number of agents becomes too large, the aggregator may have to wait for a long time to wrap up the current round, as the computational capability of the agents could vary and some of them may have problems uploading or fail to upload their local models. Thus, some of the agents can become slow or totally dysfunctional when sending their models to the aggregator. These slow agents are known as stragglers in distributed ML, which motivates us to use the asynchronous FL mode.
Asynchronous FL does not require the aggregator to select the agents that have to upload their local models. Instead, it opens the door for any trusted agents to upload the model anytime. Furthermore, it is fine to wrap up the federation round whenever the aggregator wants to generate the global model, with or without criteria such as the minimum number of local models that needs to be collected, or some predefined interval or deadline for which the aggregator needs to wait to receive the local models from the agents until the aggregation for that round happens. This asynchronous FL approach gives the FL system much more flexibility for model aggregation for each FL round, but the design may be more complicated than the simple synchronous aggregation framework.
When managing the FL rounds, you need to consider the practicalities of running rounds, such as scheduling and dealing with delayed responses, the minimum levels of participation required, the details of example stores, using the downloaded or trained models for improved inference in the applications on the edge devices, and dealing with bad or slow agents.
We will look into the FL process and procedure flow next, focusing on the aggregator side.
An aggregator has two threads running to accept and cache the local models and aggregate the collected local ML models. In this section, we describe those procedures.
The aggregator side process of accepting and caching local ML models is depicted in Figure 3.7 and explained as follows:
This procedure keeps running until the termination conditions are satisfied or the operator of the FL system opts to stop the process. Figure 3.7 depicts the procedure of accepting and caching local ML models:
Figure 3.7 – Procedure for accepting and caching local ML models
Once the local ML models are accepted and cached, the FL system moves on to the next procedure: aggregating the local models.
The aggregator-side procedure of aggregating local ML models depicted in Figure 3.8 is as follows:
In a case where time for the FL round has expired and not enough agents that participated in the round have uploaded a model, the round can be abandoned or forced to conduct aggregation for the local models collected so far.
Figure 3.8 shows the aggregator’s process from checking the aggregation criteria to synthesizing the global model when enough models are collected:
Figure 3.8 – Model synthesis routine: aggregating local ML models
Aggregating local models to generate the global model has been explained. Now, let us look into the agent-side FL cycle, including the retraining process of the local ML models.
In the distributed agent, the following state transition happens and is repeated for the continuous operation of the FL cycle:
Figure 3.9 depicts the state transition of an agent to adapt and update the ML model:
Figure 3.9 – Agent-side state transition to adapt and update the ML model
Next, we touch on a model interpretation based on deviation from the baseline outputs that are used for anomaly detection and preventing model degradation.
We can also provide an interpretation framework by looking at the output of each local model. The following procedure can be considered to ensure the local model is always good to use and can be deployed in production:
Now that the process of the FL has been explained, let us look into the basics of model aggregation, which comprise the critical part of FL.
Aggregation is a core concept within FL. In fact, the strategies employed to aggregate models are the key theoretical driver for the performance of FL systems. The purpose of this section is to introduce the high-level concepts of aggregation within the context of an FL system – the underlying theory and examples of advanced aggregation strategies will be discussed in greater depth in Chapter 7, Model Aggregation.
Let’s revisit the aggregator-side cycle discussed in the Understanding the FL system flow – from initialization to continuous operation section, at the point in the process where the agents assigned to a certain aggregator have finished training locally and have transmitted these models back to this aggregator. The goal of any aggregation strategy, or any way of aggregating these models together, is to produce new models that gradually increase in performance across all of the data collected by the constituent agents.
An important point to remember is that FL is, by definition, a restricted version of the distributed learning setting, in which the data collected locally by each agent cannot be directly accessed by other agents. If this restriction were not in place, a model could be made to perform well trivially on all of the data by collecting the data from each agent and training on the joint dataset; thus, it makes sense to treat this centrally-trained model as the target model for an FL approach. At a high level, we can consider this unrestricted distributed learning scenario as aggregation before model training (where in this case, aggregation refers to combining the data from each agent). Because FL does not allow data to be accessed by other agents, we consider the scenario as aggregation after model training instead; in this context, aggregation refers to the combination of the intelligence captured by each of the trained models from their differing local datasets. To summarize, the goal of an aggregation strategy is to combine models in a way that eventually leads to a generalized model whose performance approaches that of the respective centrally trained model.
To make some of these ideas more concrete, let’s take an initial look into one of the most well-known and straightforward aggregation strategies, known as Federated Averaging (FedAvg). The FedAvg algorithm is performed as follows: let be the parameters of the models from agents, each with a local dataset size of . Also, is the total dataset size defined as . Then, FedAvg returns the following ML model as the aggregated model:
Essentially, we perform FedAvg over a set of models by taking the weighted average of the models, with weights proportional to the size of the dataset used to train the model. As a result, the types of models to which FedAvg can be applied are models that can be represented as some set of parameter values. Deep neural networks are currently the most notable of these kinds of models – most of the results analyzing the performance of FedAvg work with deep learning models.
It is rather surprising that this relatively simple approach can lead to generalization in the resulting model. We can visually examine what FedAvg looks like within a toy two-dimensional parameter space to observe the benefits of the aggregation strategy:
Figure 3.10 – Two-dimensional parameter space with local models from two agents (the circle and square) and a target model (the black x)
Let’s consider a case where we have two newly initialized models (the circle and square points) belonging to separate agents. The space in the preceding figure represents the parameter space of the models, where each toy model is defined by two parameters. As the models are trained, these points will move in the parameter space – the goal is to approach a local optimum in the parameter space, generally corresponding to the aforementioned centrally trained model:
Figure 3.11 – Change in local model parameters without aggregation
Each model converges to separate dataset-specific optima (two x points from the circle and square) that do not generalize. Because each agent only has access to a subset of the data, the local optima reached by training each model locally will differ from the true local optima; this difference depends on how similar the underlying data distributions are for each agent. If the models are only trained locally, the resulting models will likely not generalize over all of the data:
Figure 3.12 – Adding aggregation moves the local model parameters to the average for both models at each step, leading to convergence at the target model
Applying FedAvg at each movement step allows us to create an aggregate model that eventually comes close to the true local optima in the parameter space.
This example displays the basic capability of FedAvg to produce generalized models. However, working with real models (such as highly parameterized deep learning models) introduces additional complexity that is handled by FedAvg but not by simpler approaches. For example, we might wonder why we don’t simply fully train each local model and only average at the end; while this approach would work in this toy case, it has been observed that only averaging once with real models leads to poor performance across all of the data. The FedAvg process allows for a more robust way to reach the generalized model within high-dimension parameter spaces.
This section only aims to give an overview of aggregation in FL; Chapter 7, Model Aggregation, contains more detailed explanations and examples for aggregation in different scenarios.
We now understand the entire process of how the FL system works with basic model aggregation. In some applications, the FL system may have to support a huge number of agents to realize its scalability. The following section will give you some idea about how to scale more smoothly, especially with a decentralized horizontal design.
In this section, we will look into how to further scalability when we need to support a large number of devices and users.
There are practical cases where control, ease of maintenance and deployment, and low communication overhead are provided by centralized FL. If the number of agents is not large, it makes more sense to stick to centralized FL than decentralized FL. However, when the number of participating agents becomes quite large, it may be worth looking into horizontal scaling with a decentralized FL architecture. The latest developments of auto-scaling frameworks these days, such as the Kubernetes framework (https://kubernetes.io/), can be a nice integration with the topic that is discussed in this section, although actual integration and implementation with Kubernetes is beyond the scope of this book.
There will be some use cases where many aggregators are needed to cluster groups of agents and create a global model on top of those many aggregators. Google uses a centralized approach for this, as in the paper Towards Federated Learning at Scale, while setting up a centralized node for managing multiple aggregators may have some resilience issues. The idea is simple: periodically aggregate all the cluster models at some central master node.
On the other hand, we can realize the decentralized way of aggregating cluster models created by multiple aggregators. The architecture for that is based on two crucial ideas:
To create semi-global models, decentralized cluster aggregators exchange their aggregated cluster models with each other and approximate optimal global models. The cluster aggregators can also use a database to periodically collect other cluster models to generate the semi-global models. This framework allows for the absorption of training results from diverse sets of users dispersed across many aggregators by synthesizing the most updated global models without a master node concept.
Based on this decentralized architecture, the robustness of the entire FL system can be enhanced, as the semi-global model can be independently computed at each cluster aggregator. The FL system can be scaled further, as each cluster aggregator is responsible for creating its own semi-global model by itself – not via the master node of those aggregators – and therefore, decentralized semi-global model formation comes with resiliency and mobility.
We can even decouple the database that stores the uploaded local models, cluster global models, and semi-global models. By introducing a distributed database into the FL system, the entire system could be made more scalable, resilient, and secure together with some failover mechanism.
For example, each cluster aggregator stores the cluster model in a distributed database. The cluster aggregators can retrieve cluster models of other aggregators by pulling the models periodically from the databases. At each cluster aggregator, a semi-global ML model is generated by synthesizing the pulled models.
Figure 3.13 illustrates the overall architecture of the decentralized horizontal design of a multi-aggregator FL system:
Figure 3.13 – Architecture of a decentralized FL system with multiple aggregators (horizontal design)
Now that we have discussed how to enhance the FL system with a horizontal design using the semi-global model concept, next, we will look at distributed database frameworks to further ensure scalability and resiliency.
Furthermore, the accountability of the model updates can be provided by storing historical model data in a data-driven distributed database. The InterPlanetary File System (IPFS) and Blockchain are well-known distributed databases that ensure the accountability of global model updates. After a cluster aggregator generates a semi-global model based on other cluster models, the semi-global model is stored in a distributed database. The distributed database manages the information of those models with a unique identifier. To maintain all the models consistently, including local, cluster, and semi-global models, each ML model is assigned a globally unique identifier, such as a hash value, which could be realized using the concept of a Chord Distributed Hash Table (Chord DHT). The Chord DHT is a scalable peer-to-peer lookup protocol for internet applications.
The cluster aggregator can store metadata on the cluster models, such as timestamps and hash identifiers. This gives us further accountability for model synthesis by ensuring the cluster models haven't been altered. It is also possible to identify a set of aggregators that are sending harmful cluster models to destroy the semi-global models once the malicious models are detectable. These models can be filtered by analyzing the patterns of the weights of the cluster model or deviation from the other cluster models when the difference is too big to rely on.
The nature of the distributed database is to store all the volatile state information of the distributed FL system. The FL system can restore from the distributed database in the case of failure. The cluster aggregators also exchange their cluster models based on a certain interval defined by the system operator. Therefore, the mapping table between cluster models and aggregators needs to be logged in the database together with meta-information on the local, cluster, and semi-global models, such as the generation time of those models and the size of training samples.
Distributed agents can broadcast participation messages to connectable aggregators when they want to join their FL process. The participation messages can contain the unique ID of the agent. One of the cluster aggregators then returns a cluster aggregator ID, potentially the value generated based on a common hash function, to which the agent should belong. Figure 3.14 depicts how the agent is assigned to a certain cluster aggregator using a hash function:
Figure 3.14 – The sequence of an agent joining one of the cluster aggregators in an FL system
In the following section, we will look into how the semi-global model is generated based on aggregating the multiple cluster global models.
After the agent is assigned to a specific cluster aggregator, the agent starts to participate in the FL process. It requests a base ML model if it is registered – otherwise, it needs to upload the base model to start local training. The procedure of uploading local models and generating cluster and semi-global models will continue until the agent or aggregator is disconnected from the system. The sequence of the local and cluster model upload process, aggregation process, and semi-global model synthesis and pulling is illustrated in Figure 3.15:
Figure 3.15 – The sequence of the semi-global model synthesis processes from uploading local models to pulling semi-global models
Let’s look at semi-global model synthesis using the flowchart between the agent, aggregator, and distributed database.
The aggregator receives a local model from an agent. When receiving the local model, the model filtering process will decide whether to accept the uploaded model or not. This framework can be implemented using many different methods, such as a basic scheme of checking the difference between the weights of the global and local models. If the model is not valid, just discard the local model.
Then, a cluster model is created by aggregating all the accepted local models. The aggregator stores the cluster model in a database, as well as simultaneously retrieving the cluster models generated by other cluster aggregators. A semi-global model is then synthesized from those cluster models and will be used in the agents that are assigned to the cluster aggregator.
Figure 3.16 shows how the cluster aggregator proceeds with cluster and semi-global model synthesis using a distributed database:
Figure 3.16 – The procedure and flow of semi-global model synthesis
An aggregator does not need to retrieve all the cluster models generated at each round to create a semi-global model. To synthesize a semi-global model, the global model can eventually converge based on the subset of models randomly selected by each aggregator. Using this approach, the robustness and independence of aggregators will be enhanced by compromising on the conditions to create the global model at every update. This framework can also resolve the bottlenecks in terms of computation and communication typical to centralized FL systems.
In this chapter, we discussed the potential architecture, procedure flow, and message sequences within an FL system. The typical FL system architecture consists of an aggregator, agents, and a database server. These three components are constantly communicating with each other to exchange system information and ML models to achieve model aggregation.
The key to implementing a good FL system is decoupling the critical components and carefully designing the interfaces between them. We focused on the aspect of the simplicity of its design so that further enhancement can be achieved by just adding additional components to the systems. Horizontal decentralized design can also help implement a scalable FL system.
In the following chapter, we will discuss the implementation details of achieving FL on the server side. As some critical aspects of the functionalities have been introduced in this chapter, you will be able to implement the basic system and smoothly run the simulation with some ML applications.
Some of the concepts discussed in this chapter can be explored further by reading the following papers: