CHAPTER 9

Federated Reinforcement Learning

In this chapter, we introduce federated reinforcement learning (FRL), covering the basics of reinforcement learning, distributed reinforcement learning, horizontal FRL and vertical FRL, as well as application examples of FRL.

9.1    INTRODUCTION TO REINFORCEMENT LEARNING

Reinforcement Learning (RL) is a branch of machine learning (ML) which mainly deals with sequential decision-making [Sutton and Barto, 1998]. An RL problem usually consists of a dynamic environment and an agent (or agents) which interacts with the environment. The environment evolves once the agent selects an action based on the current state of the environment by presenting a reward for evaluating the performance of the agent. The agent seeks to achieve a goal in the environment by making sequential decisions. Traditional RL problems can be formulated as Markov Decision Processes (MDPs). The agent has to tackle a sequential decision-making problem to maximize a value function (i.e., the expected sum of discounted rewards, or reward expectations).

As shown in Figure 9.1, the agent observes the environment state, then selects an action based on the state. The agent is expected to receive a reward from the environment based on this selected action. In MDPs, the next state of the environment is dependent on the last state and the action selected by the agent. The action which results in the “highest expected reward” usually refers to the action that puts the agent in the state with the highest potential to gain rewards in the future. The agent moves in state-action-reward-state (SARS) cycles.

Image

Figure 9.1: The state-action-reward-state cycle.

The difficulties of this problem lie in the following issues.

•  An agent has limited knowledge about the optimal actions for a given state of the environment. Considering the one-round decision-making process in the RL problem with continuous action space, the agent has to deal with an optimization problem with continuous space, which may require a huge computation effort.

•  The agent’s actions can affect the future states of the environment, thereby affecting the options and opportunities available to the agent in the future. Therefore, dealing with sequential decision-making problems, each agent should not greedily choose actions even though these actions may obtain good rewards in a short term. That is, the agent has to trade-off between the current reward and the future reward expectations.

•  Selecting optimal actions requires taking into account indirect, delayed consequences of the actions, and thus may require foresight or planning.

Other than the agent and the environment, one can identify four key sub-elements of an RL system: policy, reward signal, value function, and optionally, a model of the environment.

9.1.1    POLICY

A policy defines the agent’s decisions to choose an action given a specified state. Roughly speaking, it is a mapping function (or conditional distribution) that takes the current state of the environment as input and returns the optimal (or sub-optimal) actions. A policy is the core of an RL agent as it determines the mapping from environment state to agent behavior, which can be deterministic or stochastic.

9.1.2    REWARD

A reward defines immediate feedback from the environment to the agent in an RL problem. At each time step, after an agent acts according to its current policy, the environment presents a reward that justifies the performance of the agent’s actions. The agent’s sole objective is to maximize the total reward over the long run. The reward signal is the primary basis for adjusting the policy.

9.1.3    VALUE FUNCTION

Whereas the reward signal feeds back to the agent what is good in an immediate sense, a value function predicts the expected future rewards the agent may accumulate starting from the current state. Roughly speaking, the value function is a way to measure the performance of an action under a given state, and the purpose of estimating values is to accumulate higher rewards. In traditional value-based RL methods, an action should be chosen based on the highest value rather than the highest reward, since the value function evaluates the reward expectations in the long run. Rewards are basically given directly by the environment, but values must be estimated and re-estimated from the sequence of observations an agent makes over its entire lifetime.

9.1.4    MODEL OF THE ENVIRONMENT

For some reinforcement learning systems, there can be a model of the environment. The model of the environment is a virtual model that mimics the behavior of the environment. For example, given a state-action pair, the model of the environment might predict the resultant next state and next reward and, with these predictions, possible future situations before they actually happen can be considered. Methods for solving reinforcement learning problems that use models are called model-based methods. However, most other algorithms assume that the model is not known, and they estimate the policy and value function by trial-and-error. These methods are known as model-free methods.

9.1.5    RL BACKGROUND EXAMPLE

In the following, we present a detailed background example for RL description: the optimal control of the coal-fired boilers in the power plants. The coal-fired boiler systems play roles in transforming the energy firstly into steam heat and then into electricity, which is of a highly dynamic nature. The stochastic factors may come from the unpredictable changes in demands, the equipment conditions, and the calorific values of coal, etc. Figure 9.2 presents a basic framework for applying RL methods to optimal control of coal-fired boiler systems.

Image

Figure 9.2: RL framework for coal-fired boiler control.

As can be seen from Figure 9.2, in order to train an RL agent for the optimal control of a coal-fired boiler system, the following interactions have to be conducted.

1.  RL agent gets observations. The RL agent can obtain the observations of the coal-fired boiler, including boiler temperatures, flue gas oxygen content, steam pressure, etc.

2.  RL agent takes actions. Then, based on the learned knowledge of the RL agent, an action is presented to the control system of the coal-fired boiler. The action includes the speed of the coal conveying belt, primary air volume, secondary airflow, etc.

3.  Coal-fired boiler evolves. Finally, the coal-fired boiler accepts the agent’s actions and evolves into another condition.

9.2    REINFORCEMENT LEARNING ALGORITHMS

RL algorithms can be categorized according to the following key elements.

Model based vs. Model free: Model-based methods attempt to build a virtual model of the environment first and then act according to the best policy derived from the virtual model. For model-free methods, they assume the model of the environment cannot be built, and they estimate the policy and value function by trial-and-error.

Value based vs. Policy based: Value-based methods attempt to learn a value function and infer an optimal policy from it. Policy-based methods directly search in the space of the policy parameters to find an optimal policy.

Monte Carlo Update vs. Temporal Difference (TD) Update: Monte Carlo Update evaluates a policy using the accumulated reward over the entire episode. This is straightforward in implementation, but they require a large number of iterations for convergence and suffer from a large variance when estimating their value function. Instead of using the total accumulated reward, TD Update calculates a temporal error, which is the difference of the new estimate of the value function and the old estimate of the value function, to update the policy. This kind of update only needs the most recent iterations and reduces the variance. However, the bias increases during estimation as the global view of the whole episode is not considered.

On-policy vs. Off-policy: On-policy methods use the current policy to generate actions and update the current policy itself accordingly. Off-policy methods use a different exploratory policy to generate actions and the target policy is updated based on these actions.

Table 9.1 is a summary of some popular RL algorithms and their categories. Two TD algorithms that have been widely used to solve RL problems are State-Action-Reward-State-Action (SARSA) and Q-Learning.

SARSA is an On-policy, TD algorithm [Rummery and Niranjan, 1994]. It is on policy since it follows the same policy to find the next action. It tries to learn an action-value function instead of a value function. The policy evaluation step uses the temporal error for the action-value function, which is similar to the value function.

Q-Learning is an Off-policy, TD algorithm [Watkins and Dayan, 1992]. It is Off-policy since it selects the next action in a greedy fashion rather than follow the same policy. The Q-function is updated by using a policy that is directly greedy with respect to the current Q-function.

9.3    DISTRIBUTED REINFORCEMENT LEARNING

RL algorithms can be interpreted as playing a game many times and learning from a huge number of trials. This can be very time consuming if only one agent is involved to explore a huge state-action space. If we have multiple copies of the agent and the environment, the problem can be solved more efficiently in a distributed fashion. The distributed RL paradigm can be either Asynchronous or Synchronous.

Table 9.1: The state-action-reward-state cycle

Image

9.3.1    ASYNCHRONOUS DISTRIBUTED REINFORCEMENT LEARNING

In the asynchronous scenario, multiple agents explore their own environments separately, and a global set of parameters is updated asynchronously. This allows a large number of actors to learn collaboratively. However, due to the delay of some agents, this algorithm may suffer from stale (old) gradients problem.

A3C Asynchronous Advantage Actor-Critic (A3C) is an algorithm proposed by Google DeepMind in 2016 [Mnih et al., 2016]. It creates up to 16 (or 32) copies of the agent and environment on a single CPU when learning Atari benchmark games. As the algorithm is highly paralleled, it is able to learn many of the Atari benchmark games very quickly on inexpensive CPU hardware.

General Reinforcement Learning Architecture (Gorila) [Nair et al., 2015] is another asynchronous framework for large scale distributed reinforcement learning. Multiple agents can be created and they are separated into different roles including actors and learners. Actors only generate experience by acting in the environment. The collection of experience is stored in a shared replay memory. Learners only train by sampling from the replay memory.

9.3.2    SYNCHRONOUS DISTRIBUTED REINFORCEMENT LEARNING

Sync-Opt Synchronous Stochastic Optimization (Sync-Opt) [Chen et al., 2017] attempts to solve the problem of slow, straggling agents which slow down synchronous RL learning. To avoid these agents, the training process only waits for a preset number of agents to return, and the slowest few agents are dropped.

Advantage Actor-Critic (A2C) [Clemente et al., 2017] is a modified version of the famous A3C. It works in the same way, with the exception that it synchronizes all agents between rounds. OpenAI claims in their blog post that synchronous A2C outperforms A3C [Mnih et al., 2016].

9.4    FEDERATED REINFORCEMENT LEARNING

DRL has many technical and non-technical issues during implementations. One of the most critical issues is how to prevent information-leakage and to preserve agent privacy during DRL. This concern leads to privacy-preserving versions of RL—Federated Reinforcement Learning (FRL). Here we categorize FRL researches into Horizontal Federated Reinforcement Learning (HFRL) and Vertical Federated Reinforcement Learning (VFRL).

9.4.1    BACKGROUND AND CATEGORIZATION

This section illustrates the background of HFRL and VFRL. We emphasize on presenting the detailed backgrounds, problem settings and possible framework for both HRFL and VFRL using the background example in real-life industry systems presented in Section 9.1.5.

FRL Background

In RL researches, one of the commonly studied interests is to design feedback controllers for discrete- and continuous-time dynamical systems that combine features of adaptive control and optimal control. These feedback control problems include self-driving systems, autonomous helicopter control and optimal control of industrial systems, etc.

Note that Zhuo et al. [2019] presented several real-life FRL examples.

1.  In manufacturing, factories produce different components. The decision policies are private and will not be shared with each other. On the other hand, building high-quality individual decision policies is often difficult due to limited businesses and lack of reward signals (for some agents). It is thus helpful for factories to learn decision policies cooperatively under the condition that private data are not given away.

2.  Another example is building medical treatment policies for hospital patients. Patients may be treated in some hospitals without providing feedback about the treatments (i.e., no reward signal for these treatment decision policies). In addition, data records about patients are private and may not be shared among hospitals. It is thus necessary to learn treatment policies for hospitals through Federated DRL.

In the following chapters, for the sake of consistency, we will explain the detailed background, problem settings and framework of HFRL and VFRL based on coal-fired boiler systems.

Horizontal Federated Reinforcement Learning

Parallel RL [Kretchmar, 2002, Grounds and Kudenko, 2005] has long been studied in the RL research community, in which multiple agents are assumed to perform the same task (with the same rewards corresponding to states and actions). The agents may carry out learning in different environments. Note that most parallel RL settings adopt the operations of transferring agents’ experience or gradients. It is straightforward that such operations cannot be conducted when considering privacy-preserving issues. Therefore, it is natural to adopt HFRL for privacy-preserving issues. The HFRL community adopts these basic settings in parallel RL with the privacy-preserving objective as an extra constraint (for both the server and the agents). A basic framework for conducting HFRL is presented in Figure 9.3.

Image

Figure 9.3: Example architecture of HFRL framework.

As can be seen in Figure 9.3, HFRL contains multiple parallel RL agents (we present two agents for briefness) for different coal-fired boiler systems, which may be geographically distributed. The RL agents have the same task of making optimal control of corresponding coal-fired boiler systems. A federated server takes the role of aggregating the models from different RL agents. The basic steps for conducting HFRL can be listed as follows.

•  Step 1: All participant RL agents train their own RL models according to Figure 9.2 locally and independently, without any exchange of data experience, parameter gradients, and losses.

•  Step 2: RL agents send their masked model parameters to the server.

•  Step 3: Federated server encrypts the models from non-identical RL agents and conducts aggregation methods to obtain a federated model.

•  Step 4: Federated server sends the federated model to the RL agents.

•  Step 5: RL agents update the local model.

In literature, researchers start to pay attention to studies in HFRL. Liu et al. [2019] proposed Lifelong Federated Reinforcement Learning (LFRL) in the autonomous navigation settings where the main task is to make the robots share their experience so that they can effectively use prior knowledge and quickly adapt to new changes in the environment. The main idea of work can be summarized in three steps.

1.  Independent learning. Each robot executes its own navigation task in its own environment. Note that the environments can be different, related, or non-related. The basic idea is to conduct lifelong learning locally to learn to avoid diverse types of obstacles.

2.  Knowledge fusion. The knowledge and skills extracted by the robots from defined or undefined environments are then re-produced by the knowledge fusion process, which produces a final model.

3.  Agent network update. The parameters of the agents’ networks are updated regularly. Thus, the knowledge gained by different agents can be shared through these parameters.

Ren et al. [2019] presented a framework based on the deployment of multiple deep reinforcement learning agents on multiple edge nodes to indicate the decisions of the IoT devices. In order to make better knowledge aggregation when reducing the transmission costs between the IoT devices and edge nodes, the authors employed federated learning to train DRL agents in a distributed fashion. The authors conducted extensive experiments to demonstrate the effectiveness of the proposed scheme of HFRL for distributed IoT devices.

Nadiger et al. [2019] thoroughly described the overall architecture for HFRL, which contains the grouping policy, the learning policy, and the federation policy for participant RL agents. The authors further demonstrated the effectiveness of the proposed architecture based on the Atari game Pong. In the demonstrations, the authors showed that with the proposed approach, there is a median improvement of 17% on the personalization time.

Although the privacy-preserving objective may present more challenges, we can benefit from HFRL in the following ways.

•  To avoid non-i.i.d. samples. It is common that single-task agent may encounter non-i.i.d. samples during the learning process. It is clear that one of the main reasons is that for RL tasks with a single-agent setting, the experience gained afterward can be strongly related with previous experience, which may break the i.i.d. data assumption. HFRL can provide benefits for building a more accurate and stable reinforcement learning system.

•  To enhance sample efficiency. Another drawback of conventional RL methods is the poor ability to quickly build stable and accurate models with limited samples (known as low sample efficiency problem), which prevents conventional RL methods from being applied in real-world applications. Under HFRL, we can aggregate knowledge extracted by different agents from non-identical environments to address the low sample efficiency problem.

•  To accelerate the learning process. Actually, this benefit can be drawn from the above two advantages as a by-product. Combined with the powerful FL framework for aggregating different knowledge learned by non-identical agents, experience from more non-i.i.d. samples can accelerate RL learning and achieve better results.

Vertical Federated Reinforcement Learning

Recalling the optimal control problem of coal-fired boiler systems, it is obvious that the working condition of a boiler is not only dependent on the controllable factors, but also on the unobtainable (or unpredictable) factors. For example, the meteorological condition may greatly affect the burning efficiency and steam output of coal-fired boilers. In order to train a more reasonable and robust RL agent, it is natural to extract the knowledge from meteorological data. Unfortunately, professional equipment for real-time and accurate measurements of local meteorological data may be not affordable for small power plants. Moreover, the owner of the power plant may not be interested in the raw meteorological data, but the value extracted from itself. Therefore, in order to train a more robust RL agent, it is natural for the owner of the power plant to cooperate with the meteorological data management department. By return, the meteorological data management department can get paid without directly revealing any real-time meteorological data. This cooperative framework falls into the categorization of VFRL.

In VFRL, there are different RL agents that maintain non-identical observations of the same environment. Each RL agent maintains a corresponding action policy (some agent may have no action policy). The main goal of the cooperative framework is to train a more effective RL agent with the mixed knowledge extracted from the observations of different cooperative agents. During the training or the inference process, any direct transformation of raw data is forbidden. The following presents a possible framework for VFRL—Federated DQN.

As can be seen from Figure 9.4, we name the RL agent who obtains the reward from the environment as Q-network agent (agent A in Figure 9.4), and all other agents as cooperative RL agents:

•  Step 1. All participant RL agents take actions according to the current environment observations and knowledge extracted. Note that some agents may make no action, which only maintains its own observations of the environment.

•  Step 2. RL agents obtain the corresponding feedback of the environment, including the current environment observations, the reward, etc.

•  Step 3. RL agents compute the mid-products by feeding the obtained observations to its neural network and then send the masked mid-products to the Q-network agent.

•  Step 4. The Q-network agent encrypts all mid-products and trains the Q-network with the current losses through back-propagation.

•  Step 5. The Q-network agent sends back the masked weight gradients to the cooperative agents.

•  Step 6. Each cooperative agent encrypts the gradients and updates its own network.

Image

Figure 9.4: Federated-DQN framework.

In literature, existing work falling into VFRL category is Zhuo et al. [2019], which investigates the problem of multi-agent RL system in a cooperative way, when considering the privacy-preserving requirements of agent data, gradients, and models. The FRL framework studied corresponds to the VFRL architecture we presented above (which is named VFRL afterward). The author presented detailed real-life systems where VFRL is meaningful.

Modeling and exploiting the behaviors in systems with multiple cooperative or adversarial agents have long been an interesting challenge for the RL community [Mao et al., 2019, Foerster et al., 2016]. Under the realm of FL, agents may perform heterogeneous tasks, with different states, actions, or rewards (some may have no rewards or actions). The main goal of each VFRL agent is to construct a stable and accurate RL model cooperatively or competitively without direct exchange of experience (including states, actions, and rewards) or the corresponding gradients. Compared with multi-agent RL, the advantages of VFRL can be summarized as follows.

•  To avoid agent and user information-leakage. In the coal-fired boiler systems, a straightforward benefit presented for the meteorological data management department is that it can enhance the production efficiency without any leakage of raw real-time meteorological data. This can be cast as a service that can be published to all potential external users.

•  To enhance RL performance. With proper knowledge extraction methods adopted, we can train a more reasonable and robust RL agent to enhance efficiency. VFRL is advantageous in the sense that it can enable a learning system to leverage such information while preserving privacy.

9.5    CHALLENGES AND OUTLOOK

As an emerging framework for preserving the privacy of different parties and preventing information leakage during the training and inference processes, Federated Learning attracts more and more research attention during the past few years. The following illustrates the challenges together with research directions for FRL.

•  New privacy-preserving paradigms. Note that the above-cited FRL work adopted either the idea of exchanging parameters or employing Gaussian noise, which is very fragile when faced with adversarial (deceiving) agents or even attackers. More reliable paradigms can be merged into FRL, such as Differential Privacy, Secure Multi-party Computation, Homomorphic Encryption, etc.

•  Transfer FRL. Although we did not make a single category for Transfer FRL, its importance still urges us to present a meaningful research direction. In conventional RL researches, transferring the experience, knowledge, parameters, or gradients from the already learned tasks to new ones constitutes a research frontier. It is a general common sense in RL community that the goal of learning from prior knowledge is even more challenging than merely learning from samples.

•  FRL Mechanisms. It can be easily summarized from above-cited work that all existing FRL researches fall into the categorization of Deep Reinforcement Learning. Considering the constraints in the realm of FL, it would be of great meanings in presenting new RL mechanisms (with traditional or DL methods), which constitute another challenging frontier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset