© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
M. Paluszek et al.Practical MATLAB Deep Learninghttps://doi.org/10.1007/978-1-4842-7912-0_15

15. Reinforcement Learning

Michael Paluszek1  , Stephanie Thomas2   and Eric Ham2  
(1)
Plainsboro, NJ, USA
(2)
Princeton, NJ, USA
 

15.1 Introduction

Reinforcement learning is a machine learning approach in which an intelligent agent learns to take actions to maximize a reward. We will apply this to the design of a Titan landing control system. Reinforcement learning is a tool to approximate solutions that could have been obtained by dynamic programming, but whose exact solutions are computationally intractable [3].

We’ll start the chapter by modeling the point-mass dynamics for the vehicle. We will then build a simulation. We’ll test the simulation with the vehicle that starts in orbit around Titan, a moon of Jupiter, and with the vehicle in level flight. We will then use optimization to find a trajectory. Finally, we’ll design a reinforcement learning algorithm to find a trajectory.

15.2 Titan Lander

The problem, for which we will use reinforcement learning, is a powered landing on Titan. Titan is the second-largest moon in the solar system and is larger than the planet Mercury. Titan has a thick atmosphere which makes it feasible to fly an airplane in the atmosphere using aerodynamic forces for lift. Figure 15.1 shows an image of Titan and its atmospheric density.

We will use a model similar to that in Chapter 9, terrain-based navigation, except that we will simplify it to the planar case. In addition, we will assume that the mass of the vehicle does not vary. Our propulsion system uses a nuclear fusion–powered turboramjet that consumes very little fuel. It heats the incoming atmosphere with the energy from the fusion reaction. The point-mass vehicle equations of motion are different from Chapter 9 in that we consider that the moon is round and that gravity varies in strength as the vehicle descends. The equations are written in cylindrical coordinates:
$$displaystyle egin{aligned} egin{array}{rcl}{} dot{r} & = &displaystyle u_r end{array} end{aligned} $$
(15.1)
$$displaystyle egin{aligned} egin{array}{rcl} dot{u}_r & = &displaystyle frac{v^2}{r}-frac{mu}{r^2} + a_r end{array} end{aligned} $$
(15.2)
$$displaystyle egin{aligned} egin{array}{rcl} dot{u}_t & = &displaystyle -frac{u_tu_r}{r} + a_t end{array} end{aligned} $$
(15.3)
where r is the radial position, ur is the radial velocity, ut is the tangential velocity, μ is the gravitational parameter for Titan, and ar and at are the accelerations in those directions. We don’t integrate the tangential position or angle since our problem will only involve landing. If you wanted to land at a particular spot, after planning your trajectory, you would just pick an orbital position to begin descent so that you end at the right spot.
In the diagram in Figure 15.2, the velocity magnitude v is
$$displaystyle egin{aligned} v = sqrt{u_r^2+u_t^2} end{aligned} $$
(15.4)
Figure 15.1

The composite infrared image of Titan is courtesy of NASA.

Figure 15.2

Aircraft model showing lift, drag, and gravity.

The flight path angle γ is not used in this formulation. The gravitational acceleration appears in Equation 15.1.
$$displaystyle egin{aligned} g = frac{mu}{r^2} end{aligned} $$
(15.5)
We use a very simple aerodynamic model. The lift coefficient cL is defined as
$$displaystyle egin{aligned} c_L = c_{L_alpha}alpha end{aligned} $$
(15.6)
The lift coefficient is in reality a nonlinear function of the angle of attack α. It has a maximum angle of attack above which the wing stalls and all lift is lost. This coefficient will also vary with flight speed. For our simple model, we will assume a flat plate with $$ c_{L_alpha } = 2pi $$. The drag coefficient is
$$displaystyle egin{aligned} c_D = c_{D_0} + kc_L^2 end{aligned} $$
(15.7)
where k is
$$displaystyle egin{aligned} k = frac{1}{pi A_R epsilon} end{aligned} $$
(15.8)
AR is the aspect ratio and ? is the Oswald efficiency factor which is typically from 0.8 to 0.95. The efficiency factor is how efficiently lift is coupled to drag. The aspect ratio is the ratio of the wingspan (from the point nearest the fuselage to the tip) and the chord (the length from the front to the back of the wing). This equation is known as the drag polar. When there is no lift, the drag coefficient is $$c_{D_0}$$. cD varies as the square of the lift coefficient. For hypersonic flow, above Mach 4, the following model, based on Newtonian Impact Theory, is
$$displaystyle egin{aligned} egin{array}{rcl} L = 
ho SV^{2}sin^{2}alphacosalpha\ D = 
ho S V^{2}left(sin^{3}alpha+frac{0.075}{R_{E}^{1/5}sqrt{M}}+C_{D_{pi}}
ight) end{array} end{aligned} $$
(15.9)
where $$C_{D_{pi }}$$ is approximately 0.001. This would need to be meshed with the previous model for when the Mach number drops below four. For the purpose of this chapter, we will stick to the low-speed model. The lift-to-drag ratio is
$$displaystyle egin{aligned} frac{c_L}{c_{D_0} + kc_L^2} end{aligned} $$
(15.10)
which is a maximum when
$$displaystyle egin{aligned} c_L =sqrt{frac{c_{D_0}}{k}} end{aligned} $$
(15.11)
This is the lift coefficient at which we would like to operate in steady flight since it’s where we get the most lift for the least drag.
The dynamic pressure, the pressure due to the motion of the aircraft, is
$$displaystyle egin{aligned} q = frac{1}{2}
ho(u_r^2 + u_t^2) end{aligned} $$
(15.12)
where $$sqrt {u_r^2 + u_t^2}$$ is the speed and ρ is the atmospheric density. The magnitudes of lift and drag are
$$displaystyle egin{aligned} egin{array}{rcl} L = qc_Ls end{array} end{aligned} $$
(15.13)
$$displaystyle egin{aligned} egin{array}{rcl} D = qc_Ds end{array} end{aligned} $$
(15.14)
where s is the wetted area. A wetted area is the area of the vehicle that contributes to lift and drag. The lift force is normal to the velocity vector, and the drag force is along the velocity vector. The unit velocity vector is
$$displaystyle egin{aligned} vec{u} = frac{left[egin{array}{r}u_r\u_tend{array}
ight]}{sqrt{u_r^2 + u_t^2}} end{aligned} $$
(15.15)
We use the arrow, $$vec { }$$, to denote a vector, which in MATLAB is a multielement array. The drag vector is
$$displaystyle egin{aligned} vec{D} = Dvec{u} end{aligned} $$
(15.16)
The lift vector is a 90-degree rotation and is
$$displaystyle egin{aligned} vec{L} = Lleft[egin{array}{rr}0 & 1\-1& 0end{array}
ight]vec{u} end{aligned} $$
(15.17)
In both cases, the radial component is the top element, and the bottom is the tangential component. The thrust vector is
$$displaystyle egin{aligned} vec{T} = Tleft[egin{array}{r}sinalpha\cosalphaend{array}
ight] end{aligned} $$
(15.18)

We start in a circular orbit. To begin reentry, we increase the angle of attack to increase the drag. Remember, we are using a simple aerodynamics model. This causes the ramjet to descend. We then adjust the angle of attack and thrust to reach our desired landing point. In this chapter, we will design one trajectory using optimal control, and then we will use reinforcement learning to design a trajectory. Our objective will be to reach the Titan surface with zero velocity.

15.3 Modeling the Titan Atmosphere

15.3.1 Problem

We want to compute the atmospheric density of Titan.

15.3.2 Solution

Use a database of Titan densities and altitudes and interpolate.

15.3.3 How It Works

The function TitanAtmosphere computes the Titan atmospheric density and temperature.
The altitude is computed through linear interpolation. This means that intermediate values will be returned. If no outputs are requested, it produces a double-y plot with this code:
The commands yyaxis right and yyaxis left set the axis handles to match the left and right axes with the plot. Typing TitanAtmosphere will result in two plots. The demo plots are shown in Figure 15.3.
Figure 15.3

Titan and its atmospheric density. The speed of sound and temperature curves have the same shape.

15.4 Simulating the Aircraft

15.4.1 Problem

We want to numerically integrate the trajectory given aircraft parameters and the controls.

15.4.2 Solution

Write a function with a built-in integrator to integrate the equations of motion. Write a second function that returns the state derivative.

15.4.3 How It Works

The function RHS2DTitan models the two-dimensional dynamics of the vehicle. It includes gravity and aerodynamics. It is specialized for Titan.
Constants are at the beginning of the function. The next block of code returns the default data structure. Forces are converted to kN since the units of the states are km. This is followed by the dynamics code. The subfunction returns the default data structure. If you run the function, it will print the default data structure in the command window.

Most of the data structure is in the struct statement. The cell array is added afterward. If it was added in the struct, MATLAB would make a three-element structure, one for each cell array element. The default data structure gives the user an initial state and suggested state names for plotting. The default data structure puts the vehicle in a circular orbit.

A second function, Simulation2DTitan, runs the simulation. Its inputs are the same as RHS2DTitan with the addition of the control input, which is a two-by-n array of the angle of attacks and thrusts.
We use ode113 internally to give us flexibility in the size of the control time steps. ode113 ensures that we will keep the integration errors below certain bounds. We are using the built-in bounds in this case. The code is

The first line rearranges the order of inputs of RHS2DTitan to be compatible with ode113. ode113 returns all of the intermediate values in an n-by-three array so we just grab the last set.

The built-in demo starts in a circular orbit with a high angle of attack which starts reentry.
Figure 15.4 shows the results of the demo. As expected, the aircraft begins reentry.
Figure 15.4

Simulation using the built-in demo.

15.5 Simulating Level Flight

15.5.1 Problem

We want to fly our aircraft level to the Titan surface.

15.5.2 Solution

Write a script calling the Simulation2DTitan function with the aircraft at the optimal angle of attack and sufficient thrust to maintain level flight.

15.5.3 How It Works

For level flight, we will operate at the optimal angle of attack, given in Equation 15.11, with enough thrust to overcome drag. The velocity needs to be fast enough to generate sufficient lift to balance gravity at the desired altitude. Let’s assume that we want to fly at 1 km altitude above the surface of Titan. The script LevelFlight.m sets up and runs the simulation. We need equilibrium thrust, velocity, and the angle of attack. The easiest way to do this is with a search where the cost is the magnitude of the acceleration vector.

We first set up the simulation.
We then use fminsearch to find the angle of attack, thrust, and velocity that make the accelerations as small as possible.
The cost function is
It calls RHS2DTitan and computes the state derivative. In equilibrium flight, this should be zero. It computes the magnitude of the acceleration vector and uses this as the cost. We then run the simulation.
Figure 15.5

Level flight simulations. The results are not perfect due to numerical errors. 1 km and 100 km altitude results are shown.

The following prints out in the command window. It shows important parameters, states, and controls for 1 km altitude.

We also ran it at 100 km. The angle of attack is lower and the thrust is higher.

Figure 15.5 shows the results for both cases. The vehicle maintains level flight.

15.6 Optimal Trajectory

15.6.1 Problem

We want to produce an optimal trajectory to go from a starting point in an orbit to the surface.

15.6.2 Solution

Write a script using fmincon to produce an optimal trajectory.

15.6.3 How It Works

We will solve the constrained optimal control problem. The constraint is the end state, which must be
$$displaystyle egin{aligned} x = left[egin{array}{l}r_{mathrm{Titan}}\0\0end{array}
ight] end{aligned} $$
(15.19)
We have three different cost calculations that could be minimized. fmincon tries to minimize the costs. The first is dynamic pressure:
$$displaystyle egin{aligned} q = frac{1}{2}
ho v^2 end{aligned} $$
(15.20)
This is the aerodynamic load on the vehicle. Stagnation temperature is the second:
$$displaystyle egin{aligned} T_0 = T_a(1 + frac{1}{2}f(gamma - 1)M^2) end{aligned} $$
(15.21)
where Ta is the ambient temperature, γ is the ratio of specific heats, and M is the Mach number, which is the speed divided by the speed of sound. The last is the heating rate [45]:
$$displaystyle egin{aligned} r approx sqrt{
ho}v^3 end{aligned} $$
(15.22)
In all cases, the cost will be the mean of these quantities over the trajectory.
The first step is to find a trajectory that reaches the surface. We write a script Landing.m for that purpose.

We start in a circular orbit and set the angle of attack to zero and zero thrust. We hit the ground within 15 minutes with a high vertical velocity and some tangential velocity. Figure 15.6 shows the trajectory. It shows that a good initial guess is zero angle of attack, and 15-minute duration.

Optimization using fmincon requires a constraint function and a cost function. The cost will be the mean dynamic pressure given in Equation 15.20. The constraint will be the landing condition. Both require that we simulate start until ground contact. We don’t care when ground contact occurs. We break up time into segments of equal length.

The controls will be piecewise continuous along the trajectory. The number of segments will impact the accuracy of the solution and the speed of the solution. More segments mean higher accuracy but can slow convergence.
Figure 15.6

Trajectory with no thrust and zero angle of attack.

The constraint function is shown in the following code. It integrates the equations of motion until the vehicle hits the ground. Both functions use Simulation2DTitan.
One of three costs can be selected: mean stagnation temperature, mean dynamic pressure, and mean heating rate. They are computed by integrating the equations of motion over the trajectory.
As you can see, the process is quite numerically intensive since we are integrating the equations of motion twice. The script to do the optimization is shown in the following code:
You’ll note that we use a non-uniform time distribution. This is because near the ground we expect that the controller will need to make decisions more quickly. The function has a default of a linearly decreasing step size.
The sequence is shown in Figure 15.7. Other sequences are possible. We use an exponentially decreasing step size.
Figure 15.7

Exponentially distributed step.

When you run the landing script, you will get the following output in the command line. The first column is the number of iterations. f(x) is the cost, in this case, the heating rate. The second column is the total function count. The fourth column, Feasibility, is how well it is matching the landing constraint of zero altitude and velocities. Ideally, it is zero when the constraint is matched. The first-order optimality is how close the solution is to the optimal solution. The norm step size is the norm (essentially magnitude) of the control step it is taking.

The end is shown as follows:

Figure 15.8 shows the optimal solution. It varies the angle of attack near the end. The mean heating rate drops by a third, and the constraints are met.

The solution shows that we should make the time steps even smaller near the end.
Figure 15.8

Optimal landing.

15.7 Reinforcement Example

15.7.1 Problem

We want to produce a trajectory to go from a starting point in an orbit to the surface.

15.7.2 Solution

Use reinforcement learning to produce the angle of attack to minimize the mean dynamic pressure while reaching zero altitudes at zero velocities.

15.7.3 How It Works

Figure 15.9 shows the pattern for reinforcement learning. It is essentially a trial and error approach. The reinforcement learning algorithm implements a policy that it updates using experimentation. In our problem, the goal is a policy of the angle of attack that allows the lander to reach the surface with zero velocity. The model is the same as was used in the optimization approach in the previous section. As with the optimization approach, no apriori algorithm is used. The reinforcement learning algorithm creates its algorithm from multiple attempts.

We will implement a Deep Deterministic Policy Gradient (DDPG) algorithm [21]. It is a model-free, online, off-policy reinforcement learning method. It works with continuous observations and actions. Off-policy agents create a buffer of observations and use that to update the policy. A DDPG agent is an actor-critic reinforcement learning agent that searches for an optimal policy.

The first step is to make a class to encapsulate the environment. The properties are constants. Calculations are not allowed here. The reward scale is the scaling of the position and velocity errors.
Figure 15.9

Reinforcement learning.

The next part is the necessary methods. This includes the class constructor, the step method, and the code InitialObservation method. The step method performs one integration time step.
The remainder of the class is methods to set class members and graphics members. An important part is a reward. The reward is an exponential function of the magnitude of the rate error when the lander is near the ground. Otherwise it is the heating rate.
The training is done in the following script. The agent is created in the call to rlDDPG Agent. The agent includes both the actor and the critic which are created by rlDDPGAgent. The training sets the weights for the actor and critic.

We print out the observation information. The observations are the lander state. We don’t set any observation limits.

The observations aren’t given any limits. The action information is listed next. The action is angle of attack.

The limits on the angle of attack are 0 and 45 degrees.

The actor network is shown in Figure 15.10 and listed as follows. The features are the three states. There are nine layers. This is the default network. You can write any neural network you would like. The actor network takes the state observation and creates the action which is to vary the angle of attack.

The first layer is the “feature” input, which is the three measurements. These are passed to a 128-neuron fully connected layer that uses ReLU, rectified linear units, as the activation function. This is followed by another fully connected layer with the same activation function. The output layer is a fully connected layer with two neurons that uses a hyperbolic tangent activation function. The output of the activation layer is the vector control of the angle of attack.
Figure 15.10

Actor network. The input node is an array.

The critic network shown in Figure 15.11 takes the states and the actions as inputs.
Figure 15.11

Critic network. The input nodes are two arrays.

The training reports results in the command window. Each episode is 50 steps. It is breaking up the landing trajectory into 50 parts. The best reward is zero.

Figure 15.12 shows the training window.

Episode Q0 is an estimate of the long-term reward at the start of each episode, using only the initial observation. The window also shows the reward for the current episode and the average reward. The highest reward, in this case, is zero.

You can halt training at any time, and the script will complete and run the simulation. The output of sim, experience, gives the simulation output. It is a fairly complex data structure.
Figure 15.12

Reinforcement learning training window. You can halt training at any tme.

experience.Observation.LanderStates gives the time series for the simulation.

The landing is shown in Figure 15.13. The solution is different from the optimal solution. We do not give a negative reward for increasing the radius so it feels free to go to higher altitudes.
Figure 15.13

Time history.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset