15.1 Introduction
Reinforcement learning is a machine learning approach in which an intelligent agent learns to take actions to maximize a reward. We will apply this to the design of a Titan landing control system. Reinforcement learning is a tool to approximate solutions that could have been obtained by dynamic programming, but whose exact solutions are computationally intractable [3].
We’ll start the chapter by modeling the point-mass dynamics for the vehicle. We will then build a simulation. We’ll test the simulation with the vehicle that starts in orbit around Titan, a moon of Jupiter, and with the vehicle in level flight. We will then use optimization to find a trajectory. Finally, we’ll design a reinforcement learning algorithm to find a trajectory.
15.2 Titan Lander
The problem, for which we will use reinforcement learning, is a powered landing on Titan. Titan is the second-largest moon in the solar system and is larger than the planet Mercury. Titan has a thick atmosphere which makes it feasible to fly an airplane in the atmosphere using aerodynamic forces for lift. Figure 15.1 shows an image of Titan and its atmospheric density.
We start in a circular orbit. To begin reentry, we increase the angle of attack to increase the drag. Remember, we are using a simple aerodynamics model. This causes the ramjet to descend. We then adjust the angle of attack and thrust to reach our desired landing point. In this chapter, we will design one trajectory using optimal control, and then we will use reinforcement learning to design a trajectory. Our objective will be to reach the Titan surface with zero velocity.
15.3 Modeling the Titan Atmosphere
15.3.1 Problem
We want to compute the atmospheric density of Titan.
15.3.2 Solution
Use a database of Titan densities and altitudes and interpolate.
15.3.3 How It Works
15.4 Simulating the Aircraft
15.4.1 Problem
We want to numerically integrate the trajectory given aircraft parameters and the controls.
15.4.2 Solution
Write a function with a built-in integrator to integrate the equations of motion. Write a second function that returns the state derivative.
15.4.3 How It Works
Most of the data structure is in the struct statement. The cell array is added afterward. If it was added in the struct, MATLAB would make a three-element structure, one for each cell array element. The default data structure gives the user an initial state and suggested state names for plotting. The default data structure puts the vehicle in a circular orbit.
The first line rearranges the order of inputs of RHS2DTitan to be compatible with ode113. ode113 returns all of the intermediate values in an n-by-three array so we just grab the last set.
15.5 Simulating Level Flight
15.5.1 Problem
We want to fly our aircraft level to the Titan surface.
15.5.2 Solution
Write a script calling the Simulation2DTitan function with the aircraft at the optimal angle of attack and sufficient thrust to maintain level flight.
15.5.3 How It Works
For level flight, we will operate at the optimal angle of attack, given in Equation 15.11, with enough thrust to overcome drag. The velocity needs to be fast enough to generate sufficient lift to balance gravity at the desired altitude. Let’s assume that we want to fly at 1 km altitude above the surface of Titan. The script LevelFlight.m sets up and runs the simulation. We need equilibrium thrust, velocity, and the angle of attack. The easiest way to do this is with a search where the cost is the magnitude of the acceleration vector.
The following prints out in the command window. It shows important parameters, states, and controls for 1 km altitude.
We also ran it at 100 km. The angle of attack is lower and the thrust is higher.
Figure 15.5 shows the results for both cases. The vehicle maintains level flight.
15.6 Optimal Trajectory
15.6.1 Problem
We want to produce an optimal trajectory to go from a starting point in an orbit to the surface.
15.6.2 Solution
Write a script using fmincon to produce an optimal trajectory.
15.6.3 How It Works
We start in a circular orbit and set the angle of attack to zero and zero thrust. We hit the ground within 15 minutes with a high vertical velocity and some tangential velocity. Figure 15.6 shows the trajectory. It shows that a good initial guess is zero angle of attack, and 15-minute duration.
Optimization using fmincon requires a constraint function and a cost function. The cost will be the mean dynamic pressure given in Equation 15.20. The constraint will be the landing condition. Both require that we simulate start until ground contact. We don’t care when ground contact occurs. We break up time into segments of equal length.
When you run the landing script, you will get the following output in the command line. The first column is the number of iterations. f(x) is the cost, in this case, the heating rate. The second column is the total function count. The fourth column, Feasibility, is how well it is matching the landing constraint of zero altitude and velocities. Ideally, it is zero when the constraint is matched. The first-order optimality is how close the solution is to the optimal solution. The norm step size is the norm (essentially magnitude) of the control step it is taking.
The end is shown as follows:
Figure 15.8 shows the optimal solution. It varies the angle of attack near the end. The mean heating rate drops by a third, and the constraints are met.
15.7 Reinforcement Example
15.7.1 Problem
We want to produce a trajectory to go from a starting point in an orbit to the surface.
15.7.2 Solution
Use reinforcement learning to produce the angle of attack to minimize the mean dynamic pressure while reaching zero altitudes at zero velocities.
15.7.3 How It Works
Figure 15.9 shows the pattern for reinforcement learning. It is essentially a trial and error approach. The reinforcement learning algorithm implements a policy that it updates using experimentation. In our problem, the goal is a policy of the angle of attack that allows the lander to reach the surface with zero velocity. The model is the same as was used in the optimization approach in the previous section. As with the optimization approach, no apriori algorithm is used. The reinforcement learning algorithm creates its algorithm from multiple attempts.
We will implement a Deep Deterministic Policy Gradient (DDPG) algorithm [21]. It is a model-free, online, off-policy reinforcement learning method. It works with continuous observations and actions. Off-policy agents create a buffer of observations and use that to update the policy. A DDPG agent is an actor-critic reinforcement learning agent that searches for an optimal policy.
We print out the observation information. The observations are the lander state. We don’t set any observation limits.
The observations aren’t given any limits. The action information is listed next. The action is angle of attack.
The limits on the angle of attack are 0 and 45 degrees.
The actor network is shown in Figure 15.10 and listed as follows. The features are the three states. There are nine layers. This is the default network. You can write any neural network you would like. The actor network takes the state observation and creates the action which is to vary the angle of attack.
The training reports results in the command window. Each episode is 50 steps. It is breaking up the landing trajectory into 50 parts. The best reward is zero.
Figure 15.12 shows the training window.
Episode Q0 is an estimate of the long-term reward at the start of each episode, using only the initial observation. The window also shows the reward for the current episode and the average reward. The highest reward, in this case, is zero.
experience.Observation.LanderStates gives the time series for the simulation.