Assessments

Chapter 3 

  • What's a stochastic policy?
    • It's a policy defined in terms of a probability distribution
  • How can a return be defined in terms of the return at the next time step?
  • Why is the Bellman equation so important?
    • Because it provides a general formula to compute the value of a state using the current reward and the value of the subsequent state. 
  • Which are the limiting factors of DP algorithms?
    • Due to a complexity explosion with the number of states, they have to be limited. The other constraint is that the dynamics of the system have to be fully known.
  • What's policy evaluation?
    • Is an iterative method to compute the value function for a given policy using the Bellman equations.
  • How does policy iteration and value iteration differs?
    • Policy iteration alternate between policy evaluation and policy improvement, value iteration instead, combine the two in a single update using the max function.

 

Chapter 4

  • What's the main property of the MC method used in RL?
    • The estimation of the value function as the average return from a state.
  • Why are MC methods offline?
    • Because they update the state value only when the complete trajectory is available. Thus they have to wait until the end of the episode.
  • What are the two main ideas of TD learning?
    • They combine the ideas of sampling and bootstrapping
  • What are the differences between MC and TD?
    • MC learn from the full trajectory, whereas TD learn at every step acquiring knowledge also from an incomplete trajectory.
  • Why is exploration important in TD learning?
    • Because the TD update is done only on the action-state visited, so if some of them has not been discovered, in the absence of an exploration strategy they will never be visited. Thus, some good policy may not be discovered.
  • Why Q-learning is off-policy?
    • Because the Q-learning update is done independently of the behavior policy. It uses the greedy policy of the max operation. 

Chapter 5

  • What arise of the deadly triad problem?
    • When off-policy learning are combined with function approximation and boostrapping.
  • How DQN overcome the instabilities?
    • Using a replay buffer and a separate online and target network.
  • What's the moving target problem?
    • It's a problem that arises when the target values aren't fixed and they change as the network is optimized.
  • How the moving target problem is mitigated in DQN?
    • Introducing a target network that is updated less frequently than the online network.
  • What's the optimization procedure used in DQN?
    • A mean square error loss function is optimized through stochastic gradient descent, an iterative method that performs gradient descent on a batch.
  • What's the definition of a state-action advantage value function?

Chapter 6

  • How PG algorithms maximize the objective function?
    • They do it by taking a step in the opposite direction of the objective function's derivative. The step is proportional to the return.
  • What's the main intuition behind PG algorithms?
    • Encourage good actions and dissuade the agent from the bad ones.
  • Why introducing a baseline in REINFORCE it remains unbiased?
    • Because in expectation 
  • To which broader class of algorithms belong to REINFORCE?
    • It is a Monte Carlo method as it relies on full trajectories like MC methods do.
  • How the critic in AC methods differs from a value function used as a baseline in REINFORCE?
    • Besides the learned function is the same, the critic uses the approximated value function for bootstrap the action-state value instead in REINFORCE (but also in AC) it is used as a baseline to reduce the variance.
  • If you had to develop an algorithm for an agent that has to learn to move, would you prefer REINFORCE or AC?
    • You should first try an actor-critic algorithm as the agent has to learn a continuous task.
  • Could you use an n-step Actor-Critic algorithm as a REINFORCE algorithm?
    • Yes, you could as far as  is greater than the maximum possible number of steps in the environment.

Chapter 7

  • How can a policy neural network control a continuous agent?
    • One way to do it is to predict the mean and the standard deviation that describe a Gaussian distribution. The standard deviation could either be conditioned on a state (the input of the neural network) or be a standalone parameter.
  • What's the KL divergence?
    • Is a measure of proximity of two probability distributions.
  • What's the main idea behind TRPO?
    • To optimize the new objective function in a region near the old probability distribution.
  • How is the KL divergence used in TRPO?
    • It is used as a hard constraint to limit the digression between an old and a new policy.
  • What's the main benefit of PPO?
    • It uses only a first-order optimization that increase the simplicity of the algorithm and has a better sample efficiency and performance.
  • How does PPO achieve good sample efficiency?
    • It run minibatch updates several times exploiting better the data.

Chapter 8

  • Which is the primary limitation of Q-learning algorithms?
    • Ther action space has to be discrete and small in order to compute the global maximum.
  • Why are stochastic gradient algorithms sample inefficient?
    • Because the are on-policy and need new data every time the policy changes.
  • How does deterministic policy gradient overcome the maximization problem?
    • DPG model the policy as a deterministic function predicting only a deterministic action and the deterministic policy gradient theorem gives a way to compute the gradient used to update the policy.
  • How does DPG guarantee enough exploration?
    • By adding noise into the deterministic policy or by learning a different behavior policy.
  • What DDPG stands for? And what is its main contribution?
    • DDPG stands for Deep Deterministic Policy Gradient and is an algorithm that adapts the deterministic policy gradient to work with deep neural networks. They use new strategies to stabilize and speed up learning.
  • Which problems does TD3 propose to minimize?
    • Overestimation bias common in Q-learning and high variance estimates.
  • What new mechanisms does TD3 employ?
    • To reduce the overestimation bias, they use a Clipped Double Q-learning while they address the variance problem with a delayed policy update and a smoothing regularization technique.

Chapter 9

  • Would you use a model-based or a model-free algorithm if you had only 10 games to train your agent to play checkers?
    • I would use a model-based algorithm. The model of checkers is known and plan on is a feasible task.
  • What are the disadvantages of model-based algorithms?
    • Overall, they require more computational power and achieve lower asymptotical performance with respect to model-free algorithms.
  • If a model of the environment is unknown, how can it be learned?
    • Once a dataset is collected through interactions with the real environment, the dynamics model can be learned in a usual supervised way.
  • Why data-aggregation methods are used?
    • Because usually the first interactions with the environment are done with a naive policy that doesn't explore all of it. Further interactions with a more defined policy are required to affine the model of the environment.
  • How does ME-TRPO stabilize training?
    • ME-TRPO employs two main features: an ensemble of models and early stopping techniques.
  • Why an ensemble of models improve policy learning?
    • Because predictions that are done by an ensemble of models take into account any uncertainty of the single model.

Chapter 10

  • Is imitation learning considered a reinforcement learning technique?
    • No, because the underlying frameworks are different. The objective of IL isn't to maximize the reward as in RL.
  • Would you use imitation learning to build a unbitable agent in Go?
    • Probably not, because it requires an expert from which to learn. And if the agent has to be the best player in the world means that there's no worthy expert.
  • What's the full name of DAgger?
    • Dataset aggregations
  • What's the main strength of DAgger?
    • It overcomes the problem of distribution mismatch by employing the expert to teach actively the learner to recover from errors.
  • Where would you use IRL instead of IL?
    • In problems where the reward function is easier to learn and where there's the necessity to learn a policy better than that of the expert.

Chapter 11

  • What are two alternative algorithms to reinforcement learning for solving sequential decision problems?
    • evolution strategies and genetic algorithms
  • What are the processes that give birth to new individuals in evolutionary algorithms?
    • The mutation that mutates the gene of a parent and crossover that combines genetic information from two parents.
  • What is the source of inspiration of evolutionary algorithms like genetic algorithms?
    • Evolutionary algorithms are principally inspired by biological evolution.
  • How does CMA-ES evolve evolution strategies?
    • CMA-ES samples new candidate from a multivariate normal distribution with the covariance matrix that is adapted to the population.
  • What's one advantage and one disadvantage of evolution strategies?
    • One advantage is that they are derivative-free methods while a disadvantage is that of being sample inefficient.
  • What's the trick used in the "Evolution Strategies as Scalable Alternative to Reinforcement Learning" paper to reduce the variance?
    • They propose to use mirroring noise and generate an additional mutation with a perturbation with the opposite sign.

Chapter 12

  • What's the exploration-exploitation dilemma?
    • Is a decision problem of whether it's better to explore in order to make better decisions in the future or exploit the best current option.
  • What are two exploration strategies that we already used in previous RL algorithms?
    • -greedy and a strategy that introduces some additional noise into the policy.
  •  What's UCB?
    • Upper Confidence Bound is an optimistic exploration algorithm that estimates an upper confidence bound for each value and selects the action that maximizes (12.3)
  • Is Montezuma's Revenge or Multi-armed bandit problem more difficult to solve?
    • Montezuma's Revenge is much more difficult than the multi-armed bandit problem just for the fact that the latter is stateless while the former has an astronomical number of possible states. Montezuma's Revenge has also more complexity intrinsic in the game.
  • How ESBAS tackle the problem of online RL algorithm selection?
    • By employing a meta-algorithm that learns which algorithm among a fixed portfolio performs better in a given circumstance.

Chapter 13

  • How would you rank DQN, A2C, and ES based on their sample efficiency?
    • DQN is the most sample-efficiency followed by A2C and ES.
  • What would their rank be if rated on the training time and 100 CPUs are available?
    • ES probably would be the faster to train, then A2C and DQN.
  • Would you start debugging an RL algorithm on CartPole or  MontezumaRevenge?
    • CartPole. You should start the debug of an algorithm with an easy task.
  • Why is it better to use multiple seeds when comparing multiple deep RL algorithms?
    • The results from a single trial can be highly volatile due to the stochasticity of the neural network and environment. By averaging multiple random seeds the results would approximate the average case.
  • Does the intrinsic reward help the exploration of an environment?
    • Yes, this's because the intrinsic reward is a sort of exploration bonus that would increase the curiosity of the agent to visit novel states.
  • What's transfer learning?
    • Is the task of efficiently transfer knowledge between two environments.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset