Why explore?

The trajectories are sampled following a policy that can be stochastic or deterministic. In the case of a deterministic policy, each time a trajectory is sampled, the visited states will always be the same, and the update of the value function will take into account only this limited set of states. This will considerably limit your knowledge about the environment. It is like learning from a teacher that never changes their opinion on a subject—you will be stuck with those ideas without learning about others.

Thus the exploration of the environment is crucial if you want to achieve good results, and it ensures that there are no better policies that could be found.

On the other hand, if a policy is designed in such a way that it explores the environment constantly without taking into consideration what has already been learned, the achievement of a good policy is very difficult, perhaps even impossible. This balance between exploration and exploitation (behaving according to the best policy currently available) is called the exploration-exploitation dilemma and will be considered in greater detail in Chapter 12, Developing an ESBAS Algorithm.

Table of Contents for Why explore?

Create new playlist

Sign In

Sign Up

Table of Contents for
Why explore?