Why explore?

The trajectories are sampled following a policy that can be stochastic or deterministic. In the case of a deterministic policy, each time a trajectory is sampled, the visited states will always be the same, and the update of the value function will take into account only this limited set of states. This will considerably limit your knowledge about the environmentIt is like learning from a teacher that never changes their opinion on a subject—you will be stuck with those ideas without learning about others.

Thus the exploration of the environment is crucial if you want to achieve good results, and it ensures that there are no better policies that could be found.

On the other hand, if a policy is designed in such a way that it explores the environment constantly without taking into consideration what has already been learned, the achievement of a good policy is very difficult, perhaps even impossible. This balance between exploration and exploitation (behaving according to the best policy currently available) is called the exploration-exploitation dilemma and will be considered in greater detail in Chapter 12, Developing an ESBAS Algorithm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset