Chapter 3

Chapter 3: Finite Markov Decision Processes

Basic Definitions

MDP is the most basic formulation of sequential decision process under the assumption of Markov property.

  1. State: The state must include information about all aspects of the past agent-environment interaction that make a difference for the future.
  2. Action
  3. Reward: The reward defines what we want to achieve instead of how we want to achieve it.
  4. Dynamics:
  5. Return: Return is defined as some function of the reward sequence
    For episodic tasks, we have
    For continuing tasks, we have
    They can be unified under the same framework as by adding an absorbing state with zero reward to the terminal of episodic tasks
    The recursive form of return is , which forms the basis of Bellman equations

Further notes:

  1. In the RL book, the reward obtained from taking action in state at time step is denoted as instead of ;
  2. RL beyond MDP assumption is an important research topic (also discussed in the RL book)
  3. The representation of the states and actions has a great influence on the learning process, but is beyond the scope of the RL book (many recent works actually focus on this topic)
  4. The RL book focuses on scalar reward signal, but there are also some recent works focusing on multi-objective reward signal in vector form

Policies and Value Functions

Value function is the expected return of a state or a state-action pair
Policy is a mapping from states to the probabilities of selecting each possible action
Value functions are defined w.r.t. particular policies, i.e.,
Based on the simple relationships of we can derive the Bellman equation which expresses the relationship between the value of a state (state-action pair) and the values of its successor states (state-action pairs), i.e., v_\pi (s) = \sum_a \pi (a | s) \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ] \\ q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma \sum_{a'} \pi (a' | s') q_\pi (s', a') \big ]. The value function is the unique solution to its Bellman equation by solving a set of linear equations. Notice that the assumption here is that the system dynamics is known.
Another useful tool to visualize the recursive relationships of value functions is backup diagram.

Optimal Policies and Optimal Value Functions

Definition of a "better" policy: if and only if for all .
There always exists an optimal value function and and its corresponding optimal policies (potentially more than one) for MDPs. Intuitively, if a policy is not optimal, we can always improve the value of a state by changing the policy for this specific state. The improvement in the value of will then backpropagate to all the values of the states which can reach in the state transition graph. In this way, we can always achieve a better policy and gradually reach the optimal policy.
Based on the following simple relations, i.e., v_\pi (s) = \sum_a \pi (a | s) q_\pi (s, a) \rightarrow v_*(s) = \max_{a \in \mathcal{A}(s)} q_*(s, a) \\ \quad q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ] \rightarrow q_*(s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_*(s') \big ], we have the Bellman optimality equation without reference to any specific policy as
The optimal policy can be easily derived by greedy search over the state values.
Solving the Bellman optimality equation requires solving nonlinear equations based on the assumption of fully known system dynamics and Markov property. Even these two assmuptions are satisfied, solving the equations is still computationally infeasible when the state space is very large. Consequently, different RL methods mainly focus on how to solve the Bellman optimality equation approximately.

Further notes:
The MDP formulation of RL makes it closely related to (stochastic) optimal control.

Reinforcement learning adds to MDPs a focus on approximation and incomplete information for realistically large problems.

The online nature of reinforcement learning makes it possible to approximate optimal policies in ways that put more effor into learning to make good decisions for frequently encountered states, at the expense of less effort for infrequently encountered states.

你可能感兴趣的:(Chapter 3)