AI Planning
Theworld is:
1) dynamic
2) stochastic
3) partially observable
1) take time
2) continuous effect
1 classical planning:
- States described by propositions currentlytrue
• Actions: general statetransformations described by
sets of pre- and post-conditions
• Represents a state-transitionsystem (but more
regression : from goal state to initialsate
forwarding: from initial sate to goal state
search method: BFS and DFS
returns: a sequence of actions thattransforms s into g
1. Calculate the difference setd=g-s.
1. If d is empty, return an emptyplan
2. Choose action a whose add-listhas most formulas contained in g
3. p’ = STRIPS( s, precondition of a)
4. Compute the new state s’ byapplying p’ and a to s.
5. p = STRIPS( s’, g )
6. return p’;a;p
Refineplan( P : Plan set)
1. If P is empty, Fail.
2. If a minimal candidate of P is asolution, return it. End
3. Select a refinement strategy R
4. Apply R to P to get a new planset P’
5. Call Refineplan(P’ )
Termination ensured if R complete andmonotonic.
• State space refinement: e.g.STRIPS
• Plan space refinement: e.g. Leastcommitment
• Task refinement: e.g. HTN
2 Stochastic environment
Ina stochastic environment, we use MDP to model and plan.
1)The branch factor is too large.
2)The tree is very deep.
3)Many states visited more than once
usage: robot navigation. Planning from x toy in a stochastic environments
sin {states,};
ain {actions(s)}; State Transition: T(S,a, S');
Fully observable: S, A
Stochastic in state transition: T(S,a, S') = Pr(S'|S,a)
Reward in state S: R(S)- This is a short term and primitive value onpolicy.
Howto find a planning in MDP? or how to solve MDP?
Here we introduce some other value to solveMDP.
policy in state S: π(s)->the actiontaken in state s.
Value of state (node): Vπ(s):expected total reward in state s after policy π. This is a long term evaluate function of the policy. Reinforcementlearning is to get best V(s), not R(s).
the objective:
Find a policy π(S) that max thefunction: E[ ]->max
Discount factor
Value function: (s)
Planning = calculate value functions
Howto compute this functions max?
Use value iteration to get optimal policies.
Everystate s, Vk(s) is finding an action that max the value of Vk(s)using the above function.