Reinforcement Learning 第十四周课程笔记

This week

  • should watch CCC.
  • The readings are: Zeibart et al. (2008). Babes et al. (2011). Griffith et al (2013). Cederborg et al (2015).* Roberts (2006). Bhat (2007).*
Reinforcement Learning 第十四周课程笔记_第1张图片
CCC
Reinforcement Learning 第十四周课程笔记_第2张图片
Coordinating and communicating
Reinforcement Learning 第十四周课程笔记_第3张图片
The decentralized partially observable Markov decision process (*Dec*-*POMDP*)
  • Dec-POMDP has some perspectives of game theory and MDP
  • Multiple agent working on getting a common reward. (if the rewards are separated for all the agents, then it's a POSG partially observable stochastic game)
Reinforcement Learning 第十四周课程笔记_第4张图片
DEC-POMDPs properties
Reinforcement Learning 第十四周课程笔记_第5张图片
DEC-POMDPs example
  • two agents, they know where they are but don't know the other's position. when the two are in the same room, they win.
  • Strategy: go to a shared room. But my knowledge of my current position could be wrong ( partially observable world).
Reinforcement Learning 第十四周课程笔记_第6张图片
Communicating and Coaching
  • agent 1 wants to set up some kind of reward function to move agent to do something (e.g. get the apple for me).

Inverse Reinforcement Learning

Reinforcement Learning 第十四周课程笔记_第7张图片
Inverse Reinforcement Learning
  • Inverse Reinforcement Learning: the agent experience the environment and a set a behavior and then generate a reward function based on the inputs.
Reinforcement Learning 第十四周课程笔记_第8张图片
MLIRL: Maximium Likelyhod inverse reinforcement learning.
Reinforcement Learning 第十四周课程笔记_第9张图片
MLIRL result
Reinforcement Learning 第十四周课程笔记_第10张图片
CCC

Policy Shaping

Reinforcement Learning 第十四周课程笔记_第11张图片
Policy Shaping
  • if a human is giving feedback (commentary) about weather the agent's action is good or bad, s/he is doing policy shaping.
  • policy shaping could be realized by reward shaping which is replace reward of an action with a new reward?
  • Agent need a mechanism to learn from the environment and the commentary to decide what policy to take (not just listening to the commentary, cause the commentary might not be always right).
Reinforcement Learning 第十四周课程笔记_第12张图片
quiz 1: Policy Shaping
  • If human is alway correct, given the feedback, what's the probability that the action (x, y, or z) is optimal?
  • answers in the slides above.
Reinforcement Learning 第十四周课程笔记_第13张图片
Quiz 2: Policy Shaping
  • what if human is 0.8 probability of right?
  • counting method:
    • saying x is optimal is liking saying y and z is not optimal.
    • since human is 0.8 correct, then x, y, z being optimal is 0.8, 0.2, 0.2.
    • normalize the numbers above, will get 2/3, 1/6, 1/6.
Reinforcement Learning 第十四周课程笔记_第14张图片
Policy Shaping probabiligy calculation
  • Δa is coming from data of action a (da). C is the probability of correct of the people giving commentary.
  • The formula above give the method of calculating probability of action a is optimal.
  • Note: the final probability will need to be normalized against the probabilities of other actions.
Reinforcement Learning 第十四周课程笔记_第15张图片
quiz 3: How to combine info from multiple sources in Policy shaping?
  • in the policy shaping case, information are coming from multiple sources.
  • E.g. πa and πH are policy info from agent exploring the world and human giving feedback.
  • Some algorithm decrease the importance of πH as time goes. One need to know that πa already incorporated the information of human uncertainty (C).
  • the way to combine the two sources is to calculate the probability that the two policy will agree: aopt=argmaxa p(a|π1) * p(a|π2).
    • in the quiz xopt = 1/15, yopt=1/60,aopt=2/15. So we should choose z as optimal.

Drama Management

Reinforcement Learning 第十四周课程笔记_第16张图片
Drama management world
  • the way a human can communicate to an agent
    • demonstration: show the agent what's the correct action (inverse RL)
    • reward shaping: giving reward for agent's actions
    • policy shaping: commentary on the agent's actions
  • author convey his intent to the agent so the agent can
Reinforcement Learning 第十四周课程笔记_第17张图片
Drama Management: what's a stroy
  • story can be defined as a trajectory through plot points
Reinforcement Learning 第十四周课程笔记_第18张图片
Trajectories as MDP
  • above a some mapping of MDP elements to trajectory MDP elements
  • Problems
    • large number of sequence of states (hyper exponential)
    • Since MDP will maximize rewards, treating story as an MDP will only make the author happy and force the player to experience the story.
Reinforcement Learning 第十四周课程笔记_第19张图片
TTD-MDP: Targeted trajectory distributions MDPs
  • p(t'|a,t) is the probability that the player at trajectory t and take action a then ended up in trajectory t'. P(T) is a target distribution.
  • the action is not player's action but the story action
  • the optimal policy is the policy that will lead to the targeted trajectory distribution P(T)
  • the calculation time is linear and dependent on the length of the story.

what have we learned

Reinforcement Learning 第十四周课程笔记_第20张图片
recap
2015-11-18 初稿 完成

你可能感兴趣的:(Reinforcement Learning 第十四周课程笔记)