a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex and uncertain environment. -Sutton and Barto
Action: Move LEFT or Right
https://www/youtube.com/watch?v=WXuK6gekU1Y
A chess player makes a move: the choice is informed both by planning-anticipating possible replies and counterreplies.
A gazelle calf struggles to stand, 30 min later it runs 36 kilometers per hour.
Portfolio management.
Playing Atari game
Action: move UP or Down
From Andrej Karpathy blog: http://karpathy.github.io/2016/05/31/rl/
https://www.youtube.com/watch?v=gn4nRCC9TwQ
https://ai.googleblog.com/2016/03/deep-learning-for-robots-learning-from.html
https://www.youtube.com/watch?v=jwSbzNHGflM
https://www.youtube.com/watch?v=ixmE5nt2o88
The agent learns to interact with the environment
All goals of agent can be described by the maximization of expected cumulative reward.
+/- reward for winning or losing a game
+/- reward for running with its mom or being eaten
+/- reward for each profit or loss in $
+/- reward for increasing or decreasing scores
Objective of the agent: select a series of actions to maximize total future rewards
Actions may have long term consequences
Reward may be delayed
Trade-off between immediate reward and long-term reward
The history is sequence of observation, actions, rewards.
H t = O 1 , R 1 , A 1 , . . . , A t − 1 , O t , R t H_{t} = O_{1}, R_{1}, A_{1}, ..., A_{t - 1}, O_{t}, R_{t} Ht=O1,R1,A1,...,At−1,Ot,Rt
What happens next depends on the history
State is the function used to determine what happens next
S t = f ( H t ) S_{t} = f(H_{t}) St=f(Ht)
Environment state and agent state
S t e = f e ( H t ) S t a = f a ( H t ) S_{t}^{e} = f^{e}(H_{t}) \,S_{t}^{a} = f^{a}(H_{t}) Ste=fe(Ht)Sta=fa(Ht)
Full observability: agent directly observe the environment state, formally as Markov decision process (MDP)
O t = S t e = S t a O_{t} = S_{t}^{e} = S_{t}^{a} Ot=Ste=Sta
Partial observability: agent indirectly observe the environment, formally as partial observable Markov decision process (POMDP)
An RL agent may include one or more of these components:
Value function: expected discounted sum of future rewards under a particular policy π \pi π
Discount factor weights immediate vs future rewards
Used to quantify goodness/badness of states and actions
v π ( s ) ≐ E π [ G t ∣ S t = s ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s ] , f o r a l l s ∈ S v_{\pi}(s) \doteq E_{\pi}[G_{t} | S_{t} = s] = E_{\pi}[\sum_{k = 0}^{∞}γ^{k}R_{t + k + 1} | S_{t} = s], for \, all \, s ∈ S vπ(s)≐Eπ[Gt∣St=s]=Eπ[k=0∑∞γkRt+k+1∣St=s],foralls∈S
Q-function (could be used to select among actions)
q π ( s , a ) ≐ E π [ G t ∣ S t = s , A t = a ] = E π [ ∑ k = 0 ∞ γ k R t + k + 1 ∣ S t = s , A t = a ] . q_{\pi}(s, a) \doteq E_{\pi}[G_{t} | S_{t} = s, A_{t} = a] = E_{\pi}[\sum_{k = 0}^{∞}γ^{k}R_{t + k + 1} | S_{t} = s, A_{t} = a]. qπ(s,a)≐Eπ[Gt∣St=s,At=a]=Eπ[k=0∑∞γkRt+k+1∣St=s,At=a].
A model predict what the environment will do next
predict the next state: P S S ′ a = P [ S t + 1 = s ′ ∣ S t = s ] , A t = a P_{SS'}^{a} = \mathbb{P}[S_{t + 1} = s' | S_{t} = s], A_{t} = a PSS′a=P[St+1=s′∣St=s],At=a
Definition of MDP
P a P^{a} Pa is dynamics/transition model for each action
P ( S t + 1 = s ′ ∣ S t = s , A t = a ) P(S_{t + 1} = s' | S_{t} = s, A_{t} = a) P(St+1=s′∣St=s,At=a)
R is reward function R ( S t = s , A t = a ) = E [ R t ∣ S t = s , A t = a ] R(S_{t} = s, A_{t} = a) = \mathbb{E}[R_{t} | S_{t} = s, A_{t} = a] R(St=s,At=a)=E[Rt∣St=s,At=a]
Discount factor γ ∈ [0, 1]
From David Silver Slide
Credit: David Silver’s slide
Agent only experiences what happens for the actions it tries!
How should an RL agent balance its actions?
Often there may be an exploration-exploitation trade-off
Restaurant Selection
Online Banner Advertisements
Oil Drilling
Game Playing
https://github.com/metalbubble/RLexample
https://github.com/openai/retro
import gym
env = gym.make("Taxi-v2")
observation = env.reset()
agent = load_agent()
for step in range(100):
action = agent(observation)
observation, reward, done, info = env.step(action)
https://gym.openai.com/envs/#classic_control
https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py
import gym
env = gym.make("CartPole-v0")
env.reset()
env.render() # display the rendered scene
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
Cross Entropy method(CEM)
https://gist.github.com/kashif/5dfa12d80402c559e060d567ea352c06
import gym
env = gym.make("Pong-v0")
env.reset()
env.render() # display the rendered scene
python my_random_agent.py Pong-v0
python pg-pong.py
Loading weight: pong_bolei.p(model trained over night)
observation = env.reset()
cur_x = prepro(observation)
x = cur_x - pre_x
pre_x = cur_x
aprob, h = policy_forward(x)
Randomized action:
action = 2 if np.random.uniform() < aprob else 3 # roll the dice!
h = np.dot(W1, x)
h[h<0] = 0 # ReLU nonlinearity: threshold at zero
logp = np.dot(W2,h) # compute log probability of going up
p = 1.0 / (1.0 + np.exp(-logp)) #sigmoid function (gives probability of going up)
How to optimize the W1 and W2?
Policy Gradient!(To be introduced in future lecture)
http://karpathy.github.io/2016/05/31/rl
https://github.com/cuhkrlcourse/RLexample
http://karpathy.github.io/2016/05/31/rl
Please read Sutton and Barton: Chapter 1 and Chapter 3