本科毕业设计的强化学习笔记。许多强化学习的术语使用中文表达容易产生歧义,因此本笔记使用英文。 David Silver强化学习课程视频网址
David Silver强化学习课程课件。
Lecture One
Abstract
- all is about decision
- no supervisor,only reward signal(not supervisor learning)
- feedback is delayed,not instantaneous
- the key role:agent(brain)
The process
- background:environment(earth) input: observation & reward –> output:action
- history and state:
History: $$H_t=A_1,O_1,A_2,O_2…A_t,O_t,R_t$$
State of agent : $$S_t^a=f(H_t)$$
State of environment: $$S_t=f(H_t)$$
keys of fundamental assumptions
- Markov state
- agent may conclude
- policy:to decide agent’s function towards current state
- value:predictive reward of agent’s action
- model:a metaphysical regulation of “world”(state&reward)
- exploitation(choose old) and exploration(choose new)
The classification of agent
- Value Based: only need $max(V)$
- Policy Based: get action directly by state
- AC(Actor Critic):act(i.e. policy), critic(i.e. value function)
let us review immediately
Lecture Two
introduction to MDP(markov decision process)
almost all RL problems can be formalised as MDPs
Markov Property and Markov Process
the future is independent of the past given the present
Markov Property
Markov Process
State Transition Matrix
the state transition probability for a Markov state and successor state
Markov Reward Process
A Markov reward process is a Markov chain with values.(State,Probability,Reward,Discount factor-gamma)
return(G)
the return G is the total discounted reward from time-step t.
$$Gt = R{t+1}+{\gamma}R{t+2}+ {\gamma} ^2R{t+3}…+{\gamma}^kR_{t+k+1}$$
we could figure out from equation above if discount factor closes to 0 the return will be “myopic”, if discount factor nevertheless closes to 1 then the return will be “far-sighted”
Value Function:Expectation
the value function v(s) gives the long-term value of state s.
then we could deduce MRP Bellman Equation below:
Reward only relates to state,therefore Bellman Equation can be decomposition said by:
Bellman Equation for MRPs in matrix:
The bellman equation can be expressed concisely using matrics:
The bellman equation is a kind of linear equation so we could solve it directly:
But time complexity is close to $O(n^3)$,we could solve it by iteration methods:
- Dynamic Programming
- Monto Carlo evaluation
- Temporal-Difference(TD) Learning
Let us see some other concepts for preparing!
Markov Decision Processes
Markov Decison Processes (S,A,P,R,$\gamma$) is a Markov Reward Processed with decisons.It is an environment in which all states are Markov
Policies
A policy $\pi$ is a distribution over actions given states.
for MDP, we must calculate state-value function and action-value function, we have definition below:
just like bellman equation, we could deduce bell expectation equation from state-value function and action-value function:
we finally have different Bellman Expectation Equation for $V^{\pi}$,$Q^{\pi}$ and the Matrix Form is:
Optimal Value Functions
Coming question, how could we judge the performance of our policy?
Optimal Bellman Equation
we could deduce the Optimal Bellman Eqution from above first:
we could solve the equation by:
- Value iteration
- Police iteration
- Q-Learning
- Sarsa
extension of MDP
- infinite MDPS
- Reductions of POMDP’s
infinite MDPs
POMDP:
It could be seen as a Hidden Markov Process adding actions!
Let us review immediately!
Lecture 3 : Planning by Dynamic Programming
Introduction
- Dynamic sequential or temporal component to the problem
- Programming is a optimision of question!
- optimal substructure
- overlapping subproblems
There are two applications of DP
for prediction:
- Input:MDP & policy $\pi$
- Output: value function $v_\pi$
for control
- Input: MDP
- Output: optimal value function $v*$ & optimal policy $\pi*$
Iterative Policy Evaluation
How to evaluate $\pi$?
- solition:iterative application of bellman expectation backup to get the true value function($V0->V\pi$)
- from end to start by iteration.
It is just like a weighted average of every probability of each action.
the detail you could see the manuscript!
Policy Iteration:
- evaluate the policy $\pi$:fill the maze with number to get $v\pi$
$$V\pi(s) = E[R{t+1}+\gamma R{t+2}+…|S_t = s]$$ - improve the policy by acting greedy with respect to $v\pi$
$$\pi^{‘} = greedy(v\pi) $$
- Policy improvement:
the details you could see the manuscript!
Value Iteration
The solution v∗(s) can be found by one-step lookahead, and it start with final rewards and work backwards:
There is an example:
the details you could see the manuscript!
Synchronous Dynamic Programming Algorithms
we could see Iterative Policy Evaluation and Policy Iteration as a whole knowledge. The knowledge is all in consideration of policy.They as the same in essence.
Asynchronous Dynamic Programming
- In-Place Dynamic Programming
the main difference between two methods above is the number of copy for reducing storage.
- Prioritised sweeping
- Real-time dynamic programming