强化学习-DavidSilver(Lecture1 to Lecture3)

本科毕业设计的强化学习笔记。许多强化学习的术语使用中文表达容易产生歧义，因此本笔记使用英文。 David Silver强化学习课程视频网址
 David Silver强化学习课程课件。

Lecture One

Abstract

all is about decision
no supervisor,only reward signal(not supervisor learning)
feedback is delayed,not instantaneous
the key role:agent(brain)

The process

background:environment(earth) input: observation & reward –> output:action
history and state:
History: $H_t=A_1,O_1,A_2,O_2…A_t,O_t,R_t$
State of agent : $S_t^a=f(H_t)$
State of environment: $S_t=f(H_t)$

keys of fundamental assumptions

Markov state
agent may conclude

policy:to decide agent’s function towards current state
value:predictive reward of agent’s action
model:a metaphysical regulation of “world”(state&reward)

exploitation(choose old) and exploration(choose new)

The classification of agent

Value Based: only need $max(V)$
Policy Based: get action directly by state
AC(Actor Critic):act(i.e. policy), critic(i.e. value function)

let us review immediately

Lecture Two

introduction to MDP(markov decision process)

almost all RL problems can be formalised as MDPs

Markov Property and Markov Process

the future is independent of the past given the present

Markov Property

Markov Process

State Transition Matrix

the state transition probability for a Markov state and successor state

Markov Reward Process

A Markov reward process is a Markov chain with values.(State,Probability,Reward,Discount factor-gamma)

return(G)

the return G is the total discounted reward from time-step t.

$$Gt = R{t+1}+{\gamma}R{t+2}+ {\gamma} ^2R{t+3}…+{\gamma}^kR_{t+k+1}$$
　we could figure out from equation above if discount factor closes to 0 the return will be “myopic”, if discount factor nevertheless closes to 1 then the return will be “far-sighted”

Value Function:Expectation

the value function v(s) gives the long-term value of state s.

then we could deduce MRP Bellman Equation below:

Reward only relates to state,therefore Bellman Equation can be decomposition said by:

Bellman Equation for MRPs in matrix:

The bellman equation can be expressed concisely using matrics:

The bellman equation is a kind of linear equation so we could solve it directly:

But time complexity is close to $O(n^3)$ ,we could solve it by iteration methods:

Dynamic Programming
Monto Carlo evaluation
Temporal-Difference(TD) Learning

Let us see some other concepts for preparing!

Markov Decision Processes

Markov Decison Processes (S,A,P,R, $\gamma$ ) is a Markov Reward Processed with decisons.It is an environment in which all states are Markov

Policies

A policy $\pi$ is a distribution over actions given states.

for MDP, we must calculate state-value function and action-value function, we have definition below:

just like bellman equation, we could deduce bell expectation equation from state-value function and action-value function:

we finally have different Bellman Expectation Equation for $V^{\pi}$ , $Q^{\pi}$ and the Matrix Form is：

Optimal Value Functions

Coming question, how could we judge the performance of our policy?

Optimal Bellman Equation

we could deduce the Optimal Bellman Eqution from above first:

we could solve the equation by:

Value iteration
Police iteration
Q-Learning
Sarsa

extension of MDP

infinite MDPS
Reductions of POMDP’s

infinite MDPs

POMDP:

It could be seen as a Hidden Markov Process adding actions!

Let us review immediately!

Lecture 3 : Planning by Dynamic Programming

Introduction

Dynamic sequential or temporal component to the problem
Programming is a optimision of question!
- optimal substructure
- overlapping subproblems

There are two applications of DP

for prediction:
- Input：MDP & policy $\pi$
- Output: value function $v_\pi$
for control
- Input: MDP
- Output: optimal value function $v* $& optimal policy$ \pi*$

Iterative Policy Evaluation

How to evaluate $\pi$ ?

solition:iterative application of bellman expectation backup to get the true value function($V0->V\pi$)
from end to start by iteration.

It is just like a weighted average of every probability of each action.

the detail you could see the manuscript!

Policy Iteration:

evaluate the policy $\pi$ ：fill the maze with number to get $v\pi$V\pi(s) = E[R{t+1}+\gamma R{t+2}+…|S_t = s]$$
improve the policy by acting greedy with respect to $v\pi$\pi^{‘} = greedy(v\pi) $$

Policy improvement:

the details you could see the manuscript!

Value Iteration

The solution v∗(s) can be found by one-step lookahead, and it start with final rewards and work backwards:

There is an example:

the details you could see the manuscript!

Synchronous Dynamic Programming Algorithms

we could see Iterative Policy Evaluation and Policy Iteration as a whole knowledge. The knowledge is all in consideration of policy.They as the same in essence.

Asynchronous Dynamic Programming

In-Place Dynamic Programming

the main difference between two methods above is the number of copy for reducing storage.

Prioritised sweeping

Real-time dynamic programming