强化学习-DavidSilver(Lecture1 to Lecture3)

本科毕业设计的强化学习笔记。许多强化学习的术语使用中文表达容易产生歧义,因此本笔记使用英文。 David Silver强化学习课程视频网址
David Silver强化学习课程课件

Lecture One

Abstract

  • all is about decision
  • no supervisor,only reward signal(not supervisor learning)
  • feedback is delayed,not instantaneous
  • the key role:agent(brain)

The process

  • background:environment(earth) input: observation & reward –> output:action
  • history and state:
    History: $$H_t=A_1,O_1,A_2,O_2…A_t,O_t,R_t$$
    State of agent : $$S_t^a=f(H_t)$$
    State of environment: $$S_t=f(H_t)$$

keys of fundamental assumptions

  1. Markov state
  2. agent may conclude
  • policy:to decide agent’s function towards current state
  • value:predictive reward of agent’s action
  • model:a metaphysical regulation of “world”(state&reward)
  1. exploitation(choose old) and exploration(choose new)

The classification of agent

  • Value Based: only need $max(V)$
  • Policy Based: get action directly by state
  • AC(Actor Critic):act(i.e. policy), critic(i.e. value function)

let us review immediately


Lecture Two

introduction to MDP(markov decision process)

almost all RL problems can be formalised as MDPs

Markov Property and Markov Process

the future is independent of the past given the present

Markov Property

Markov Process

State Transition Matrix

the state transition probability for a Markov state and successor state

Markov Reward Process

A Markov reward process is a Markov chain with values.(State,Probability,Reward,Discount factor-gamma)

return(G)

the return G is the total discounted reward from time-step t.

$$Gt = R{t+1}+{\gamma}R{t+2}+ {\gamma} ^2R{t+3}…+{\gamma}^kR_{t+k+1}$$
 we could figure out from equation above if discount factor closes to 0 the return will be “myopic”, if discount factor nevertheless closes to 1 then the return will be “far-sighted”

Value Function:Expectation

the value function v(s) gives the long-term value of state s.

then we could deduce MRP Bellman Equation below:

Reward only relates to state,therefore Bellman Equation can be decomposition said by:

Bellman Equation for MRPs in matrix:

The bellman equation can be expressed concisely using matrics:

The bellman equation is a kind of linear equation so we could solve it directly:

But time complexity is close to $O(n^3)$,we could solve it by iteration methods:

  • Dynamic Programming
  • Monto Carlo evaluation
  • Temporal-Difference(TD) Learning

Let us see some other concepts for preparing!

Markov Decision Processes

Markov Decison Processes (S,A,P,R,$\gamma$) is a Markov Reward Processed with decisons.It is an environment in which all states are Markov

Policies

A policy $\pi$ is a distribution over actions given states.

for MDP, we must calculate state-value function and action-value function, we have definition below:

just like bellman equation, we could deduce bell expectation equation from state-value function and action-value function:

we finally have different Bellman Expectation Equation for $V^{\pi}$,$Q^{\pi}$ and the Matrix Form is:

Optimal Value Functions

Coming question, how could we judge the performance of our policy?

Optimal Bellman Equation

we could deduce the Optimal Bellman Eqution from above first:





we could solve the equation by:

  • Value iteration
  • Police iteration
  • Q-Learning
  • Sarsa

extension of MDP

  • infinite MDPS
  • Reductions of POMDP’s

infinite MDPs

POMDP:

It could be seen as a Hidden Markov Process adding actions!

Let us review immediately!



Lecture 3 : Planning by Dynamic Programming

Introduction

  • Dynamic sequential or temporal component to the problem
  • Programming is a optimision of question!
    • optimal substructure
    • overlapping subproblems

  There are two applications of DP

  1. for prediction:

    • Input:MDP & policy $\pi$
    • Output: value function $v_\pi$
  2. for control

    • Input: MDP
    • Output: optimal value function $v*$ & optimal policy $\pi*$

Iterative Policy Evaluation

How to evaluate $\pi$?

  • solition:iterative application of bellman expectation backup to get the true value function($V0->V\pi$)
  • from end to start by iteration.

  It is just like a weighted average of every probability of each action.

the detail you could see the manuscript!

Policy Iteration:

  1. evaluate the policy $\pi$:fill the maze with number to get $v\pi$
    $$V
    \pi(s) = E[R{t+1}+\gamma R{t+2}+…|S_t = s]$$
  2. improve the policy by acting greedy with respect to $v\pi$
    $$\pi^{‘} = greedy(v
    \pi) $$

  1. Policy improvement:

the details you could see the manuscript!

Value Iteration

The solution v∗(s) can be found by one-step lookahead, and it start with final rewards and work backwards:

There is an example:

the details you could see the manuscript!

Synchronous Dynamic Programming Algorithms

we could see Iterative Policy Evaluation and Policy Iteration as a whole knowledge. The knowledge is all in consideration of policy.They as the same in essence.

Asynchronous Dynamic Programming

  • In-Place Dynamic Programming

the main difference between two methods above is the number of copy for reducing storage.

  • Prioritised sweeping

  • Real-time dynamic programming

请zzy824喝杯咖啡
0%