Skip to content
Zhengyuan Zhu
Go back

Reinforcement Learning - David Silver (Lecture 1 to Lecture 3)

Reinforcement learning notes for undergraduate graduation project. Many reinforcement learning terms can be ambiguous when expressed in Chinese, so this note uses English. David Silver Reinforcement Learning Course Video David Silver Reinforcement Learning Course Materials.

Lecture One

Abstract

The process

Markov Reward Process

A Markov reward process is a Markov chain with values.(State,Probability,Reward,Discount factor-gamma)

return(G)

the return G is the total discounted reward from time-step t.

$$G_t = R_{t+1}+{\gamma}R_{t+2}+ {\gamma} ^2R_{t+3}…+{\gamma}^kR_{t+k+1}$$  we could figure out from equation above if discount factor closes to 0 the return will be “myopic”, if discount factor nevertheless closes to 1 then the return will be “far-sighted”

Value Function:Expectation

the value function v(s) gives the long-term value of state s.

then we could deduce MRP Bellman Equation below:

Reward only relates to state,therefore Bellman Equation can be decomposition said by:

Bellman Equation for MRPs in matrix:

The bellman equation can be expressed concisely using matrics:

The bellman equation is a kind of linear equation so we could solve it directly:

But time complexity is close to $O(n^3)$,we could solve it by iteration methods:

Let us see some other concepts for preparing!

Markov Decision Processes

Markov Decison Processes (S,A,P,R,$\gamma$) is a Markov Reward Processed with decisons.It is an environment in which all states are Markov

Policies

A policy $\pi$ is a distribution over actions given states.

for MDP, we must calculate state-value function and action-value function, we have definition below:

just like bellman equation, we could deduce bell expectation equation from state-value function and action-value function:

we finally have different Bellman Expectation Equation for $V^{\pi}$,$Q^{\pi}$ and the Matrix Form is:

Optimal Value Functions

Coming question, how could we judge the performance of our policy?

Optimal Bellman Equation

we could deduce the Optimal Bellman Eqution from above first:

we could solve the equation by:

extension of MDP

infinite MDPs

POMDP:

It could be seen as a Hidden Markov Process adding actions!

Let us review immediately!

Lecture 3 : Planning by Dynamic Programming

Introduction

for prediction:

for control

Iterative Policy Evaluation

How to evaluate $\pi$?

It is just like a weighted average of every probability of each action.

the detail you could see the manuscript!

Policy Iteration:

the details you could see the manuscript!

Value Iteration

The solution v∗(s) can be found by one-step lookahead, and it start with final rewards and work backwards:

There is an example:

the details you could see the manuscript!

Synchronous Dynamic Programming Algorithms

we could see Iterative Policy Evaluation and Policy Iteration as a whole knowledge. The knowledge is all in consideration of policy.They as the same in essence.

Asynchronous Dynamic Programming

the main difference between two methods above is the number of copy for reducing storage.


Share this post on:

Previous Post
Pattern Recognition and Machine Learning Handwritten Notes
Next Post
Convolutional Neural Networks
Jack the orange tabby cat
I'm Jack 🧡
Luna the tuxedo cat
I'm Luna! 🖤