Recap

- The goal of reinforcement learning


 - Evaluating the objective

    - Because it's hard to get the actual value, approximate with samples(Monte-Carlo sampling)

 


Policy Gradient

 - Direct policy differentiation : updates weights directly using policy gradient

    - However, policy gradient does not work well (due to high variance)


 - Evaluating the policy gradient

    - REINFORCE algorithm(Monte-Carlo policy gradient)

REINFORCE algorithm


    - Comparison to maxmimum likelihood

Comparision between policy gradient and maximum likelihood

        * Policy gradient : a weighted(a accumulative reward) maximum likelihood

        * ML has the same probability for all samples, but policy gradient biases probability due to accumulative reward

            → Good stuff is made more likely, bad stuff is made less likely

             Simply formalizes the notion of "trial and error"

 

Policy gradient with automatic differentiation
Pseudocode example(with discrete actions)

        * (Example) Gaussian policies

Gaussian policies for continuous actions

               - Maximum likelihood estimate of µ and Σ

Multivariate gaussian distribution


    - Partial observability

        * Markov property is not actually used

        * However, policy gradient works just fine by replacing states with observations

 


    - What is wrong with the policy gradient?


Reducing variance

  1) Causality : the future does not affect the past

Just decrease the variance

 

 2) Baselines : scaling reward(unbiased)

Baselines do not change the gradient

    - The baseline for minimize variance : can derive optimal baseline

Analyzing variance : Baselines change the variance


On-policy

 - Policy gradient is on-policy : sample inefficient

      (On-policy) : each time the policy is changed, even a little bit, we need to generate new samples  // from lec4

On-policy policy gradient can be inefficient


 - Off-policy learning & importance sampling : derive off-policy policy gradient with importance sampling

When theta=thate', we can derive off-policy policy gradient with importance sampling


 - The off-policy policy gradient : when theta != theta'


 - A fisrt-order approximation for IS (preview)

Anyway, we want to use previous samples


  - Policy gradient in practice

Gradient has high variance


< Summary of today's lecture >

1. The policy gradient algorithm : REINFORCE(Monte-Carlo policy gradient) algorithm

2. What does the policy gradient do?

  - trial and error (good stuff is made more likely, bad stuff is made less likely)

  - Gradient has high variance

3. Basic variance reduction

  - causality : the future does not affect the past, Q^hat_i,t(reward to go)

  - baselines : subtract a baseline for scaling, Q-b

4. On-policy & Off-policy policy gradient

  - Policy gradient is on-policy → can't just skip sampling step → how to use previous samples?

  - Can drive off-policy variant using importance sampling ( theta=theta' and theta!=theta' cases)

  - Policy gradient can implement with automatic differentiation as a weighted maximum likelihood

Recap

Goal : find pi_theta


Definition

- Reward functions r(s,a) : tell us which actions are better

Reward functions


 - Markov chain M={S,T} : simple graphical model, no notion about agent and actions

Markov chain M={S,T}

 - Markov decision process M={S,A,T,r}

Markov decision process M={S,A,T,r}

 - Partially observed MDP M={S,A,O,T,,r}

Partially observed MPD M={S,A,O,T,ℇ,r}


- The goal of reinforcement learning

The goal of reinforcement learning is to find optimal theta

    - Finite horizon case : state-action marginal(Ptheta(s_t,a_t))

Finite horizon case : argmax of the sums of the expected reward for T timestep

        * Each state-action pair is determined by the transition probability p(s|s,a) and the policy pi_theta(a,s)

    - Infinite horizon case : does p(s_t,a_t) converge to a stationary distribution?

Infinite horizon case can generally converge to stationary distribution

        * Stationary distribution : a specific entity which is unchanged by the effect of some matrix of operator. Stationary distributions are related to eigenvectors for which the eigenvalue is unity.

            (Stationary) time-independent

            (Example) If Αν=λν is satisfied, 'A' is called stationary distribution

                 (Α : square matrix, ν : eigenvectors (column vector), λ : eigenvalues (diagonal matrix))

        * So, object is simply divided into T

Expectations and stochastic systems

    - Why we care about expectations?

        *  Since the reward function is discrete, It cannot be differentiated → cannot optimize directly

        * However, the expectation of distribution is smooth in theta


Algorithms

 - The anatomy of a reinforcement learning algorithm

A simple example

    - Which parts are expensive? : depends on model and method of collecting data


 - How do we deal with all these expectations?

If Q(s,a) is known, easy to modify pi_theta(as)

 

    - Definition : Q-function

Q-function

    - Definition : value function

Value-function

 

    → Using Q-functions and value functions

     (Idea 1) If we have pi(policy) and we know Q^pi(s,a), then we can improve pi

Idea 1. Improve policy

     (Idea 2) Compute gradient to increase probaility of good actions a

Idea 2. Gradient-based improvement


Types of RL algorithms

 - The goal of reinforcement learning : find optimal theta(policy)

 

 - Types of RL algorithms

    - Model-based algorithms : estimate the transition model, and then use it (for panning/to improve a policy/something else)

Model-based RL algorithms

    - Value-based algorithms : estimate value function or Q-function of the optimal policy

Value function based RL algorithms

    - Policy gradients : directly differentiate the above objective

Direct policy gradients

    - Actor-critic : estimate value function or Q-function of the current policy, use it to improve policy

Actor-critic : value function + policy gradients


Tradeoffs

 - Why so many RL algorithms?

    - Different tradeoffs

        * Sample efficiency : How many samples do we need to get a good policy?

            - Most important question : is the algorithm off-policy?

              (Off policy) can improve the policy without generating new samples from that policy

              (On policy) each time the policy is changed, even a little bit, we need to generate new samples

Comparison : Sample efficiency

        * Stability & easy of use

            - Reinforcement learning does not often use gradient descent

 

    - Different assumptions

        * Stochastic or deterministic?

           (Deterministic) a=π(s)

           (Stochastic) π(a│s)=P(A=a|S=s)  

        * Episodic or infinite horizon?

        * Continuous or discrete?

 

    - Different things are easy or hard in different settings

        * Easier to represent the policy/model?

 

    - Examples of specific algorithms

Examples of specific algorithms


< Summary of today's lecture >

1. Definition of a Markov decision process : M={S,A,T,r}

2. Definition of reinforcement learning problem

    - Goal : find optimal theta

        - Infinite(stationary distribution) / finite horizon case(state-action marginal)

3. Anatomy of a RL algorithm : {Generate samples - Model evaluation(value-function, Q-function) - Policy improvement(Gradient descent-based)}

4. Brief overview of RL algorithm types

    - Model-based : MCST(Monte-Carlo tree search), Dynamic programming

    - Value-based : Q-learning       // near deterministic policy

    - Policy gradient : REINFORCE(Monte-Carlo policy gradient)

    - Actor-Critic : A3C, SAC 

'cs285, fall 2019' 카테고리의 다른 글

Lecture 5. Policy Gradients  (0) 2020.03.10
Lecture 2. Supervised Learning of Behaviors  (0) 2020.03.09

Terminology & notation

State : Markov, observation : non-Markov

    - o_1 : a cheetah, o_2 : a cheetah covered by a car, o_3 : o_1 is required to predict o_3 → non-Markov

    - Sequential decision problems : the problem of selecting an action a_t by observing o_t every timestep t


Imitation Learning

    - Control problem via supervised learning, same as behavioral cloning

Imitation learning, behavioral cloning

    - Does it work?

Left : No, Right : Yes

               (Left) No, distributional drift(mismatch) occur : accumulative small noise can cause big mistakes

               (Right) Yes, collect more data for stability


    - Can we make it work more often?

        - Can we make Pdata(o_t) = Ppi_theta(o_t)? : use DAgger

DAgger. make Ptrain(Pdata)=Ptest(Ppi_theta)

            * step3 is bottleneck


        - (More general analysis) Does DAgger improve distributional drift over behavioral cloning?

A cost function for imatation learning

            * Each mismatch incurs cost of 1

Upper bound of benavioral cloning

            * Behavioral cloing : O(ϵt^2)

            * DAgger : O(ϵt)


    - Can we make it work without more data?

        - Need to mimic expert behavior very accurately without overfit

        - But, why might we fail to fit the expert?

            1) Non-Markovian behavior

                * Human behavior depends on all past observations

                * How can we use the whole history? Use RNN or LSTM

                * But this causes 'causal confusion' problem

                  (Example)  The yellow light puts on each time you apply the brake. The more you use all the history the relationship between yellow light and brakes will be stronger, even if the situation that turns the yellow right on is not reasonable

 

            2) Multimodal behavior

(Solution)

a) Output mixture of Gaussians

b) Latent variable models : change input instead of output

c) Autoregressive discretization : continuous action space can be discretize -(binning size depends on the dimension of action space)-> iterative discrete sampling to select next action of dimension

 

 


< Summary of today's lecture >

1. Definition of sequential decision problems : the problem of selecting an action a_t by observing o_t every timestep t

2. Imitation learning : supervised learning for decision making

    1) Does direct imitation work? : generally not but sometimes with large data it works

    2) How can we make it work more often?

        - DAgger : with more data

        - To mimic expert behavior fails to fit

            * Non-markovian behavior : use RNN or LSTM to use the history

            * Multimodal behavior : use gaussian mixture model/latent variable model/AR discretization

'cs285, fall 2019' 카테고리의 다른 글

Lecture 5. Policy Gradients  (0) 2020.03.10
Lecture 4. Introduction to Reinforcement Learning  (0) 2020.03.09

+ Recent posts