Terminology & notation

State : Markov, observation : non-Markov

    - o_1 : a cheetah, o_2 : a cheetah covered by a car, o_3 : o_1 is required to predict o_3 → non-Markov

    - Sequential decision problems : the problem of selecting an action a_t by observing o_t every timestep t


Imitation Learning

    - Control problem via supervised learning, same as behavioral cloning

Imitation learning, behavioral cloning

    - Does it work?

Left : No, Right : Yes

               (Left) No, distributional drift(mismatch) occur : accumulative small noise can cause big mistakes

               (Right) Yes, collect more data for stability


    - Can we make it work more often?

        - Can we make Pdata(o_t) = Ppi_theta(o_t)? : use DAgger

DAgger. make Ptrain(Pdata)=Ptest(Ppi_theta)

            * step3 is bottleneck


        - (More general analysis) Does DAgger improve distributional drift over behavioral cloning?

A cost function for imatation learning

            * Each mismatch incurs cost of 1

Upper bound of benavioral cloning

            * Behavioral cloing : O(ϵt^2)

            * DAgger : O(ϵt)


    - Can we make it work without more data?

        - Need to mimic expert behavior very accurately without overfit

        - But, why might we fail to fit the expert?

            1) Non-Markovian behavior

                * Human behavior depends on all past observations

                * How can we use the whole history? Use RNN or LSTM

                * But this causes 'causal confusion' problem

                  (Example)  The yellow light puts on each time you apply the brake. The more you use all the history the relationship between yellow light and brakes will be stronger, even if the situation that turns the yellow right on is not reasonable

 

            2) Multimodal behavior

(Solution)

a) Output mixture of Gaussians

b) Latent variable models : change input instead of output

c) Autoregressive discretization : continuous action space can be discretize -(binning size depends on the dimension of action space)-> iterative discrete sampling to select next action of dimension

 

 


< Summary of today's lecture >

1. Definition of sequential decision problems : the problem of selecting an action a_t by observing o_t every timestep t

2. Imitation learning : supervised learning for decision making

    1) Does direct imitation work? : generally not but sometimes with large data it works

    2) How can we make it work more often?

        - DAgger : with more data

        - To mimic expert behavior fails to fit

            * Non-markovian behavior : use RNN or LSTM to use the history

            * Multimodal behavior : use gaussian mixture model/latent variable model/AR discretization

'cs285, fall 2019' 카테고리의 다른 글

Lecture 5. Policy Gradients  (0) 2020.03.10
Lecture 4. Introduction to Reinforcement Learning  (0) 2020.03.09

+ Recent posts