'cs285, fall 2019' 카테고리의 글 목록

Recap

- The goal of reinforcement learning

- Evaluating the objective

- Because it's hard to get the actual value, approximate with samples(Monte-Carlo sampling)

Policy Gradient

- Direct policy differentiation : updates weights directly using policy gradient

- However, policy gradient does not work well (due to high variance)

- Evaluating the policy gradient

- REINFORCE algorithm(Monte-Carlo policy gradient)

- Comparison to maxmimum likelihood

Comparision between policy gradient and maximum likelihood

* Policy gradient : a weighted(a accumulative reward) maximum likelihood

* ML has the same probability for all samples, but policy gradient biases probability due to accumulative reward

→ Good stuff is made more likely, bad stuff is made less likely

→ Simply formalizes the notion of "trial and error"

Policy gradient with automatic differentiation

Pseudocode example(with discrete actions)

* (Example) Gaussian policies

Gaussian policies for continuous actions

- Maximum likelihood estimate of µ and Σ

- Partial observability

* Markov property is not actually used

* However, policy gradient works just fine by replacing states with observations

- What is wrong with the policy gradient?

Reducing variance

1) Causality : the future does not affect the past

2) Baselines : scaling reward(unbiased)

- The baseline for minimize variance : can derive optimal baseline

Analyzing variance : Baselines change the variance

On-policy

- Policy gradient is on-policy : sample inefficient

(On-policy) : each time the policy is changed, even a little bit, we need to generate new samples // from lec4

On-policy policy gradient can be inefficient

- Off-policy learning & importance sampling : derive off-policy policy gradient with importance sampling

When theta=thate', we can derive off-policy policy gradient with importance sampling

- The off-policy policy gradient : when theta != theta'

- A fisrt-order approximation for IS (preview)

- Policy gradient in practice

< Summary of today's lecture >

1. The policy gradient algorithm : REINFORCE(Monte-Carlo policy gradient) algorithm

2. What does the policy gradient do?

- trial and error (good stuff is made more likely, bad stuff is made less likely)

- Gradient has high variance

3. Basic variance reduction

- causality : the future does not affect the past, Q^hat_i,t(reward to go)

- baselines : subtract a baseline for scaling, Q-b

4. On-policy & Off-policy policy gradient

- Policy gradient is on-policy → can't just skip sampling step → how to use previous samples?

- Can drive off-policy variant using importance sampling ( theta=theta' and theta!=theta' cases)

- Policy gradient can implement with automatic differentiation as a weighted maximum likelihood

저작자표시 비영리 (새창열림)

'cs285, fall 2019' 카테고리의 다른 글

Lecture 4. Introduction to Reinforcement Learning (0)	2020.03.09
Lecture 2. Supervised Learning of Behaviors (0)	2020.03.09

Recap

Definition

- Reward functions r(s,a) : tell us which actions are better

- Markov chain M={S,T} : simple graphical model, no notion about agent and actions

- Markov decision process M={S,A,T,r}

- Partially observed MDP M={S,A,O,T,ℇ,r}

- The goal of reinforcement learning

- Finite horizon case : state-action marginal(Ptheta(s_t,a_t))

Finite horizon case : argmax of the sums of the expected reward for T timestep

* Each state-action pair is determined by the transition probability p(s|s,a) and the policy pi_theta(a,s)

- Infinite horizon case : does p(s_t,a_t) converge to a stationary distribution?

Infinite horizon case can generally converge to stationary distribution

* Stationary distribution : a specific entity which is unchanged by the effect of some matrix of operator. Stationary distributions are related to eigenvectors for which the eigenvalue is unity.

(Stationary) time-independent

(Example) If Αν=λν is satisfied, 'A' is called stationary distribution

(Α : square matrix, ν : eigenvectors (column vector), λ : eigenvalues (diagonal matrix))

* So, object is simply divided into T

- Why we care about expectations?

* Since the reward function is discrete, It cannot be differentiated → cannot optimize directly

* However, the expectation of distribution is smooth in theta

Algorithms

- The anatomy of a reinforcement learning algorithm

- Which parts are expensive? : depends on model and method of collecting data

- How do we deal with all these expectations?

If Q(s,a) is known, easy to modify pi_theta(as)

- Definition : Q-function

- Definition : value function

→ Using Q-functions and value functions

(Idea 1) If we have pi(policy) and we know Q^pi(s,a), then we can improve pi

(Idea 2) Compute gradient to increase probaility of good actions a

Types of RL algorithms

- The goal of reinforcement learning : find optimal theta(policy)

- Types of RL algorithms

- Model-based algorithms : estimate the transition model, and then use it (for panning/to improve a policy/something else)

- Value-based algorithms : estimate value function or Q-function of the optimal policy

- Policy gradients : directly differentiate the above objective

- Actor-critic : estimate value function or Q-function of the current policy, use it to improve policy

Actor-critic : value function + policy gradients

Tradeoffs

- Why so many RL algorithms?

- Different tradeoffs

* Sample efficiency : How many samples do we need to get a good policy?

- Most important question : is the algorithm off-policy?

(Off policy) can improve the policy without generating new samples from that policy

(On policy) each time the policy is changed, even a little bit, we need to generate new samples

* Stability & easy of use

- Reinforcement learning does not often use gradient descent

- Different assumptions

* Stochastic or deterministic?

(Deterministic) a=π(s)

(Stochastic) π(a│s)=P(A=a|S=s)

* Episodic or infinite horizon?

* Continuous or discrete?

- Different things are easy or hard in different settings

* Easier to represent the policy/model?

- Examples of specific algorithms

< Summary of today's lecture >

1. Definition of a Markov decision process : M={S,A,T,r}

2. Definition of reinforcement learning problem

- Goal : find optimal theta

- Infinite(stationary distribution) / finite horizon case(state-action marginal)

3. Anatomy of a RL algorithm : {Generate samples - Model evaluation(value-function, Q-function) - Policy improvement(Gradient descent-based)}

4. Brief overview of RL algorithm types

- Model-based : MCST(Monte-Carlo tree search), Dynamic programming

- Value-based : Q-learning // near deterministic policy

- Policy gradient : REINFORCE(Monte-Carlo policy gradient)

- Actor-Critic : A3C, SAC

저작자표시 비영리 (새창열림)

'cs285, fall 2019' 카테고리의 다른 글

Lecture 5. Policy Gradients (0)	2020.03.10
Lecture 2. Supervised Learning of Behaviors (0)	2020.03.09

Terminology & notation

State : Markov, observation : non-Markov

- o_1 : a cheetah, o_2 : a cheetah covered by a car, o_3 : o_1 is required to predict o_3 → non-Markov

- Sequential decision problems : the problem of selecting an action a_t by observing o_t every timestep t

Imitation Learning

- Control problem via supervised learning, same as behavioral cloning

- Does it work?

(Left) No, distributional drift(mismatch) occur : accumulative small noise can cause big mistakes

(Right) Yes, collect more data for stability

- Can we make it work more often?

- Can we make Pdata(o_t) = Ppi_theta(o_t)? : use DAgger

DAgger. make Ptrain(Pdata)=Ptest(Ppi_theta)

* step3 is bottleneck

- (More general analysis) Does DAgger improve distributional drift over behavioral cloning?

* Each mismatch incurs cost of 1

* Behavioral cloing : O(ϵt^2)

* DAgger : O(ϵt)

- Can we make it work without more data?

- Need to mimic expert behavior very accurately without overfit

- But, why might we fail to fit the expert?

1) Non-Markovian behavior

* Human behavior depends on all past observations

* How can we use the whole history? Use RNN or LSTM

* But this causes 'causal confusion' problem

(Example) The yellow light puts on each time you apply the brake. The more you use all the history the relationship between yellow light and brakes will be stronger, even if the situation that turns the yellow right on is not reasonable

2) Multimodal behavior

(Solution)

a) Output mixture of Gaussians

b) Latent variable models : change input instead of output

c) Autoregressive discretization : continuous action space can be discretize -(binning size depends on the dimension of action space)-> iterative discrete sampling to select next action of dimension

< Summary of today's lecture >

1. Definition of sequential decision problems : the problem of selecting an action a_t by observing o_t every timestep t

2. Imitation learning : supervised learning for decision making

1) Does direct imitation work? : generally not but sometimes with large data it works

2) How can we make it work more often?

- DAgger : with more data

- To mimic expert behavior fails to fit

* Non-markovian behavior : use RNN or LSTM to use the history

* Multimodal behavior : use gaussian mixture model/latent variable model/AR discretization

'cs285, fall 2019' 카테고리의 다른 글

Lecture 5. Policy Gradients (0)	2020.03.10
Lecture 4. Introduction to Reinforcement Learning (0)	2020.03.09

Document

cs285, fall 2019

Lecture 5. Policy Gradients

'cs285, fall 2019' 카테고리의 다른 글

Lecture 4. Introduction to Reinforcement Learning

'cs285, fall 2019' 카테고리의 다른 글

Lecture 2. Supervised Learning of Behaviors

'cs285, fall 2019' 카테고리의 다른 글

+ Recent posts

티스토리툴바