Recap
- The goal of reinforcement learning
- Evaluating the objective
- Because it's hard to get the actual value, approximate with samples(Monte-Carlo sampling)
Policy Gradient
- Direct policy differentiation : updates weights directly using policy gradient
- However, policy gradient does not work well (due to high variance)
- Evaluating the policy gradient
- REINFORCE algorithm(Monte-Carlo policy gradient)
- Comparison to maxmimum likelihood
* Policy gradient : a weighted(a accumulative reward) maximum likelihood
* ML has the same probability for all samples, but policy gradient biases probability due to accumulative reward
→ Good stuff is made more likely, bad stuff is made less likely
→ Simply formalizes the notion of "trial and error"
* (Example) Gaussian policies
- Maximum likelihood estimate of µ and Σ
- Partial observability
* Markov property is not actually used
* However, policy gradient works just fine by replacing states with observations
- What is wrong with the policy gradient?
Reducing variance
1) Causality : the future does not affect the past
2) Baselines : scaling reward(unbiased)
- The baseline for minimize variance : can derive optimal baseline
On-policy
- Policy gradient is on-policy : sample inefficient
(On-policy) : each time the policy is changed, even a little bit, we need to generate new samples // from lec4
- Off-policy learning & importance sampling : derive off-policy policy gradient with importance sampling
- The off-policy policy gradient : when theta != theta'
- A fisrt-order approximation for IS (preview)
- Policy gradient in practice
< Summary of today's lecture >
1. The policy gradient algorithm : REINFORCE(Monte-Carlo policy gradient) algorithm
2. What does the policy gradient do?
- trial and error (good stuff is made more likely, bad stuff is made less likely)
- Gradient has high variance
3. Basic variance reduction
- causality : the future does not affect the past, Q^hat_i,t(reward to go)
- baselines : subtract a baseline for scaling, Q-b
4. On-policy & Off-policy policy gradient
- Policy gradient is on-policy → can't just skip sampling step → how to use previous samples?
- Can drive off-policy variant using importance sampling ( theta=theta' and theta!=theta' cases)
- Policy gradient can implement with automatic differentiation as a weighted maximum likelihood
'cs285, fall 2019' 카테고리의 다른 글
Lecture 4. Introduction to Reinforcement Learning (0) | 2020.03.09 |
---|---|
Lecture 2. Supervised Learning of Behaviors (0) | 2020.03.09 |