1. Overview

 - A common and import goal of statistics : To learn about a large group from some of its member data (i.e. sample)

 - Data are observations that have been collected

    - Population : complete collection of all elements to be studied(The collection includes all subjects to be studied)

    - Sample : a subselection of members from part of a population

 - Focus in this course : To learn how to use sample data to form conclusions about population

 

2. Types of data

 - Parameter and statistic

    - Parameter : a measurement describing some characteristic of a population

    - Statistic : a measurement describing some characteristic of a sample

 - A common way of classifying data

    - Quantitative vs.Qualitative data

        * Quantitative data : consist of numbers representing counts or measurements (ex. 52kg)

        * Qualitative(or categorical or attribute) data : can be distinguished by some non-numetric characteristic (ex. overweight)

    - Discrete vs. continuous data 

        * Discrete data : the number of possible values is either a finite number or a countable number (counted)

        * Continuous(numerical) data : infinitely many possible values (measured)

 - Another common way of classifying data : to use four levels of measurements

    - Nominal : categories, non-ordering (ex. colors)

    - Ordinal : categories, ordering (ex. course grades)

    - Interval : difference between any two data values is meaningful, no zero starting point (ex. temperatures 0℃, Years)

    - Ratio : interval level with a natural zero starting point (ex. weights 0kg represent no weight, ages) // no represents nothing

 

3. Design of experiments

 - Two distinct sources for obtaining data

    - Observational study : don't attempt to modify the subjects

        (Different types of observational studies)

            * Cross-sectional study : data are collected at one point in time

            * Retrospective study : data are collected from the past

            * Prospective(or cohort) study : data are collected in the future from groups sharing common factors

    - Experiment : apply some treatment (ex. clinical trial)

        - Results of experiments are sometimes ruined because of confounding

          (Confounding) occurs when effects of variables are somehow mixed so that the individual effects of the variables cannot be identified(i.e., confusion of variable effects)

        - Thus, try to plan the experiment so that confounding does not occur

          (Three key issues in design of experiment)

            * Issue 1 : Controlling Effects of Variables

                - Blinding : a technique in which the subject doesn't know whether he or she is receiving a treatment or a placebo

                    - single-blind : the subject don't know whether they are getting the treatment or placebo

                    - double-blind : neither the subject nor the experimenters know who is receiving a particular treatment

                - Blocks :  a group of subjects that are known to be similar in the ways that might affect the outcome of the experiment

                - Randomization : select subject randomly

                    - Completely randomized design : treatments are assigned to the subjects by using a completely random assignment process

                    - Randomized block design : use randomization to assign subjectws to treatments separately within each block

 

          * Issue 2 : Sample size & replication

                - Sample size : large enough so that we can see the true nature of any effects, and obtain the sample using an appropriate method

                - Replication : repetition or duplication of an experiment for reliability (odd repetition is better than even)

 

          * Issue 3 : Randomization & sampling strategies

             (Sampling method)

                - Random sampling : each individual member has the same chance of being selected

                - Simple random sampling : every possible samples of the same size 'n' has the same chance of being chosen

                - Systematic sampling : randomly select a starting point and then select every k-th element in the population

                - Stratified sampling : subdivide the population into at least two different subgroups then draw a sample from each subgroup

                - Cluster sampling : divide the population area into sections then randomly select some of those clusters

             (Sampling errors) the difference between a sample result and the true population result(∵ sample is a subset of population)


Summary

 - Goal of statistics : to learn about a population(all subjects) from sample(part of a population)

 - Types of data

    * Parameter(from population) & statistics(from sample)

    * Common way of classifying data : Quantitative(counts, discrete and continuous) vs. Qualitative(non-numeric)

    * Another common way of classifying data : nominal(categories, non-ordering) / ordinal(categories, ordering) / internal(difference values is meaning full, no zero starting point) / ratio(interval with zero starting point)

- Design of experiment

    * Two distinct sources for obtaning data : Observation(not modify) and experiment(apply treatment)

        - Different types of observational studies : cross-sectional(at one point in time) / retrospective(past) / prospective(future)

        - Results of experiments are sometimes ruined because of confounding

            * Three key issues in design of experiment : controlling effects of variables, samples size & replication, and randomization & sampling method

                - controlling effects of variables : blinding(single or double) / blocks / randomization(completely or randomized block)

                - sampling method : random sampling, simple random sampling(size n), systematic sampling, stratified sampling, cluster sampling

- Density Estimation : how to estimate distribution characteristics of the original variable from distribution of observed data

    * Estimate density of variable 'x' = estimate pdf(probability density function) of variable 'x' in machine learning

    * The density estimation method can be divided into parametric method and non-parametric method

        (parametric density estimation) Estimate model parameters for predefined pdf using data

            (example) Under assumption that daily traffic follows the normal distribution, we only need to get mean and variance from data

        (non-parametric density estimation) However, models are not often given in reality → Have to estimate model parameters for unknown pdf using data  Easiest form for non-parametric density estimation is histogram(normalize histogram obtain from observed data, and assume it as pdf)

 

- Kernel Density Estimation(KDE) : how to estimate density using kernel function

    - cons of histogram : discontinuity at the edge of bins, depends on bin size, inefficient for high dimension data

    - KDE : improve the problem of histogram for estimating non-parametric density using kernel function

        * Kernel function : non-negative function that is symmetry about the origin and has an integral value of 1 (ex. Gaussian, uniform function)

        * Important issue using KDE : what kernel function used, what h applied for bandwidth

        (ref) https://en.wikipedia.org/wiki/Kernel_(statistics)#Kernel_functions_in_common_use

        (ref) https://en.wikipedia.org/wiki/Kernel_density_estimation

 

Kernel (statistics) - Wikipedia

From Wikipedia, the free encyclopedia Jump to navigation Jump to search The term kernel is used in statistical analysis to refer to a window function. The term "kernel" has several distinct meanings in different branches of statistics. Bayesian statistics[

en.wikipedia.org

 

Kernel density estimation - Wikipedia

In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, base

en.wikipedia.org

(left) compare histogram with KDE    (right) bandwidth of the kernel

 

source : https://darkpgmr.tistory.com/147

Recap

- The goal of reinforcement learning


 - Evaluating the objective

    - Because it's hard to get the actual value, approximate with samples(Monte-Carlo sampling)

 


Policy Gradient

 - Direct policy differentiation : updates weights directly using policy gradient

    - However, policy gradient does not work well (due to high variance)


 - Evaluating the policy gradient

    - REINFORCE algorithm(Monte-Carlo policy gradient)

REINFORCE algorithm


    - Comparison to maxmimum likelihood

Comparision between policy gradient and maximum likelihood

        * Policy gradient : a weighted(a accumulative reward) maximum likelihood

        * ML has the same probability for all samples, but policy gradient biases probability due to accumulative reward

            → Good stuff is made more likely, bad stuff is made less likely

             Simply formalizes the notion of "trial and error"

 

Policy gradient with automatic differentiation
Pseudocode example(with discrete actions)

        * (Example) Gaussian policies

Gaussian policies for continuous actions

               - Maximum likelihood estimate of µ and Σ

Multivariate gaussian distribution


    - Partial observability

        * Markov property is not actually used

        * However, policy gradient works just fine by replacing states with observations

 


    - What is wrong with the policy gradient?


Reducing variance

  1) Causality : the future does not affect the past

Just decrease the variance

 

 2) Baselines : scaling reward(unbiased)

Baselines do not change the gradient

    - The baseline for minimize variance : can derive optimal baseline

Analyzing variance : Baselines change the variance


On-policy

 - Policy gradient is on-policy : sample inefficient

      (On-policy) : each time the policy is changed, even a little bit, we need to generate new samples  // from lec4

On-policy policy gradient can be inefficient


 - Off-policy learning & importance sampling : derive off-policy policy gradient with importance sampling

When theta=thate', we can derive off-policy policy gradient with importance sampling


 - The off-policy policy gradient : when theta != theta'


 - A fisrt-order approximation for IS (preview)

Anyway, we want to use previous samples


  - Policy gradient in practice

Gradient has high variance


< Summary of today's lecture >

1. The policy gradient algorithm : REINFORCE(Monte-Carlo policy gradient) algorithm

2. What does the policy gradient do?

  - trial and error (good stuff is made more likely, bad stuff is made less likely)

  - Gradient has high variance

3. Basic variance reduction

  - causality : the future does not affect the past, Q^hat_i,t(reward to go)

  - baselines : subtract a baseline for scaling, Q-b

4. On-policy & Off-policy policy gradient

  - Policy gradient is on-policy → can't just skip sampling step → how to use previous samples?

  - Can drive off-policy variant using importance sampling ( theta=theta' and theta!=theta' cases)

  - Policy gradient can implement with automatic differentiation as a weighted maximum likelihood

Recap

Goal : find pi_theta


Definition

- Reward functions r(s,a) : tell us which actions are better

Reward functions


 - Markov chain M={S,T} : simple graphical model, no notion about agent and actions

Markov chain M={S,T}

 - Markov decision process M={S,A,T,r}

Markov decision process M={S,A,T,r}

 - Partially observed MDP M={S,A,O,T,,r}

Partially observed MPD M={S,A,O,T,ℇ,r}


- The goal of reinforcement learning

The goal of reinforcement learning is to find optimal theta

    - Finite horizon case : state-action marginal(Ptheta(s_t,a_t))

Finite horizon case : argmax of the sums of the expected reward for T timestep

        * Each state-action pair is determined by the transition probability p(s|s,a) and the policy pi_theta(a,s)

    - Infinite horizon case : does p(s_t,a_t) converge to a stationary distribution?

Infinite horizon case can generally converge to stationary distribution

        * Stationary distribution : a specific entity which is unchanged by the effect of some matrix of operator. Stationary distributions are related to eigenvectors for which the eigenvalue is unity.

            (Stationary) time-independent

            (Example) If Αν=λν is satisfied, 'A' is called stationary distribution

                 (Α : square matrix, ν : eigenvectors (column vector), λ : eigenvalues (diagonal matrix))

        * So, object is simply divided into T

Expectations and stochastic systems

    - Why we care about expectations?

        *  Since the reward function is discrete, It cannot be differentiated → cannot optimize directly

        * However, the expectation of distribution is smooth in theta


Algorithms

 - The anatomy of a reinforcement learning algorithm

A simple example

    - Which parts are expensive? : depends on model and method of collecting data


 - How do we deal with all these expectations?

If Q(s,a) is known, easy to modify pi_theta(as)

 

    - Definition : Q-function

Q-function

    - Definition : value function

Value-function

 

    → Using Q-functions and value functions

     (Idea 1) If we have pi(policy) and we know Q^pi(s,a), then we can improve pi

Idea 1. Improve policy

     (Idea 2) Compute gradient to increase probaility of good actions a

Idea 2. Gradient-based improvement


Types of RL algorithms

 - The goal of reinforcement learning : find optimal theta(policy)

 

 - Types of RL algorithms

    - Model-based algorithms : estimate the transition model, and then use it (for panning/to improve a policy/something else)

Model-based RL algorithms

    - Value-based algorithms : estimate value function or Q-function of the optimal policy

Value function based RL algorithms

    - Policy gradients : directly differentiate the above objective

Direct policy gradients

    - Actor-critic : estimate value function or Q-function of the current policy, use it to improve policy

Actor-critic : value function + policy gradients


Tradeoffs

 - Why so many RL algorithms?

    - Different tradeoffs

        * Sample efficiency : How many samples do we need to get a good policy?

            - Most important question : is the algorithm off-policy?

              (Off policy) can improve the policy without generating new samples from that policy

              (On policy) each time the policy is changed, even a little bit, we need to generate new samples

Comparison : Sample efficiency

        * Stability & easy of use

            - Reinforcement learning does not often use gradient descent

 

    - Different assumptions

        * Stochastic or deterministic?

           (Deterministic) a=π(s)

           (Stochastic) π(a│s)=P(A=a|S=s)  

        * Episodic or infinite horizon?

        * Continuous or discrete?

 

    - Different things are easy or hard in different settings

        * Easier to represent the policy/model?

 

    - Examples of specific algorithms

Examples of specific algorithms


< Summary of today's lecture >

1. Definition of a Markov decision process : M={S,A,T,r}

2. Definition of reinforcement learning problem

    - Goal : find optimal theta

        - Infinite(stationary distribution) / finite horizon case(state-action marginal)

3. Anatomy of a RL algorithm : {Generate samples - Model evaluation(value-function, Q-function) - Policy improvement(Gradient descent-based)}

4. Brief overview of RL algorithm types

    - Model-based : MCST(Monte-Carlo tree search), Dynamic programming

    - Value-based : Q-learning       // near deterministic policy

    - Policy gradient : REINFORCE(Monte-Carlo policy gradient)

    - Actor-Critic : A3C, SAC 

'cs285, fall 2019' 카테고리의 다른 글

Lecture 5. Policy Gradients  (0) 2020.03.10
Lecture 2. Supervised Learning of Behaviors  (0) 2020.03.09

+ Recent posts