17 November 2016

Overview

Brief Recap

  • Bayesian Statistics operates under a fundamentally different paradigm to classical (Frequentist) Statistics
  • Frequentist probability of an event occuring is given by the relative frequency in an infinite number of experiments
  • Frequentists have difficulty answering questions where experiments cannot be done, e.g. What is the probability that the world will end tomorrow? or What is the probability Trump will be elected president?

Brief Recap

  • Bayesian probability operates on a subjective belief basis: everything can be assigned a probability
  • That probability or belief might be personal to you
  • Is it a better paradigm? For many problems it is more flexible but certain types of M-open problems (e.g. hypothesis testing) are trickier to handle. This is because Bayesian Statistics requires us to define the entire model space.

The posterior distribution of \(\theta\) given \(X\) is given by:

\[ p(\theta|X) = \frac{ p(X|\theta) p(\theta) }{ p(X) } \]

The likelihood is the probability of seeing the data given a model and its parameters

The prior is your personal belief of what the distribution of those parameters look like (before seeing data)

The posterior is the probability distribution of those parameters (after seeing data)

Modelling vs inference

Modelling

  • Think up mechanistic description of how the world works…
  • Consider measurement error and noise
  • Write down \(p(\theta | X) \propto p(X|\theta)p(\theta)\)
  • Iterate and profit

Inference

  • Traditionally complicated, error prone, can take months
  • Very rarely \(p(\theta | X)\) available in closed form: approximate inference (MCMC, VB, EP, etc…)
  • Intellectually interesting…
  • …but fundamentally says nothing about \(\theta\)
  • Worse, inference often affects modelling!

Hello, Stan!

Stan is a program (named after Stanislaw Ulam) to perform Bayesian inference using Hamiltonian Monte Carlo (HMC)

\(\rightarrow\) let's you focus on modelling and not worry (too much!) about inference

Stan

Building blocks of a Stan file:

Data

What is the (fixed) data required?

Parameters

What (uknown) parameters will we infer?

Model

How do the parameters and data relate? Specified with probability density / mass functions

Can also have transformed data, transformed parameters and generated quantites

Example: linear regression

library(ggplot2)

N <- 100
x <- runif(N)
y <- -1 + 3 * x + rnorm(N, 0, 0.2)
qplot(x, y)

Linear regression

Linear models:

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \]

Noise has zero mean and uncorrelated with \(x_i\) (homoskedastic):

\[ \epsilon_i \sim N(0, \tau^{-1}) \]

Bayesian prior structure:

\[ \beta_0, \beta_1 \sim N(0, \sigma^2_0) \] \[ \tau \sim \text{Gamma}(a, b) \]

Remember:
\[ \tau = \frac{1}{\sigma^2}\]

Linear regression: priors

Let's (pretend?) we don't know much:

\[ \beta_0, \beta_1 \sim N(0, 100^2) \]

qplot(rnorm(1000, 0, 100))

Linear regression: priors (cont)

\[ \tau \sim \text{Gamma}(0.01, 0.01) \]

qplot(rgamma(1000, 0.1, 0.1)) 

Beware Gamma distribution parametrisation (shape-rate-scale)

Linear regression in Stan

Two steps

1. Write the model file

What data we require, and how our model relates this data to the unknown parameters (and how the parameters related to each other)

2. Inference

Stan does this for you, but we still have various decisions to make: e.g. HMC vs SVI, how many iterations, etc

Constructing the model file: data

Model file part 1: what data do we require?

data {
  int<lower = 0> N;
  real y[N];
  real x[N];
}

Many different data types:

  • Standard: real, vector, matrix, etc
  • Constrained: real<lower=0, upper=1>
  • Structured: cov_mat, ordered vectors

Constructing the model file: parameters

Model file part 2: what parameters do we have?

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \;\;\epsilon_i \sim \text{Normal}(0, \tau^{-1})\]

Want to infer \(\beta_0\), \(\beta_1\) and \(\tau\):

parameters {
  real beta0;
  real beta1;
  real<lower = 0> tau;
}

Remember in Bayesian inference there's no real difference between data and parameters!

So (almost) every statement in the data block is valid here too

Constructing the model file: model

Model file part 3: how do the parameters relate to the data?

\[ \tau \sim \text{Gamma}(a, b) \\ \beta_0, \beta_1 \sim N(0, \sigma^2_0) \\ \epsilon_i \sim \text{Normal}(0, \tau^{-1}) \\ y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

This is encoded in Stan via:

model {
  beta0 ~ normal(0, 100);
  beta1 ~ normal(0, 100);
  tau ~ gamma(0.1, 0.1);
  for(i in 1:N) {
    y[i] ~ normal(beta0 + beta1 * x[i], 1 / sqrt(tau));
  }
}

Constructing the model file: model

model {
  beta0 ~ normal(0, 100);
  beta1 ~ normal(0, 100);
  tau ~ gamma(0.1, 0.1);
  for(i in 1:N) {
    y[i] ~ normal(beta0 + beta1 * x[i], 1 / sqrt(tau));
  }
}

General form:

random variable ~ distribution(more random variables)

This supports a wide range of distributions:

  • normal
  • gamma
  • student
  • exponential
  • many more…

Model file: some notes

1. Supports in place transformations! (& costs CY money)

1 / tau ~ inv_gamma(...);

2. Iteration with for loops as per R

for(g in 1:G) {
  ...
}

3. Assignment

Recently changed from <- to =. Useful for less-cluttered code, e.g.

real mu[N];
for(i in 1:N) mu[i] = beta0 + beta1 * x[i];
y[i] ~ normal(mu[i], 1 / sqrt(tau));

Complete model

model_str <- '
data {
  int<lower = 0> N;
  real y[N];
  real x[N];
}
parameters {
  real beta0;
  real beta1;
  real<lower = 0> tau;
}
model {
  beta0 ~ normal(0, 100);
  beta1 ~ normal(0, 100);
  tau ~ gamma(0.1, 0.1);
  for(i in 1:N) {
    y[i] ~ normal(beta0 + beta1 * x[i], 1 / sqrt(tau));
  }
}
'

Part 2: Inference!

Step 1: compile the Stan model

library(rstan)
linear_model <- stan_model(model_code = model_str)

For simple models it's easy to pass the model specification as a string. For more complex models, it's better to store it as a separate file:

complex_model <- stan_model("my_complex_model.stan")

Part 2: Inference!

Step 2: run the model

Put data into list format:

data_list <- list(N = length(y), x = x, y = y)

Inference using sampling:

fit <- sampling(linear_model,
                data = data_list, 
                iter = 4000, # Number of MCMC iterations
                warmup = 2000, # How many iterations at beginning to ignore
                chains = 4,  # Number of chains to sample
                thin = 4) # Only keep every fourth sample
## 
## SAMPLING FOR MODEL 'd286bea5be64d4f60da9f8f18551ab8a' NOW (CHAIN 1).
## 
## Chain 1, Iteration:    1 / 4000 [  0%]  (Warmup)
## Chain 1, Iteration:  400 / 4000 [ 10%]  (Warmup)
## Chain 1, Iteration:  800 / 4000 [ 20%]  (Warmup)
## Chain 1, Iteration: 1200 / 4000 [ 30%]  (Warmup)
## Chain 1, Iteration: 1600 / 4000 [ 40%]  (Warmup)
## Chain 1, Iteration: 2000 / 4000 [ 50%]  (Warmup)
## Chain 1, Iteration: 2001 / 4000 [ 50%]  (Sampling)
## Chain 1, Iteration: 2400 / 4000 [ 60%]  (Sampling)
## Chain 1, Iteration: 2800 / 4000 [ 70%]  (Sampling)
## Chain 1, Iteration: 3200 / 4000 [ 80%]  (Sampling)
## Chain 1, Iteration: 3600 / 4000 [ 90%]  (Sampling)
## Chain 1, Iteration: 4000 / 4000 [100%]  (Sampling)
##  Elapsed Time: 0.168416 seconds (Warm-up)
##                0.175169 seconds (Sampling)
##                0.343585 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'd286bea5be64d4f60da9f8f18551ab8a' NOW (CHAIN 2).
## 
## Chain 2, Iteration:    1 / 4000 [  0%]  (Warmup)
## Chain 2, Iteration:  400 / 4000 [ 10%]  (Warmup)
## Chain 2, Iteration:  800 / 4000 [ 20%]  (Warmup)
## Chain 2, Iteration: 1200 / 4000 [ 30%]  (Warmup)
## Chain 2, Iteration: 1600 / 4000 [ 40%]  (Warmup)
## Chain 2, Iteration: 2000 / 4000 [ 50%]  (Warmup)
## Chain 2, Iteration: 2001 / 4000 [ 50%]  (Sampling)
## Chain 2, Iteration: 2400 / 4000 [ 60%]  (Sampling)
## Chain 2, Iteration: 2800 / 4000 [ 70%]  (Sampling)
## Chain 2, Iteration: 3200 / 4000 [ 80%]  (Sampling)
## Chain 2, Iteration: 3600 / 4000 [ 90%]  (Sampling)
## Chain 2, Iteration: 4000 / 4000 [100%]  (Sampling)
##  Elapsed Time: 0.166997 seconds (Warm-up)
##                0.170728 seconds (Sampling)
##                0.337725 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'd286bea5be64d4f60da9f8f18551ab8a' NOW (CHAIN 3).
## 
## Chain 3, Iteration:    1 / 4000 [  0%]  (Warmup)
## Chain 3, Iteration:  400 / 4000 [ 10%]  (Warmup)
## Chain 3, Iteration:  800 / 4000 [ 20%]  (Warmup)
## Chain 3, Iteration: 1200 / 4000 [ 30%]  (Warmup)
## Chain 3, Iteration: 1600 / 4000 [ 40%]  (Warmup)
## Chain 3, Iteration: 2000 / 4000 [ 50%]  (Warmup)
## Chain 3, Iteration: 2001 / 4000 [ 50%]  (Sampling)
## Chain 3, Iteration: 2400 / 4000 [ 60%]  (Sampling)
## Chain 3, Iteration: 2800 / 4000 [ 70%]  (Sampling)
## Chain 3, Iteration: 3200 / 4000 [ 80%]  (Sampling)
## Chain 3, Iteration: 3600 / 4000 [ 90%]  (Sampling)
## Chain 3, Iteration: 4000 / 4000 [100%]  (Sampling)
##  Elapsed Time: 0.165873 seconds (Warm-up)
##                0.156197 seconds (Sampling)
##                0.32207 seconds (Total)
## 
## 
## SAMPLING FOR MODEL 'd286bea5be64d4f60da9f8f18551ab8a' NOW (CHAIN 4).
## 
## Chain 4, Iteration:    1 / 4000 [  0%]  (Warmup)
## Chain 4, Iteration:  400 / 4000 [ 10%]  (Warmup)
## Chain 4, Iteration:  800 / 4000 [ 20%]  (Warmup)
## Chain 4, Iteration: 1200 / 4000 [ 30%]  (Warmup)
## Chain 4, Iteration: 1600 / 4000 [ 40%]  (Warmup)
## Chain 4, Iteration: 2000 / 4000 [ 50%]  (Warmup)
## Chain 4, Iteration: 2001 / 4000 [ 50%]  (Sampling)
## Chain 4, Iteration: 2400 / 4000 [ 60%]  (Sampling)
## Chain 4, Iteration: 2800 / 4000 [ 70%]  (Sampling)
## Chain 4, Iteration: 3200 / 4000 [ 80%]  (Sampling)
## Chain 4, Iteration: 3600 / 4000 [ 90%]  (Sampling)
## Chain 4, Iteration: 4000 / 4000 [100%]  (Sampling)
##  Elapsed Time: 0.174429 seconds (Warm-up)
##                0.176195 seconds (Sampling)
##                0.350624 seconds (Total)

Inference (cont)

What does sampling return?

str(fit, max.level = 2)
## Formal class 'stanfit' [package "rstan"] with 10 slots
##   ..@ model_name: chr "d286bea5be64d4f60da9f8f18551ab8a"
##   ..@ model_pars: chr [1:4] "beta0" "beta1" "tau" "lp__"
##   ..@ par_dims  :List of 4
##   ..@ mode      : int 0
##   ..@ sim       :List of 12
##   ..@ inits     :List of 4
##   ..@ stan_args :List of 4
##   ..@ stanmodel :Formal class 'stanmodel' [package "rstan"] with 5 slots
##   ..@ date      : chr "Sun Nov 13 17:20:37 2016"
##   ..@ .MISC     :<environment: 0x7fd0e76f0df8>

Inference (cont)

How do you extract parameter estimates? Use extract!

For \(\beta_0\):

beta0_trace <- rstan::extract(fit, "beta0")$beta0
str(beta0_trace)
##  num [1:2000(1d)] -0.992 -0.981 -0.953 -0.923 -0.939 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ iterations: NULL

Can get a posterior mean estimate using mean (multidimensional parameter -> colMeans):

beta0_map <- mean(beta0_trace)
beta0_map
## [1] -0.9704332

Inference (cont)

How does this compare to the truth?

qplot(beta0_trace, geom = 'density') +
  geom_vline(xintercept = -1, colour = 'darkblue') +
  geom_vline(xintercept = beta0_map, colour = 'darkred')

MCMC diagnostics

But we normally (never) know the truth…

…so examine MCMC diagnostics

Issues with MCMC samplers:

  1. Not enough iterations -> unrepresentative of target distribution
  2. Correlated draws -> less precise than independent
Love MCMC diagnostics? BDA3 11.4

Practical MCMC diagnostics

Day-to-day good enough:

  • Traces of log-likelihood and parameters
  • Autocorrelation plots
  • Effective sample size
  • Monte Carlo standard error
  • Gelman's R-hat

In practice, examining parameter traces and autocorrelation is usually sufficient

Diagnostics: trace examination

Diagnostics: trace examination

stan_trace(fit, "lp__")

Diagnostics: trace examination

Bad convergence:

Diagnostics: trace examination

stan_trace(fit, pars = c("beta0", "beta1", "tau"))

Diagnostics: autocorrelation

How correlated is each draw with a given lag along the Markov chain?

WoMMBAT: A user interface for hierarchical Bayesian estimation of working memory capacity.

Diagnostics: autocorrelation

stan_ac(fit, "lp__")

Diagnostics: autocorrelation

stan_ac(fit)

Diagnostics: MCSE

Still trying to estimate \(\hat{\theta}\) from a (finite) sample \(\rightarrow\) Monte Carlo Standard Error (MCSE)

Qn: how does the MCSE compare to the posterior variances?

stan_mcse(fit)

Diagnostics: ESS

Given the draws aren't truly independent, what is the effective sample size (ESS)? (one for each parameter)

stan_ess(fit)

Diagnostics: Gelman-Rubin R-hat

Ratio if within-to-between sample variance

stan_rhat(fit)

If \(\hat{R}\) is far from 1, more iterations may help

Discrete parameters

Stan can't do inference over discrete parameters (because can't compute gradient). However, in many cases we can marginalise them out:

If \(\theta\) is a continuous parameter and \(\phi\) is discrete, normally target unnormalised density \(p(x | \theta, \phi)p(\theta)p(\phi)\). Instead, target

\[ p(x | \theta)p(\theta) = \sum_{k \in \Omega(\phi)} p(x | \theta, \phi = k)p(\theta)p(\phi = k) \]

Messing with the log-likelihood directly

Using target += to manipulate the log-likelihood:

y[i] ~ normal(mu[i], 1 / sqrt(tau))

is entirely equivalent to

target += normal(y[i] | mu[i], 1 / sqrt(tau))

The | is a recent syntax change to Stan.

Discrete parameters

Example: finite mixture model

The responsibility parameters (data sample \(i\) belongs to cluster \(k\) are discrete).

 parameters {
      real y;
} model {
      target += log_sum_exp(log(0.3) + normal(y | -1, 2),
                            log(0.7) + normal(y | 3 1));
}

Life pro tip: use log_sum_exp wherever possible (and work with log probability densities).

Stochastic Variational Inference

fit_vb <- vb(linear_model, data = data_list)
## 
## This is Automatic Differentiation Variational Inference.
## 
## (EXPERIMENTAL ALGORITHM: expect frequent updates to the procedure.)
## 
## Gradient evaluation took 1.6e-05 seconds
## 1000 iterations under these settings should take 0.016 seconds.
## Adjust your expectations accordingly!
## 
## Begin eta adaptation.
## Iteration:   1 / 250 [  0%]  (Adaptation)
## Iteration:  50 / 250 [ 20%]  (Adaptation)
## Iteration: 100 / 250 [ 40%]  (Adaptation)
## Iteration: 150 / 250 [ 60%]  (Adaptation)
## Iteration: 200 / 250 [ 80%]  (Adaptation)
## Success! Found best value [eta = 1] earlier than expected.
## 
## Begin stochastic gradient ascent.
##   iter       ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
##    100     -2e+01             1.000            1.000
##    200         -7             1.091            1.182
##    300          2             2.360            1.182
##    400          1             1.960            1.182
##    500          2             1.628            1.000
##    600        0.8             1.496            1.000
##    700          2             1.374            0.836
##    800         -1             1.530            1.000
##    900          2             1.569            1.000
##   1000     -1e+01             1.529            1.165
##   1100       -0.9             2.465            1.182   MAY BE DIVERGING... INSPECT ELBO
##   1200          3             2.480            1.328   MAY BE DIVERGING... INSPECT ELBO
##   1300          2             2.002            1.165   MAY BE DIVERGING... INSPECT ELBO
##   1400          2             1.929            1.165   MAY BE DIVERGING... INSPECT ELBO
##   1500        0.3             2.471            1.328   MAY BE DIVERGING... INSPECT ELBO
##   1600          1             2.462            1.328   MAY BE DIVERGING... INSPECT ELBO
##   1700          1             2.405            1.328   MAY BE DIVERGING... INSPECT ELBO
##   1800          2             2.168            1.165   MAY BE DIVERGING... INSPECT ELBO
##   1900          2             2.000            0.745   MAY BE DIVERGING... INSPECT ELBO
##   2000         -1             2.188            0.745   MAY BE DIVERGING... INSPECT ELBO
##   2100         -1             1.168            0.253   MAY BE DIVERGING... INSPECT ELBO
##   2200        0.4             1.483            0.253   MAY BE DIVERGING... INSPECT ELBO
##   2300       -0.3             1.702            0.745   MAY BE DIVERGING... INSPECT ELBO
##   2400         -3             1.788            0.887   MAY BE DIVERGING... INSPECT ELBO
##   2500       -0.1             3.002            0.887   MAY BE DIVERGING... INSPECT ELBO
##   2600          2             3.036            1.083   MAY BE DIVERGING... INSPECT ELBO
##   2700          2             3.044            1.083   MAY BE DIVERGING... INSPECT ELBO
##   2800          2             3.026            1.083   MAY BE DIVERGING... INSPECT ELBO
##   2900          2             3.018            1.083   MAY BE DIVERGING... INSPECT ELBO
##   3000          2             2.720            0.887   MAY BE DIVERGING... INSPECT ELBO
##   3100        0.8             2.859            1.083   MAY BE DIVERGING... INSPECT ELBO
##   3200        0.7             2.420            0.887   MAY BE DIVERGING... INSPECT ELBO
##   3300          2             2.251            0.628   MAY BE DIVERGING... INSPECT ELBO
##   3400        0.9             2.268            0.628   MAY BE DIVERGING... INSPECT ELBO
##   3500          2             0.524            0.428   MAY BE DIVERGING... INSPECT ELBO
##   3600          2             0.440            0.240
##   3700          1             0.487            0.428
##   3800          2             0.502            0.428   MAY BE DIVERGING... INSPECT ELBO
##   3900          2             0.495            0.428
##   4000          2             0.521            0.428   MAY BE DIVERGING... INSPECT ELBO
##   4100       -0.6             0.868            0.428   MAY BE DIVERGING... INSPECT ELBO
##   4200          2             0.999            0.620   MAY BE DIVERGING... INSPECT ELBO
##   4300          2             0.969            0.428   MAY BE DIVERGING... INSPECT ELBO
##   4400          2             0.882            0.331   MAY BE DIVERGING... INSPECT ELBO
##   4500          2             0.854            0.317   MAY BE DIVERGING... INSPECT ELBO
##   4600          2             0.845            0.317   MAY BE DIVERGING... INSPECT ELBO
##   4700          2             0.797            0.222   MAY BE DIVERGING... INSPECT ELBO
##   4800          2             0.812            0.317   MAY BE DIVERGING... INSPECT ELBO
##   4900          2             0.834            0.317   MAY BE DIVERGING... INSPECT ELBO
##   5000          2             0.812            0.272   MAY BE DIVERGING... INSPECT ELBO
##   5100          2             0.320            0.189
##   5200          1             0.284            0.189
##   5300          2             0.298            0.189
##   5400          2             0.286            0.151
##   5500          1             0.385            0.272
##   5600          2             0.411            0.373
##   5700          3             0.424            0.373
##   5800          1             0.495            0.417
##   5900          1             0.469            0.417
##   6000        0.9             0.499            0.417
##   6100          2             0.534            0.464   MAY BE DIVERGING... INSPECT ELBO
##   6200          2             0.443            0.417
##   6300          2             0.419            0.397
##   6400          2             0.443            0.397
##   6500          2             0.339            0.307
##   6600          2             0.315            0.268
##   6700          2             0.295            0.230
##   6800          2             0.199            0.172
##   6900          2             0.200            0.172
##   7000          2             0.168            0.120
##   7100          2             0.129            0.115
##   7200          2             0.152            0.115
##   7300          2             0.149            0.115
##   7400        0.4             0.543            0.115   MAY BE DIVERGING... INSPECT ELBO
##   7500          2             0.615            0.172   MAY BE DIVERGING... INSPECT ELBO
##   7600          2             0.615            0.177   MAY BE DIVERGING... INSPECT ELBO
##   7700          2             0.632            0.201   MAY BE DIVERGING... INSPECT ELBO
##   7800          2             0.664            0.241   MAY BE DIVERGING... INSPECT ELBO
##   7900          2             0.669            0.241   MAY BE DIVERGING... INSPECT ELBO
##   8000          2             0.686            0.243   MAY BE DIVERGING... INSPECT ELBO
##   8100          2             0.679            0.243   MAY BE DIVERGING... INSPECT ELBO
##   8200          2             0.656            0.241   MAY BE DIVERGING... INSPECT ELBO
##   8300          2             0.670            0.243   MAY BE DIVERGING... INSPECT ELBO
##   8400          2             0.274            0.243
##   8500          2             0.218            0.243
##   8600          1             0.241            0.244
##   8700          2             0.262            0.296
##   8800          2             0.227            0.244
##   8900          3             0.240            0.244
##   9000          2             0.231            0.244
##   9100          2             0.241            0.244
##   9200          2             0.243            0.244
##   9300          2             0.211            0.197
##   9400          2             0.195            0.151
##   9500          2             0.184            0.138
##   9600          2             0.145            0.132
##   9700          2             0.125            0.132
##   9800          2             0.139            0.138
##   9900          2             0.136            0.138
##   10000          2             0.122            0.132
## Informational Message: The maximum number of iterations is reached! The algorithm may not have converged.
## This variational approximation is not guaranteed to be meaningful.
## 
## Drawing a sample of size 1000 from the approximate posterior... 
## COMPLETED.

SVI (cont)

beta0_trace_vb <- rstan::extract(fit_vb, "beta0")$beta0
beta0_map_vb <- mean(beta0_trace_vb)

qplot(beta0_trace_vb, geom = 'density') +
  geom_vline(xintercept = -1, colour = 'darkblue') +
  geom_vline(xintercept = beta0_map_vb, colour = 'darkred')

Resources

  • Bayesian Data Analysis 3 (textbook)
  • mc-stan.org
  • Loo leave one out cross-validation
  • Rstanarm applied regression modelling