17 November 2016
The posterior distribution of \(\theta\) given \(X\) is given by:
\[ p(\theta|X) = \frac{ p(X|\theta) p(\theta) }{ p(X) } \]
The likelihood is the probability of seeing the data given a model and its parameters
The prior is your personal belief of what the distribution of those parameters look like (before seeing data)
The posterior is the probability distribution of those parameters (after seeing data)
Stan is a program (named after Stanislaw Ulam) to perform Bayesian inference using Hamiltonian Monte Carlo (HMC)
\(\rightarrow\) let's you focus on modelling and not worry (too much!) about inference
Building blocks of a Stan file:
What is the (fixed) data required?
What (uknown) parameters will we infer?
How do the parameters and data relate? Specified with probability density / mass functions
Can also have transformed data, transformed parameters and generated quantites
library(ggplot2) N <- 100 x <- runif(N) y <- -1 + 3 * x + rnorm(N, 0, 0.2) qplot(x, y)
Linear models:
\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \]
Noise has zero mean and uncorrelated with \(x_i\) (homoskedastic):
\[ \epsilon_i \sim N(0, \tau^{-1}) \]
Bayesian prior structure:
\[ \beta_0, \beta_1 \sim N(0, \sigma^2_0) \] \[ \tau \sim \text{Gamma}(a, b) \]
Remember:Let's (pretend?) we don't know much:
\[ \beta_0, \beta_1 \sim N(0, 100^2) \]
qplot(rnorm(1000, 0, 100))
\[ \tau \sim \text{Gamma}(0.01, 0.01) \]
qplot(rgamma(1000, 0.1, 0.1))
Beware Gamma distribution parametrisation (shape-rate-scale)
Two steps
What data we require, and how our model relates this data to the unknown parameters (and how the parameters related to each other)
Stan does this for you, but we still have various decisions to make: e.g. HMC vs SVI, how many iterations, etc
Model file part 1: what data do we require?
data {
int<lower = 0> N;
real y[N];
real x[N];
}
Many different data types:
real, vector, matrix, etcreal<lower=0, upper=1>cov_mat, ordered vectorsModel file part 2: what parameters do we have?
\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \;\;\epsilon_i \sim \text{Normal}(0, \tau^{-1})\]
Want to infer \(\beta_0\), \(\beta_1\) and \(\tau\):
parameters {
real beta0;
real beta1;
real<lower = 0> tau;
}
Remember in Bayesian inference there's no real difference between data and parameters!
So (almost) every statement in the data block is valid here too
Model file part 3: how do the parameters relate to the data?
\[ \tau \sim \text{Gamma}(a, b) \\ \beta_0, \beta_1 \sim N(0, \sigma^2_0) \\ \epsilon_i \sim \text{Normal}(0, \tau^{-1}) \\ y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]
This is encoded in Stan via:
model {
beta0 ~ normal(0, 100);
beta1 ~ normal(0, 100);
tau ~ gamma(0.1, 0.1);
for(i in 1:N) {
y[i] ~ normal(beta0 + beta1 * x[i], 1 / sqrt(tau));
}
}
model {
beta0 ~ normal(0, 100);
beta1 ~ normal(0, 100);
tau ~ gamma(0.1, 0.1);
for(i in 1:N) {
y[i] ~ normal(beta0 + beta1 * x[i], 1 / sqrt(tau));
}
}
General form:
random variable ~ distribution(more random variables)
This supports a wide range of distributions:
normalgammastudentexponential1 / tau ~ inv_gamma(...);
for loops as per Rfor(g in 1:G) {
...
}
Recently changed from <- to =. Useful for less-cluttered code, e.g.
real mu[N]; for(i in 1:N) mu[i] = beta0 + beta1 * x[i]; y[i] ~ normal(mu[i], 1 / sqrt(tau));
model_str <- '
data {
int<lower = 0> N;
real y[N];
real x[N];
}
parameters {
real beta0;
real beta1;
real<lower = 0> tau;
}
model {
beta0 ~ normal(0, 100);
beta1 ~ normal(0, 100);
tau ~ gamma(0.1, 0.1);
for(i in 1:N) {
y[i] ~ normal(beta0 + beta1 * x[i], 1 / sqrt(tau));
}
}
'
library(rstan) linear_model <- stan_model(model_code = model_str)
For simple models it's easy to pass the model specification as a string. For more complex models, it's better to store it as a separate file:
complex_model <- stan_model("my_complex_model.stan")
Put data into list format:
data_list <- list(N = length(y), x = x, y = y)
Inference using sampling:
fit <- sampling(linear_model,
data = data_list,
iter = 4000, # Number of MCMC iterations
warmup = 2000, # How many iterations at beginning to ignore
chains = 4, # Number of chains to sample
thin = 4) # Only keep every fourth sample
## ## SAMPLING FOR MODEL 'd286bea5be64d4f60da9f8f18551ab8a' NOW (CHAIN 1). ## ## Chain 1, Iteration: 1 / 4000 [ 0%] (Warmup) ## Chain 1, Iteration: 400 / 4000 [ 10%] (Warmup) ## Chain 1, Iteration: 800 / 4000 [ 20%] (Warmup) ## Chain 1, Iteration: 1200 / 4000 [ 30%] (Warmup) ## Chain 1, Iteration: 1600 / 4000 [ 40%] (Warmup) ## Chain 1, Iteration: 2000 / 4000 [ 50%] (Warmup) ## Chain 1, Iteration: 2001 / 4000 [ 50%] (Sampling) ## Chain 1, Iteration: 2400 / 4000 [ 60%] (Sampling) ## Chain 1, Iteration: 2800 / 4000 [ 70%] (Sampling) ## Chain 1, Iteration: 3200 / 4000 [ 80%] (Sampling) ## Chain 1, Iteration: 3600 / 4000 [ 90%] (Sampling) ## Chain 1, Iteration: 4000 / 4000 [100%] (Sampling) ## Elapsed Time: 0.168416 seconds (Warm-up) ## 0.175169 seconds (Sampling) ## 0.343585 seconds (Total) ## ## ## SAMPLING FOR MODEL 'd286bea5be64d4f60da9f8f18551ab8a' NOW (CHAIN 2). ## ## Chain 2, Iteration: 1 / 4000 [ 0%] (Warmup) ## Chain 2, Iteration: 400 / 4000 [ 10%] (Warmup) ## Chain 2, Iteration: 800 / 4000 [ 20%] (Warmup) ## Chain 2, Iteration: 1200 / 4000 [ 30%] (Warmup) ## Chain 2, Iteration: 1600 / 4000 [ 40%] (Warmup) ## Chain 2, Iteration: 2000 / 4000 [ 50%] (Warmup) ## Chain 2, Iteration: 2001 / 4000 [ 50%] (Sampling) ## Chain 2, Iteration: 2400 / 4000 [ 60%] (Sampling) ## Chain 2, Iteration: 2800 / 4000 [ 70%] (Sampling) ## Chain 2, Iteration: 3200 / 4000 [ 80%] (Sampling) ## Chain 2, Iteration: 3600 / 4000 [ 90%] (Sampling) ## Chain 2, Iteration: 4000 / 4000 [100%] (Sampling) ## Elapsed Time: 0.166997 seconds (Warm-up) ## 0.170728 seconds (Sampling) ## 0.337725 seconds (Total) ## ## ## SAMPLING FOR MODEL 'd286bea5be64d4f60da9f8f18551ab8a' NOW (CHAIN 3). ## ## Chain 3, Iteration: 1 / 4000 [ 0%] (Warmup) ## Chain 3, Iteration: 400 / 4000 [ 10%] (Warmup) ## Chain 3, Iteration: 800 / 4000 [ 20%] (Warmup) ## Chain 3, Iteration: 1200 / 4000 [ 30%] (Warmup) ## Chain 3, Iteration: 1600 / 4000 [ 40%] (Warmup) ## Chain 3, Iteration: 2000 / 4000 [ 50%] (Warmup) ## Chain 3, Iteration: 2001 / 4000 [ 50%] (Sampling) ## Chain 3, Iteration: 2400 / 4000 [ 60%] (Sampling) ## Chain 3, Iteration: 2800 / 4000 [ 70%] (Sampling) ## Chain 3, Iteration: 3200 / 4000 [ 80%] (Sampling) ## Chain 3, Iteration: 3600 / 4000 [ 90%] (Sampling) ## Chain 3, Iteration: 4000 / 4000 [100%] (Sampling) ## Elapsed Time: 0.165873 seconds (Warm-up) ## 0.156197 seconds (Sampling) ## 0.32207 seconds (Total) ## ## ## SAMPLING FOR MODEL 'd286bea5be64d4f60da9f8f18551ab8a' NOW (CHAIN 4). ## ## Chain 4, Iteration: 1 / 4000 [ 0%] (Warmup) ## Chain 4, Iteration: 400 / 4000 [ 10%] (Warmup) ## Chain 4, Iteration: 800 / 4000 [ 20%] (Warmup) ## Chain 4, Iteration: 1200 / 4000 [ 30%] (Warmup) ## Chain 4, Iteration: 1600 / 4000 [ 40%] (Warmup) ## Chain 4, Iteration: 2000 / 4000 [ 50%] (Warmup) ## Chain 4, Iteration: 2001 / 4000 [ 50%] (Sampling) ## Chain 4, Iteration: 2400 / 4000 [ 60%] (Sampling) ## Chain 4, Iteration: 2800 / 4000 [ 70%] (Sampling) ## Chain 4, Iteration: 3200 / 4000 [ 80%] (Sampling) ## Chain 4, Iteration: 3600 / 4000 [ 90%] (Sampling) ## Chain 4, Iteration: 4000 / 4000 [100%] (Sampling) ## Elapsed Time: 0.174429 seconds (Warm-up) ## 0.176195 seconds (Sampling) ## 0.350624 seconds (Total)
What does sampling return?
str(fit, max.level = 2)
## Formal class 'stanfit' [package "rstan"] with 10 slots ## ..@ model_name: chr "d286bea5be64d4f60da9f8f18551ab8a" ## ..@ model_pars: chr [1:4] "beta0" "beta1" "tau" "lp__" ## ..@ par_dims :List of 4 ## ..@ mode : int 0 ## ..@ sim :List of 12 ## ..@ inits :List of 4 ## ..@ stan_args :List of 4 ## ..@ stanmodel :Formal class 'stanmodel' [package "rstan"] with 5 slots ## ..@ date : chr "Sun Nov 13 17:20:37 2016" ## ..@ .MISC :<environment: 0x7fd0e76f0df8>
How do you extract parameter estimates? Use extract!
For \(\beta_0\):
beta0_trace <- rstan::extract(fit, "beta0")$beta0 str(beta0_trace)
## num [1:2000(1d)] -0.992 -0.981 -0.953 -0.923 -0.939 ... ## - attr(*, "dimnames")=List of 1 ## ..$ iterations: NULL
Can get a posterior mean estimate using mean (multidimensional parameter -> colMeans):
beta0_map <- mean(beta0_trace) beta0_map
## [1] -0.9704332
How does this compare to the truth?
qplot(beta0_trace, geom = 'density') + geom_vline(xintercept = -1, colour = 'darkblue') + geom_vline(xintercept = beta0_map, colour = 'darkred')
But we normally (never) know the truth…
…so examine MCMC diagnostics
Issues with MCMC samplers:
Day-to-day good enough:
In practice, examining parameter traces and autocorrelation is usually sufficient
Bad:
Good:
stan_trace(fit, "lp__")
Bad convergence:
stan_trace(fit, pars = c("beta0", "beta1", "tau"))
How correlated is each draw with a given lag along the Markov chain?
WoMMBAT: A user interface for hierarchical Bayesian estimation of working memory capacity.
stan_ac(fit, "lp__")
stan_ac(fit)
Still trying to estimate \(\hat{\theta}\) from a (finite) sample \(\rightarrow\) Monte Carlo Standard Error (MCSE)
Qn: how does the MCSE compare to the posterior variances?
stan_mcse(fit)
Given the draws aren't truly independent, what is the effective sample size (ESS)? (one for each parameter)
stan_ess(fit)
Ratio if within-to-between sample variance
stan_rhat(fit)
If \(\hat{R}\) is far from 1, more iterations may help
Stan can't do inference over discrete parameters (because can't compute gradient). However, in many cases we can marginalise them out:
If \(\theta\) is a continuous parameter and \(\phi\) is discrete, normally target unnormalised density \(p(x | \theta, \phi)p(\theta)p(\phi)\). Instead, target
\[ p(x | \theta)p(\theta) = \sum_{k \in \Omega(\phi)} p(x | \theta, \phi = k)p(\theta)p(\phi = k) \]
Using target += to manipulate the log-likelihood:
y[i] ~ normal(mu[i], 1 / sqrt(tau))
is entirely equivalent to
target += normal(y[i] | mu[i], 1 / sqrt(tau))
The | is a recent syntax change to Stan.
Example: finite mixture model
The responsibility parameters (data sample \(i\) belongs to cluster \(k\) are discrete).
parameters {
real y;
} model {
target += log_sum_exp(log(0.3) + normal(y | -1, 2),
log(0.7) + normal(y | 3 1));
}
Life pro tip: use log_sum_exp wherever possible (and work with log probability densities).
fit_vb <- vb(linear_model, data = data_list)
## ## This is Automatic Differentiation Variational Inference. ## ## (EXPERIMENTAL ALGORITHM: expect frequent updates to the procedure.) ## ## Gradient evaluation took 1.6e-05 seconds ## 1000 iterations under these settings should take 0.016 seconds. ## Adjust your expectations accordingly! ## ## Begin eta adaptation. ## Iteration: 1 / 250 [ 0%] (Adaptation) ## Iteration: 50 / 250 [ 20%] (Adaptation) ## Iteration: 100 / 250 [ 40%] (Adaptation) ## Iteration: 150 / 250 [ 60%] (Adaptation) ## Iteration: 200 / 250 [ 80%] (Adaptation) ## Success! Found best value [eta = 1] earlier than expected. ## ## Begin stochastic gradient ascent. ## iter ELBO delta_ELBO_mean delta_ELBO_med notes ## 100 -2e+01 1.000 1.000 ## 200 -7 1.091 1.182 ## 300 2 2.360 1.182 ## 400 1 1.960 1.182 ## 500 2 1.628 1.000 ## 600 0.8 1.496 1.000 ## 700 2 1.374 0.836 ## 800 -1 1.530 1.000 ## 900 2 1.569 1.000 ## 1000 -1e+01 1.529 1.165 ## 1100 -0.9 2.465 1.182 MAY BE DIVERGING... INSPECT ELBO ## 1200 3 2.480 1.328 MAY BE DIVERGING... INSPECT ELBO ## 1300 2 2.002 1.165 MAY BE DIVERGING... INSPECT ELBO ## 1400 2 1.929 1.165 MAY BE DIVERGING... INSPECT ELBO ## 1500 0.3 2.471 1.328 MAY BE DIVERGING... INSPECT ELBO ## 1600 1 2.462 1.328 MAY BE DIVERGING... INSPECT ELBO ## 1700 1 2.405 1.328 MAY BE DIVERGING... INSPECT ELBO ## 1800 2 2.168 1.165 MAY BE DIVERGING... INSPECT ELBO ## 1900 2 2.000 0.745 MAY BE DIVERGING... INSPECT ELBO ## 2000 -1 2.188 0.745 MAY BE DIVERGING... INSPECT ELBO ## 2100 -1 1.168 0.253 MAY BE DIVERGING... INSPECT ELBO ## 2200 0.4 1.483 0.253 MAY BE DIVERGING... INSPECT ELBO ## 2300 -0.3 1.702 0.745 MAY BE DIVERGING... INSPECT ELBO ## 2400 -3 1.788 0.887 MAY BE DIVERGING... INSPECT ELBO ## 2500 -0.1 3.002 0.887 MAY BE DIVERGING... INSPECT ELBO ## 2600 2 3.036 1.083 MAY BE DIVERGING... INSPECT ELBO ## 2700 2 3.044 1.083 MAY BE DIVERGING... INSPECT ELBO ## 2800 2 3.026 1.083 MAY BE DIVERGING... INSPECT ELBO ## 2900 2 3.018 1.083 MAY BE DIVERGING... INSPECT ELBO ## 3000 2 2.720 0.887 MAY BE DIVERGING... INSPECT ELBO ## 3100 0.8 2.859 1.083 MAY BE DIVERGING... INSPECT ELBO ## 3200 0.7 2.420 0.887 MAY BE DIVERGING... INSPECT ELBO ## 3300 2 2.251 0.628 MAY BE DIVERGING... INSPECT ELBO ## 3400 0.9 2.268 0.628 MAY BE DIVERGING... INSPECT ELBO ## 3500 2 0.524 0.428 MAY BE DIVERGING... INSPECT ELBO ## 3600 2 0.440 0.240 ## 3700 1 0.487 0.428 ## 3800 2 0.502 0.428 MAY BE DIVERGING... INSPECT ELBO ## 3900 2 0.495 0.428 ## 4000 2 0.521 0.428 MAY BE DIVERGING... INSPECT ELBO ## 4100 -0.6 0.868 0.428 MAY BE DIVERGING... INSPECT ELBO ## 4200 2 0.999 0.620 MAY BE DIVERGING... INSPECT ELBO ## 4300 2 0.969 0.428 MAY BE DIVERGING... INSPECT ELBO ## 4400 2 0.882 0.331 MAY BE DIVERGING... INSPECT ELBO ## 4500 2 0.854 0.317 MAY BE DIVERGING... INSPECT ELBO ## 4600 2 0.845 0.317 MAY BE DIVERGING... INSPECT ELBO ## 4700 2 0.797 0.222 MAY BE DIVERGING... INSPECT ELBO ## 4800 2 0.812 0.317 MAY BE DIVERGING... INSPECT ELBO ## 4900 2 0.834 0.317 MAY BE DIVERGING... INSPECT ELBO ## 5000 2 0.812 0.272 MAY BE DIVERGING... INSPECT ELBO ## 5100 2 0.320 0.189 ## 5200 1 0.284 0.189 ## 5300 2 0.298 0.189 ## 5400 2 0.286 0.151 ## 5500 1 0.385 0.272 ## 5600 2 0.411 0.373 ## 5700 3 0.424 0.373 ## 5800 1 0.495 0.417 ## 5900 1 0.469 0.417 ## 6000 0.9 0.499 0.417 ## 6100 2 0.534 0.464 MAY BE DIVERGING... INSPECT ELBO ## 6200 2 0.443 0.417 ## 6300 2 0.419 0.397 ## 6400 2 0.443 0.397 ## 6500 2 0.339 0.307 ## 6600 2 0.315 0.268 ## 6700 2 0.295 0.230 ## 6800 2 0.199 0.172 ## 6900 2 0.200 0.172 ## 7000 2 0.168 0.120 ## 7100 2 0.129 0.115 ## 7200 2 0.152 0.115 ## 7300 2 0.149 0.115 ## 7400 0.4 0.543 0.115 MAY BE DIVERGING... INSPECT ELBO ## 7500 2 0.615 0.172 MAY BE DIVERGING... INSPECT ELBO ## 7600 2 0.615 0.177 MAY BE DIVERGING... INSPECT ELBO ## 7700 2 0.632 0.201 MAY BE DIVERGING... INSPECT ELBO ## 7800 2 0.664 0.241 MAY BE DIVERGING... INSPECT ELBO ## 7900 2 0.669 0.241 MAY BE DIVERGING... INSPECT ELBO ## 8000 2 0.686 0.243 MAY BE DIVERGING... INSPECT ELBO ## 8100 2 0.679 0.243 MAY BE DIVERGING... INSPECT ELBO ## 8200 2 0.656 0.241 MAY BE DIVERGING... INSPECT ELBO ## 8300 2 0.670 0.243 MAY BE DIVERGING... INSPECT ELBO ## 8400 2 0.274 0.243 ## 8500 2 0.218 0.243 ## 8600 1 0.241 0.244 ## 8700 2 0.262 0.296 ## 8800 2 0.227 0.244 ## 8900 3 0.240 0.244 ## 9000 2 0.231 0.244 ## 9100 2 0.241 0.244 ## 9200 2 0.243 0.244 ## 9300 2 0.211 0.197 ## 9400 2 0.195 0.151 ## 9500 2 0.184 0.138 ## 9600 2 0.145 0.132 ## 9700 2 0.125 0.132 ## 9800 2 0.139 0.138 ## 9900 2 0.136 0.138 ## 10000 2 0.122 0.132 ## Informational Message: The maximum number of iterations is reached! The algorithm may not have converged. ## This variational approximation is not guaranteed to be meaningful. ## ## Drawing a sample of size 1000 from the approximate posterior... ## COMPLETED.
beta0_trace_vb <- rstan::extract(fit_vb, "beta0")$beta0 beta0_map_vb <- mean(beta0_trace_vb) qplot(beta0_trace_vb, geom = 'density') + geom_vline(xintercept = -1, colour = 'darkblue') + geom_vline(xintercept = beta0_map_vb, colour = 'darkred')