Probabilistic pseudotime inference using Bayesian Gaussian Process Latent Variable Models

A simple example

Let’s first trying to fit a probabilistic trajectory to a PCA representation of the Trapnell et al. (2014) dataset. This is included in the package under monocle_pca:

library(pseudogp)
library(ggplot2)
data(monocle_le)

ggplot(data.frame(monocle_le)) + geom_point(aes(x = X1, y = X2)) + theme_bw()

Then we can do a simple fit using the fitPseudotime function provided by pseudogp:

le_fit <- fitPseudotime(monocle_le, smoothing_alpha = 30, smoothing_beta = 6, iter = 1000, chains = 1)

However, this takes a long time for a vignette, so we can use the le_fit example bundled with the package as our reference:

data(le_fit)

We can plot the posterior curve in the reduced space:

posteriorCurvePlot(monocle_le, le_fit)

we can plot individual traces by setting posterior_mean = FALSE:

posteriorCurvePlot(monocle_le, le_fit, posterior_mean = FALSE)

and we can visualise the uncertainty using a boxplot

posteriorBoxplot(le_fit)

Exracting posterior samples from the `stanfit` object

All the posterior samples of pseudotime and kernel parameters are contained in the object returned by rstan::stan. We can use rstan’s built in plotting functions to examine the traces of the kernel parameters too. For example, we can look at a boxplot of the lambda parameters or plot the trace of the sigma (noise) parameters:

plot(le_fit, pars="lambda")

rstan::traceplot(le_fit, pars="sigma")

We can also extract the pseudotime traces from the object using rstan:extract:

pst <- extract(le_fit, pars="t")$t
print(str(pst))

##  num [1:500, 1:155] 0.1106 0.1466 0.09 0.0906 0.0164 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ iterations: NULL
##   ..$           : NULL
## NULL

We can then use ggplot2 to plot posterior distributions to get a handle on the uncertainty:

set.seed(1L)
to_sample <- sample(nrow(monocle_le), 4)
traces <- data.frame(pst[,to_sample])
names(traces) <- paste0("Cell",1:4)
traces_melted <- reshape2::melt(traces, variable.name="Cell", value.name="Pseudotime")
pstplt <- ggplot(traces_melted) + geom_density(aes(x = Pseudotime, fill = Cell), alpha = 0.5) +
  theme_bw()
pstplt

We can also find the MAP pseudotime estimates using the posterior.mode function from the MCMCglmm package:

library(MCMCglmm)
library(coda)
tmcmc <- mcmc(traces)
tmap <- posterior.mode(tmcmc)
for(i in 1:length(tmap)) {
  pstplt <- pstplt + geom_vline(xintercept = tmap[i], linetype = 2)
}
pstplt

Choice of reduced dimension representation

Gaussian Process Latent Variable Models are brilliant, but not magic. There needs to be some structure in the representation you choose to get a consistent curve fit. Let’s look at an example of a representation with ‘structure’ and one without.

x <- runif(100, -1, 1)
y_structured <- rnorm(100, x^2, sd = 0.1)
y_unstructured <- rnorm(100, x^2, sd = 1)
dfplt <- data.frame(x, y_structured, y_unstructured)
dfmelt <- reshape2::melt(dfplt, id.vars = "x", value.name = "y")
ggplot(dfmelt) + geom_point(aes(x=x, y=y)) + facet_wrap(~variable) + theme_bw()

This method will consistently fit the structured data, but not the unstructured. To test for ‘consistency’, call fitPseudotime with multiple chains (by using the chains parameter) and then plot the posterior curves (posteriorCurvePlot automatically handles multiple chains and colours each differently). If the curves from each chain line up then you have a consistent fit. If not, look to a different representation of your data or select genes to encourage structure.

Fitting to multiple reduced dimensional representations simultaneously

## first let's load the data
data(monocle_le)
data(monocle_pca)
data(monocle_tsne)

X <- data.frame(rbind(monocle_le, monocle_pca, monocle_tsne))
names(X) <- c("x1", "x2")
X$cell <- rep(1:nrow(monocle_le), times = 3)
X$representation <- rep(c("LE", "PCA", "tSNE"), each = nrow(monocle_le))
ggplot(X) + geom_point(aes(x = x1, y = x2)) + facet_wrap( ~ representation) + theme_bw()

Now we can prepare data and call the stan fit. This is left unevaluated here due to the significant runtime. However, all plotting functions can be called as before.

data <- list(LE = monocle_le, PCA = monocle_pca, tSNE = monocle_tsne)
fit <- fitPseudotime(data, chains = 2, iter = 1000, smoothing_alpha = 12, smoothing_beta = 3)

Technical details

devtools::session_info()

##  setting  value                       
##  version  R version 3.2.3 (2015-12-10)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_GB.UTF-8                 
##  tz       Europe/London               
##  date     2016-04-06                  
## 
##  package      * version    date       source                           
##  ape          * 3.4        2015-11-29 CRAN (R 3.2.3)                   
##  assertthat     0.1        2013-12-06 CRAN (R 3.2.0)                   
##  coda         * 0.18-1     2015-10-16 CRAN (R 3.2.0)                   
##  codetools      0.2-14     2015-07-15 CRAN (R 3.2.3)                   
##  colorspace     1.2-6      2015-03-11 CRAN (R 3.2.0)                   
##  corpcor        1.6.8      2015-07-08 CRAN (R 3.2.0)                   
##  cowplot        0.6.1.9999 2016-03-16 Github (wilkelab/cowplot@1f87149)
##  cubature       1.1-2      2013-02-25 CRAN (R 3.2.0)                   
##  DBI            0.3.1      2014-09-24 CRAN (R 3.2.0)                   
##  devtools       1.9.1      2015-09-11 CRAN (R 3.2.0)                   
##  digest         0.6.9      2016-01-08 CRAN (R 3.2.3)                   
##  dplyr          0.4.3      2015-09-01 CRAN (R 3.2.0)                   
##  evaluate       0.8        2015-09-18 CRAN (R 3.2.0)                   
##  formatR        1.2.1      2015-09-18 CRAN (R 3.2.0)                   
##  ggplot2      * 2.1.0      2016-03-01 CRAN (R 3.2.4)                   
##  gridExtra      2.2.1      2016-02-29 CRAN (R 3.2.4)                   
##  gtable         0.2.0      2016-02-26 CRAN (R 3.2.3)                   
##  htmltools      0.3        2015-12-29 CRAN (R 3.2.3)                   
##  inline         0.3.14     2015-04-13 CRAN (R 3.2.0)                   
##  knitr          1.12.3     2016-01-22 CRAN (R 3.2.3)                   
##  labeling       0.3        2014-08-23 CRAN (R 3.2.0)                   
##  lattice        0.20-33    2015-07-14 CRAN (R 3.2.3)                   
##  lazyeval       0.1.10     2015-01-02 CRAN (R 3.2.0)                   
##  magrittr       1.5        2014-11-22 CRAN (R 3.2.0)                   
##  Matrix       * 1.2-3      2015-11-28 CRAN (R 3.2.3)                   
##  matrixStats  * 0.50.1     2015-12-15 CRAN (R 3.2.3)                   
##  MCMCglmm     * 2.22.1     2016-01-30 CRAN (R 3.2.3)                   
##  memoise        0.2.1      2014-04-22 CRAN (R 3.2.0)                   
##  munsell        0.4.3      2016-02-13 CRAN (R 3.2.3)                   
##  nlme           3.1-122    2015-08-19 CRAN (R 3.2.3)                   
##  plyr           1.8.3      2015-06-12 CRAN (R 3.2.0)                   
##  princurve    * 1.1-12     2013-04-25 CRAN (R 3.2.0)                   
##  pseudogp     * 0.1        2016-04-06 local (/@b476b43)                
##  R6             2.1.2      2016-01-26 CRAN (R 3.2.3)                   
##  RColorBrewer * 1.1-2      2014-12-07 CRAN (R 3.2.0)                   
##  Rcpp         * 0.12.4     2016-03-26 CRAN (R 3.2.4)                   
##  reshape2       1.4.1      2014-12-06 CRAN (R 3.2.0)                   
##  rmarkdown      0.9.5      2016-02-22 CRAN (R 3.2.3)                   
##  rstan        * 2.9.0-3    2016-02-12 CRAN (R 3.2.3)                   
##  scales         0.4.0      2016-02-26 CRAN (R 3.2.3)                   
##  stringi        1.0-1      2015-10-22 CRAN (R 3.2.0)                   
##  stringr        1.0.0      2015-04-30 CRAN (R 3.2.0)                   
##  tensorA        0.36       2010-12-01 CRAN (R 3.2.0)                   
##  yaml           2.1.13     2014-06-12 CRAN (R 3.2.0)

Probabilistic pseudotime inference using Bayesian Gaussian Process Latent Variable Models

Kieran Campbell

2016-04-06

Introduction

A simple example

Exracting posterior samples from the `stanfit` object

Choice of reduced dimension representation

Fitting to multiple reduced dimensional representations simultaneously

Technical details

Probabilistic pseudotime inference using Bayesian Gaussian Process Latent Variable Models

Kieran Campbell

2016-04-06

Introduction

A simple example

Exracting posterior samples from the stanfit object

Choice of reduced dimension representation

Fitting to multiple reduced dimensional representations simultaneously

Technical details

Exracting posterior samples from the `stanfit` object