Contents

1 Introduction

Ouija is a probabilistic framework that allows for interpretable learning of single-cell pseudotimes using only small panels of marker genes. Ouija

Under the hood, Ouija uses Bayesian non-linear factor analysis with priors on the factor loading matrix to specify gene behaviour. Inference is performed using the Stan probabilistic programming language.

2 A basic example

2.1 Data input

Ouija takes input in three forms:

  1. A cell-by-gene expression matrix of non-negative values. We recommend using log2(TPM + 1) or log2(RPKM + 1) as this is what the mean-variance relationship in the model is designed for.
  2. A SingleCellExperiment (from the SingleCellExperiment package)

Here we can use some synthetic data bundled with the package. This contains a gene expression matrix example_gex comprising 11 genes and 400 cells, of which the first 9 are switch-like and the final 2 are transient

data(example_gex)
example_gex[1:3, ]
##      switch_1 switch_2 switch_3 switch_4 switch_5 switch_6 switch_7 switch_8
## [1,] 0.000000 6.584901 0.000000 6.272541 12.34626 3.371506 2.712104 2.624922
## [2,] 7.261585 9.766147 6.902371 0.000000  0.00000 0.000000 0.000000 0.000000
## [3,] 4.605835 1.452838 0.000000 8.157470 14.55081 6.789900 9.762273 9.125681
##       switch_9 transient_1 transient_2
## [1,] 10.039922           0    0.000000
## [2,]  0.000000           0    0.000000
## [3,]  8.999462           0    4.836451

We can further create example SingleCellExperiment that Ouija can use as input:

single_cell_set <- SingleCellExperiment(assays = list(exprs = t(example_gex)))

For the two input cases above Ouija can equivalently use ouija(example_gex, ...) or ouija(single_cell_set, ...). By default Ouija uses the logcounts assay for a SingleCellExperiment, though this can be changed using the single_cell_experiment_assay argument to the ouija function.

2.2 Response types

Using Ouija we can model genes as either exhibiting monotonic up or down regulation (known as switch-like behaviour), or transient behaviour where the gene briefly peaks. By default Ouija assumes all genes exhibit switch-like behaviour (and don’t worry if you get it wrong - the noise model means incorrectly specifying a transient gene as switch-like has minimal effect).

In this example we can infer the behaviour type from the gene names:

response_type <- sapply(strsplit(colnames(example_gex), "_"), `[`, 1)

2.3 Fitting with Ouija

In order to fit the pseudotimes simply call ouija passing in the expected response types. Note that if no response types are provided then they are all assumed to be switch-like by default. The input to Ouija is either a cell-by-gene matrix of non-negative expression values, or an ExpressionSet that has similar values in exprs(eset).

For this vignette we’ll reduce the number of iterations to 500 and the number of cells to 200 to speed up the build time, though we recommend around 4000 in practice.

oui <- ouija(example_gex[sample(seq_len(nrow(example_gex)), 200), ], 
             response_type, iter = 500)
print(oui)
## A Ouija fit with 200 cells and 11 marker genes 
## Inference type:  Hamiltonian Monte Carlo 
## MCMC info: 500 iterations on 1 chains 
## (Gene behaviour) Switch/transient: 9 / 2

It’s good practice to look at the trace and aurocorrelation of the (log)-likelihood to make sure the distribution has (roughly) converged. More advanced diagnostics may be accessed through the rstan package applied to oui$fit.

plot_diagnostics(oui)

2.4 Examining results

Ouija comes with three plotting functions to help understand the fitted pseudotimes.

We can plot the gene expression over pseudotime along with the maximum a posteriori (MAP) estimates of the mean function (the sigmoid or Gaussian transient function) using the plot_expression function:

plot_expression(oui)

We can also visualise when in the trajectory gene regulation behaviour occurs, either in the form of the switch time or the peak time (for switch-like or transient genes) using the plot_switch_times and plot_transient_times functions:

plot_switch_times(oui)

plot_peak_times(oui)

2.5 Extracting useful quantities

For downstream analysis it is useful to extract maximum a posteriori estimates of several quantities. These can be accessed through various convenience functions:

tmap <- map_pseudotime(oui) # MAP pseudotimes
t0map <- switch_times(oui) # MAP switch times
pmap <- peak_times(oui) # MAP peak times
kmap <- switch_strengths(oui) # MAP switch strengths

2.6 Determining regulation order of genes

A common analysis is to work out the regulation orderings of genes. For example, is gene A upregulated before gene B? Does gene C peak before the downregulation of gene D? Ouija answers these questions in terms of a Bayesian hypothesis test of whether the difference in regulation timing (either switch time or peak time) is significantly different to 0. This is collated using the gene_regulation function:

gene_regs <- gene_regulation(oui)
head(gene_regs)
## # A tibble: 6 x 7
## # Groups:   label, gene_A [6]
##   label        gene_A  gene_B  mean_difference lower_95 upper_95 significant
##   <chr>        <chr>   <chr>             <dbl>    <dbl>    <dbl> <lgl>      
## 1 switch_1 - … switch… switch…         -0.215    -0.697    0.312 F          
## 2 switch_1 - … switch… switch…         -0.155    -0.560    0.247 F          
## 3 switch_1 - … switch… switch…         -0.188    -0.597    0.228 F          
## 4 switch_1 - … switch… switch…          0.0412   -0.367    0.471 F          
## 5 switch_1 - … switch… switch…          0.0100   -0.352    0.459 F          
## 6 switch_1 - … switch… switch…          0.234    -0.152    0.684 F

As can be seen, this returns a data.frame with 7 columns:

  • label The two genes being compared in a string format (handy for plotting)
  • gene_A The first gene being compared
  • gene_B The second gene being compared
  • mean_difference The mean difference in regulation timing across all MCMC traces
  • lower_95 The lower bound of the 95% credible interval for the difference in regulation timing
  • upper_95 The corresopnding upper bound
  • significant A logical describing whether the difference in regulation timings is significant - true if the posterior credible interval does not overlap 0

We can graph the posterior differences in gene regulation including whether they are significantly different as per our example above:

ggplot(gene_regs, aes(y = label, x = mean_difference, color = significant)) +
  geom_point() +
  geom_errorbarh(aes(xmin = lower_95, xmax = upper_95)) +
  xlab("Mean difference in regulation time") +
  ylab("Gene pair")

2.7 Identifying metastable states with consistency matrices

A further common analysis to be performed with Ouija is the identification of “metastable states”, or discrete cell types along continuous pseudotemporal trajectories. The basic idea is that as cells differentiate they may be “stable” for part of the trajectory, which counts as a cell type.

To identify these Ouija forms a consistency matrix. If there are \(N\) cells, the consistency matrix is the \(N\) by \(N\) matrix where the entry in the \(i^{th}\) row and \(j^{th}\) column is the empirical probability that cell \(i\) is before cell \(j\). The intuition is that if there are phases of pseudotime where the consistency matrix is around 0.5 then there is large uncertainty as to whether one cell is ordered before another, meaning all the cells are essentially at the same point in pseudotime defining a cell state. If however the consistency matrix is at 0 or 1, the cells are transitioning along pseudotime with little uncertainty as to their ordering implying a continuous progression.

We can visualise the consistency matrix via a call to plot_consistency and calculate it via a call to consistency_matrix:

cmo <- consistency_matrix(oui)
plot_consistency(oui)

In order to identify the clusters along pseudotime we call cluster_consistency on the confusion matrix. This applies Gaussian Mixture Modelling through the mclust package to the first principal component of the consistency matrix. The number of clusters chosen is that which maximises the BIC through iterative search, which defaults to 2:9 though this can be changed through the n_clusters argument:

cell_classifications <- cluster_consistency(cmo)

We can visualise this either as a scatter plot or density plot:

map_pst <- map_pseudotime(oui)
df_class <- data.frame(map_pst, cell_classifications)
ggplot(df_class, aes(x = map_pst, y = cell_classifications)) +
  geom_point() +
  xlab("MAP pseudotime") +
  ylab("Cell classification")

ggplot(df_class, aes(x = map_pst, fill = factor(cell_classifications))) +
  geom_density() +
  theme(legend.position = 'top') +
  scale_fill_discrete(name = "Cell classification") +
  xlab("MAP pseudotime")

3 Advanced usage

3.1 Incorporating prior information on gene behaviour

Because Ouija uses a parametric model of gene expression with interpretable parameters under a Bayesian framework we can encode prior beliefs or information as informative Bayesian priors. For example, the switch time parameter \(t_0\) tells us where in the trajectory a gene is up or down regulated, with 0 being the beginning of the trajectory and 1 the end. By default, a weak prior is placed on \(t0\) with a mean at 0.5. However, a researcher may know a given gene is downregulated early in the (differentiation) trajectory. In such case we may wish to place an informative prior, such as a prior mean of 0.1 indicating downregulation happens early.

In general the parameters we may reasonably wish to place prior information on are the switch strengths, switch times, and peak times. In each case the parameter has a Gaussian prior distribution with

\[ \theta \sim \mathcal{N}(\mu, \sigma) \]

where \(\mu\) tells us what the prior information is and \(\sigma\) signifies how strong this prior information is.

For example, if we’re a gene is strongly upregulated, we might place a \(\mathcal{N}(10, 0.1)\) prior on one of the switch strengths; if we think it’s strongly upregulated but aren’t as sure we might place a \(\mathcal{N}(10, 5)\) prior on it.

We pass this information to Ouija through arguments to the ouija function. Here we specify the vectors of the \(\mu\) and \(\sigma\) variables. These must be same length as the corresponding number of genes - e.g. the length of \(\mu\) for switch-strength must be the same as the number of genes.

The arguments to the ouija function that specify the prior means and variances for each of the parameters are given in the table below.

Parameter Symbol Prior mean argument \(\mu\) Prior stdev argument \(\sigma\)
Switch strength \(k\) switch_strengths switch_strength_sd
Switch time \(t_0\) switch_times switch_time_sd
Peak time \(p\) peak_times peak_time_sd

3.2 Accessing the STAN fit

The underlying STAN fit sits directly in the fit slot of the object returned by ouija. Therefore, you can directly visualise anything related to this fit, e.g. using the functions stan_hist and stan_trace, along with accessing posterior samples through the extract function.

3.3 Inference types

Stan now supports two types of inference:

  • Hamiltonian Monte Carlo (HMC) - full MCMC inference where gradient information of the log-posterior is used to “guide” the random walk through the parameter space
  • Automatic Differentiation Variational Bayes (ADVI or simply VI) - approximate inference where the KL divergence to an approximate distribution is minimised

In general, HMC will provide more accurate inference with approximately correct posterior variance for all parameters. However, VB is orders of magnitude quicker than HMC and while it may underestimate posterior variance, anecdotally it seems just as good as HMC for discovering posterior pseudotimes.

These inference types may be invoked using the inference_type argument:

oui_vb <- ouija(synth_gex, response_types,
                inference_type = "vb")

4 Technical info

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2                rstan_2.17.3               
##  [3] StanHeaders_2.17.2          ouija_0.99.0               
##  [5] Rcpp_0.12.15                scater_1.6.0               
##  [7] SingleCellExperiment_0.99.4 SummarizedExperiment_1.7.10
##  [9] DelayedArray_0.3.21         matrixStats_0.52.2         
## [11] GenomicRanges_1.29.15       GenomeInfoDb_1.13.5        
## [13] IRanges_2.11.19             S4Vectors_0.15.14          
## [15] ggplot2_2.2.1               Biobase_2.37.2             
## [17] BiocGenerics_0.23.3         BiocStyle_2.5.39           
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-131            bitops_1.0-6            bit64_0.9-7            
##  [4] progress_1.1.2          rprojroot_1.2           tensorA_0.36           
##  [7] tools_3.4.3             backports_1.1.1         utf8_1.1.3             
## [10] R6_2.2.2                vipor_0.4.5             DBI_0.7                
## [13] lazyeval_0.2.0          colorspace_1.3-2        tidyselect_0.2.3       
## [16] gridExtra_2.3           prettyunits_1.0.2       bit_1.1-12             
## [19] compiler_3.4.3          cli_1.0.0               labeling_0.3           
## [22] bookdown_0.4            scales_0.5.0            stringr_1.2.0          
## [25] digest_0.6.12           rmarkdown_1.6           XVector_0.17.2         
## [28] pkgconfig_2.0.1         htmltools_0.3.6         limma_3.33.14          
## [31] MCMCglmm_2.25           rlang_0.1.6             RSQLite_2.0            
## [34] shiny_1.0.5             bindr_0.1               mclust_5.4             
## [37] dplyr_0.7.4             inline_0.3.14           RCurl_1.95-4.8         
## [40] magrittr_1.5            GenomeInfoDbData_0.99.1 Matrix_1.2-12          
## [43] ggbeeswarm_0.6.0        munsell_0.4.3           ape_5.0                
## [46] viridis_0.4.1           stringi_1.1.6           yaml_2.1.16            
## [49] edgeR_3.19.8            zlibbioc_1.23.0         rhdf5_2.21.6           
## [52] plyr_1.8.4              grid_3.4.3              blob_1.1.0             
## [55] crayon_1.3.4            shinydashboard_0.6.1    lattice_0.20-35        
## [58] cowplot_0.9.2           locfit_1.5-9.1          knitr_1.16             
## [61] pillar_1.1.0            rjson_0.2.15            cubature_1.3-11        
## [64] corpcor_1.6.9           codetools_0.2-15        reshape2_1.4.2         
## [67] biomaRt_2.33.4          XML_3.98-1.9            glue_1.2.0             
## [70] evaluate_0.10           data.table_1.10.4-2     httpuv_1.3.5           
## [73] purrr_0.2.4             tidyr_0.8.0             gtable_0.2.0           
## [76] assertthat_0.2.0        mime_0.5                xtable_1.8-2           
## [79] coda_0.19-1             viridisLite_0.2.0       tibble_1.4.2           
## [82] AnnotationDbi_1.39.4    beeswarm_0.2.3          memoise_1.1.0          
## [85] tximport_1.5.1