Contents

1 Overview

clonealign assigns cells measured using single-cell RNA-seq to their clones of origin using copy number data. This is especially useful when clones are inferred from shallow single-cell DNA-seq, in which case the copy number state of each clone is known, but the precise SNV structure is unknown.

To assign cells to clones, clonealign makes the assumption that

\[ \text{gene expression} \propto \text{number of gene copies} \]

This is demonstrated in the figure below.

Mathematically we have an \(N \times G\) matrix \(Y\) of raw gene expression counts (from RNA-seq) for \(N\) cells and \(G\) genes, where \(y_{ng}\) is the counts to gene \(g\) in cell \(c\). We also have a \(G \times C\) matrix \(\Lambda\) of copy number variation for \(C\) clones, where \(\lambda_{gc}\) is the copy number of gene \(g\) in clone \(c\). We introduce a clone-assigning categorical variable \(\pi_n\) for each cell, where

\[z_n = c \text{ if cell $n$ on clone $c$} \]

then clonealign models the conditional expected counts in a gene and cell as

\[ E[y_{ng} | z_n=c] = \frac{\lambda_{g,c} f(\mu_g) e^{\psi_n \cdot w_g}}{ \sum_{g'}\lambda_{g',c} f(\mu_{g'}) e^{\psi_n \cdot w_{g'}}} s_n \]

where \(s_n\) is a cell-specific size factor, \(\mu_g\) is the per-chromosome expression (normalized so that \(\mu_1 = 1\) for model identifiability), \(f\) is a function that maps copy number to a multiplicative factor of expression, and \(\psi\) and \(w\) are cell and gene specific random effects. As of clonealign 2.0, the noise distribution is assumed to be multinomial. Inference is performed using reparametrization-gradient variational inference to calculate a posterior distribution of the clone assignments \(z_n\) and of all other model parameters.

2 Installation

clonealign is built upon Google’s Tensorflow using the Tensorflow R package provided by Rstudio. To install tensorflow, run

install.packages("tensorflow")
tensorflow::install_tensorflow(extra_packages ="tensorflow-probability", version="1.13.1")

You can confirm the installation succeeded by running

sess = tf$Session()
hello <- tf$constant('Hello, TensorFlow!')
sess$run(hello)

For more details see the Rstudio page on tensorflow installation.

clonealign can then be installed using the devtools package via

devtools::install_github("kieranrcampbell/clonealign")

3 Basic usage

3.1 Data preparation

By default, clonealign requires two inputs:

  • Gene expression data as raw counts. This can be in the form of a SingleCellExperiment, SummarizedExperiment or cell by gene matrix
  • Copy number profiles for each clone and gene (where the genes must be the same as those measured in the expression data). This can be in the form of a data.frame, DataFrame or matrix

Bundled with the package is an example SingleCellExperiment for 100 genes and 200 cells:

library(clonealign)
#> Registered S3 method overwritten by 'R.oo':
#>   method        from       
#>   throw.default R.methodsS3
data(example_sce)
example_sce
#> class: SingleCellExperiment 
#> dim: 100 200 
#> metadata(0):
#> assays(1): counts
#> rownames(100): gene_1 gene_2 ... gene_99 gene_100
#> rowData names(3): A B C
#> colnames(200): cell_1 cell_2 ... cell_199 cell_200
#> colData names(0):
#> reducedDimNames(0):
#> spikeNames(0):

This has raw integer counts in the assays slot as required for input to clonealign:

assay(example_sce, "counts")[1:5, 1:5]
#>        cell_1 cell_2 cell_3 cell_4 cell_5
#> gene_1      0      0      0      0      0
#> gene_2      0      0      0      0      0
#> gene_3      1      1      1      0      0
#> gene_4      0      0      0      0      0
#> gene_5      1      0      0      1      0

The CNV data is stored in the rowData of the SingleCellExperiment for 3 clones (A, B, and C) and crucially the same genes as the expression data:

cnv_data <- rowData(example_sce)[, c("A", "B", "C")]
stopifnot(nrow(cnv_data) == nrow(example_sce)) # Make sure genes match up
head(cnv_data)
#> DataFrame with 6 rows and 3 columns
#>                A         B         C
#>        <integer> <integer> <integer>
#> gene_1         1         2         2
#> gene_2         2         1         1
#> gene_3         3         2         2
#> gene_4         3         2         2
#> gene_5         2         2         3
#> gene_6         2         1         1

We next run the preprocess_for_clonealign that performs the following filtering steps:

  • Removes genes with fewer than min_counts_per_gene counts mapped
  • Removes cells with fewer than min_counts_per_cell counts mapped
  • If remove_outlying_genes is TRUE, removes genes whose mean expression is more than nmads MADs away from the overall mean
  • Removes genes whose maximum copy number is greater than max_copy_number
  • If remove_genes_same_copy_number is TRUE, removes genes whose copy number is the same across all clones

Note that it is critical before running clonealign that genes on the sex chromosomes are removed, as these violate the gene dosage assumption due to e.g. X inactivation. To get a list of genes and their respective chromosomes, we suggest using the annotables package.

ca_data <- preprocess_for_clonealign(example_sce, cnv_data)

3.2 Model fitting

The model is fitted with a basic call to run_clonealign, which calls the clonealign function across a range of parameter initializations and picks the model fit that achieves the best ELBO.

cal <- run_clonealign(ca_data$gene_expression_data, ca_data$copy_number_data, seed = 123)
#> Constructing tensorflow graph
#> Set session seed to 969167 (disabled GPU)
#> Removing 1 genes with low counts
#> Optimizing ELBO
#> 
#> ELBO converged or reached max iterations
#> Computing final ELBO
#> Constructing tensorflow graph
#> Set session seed to 188942 (disabled GPU)
#> Removing 1 genes with low counts
#> Optimizing ELBO
#> 
#> ELBO converged or reached max iterations
#> Computing final ELBO
#> Constructing tensorflow graph
#> Set session seed to 134058 (disabled GPU)
#> Removing 1 genes with low counts
#> Optimizing ELBO
#> 
#> ELBO converged or reached max iterations
#> Computing final ELBO
#> Constructing tensorflow graph
#> Set session seed to 124022 (disabled GPU)
#> Removing 1 genes with low counts
#> Optimizing ELBO
#> 
#> ELBO converged or reached max iterations
#> Computing final ELBO
#> Constructing tensorflow graph
#> Set session seed to 685285 (disabled GPU)
#> Removing 1 genes with low counts
#> Optimizing ELBO
#> 
#> ELBO converged or reached max iterations
#> Computing final ELBO
#> Constructing tensorflow graph
#> Set session seed to 226318 (disabled GPU)
#> Removing 1 genes with low counts
#> Optimizing ELBO
#> 
#> ELBO converged or reached max iterations
#> Computing final ELBO
#> Constructing tensorflow graph
#> Set session seed to 365209 (disabled GPU)
#> Removing 1 genes with low counts
#> Optimizing ELBO
#> 
#> ELBO converged or reached max iterations
#> Computing final ELBO
#> Constructing tensorflow graph
#> Set session seed to 648795 (disabled GPU)
#> Removing 1 genes with low counts
#> Optimizing ELBO
#> 
#> ELBO converged or reached max iterations
#> Computing final ELBO
#> Constructing tensorflow graph
#> Set session seed to 985797 (disabled GPU)
#> Removing 1 genes with low counts
#> Optimizing ELBO
#> 
#> ELBO converged or reached max iterations
#> Computing final ELBO
#> ELBOs:  -562.642123413086 -562.63835144043 -562.625122070312 -562.864599609375 -562.871282958984 -562.859097290039 -562.870913696289 -562.893487548828 -562.904473876953

Note we have set a random seed, which in turn will set seeds for each of the runs to ensure reproducibility.

print(cal)
#> A clonealign_fit for 6 cells, 66 genes, and 3 clones
#> To access clone assignments, call x$clone
#> To access ML parameter estimates, call x$ml_params

We can plot the ELBO to ensure convergence:

qplot(seq_along(cal$convergence_info$elbo), cal$convergence_info$elbo, geom = c("point", "line")) +
  labs(x = "Iteration", y = "ELBO")

The maximum likelihood estimates of the clone assignments can be access through the clone slot:

clones <- cal$clone
table(clones)
#> clones
#> A 
#> 6

and we can make a heatmap of the clonal assignment probabilities:

pheatmap::pheatmap(cal$ml_params$clone_probs)

This can easily be added to the SingleCellExperiment for visualization with scater, after subsetting to the retained cells from the preprocess_for_clonealign call

example_sce <- example_sce[, ca_data$retained_cells]
example_sce$clone <- clones
example_sce <- normalize(example_sce)
#> Warning in .local(object, ...): using library sizes as size factors
plotPCA(example_sce, ncomponents = 3, colour_by = "clone")

The clone assignments in clones can then be used for the desired downstream analysis, such as differential expression or SNV analysis.

3.3 Plotting results

The plot_clonealign function can be used to check the sanity of the fitted clones by ensuring that gene expression does correlate to the inferred copy number. For this we require the input SingleCellExperiment, copy number matrix, and clone assignments. Note that the SingleCellExperiment requires columns in rowData corresponding to the chromosome, start and end position of each feature (gene). These can conveniently be gathered using the getBMFeatureAnnos function in scater, e.g.

sce <- getBMFeatureAnnos(sce, filters = "ensembl_gene_id",
                         attributes = c("ensembl_gene_id", "start_position", "end_position"),
                         feature_symbol = "hgnc_symbol",
                         feature_id = "ensembl_gene_id",
                         dataset = "hsapiens_gene_ensembl")

For now we’ll set these to made-up values:

gene_position <- as_data_frame(cnv_data) %>% 
  dplyr::mutate(gene = seq_len(nrow(cnv_data))) %>% 
  dplyr::arrange(A, B, C) %>% 
  dplyr::mutate(position = seq_len(nrow(cnv_data))) %>% 
  dplyr::arrange(gene) %>% 
  .$position
#> Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
#> This warning is displayed once per session.

rowData(example_sce)$chromosome <- "1"
rowData(example_sce)$start_pos <- gene_position
rowData(example_sce)$end_pos <- gene_position

We can then plot the expression estimates using the plot_clonealign function:

plot_clonealign(example_sce, cal$clone, cnv_data,
                chromosome = "1",
                chr_str = "chromosome",
                start_str = "start_pos",
                end_str = "end_pos")
#> Joining, by = "ensembl_gene_id"
#> Joining, by = c("clone", "state")
#> Joining, by = c("state", "clone")

where the *_str identifies the columns of rowData(example_sce) to look for the chromosome names and feature start and end positions.

3.4 Evaluating the clone assignment

clonealign will assign single-cell RNA-seq to clones no matter how good the fit and no matter whether the clones actually exist in the expression data. A simple yet effective metric is to evaluate the correlation of the expression with copy number after assignment. If the clone assignment has been unsuccessful, then this should be a distribution scattered around a mean close to zero. This is computed in the correlations slot of the clonealign_fit.

While we have too few genes in this example to critically evaluate, ideally most of this distribution should sit above 0 in real data.

4 Advanced options

4.1 Controlling Variational Inference

Inference is performed using reparametrization gradient variational inference which uses the evidence lower bound (ELBO) to monitor convergence. This is controlled using the rel_tol parameter. When the difference

\[ \Delta ELBO = \frac{ELBO_{\text{new}} - ELBO_{\text{old}}}{|ELBO_{\text{old}}|} \] falls below rel_tol, the optimization algorithm is considered converged. The maximum number of iterations to acheive this is set using the max_iter parameter.

Inference is performed using Adam (Kingma and Ba (2014)) which is controlled by three parameters:

  • learning_rate the learning rate
  • rel_tol the relative difference in the ELBO below which the optimization will be considered converged (see above)
  • max_iter the maximum number of Adam iterations to perform (see above)

4.2 Accessing posterior / maximum-likelihood parameter estimates

The object returned by a call to clonealign contains a clone slot for the maximum likelihood (ML) clone assignment for each cell. The ML estimates of the other parameters can be found in the cal$ml_params slot:

names(cal$ml_params)
#> [1] "mu"          "clone_probs" "s"           "alpha"       "psi"        
#> [6] "W"           "chi"

The slot clone_probs gives the probability that each cell is assigned to each clone:

head(cal$ml_params$clone_probs)
#>              A            B            C
#> [1,] 0.9989527 0.0005769770 0.0004702821
#> [2,] 0.9987974 0.0007222228 0.0004802904
#> [3,] 0.9990749 0.0004102710 0.0005147131
#> [4,] 0.9991068 0.0004590424 0.0004341093
#> [5,] 0.9990491 0.0003608988 0.0005899800
#> [6,] 0.9990216 0.0004625435 0.0005158331

while mu and phi give the maximum likelihood estimates of the \(\mu\) and \(\phi\) parameters from the model.

5 Technical

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 9 (stretch)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/libopenblasp-r0.2.19.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] clonealign_1.99.2           dplyr_0.8.3                
#>  [3] scater_1.12.0               ggplot2_3.2.1              
#>  [5] SingleCellExperiment_1.6.0  SummarizedExperiment_1.14.0
#>  [7] DelayedArray_0.10.0         BiocParallel_1.18.0        
#>  [9] matrixStats_0.54.0          Biobase_2.44.0             
#> [11] GenomicRanges_1.36.0        GenomeInfoDb_1.20.0        
#> [13] IRanges_2.18.0              S4Vectors_0.22.0           
#> [15] BiocGenerics_0.30.0         BiocStyle_2.12.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] bitops_1.0-6             fs_1.3.1                
#>  [3] usethis_1.5.0            devtools_2.0.2          
#>  [5] RColorBrewer_1.1-2       progress_1.2.2          
#>  [7] rprojroot_1.3-2          tools_3.6.0             
#>  [9] backports_1.1.4          R6_2.4.0                
#> [11] irlba_2.3.3              vipor_0.4.5             
#> [13] lazyeval_0.2.2           colorspace_1.4-1        
#> [15] withr_2.1.2              tidyselect_0.2.5        
#> [17] gridExtra_2.3            prettyunits_1.0.2       
#> [19] processx_3.4.1           compiler_3.6.0          
#> [21] cli_1.1.0                BiocNeighbors_1.2.0     
#> [23] desc_1.2.0               labeling_0.3            
#> [25] bookdown_0.9             scales_1.0.0            
#> [27] callr_3.2.0              tfruns_1.4              
#> [29] stringr_1.4.0            digest_0.6.20           
#> [31] R.utils_2.9.0            rmarkdown_1.12          
#> [33] XVector_0.24.0           base64enc_0.1-3         
#> [35] pkgconfig_2.0.2          htmltools_0.3.6         
#> [37] sessioninfo_1.1.1        rlang_0.4.0             
#> [39] rstudioapi_0.10          DelayedMatrixStats_1.6.0
#> [41] jsonlite_1.6             tensorflow_1.14.0       
#> [43] R.oo_1.22.0              RCurl_1.95-4.12         
#> [45] magrittr_1.5             BiocSingular_1.0.0      
#> [47] GenomeInfoDbData_1.2.1   Matrix_1.2-17           
#> [49] Rcpp_1.0.2               ggbeeswarm_0.6.0        
#> [51] munsell_0.5.0            reticulate_1.13         
#> [53] viridis_0.5.1            R.methodsS3_1.7.1       
#> [55] whisker_0.3-2            stringi_1.4.3           
#> [57] yaml_2.2.0               zlibbioc_1.30.0         
#> [59] plyr_1.8.4               pkgbuild_1.0.3          
#> [61] grid_3.6.0               crayon_1.3.4            
#> [63] lattice_0.20-38          cowplot_0.9.4           
#> [65] hms_0.5.0                zeallot_0.1.0           
#> [67] knitr_1.22               ps_1.3.0                
#> [69] pillar_1.4.2             reshape2_1.4.3          
#> [71] pkgload_1.0.2            glue_1.3.1              
#> [73] evaluate_0.13            remotes_2.0.4           
#> [75] BiocManager_1.30.5       vctrs_0.2.0             
#> [77] testthat_2.1.1           tidyr_0.8.3             
#> [79] gtable_0.3.0             purrr_0.3.2             
#> [81] assertthat_0.2.1         xfun_0.6                
#> [83] rsvd_1.0.0               viridisLite_0.3.0       
#> [85] pheatmap_1.0.12          tibble_2.1.3            
#> [87] beeswarm_0.2.3           memoise_1.1.0

References

Kingma, Diederik P, and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimization.” arXiv Preprint arXiv:1412.6980.