After going through this tutorial you should be able to:
SingleCellExperiment
objectsSingleCellExperiment
objectsscater
cellassign
This document compiled: http://kieranrcampbell.github.io/r-workshop-march-2019
suppressPackageStartupMessages({
library(scater) # BioConductor
library(SingleCellExperiment) # BioConductor
library(DropletUtils) # BioConductor
library(tidyverse) # CRAN
library(here) # CRAN
library(DT) # CRAN
library(pheatmap) # CRAN
})
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'pheatmap' was built under R version 3.5.2
knitr::opts_chunk$set(echo = TRUE,
cache = TRUE)
If you need to install any of these, you can do so by:
install.packages(c("tidyverse", "here", "DT", "pheatmap", "BiocManager"))
BiocManager::install(c("scater", "SingleCellExperiment", "DropletUtils"))
It is highly recommended you do so before the tutorial as this can take significant time!
To install cellassign, we need to install Google’s Tensorflow framework. If this doesn’t work in the tutorial - don’t worry, you won’t need it for 80% of what we cover.
install.packages("tensorflow", repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/wh/6v_w5j853g11zfhlb7f2rcvw0000gn/T//RtmpbGDyNc/downloaded_packages
tensorflow::install_tensorflow()
## Creating virtualenv for TensorFlow at /Users/kierancampbell/.virtualenvs/r-tensorflow
## Installing TensorFlow ...
##
## Installation complete.
install.packages("devtools", repos = "http://cran.us.r-project.org") # If not already installed
##
## There is a binary version available but the source version is
## later:
## binary source needs_compilation
## devtools 1.13.6 2.0.1 FALSE
## installing the source package 'devtools'
devtools::install_github("Irrationone/cellassign")
## Skipping install of 'cellassign' from a github remote, the SHA1 (3ce59232) has not changed since last install.
## Use `force = TRUE` to force installation
We’re going to use some re-processed data from Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations (patient 1 specifically). If you git cloned this repository (http://github.com/kieranrcampbell/https://github.com/kieranrcampbell/r-workshop-march-2019) this can be found in the directory data/outs/filtered_gene_bc_matrices/GRCh38
:
data_dir <- here("data", "outs", "filtered_gene_bc_matrices", "GRCh38")
print(data_dir)
## [1] "/Users/kierancampbell/ubc/r-workshop-march-2019/data/outs/filtered_gene_bc_matrices/GRCh38"
print(dir(data_dir))
## [1] "barcodes.tsv" "genes.tsv" "matrix.mtx"
10X data typically has a barcodes file (indicating cell barcodes), a genes file (indicating a mapping between genes and indices) and the actual expression data as raw counts in matrix.mtx
.
We can read these in to a SingleCellExperiment
object using the read10xCounts
function
sce <- read10xCounts(data_dir)
sce
## class: SingleCellExperiment
## dim: 33694 1767
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ...
## ENSG00000277475 ENSG00000268674
## rowData names(2): ID Symbol
## colnames: NULL
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
SingleCellExperiment
packageOverall idea: think of your count matrix where the columns are cells and the rows are genes.
So when you see things like rowData
think geneData
, and colData
think cellData
!
Getting data dimensions
nrow(sce) # Number of rows = genes
## [1] 33694
ncol(sce) # Number of columns = cells
## [1] 1767
print(sce)
## class: SingleCellExperiment
## dim: 33694 1767
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ...
## ENSG00000277475 ENSG00000268674
## rowData names(2): ID Symbol
## colnames: NULL
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
sce[, c(1,3,5)] # Subset to cells 1, 3, 5
## class: SingleCellExperiment
## dim: 33694 3
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ...
## ENSG00000277475 ENSG00000268674
## rowData names(2): ID Symbol
## colnames: NULL
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
sce[c(2,4,6), ] # Subset to genes 2, 4, 6
## class: SingleCellExperiment
## dim: 3 1767
## metadata(0):
## assays(1): counts
## rownames(3): ENSG00000237613 ENSG00000238009 ENSG00000239906
## rowData names(2): ID Symbol
## colnames: NULL
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
rownames
= gene names
head(rownames(sce))
## [1] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009"
## [5] "ENSG00000239945" "ENSG00000239906"
colnames
= cell names = barcodes (sometimes)
head(colnames(sce))
## NULL
Column data (cell specific metadata)
head(colData(sce))
## DataFrame with 6 rows and 2 columns
## Sample
## <character>
## 1 /Users/kierancampbell/ubc/r-workshop-march-2019/data/outs/filtered_gene_bc_matrices/GRCh38
## 2 /Users/kierancampbell/ubc/r-workshop-march-2019/data/outs/filtered_gene_bc_matrices/GRCh38
## 3 /Users/kierancampbell/ubc/r-workshop-march-2019/data/outs/filtered_gene_bc_matrices/GRCh38
## 4 /Users/kierancampbell/ubc/r-workshop-march-2019/data/outs/filtered_gene_bc_matrices/GRCh38
## 5 /Users/kierancampbell/ubc/r-workshop-march-2019/data/outs/filtered_gene_bc_matrices/GRCh38
## 6 /Users/kierancampbell/ubc/r-workshop-march-2019/data/outs/filtered_gene_bc_matrices/GRCh38
## Barcode
## <character>
## 1 AAACCTGCAGTAAGCG-1
## 2 AAACCTGGTCCGTTAA-1
## 3 AAACCTGGTTCTGTTT-1
## 4 AAACCTGTCCTCATTA-1
## 5 AAACGGGAGTAGGCCA-1
## 6 AAACGGGAGTGTACTC-1
I might want to set the column names to the barcode:
colnames(sce) <- colData(sce)$Barcode
head(colnames(sce))
## [1] "AAACCTGCAGTAAGCG-1" "AAACCTGGTCCGTTAA-1" "AAACCTGGTTCTGTTT-1"
## [4] "AAACCTGTCCTCATTA-1" "AAACGGGAGTAGGCCA-1" "AAACGGGAGTGTACTC-1"
And I can subset on barcode:
sce[, "AAACCTGCAGTAAGCG-1"]
## class: SingleCellExperiment
## dim: 33694 1
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ...
## ENSG00000277475 ENSG00000268674
## rowData names(2): ID Symbol
## colnames(1): AAACCTGCAGTAAGCG-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
Row data (gene specific metadata)
head(rowData(sce))
## DataFrame with 6 rows and 2 columns
## ID Symbol
## <character> <character>
## ENSG00000243485 ENSG00000243485 RP11-34P13.3
## ENSG00000237613 ENSG00000237613 FAM138A
## ENSG00000186092 ENSG00000186092 OR4F5
## ENSG00000238009 ENSG00000238009 RP11-34P13.7
## ENSG00000239945 ENSG00000239945 RP11-34P13.8
## ENSG00000239906 ENSG00000239906 RP11-34P13.14
reducedDims
is where our PCA,UMAP,tSNE representations will live - but we haven’t made them yet
reducedDims(sce)
## List of length 0
## names(0):
sizeFactors
is where our cell size factors will live - but we haven’t calculated them yet
head(sizeFactors(sce))
## NULL
The ability to have multiple assays
is one of the unique advantages of the SingleCellExperiment approach - I can carry around counts
, logcounts
, and any other weird data transformation I might like. Right now we only have raw counts, because that’s what we’ve read in:
names(assays(sce))
## [1] "counts"
assay(sce, "counts")
counts(sce)
I can make my own:
assay(sce, "my_super_strange_assay") <- sin(as.matrix(counts(sce)))
names(assays(sce))
## [1] "counts" "my_super_strange_assay"
class(assay(sce, "my_super_strange_assay"))
## [1] "matrix"
Note the beauty of SingleCellExperiments is that subsetting is consistent: if I want to subset only some cells and genes:
sce_subset <- sce[c(1,3,5), c(2,4,6,8)]
Then everything else is subset too!
print(dim(counts(sce_subset)))
## [1] 3 4
print(length(sizeFactors(sce_subset)))
## [1] 0
print(dim(rowData(sce_subset)))
## [1] 3 2
So the approach may seem like a lot of work up front to just carry around a count matrix and some metadata, but this sort of consistent subsetting makes it much harder (but still not impossible) to introduce bugs into your analysis.
First we do some key things to our data:
rowData(sce)$ensembl_gene_id <- rownames(sce)
sce <- getBMFeatureAnnos(sce,
filters = "ensembl_gene_id",
attributes = c("ensembl_gene_id", "hgnc_symbol", "entrezgene",
"start_position", "end_position", "chromosome_name"),
dataset = "hsapiens_gene_ensembl")
## Batch submitting query [>-------------------------] 3% eta: 1m Batch
## submitting query [>-------------------------] 4% eta: 1m Batch submitting
## query [=>------------------------] 6% eta: 50s Batch submitting
## query [=>------------------------] 7% eta: 45s Batch submitting
## query [=>------------------------] 9% eta: 42s Batch submitting
## query [==>-----------------------] 10% eta: 39s Batch submitting
## query [==>-----------------------] 12% eta: 37s Batch submitting
## query [==>-----------------------] 13% eta: 35s Batch submitting
## query [===>----------------------] 15% eta: 34s Batch submitting
## query [===>----------------------] 16% eta: 33s Batch submitting
## query [====>---------------------] 18% eta: 32s Batch submitting
## query [====>---------------------] 19% eta: 31s Batch submitting
## query [====>---------------------] 21% eta: 30s Batch submitting
## query [=====>--------------------] 22% eta: 33s Batch submitting
## query [=====>--------------------] 24% eta: 32s Batch submitting
## query [=====>--------------------] 25% eta: 31s Batch submitting
## query [======>-------------------] 26% eta: 31s Batch submitting
## query [======>-------------------] 28% eta: 30s Batch submitting
## query [=======>------------------] 29% eta: 30s Batch submitting
## query [=======>------------------] 31% eta: 29s Batch submitting
## query [=======>------------------] 32% eta: 28s Batch submitting
## query [========>-----------------] 34% eta: 27s Batch submitting
## query [========>-----------------] 35% eta: 27s Batch submitting
## query [=========>----------------] 37% eta: 26s Batch submitting
## query [=========>----------------] 38% eta: 25s Batch submitting
## query [=========>----------------] 40% eta: 24s Batch submitting
## query [==========>---------------] 41% eta: 23s Batch submitting
## query [==========>---------------] 43% eta: 23s Batch submitting
## query [==========>---------------] 44% eta: 22s Batch submitting
## query [===========>--------------] 46% eta: 22s Batch submitting
## query [===========>--------------] 47% eta: 21s Batch submitting
## query [============>-------------] 49% eta: 20s Batch submitting
## query [============>-------------] 50% eta: 20s Batch submitting
## query [============>-------------] 51% eta: 19s Batch submitting
## query [=============>------------] 53% eta: 18s Batch submitting
## query [=============>------------] 54% eta: 18s Batch submitting
## query [==============>-----------] 56% eta: 17s Batch submitting
## query [==============>-----------] 57% eta: 16s Batch submitting
## query [==============>-----------] 59% eta: 16s Batch submitting
## query [===============>----------] 60% eta: 15s Batch submitting
## query [===============>----------] 62% eta: 15s Batch submitting
## query [===============>----------] 63% eta: 14s Batch submitting
## query [================>---------] 65% eta: 13s Batch submitting
## query [================>---------] 66% eta: 13s Batch submitting
## query [=================>--------] 68% eta: 12s Batch submitting
## query [=================>--------] 69% eta: 12s Batch submitting
## query [=================>--------] 71% eta: 11s Batch submitting
## query [==================>-------] 72% eta: 10s Batch submitting
## query [==================>-------] 74% eta: 10s Batch submitting
## query [===================>------] 75% eta: 9s Batch submitting
## query [===================>------] 76% eta: 9s Batch submitting
## query [===================>------] 78% eta: 8s Batch submitting
## query [====================>-----] 79% eta: 8s Batch submitting
## query [====================>-----] 81% eta: 7s Batch submitting
## query [====================>-----] 82% eta: 7s Batch submitting
## query [=====================>----] 84% eta: 6s Batch submitting
## query [=====================>----] 85% eta: 5s Batch submitting
## query [======================>---] 87% eta: 5s Batch submitting
## query [======================>---] 88% eta: 4s Batch submitting
## query [======================>---] 90% eta: 4s Batch submitting
## query [=======================>--] 91% eta: 3s Batch submitting
## query [=======================>--] 93% eta: 3s Batch submitting
## query [=======================>--] 94% eta: 2s Batch submitting
## query [========================>-] 96% eta: 2s Batch submitting
## query [========================>-] 97% eta: 1s Batch submitting
## query [=========================>] 99% eta: 1s Batch submitting query
## [==========================] 100% eta: 0s
# Calculate size factors
sce <- scran::computeSumFactors(sce, BPPARAM = MulticoreParam(10))
## Warning in FUN(...): encountered negative size factor estimates
# Compute log normal expression values
sce <- normalize(sce)
names(assays(sce))
## [1] "counts" "my_super_strange_assay"
## [3] "logcounts"
head(sizeFactors(sce))
## [1] 0.033215558 0.071659554 0.009291482 3.408452925 2.399221881 0.106118551
sce <- runPCA(sce)
sce <- runUMAP(sce)
reducedDims(sce)
## List of length 2
## names(2): PCA UMAP
Just give me my PCA!
head(reducedDims(sce)[['PCA']])
## PC1 PC2
## AAACCTGCAGTAAGCG-1 -6.822310 -1.921039
## AAACCTGGTCCGTTAA-1 -3.898654 -2.770704
## AAACCTGGTTCTGTTT-1 -7.841220 -1.896710
## AAACCTGTCCTCATTA-1 7.392807 2.167728
## AAACGGGAGTAGGCCA-1 12.452015 -4.042698
## AAACGGGAGTGTACTC-1 -4.069171 -1.635552
I like to add the symbols to the rownames:
rownames(sce) <- paste0(rowData(sce)$Symbol, "_", rownames(sce))
head(rownames(sce))
## [1] "RP11-34P13.3_ENSG00000243485" "FAM138A_ENSG00000237613"
## [3] "OR4F5_ENSG00000186092" "RP11-34P13.7_ENSG00000238009"
## [5] "RP11-34P13.8_ENSG00000239945" "RP11-34P13.14_ENSG00000239906"
We next need to work out which genes are mitochondrial and ribosomal as these work well for QC:
# Get Mitochondrial genes for QC:
mt_genes <- which(rowData(sce)$chromosome_name == "MT")
ribo_genes <- grepl("^RP[LS]", rowData(sce)$Symbol)
feature_ctrls <- list(mito = rownames(sce)[mt_genes],
ribo = rownames(sce)[ribo_genes])
lapply(feature_ctrls, head)
## $mito
## [1] "MT-ND1_ENSG00000198888" "MT-ND2_ENSG00000198763"
## [3] "MT-CO1_ENSG00000198804" "MT-CO2_ENSG00000198712"
## [5] "MT-ATP8_ENSG00000228253" "MT-ATP6_ENSG00000198899"
##
## $ribo
## [1] "RPL22_ENSG00000116251" "RPL11_ENSG00000142676"
## [3] "RPS6KA1_ENSG00000117676" "RPS8_ENSG00000142937"
## [5] "RPL5_ENSG00000122406" "RPS27_ENSG00000177954"
And call the calcualteQCMetrics
function in scater:
sce <- calculateQCMetrics(sce, feature_controls = feature_ctrls)
datatable(head(as.data.frame(colData(sce))))
My personal favourite plot to QC scRNA-seq data:
plotColData(sce, x = "total_features_by_counts", y = "pct_counts_mito")
Typically retain cells that have < 10% mitochondrial transcripts and > 1000 features, but this is dataset dependent - for example, tumour cells typically have higher metabolic burden, leading to higher % mitochondrial (we typically use 20% as a filter then).
plotPCA(sce, colour_by = "pct_counts_mito")
## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.
For this going to use a simple threshold of 10% mitochondrial. Importantly, we re-compute the QC metrics, size factors and normalization
mito_thresh <- 10
sce_qc <- sce[, sce$pct_counts_mito < mito_thresh]
sce_qc <- scran::computeSumFactors(sce_qc, BPPARAM = MulticoreParam(10))
sce_qc <- normalize(sce_qc)
sce_qc <- calculateQCMetrics(sce_qc, feature_controls = feature_ctrls)
sce_qc <- runPCA(sce_qc)
sce_qc <- runUMAP(sce_qc)
# plotPCA(sce_qc, colour_by = "total_features_by_counts")
plotUMAP(sce_qc, colour_by = "pct_counts_mito")
## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.
plotScater(sce)
plotScater(sce_qc)
plotHighestExprs(sce_qc)
I can call
plotPCA(sce, colour_by = "x")
plotUMAP(sce, colour_by = "x")
plotTSNE(sce, colour_by = "x")
where x
is:
colData(sce)
(= the cell specific data) to colour by metadatarownames(sce)
to colour by expressionplotPCA(sce, colour_by = "SAA1_ENSG00000173432")
## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.
plotColData(sce_qc, x = "total_counts", y = "pct_counts_mito")
plotExpression(sce_qc,
x = "total_counts",
features = "GAPDH_ENSG00000111640")
CellAssign is our new method to assign cells to known cell types. It relies on assuming cell types over-express their own markers, e.g. an epithelial tumour cell should overexpress EPCAM relative to other cell types.
library(cellassign)
In this example, the data we have just performed QC and exploratory analysis of is liver cells, that we expect to contain a certain number of cell types. To begin, we specify a list, where each item corresponds to a set of marker genes for a given cell type:
liver_marker_list <- list(
Hepatocytes = c("ALB", "HAMP", "ARG1", "PCK1", "AFP", "BCHE"),
LSECs = c("CALCRL", "FCGR2B", "VWF"),
Cholangiocytes = c("KRT19", "EPCAM", "CLDN4", "CLDN10", "SOX9", "MMP7", "CXCL1", "CFTR", "TFF2", "KRT7", "CD24"),
`Hepatic Stellate Cells` = c("ACTA2", "COL1A1", "COL1A2", "COL3A1", "DCN", "MYL9"),
Macrophages = c("CD68", "MARCO", "FCGR3A", "LYZ", "PTPRC"),
`ab T cells` = c("CD2", "CD3D", "TRAC", "IL32", "CD3E", "PTPRC"),
`gd T cells` = c("NKG7", "FCGR3A", "HOPX", "GNLY", "KLRF1", "CMC1", "CCL3", "PTPRC"),
`NK cells` = c("GZMK", "KLRF1", "CCL3", "CMC1", "NKG7", "PTPRC"),
`Plasma cells` = c("CD27", "IGHG1", "CD79A", "IGHG2", "PTPRC", "IGKC"),
`Mature B cells` = c("MS4A1", "LTB", "CD52", "IGHD", "CD79A", "PTPRC", "IGKC"),
`Erythroid cells` = c("HBB", "SLC25A37", "CA1", "ALAS2")
)
To begin, we use cellassign
’s marker_list_to_mat
function to convert this into a (binary) cell type by marker matrix:
mgi <- marker_list_to_mat(liver_marker_list, include_other = FALSE)
pheatmap(mgi)
We then make sure all of these markers exist in our SingleCellExperiment:
marker_in_sce <- match(rownames(mgi), rowData(sce_qc)$Symbol)
stopifnot(all(!is.na(marker_in_sce)))
And finally we subset sce
to just the marker genes:
sce_marker <- sce_qc[marker_in_sce, ]
stopifnot(all.equal(rownames(mgi), rowData(sce_marker)$Symbol))
We then call cellassign
passing in the SingleCellExperiment
, marker info, the size factors we’ve calculated, as well as various other options:
counts(sce_marker) <- as.matrix(counts(sce_marker))
print(dim(sce_marker))
## [1] 56 331
print(dim(mgi))
## [1] 56 11
fit <- cellassign(
exprs_obj = sce_marker,
marker_gene_info = mgi,
s = sizeFactors(sce_qc),
shrinkage = TRUE,
max_iter_adam = 50,
min_delta = 2,
verbose = TRUE
)
## 50 L old: -49198.13364057; L new: -12797.9958710856; Difference (%): 0.739868264829217
## 50 L old: -12797.9958710856; L new: -12240.3944291952; Difference (%): 0.0435694344260704
## 50 L old: -12240.3944291952; L new: -11957.9641964594; Difference (%): 0.0230736218811856
## 50 L old: -11957.9641964594; L new: -11788.814823868; Difference (%): 0.014145331915394
## 50 L old: -11788.814823868; L new: -11721.4394635294; Difference (%): 0.00571519371075677
## 50 L old: -11721.4394635294; L new: -11697.3827048211; Difference (%): 0.0020523723884898
## 50 L old: -11697.3827048211; L new: -11687.9651451312; Difference (%): 0.000805099732778434
## 50 L old: -11687.9651451312; L new: -11681.5548969882; Difference (%): 0.000548448601906885
## 50 L old: -11681.5548969882; L new: -11676.25864483; Difference (%): 0.000453385889548398
## 50 L old: -11676.25864483; L new: -11671.4496068477; Difference (%): 0.000411864633057001
## 50 L old: -11671.4496068477; L new: -11666.892848577; Difference (%): 0.000390419221623123
## 50 L old: -11666.892848577; L new: -11662.6015473256; Difference (%): 0.000367818690637439
## 50 L old: -11662.6015473256; L new: -11658.0910340668; Difference (%): 0.000386750180952966
## 50 L old: -11658.0910340668; L new: -11653.7533749585; Difference (%): 0.000372072845859803
## 50 L old: -11653.7533749585; L new: -11649.4509262191; Difference (%): 0.000369189959747849
## 50 L old: -11649.4509262191; L new: -11645.0699987986; Difference (%): 0.00037606299629168
## 50 L old: -11645.0699987986; L new: -11640.7297622289; Difference (%): 0.000372710217296541
## 50 L old: -11640.7297622289; L new: -11636.2901717828; Difference (%): 0.000381384203296699
## 50 L old: -11636.2901717828; L new: -11631.8513559524; Difference (%): 0.000381463143740781
## 50 L old: -11631.8513559524; L new: -11627.3750041921; Difference (%): 0.000384835708719969
fit
## A cellassign fit for 331 cells, 56 genes, 11 cell types with 0 covariates
## To access MLE cell types, call x$cell_type
## To access MLE parameter estimates, call x$mle_params
Add the cell types to the sce:
sce_qc$cell_type <- fit$cell_type
plotUMAP(sce_qc, colour_by = "cell_type")
## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.
acol <- data.frame(`cellassign cell type` = sce_qc$cell_type)
rownames(acol) <- colnames(sce_qc)
pheatmap(as.matrix(logcounts(sce_marker)),
annotation_col = acol)
pheatmap(fit$mle_params$gamma)