This vignette outlines how to take clone specific copy number data at a region resolution (e.g. 500kb bins) and convert this to gene resolution as is required by clonealign
. We leave this as a guide to be implemented separately rather than precomputed functions as there are several choices that are situation specific, such as choice of genome and chromosome naming (1,2,3 vs chr1,chr2,chr3).
An example data frame containing clone and region specific data is included in the package:
data(df_cnv)
print(head(df_cnv))
#> # A tibble: 6 x 5
#> chr start end copy_number clone
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 1 1 12600000 1 A
#> 2 1 12600001 60150000 2 A
#> 3 1 60150001 61200000 3 A
#> 4 1 61200001 77100000 2 A
#> 5 1 77100001 78750000 3 A
#> 6 1 78750001 156000000 2 A
This contains the following columns:
chr
: chromosomestart
, end
: start and end positions on the chromosomecopy_number
: copy number of the segmentclone
: the clone for which this is the copy numberNext, we load the database of genes. In this example we choose the hg19 annotation, since our correspodning single-cell RNA-seq is aligned to hg19, and the copy number data (above) has been aligned to hg19. Thus we set
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
It is important we choose the correct genome here. For example, if instead we were working with mice and had aligned to mm9 then we would set
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
We can then load the corresponding gene annotations via
g <- genes(txdb, single.strand.genes.only=FALSE)
g
#> GRangesList object of length 23459:
#> $1
#> GRanges object with 1 range and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr19 58858172-58874214 -
#>
#> $10
#> GRanges object with 1 range and 0 metadata columns:
#> seqnames ranges strand
#> [1] chr8 18248755-18258723 +
#>
#> $100
#> GRanges object with 1 range and 0 metadata columns:
#> seqnames ranges strand
#> [1] chr20 43248163-43280376 -
#>
#> ...
#> <23456 more elements>
#> -------
#> seqinfo: 93 sequences (1 circular) from hg19 genome
One issue that creates perennial worldwide angst is that our chromosome names are 1,2,3,… whereas those with gene annotation are chr1, chr2, chr3,… We fix this by changing our original data frame as follows:
df_cnv <- mutate(df_cnv, chr = paste0("chr", chr))
We can then convert this to a GRanges
object:
cnv_gr <- makeGRangesFromDataFrame(df_cnv, keep.extra.columns = TRUE)
cnv_gr
#> GRanges object with 435 ranges and 2 metadata columns:
#> seqnames ranges strand | copy_number clone
#> <Rle> <IRanges> <Rle> | <numeric> <character>
#> [1] chr1 1-12600000 * | 1 A
#> [2] chr1 12600001-60150000 * | 2 A
#> [3] chr1 60150001-61200000 * | 3 A
#> [4] chr1 61200001-77100000 * | 2 A
#> [5] chr1 77100001-78750000 * | 3 A
#> ... ... ... ... . ... ...
#> [431] chr21 1-38850000 * | 2 C
#> [432] chr21 38850001-48150000 * | 1 C
#> [433] chr22 1-51450000 * | 2 C
#> [434] chrX 1-155400000 * | 2 C
#> [435] chrY 1-59400000 * | 0 C
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
Then we compute the overlaps between the gene annotation and the copy number data:
olaps <- findOverlaps(g, cnv_gr)
olaps
#> Hits object with 70601 hits and 0 metadata columns:
#> queryHits subjectHits
#> <integer> <integer>
#> [1] 1 136
#> [2] 1 278
#> [3] 1 424
#> [4] 2 74
#> [5] 2 214
#> ... ... ...
#> [70597] 23458 199
#> [70598] 23458 343
#> [70599] 23459 145
#> [70600] 23459 287
#> [70601] 23459 433
#> -------
#> queryLength: 23459 / subjectLength: 435
Here, the first column represents the index of each gene in g
(that can be accessed with queryHits(olaps)
), while the second column represents the index of each copy number region (subjectHits(olaps)
). We can then convert this into a gene and copy number data frame:
df_gene <- data_frame(entrezgene = names(g)[queryHits(olaps)],
copy_number = mcols(cnv_gr)$copy_number[subjectHits(olaps)],
clone = mcols(cnv_gr)$clone[subjectHits(olaps)])
Next, we’d like to map on ensembl gene ids:
entrezgene_ensembl_map <- as.list(org.Hs.egENSEMBL)
entrezgene_ensembl_map <- lapply(entrezgene_ensembl_map, `[`, 1)
df_gene <- dplyr::filter(df_gene, entrezgene %in% names(entrezgene_ensembl_map)) %>%
dplyr::mutate(ensembl_gene_id = unlist(entrezgene_ensembl_map[entrezgene])) %>%
dplyr::select(ensembl_gene_id, entrezgene, copy_number, clone) %>%
drop_na()
df_gene
#> # A tibble: 68,546 x 4
#> ensembl_gene_id entrezgene copy_number clone
#> <chr> <chr> <dbl> <chr>
#> 1 ENSG00000121410 1 2 A
#> 2 ENSG00000121410 1 2 B
#> 3 ENSG00000121410 1 2 C
#> 4 ENSG00000156006 10 1 A
#> 5 ENSG00000156006 10 1 B
#> 6 ENSG00000156006 10 1 C
#> 7 ENSG00000196839 100 1 B
#> 8 ENSG00000196839 100 1 C
#> 9 ENSG00000196839 100 1 A
#> 10 ENSG00000170558 1000 2 A
#> # ... with 68,536 more rows
We may find non-unique mappings. This can be due to genes spanning breakpoints or multi-mappings to e.g. pseudo-autosomal regions. To fix this, we retain only genes that are uniquely mapped:
df_gene <- count(df_gene, ensembl_gene_id) %>%
filter(n == length(unique(df_gene$clone))) %>%
inner_join(df_gene) %>%
dplyr::select(-n)
#> Joining, by = "ensembl_gene_id"
#> Warning: Column `ensembl_gene_id` has different attributes on LHS and RHS of
#> join
Clonealign requires a gene by clone matrix as input, so to create this we’ll use the spread
function from tidyr
:
df_gene_expanded <- spread(df_gene, clone, copy_number)
head(df_gene_expanded)
#> # A tibble: 6 x 5
#> ensembl_gene_id entrezgene A B C
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 ENSG00000000003 7105 1 2 2
#> 2 ENSG00000000005 64102 1 2 2
#> 3 ENSG00000000419 8813 1 1 1
#> 4 ENSG00000000457 57147 3 3 3
#> 5 ENSG00000000460 55732 3 3 3
#> 6 ENSG00000000938 2268 2 2 2
We can turn this into the required matrix:
cnv_mat <- dplyr::select(df_gene_expanded, -ensembl_gene_id, -entrezgene) %>%
as.matrix()
rownames(cnv_mat) <- df_gene_expanded$ensembl_gene_id
In general, we should select as input to clonealign:
keep_gene <- rowMins(cnv_mat) <= 6 & rowVars(cnv_mat) > 0
cnv_mat <- cnv_mat[keep_gene,]
print(head(cnv_mat))
#> A B C
#> ENSG00000000003 1 2 2
#> ENSG00000000005 1 2 2
#> ENSG00000001497 1 2 2
#> ENSG00000002822 2 3 3
#> ENSG00000003096 1 2 2
#> ENSG00000004809 5 4 4
This is now ready for input to clonealign
. If for example we had a SingleCellExperiment
named sce
as input with ensembl gene ids as the rownames
, we would subset accordingly:
sce <- sce[rownames(cnv_mat),]
and use these as input to clonealign:
clonealign(sce, cnv_mat,...)
sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 9 (stretch)
#>
#> Matrix products: default
#> BLAS: /usr/lib/openblas-base/libblas.so.3
#> LAPACK: /usr/lib/libopenblasp-r0.2.19.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] org.Hs.eg.db_3.7.0
#> [2] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
#> [3] GenomicFeatures_1.34.1
#> [4] AnnotationDbi_1.44.0
#> [5] forcats_0.3.0
#> [6] stringr_1.3.1
#> [7] purrr_0.2.5
#> [8] readr_1.2.1
#> [9] tidyr_0.8.2
#> [10] tibble_1.4.2
#> [11] tidyverse_1.2.1
#> [12] bindrcpp_0.2.2
#> [13] clonealign_0.99.0
#> [14] dplyr_0.7.8
#> [15] scater_1.10.0
#> [16] ggplot2_3.1.0
#> [17] SingleCellExperiment_1.4.0
#> [18] SummarizedExperiment_1.12.0
#> [19] DelayedArray_0.8.0
#> [20] BiocParallel_1.16.2
#> [21] matrixStats_0.54.0
#> [22] Biobase_2.42.0
#> [23] GenomicRanges_1.34.0
#> [24] GenomeInfoDb_1.18.1
#> [25] IRanges_2.16.0
#> [26] S4Vectors_0.20.1
#> [27] BiocGenerics_0.28.0
#> [28] BiocStyle_2.10.0
#>
#> loaded via a namespace (and not attached):
#> [1] ggbeeswarm_0.6.0 colorspace_1.3-2
#> [3] rprojroot_1.3-2 XVector_0.22.0
#> [5] base64enc_0.1-3 fs_1.2.6
#> [7] rstudioapi_0.8 remotes_2.0.2
#> [9] bit64_0.9-7 fansi_0.4.0
#> [11] lubridate_1.7.4 xml2_1.2.0
#> [13] knitr_1.20 pkgload_1.0.2
#> [15] jsonlite_1.5 Rsamtools_1.34.0
#> [17] broom_0.5.0 tfruns_1.4
#> [19] HDF5Array_1.10.0 BiocManager_1.30.4
#> [21] compiler_3.5.1 httr_1.3.1
#> [23] backports_1.1.2 assertthat_0.2.0
#> [25] Matrix_1.2-14 lazyeval_0.2.1
#> [27] cli_1.0.1 htmltools_0.3.6
#> [29] prettyunits_1.0.2 tools_3.5.1
#> [31] gtable_0.2.0 glue_1.3.0
#> [33] GenomeInfoDbData_1.2.0 reshape2_1.4.3
#> [35] Rcpp_1.0.0 cellranger_1.1.0
#> [37] Biostrings_2.50.1 nlme_3.1-137
#> [39] rtracklayer_1.42.1 DelayedMatrixStats_1.4.0
#> [41] xfun_0.4 ps_1.2.1
#> [43] testthat_2.0.1 rvest_0.3.2
#> [45] devtools_2.0.1 XML_3.98-1.16
#> [47] zlibbioc_1.28.0 scales_1.0.0
#> [49] hms_0.4.2 rhdf5_2.26.0
#> [51] RColorBrewer_1.1-2 yaml_2.2.0
#> [53] memoise_1.1.0 reticulate_1.10
#> [55] gridExtra_2.3 biomaRt_2.38.0
#> [57] stringi_1.2.4 RSQLite_2.1.1
#> [59] tensorflow_1.10 desc_1.2.0
#> [61] pkgbuild_1.0.2 rlang_0.3.0.1
#> [63] pkgconfig_2.0.2 bitops_1.0-6
#> [65] evaluate_0.12 lattice_0.20-35
#> [67] Rhdf5lib_1.4.1 bindr_0.1.1
#> [69] GenomicAlignments_1.18.0 labeling_0.3
#> [71] bit_1.1-14 cowplot_0.9.3
#> [73] processx_3.2.0 tidyselect_0.2.5
#> [75] plyr_1.8.4 magrittr_1.5
#> [77] bookdown_0.7 R6_2.3.0
#> [79] DBI_1.0.0 pillar_1.3.0
#> [81] haven_2.0.0 whisker_0.3-2
#> [83] withr_2.1.2 RCurl_1.95-4.11
#> [85] modelr_0.1.2 crayon_1.3.4
#> [87] utf8_1.1.4 rmarkdown_1.10
#> [89] viridis_0.5.1 progress_1.2.0
#> [91] usethis_1.4.0 grid_3.5.1
#> [93] readxl_1.1.0 blob_1.1.1
#> [95] callr_3.0.0 digest_0.6.18
#> [97] munsell_0.5.0 beeswarm_0.2.3
#> [99] viridisLite_0.3.0 vipor_0.4.5
#> [101] sessioninfo_1.1.1