Contents

1 Overview

This vignette outlines how to take clone specific copy number data at a region resolution (e.g. 500kb bins) and convert this to gene resolution as is required by clonealign. We leave this as a guide to be implemented separately rather than precomputed functions as there are several choices that are situation specific, such as choice of genome and chromosome naming (1,2,3 vs chr1,chr2,chr3).

2 Converting region-based copy number to gene based

2.1 Example dataset

An example data frame containing clone and region specific data is included in the package:

data(df_cnv)

print(head(df_cnv))
#> # A tibble: 6 x 5
#>   chr      start       end copy_number clone
#>   <chr>    <dbl>     <dbl>       <dbl> <chr>
#> 1 1            1  12600000           1 A    
#> 2 1     12600001  60150000           2 A    
#> 3 1     60150001  61200000           3 A    
#> 4 1     61200001  77100000           2 A    
#> 5 1     77100001  78750000           3 A    
#> 6 1     78750001 156000000           2 A

This contains the following columns:

  1. chr: chromosome
  2. start, end: start and end positions on the chromosome
  3. copy_number: copy number of the segment
  4. clone: the clone for which this is the copy number

2.2 Loading the gene database and making sure chromosome names match

Next, we load the database of genes. In this example we choose the hg19 annotation, since our correspodning single-cell RNA-seq is aligned to hg19, and the copy number data (above) has been aligned to hg19. Thus we set

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

It is important we choose the correct genome here. For example, if instead we were working with mice and had aligned to mm9 then we would set

txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene

We can then load the corresponding gene annotations via

g <- genes(txdb, single.strand.genes.only=FALSE)
g
#> GRangesList object of length 23459:
#> $1 
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames            ranges strand
#>          <Rle>         <IRanges>  <Rle>
#>   [1]    chr19 58858172-58874214      -
#> 
#> $10 
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames            ranges strand
#>   [1]     chr8 18248755-18258723      +
#> 
#> $100 
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames            ranges strand
#>   [1]    chr20 43248163-43280376      -
#> 
#> ...
#> <23456 more elements>
#> -------
#> seqinfo: 93 sequences (1 circular) from hg19 genome

One issue that creates perennial worldwide angst is that our chromosome names are 1,2,3,… whereas those with gene annotation are chr1, chr2, chr3,… We fix this by changing our original data frame as follows:

df_cnv <- mutate(df_cnv, chr = paste0("chr", chr))

We can then convert this to a GRanges object:

cnv_gr <- makeGRangesFromDataFrame(df_cnv, keep.extra.columns = TRUE)
cnv_gr
#> GRanges object with 435 ranges and 2 metadata columns:
#>         seqnames            ranges strand | copy_number       clone
#>            <Rle>         <IRanges>  <Rle> |   <numeric> <character>
#>     [1]     chr1        1-12600000      * |           1           A
#>     [2]     chr1 12600001-60150000      * |           2           A
#>     [3]     chr1 60150001-61200000      * |           3           A
#>     [4]     chr1 61200001-77100000      * |           2           A
#>     [5]     chr1 77100001-78750000      * |           3           A
#>     ...      ...               ...    ... .         ...         ...
#>   [431]    chr21        1-38850000      * |           2           C
#>   [432]    chr21 38850001-48150000      * |           1           C
#>   [433]    chr22        1-51450000      * |           2           C
#>   [434]     chrX       1-155400000      * |           2           C
#>   [435]     chrY        1-59400000      * |           0           C
#>   -------
#>   seqinfo: 24 sequences from an unspecified genome; no seqlengths

2.3 Finding overlaps between gene and region based annotation

Then we compute the overlaps between the gene annotation and the copy number data:

olaps <- findOverlaps(g, cnv_gr)
olaps
#> Hits object with 70601 hits and 0 metadata columns:
#>           queryHits subjectHits
#>           <integer>   <integer>
#>       [1]         1         136
#>       [2]         1         278
#>       [3]         1         424
#>       [4]         2          74
#>       [5]         2         214
#>       ...       ...         ...
#>   [70597]     23458         199
#>   [70598]     23458         343
#>   [70599]     23459         145
#>   [70600]     23459         287
#>   [70601]     23459         433
#>   -------
#>   queryLength: 23459 / subjectLength: 435

Here, the first column represents the index of each gene in g (that can be accessed with queryHits(olaps)), while the second column represents the index of each copy number region (subjectHits(olaps)). We can then convert this into a gene and copy number data frame:

df_gene <- data_frame(entrezgene = names(g)[queryHits(olaps)], 
           copy_number = mcols(cnv_gr)$copy_number[subjectHits(olaps)],
           clone = mcols(cnv_gr)$clone[subjectHits(olaps)])

Next, we’d like to map on ensembl gene ids:

entrezgene_ensembl_map <- as.list(org.Hs.egENSEMBL)
entrezgene_ensembl_map <- lapply(entrezgene_ensembl_map, `[`, 1)

df_gene <- dplyr::filter(df_gene, entrezgene %in% names(entrezgene_ensembl_map)) %>% 
  dplyr::mutate(ensembl_gene_id = unlist(entrezgene_ensembl_map[entrezgene])) %>% 
  dplyr::select(ensembl_gene_id, entrezgene, copy_number, clone) %>% 
  drop_na()

df_gene
#> # A tibble: 68,546 x 4
#>    ensembl_gene_id entrezgene copy_number clone
#>    <chr>           <chr>            <dbl> <chr>
#>  1 ENSG00000121410 1                    2 A    
#>  2 ENSG00000121410 1                    2 B    
#>  3 ENSG00000121410 1                    2 C    
#>  4 ENSG00000156006 10                   1 A    
#>  5 ENSG00000156006 10                   1 B    
#>  6 ENSG00000156006 10                   1 C    
#>  7 ENSG00000196839 100                  1 B    
#>  8 ENSG00000196839 100                  1 C    
#>  9 ENSG00000196839 100                  1 A    
#> 10 ENSG00000170558 1000                 2 A    
#> # ... with 68,536 more rows

We may find non-unique mappings. This can be due to genes spanning breakpoints or multi-mappings to e.g. pseudo-autosomal regions. To fix this, we retain only genes that are uniquely mapped:

df_gene <- count(df_gene, ensembl_gene_id) %>% 
  filter(n == length(unique(df_gene$clone))) %>% 
  inner_join(df_gene) %>% 
  dplyr::select(-n)
#> Joining, by = "ensembl_gene_id"
#> Warning: Column `ensembl_gene_id` has different attributes on LHS and RHS of
#> join

2.4 Creating and filtering the input for clonealign

Clonealign requires a gene by clone matrix as input, so to create this we’ll use the spread function from tidyr:

df_gene_expanded <- spread(df_gene, clone, copy_number)
head(df_gene_expanded)
#> # A tibble: 6 x 5
#>   ensembl_gene_id entrezgene     A     B     C
#>   <chr>           <chr>      <dbl> <dbl> <dbl>
#> 1 ENSG00000000003 7105           1     2     2
#> 2 ENSG00000000005 64102          1     2     2
#> 3 ENSG00000000419 8813           1     1     1
#> 4 ENSG00000000457 57147          3     3     3
#> 5 ENSG00000000460 55732          3     3     3
#> 6 ENSG00000000938 2268           2     2     2

We can turn this into the required matrix:

cnv_mat <- dplyr::select(df_gene_expanded, -ensembl_gene_id, -entrezgene) %>% 
  as.matrix()

rownames(cnv_mat) <- df_gene_expanded$ensembl_gene_id

In general, we should select as input to clonealign:

  1. Genes whose copy number varies between clones
  2. Genes whose minimum copy number is ~ 6 - this is arbitrary, but we expect dosage mechanisms to tail off somewhere in this region
keep_gene <- rowMins(cnv_mat) <= 6 & rowVars(cnv_mat) > 0

cnv_mat <- cnv_mat[keep_gene,]

print(head(cnv_mat))
#>                 A B C
#> ENSG00000000003 1 2 2
#> ENSG00000000005 1 2 2
#> ENSG00000001497 1 2 2
#> ENSG00000002822 2 3 3
#> ENSG00000003096 1 2 2
#> ENSG00000004809 5 4 4

This is now ready for input to clonealign. If for example we had a SingleCellExperiment named sce as input with ensembl gene ids as the rownames, we would subset accordingly:

sce <- sce[rownames(cnv_mat),]

and use these as input to clonealign:

clonealign(sce, cnv_mat,...)

3 Technical

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 9 (stretch)
#> 
#> Matrix products: default
#> BLAS: /usr/lib/openblas-base/libblas.so.3
#> LAPACK: /usr/lib/libopenblasp-r0.2.19.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] org.Hs.eg.db_3.7.0                     
#>  [2] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
#>  [3] GenomicFeatures_1.34.1                 
#>  [4] AnnotationDbi_1.44.0                   
#>  [5] forcats_0.3.0                          
#>  [6] stringr_1.3.1                          
#>  [7] purrr_0.2.5                            
#>  [8] readr_1.2.1                            
#>  [9] tidyr_0.8.2                            
#> [10] tibble_1.4.2                           
#> [11] tidyverse_1.2.1                        
#> [12] bindrcpp_0.2.2                         
#> [13] clonealign_0.99.0                      
#> [14] dplyr_0.7.8                            
#> [15] scater_1.10.0                          
#> [16] ggplot2_3.1.0                          
#> [17] SingleCellExperiment_1.4.0             
#> [18] SummarizedExperiment_1.12.0            
#> [19] DelayedArray_0.8.0                     
#> [20] BiocParallel_1.16.2                    
#> [21] matrixStats_0.54.0                     
#> [22] Biobase_2.42.0                         
#> [23] GenomicRanges_1.34.0                   
#> [24] GenomeInfoDb_1.18.1                    
#> [25] IRanges_2.16.0                         
#> [26] S4Vectors_0.20.1                       
#> [27] BiocGenerics_0.28.0                    
#> [28] BiocStyle_2.10.0                       
#> 
#> loaded via a namespace (and not attached):
#>   [1] ggbeeswarm_0.6.0         colorspace_1.3-2        
#>   [3] rprojroot_1.3-2          XVector_0.22.0          
#>   [5] base64enc_0.1-3          fs_1.2.6                
#>   [7] rstudioapi_0.8           remotes_2.0.2           
#>   [9] bit64_0.9-7              fansi_0.4.0             
#>  [11] lubridate_1.7.4          xml2_1.2.0              
#>  [13] knitr_1.20               pkgload_1.0.2           
#>  [15] jsonlite_1.5             Rsamtools_1.34.0        
#>  [17] broom_0.5.0              tfruns_1.4              
#>  [19] HDF5Array_1.10.0         BiocManager_1.30.4      
#>  [21] compiler_3.5.1           httr_1.3.1              
#>  [23] backports_1.1.2          assertthat_0.2.0        
#>  [25] Matrix_1.2-14            lazyeval_0.2.1          
#>  [27] cli_1.0.1                htmltools_0.3.6         
#>  [29] prettyunits_1.0.2        tools_3.5.1             
#>  [31] gtable_0.2.0             glue_1.3.0              
#>  [33] GenomeInfoDbData_1.2.0   reshape2_1.4.3          
#>  [35] Rcpp_1.0.0               cellranger_1.1.0        
#>  [37] Biostrings_2.50.1        nlme_3.1-137            
#>  [39] rtracklayer_1.42.1       DelayedMatrixStats_1.4.0
#>  [41] xfun_0.4                 ps_1.2.1                
#>  [43] testthat_2.0.1           rvest_0.3.2             
#>  [45] devtools_2.0.1           XML_3.98-1.16           
#>  [47] zlibbioc_1.28.0          scales_1.0.0            
#>  [49] hms_0.4.2                rhdf5_2.26.0            
#>  [51] RColorBrewer_1.1-2       yaml_2.2.0              
#>  [53] memoise_1.1.0            reticulate_1.10         
#>  [55] gridExtra_2.3            biomaRt_2.38.0          
#>  [57] stringi_1.2.4            RSQLite_2.1.1           
#>  [59] tensorflow_1.10          desc_1.2.0              
#>  [61] pkgbuild_1.0.2           rlang_0.3.0.1           
#>  [63] pkgconfig_2.0.2          bitops_1.0-6            
#>  [65] evaluate_0.12            lattice_0.20-35         
#>  [67] Rhdf5lib_1.4.1           bindr_0.1.1             
#>  [69] GenomicAlignments_1.18.0 labeling_0.3            
#>  [71] bit_1.1-14               cowplot_0.9.3           
#>  [73] processx_3.2.0           tidyselect_0.2.5        
#>  [75] plyr_1.8.4               magrittr_1.5            
#>  [77] bookdown_0.7             R6_2.3.0                
#>  [79] DBI_1.0.0                pillar_1.3.0            
#>  [81] haven_2.0.0              whisker_0.3-2           
#>  [83] withr_2.1.2              RCurl_1.95-4.11         
#>  [85] modelr_0.1.2             crayon_1.3.4            
#>  [87] utf8_1.1.4               rmarkdown_1.10          
#>  [89] viridis_0.5.1            progress_1.2.0          
#>  [91] usethis_1.4.0            grid_3.5.1              
#>  [93] readxl_1.1.0             blob_1.1.1              
#>  [95] callr_3.0.0              digest_0.6.18           
#>  [97] munsell_0.5.0            beeswarm_0.2.3          
#>  [99] viridisLite_0.3.0        vipor_0.4.5             
#> [101] sessioninfo_1.1.1