Abstract

Instructions on how to obtain 489 cell type gene signatures from Aran et al., 2017.

Introduction

The xCellData package provides a R / Bioconductor resource for obtaining and representing 489 cell type gene signatures from (Aran, Hu, and Butte 2017).

This packages uses the unisets Sets class to represent the collection of signatures. However, the data itself is distributed with the package as a GMT file, which may be parsed and imported by other packages (e.g. GSEABase GeneSetCollection, GeneSet tbl_geneset).

Data preprocessing

The script used to download and preprocess the data is distributed with the package. You can find it at the following location:

system.file(package = "xCellData", "scripts", "makeData.R")
## [1] "/home/travis/R/Library/xCellData/scripts/makeData.R"

Briefly, the script downloads “Additional file 3: The 489 cell type gene signatures. (XLSX 417 kb)” from the https://genomebiology.biomedcentral.com website and reformats the content of the published Microsoft Excel file into a GMT text file.

Workflow

Loading the data

We use the xCellData() function to parse the GMT file distributed with the package into a unisets Sets object.

library(xCellData)
library(unisets)
xsig <- xCellData()
xsig
## Sets with 20803 relations between 5079 elements and 489 sets
##             element          set
##         <character>  <character>
##     [1]        C1QA   aDC_HPCA_1
##     [2]        C1QB   aDC_HPCA_1
##     [3]       CCL13   aDC_HPCA_1
##     [4]       CCL17   aDC_HPCA_1
##     [5]       CCL19   aDC_HPCA_1
##     ...         ...          ...
## [20799]       IL2RA Tregs_HPCA_3
## [20800]       KCNA2 Tregs_HPCA_3
## [20801]       LAIR2 Tregs_HPCA_3
## [20802]      MCF2L2 Tregs_HPCA_3
## [20803]        RGS1 Tregs_HPCA_3
## -----------
## elementInfo: IdVector with 0 metadata
##     setInfo: IdVector with 1 metadata (source)

Using the data

The signatures may then be used for downstream analyses such as cell type annotation.

For instance, the Sets object can be split into a list of signatures, for use in functions such as lapply.

as.list(xsig)
## List of length 489
## names(489): Adipocytes_ENCODE_1 ... pro B-cells_NOVERSHTERN_3

One may also inspect the number of genes in each signature.

dat <- setLengths(xsig)
hist(
    dat, breaks = 100, xlim=c(0, max(dat)),
    main = "Distribution of signature sizes", xlab = "Number of genes"
)

Example of packages using xCellData include:

Session information

## R Under development (unstable) (2020-07-13 r78833)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS:   /home/travis/R-bin/lib/R/lib/libRblas.so
## LAPACK: /home/travis/R-bin/lib/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] unisets_0.99.0      S4Vectors_0.27.9    BiocGenerics_0.35.2
## [4] xCellData_0.0.1     BiocStyle_2.17.0   
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.19.4 xfun_0.15                  
##  [3] reshape2_1.4.4              lattice_0.20-41            
##  [5] vctrs_0.3.0                 htmltools_0.5.0            
##  [7] rtracklayer_1.49.2          yaml_2.2.1                 
##  [9] blob_1.2.1                  XML_3.99-0.3               
## [11] rlang_0.4.7                 pkgdown_1.5.1              
## [13] DBI_1.1.0                   BiocParallel_1.23.0        
## [15] bit64_0.9-7                 matrixStats_0.56.0         
## [17] GenomeInfoDbData_1.2.3      plyr_1.8.6                 
## [19] stringr_1.4.0               zlibbioc_1.35.0            
## [21] Biostrings_2.57.1           memoise_1.1.0              
## [23] evaluate_0.14               Biobase_2.49.0             
## [25] knitr_1.29                  IRanges_2.23.6             
## [27] GenomeInfoDb_1.25.0         AnnotationDbi_1.51.0       
## [29] GSEABase_1.51.1             Rcpp_1.0.4.6               
## [31] xtable_1.8-4                backports_1.1.8            
## [33] BiocManager_1.30.10         DelayedArray_0.15.1        
## [35] desc_1.2.0                  graph_1.67.1               
## [37] annotate_1.67.0             XVector_0.29.1             
## [39] fs_1.4.1                    bit_1.1-15.2               
## [41] Rsamtools_2.5.1             digest_0.6.25              
## [43] stringi_1.4.6               bookdown_0.20              
## [45] GenomicRanges_1.41.1        rprojroot_1.3-2            
## [47] grid_4.1.0                  tools_4.1.0                
## [49] bitops_1.0-6                magrittr_1.5               
## [51] RCurl_1.98-1.2              RSQLite_2.2.0              
## [53] crayon_1.3.4                MASS_7.3-51.6              
## [55] Matrix_1.2-18               assertthat_0.2.1           
## [57] rmarkdown_2.3               R6_2.4.1                   
## [59] GenomicAlignments_1.25.1    compiler_4.1.0

References

Aran, Dvir, Zicheng Hu, and Atul J. Butte. 2017. “XCell: Digitally Portraying the Tissue Cellular Heterogeneity Landscape.” Genome Biology 18 (1): 220. https://doi.org/10.1186/s13059-017-1349-1.