Using unisets with Bioconductor annotation packages
vignettes/bioc-annotation.Rmd
bioc-annotation.Rmd
Abstract
Importing gene set relationships from Bioconductor annotation packages.
Annotation packages are available from Bioconductor for a range of model species. Users may browse BiocViews “AnnotationData” on the Bioconductor website or search packages programmatically using the command below.
BiocManager::available("^org\\.")
Here, we load the human gene annotations.
library(org.Hs.eg.db)
Go3AnnDbBimap
objects (from the AnnotationDbi package) are maps between Entrez gene identifiers and Gene Ontology (GO) identifiers. Those objects may be directly converted to Sets
objects as demonstrated below.
## GOSets with 322301 relations between 20488 elements and 18177 sets
## element set | evidence ontology
## <character> <character> | <factor> <factor>
## [1] 1 GO:0002576 | TAS BP
## [2] 1 GO:0008150 | ND BP
## [3] 1 GO:0043312 | TAS BP
## [4] 2 GO:0001869 | IDA BP
## [5] 2 GO:0002576 | TAS BP
## ... ... ... . ... ...
## [322297] 111089941 GO:0004571 | IMP MF
## [322298] 111089941 GO:0005509 | IEA MF
## [322299] 111240474 GO:0005515 | IPI MF
## [322300] 112441434 GO:0005515 | IPI MF
## [322301] 113219467 GO:0030533 | IEA MF
## -----------
## elementInfo: EntrezIdVector with 0 metadata
## setInfo: GOIdVector with 4 metadata (GOID, DEFINITION, ...)
Notice how the "element"
information is typed as EntrezIdVector
, allowing the type of identifier to affect downstream methods (e.g., pathway analyses). The EntrezIdVector
class directly inherits from the IdVector
class, and benefits of all the methods associated with the parent class.
It is also useful to note that the conversion of Go3AnnDbBimap
Gene Ontology maps to unisets objects automatically fetches metadata for each GO identifier from the GO.db
package, if installed. The metadata is stored it in the mcols
(metadata-columns) slot of the setInfo
slot of the object returned. This metadata can be accessed using the accessor method of the same name.
## DataFrame with 18177 rows and 2 columns
## ONTOLOGY
## <character>
## GO:0002576 BP
## GO:0008150 BP
## GO:0043312 BP
## GO:0001869 BP
## GO:0007597 BP
## ... ...
## GO:0047349 MF
## GO:0070567 MF
## GO:0047547 MF
## GO:0047613 MF
## GO:0050337 MF
## TERM
## <character>
## GO:0002576 platelet degranulation
## GO:0008150 biological_process
## GO:0043312 neutrophil degranulation
## GO:0001869 negative regulation of complement activation, lectin pathway
## GO:0007597 blood coagulation, intrinsic pathway
## ... ...
## GO:0047349 D-ribitol-5-phosphate cytidylyltransferase activity
## GO:0070567 cytidylyltransferase activity
## GO:0047547 2-methylcitrate dehydratase activity
## GO:0047613 aconitate decarboxylase activity
## GO:0050337 thiosulfate-thiol sulfurtransferase activity
We may then visualize the distribution of set sizes, on a log10 scale.
library(ggplot2) library(cowplot) ggplot(data.frame(setLengths=setLengths(go_sets))) + geom_histogram(aes(setLengths), bins=100, color="black", fill="grey") + scale_x_log10() + labs(y="Sets", x="Genes")
org.Hs.egGO
is an R object that provides mappings between entrez gene identifiers and the GO identifiers that they are directly associated with. This mapping and its reverse mapping do NOT associate the child terms from the GO ontology with the gene. Only the directly evidenced terms are represented here.
In contrast, org.Hs.egGO2ALLEGS
is an R object that provides mappings between a given GO identifier and all of the Entrez Gene identifiers annotated at that GO term OR TO ONE OF IT’S CHILD NODES in the GO ontology. Thus, this mapping is much larger and more inclusive than org.Hs.egGO2EG
.
Below, we use the length
method to show the number of relations between genes and GO terms imported from the org.Hs.egGO2ALLEGS
map.
go_sets <- import(org.Hs.egGO2ALLEGS)
## 'select()' returned 1:1 mapping between keys and columns
## Coercing evidence to factor
## Coercing ontology to factor
## [1] "3,558,563"
We can also examine the count of relations associated with each evidence code in each Gene Ontology namespace.
ggplot(as.data.frame(go_sets)) + geom_bar(aes(evidence)) + facet_wrap(~ontology, ncol = 1) + coord_flip() + # scale_y_continuous(labels = function(x){ format(as.integer(x), big.mark = ",") }) + scale_y_continuous(labels = scales::comma) + theme(axis.text.y = element_text(size=rel(0.7)))
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] cowplot_1.0.0 ggplot2_3.3.0 unisets_0.99.0
## [4] org.Hs.eg.db_3.11.1 AnnotationDbi_1.51.0 IRanges_2.23.4
## [7] S4Vectors_0.27.5 Biobase_2.49.0 BiocGenerics_0.35.2
## [10] BiocStyle_2.17.0
##
## loaded via a namespace (and not attached):
## [1] bit64_0.9-7 assertthat_0.2.1
## [3] BiocManager_1.30.10 blob_1.2.1
## [5] GenomeInfoDbData_1.2.3 Rsamtools_2.5.0
## [7] yaml_2.2.1 pillar_1.4.4
## [9] RSQLite_2.2.0 backports_1.1.6
## [11] lattice_0.20-41 glue_1.4.0
## [13] digest_0.6.25 GenomicRanges_1.41.1
## [15] XVector_0.29.0 colorspace_1.4-1
## [17] htmltools_0.4.0 Matrix_1.2-18
## [19] plyr_1.8.6 GSEABase_1.51.0
## [21] XML_3.99-0.3 pkgconfig_2.0.3
## [23] bookdown_0.18 zlibbioc_1.35.0
## [25] xtable_1.8-4 GO.db_3.11.1
## [27] scales_1.1.0 BiocParallel_1.23.0
## [29] tibble_3.0.1 annotate_1.67.0
## [31] farver_2.0.3 ellipsis_0.3.0
## [33] withr_2.2.0 SummarizedExperiment_1.19.2
## [35] magrittr_1.5 crayon_1.3.4
## [37] memoise_1.1.0 evaluate_0.14
## [39] fs_1.4.1 MASS_7.3-51.5
## [41] graph_1.67.0 tools_4.0.0
## [43] lifecycle_0.2.0 matrixStats_0.56.0
## [45] stringr_1.4.0 munsell_0.5.0
## [47] DelayedArray_0.15.1 Biostrings_2.57.0
## [49] compiler_4.0.0 pkgdown_1.5.1.9000
## [51] GenomeInfoDb_1.25.0 rlang_0.4.6
## [53] grid_4.0.0 RCurl_1.98-1.2
## [55] labeling_0.3 bitops_1.0-6
## [57] rmarkdown_2.1 gtable_0.3.0
## [59] DBI_1.1.0 reshape2_1.4.4
## [61] R6_2.4.1 GenomicAlignments_1.25.0
## [63] knitr_1.28 rtracklayer_1.49.1
## [65] bit_1.1-15.2 rprojroot_1.3-2
## [67] desc_1.2.0 stringi_1.4.6
## [69] Rcpp_1.0.4.6 vctrs_0.2.4
## [71] xfun_0.13