Abstract

Importing gene set relationships from Bioconductor annotation packages.

Getting started

Bioconductor annotation packages

Annotation packages are available from Bioconductor for a range of model species. Users may browse BiocViews “AnnotationData” on the Bioconductor website or search packages programmatically using the command below.

BiocManager::available("^org\\.")

Here, we load the human gene annotations.

library(org.Hs.eg.db)

Importing to unisets classes

Gene Ontology

Go3AnnDbBimap objects (from the AnnotationDbi package) are maps between Entrez gene identifiers and Gene Ontology (GO) identifiers. Those objects may be directly converted to Sets objects as demonstrated below.

library(unisets)
go_sets <- import(org.Hs.egGO)
go_sets
## GOSets with 322301 relations between 20488 elements and 18177 sets
##              element         set | evidence ontology
##          <character> <character> | <factor> <factor>
##      [1]           1  GO:0002576 |      TAS       BP
##      [2]           1  GO:0008150 |      ND        BP
##      [3]           1  GO:0043312 |      TAS       BP
##      [4]           2  GO:0001869 |      IDA       BP
##      [5]           2  GO:0002576 |      TAS       BP
##      ...         ...         ... .      ...      ...
## [322297]   111089941  GO:0004571 |      IMP       MF
## [322298]   111089941  GO:0005509 |      IEA       MF
## [322299]   111240474  GO:0005515 |      IPI       MF
## [322300]   112441434  GO:0005515 |      IPI       MF
## [322301]   113219467  GO:0030533 |      IEA       MF
## -----------
## elementInfo: EntrezIdVector with 0 metadata
##     setInfo: GOIdVector with 4 metadata (GOID, DEFINITION, ...)

Notice how the "element" information is typed as EntrezIdVector, allowing the type of identifier to affect downstream methods (e.g., pathway analyses). The EntrezIdVector class directly inherits from the IdVector class, and benefits of all the methods associated with the parent class.

It is also useful to note that the conversion of Go3AnnDbBimap Gene Ontology maps to unisets objects automatically fetches metadata for each GO identifier from the GO.db package, if installed. The metadata is stored it in the mcols (metadata-columns) slot of the setInfo slot of the object returned. This metadata can be accessed using the accessor method of the same name.

mcols(setInfo(go_sets))[, c("ONTOLOGY", "TERM")]
## DataFrame with 18177 rows and 2 columns
##               ONTOLOGY
##            <character>
## GO:0002576          BP
## GO:0008150          BP
## GO:0043312          BP
## GO:0001869          BP
## GO:0007597          BP
## ...                ...
## GO:0047349          MF
## GO:0070567          MF
## GO:0047547          MF
## GO:0047613          MF
## GO:0050337          MF
##                                                                    TERM
##                                                             <character>
## GO:0002576                                       platelet degranulation
## GO:0008150                                           biological_process
## GO:0043312                                     neutrophil degranulation
## GO:0001869 negative regulation of complement activation, lectin pathway
## GO:0007597                         blood coagulation, intrinsic pathway
## ...                                                                 ...
## GO:0047349          D-ribitol-5-phosphate cytidylyltransferase activity
## GO:0070567                                cytidylyltransferase activity
## GO:0047547                         2-methylcitrate dehydratase activity
## GO:0047613                             aconitate decarboxylase activity
## GO:0050337                 thiosulfate-thiol sulfurtransferase activity

We may then visualize the distribution of set sizes, on a log10 scale.

library(ggplot2)
library(cowplot)
ggplot(data.frame(setLengths=setLengths(go_sets))) +
    geom_histogram(aes(setLengths), bins=100, color="black", fill="grey") +
    scale_x_log10() + labs(y="Sets", x="Genes")

org.Hs.egGO is an R object that provides mappings between entrez gene identifiers and the GO identifiers that they are directly associated with. This mapping and its reverse mapping do NOT associate the child terms from the GO ontology with the gene. Only the directly evidenced terms are represented here.

In contrast, org.Hs.egGO2ALLEGS is an R object that provides mappings between a given GO identifier and all of the Entrez Gene identifiers annotated at that GO term OR TO ONE OF IT’S CHILD NODES in the GO ontology. Thus, this mapping is much larger and more inclusive than org.Hs.egGO2EG.

Below, we use the length method to show the number of relations between genes and GO terms imported from the org.Hs.egGO2ALLEGS map.

go_sets <- import(org.Hs.egGO2ALLEGS)
## 'select()' returned 1:1 mapping between keys and columns
## Coercing evidence to factor
## Coercing ontology to factor
format(length(go_sets), big.mark=",")
## [1] "3,558,563"

We can also examine the count of relations associated with each evidence code in each Gene Ontology namespace.

ggplot(as.data.frame(go_sets)) +
    geom_bar(aes(evidence)) + facet_wrap(~ontology, ncol = 1) + coord_flip() +
    # scale_y_continuous(labels = function(x){ format(as.integer(x), big.mark = ",") }) +
    scale_y_continuous(labels = scales::comma) +
    theme(axis.text.y = element_text(size=rel(0.7)))

Session info

## R version 4.0.0 (2020-04-24)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] cowplot_1.0.0        ggplot2_3.3.0        unisets_0.99.0      
##  [4] org.Hs.eg.db_3.11.1  AnnotationDbi_1.51.0 IRanges_2.23.4      
##  [7] S4Vectors_0.27.5     Biobase_2.49.0       BiocGenerics_0.35.2 
## [10] BiocStyle_2.17.0    
## 
## loaded via a namespace (and not attached):
##  [1] bit64_0.9-7                 assertthat_0.2.1           
##  [3] BiocManager_1.30.10         blob_1.2.1                 
##  [5] GenomeInfoDbData_1.2.3      Rsamtools_2.5.0            
##  [7] yaml_2.2.1                  pillar_1.4.4               
##  [9] RSQLite_2.2.0               backports_1.1.6            
## [11] lattice_0.20-41             glue_1.4.0                 
## [13] digest_0.6.25               GenomicRanges_1.41.1       
## [15] XVector_0.29.0              colorspace_1.4-1           
## [17] htmltools_0.4.0             Matrix_1.2-18              
## [19] plyr_1.8.6                  GSEABase_1.51.0            
## [21] XML_3.99-0.3                pkgconfig_2.0.3            
## [23] bookdown_0.18               zlibbioc_1.35.0            
## [25] xtable_1.8-4                GO.db_3.11.1               
## [27] scales_1.1.0                BiocParallel_1.23.0        
## [29] tibble_3.0.1                annotate_1.67.0            
## [31] farver_2.0.3                ellipsis_0.3.0             
## [33] withr_2.2.0                 SummarizedExperiment_1.19.2
## [35] magrittr_1.5                crayon_1.3.4               
## [37] memoise_1.1.0               evaluate_0.14              
## [39] fs_1.4.1                    MASS_7.3-51.5              
## [41] graph_1.67.0                tools_4.0.0                
## [43] lifecycle_0.2.0             matrixStats_0.56.0         
## [45] stringr_1.4.0               munsell_0.5.0              
## [47] DelayedArray_0.15.1         Biostrings_2.57.0          
## [49] compiler_4.0.0              pkgdown_1.5.1.9000         
## [51] GenomeInfoDb_1.25.0         rlang_0.4.6                
## [53] grid_4.0.0                  RCurl_1.98-1.2             
## [55] labeling_0.3                bitops_1.0-6               
## [57] rmarkdown_2.1               gtable_0.3.0               
## [59] DBI_1.1.0                   reshape2_1.4.4             
## [61] R6_2.4.1                    GenomicAlignments_1.25.0   
## [63] knitr_1.28                  rtracklayer_1.49.1         
## [65] bit_1.1-15.2                rprojroot_1.3-2            
## [67] desc_1.2.0                  stringi_1.4.6              
## [69] Rcpp_1.0.4.6                vctrs_0.2.4                
## [71] xfun_0.13