An introduction to the unisets package
vignettes/unisets.Rmd
unisets.Rmd
Abstract
Introduction to the unisets package.
The goal of the unisets package is to provide a collection of S4 classes to store relationships between elements and sets, with a particular emphasis on gene sets.
This is a basic example which shows you how to create a Sets
object, to store simple associations between elements and sets, along with optional metadata associated with each relation:
library(unisets) sets_list <- list( geneset1 = c("A", "B"), geneset2 = c("B", "C", "D") ) relations_table <- DataFrame( element = unlist(sets_list), set = rep(names(sets_list), lengths(sets_list)), extra1 = rep(c("ABC", "DEF"), c(3L, 2L)), extra2 = seq(0, 1, length.out = 5L) ) base_sets <- Sets(relations_table) base_sets
## Sets with 5 relations between 4 elements and 2 sets
## element set | extra1 extra2
## <character> <character> | <character> <numeric>
## [1] A geneset1 | ABC 0.00
## [2] B geneset1 | ABC 0.25
## [3] B geneset2 | ABC 0.50
## [4] C geneset2 | DEF 0.75
## [5] D geneset2 | DEF 1.00
## -----------
## elementInfo: IdVector with 0 metadata
## setInfo: IdVector with 0 metadata
Metadata for each element and set can be provided as separate IdVector
objects. The IdVector
class stores a vector of identifiers as a character vector, and associated metadata as a DataFrame
.
element_data <- IdVector(ids = c("A", "B", "C", "D")) mcols(element_data) <- DataFrame( GeneStat1 = c( 1, 2, 3, 4 ), GeneInfo1 = c("a", "b", "c", "d") ) set_data <- IdVector(ids = c("geneset1", "geneset2")) mcols(set_data) <- DataFrame( SetStat1 = c( 100, 200 ), SetInfo1 = c("abc", "def") ) base_sets <- Sets(relations_table, element_data, set_data) base_sets
## Sets with 5 relations between 4 elements and 2 sets
## element set | extra1 extra2
## <character> <character> | <character> <numeric>
## [1] A geneset1 | ABC 0.00
## [2] B geneset1 | ABC 0.25
## [3] B geneset2 | ABC 0.50
## [4] C geneset2 | DEF 0.75
## [5] D geneset2 | DEF 1.00
## -----------
## elementInfo: IdVector with 2 metadata (GeneStat1, GeneInfo1)
## setInfo: IdVector with 2 metadata (SetStat1, SetInfo1)
The elementInfo
and setInfo
slots each store an IdVector
that describes the identifier and metadata associated with each unique element and set, respectively. Those metadata can be directly accessed and updated using the corresponding accessor methods.
elementInfo(base_sets)
## IdVector of length 4 with 4 unique identifiers
## Ids: A, B, C, D
## Metadata: GeneStat1, GeneInfo1 (2 columns)
setInfo(base_sets)
## IdVector of length 2 with 2 unique identifiers
## Ids: geneset1, geneset2
## Metadata: SetStat1, SetInfo1 (2 columns)
Note that relations between elements and sets are internally stored as an S4Vectors Hits
object. This container efficiently represents edges between a set of left nodes and a set of right nodes, with optional metadata that describe each edge.
To do so, the DataFrame
provided as the relations
argument of the Sets
constructor is divided in two pieces of information:
"element"
and "set"
columns are extracted and substitued by the index of the matching identifier in the elementInfo
and setInfo
slot, to create the from
and to
slots of a Hits
object, respectively. If the elementInfo
and setInfo
are not supplied, the corresponding slots are automatically constructed from unique values found in the "element"
and "set"
columns.Hits
object.relations(base_sets)
## Hits object with 5 hits and 2 metadata columns:
## from to | extra1 extra2
## <integer> <integer> | <character> <numeric>
## [1] 1 1 | ABC 0.00
## [2] 2 1 | ABC 0.25
## [3] 2 2 | ABC 0.50
## [4] 3 2 | DEF 0.75
## [5] 4 2 | DEF 1.00
## -------
## nLnode: 4 / nRnode: 2
mcols(relations(base_sets))
## DataFrame with 5 rows and 2 columns
## extra1 extra2
## <character> <numeric>
## 1 ABC 0.00
## 2 ABC 0.25
## 3 ABC 0.50
## 4 DEF 0.75
## 5 DEF 1.00
Conveniently, the as
method can be used to format relations and associated metadata as a DataFrame
substituting hits for their corresponding element and set identifiers. Metadata for relations, elements, and sets are returned as DataFrame
nested in the "relationData"
, "elementInfo"
, and "setInfo"
columns.
as(base_sets, "DataFrame")
## DataFrame with 5 rows and 5 columns
## element set relationData elementInfo setInfo
## <IdVector> <IdVector> <DataFrame> <DataFrame> <DataFrame>
## 1 A geneset1 ABC:0.00 1:a 100:abc
## 2 B geneset1 ABC:0.25 2:b 100:abc
## 3 B geneset2 ABC:0.50 2:b 200:def
## 4 C geneset2 DEF:0.75 3:c 200:def
## 5 D geneset2 DEF:1.00 4:d 200:def
Similarly, as.data.frame
can be used to obtain a flattened data.frame
, with columns "element"
, "set"
, and any column in the relation metadata columns.
as.data.frame(base_sets)
## element set extra1 extra2
## 1 A geneset1 ABC 0.00
## 2 B geneset1 ABC 0.25
## 3 B geneset2 ABC 0.50
## 4 C geneset2 DEF 0.75
## 5 D geneset2 DEF 1.00
Classes derived from Hits
may add additional constraints on the relations to define special types of relationships between elements and sets.
For instance, the FuzzyHits
class is a direct extension of the Hits
class where the metadata accompanying each relation must include at least a column called "membership"
that holds the “membership function”, a numeric value in the interval [0,1] that provides a measure of partial membership between elements and sets.
Simultaneously, the FuzzySets
class is a direct extension of the Sets
class where the relations
slot must contain FuzzyHits
. As such, FuzzySets
can be constructed exactly like Sets
, with the only additional constraint that the relations table must contains a "membership"
column with numeric values in the interval [0,1].
relations_table$membership <- round(runif(nrow(relations_table)), 2) fuzzy_sets <- FuzzySets(relations_table, element_data, set_data) fuzzy_sets
## FuzzySets with 5 relations between 4 elements and 2 sets
## element set | extra1 extra2 membership
## <character> <character> | <character> <numeric> <numeric>
## [1] A geneset1 | ABC 0.00 0.42
## [2] B geneset1 | ABC 0.25 0.87
## [3] B geneset2 | ABC 0.50 0.80
## [4] C geneset2 | DEF 0.75 0.95
## [5] D geneset2 | DEF 1.00 0.16
## -----------
## elementInfo: IdVector with 2 metadata (GeneStat1, GeneInfo1)
## setInfo: IdVector with 2 metadata (SetStat1, SetInfo1)
The membership
function associated with each relation can be directly obtained and modified using the corresponding accessor.
membership(fuzzy_sets)
## [1] 0.42 0.87 0.80 0.95 0.16
Identically to Sets
, the relations
accessor returns fuzzy relations as Hits
, while the as
method may be used to format the information as a DataFrame
, both of which include the "membership"
column, as metadata column and nested under "relationData"
, respectively.
relations(fuzzy_sets)
## FuzzyHits object with 5 hits and 3 metadata columns:
## from to | extra1 extra2 membership
## <integer> <integer> | <character> <numeric> <numeric>
## [1] 1 1 | ABC 0.00 0.42
## [2] 2 1 | ABC 0.25 0.87
## [3] 2 2 | ABC 0.50 0.80
## [4] 3 2 | DEF 0.75 0.95
## [5] 4 2 | DEF 1.00 0.16
## -------
## nLnode: 4 / nRnode: 2
as(fuzzy_sets, "DataFrame")
## DataFrame with 5 rows and 5 columns
## element set relationData elementInfo setInfo
## <IdVector> <IdVector> <DataFrame> <DataFrame> <DataFrame>
## 1 A geneset1 ABC:0.00:0.42 1:a 100:abc
## 2 B geneset1 ABC:0.25:0.87 2:b 100:abc
## 3 B geneset2 ABC:0.50:0.80 2:b 200:def
## 4 C geneset2 DEF:0.75:0.95 3:c 200:def
## 5 D geneset2 DEF:1.00:0.16 4:d 200:def
The GOSets
class is another direct extension of the Sets
class where the relations
slot must contain GOHits
. Similary to FuzzyHits
, the GOHits
class extends the Hits
class, but with the distinct contraint that each relation metadata must include at least 2 columns called "evidence"
and "ontology"
holding the Gene Ontology evidence code and ontology code, respectively.
Examples of GOSets
usage are described in a dedicated vignette.
The subset
method can be applied to Sets
objects and derivatives (e.g. FuzzySets
, GOSets
), using a logical expression that may refer to the "element"
and "set"
columns as well as any metadata associated with the relations, indicating rows to keep.
## Sets with 1 relation between 1 element and 1 set
## element set | extra1 extra2
## <character> <character> | <character> <numeric>
## [1] B geneset1 | ABC 0.25
## -----------
## elementInfo: IdVector with 2 metadata (GeneStat1, GeneInfo1)
## setInfo: IdVector with 2 metadata (SetStat1, SetInfo1)
Similarly, the subset
method can be also applied to objects derived from Sets
, such as FuzzySets
, in which case the logical expression may also refer to the additional "membership"
metadata that is guaranted by the class validity method.
subset(fuzzy_sets, set == "geneset2" & membership > 0.3)
## FuzzySets with 2 relations between 2 elements and 1 set
## element set | extra1 extra2 membership
## <character> <character> | <character> <numeric> <numeric>
## [1] B geneset2 | ABC 0.50 0.80
## [2] C geneset2 | DEF 0.75 0.95
## -----------
## elementInfo: IdVector with 2 metadata (GeneStat1, GeneInfo1)
## setInfo: IdVector with 2 metadata (SetStat1, SetInfo1)
Note that the default behaviour of the subset
method is to drop elements and sets that are not represented in the relations from the elementInfo
and setInfo
slots, respectively. This behaviour can be controlled using the drop
argument, which accepts a single logical value.
## [1] "geneset1"
## [1] "geneset1" "geneset2"
It is possible to extract the gene sets as a list
, for use with functions such as lapply
.
as(fuzzy_sets, "list")
## List of length 2
## names(2): geneset1 geneset2
It is also possible to visualize membership between gene and gene sets as a matrix.
Notably, Sets
objects produce a logical
matrix of binary membership that indicates whether each element is associated at least once with each set:
base_matrix <- as(base_sets, "matrix") base_matrix
## geneset1 geneset2
## A TRUE FALSE
## B TRUE TRUE
## C FALSE TRUE
## D FALSE TRUE
In contrast, FuzzySets
objects produce a double
matrix displaying the membership function for each relation. Relations that are not described in the FuzzySets
are filled with NA
, to contrast with relations explictly associated with a membership function of 0.
membership(fuzzy_sets)[1] <- 0 fuzzy_matrix <- as(fuzzy_sets, "matrix") fuzzy_matrix
## geneset1 geneset2
## A 0.00 NA
## B 0.87 0.80
## C NA 0.95
## D NA 0.16
It is possible to convert incidence matrices into objects derived from the Sets
class.
Notably, the Sets
class is suitable for logical
matrices indicating binary membership.
as(base_matrix, "Sets")
## Sets with 5 relations between 4 elements and 2 sets
## element set
## <character> <character>
## [1] A geneset1
## [2] B geneset1
## [3] B geneset2
## [4] C geneset2
## [5] D geneset2
## -----------
## elementInfo: IdVector with 0 metadata
## setInfo: IdVector with 0 metadata
Similarly, the FuzzySets
class is suitable for double
matrices indicating the membership function for each relation. Importantly, relations described as NA
are not imported into the FuzzySets
object (consistently with the as.matrix
method described above). In contrast, relations with a membership function of 0 are imported and described as such.
fuzzy_matrix[1, 1] <- 0 as(fuzzy_matrix, "FuzzySets")
## Dropping relations with NA membership function
## FuzzySets with 5 relations between 4 elements and 2 sets
## element set | membership
## <character> <character> | <numeric>
## [1] A geneset1 | 0.00
## [2] B geneset1 | 0.87
## [3] B geneset2 | 0.80
## [4] C geneset2 | 0.95
## [5] D geneset2 | 0.16
## -----------
## elementInfo: IdVector with 0 metadata
## setInfo: IdVector with 0 metadata
The count of relations between elements and sets can be obtained using the length
method.
length(base_sets)
## [1] 5
The count of unique elements and sets can be obtained using the nElements
and nSets
methods.
nElements(base_sets)
## [1] 4
nSets(base_sets)
## [1] 2
The size of each gene set can be obtained using the setLengths
method.
setLengths(fuzzy_sets)
## geneset1 geneset2
## 2 3
Conversely, the number of sets associated with each gene is returned by the elementLengths
function.
elementLengths(fuzzy_sets)
## A B C D
## 1 2 1 1
The identifiers of elements and sets can be inspected and renamed using ids
accessor on the IdVector
object returned by each of the elementInfo
or setInfo
accessors.
ids(elementInfo(base_sets)) <- paste0("Gene", seq_len(nElements(base_sets))) ids(elementInfo(base_sets))
## [1] "Gene1" "Gene2" "Gene3" "Gene4"
## [1] "Geneset1" "Geneset2"
A common representation of gene sets is the GMT format, which is a non-rectangular format where each line is a set. The first column is the name of the set, the second column is a description of the source of the set (such as a URL), and the third column onwards are the elements of the set, such that each set may have a variable number of elements.
Importing from and exporting to GMT files is performed using the generic import
and export
methods, which recognize the “.gmt” file extenson as a trigger to import from and export to the GMT file format. Alternatively, the import.gmt
and import.gmt
functions may be used to explicitly export to the GMT file format.
Any object that inherits from the Sets
class may be exported to the GMT file format. However, any information that is not supported by the GMT file format will be lost during the export. Reciprocally, the import
function produces a Sets
object, which adequately represents all the information present in the GMT file format.
gmt_file <- system.file(package="unisets", "extdata", "example.gmt") base_sets_from_gmt <- import(gmt_file) base_sets_from_gmt
## Sets with 674 relations between 633 elements and 4 sets
## element set
## <character> <character>
## [1] JUNB HALLMARK_TNFA_SIGNALING_VIA_NFKB
## [2] CXCL2 HALLMARK_TNFA_SIGNALING_VIA_NFKB
## [3] ATF3 HALLMARK_TNFA_SIGNALING_VIA_NFKB
## [4] NFKBIA HALLMARK_TNFA_SIGNALING_VIA_NFKB
## [5] TNFAIP3 HALLMARK_TNFA_SIGNALING_VIA_NFKB
## ... ... ...
## [670] STK38L HALLMARK_MITOTIC_SPINDLE
## [671] YWHAE HALLMARK_MITOTIC_SPINDLE
## [672] RAPGEF5 HALLMARK_MITOTIC_SPINDLE
## [673] CEP72 HALLMARK_MITOTIC_SPINDLE
## [674] CSNK1D HALLMARK_MITOTIC_SPINDLE
## -----------
## elementInfo: IdVector with 0 metadata
## setInfo: IdVector with 1 metadata (source)
The additional metadata corresponding to the source (second column of the GMT) per set is also added as metadata corresponding to the sets, accessible via setInfo
, which returns an IdVector
class object.
setInfo(base_sets_from_gmt)
## IdVector of length 4 with 4 unique identifiers
## Ids: HALLMARK_TNFA_SIGNALING_VIA_NFKB, HALLMARK_HYPOXIA, HALLMARK_CHOLESTEROL_HOMEOSTASIS, HALLMARK_MITOTIC_SPINDLE
## Metadata: source (1 column)
To access the internal DataFrame representation, the accessor mcols
can additionally be applied.
mcols(setInfo(base_sets_from_gmt))
## DataFrame with 4 rows and 1 column
## source
## <character>
## HALLMARK_TNFA_SIGNALING_VIA_NFKB http://www.broadinstitute.org/gsea/msigdb/cards/HALLMARK_TNFA_SIGNALING_VIA_NFKB
## HALLMARK_HYPOXIA http://www.broadinstitute.org/gsea/msigdb/cards/HALLMARK_HYPOXIA
## HALLMARK_CHOLESTEROL_HOMEOSTASIS http://www.broadinstitute.org/gsea/msigdb/cards/HALLMARK_CHOLESTEROL_HOMEOSTASIS
## HALLMARK_MITOTIC_SPINDLE http://www.broadinstitute.org/gsea/msigdb/cards/HALLMARK_MITOTIC_SPINDLE
## elementMetadata(setInfo(base_sets_from_gmt)) # equivalent to above
To export Sets
objects in GMT file format, the export
generic may be used if the file extension is “.gmt”. Alternatively, data in GMT format may be exported to files with different extensions (e.g., “.txt”) using the export.gmt
function. Note that if "source"
heading is not found in the set metadata (i.e., mcols(setInfo(x))
), this value will be filled with "unisets"
in the exported file.
Bug reports can be posted as issues in the unisets GitHub repository. The GitHub repository is the primary source for development versions of the package, where new functionality is added over time. The authors appreciate well-considered suggestions for improvements or new features, or even better, pull requests.
If you use unisets for your analysis, please cite it as shown below:
citation("unisets")
##
## To cite package 'unisets' in publications use:
##
## Kevin Rue-Albrecht and Robert Amezquita (2019). unisets: Collection
## of Classes to Store Gene Sets. https://github.com/kevinrue/unisets,
## http://kevinrue.github.io/unisets.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {unisets: Collection of Classes to Store Gene Sets},
## author = {Kevin Rue-Albrecht and Robert Amezquita},
## year = {2019},
## note = {https://github.com/kevinrue/unisets, http://kevinrue.github.io/unisets},
## }
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] unisets_0.99.0 S4Vectors_0.27.5 BiocGenerics_0.35.2
## [4] BiocStyle_2.17.0
##
## loaded via a namespace (and not attached):
## [1] SummarizedExperiment_1.19.2 xfun_0.13
## [3] reshape2_1.4.4 lattice_0.20-41
## [5] vctrs_0.2.4 htmltools_0.4.0
## [7] rtracklayer_1.49.1 yaml_2.2.1
## [9] blob_1.2.1 XML_3.99-0.3
## [11] rlang_0.4.6 pkgdown_1.5.1.9000
## [13] DBI_1.1.0 BiocParallel_1.23.0
## [15] bit64_0.9-7 matrixStats_0.56.0
## [17] GenomeInfoDbData_1.2.3 plyr_1.8.6
## [19] stringr_1.4.0 zlibbioc_1.35.0
## [21] Biostrings_2.57.0 memoise_1.1.0
## [23] evaluate_0.14 Biobase_2.49.0
## [25] knitr_1.28 IRanges_2.23.4
## [27] GenomeInfoDb_1.25.0 AnnotationDbi_1.51.0
## [29] GSEABase_1.51.0 Rcpp_1.0.4.6
## [31] xtable_1.8-4 backports_1.1.6
## [33] BiocManager_1.30.10 DelayedArray_0.15.1
## [35] desc_1.2.0 graph_1.67.0
## [37] annotate_1.67.0 XVector_0.29.0
## [39] fs_1.4.1 bit_1.1-15.2
## [41] Rsamtools_2.5.0 digest_0.6.25
## [43] stringi_1.4.6 bookdown_0.18
## [45] GenomicRanges_1.41.1 rprojroot_1.3-2
## [47] grid_4.0.0 tools_4.0.0
## [49] bitops_1.0-6 magrittr_1.5
## [51] RCurl_1.98-1.2 RSQLite_2.2.0
## [53] crayon_1.3.4 MASS_7.3-51.5
## [55] Matrix_1.2-18 assertthat_0.2.1
## [57] rmarkdown_2.1 R6_2.4.1
## [59] GenomicAlignments_1.25.0 compiler_4.0.0