Abstract

A discussion of concepts associated with cell types and transcriptional signatures.

Compiled date: 2020-06-01

Last edited: 2018-03-08

License: MIT + file LICENSE

Definition of cell identity

Discuss:

  • cell type is a continuum (e.g. differentiation, pseudotime)
  • on a similar note, all cells in an organism basically originate from a common progenitor (itself coming from two, etc.)
  • cell “states” may be defined similarly to cell “types” (e.g. markers of activated/resting cells)

Definitions

Definitions of absolute and relative markers and signatures

Absolute markers (also known as “pan markers”) may be defined as molecules (e.g. protein, transcripts) known to be present or absent in a given population of cells, irrespective of their expression in other cells of the same sample. For instance, T helper lymphocytes can be defined by the presence of surface proteins Cd3 and Cd4, while T cytotoxic lymphocytes are defined by the presence of surface proteins Cd3 and Cd8. To assess whether a cell is likely a T helper lymphocyte, one does not need to know the markers of T cytotoxic lymphocytes, nor that the two cell types have the Cd3 marker in common.

Relative markers (also known as “key markers”) may be defined by differential expression against other cells in the same sample. For instance, in a biological sample including both T helper and T cytotoxic lymphocytes, differential expression between the two lypmphocyte subsets would highlight Cd4 protein as a (relative) marker of T helper lymphocytes and Cd8 protein as a (relative) marker of T cytotoxic lymphocytes. In contrast, Cd3 protein may be considered as a marker of either cell type only if its expression level or frequency significantly differs between the two cell types.

Note that sets of absolute markers may also be trimmed to markers specific to each cell population (i.e., excluding markers present in other signatures), either to increase stringency (due to their specificity) or sensitivity (due to their smaller number). For instance, Cd3 protein being a marker of both T helper and T cytotoxic lymphocytes, one may wish to exclude it from both signatures, in a manner similar–yet more stringent–to relative markers. Such markers may be called relative sets of absolute markers, as they are composed of absolute markers compared to one another in order to identify specific subsets of markers.

Definitions qualititative and quantitative signatures

Qualitative signatures may be defined as those comprising lists of gene identifiers that are known to be either present or absent in a given population of cells, without any quantitative information (i.e., neither absolute nor relative, see below).

Quantitative signatures may be defined as either full transcriptional profiles of cell populatitons, or gene lists accompanied with either absolute or relative quantitative gene expression information (e.g., counts, transcripts per million).

In addition, Semi-quantitative signatures may be defined as signatures accompanied with summarized gene expression data (e.g., gene rank by decreasing TPM).

Applications

The origin of markers and signatures dictates how they should be used in downstream classification tasks.

Markers

Absolute markers present the advantage of allowing the immediate characterization of any cell or cluster, without the need of a reference cell of cluster. In this case, each cell or cluster may be screened for the presence of absolute markers, and assigned an identity independently of all other cells in the sample.

Relative markers can be advantageous when the general cell type composition of a given sample is known in advance, and the problem is only to distribute a predefined set of identities expected in a given sample to a similar number of cell clusters. In this case, differential expression analysis may be performed between clusters of cells in the new sample, and markers of each cluster may be compared to a similar reference sample to assign identities defined in the reference sample to each of the cells or clusters in the new sample. However, cells within a given sample are generally the result of sorting (e.g. FACS) and enriching a population of interest on a set of (protein) markers. In those cases, the markers used for sorting the cells generally appear as highly expressed in all cells, making it difficult or impossible to identify as relative markers.

Relative sets of absolute markers may be advantageous when dealing with closely related cell populations or novel populations defined by new markers relative to canonical markers and cell types.

Signatures

Qualitative signatures may provide particularly fast and convenient ways to apply FACS-like “gating” strategies to the definition of cell identities. The main challenge for such signatures is to define thresholds under and above which transcripts may be considered as absent or present (a natural default threshold being 0). The main advantage of qualitative signatures being that the presence or absence of any transcript should be considerably more stable than its fluctuating expression level.

Quantitative signatures may provide considerably more precise information on the relative expression level of markers (and other) genes. However, this additional information carries additional constraints and caveats with it, namely:

  • the naturally fluctuating level of transcripts in individual samples means that independent data sets will never produce exactly identical quantitative signatures
  • in relation to this, quantitative signatures should ideally require users to process and quantitate their new data set identically to the reference data set use to define those signatures; such methods generally limit reproducibility between researchers and software versions, in addition to greatly hindering the definition, distribution, and interpretation of “gold standard” signatures.

Semi-quantitative signatures may provide a compromise between qualitative and quantitaive signatures, summarizing fluctuating quantitative signatures into more stable semi-quantitative summary metrics (e.g., gene rank by expression level). For instance, such signatures may be used to identify the most correlated reference sample to any cell or cluster in a new data set, allowing (to some extent) the comparison of even imperfectly correlated quantitative measurements (e.g. TPM and CPM).

Representation of signatures

Lists of marker names

At its simplest a signature could list positive markers that are known to be expressed in each cell type. For instance, the PBMC signatures used in the Seurat PBMC 3k tutorial may be represented as follows:

pbmc3k_markers_list <- list(
    "CD4 T cells" = c("IL7R"),
    "CD14+ Monocytes" = c("CD14", "LYZ"),
    "B cells" = c("MS4A1"),
    "CD8 T cells" = c("CD8A"),
    "FCGR3A+ Monocytes" = c("FCGR3A", "MS4A7"),
    "NK cells" = c("GNLY", "NKG7"),
    "Dendritic Cells" = c("FCER1A", "CST3"),
    "Megakaryocytes" = c("PPBP")
)
pbmc3k_markers_list
#> $`CD4 T cells`
#> [1] "IL7R"
#> 
#> $`CD14+ Monocytes`
#> [1] "CD14" "LYZ" 
#> 
#> $`B cells`
#> [1] "MS4A1"
#> 
#> $`CD8 T cells`
#> [1] "CD8A"
#> 
#> $`FCGR3A+ Monocytes`
#> [1] "FCGR3A" "MS4A7" 
#> 
#> $`NK cells`
#> [1] "GNLY" "NKG7"
#> 
#> $`Dendritic Cells`
#> [1] "FCER1A" "CST3"  
#> 
#> $Megakaryocytes
#> [1] "PPBP"

Using GSEABase GeneSetCollection

The GSEABase package provides infrastructure for enumerating pathways and their contents and facilitating translations among the different nomenclatures that are used to describe contents of pathways (source).

The above list may be more formally (and elegantly) represented as follows:

library(GSEABase)
pbmc3k_markers_gsc <- GeneSetCollection(list(
    GeneSet(setName="CD4 T cells", c("IL7R")),
    GeneSet(setName="CD14+ Monocytes", c("CD14", "LYZ")),
    GeneSet(setName="B cells", c("MS4A1")),
    GeneSet(setName="CD8 T cells", c("CD8A")),
    GeneSet(setName="FCGR3A+ Monocytes", c("FCGR3A", "MS4A7")),
    GeneSet(setName="NK cells", c("GNLY", "NKG7")),
    GeneSet(setName="Dendritic Cells", c("FCER1A", "CST3")),
    GeneSet(setName="Megakaryocytes", c("PPBP"))
))
pbmc3k_markers_gsc
#> GeneSetCollection
#>   names: CD4 T cells, CD14+ Monocytes, ..., Megakaryocytes (8 total)
#>   unique identifiers: IL7R, CD14, ..., PPBP (12 total)
#>   types in collection:
#>     geneIdType: NullIdentifier (1 total)
#>     collectionType: NullCollection (1 total)

Note that generic R lists can easily be packaged into GSEABase GeneSetCollection objects, for instance:

pbmc3k_markers_gsc <- GeneSetCollection(mapply(function(geneIds, geneSetId) {
        GeneSet(geneIds, geneIdType=EntrezIdentifier(),
                collectionType=NullCollection(),
                setName=geneSetId)
    }, pbmc3k_markers_list, names(pbmc3k_markers_list)))

Using unisets Sets

The unisets package is being developed using S4Vectors Hits to associate identifiers in a vector of elements to identifiers in a vector of sets.

The package can be installed as follows:

devtools::install_github("kevinrue/unisets")

Conveniently, this package supports multiple source formats to create gene sets, including list described above.

library(unisets)
inputList <- list(
    "CD4 T cells" = c("IL7R"),
    "CD14+ Monocytes" = c("CD14", "LYZ"),
    "B cells" = c("MS4A1"),
    "CD8 T cells" = c("CD8A"),
    "FCGR3A+ Monocytes" = c("FCGR3A", "MS4A7"),
    "NK cells" = c("GNLY", "NKG7"),
    "Dendritic Cells" = c("FCER1A", "CST3"),
    "Megakaryocytes" = c("PPBP")
)
pbmc3k_markers_sets <- as(inputList, "Sets")
pbmc3k_markers_sets
#> Sets with 12 relations between 12 elements and 8 sets
#>          element             set
#>      <character>     <character>
#>  [1]        IL7R     CD4 T cells
#>  [2]        CD14 CD14+ Monocytes
#>  [3]         LYZ CD14+ Monocytes
#>  [4]       MS4A1         B cells
#>  [5]        CD8A     CD8 T cells
#>  ...         ...             ...
#>  [8]        GNLY        NK cells
#>  [9]        NKG7        NK cells
#> [10]      FCER1A Dendritic Cells
#> [11]        CST3 Dendritic Cells
#> [12]        PPBP  Megakaryocytes
#> -----------
#> elementInfo: IdVector with 0 metadata
#>     setInfo: IdVector with 0 metadata

Combinations of positive and negative markers using the GeneColorSet class

“Colored” gene sets, implemented in the GSEABase GeneColorSet class, can store information about the relation between each gene and a given phenotype. Here, “colors” are factor levels used to represent the “state” of each gene (e.g., expression levels “up”, “down”, or “unchanged”), and a phenotypic consequence (e.g., the phenotype is “enhanced” or “reduced”). For the purpose of signatures associated with individual cell types, we may consider the identity of a differentiated cell as the phenotype of interest.

In particular, individual genes may be positively or negatively associated with certain populations of cells. Such genes may be called positive or negative markers for the associated cell population.

For instance, the differentiation of monocytes into mature F4/80hi CX3CR1hi MHCII+ macrophages in the murine colonic mucosa is described as a “waterfall” that visually describes the concomitant downregulation of Ly6C and upregulation of MHCII (Tamoutounour et al. 2012).

Conceptually, the transcriptional signatures of Ly6Chi monocytes and mature F4/80hi CX3CR1hi MHCII+ macrophages may be represented as follows:

colored_gsc <- GeneSetCollection(list(
    GeneColorSet(
        setName="Monocytes", c("Ly6c1", "MHCII"), phenotype=c("Ly6Chi"),
        geneColor=factor(c("high", "-")),
        phenotypeColor=factor(c(TRUE, TRUE))),
    GeneColorSet(
        setName="Macrophages", c("Ly6c1", "MHCII"), phenotype=c("mature F4/80hi CX3CR1hi MHCII+"),
        geneColor=factor(c("low", "+")),
        phenotypeColor=factor(c(TRUE, TRUE)))
))
colored_gsc
#> GeneSetCollection
#>   names: Monocytes, Macrophages (2 total)
#>   unique identifiers: Ly6c1, MHCII (2 total)
#>   types in collection:
#>     geneIdType: NullIdentifier (1 total)
#>     collectionType: NullCollection (1 total)

The monocytic signature can be extracted, for further inspection.

colored_gsc[["Monocytes"]]
#> setName: Monocytes 
#> geneIds: Ly6c1, MHCII (total: 2)
#> geneIdType: Null
#> collectionType: Null 
#> phenotype: Ly6Chi 
#> geneColor: high, -
#>   levels: -, high
#> phenotypeColor: TRUE, TRUE
#>   levels: TRUE
#> details: use 'details(object)'

It is also possible to examine the relationship of individual markers with respect to the “Ly6Chi Monocyte” phenotype.

colored_gsc[["Monocytes"]][["Ly6c1"]]
#>         geneId      geneColor phenotypeColor 
#>        "Ly6c1"         "high"         "TRUE"
colored_gsc[["Monocytes"]][["MHCII"]]
#>         geneId      geneColor phenotypeColor 
#>        "MHCII"            "-"         "TRUE"

Session info

#> R Under development (unstable) (2020-05-29 r78616)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.6 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/travis/R-bin/lib/R/lib/libRblas.so
#> LAPACK: /home/travis/R-bin/lib/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] unisets_0.99.0       GSEABase_1.51.1      graph_1.67.1        
#>  [4] annotate_1.67.0      XML_3.99-0.3         AnnotationDbi_1.51.0
#>  [7] IRanges_2.23.6       S4Vectors_0.27.9     Biobase_2.49.0      
#> [10] BiocGenerics_0.35.2  BiocStyle_2.17.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.19.4 xfun_0.14                  
#>  [3] reshape2_1.4.4              lattice_0.20-41            
#>  [5] vctrs_0.3.0                 htmltools_0.4.0            
#>  [7] rtracklayer_1.49.1          yaml_2.2.1                 
#>  [9] blob_1.2.1                  rlang_0.4.6                
#> [11] pkgdown_1.5.1               DBI_1.1.0                  
#> [13] BiocParallel_1.23.0         bit64_0.9-7                
#> [15] matrixStats_0.56.0          GenomeInfoDbData_1.2.3     
#> [17] plyr_1.8.6                  stringr_1.4.0              
#> [19] zlibbioc_1.35.0             Biostrings_2.57.0          
#> [21] memoise_1.1.0               evaluate_0.14              
#> [23] knitr_1.28                  GenomeInfoDb_1.25.0        
#> [25] Rcpp_1.0.4.6                xtable_1.8-4               
#> [27] backports_1.1.7             BiocManager_1.30.10        
#> [29] DelayedArray_0.15.1         desc_1.2.0                 
#> [31] XVector_0.29.1              fs_1.4.1                   
#> [33] bit_1.1-15.2                Rsamtools_2.5.0            
#> [35] digest_0.6.25               stringi_1.4.6              
#> [37] bookdown_0.19               grid_4.1.0                 
#> [39] GenomicRanges_1.41.1        rprojroot_1.3-2            
#> [41] tools_4.1.0                 bitops_1.0-6               
#> [43] magrittr_1.5                RCurl_1.98-1.2             
#> [45] RSQLite_2.2.0               crayon_1.3.4               
#> [47] Matrix_1.2-18               MASS_7.3-51.6              
#> [49] assertthat_0.2.1            rmarkdown_2.2              
#> [51] R6_2.4.1                    GenomicAlignments_1.25.0   
#> [53] compiler_4.1.0

References

Tamoutounour, S., S. Henri, H. Lelouard, B. de Bovis, C. de Haar, C. J. van der Woude, A. M. Woltman, et al. 2012. “CD64 Distinguishes Macrophages from Dendritic Cells in the Gut and Reveals the Th1-Inducing Role of Mesenteric Lymph Node Macrophages During Colitis.” Journal Article. Eur J Immunol 42 (12): 3150–66. https://doi.org/10.1002/eji.201242847.