Example data set

First, load the SingleCellExperiment object of 10X Genomics 2,700 peripheral blood mononuclear cells (PBMC) preprocessed in a previous vignette.

library(SingleCellExperiment)
sce <- readRDS("pbmc3k_tutorial.sce.rds")
sce
## class: SingleCellExperiment 
## dim: 32738 2638 
## metadata(0):
## assays(1): counts
## rownames(32738): ENSG00000243485 ENSG00000237613 ...
##   ENSG00000215616 ENSG00000215611
## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
## colnames: NULL
## colData names(12): Sample Barcode ... Date_published seurat.ident
## reducedDimNames(2): PCA TSNE
## spikeNames(0):

Gene signatures

Declare the predefined gene signatures used in the Seurat - Guided Clustering Tutorial.

library(unisets)
geneLists <- list(
    "CD4 T cells"=c("IL7R"),
    "CD14+ Monocytes"=c("CD14", "LYZ"),
    "B cells"=c("MS4A1"),
    "CD8 T cells"=c("CD8A"),
    "FCGR3A+ Monocytes"=c("FCGR3A", "MS4A7"),
    "NK cells"=c("GNLY", "NKG7"),
    "Dendritic Cells"=c("FCER1A", "CST3"),
    "Megakaryocytes"=c("PPBP")
)
basesets <- as(geneLists, "Sets")
basesets
## Sets with 12 relations between 12 elements and 8 sets
##          element             set
##      <character>     <character>
##  [1]        IL7R     CD4 T cells
##  [2]        CD14 CD14+ Monocytes
##  [3]         LYZ CD14+ Monocytes
##  [4]       MS4A1         B cells
##  [5]        CD8A     CD8 T cells
##  ...         ...             ...
##  [8]        GNLY        NK cells
##  [9]        NKG7        NK cells
## [10]      FCER1A Dendritic Cells
## [11]        CST3 Dendritic Cells
## [12]        PPBP  Megakaryocytes
## -----------
## elementInfo: IdVector with 0 metadata
##     setInfo: IdVector with 0 metadata

Applying signatures to a data set

The predict method

The basesets object can immediately be supplied to the predict S3 method, in combination with the SummarizedExperiment object to annotate. However, the cell type signatures described in the Seurat - Guided Clustering Tutorial use gene symbol identifiers, which do not match the Ensembl gene identifiers used in rownames(sce). Here, to make the gene identifiers compatible between the data set and the signatures, we simply replace the Ensembl gene identifiers by the "Symbol_TENx" gene metadata, after making duplicated values unique.

library(hancock)
rownames(sce) <- make.unique(rowData(sce)[["Symbol_TENx"]])
sce <- predict(basesets, sce, method="ProportionPositive", cluster.col="seurat.ident")

Prediction outputs

The updated SummarizedExperiment object is returned, with a number of additional metadata.

The key output is the cell identity predicted for each cell in the object. This information is stored in colData(sce)[["hancock"]][["prediction"]], or sce$hancock$prediction, in short.

table(sce$hancock$prediction)
## 
##       CD4 T cells   CD14+ Monocytes           B cells       CD8 T cells 
##              1151               479               342               308 
## FCGR3A+ Monocytes          NK cells   Dendritic Cells    Megakaryocytes 
##               157               155                32                14

With method="ProportionPositive", all cells in each cluster are annotated with the predominant signature in their associated cluster. As such, the cell type prediction may be tabulated against the cluster membership of each cell. This view can potentially highlight multiple clusters labelled by the same predicted label as a sign of overclustering.

table(sce$hancock$prediction, sce$seurat.ident)
##                    
##                        0    1    2    3    4    5    6    7
##   CD4 T cells       1151    0    0    0    0    0    0    0
##   CD14+ Monocytes      0  479    0    0    0    0    0    0
##   B cells              0    0  342    0    0    0    0    0
##   CD8 T cells          0    0    0  308    0    0    0    0
##   FCGR3A+ Monocytes    0    0    0    0  157    0    0    0
##   NK cells             0    0    0    0    0  155    0    0
##   Dendritic Cells      0    0    0    0    0    0   32    0
##   Megakaryocytes       0    0    0    0    0    0    0   14

In addition to the final predictions, the updated SummarizedExperiment object also contain some additional information tracing to the process of prediction. That information can be found in metadata(sce)[["hancock"]].

names(metadata(sce)[["hancock"]])
## [1] "GeneSets"                    "method"                     
## [3] "packageVersion"              "ProportionPositiveByCluster"
## [5] "TopSignatureByCluster"

In particular:

  • GeneSets stores the object that provided the gene set signatures used for the prediction
  • method stores the name of the method used to make the predictions
  • packageVersion stores the version of the hancock package used to make the predictions

Specifically, method="ProportionPositive" adds the following metadata items:

  • ProportionPositiveByCluster stores the matrix indicating the proportion of cells in each cluster that are positive for each signature
  • TopSignatureByCluster stores a named vector that indicates the predominant signature for each cluster. It is shown below.
metadata(sce)[["hancock"]][["TopSignatureByCluster"]]
##                 0                 1                 2                 3 
##       CD4 T cells   CD14+ Monocytes           B cells       CD8 T cells 
##                 4                 5                 6                 7 
## FCGR3A+ Monocytes          NK cells   Dendritic Cells    Megakaryocytes 
## 8 Levels: CD4 T cells CD14+ Monocytes B cells ... Megakaryocytes

Visualizing predictions

Heat map

It is possible to visualize the proportion of cells positive for each signature in each cluster as a heat map. Note how the diagonal (displaying the tutorial annotations) shows high proportions of cells matching the annotated signature.

plotProportionPositive(sce, cluster_rows=FALSE, cluster_columns=FALSE)

We can also let signatures and clusters of cells cluster in the heat map. Unsurprisingly, the subpopulations of monocytes show some similarity, as do the subpopulations of CD4 positive and CD8 positive T cells.

plotProportionPositive(sce)

Reduced dimension

With a bit more code, we can also display the expression of each marker on the result of t-SNE dimensionality reduction.

library(ggplot2)
library(cowplot)
ggList <- list()
for (markerName in ids(elementInfo(basesets))) {
    ggDataFrame <- data.frame(
        reducedDim(sce, "TSNE"),
        logcounts=log2(assay(sce, "counts")[markerName, ] + 1)
    )
    cellIdentity <- subset(as.data.frame(basesets), element == markerName, "set", drop=TRUE)
    gg <- ggplot(ggDataFrame, aes(tSNE_1, tSNE_2, color=logcounts)) +
        geom_point(size=0.1) +
        scale_color_viridis_c() +
        labs(title=markerName, subtitle=cellIdentity, x=NULL, y=NULL) +
        theme(
            axis.text = element_blank(), axis.ticks = element_blank(),
            text = element_text(size=rel(3)), legend.text = element_text(size=rel(2)),
            legend.position = "bottom")
    ggList[[markerName]] <- gg
}
cowplot::plot_grid(plotlist = ggList)