First of all, let us define:
sampleQCtheme <-
theme_minimal() +
theme(
axis.ticks.x = element_blank(), axis.text.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "bottom", legend.box = "vertical")
groupQCtheme <-
theme_bw() +
theme(
axis.text.x = element_text(angle=90, hjust = 1, vjust = 0.5, size = rel(0.6)),
legend.position = "bottom", legend.box = "vertical",
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5)
)
col.time <- brewer.pal(9, "Set3")[c(2,7:8)]
col.infection <- brewer.pal(12, "Paired")[c(9,4,2,5)]
col.status <- brewer.pal(12, "Paired")[c(9,1,7,3,5)]
names(col.time) <- levels(sce.all$Time)[1:3]
names(col.infection) <- levels(sce.all$Infection)
names(col.status) <- levels(sce.all$Status)
data.frame that will be used to store outliers identified in each of the quality control filters defined below:outliers <- data.frame(row.names = colnames(sce.sc))
Blank samples are designed to detect major technical issues during library preparation; in particular contamination, as only ERCC spike-in features are expected in those samples.
The scater calculateQCMetrics method may then be used to calculate the relevant QC metrics:
sce.blank <- calculateQCMetrics(
sce.blank,
exprs_values = "counts",
feature_controls = list(ERCC = isSpike(sce.blank, "ERCC"))
)
## Warning in calculateQCMetrics(sce.blank, exprs_values = "counts",
## feature_controls = list(ERCC = isSpike(sce.blank, : spike-in set 'ERCC'
## overwritten by feature_controls set of the same name
## Note that the names of some metrics have changed, see 'Renamed metrics' in ?calculateQCMetrics.
## Old names are currently maintained for back-compatibility, but may be removed in future releases.
In the SingleCellExperiment object augmnented by QC metrics, let us visualise the most abundant features which displays the expected over-representation of ERCC spike-in features:
plotHighestExprs(sce.blank, n = 100)

GC content is species-specific; it can be estimated from the annotated transcriptome or the sequenced data. Therefore, GC content excessively different from the majority of samples can indicate issues during library preparation or contamination.
Let us identify as outliers single cells that display a GC content excessively different from the median of single cell samples, here beyond 3 MADs:
outliers$GC <- isOutlier(sce.sc$FastQC_percent_gc, 3)
table(outliers$GC)
##
## FALSE TRUE
## 372 1
Let us visualise outliers, along with blank and bulk samples for reference:

For reference:
Technical issues introduced during library preparation may result in an excessively low alignment rate. Nevertheless, libraries with slightly lower alignment rates may still yield a library size comparable to the majority of samples.
For those reasons reasons, let us identify as outliers single cells with an alignment rate excessively lower than the median of single cell samples, here beyond 3 MADs:
outliers$AlignedPercent <- isOutlier(sce.sc$Bowtie.2_overall_alignment_rate, 3, "lower")
table(outliers$AlignedPercent)
##
## FALSE TRUE
## 353 20
Let us visualise outliers, along with blank and bulk samples for reference:

For reference:
An excessively small count of aligned reads assigned to annotated genomics features (i.e., library size) may indicate technical issues during library preparation or low quality samples. Moreover, library size excessively different from the majority of samples may introduce bias due to under- or over-sampling of RNA fragments, and negatively affect the performance of between-sample normalisation algorithms.
Let us identify as outliers single cells with a count of assigned read pairs (i.e. library size) excessively different from the median of single cell samples, here beyond 3 MADs:
outliers$AssignedCount <- isOutlier(sce.sc$featureCounts_Assigned, 3, log = TRUE)
table(outliers$AssignedCount)
##
## FALSE TRUE
## 352 21
Let us visualise outliers, along with blank and bulk samples, for reference:

For reference:
Similarly, unusually low assignment rate of reads to annotated genomic features may indicate technical issue; for instance during poly-A selection.
Let us identify as outliers single cells with a rate of assigned reads excessively lower than the median of single cell samples, here beyond 3 MADs:
outliers$AssignedPercent <- isOutlier(sce.sc$featureCounts_percent_assigned, 3)
table(outliers$AssignedPercent)
##
## FALSE TRUE
## 357 16
Let us visualise outliers, along with blank and bulk samples, for reference:

For reference:
Let us display the count of cells identified as outliers for each of the filters so far:
fastqc.fr <- FilterRules(colnames(outliers))
fastqc.fr
## FilterRules of length 4
## names(4): GC AlignedPercent AssignedCount AssignedPercent
| Fail | Pass | |
|---|---|---|
| 373 | 373 | |
| GC | 1 | 372 |
| AlignedPercent | 20 | 353 |
| AssignedCount | 21 | 352 |
| AssignedPercent | 16 | 357 |
| 1 | 347 |
Note: In the above table, the first row simply indicates the total count of cells before any filtering; conversely, the last row indicates the count of cells that failed or passed, respectively, all filters.
The following sections require the estimation of several QC metrics that leverage control features and control samples, as means to identify outliers among single cells.
Let us apply the scater calculateQCMetrics to calculate the relevant QC metrics in the single cell data set, using set of spike-in control features:
sce.sc <- calculateQCMetrics(
sce.sc,
feature_controls = list(
ERCC = isSpike(sce.sc, "ERCC"),
MT = as.logical(seqnames(sce.sc) == "MT")
)
)
## Warning in calculateQCMetrics(sce.sc, feature_controls = list(ERCC =
## isSpike(sce.sc, : spike-in set 'ERCC' overwritten by feature_controls set
## of the same name
## Note that the names of some metrics have changed, see 'Renamed metrics' in ?calculateQCMetrics.
## Old names are currently maintained for back-compatibility, but may be removed in future releases.
For reference purpose in upcoming figures, let us also calculate the relevant QC metrics in the complete data set:
sce.all <- calculateQCMetrics(
sce.all,
feature_controls = list(
ERCC = isSpike(sce.sc, "ERCC"),
MT = as.logical(seqnames(sce.sc) == "MT")
),
cell_controls = list(
Bulk = which(sce.all$Status == "BULK"),
Blank = which(sce.all$Status == "Blank")
)
)
## Warning in calculateQCMetrics(sce.all, feature_controls = list(ERCC =
## isSpike(sce.sc, : spike-in set 'ERCC' overwritten by feature_controls set
## of the same name
## Note that the names of some metrics have changed, see 'Renamed metrics' in ?calculateQCMetrics.
## Old names are currently maintained for back-compatibility, but may be removed in future releases.
The library size refers to the total count of reads assigned to annotated features in each individual sample.
Here, let us simply show the direct relation between log10_total_counts computed by calculateQCMetrics and the featureCounts_Assigned counts detected by MultiQC in an earlier section above:
with(
data.frame(colData(sce.sc)),
all(abs(log10_total_counts == log10(featureCounts_Assigned+1)))
)
## [1] TRUE
As a result, this metric is identical to the featureCounts_Assigned metric described above; and thereby would unnecessarily identify the same outliers.
Library complexity refers to the proportion of assigned reads accounted for by the \(N\) most abundant features within each individual sample, with values for \(N\) usually ranging from tens to hundreds. It is expected to observe a lower library complexity in single cells relative to bulk samples, due to higher dropout events in single cells, and the union of multiple phenotypes within bulk samples. In contrast, blank samples are expected to have extremely low library complexity, as only up to 92 ERCC spike-ins sequences are expected in those libraries, in contrast to thousands of genomic features in biological samples.
In the SingleCellExperiment object augmnented by QC metrics, let us visualise the library complexity profile against experimental phenotype information:
plotScater(
sce.all, block1 = "Status", block2 = "Infection",
colour_by = "Time", nfeatures = 300, exprs_values = "counts"
)

First, from a QC perspective:
exposed:D23580:6h,exposed:LT2:4h, infected:D23580:4h, infected:D23580:6h, infected:LT2:6h) show suspiciously low library complexity; suggestive of technical issues during library prepration (e.g., amplification bias, dropout events). Those single cells must therefore be treated as outliers, and be removed from the data set prior to feature selection, normalisation, and subsequent analyses.Now, from a biological perspective, interesting observations may be made:
Time in exposed and infected samples in cells exposed to or infected by bacteria, in contrast to control uninfected cells. This may indicate a polarisation of single cells toward specialised phenotype(s) after infection or exposure (e.g., cytokine production, antigen presentation).Mock (i.e., uninfected) single cells tend to show a consistent, relatively high library complexityLT2 and D23580 infections show a gradual decrease of library complexity over time. This observation might be more pronounced for D23580.Although abundantly expressed features are an expected aspect of cell biology (e.g., myosin in muscle cells, clusters of differentiation in immune cells) which may contribute to the identification and classification of cell types, overly abundant features in RNA-Seq libraries may indicate issues during library preparation; in particular, cDNA amplification bias. Moreover, when a small number of highly abundant features account for an excessive proportion of sequenced reads, expression levels for the vast majority of other genes may not be estimated with sufficient accuracy.
Let us identify as outliers single cells in which the 50, 100, and 200 most abundant features account for a proportion of library size excessively higher than the median of single cell samples, here beyond 3 MADs:
outliers$top50 <- isOutlier(sce.sc$pct_counts_top_50_features, 3)
table(outliers$top50)
##
## FALSE TRUE
## 367 6
outliers$top100 <- isOutlier(sce.sc$pct_counts_top_100_features, 3)
table(outliers$top100)
##
## FALSE TRUE
## 367 6
outliers$top200 <- isOutlier(sce.sc$pct_counts_top_200_features, 3)
table(outliers$top200)
##
## FALSE TRUE
## 367 6
Let us visualise outliers, along with blank and bulk samples, for reference:

For reference:

For reference:
Note: Two infected:D23580 samples appear near (below) the cut-off threshold. This observation can be related to the overall library complexity figure shown earlier.

For reference:
Note: Two infected:D23580 samples appear near (below) the cut-off threshold. This observation can be related to the overall library complexity figure shown earlier.
A notion closely related to library complexity is the count of features detected within each individual sample. Although the definition of “detection level” may be debated, we will consider here any feature with non-zero count.
Abnormal counts of detected features may be indicative of technical issue during RNA capture and library preparation; for instance, excessively low counts of detected features may indicate RNA degradation, while excessively high counts of detected features may indicate capture of multiple cells.
Owing to the considerable amount of variation that results from the relaxed definition of detection level above, let us apply a somewhat stringent cut-off of 2 MADs to identify as outliers single cells with a count of detected features excessively lower than the median of single cell samples, here beyond 2 MADs::
outliers$FeaturesCount <- isOutlier(sce.sc$total_features, 2, "lower")
table(outliers$FeaturesCount)
##
## FALSE TRUE
## 361 12
Let us visualise outliers, along with blank and bulk samples, for reference:

For reference:
An identical–yet very small–quantity of ERCC spike-in features is added to each sample prior to library preparation. As a result, technical variation and issues during library preparation may be estimated from the proportion of reads assigned to those spike-in features; an excessive proportion of reads assigned to ERCC spike-in features may indicate eukaryotic RNA degradation, while excessive variation may indicate cDNA amplication bias.
Due to the limited amount of variation that results from the generally low levels of ERCC spike-in features detected, let us apply a somewhat relaxed cut-off of 5 MADs to identify as outliers single cells an ERCC content excessively higher than the median of single cell samples:
outliers$ERCC <- isOutlier(sce.sc$pct_counts_ERCC, 5, "higher")
table(outliers$ERCC)
##
## FALSE TRUE
## 357 16
Let us visualise those outliers, along with blank and bulk samples for reference:

For reference:
Similarly to ERCC spike-in features, mitochondrial RNA may be used as an endogenous source of control features; the limited count of genes encoded on the mitochondrial chromosome relative to the human chromosomes implies that, even at their highest levels of expression, mitochondrial genes are expected to account for a low proportion of the total read count in each sample. Consequently, abnormally high proportions of reads assigned to mitochondrial gene features may be indicative of technical issue during RNA capture and library preparation, or abnormal count of mitochondria within the single cells.
Conversely, an unusually low proportion of reads assigned to mitochondrial gene features may also indicate issues related to RNA capture and library preparation.
Let us identify as outliers single cells for which the proportion of reads assigned to mitochondrial gene features is excessively higher than the median of single cell samples:
outliers$MT <- isOutlier(sce.sc$pct_counts_MT, 3, "higher")
table(outliers$MT)
##
## FALSE TRUE
## 362 11
Let us visualise those outliers, along with blank and bulk samples for reference:

For reference:
Let us display the count of cells identified as outliers for each of the filters in this section:
scater.fr <- FilterRules(setdiff(colnames(outliers), names(fastqc.fr)))
scater.fr
## FilterRules of length 6
## names(6): top50 top100 top200 FeaturesCount ERCC MT
| Fail | Pass | |
|---|---|---|
| 373 | 373 | |
| top50 | 6 | 367 |
| top100 | 6 | 367 |
| top200 | 6 | 367 |
| FeaturesCount | 12 | 361 |
| ERCC | 16 | 357 |
| MT | 11 | 362 |
| 5 | 355 |
Let us now collect the results of all the QC metrics obtained in the previous sections, and display the count of cells :
all.fr <- FilterRules(colnames(outliers))
all.fr
## FilterRules of length 10
## names(10): GC AlignedPercent AssignedCount ... FeaturesCount ERCC MT
| Fail | Pass | |
|---|---|---|
| 373 | 373 | |
| GC | 1 | 372 |
| AlignedPercent | 20 | 353 |
| AssignedCount | 21 | 352 |
| AssignedPercent | 16 | 357 |
| top50 | 6 | 367 |
| top100 | 6 | 367 |
| top200 | 6 | 367 |
| FeaturesCount | 12 | 361 |
| ERCC | 16 | 357 |
| MT | 11 | 362 |
| 0 | 342 |
In the table above, none of the cells fails all the filters; this demonstrates the importance of each individual filter, despite the apparent relatedness and overlap between many of the metrics.
As a result, let us identify the cells that passed all of the above filters:
idx.pass <- (rowSums(evalSeparately(all.fr, outliers)) == 0)
Secondly, we may also examine the count of single cell samples, according to the number of QC filters that they have failed:
| Filters | Cells |
|---|---|
| 0 | 342 |
| 1 | 13 |
| 2 | 2 |
| 3 | 5 |
| 5 | 1 |
| 6 | 4 |
| 9 | 6 |
We may summarise experimental phenotype information for the outliers samples, or alternatively, the samples that successfully passed all filters:
summary(data.frame(colData(sce.sc))[
!idx.pass,
c("Infection", "Status", "Time", "Lane", "Plate")
])
## Infection Status Time Lane Plate
## Mock :11 Uninfected:11 2h: 5 305264:20 P1: 6
## STM-LT2 : 7 Exposed :11 4h:15 305265:11 P2:14
## STM-D23580:13 Infected : 9 6h:11 P3: 9
## P4: 2
summary(data.frame(colData(sce.sc))[
idx.pass,
c("Infection", "Status", "Time", "Lane", "Plate")
])
## Infection Status Time Lane Plate
## Mock : 80 Uninfected: 80 2h:119 305264:165 P1:86
## STM-LT2 :134 Exposed :127 4h:110 305265:177 P2:79
## STM-D23580:128 Infected :135 6h:113 P3:86
## P4:91
Let us define an order of sample that groups single cells by experimental phenotypes:
cells.order <- with(colData(sce.sc), order(Time, Status, Infection))
We may then display the QC metrics failed by each single cell as red cells in an heat map:

## quartz_off_screen
## 2
## quartz_off_screen
## 2
Let us finally retain the single cells that passed all QC metrics into a new SCESet:
sce.pass <- sce.sc[,idx.pass]
dim(sce.pass)
## [1] 58084 342
For each experimental groups, we may display the count of cells that passed the above QC are as follow, as well as the initial count of cells, for reference :
| Initial | Passed | Failed | Time | Infection | Status |
|---|---|---|---|---|---|
| 30 | 30 | 0 | 2h | Mock | Uninfected |
| 31 | 21 | 10 | 4h | Mock | Uninfected |
| 30 | 29 | 1 | 6h | Mock | Uninfected |
| 23 | 23 | 0 | 2h | STM-LT2 | Exposed |
| 23 | 21 | 2 | 4h | STM-LT2 | Exposed |
| 23 | 22 | 1 | 6h | STM-LT2 | Exposed |
| 24 | 23 | 1 | 2h | STM-LT2 | Infected |
| 24 | 23 | 1 | 4h | STM-LT2 | Infected |
| 24 | 22 | 2 | 6h | STM-LT2 | Infected |
| 23 | 20 | 3 | 2h | STM-D23580 | Exposed |
| 23 | 23 | 0 | 4h | STM-D23580 | Exposed |
| 23 | 18 | 5 | 6h | STM-D23580 | Exposed |
| 24 | 23 | 1 | 2h | STM-D23580 | Infected |
| 24 | 22 | 2 | 4h | STM-D23580 | Infected |
| 24 | 22 | 2 | 6h | STM-D23580 | Infected |