Phenotype information

Phenotype information is central to all aspects of the analysis. In the broad sense, it encompasses sample-level information beyond the experimental design (e.g., files of read counts, quality control metrics) to coordinate statistical analyses and data visualisation throughout downstream analyses. Defined in the earliest steps of an analytical workflow, it ensures traceability of sample data from files of raw data to every output and result.

Let us import experimental information:

pheno <- xlsx::read.xlsx("expdata/samples.xlsx", sheetName = "pheno")
colnames(pheno)

## [1] "File"      "Infection" "Status"    "Time"      "Lane"      "Plate"    
## [7] "Well"

To facilitate later use, let us retype some fields, reorder some factor levels, and add a set of additional fields:

Sample: contains a shortened unique identifier derived from the count file name for each sample,
Treatment: contains a factor that combines Infection and Status phenotypes
Group: contains a factor that combines all experimental phenotypes (i.e., Time, Infection and Status)

pheno$File <- as.character(pheno$File)
pheno$Infection <-
  factor(pheno$Infection, c("Mock","STM-LT2","STM-D23580","Blank"))
pheno$Status <-
  factor(pheno$Status, c("Uninfected","Exposed","Infected","Blank","Bulk"))
pheno$Lane <- as.factor(pheno$Lane)
pheno$Sample <- gsub("WTCHG_[[:digit:]]+_", "Cell_", pheno$File) # character
treatmentLevels <- c(
  "Mock_Uninfected",
  "STM-LT2_Exposed","STM-LT2_Infected",
  "STM-D23580_Exposed","STM-D23580_Infected",
  "Blank_Blank", "Mock_Bulk", "STM-LT2_Bulk", "STM-D23580_Bulk"
  )
pheno$Treatment <- with(pheno, factor(
  droplevels(interaction(Infection, Status, sep = "_")), treatmentLevels
))
# factor
pheno$Group <- with(pheno, factor(
  droplevels(interaction(pheno$Time, pheno$Treatment, sep = "_"))
))
# factor

QC metrics

Importantly, MultiQC was used to collate metrics for each sample at each step of preprocessing—from raw reads to assigned read counts—; those QC metrics provide key information to identify technical issues and sample outliers.

Let us import the QC metrics produced by MultiQC:

multiqc <- read.delim("MultiQC/final_pipeline_data/multiqc_general_stats.txt")
colnames(multiqc)

## [1] "Sample"                          "featureCounts_Assigned"         
## [3] "FastQC_percent_gc"               "Bowtie.2_overall_alignment_rate"
## [5] "featureCounts_percent_assigned"  "FastQC_total_sequences"         
## [7] "FastQC_percent_duplicates"       "FastQC_percent_fails"           
## [9] "FastQC_avg_sequence_length"

Let us shorten the sample identifier used by MultiQC to a minimal unique identifier (as done above for experimental phenotype information):

multiqc$Sample <- gsub("WTCHG_[[:digit:]]+_", "Cell_", multiqc$Sample)
rownames(multiqc) <- multiqc$Sample
colnames(multiqc)

## [1] "Sample"                          "featureCounts_Assigned"         
## [3] "FastQC_percent_gc"               "Bowtie.2_overall_alignment_rate"
## [5] "featureCounts_percent_assigned"  "FastQC_total_sequences"         
## [7] "FastQC_percent_duplicates"       "FastQC_percent_fails"           
## [9] "FastQC_avg_sequence_length"

Let us append those QC metrics to the phenotype information, that will later be attached to the read counts in a SingleCellExperiment (SingleCellExperiment):

pheno <- merge(pheno, multiqc, by = "Sample", sort = FALSE)

Counts

To import count data that is stored in a separate file for each sample, let us use the edgeR readDGE method and the phenotype information prepared above:

RG <- with(
  pheno,
  readDGE(File, "expdata/counts", c(1, 3), Group, Sample)
)
dim(RG)

## [1] 58084   384

Gene annotations

To ensure coherence between gene annotations and gene expression data, let us also import the records for genes and ERCC spike-ins molecules from the same GTF file that was supplied to featureCounts, as GRanges of the GenomicRanges package; in the next section, those annotations will be attached to the corresponding read counts in a SingleCellExperiment object (SingleCellExperiment).

First we import all the records from the composite GFF file that includes both endogenous gene features and ERCC spike-in molecules:

gtfData <- import.gff("expdata/genes_with_ERCC.gtf")
colnames(as.data.frame(gtfData))

##  [1] "seqnames"            "start"               "end"                
##  [4] "width"               "strand"              "source"             
##  [7] "type"                "score"               "phase"              
## [10] "gene_id"             "gene_version"        "gene_name"          
## [13] "gene_source"         "gene_biotype"        "havana_gene"        
## [16] "havana_gene_version" "transcript_id"

Finally, let us combine the two types of feature annotations, and reorder them to ensure that they match the order of features imported above:

names(gtfData) <- gtfData$gene_id
stopifnot(all(rownames(RG) %in% names(gtfData)))
gtfData <- gtfData[match(rownames(RG), names(gtfData))]
stopifnot(all(rownames(RG) == names(gtfData)))

In addition, let us also obtain the identifier of ERCC spike-in features, to annotate them as spike-in in the following section:

ERCCs <- grep("^ERCC-[[:digit:]]+$", rownames(RG), value = TRUE)

SingleCellExperiment

Let us assemble the count data, phenotype information, and feature information prepared above into SingleCellExperiment objects (SingleCellExperiment):

one object that contains all samples together
one for the blank samples
one for the bulk samples
one for the single cells

Note: This distinction will be important later, to compute and compared QC metrics within each sets of samples.

First, let us set the rownames of the phenotypes information data.frame to the unique sample identifier for consistency with the sample identifier associated with the count data imported above:

rownames(pheno) <- pheno$Sample

We may then create the four data sets, declaring in each of them the set of ERCC spike-in features:

All samples

pd.all <- DataFrame(pheno)
sce.all <- SingleCellExperiment(
  assays=list(counts=RG$counts), colData=pd.all, rowRanges=gtfData)
isSpike(sce.all, "ERCC") <- ERCCs
dim(sce.all)

## [1] 58084   384

Blanks

idx.blank <- pheno$Status == "Blank"
pd.blank <- DataFrame(droplevels(pheno[idx.blank,]))
sce.blank <- SingleCellExperiment(
  assays=list(counts=RG$counts[,idx.blank]),
  colData=pd.blank, rowRanges=gtfData)
isSpike(sce.blank, "ERCC") <- ERCCs
dim(sce.blank)

## [1] 58084     2

Bulks

idx.bulk<- pheno$Status == "Bulk"
pd.bulk <- DataFrame(droplevels(pheno[idx.bulk,]))
sce.bulk <- SingleCellExperiment(
  assays=list(counts=RG$counts[,idx.bulk]),
  colData=pd.bulk, rowRanges=gtfData)
isSpike(sce.bulk, "ERCC") <- ERCCs
dim(sce.bulk)

## [1] 58084     9

Single cells

idx.sc <- pheno$Status %in% c("Exposed", "Infected", "Uninfected")
pd.sc <- DataFrame(droplevels(pheno[idx.sc,]))
sce.sc <- SingleCellExperiment(
  assays=list(counts=RG$counts[,idx.sc]),
  colData=pd.sc, rowRanges=gtfData)
isSpike(sce.sc, "ERCC") <- ERCCs
dim(sce.sc)

## [1] 58084   373

Import of RNA-Sequencing read counts and associated metadata