Phenotype information is central to all aspects of the analysis. In the broad sense, it encompasses sample-level information beyond the experimental design (e.g., files of read counts, quality control metrics) to coordinate statistical analyses and data visualisation throughout downstream analyses. Defined in the earliest steps of an analytical workflow, it ensures traceability of sample data from files of raw data to every output and result.
Let us import experimental information:
pheno <- xlsx::read.xlsx("expdata/samples.xlsx", sheetName = "pheno")
colnames(pheno)
## [1] "File" "Infection" "Status" "Time" "Lane" "Plate"
## [7] "Well"
To facilitate later use, let us retype some fields, reorder some factor levels, and add a set of additional fields:
Sample: contains a shortened unique identifier derived from the count file name for each sample,Treatment: contains a factor that combines Infection and Status phenotypesGroup: contains a factor that combines all experimental phenotypes (i.e., Time, Infection and Status)pheno$File <- as.character(pheno$File)
pheno$Infection <-
factor(pheno$Infection, c("Mock","STM-LT2","STM-D23580","Blank"))
pheno$Status <-
factor(pheno$Status, c("Uninfected","Exposed","Infected","Blank","Bulk"))
pheno$Lane <- as.factor(pheno$Lane)
pheno$Sample <- gsub("WTCHG_[[:digit:]]+_", "Cell_", pheno$File) # character
treatmentLevels <- c(
"Mock_Uninfected",
"STM-LT2_Exposed","STM-LT2_Infected",
"STM-D23580_Exposed","STM-D23580_Infected",
"Blank_Blank", "Mock_Bulk", "STM-LT2_Bulk", "STM-D23580_Bulk"
)
pheno$Treatment <- with(pheno, factor(
droplevels(interaction(Infection, Status, sep = "_")), treatmentLevels
))
# factor
pheno$Group <- with(pheno, factor(
droplevels(interaction(pheno$Time, pheno$Treatment, sep = "_"))
))
# factor
Importantly, MultiQC was used to collate metrics for each sample at each step of preprocessing—from raw reads to assigned read counts—; those QC metrics provide key information to identify technical issues and sample outliers.
Let us import the QC metrics produced by MultiQC:
multiqc <- read.delim("MultiQC/final_pipeline_data/multiqc_general_stats.txt")
colnames(multiqc)
## [1] "Sample" "featureCounts_Assigned"
## [3] "FastQC_percent_gc" "Bowtie.2_overall_alignment_rate"
## [5] "featureCounts_percent_assigned" "FastQC_total_sequences"
## [7] "FastQC_percent_duplicates" "FastQC_percent_fails"
## [9] "FastQC_avg_sequence_length"
Let us shorten the sample identifier used by MultiQC to a minimal unique identifier (as done above for experimental phenotype information):
multiqc$Sample <- gsub("WTCHG_[[:digit:]]+_", "Cell_", multiqc$Sample)
rownames(multiqc) <- multiqc$Sample
colnames(multiqc)
## [1] "Sample" "featureCounts_Assigned"
## [3] "FastQC_percent_gc" "Bowtie.2_overall_alignment_rate"
## [5] "featureCounts_percent_assigned" "FastQC_total_sequences"
## [7] "FastQC_percent_duplicates" "FastQC_percent_fails"
## [9] "FastQC_avg_sequence_length"
Let us append those QC metrics to the phenotype information, that will later be attached to the read counts in a SingleCellExperiment (SingleCellExperiment):
pheno <- merge(pheno, multiqc, by = "Sample", sort = FALSE)
To import count data that is stored in a separate file for each sample, let us use the edgeR readDGE method and the phenotype information prepared above:
RG <- with(
pheno,
readDGE(File, "expdata/counts", c(1, 3), Group, Sample)
)
dim(RG)
## [1] 58084 384
To ensure coherence between gene annotations and gene expression data, let us also import the records for genes and ERCC spike-ins molecules from the same GTF file that was supplied to featureCounts, as GRanges of the GenomicRanges package; in the next section, those annotations will be attached to the corresponding read counts in a SingleCellExperiment object (SingleCellExperiment).
First we import all the records from the composite GFF file that includes both endogenous gene features and ERCC spike-in molecules:
gtfData <- import.gff("expdata/genes_with_ERCC.gtf")
colnames(as.data.frame(gtfData))
## [1] "seqnames" "start" "end"
## [4] "width" "strand" "source"
## [7] "type" "score" "phase"
## [10] "gene_id" "gene_version" "gene_name"
## [13] "gene_source" "gene_biotype" "havana_gene"
## [16] "havana_gene_version" "transcript_id"
Finally, let us combine the two types of feature annotations, and reorder them to ensure that they match the order of features imported above:
names(gtfData) <- gtfData$gene_id
stopifnot(all(rownames(RG) %in% names(gtfData)))
gtfData <- gtfData[match(rownames(RG), names(gtfData))]
stopifnot(all(rownames(RG) == names(gtfData)))
In addition, let us also obtain the identifier of ERCC spike-in features, to annotate them as spike-in in the following section:
ERCCs <- grep("^ERCC-[[:digit:]]+$", rownames(RG), value = TRUE)
Let us assemble the count data, phenotype information, and feature information prepared above into SingleCellExperiment objects (SingleCellExperiment):
Note: This distinction will be important later, to compute and compared QC metrics within each sets of samples.
First, let us set the rownames of the phenotypes information data.frame to the unique sample identifier for consistency with the sample identifier associated with the count data imported above:
rownames(pheno) <- pheno$Sample
We may then create the four data sets, declaring in each of them the set of ERCC spike-in features:
pd.all <- DataFrame(pheno)
sce.all <- SingleCellExperiment(
assays=list(counts=RG$counts), colData=pd.all, rowRanges=gtfData)
isSpike(sce.all, "ERCC") <- ERCCs
dim(sce.all)
## [1] 58084 384
idx.blank <- pheno$Status == "Blank"
pd.blank <- DataFrame(droplevels(pheno[idx.blank,]))
sce.blank <- SingleCellExperiment(
assays=list(counts=RG$counts[,idx.blank]),
colData=pd.blank, rowRanges=gtfData)
isSpike(sce.blank, "ERCC") <- ERCCs
dim(sce.blank)
## [1] 58084 2
idx.bulk<- pheno$Status == "Bulk"
pd.bulk <- DataFrame(droplevels(pheno[idx.bulk,]))
sce.bulk <- SingleCellExperiment(
assays=list(counts=RG$counts[,idx.bulk]),
colData=pd.bulk, rowRanges=gtfData)
isSpike(sce.bulk, "ERCC") <- ERCCs
dim(sce.bulk)
## [1] 58084 9
idx.sc <- pheno$Status %in% c("Exposed", "Infected", "Uninfected")
pd.sc <- DataFrame(droplevels(pheno[idx.sc,]))
sce.sc <- SingleCellExperiment(
assays=list(counts=RG$counts[,idx.sc]),
colData=pd.sc, rowRanges=gtfData)
isSpike(sce.sc, "ERCC") <- ERCCs
dim(sce.sc)
## [1] 58084 373