Prior to this vignette, paired-end reads (2x 75 bp) were processed as follows:
Note: Adapter and quality trimming using Trim Galore! (version 0.4.1; using cutadapt version 1.10) was tested; however, the high quality of the raw data, and the marginal improvement of alignment rate (< 1%) did not justify additional processing time and disk space.
Composite genome Fasta and GTF files were formed by concatenating the primary assembly of the Homo sapiens genome build GRCh38 (excluding haplotypes and patches) and the ERCC RNA Spike-In Mix:
cat $genomeFasta $erccFasta > $compositeFasta
cat $genomeGTF $erccGTF > $compositeGTF
The Fasta and GTF files for the human genome were obtained from the Ensembl FTP site release 86. Namely, the downloaded files are:
The ERCC Fasta and GTF files were obtained from ThermoFisher.
The composite genome was indexed as follows:
hisat2-build $compositeFasta $hisatIdx
Raw sequenced reads were quality-controlled using FastQC as follows:
fastqc \
--outdir $fastqcFolder \
--nogroup \
$fastqFile
An example FastQC report is embedded below, and linked here for full-screen layout.
Raw (i.e., not trimmed; see above) read pairs were aligned as follows:
hisat2 \
-x ${hisatIdx} \
-1 ${fastqFolder}/${cell}_1.fastq.gz \
-2 ${fastqFolder}/${cell}_2.fastq.gz \
-S $hisatDir/${cell}.sam \
1> $logDir/${cell}.o 2> $logDir/${cell}.e
The resulting SAM files were compressed to BAM format as follows:
samtools view -hb -o $hisatDir/${cell}.bam $hisatDir/${cell}.sam
Finally, SAM files were then removed as follows:
rm -f $hisatDir/${cell}.sam
Uniquely aligned read pairs were assigned to annotated genomic feaures and ERCC RNA spike-in features, and counted in an unstranded manner as follows:
featureCounts \
--primary \
--ignoreDup \
-p \
-a $compositeGTF \
-o $countDir/$cell \
$bamDir/${cell}.bam \
1> $logDir/$cell.out 2> $logDir/$cell.err