Overview

Prior to this vignette, paired-end reads (2x 75 bp) were processed as follows:

  • Quality control using FastQC (version 0.11.4),
  • Alignment to reference genome using HISAT2 (version 2.0.3b),
  • Assigned to annotated features using featureCounts (subread version 1.5.0-p2).

Note: Adapter and quality trimming using Trim Galore! (version 0.4.1; using cutadapt version 1.10) was tested; however, the high quality of the raw data, and the marginal improvement of alignment rate (< 1%) did not justify additional processing time and disk space.

Composite genome (Ensembl)

Composite genome Fasta and GTF files were formed by concatenating the primary assembly of the Homo sapiens genome build GRCh38 (excluding haplotypes and patches) and the ERCC RNA Spike-In Mix:

cat $genomeFasta $erccFasta > $compositeFasta
cat $genomeGTF $erccGTF > $compositeGTF

The Fasta and GTF files for the human genome were obtained from the Ensembl FTP site release 86. Namely, the downloaded files are:

  • Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
  • Homo_sapiens.GRCh38.86.chr.gtf.gz

The ERCC Fasta and GTF files were obtained from ThermoFisher.

Genome index (HISAT2)

The composite genome was indexed as follows:

hisat2-build $compositeFasta $hisatIdx

Read quality control (FastQC)

Raw sequenced reads were quality-controlled using FastQC as follows:

fastqc \
  --outdir $fastqcFolder \
  --nogroup \
  $fastqFile

An example FastQC report is embedded below, and linked here for full-screen layout.

Alignment (HISAT2)

Raw (i.e., not trimmed; see above) read pairs were aligned as follows:

hisat2 \
    -x ${hisatIdx} \
    -1 ${fastqFolder}/${cell}_1.fastq.gz \
    -2 ${fastqFolder}/${cell}_2.fastq.gz \
    -S $hisatDir/${cell}.sam \
    1> $logDir/${cell}.o 2> $logDir/${cell}.e

The resulting SAM files were compressed to BAM format as follows:

samtools view -hb -o $hisatDir/${cell}.bam $hisatDir/${cell}.sam

Finally, SAM files were then removed as follows:

rm -f $hisatDir/${cell}.sam

Read assignment (featureCounts)

Uniquely aligned read pairs were assigned to annotated genomic feaures and ERCC RNA spike-in features, and counted in an unstranded manner as follows:

featureCounts \
    --primary \
    --ignoreDup \
    -p \
    -a $compositeGTF \
    -o $countDir/$cell \
    $bamDir/${cell}.bam \
    1> $logDir/$cell.out 2> $logDir/$cell.err

MultiQC

Reports of the various preprocessing steps above were collated using MultiQC. The overall report in embedded below, and linked here for full-screen layout.