Pseudoalignment

and also regular old alignment

Bastian Schiffthaler, Nicolas Delhomme

Pseudoalignment vs. Traditional Alignment

  • Traditional aligners keep base-to-base mappings
    • STAR
    • HISAT2
  • Pseudo aligners find most likely matches between two sets of sequences: query and reference
    • Kallisto
    • Salmon

Traditional mapping

  • Splice aware: align cDNA to genome index
  • Contiguous only: align DNA to genome, or cDNA to transcriptome index
  • Gapped: alignments can have gaps

Alignment considerations

  • Most aligners use fixed-length seeds to initiate a possible alignment
  • These seeds must align exactly
  • Tradeoff between sensitivity and specificity
AlignerLengthGappedSplice-aware3rd gen support?
BBMap AnyYesYesYes
Bowtie <50NoNo No
Bowtie2 ?Yes*No No
BWA AnyYesNo Partial
Minimap2 AnyYesNo Yes
GMAP/GSNAP<300**YesYesYes
HISAT2*** YesYesYesNo
STAR YesYesYesNo

*Not "true" gapped alignment

**GSNAP (value can be changed during compilation)

***Can use SNP info. Optimized for human, but can be adapted

Traditional alignment: seeds (maximum mappable prefix)

Aligner benchmarks

https://www.ecseq.com/support/ngs/best-RNA-seq-aligner-comparison-of-mapping-tools

Aligner benchmarks

https://www.ecseq.com/support/ngs/best-RNA-seq-aligner-comparison-of-mapping-tools

* The time shown includes the (for some tools dominating) index loading step, which will be less influential (or even negligible) when mapping real-life datasets (>10 Mio reads).

Aligner benchmarks

https://www.ecseq.com/support/ngs/best-RNA-seq-aligner-comparison-of-mapping-tools

Aligner benchmarks

https://www.ecseq.com/support/ngs/best-RNA-seq-aligner-comparison-of-mapping-tools

**By default BBMap takes as much memory as the system provides. The minimum requirement for the used genome is 24GB.

Settings?

  • Very rarely need to modify defaults
  • Seed length: short sequences
  • Perfect alignments only
  • STAR: various parameters related to memory consumption

The SAM format

ColNameDescription
1QNAMEQuery template name
2FLAGBitwise flag
3RNAMEReference sequence name
4POS1-based leftmost mapping position
5MAPQMapping quality
6CIGARCIGAR string
7RNEXTReference name of mate
8PNEXTPoisition of the mate
9TLENObserved template length
10SEQSegment sequence
11QUALSequence PHRED quality
12+Additional data: TAG:TYPE:VALUE

A SAM record

samtools view <Alignment SAM/BAM/CRAM>
FCC1L3GACXX:1:1308:5586:93026#  
99      
Potra000013     
27834   
254     
100M
=       
27953   
219     
CCCCGTTAGTACCATTTGAGTTCTCAACAGCCTGCTCCTGCTCCAATTTTCTCTTCTCCTTTTTCTTCTTCTTCTCTGATTTAGCATCCTCTGAAGCACC    
@@CFFDDFHDHFHGHHIIGIIIEGHIHGGIGII@HEHIIIGGII9?FGHIIIGGIGIIIGGGIIIIIIIIIIIIIHICHFEHEHFFFFFCEECCEEDDDD    
NH:i:1  
HI:i:1  
AS:i:196        
nM:i:1  
MD:Z:100        
NM:i:0

Pseudomapping

  • Extremely fast algorithms based on k-mers
  • Speed enables probabilistic estimation of confidence intervals (bootstrapping)
  • Salmon
  • Kallisto

Kallisto

Salmon

Salmon bias corrections

https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/

Salmon selective alignment (old PoC)

Select most likely transcript among a set of candidates

https://www.biorxiv.org/content/10.1101/138800v2

Salmon decoy-aware transcriptomes

Experimental datasets are generally more complex and include reads that originate from segments that are not part of the annotated transcripts.
  • Requires availability of genome sequence
  • Avoids spurious mappings of genomic sequences with high similarity to transcripts
    • Introns
    • Intergenic sequences
    • Unannotated transcripts

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02151-8#Sec21

Parameters?

  • Bias: all-on, unless you suspect non-traditional sequences in your organism
  • K: default, unless you map to a relative -> sensitivity/specificity

Bootstrapping/Gibbs sampling and Estimating Confidence

Salmon re-samples counts in equivalence classes to estimate uncertainty in abundance estimation

Applications

Terminus can collapse transcripts that have too much uncertainty in the abundance estimation into a group for which the abundance can be estimated accurately.

The group is analyzed as a unit.

Applications

Fishpond uses uncertainty estimates for differential transcript and gene expression.

When to choose what

  • Traditional (STAR)
    • Novel gene discovery
    • RNA-Seq variant discovery
    • Cancer -> StarFusion

  • Pseudo (Salmon)
    • Quantification of known transcripts
    • High speed, high accuracy
    • No interest in variants
    • No interest in discovering novel genes

Practical

  • Build a (non-decoy-aware) salmon index.
  • Quantify one library

Some hints:

							#!/usr/bin/env bash
salmon index --help
salmon quant --help-reads
TRANSCRIPTS=~/raw_data/reference/Pabies1.0-all.phase.gff3.CDS.fa
SEQDATADIR=~/raw_data/trimmomatic
							
						

Tutorial!