Pseudoalignment

and also regular old alignment

Bastian Schiffthaler, Nicolas Delhomme

Pseudoalignment vs. Traditional Alignment

Traditional aligners keep base-to-base mappings

STAR
HISAT2

Pseudo aligners find most likely matches between two sets of sequences: query and reference

Kallisto
Salmon

Traditional mapping

Splice aware: align cDNA to genome index
Contiguous only: align DNA to genome, or cDNA to transcriptome index
Gapped: alignments can have gaps

Alignment considerations

Most aligners use fixed-length seeds to initiate a possible alignment
These seeds must align exactly
Tradeoff between sensitivity and specificity

Aligner	Length	Gapped	Splice-aware	3rd gen support?
BBMap	Any	Yes	Yes	Yes
Bowtie	<50	No	No	No
Bowtie2	?	Yes*	No	No
BWA	Any	Yes	No	Partial
Minimap2	Any	Yes	No	Yes
GMAP/GSNAP	<300**	Yes	Yes	Yes
HISAT2***	Yes	Yes	Yes	No
STAR	Yes	Yes	Yes	No

*Not "true" gapped alignment

**GSNAP (value can be changed during compilation)

***Can use SNP info. Optimized for human, but can be adapted

Traditional alignment: seeds (maximum mappable prefix)

Aligner benchmarks

https://www.ecseq.com/support/ngs/best-RNA-seq-aligner-comparison-of-mapping-tools

Aligner benchmarks

https://www.ecseq.com/support/ngs/best-RNA-seq-aligner-comparison-of-mapping-tools

* The time shown includes the (for some tools dominating) index loading step, which will be less influential (or even negligible) when mapping real-life datasets (>10 Mio reads).

Aligner benchmarks

https://www.ecseq.com/support/ngs/best-RNA-seq-aligner-comparison-of-mapping-tools

Aligner benchmarks

https://www.ecseq.com/support/ngs/best-RNA-seq-aligner-comparison-of-mapping-tools

**By default BBMap takes as much memory as the system provides. The minimum requirement for the used genome is 24GB.

Settings?

Very rarely need to modify defaults
Seed length: short sequences
Perfect alignments only
STAR: various parameters related to memory consumption

The SAM format

Col	Name	Description
1	QNAME	Query template name
2	FLAG	Bitwise flag
3	RNAME	Reference sequence name
4	POS	1-based leftmost mapping position
5	MAPQ	Mapping quality
6	CIGAR	CIGAR string
7	RNEXT	Reference name of mate
8	PNEXT	Poisition of the mate
9	TLEN	Observed template length
10	SEQ	Segment sequence
11	QUAL	Sequence PHRED quality
12+		Additional data: TAG:TYPE:VALUE

A SAM record

samtools view <Alignment SAM/BAM/CRAM>

FCC1L3GACXX:1:1308:5586:93026#  
99      
Potra000013     
27834   
254     
100M
=       
27953   
219     
CCCCGTTAGTACCATTTGAGTTCTCAACAGCCTGCTCCTGCTCCAATTTTCTCTTCTCCTTTTTCTTCTTCTTCTCTGATTTAGCATCCTCTGAAGCACC    
@@CFFDDFHDHFHGHHIIGIIIEGHIHGGIGII@HEHIIIGGII9?FGHIIIGGIGIIIGGGIIIIIIIIIIIIIHICHFEHEHFFFFFCEECCEEDDDD    
NH:i:1  
HI:i:1  
AS:i:196        
nM:i:1  
MD:Z:100        
NM:i:0

Pseudomapping

Extremely fast algorithms based on k-mers
Speed enables probabilistic estimation of confidence intervals (bootstrapping)
Salmon
Kallisto

Kallisto

Salmon

Salmon bias corrections

https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/

Salmon selective alignment (old PoC)

Select most likely transcript among a set of candidates

https://www.biorxiv.org/content/10.1101/138800v2

Salmon decoy-aware transcriptomes

Experimental datasets are generally more complex and include reads that originate from segments that are not part of the annotated transcripts.

Requires availability of genome sequence
Avoids spurious mappings of genomic sequences with high similarity to transcripts

Introns
Intergenic sequences
Unannotated transcripts

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02151-8#Sec21

Parameters?

Bias: all-on, unless you suspect non-traditional sequences in your organism
K: default, unless you map to a relative -> sensitivity/specificity

Bootstrapping/Gibbs sampling and Estimating Confidence

Salmon re-samples counts in equivalence classes to estimate uncertainty in abundance estimation

Applications

Terminus can collapse transcripts that have too much uncertainty in the abundance estimation into a group for which the abundance can be estimated accurately.

The group is analyzed as a unit.

Applications

Fishpond uses uncertainty estimates for differential transcript and gene expression.

When to choose what

Traditional (STAR)

Novel gene discovery
RNA-Seq variant discovery
Cancer -> StarFusion

Pseudo (Salmon)

Quantification of known transcripts
High speed, high accuracy
No interest in variants
No interest in discovering novel genes

Practical

Build a (non-decoy-aware) salmon index.
Quantify one library

Some hints:

							#!/usr/bin/env bash
salmon index --help
salmon quant --help-reads
TRANSCRIPTS=~/raw_data/reference/Pabies1.0-all.phase.gff3.CDS.fa
SEQDATADIR=~/raw_data/trimmomatic

Tutorial!