Transcriptome Assembly

Bastian Schiffthaler, Nicolas Delhomme

And a lot of content courtesy of Matt MacManes (@macmanes)

Overview

Study Design
Sample Collection and Preservation
Library Generation
[Sequencing and Raw Data QC]
Assembly

Why assemble a transcriptome

You don't have a reference
The current reference is hot garbage
Your specific genes are missing
You trust no one and want to QC an existing reference

Goal for today

You could plan or assist in the design and computational aspects of a transcriptome assembly study.

You are informed about options/tools for short read assembly

The burden of choice

Which Platform

Platform	Throughput	Accuracy	ReadLength	Molecule	Cost
Illumina	High	High	Short	cDNA	Low (excl. machine)
PacBio	Med	Med (high?)	Long	cDNA	High
Nanopore	Low	Med	Longer	RNA/cDNA	Low(ish)

Which Platform

Platform	Ideal Use
Illumina	Quantification/DE
PacBio	Assembly
Nanopore	Assembly/nt modification

Replication

Technical replicates aren't essential. They are well modeled by the poisson distribution.
Biological replicates are essential (esp. for DE)

“... at least six biological replicates should be used, rising to at least 12 when it is important to identify SDE genes for all fold changes.”

Marioni, John C., et al. "RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays." Genome research 18.9 (2008): 1509-1517.

Schurch, Nicholas J., et al. "How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?." Rna 22.6 (2016): 839-851.

Replication

For assembly, use one individual or one per treatment!

Sample Collection and Preservation

Transcription changes quickly and may not stop at death
RNA degrades (fast)
The specifics are taxon specific, so talk with an expert

Library QC

Gel electrophoresis or BioAnalyzer

Preprocessing: Diginorm

Why bother to do this?

More depth is not always better. There is a "sweet spot"
Assembly is a hungry hungry hippo (in terms of compute resources). Fewer (but good) reads means faster assemblies.

Haas, Brian J., et al. "De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis." Nature protocols 8.8 (2013): 1494.

Trimming

Quality trimming is beneficial to correctness
BUT: Trimming is detrimental to completeness
Trim conservatively!
Adapter trimming is mandatory

Error Correction

Assembly

Many Assemblers

Trinity
Spades
Shannon
TransABySS, Oases, SOAP, Bridger, Binpacker, IDBA-tran

Which is best?

Which assembler is best

Assembly	Sum	Missing	Unique
All	14674 ± 3590
Spades55		-1739 ± 758	570 ± 266
Spades75		-2711 ± 2047	301 ± 195
Shannon		-4375 ± 3508	302 ± 241
Trinity		-1952 ± 803	520 ± 301

Practical!

We will assemble a set of spruce mini transcriptomes (they will be bad). Your pipeline will be

Your data is in ~/raw_data/simulated. Pick a library (pair)

trimmomatic - Adapter and very mild quality trimming
rcorrector - Additional quality correction
Trinity - Assembly (illumina only)

Markdown follow-along

To run rcorrector

#!/usr/bin/env bash
cd ~
wget https://raw.githubusercontent.com/mourisl/Rcorrector/master/run_rcorrector.pl
ln -s $(which rcorrector) ~