Transcriptome Assembly

Bastian Schiffthaler, Nicolas Delhomme

And a lot of content courtesy of Matt MacManes (@macmanes)

Overview

  • Study Design
  • Sample Collection and Preservation
  • Library Generation
  • [Sequencing and Raw Data QC]
  • Assembly

Why assemble a transcriptome

  • You don't have a reference
  • The current reference is hot garbage
  • Your specific genes are missing
  • You trust no one and want to QC an existing reference

Goal for today

You could plan or assist in the design and computational aspects of a transcriptome assembly study.

You are informed about options/tools for short read assembly

The burden of choice

Which Platform

PlatformThroughputAccuracyReadLengthMoleculeCost
IlluminaHighHighShortcDNALow (excl. machine)
PacBioMedMed (high?)LongcDNAHigh
NanoporeLowMedLongerRNA/cDNALow(ish)

Which Platform

PlatformIdeal Use
IlluminaQuantification/DE
PacBioAssembly
NanoporeAssembly/nt modification

Replication

  • Technical replicates aren't essential. They are well modeled by the poisson distribution.
  • Biological replicates are essential (esp. for DE)

“... at least six biological replicates should be used, rising to at least 12 when it is important to identify SDE genes for all fold changes.”

Marioni, John C., et al. "RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays." Genome research 18.9 (2008): 1509-1517.

Schurch, Nicholas J., et al. "How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?." Rna 22.6 (2016): 839-851.

Replication

For assembly, use one individual or one per treatment!

Sample Collection and Preservation

  • Transcription changes quickly and may not stop at death
  • RNA degrades (fast)
  • The specifics are taxon specific, so talk with an expert

Library QC

Gel electrophoresis or BioAnalyzer

Preprocessing: Diginorm

DiginormHigh CoverageModerate CoverageLow Coverage

Preprocessing: Diginorm

Why bother to do this?

  • More depth is not always better. There is a "sweet spot"
  • Assembly is a hungry hungry hippo (in terms of compute resources). Fewer (but good) reads means faster assemblies.

Haas, Brian J., et al. "De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis." Nature protocols 8.8 (2013): 1494.

Trimming

  • Quality trimming is beneficial to correctness
  • BUT: Trimming is detrimental to completeness
  • Trim conservatively!
  • Adapter trimming is mandatory

Error Correction

Assembly

Many Assemblers

  • Trinity
  • Spades
  • Shannon
  • TransABySS, Oases, SOAP, Bridger, Binpacker, IDBA-tran

Which is best?

Which assembler is best

AssemblySumMissingUnique
All14674 ± 3590
Spades55-1739 ± 758570 ± 266
Spades75-2711 ± 2047301 ± 195
Shannon-4375 ± 3508302 ± 241
Trinity-1952 ± 803520 ± 301

Practical!

We will assemble a set of spruce mini transcriptomes (they will be bad). Your pipeline will be

Your data is in ~/raw_data/simulated. Pick a library (pair)

  1. trimmomatic - Adapter and very mild quality trimming
  2. rcorrector - Additional quality correction
  3. Trinity - Assembly (illumina only)

Markdown follow-along

To run rcorrector

#!/usr/bin/env bash
cd ~
wget https://raw.githubusercontent.com/mourisl/Rcorrector/master/run_rcorrector.pl
ln -s $(which rcorrector) ~