Transcriptome Evaluation

Transcriptome Evaluation and Comparison

Bastian Schiffthaler, Nicolas Delhomme

Overview

What to assess
BUSCO
Detonate
TransRate
(RNAQuast)

What can we assess

Correctness - assembly errors
Completeness - missing pieces?

What not to use

Length based metrics: N50, mean length, ...
Why?

Detonate

Computes an assembly quality statistic from raw reads and assembled transcripts
Can run in reference guided and reference free modes
Computes Bayesian statistic based on the (mis)match of expected transcript coverage and observed transcript coverage
Useful to compare between runs, less useful to get a general idea
You MUST use the library used for the assembly

Example command:

Step 1: Estimate priors

#!/usr/bin/env/bash

mkdir ~/evaluation

rsem-eval-estimate-transcript-length-distribution \
  ~/raw_data/trinity_illumina_only/Trinity.fasta ~/evaluation/detonate.params

Step 2: calculate score

#!/usr/bin/env/bash
rsem-eval-calculate-score \
  -p 2 \
  --bowtie2 \
  --transcript-length-parameters ~/evaluation/detonate.params \
  --paired-end \
  ~/raw_data/simulated/ze_1_1.fastq.gz \
  ~/raw_data/simulated/ze_1_2.fastq.gz \
  ~/raw_data/trinity_illumina_only/Trinity.fasta \
  ~/evaluation/detonate \
  250 |& tee ~/evaluation/detonate.log

Detonate scores are always negative, but higher values (less negative) are better.

Score	-11585832.86
BIC_penalty	-3115.42
Prior_score_on_contig_lengths_(f_function_canceled)	-2952.68
Prior_score_on_contig_sequences	-489758.39
Data_likelihood_in_log_space_without_correction	-11090459.05
Correction_term_(f_function_canceled)	-452.68
Number_of_contigs	488
Expected_number_of_aligned_reads_given_the_data	333817.00
Number_of_contigs_smaller_than_expected_read/fragment_length	31
Number_of_contigs_with_no_read_aligned_to	5
Maximum_data_likelihood_in_log_space	-11006679.43
Number_of_alignable_reads	333817
Number_of_alignments_in_total	373089

TransRate

Reference free QA of transcript assemblies
Detects chimeras, structural errors, incomplete assembly parts and base errors.
Useful to get a general idea
You MUST use the library used for the assembly
Scores greater than 0.22 are acceptable (better than 1/2 of all published transcriptomes at the time of publication).

Example command:

#!/usr/bin/env bash
gzip -d -c ~/raw_data/simulated/ze_1_1.fastq.gz > ~/evaluation/ze_1_1.fq
gzip -d -c ~/raw_data/simulated/ze_1_2.fastq.gz > ~/evaluation/ze_1_2.fq

transrate \
  --assembly=~/raw_data/trinity_illumina_only/Trinity.fasta \
  --left=~/evaluation/ze_1_1.fq \
  --right=~/evaluation/ze_1_2.fq |& tee ~/evaluation/transrate.log

Transrate output

[ INFO] 2020-12-09 02:56:51 : Contig metrics:
[ INFO] 2020-12-09 02:56:51 : -----------------------------------
[ INFO] 2020-12-09 02:56:51 : n seqs                          488
[ INFO] 2020-12-09 02:56:51 : smallest                        203
[ INFO] 2020-12-09 02:56:51 : largest                        3393
[ INFO] 2020-12-09 02:56:52 : n bases                      353286
[ INFO] 2020-12-09 02:56:52 : mean len                     723.95
[ INFO] 2020-12-09 02:56:52 : n under 200                       0
[ INFO] 2020-12-09 02:56:52 : n over 1k                       101
[ INFO] 2020-12-09 02:56:52 : n over 10k                        0
[ INFO] 2020-12-09 02:56:52 : n with orf                      289
[ INFO] 2020-12-09 02:56:52 : mean orf percent              98.16
[ INFO] 2020-12-09 02:56:52 : n90                             357
[ INFO] 2020-12-09 02:56:52 : n70                             587
[ INFO] 2020-12-09 02:56:52 : n50                             894
[ INFO] 2020-12-09 02:56:52 : n30                            1473
[ INFO] 2020-12-09 02:56:52 : n10                            2380
[ INFO] 2020-12-09 02:56:52 : gc                             0.45
[ INFO] 2020-12-09 02:56:52 : bases n                           0
[ INFO] 2020-12-09 02:56:52 : proportion n                    0.0
[ INFO] 2020-12-09 02:56:52 : Contig metrics done in 0 seconds
[ INFO] 2020-12-09 02:56:52 : Calculating read diagnostics...

Transrate output

[ INFO] 2020-12-09 02:57:17 : Read mapping metrics:
[ INFO] 2020-12-09 02:57:17 : -----------------------------------
[ INFO] 2020-12-09 02:57:17 : fragments                    341812
[ INFO] 2020-12-09 02:57:17 : fragments mapped             335790
[ INFO] 2020-12-09 02:57:17 : p fragments mapped             0.98
[ INFO] 2020-12-09 02:57:17 : good mappings                333470
[ INFO] 2020-12-09 02:57:17 : p good mapping                 0.98
[ INFO] 2020-12-09 02:57:17 : bad mappings                   2320
[ INFO] 2020-12-09 02:57:17 : potential bridges                 7
[ INFO] 2020-12-09 02:57:17 : bases uncovered                5229
[ INFO] 2020-12-09 02:57:17 : p bases uncovered              0.01
[ INFO] 2020-12-09 02:57:17 : contigs uncovbase                31
[ INFO] 2020-12-09 02:57:17 : p contigs uncovbase            0.06
[ INFO] 2020-12-09 02:57:17 : contigs uncovered                 4
[ INFO] 2020-12-09 02:57:17 : p contigs uncovered            0.01
[ INFO] 2020-12-09 02:57:17 : contigs lowcovered                5
[ INFO] 2020-12-09 02:57:17 : p contigs lowcovered           0.01
[ INFO] 2020-12-09 02:57:17 : contigs segmented                 6
[ INFO] 2020-12-09 02:57:17 : p contigs segmented            0.01

[ INFO] 2020-12-09 02:57:17 : TRANSRATE ASSEMBLY SCORE     0.7492
[ INFO] 2020-12-09 02:57:17 : -----------------------------------
[ INFO] 2020-12-09 02:57:17 : TRANSRATE OPTIMAL SCORE      0.7881
[ INFO] 2020-12-09 02:57:17 : TRANSRATE OPTIMAL CUTOFF     0.4913
[ INFO] 2020-12-09 02:57:17 : good contigs                    474
[ INFO] 2020-12-09 02:57:17 : p good contigs                 0.97

BUSCO - Benchmarking Universal Single Copy Orthologs

Database of single copy orthologs for ancestral lines
Expect to find many of these in the assembly
Expect to not have duplicated matches... unless?
Scores of >0.75 probably good
Choice of database matters!

Example command

#!/usr/bin/env bash

cd ~/evaluation

busco \
  -i ~/raw_data/trinity_illumina_only/Trinity.fasta \
  -c 2 \
  -o busco_trinity \
  -m tran \
  --long \
  -l /db/busco/embryophyta_odb10 |& tee ~/evaluation/busco.log

BUSCO output

        --------------------------------------------------
	|Results from dataset embryophyta_odb10           |
	--------------------------------------------------
	|C:0.4%[S:0.3%,D:0.1%],F:0.4%,M:99.2%,n:1614      |
	|6	Complete BUSCOs (C)                       |
	|5	Complete and single-copy BUSCOs (S)       |
	|1	Complete and duplicated BUSCOs (D)        |
	|7	Fragmented BUSCOs (F)                     |
	|1601	Missing BUSCOs (M)                        |
	|1614	Total BUSCO groups searched               |
	--------------------------------------------------