Strandness of RNA-seq and Transcripts Explained
Published:
Updated:
Two posts by Griffith Lab and by Hong Zheng are already pretty comprehensive in describing the strandness parameters of many tools. Here I make some notes for my own records.
Strandness of RNA-seq
Imagine we collected some RNA molecules, performed the sequencing experiments, and obtained a bunch of paired-end RNA-seq reads (respectively R1 and R2 for each end of one pair). Now we want to know whether R1 has the same sequence as the original RNA or R2 has the same sequence as the original RNA (but of course, the reads are shorter and erroneous compared to an RNA molecule). This is called strandness of RNA-seq.
As of today, the mainstream of NGS (e.g. Illumina) technology is sequence-by-synthesis (SBS). Consequently, R1 and R2 must be from different strands of the RNA molecule, because they are synthesized from different directions! That means we have only 2 possibilities of strandness: (1) R1 is from sense-strand and R2 is form anti-sense-strand; or (2) R1 is from anti-sense strand and R2 is from sense-strand. Here we call the strand with the same sequences of the RNA as sense strand. Likewise, the strand that is reverse-complementary to the RNA is called anti-sense strand.
Naming conventions
Most of the name conventions fall into two types: rf/fr
for strandness
and first/second
for library type
.
RF: The first read is Reverse sequence of RNA and the second read is Forward sequence of RNA.
FR: The first read is Forward sequence of RNA and the second read is Reverse sequence of RNA.
The first/second
library types can be somewhat confusing (at least I got confused a few times…). R1 of first
stranded libraries are the same as the anti-sense strand and vice versa. This is because the sequenced substrates in RNA-seq technologies are (for most of the time) cDNAs. The first cDNA strand, which uses the RNA as template, is rev-comp to the RNA sequence. Then cDNA molecules can be synthesized to be double-stranded. Since cDNAs are double-stranded, this further divides technologies by whether:
first: sequencing the first cDNA strand (anti-sense)
second: sequencing the second cDNA strand (sense).
Note that the first cDNA strand is reverse-complementary to the RNA molecule.
Parameters for some bioinformatics tools
Two tools, infer_experiment.py from RSeQC and check_strandness are easy-to-use tools for checking strandness.
Assuming all reads are paired and sequenced inwards.
RNA | 5’ –> 3’ | 3’ –> 5’ | |
---|---|---|---|
DNA | coding/ sense strand | noncoding/ anti-sense | |
Reads | R1 is a sub-string of RNA, R2 is rev-comp | R2 is a sub-string of RNA, R1 is rev-comp | Both R1 and R2 may be the same or rev-comp of a sub-string of RNA |
infer_experiment.py | 1++,1–,2+-,2-+ | 1+-,1-+,2++,2– | |
check_strandness | FR/fr-secondstrand | RF/fr-firststrand | |
Kallisto | --fr-stranded | --rf-stranded | |
Salmon | --libType ISF | --libType ISR | --libType IU |
Scallop | --library_type second | --library_type first | --library_type unstranded |
Stringtie | --fr | --rf |
Strand of transcripts
Transcripts/RNAs also have “strand” information (e.g. the 7th column strand
in a gtf file). It needs to be clarified that:
- strandness of RNA-seq: we are talking about which strand of RNA/cDNA the seq reads are from, the cRNA strand (i.e. rev-comp to RNA) or the direct-RNA strand (i.e. same sequence as RNA). The second strand of cDNA has the same sequence as the direct RNA, so computationally direct-RNA seq has the same strandness as “second-stranded”
- strand of transcripts: we are talking about which strand of the genome the gene/transcript is from. In other words, whether the RNA aligns to the forward sequence of the genome (
+
strand, same sequence as genome.fa file) or the RNA aligns to the reverse-complementary sequence of the genome (-
strand, rev-comp to sequence of genome.fa file).
The strand of a transcript can be inferred by using read alignment and strandness of reads, and vice versa.
SAM tags ts
, tx
Some splice-aware aligner outputs SAM tags ts
that indicates which transcript strand the read is from. This flag is usually inferred by checking the canonical intron splice motif of the reads without prior knowledge of the transcript information. Namely, +
ts-tag means the read is from the same strand as the mRNA, and -
ts-tag means the read if from the first cDNA strand (rev-comp to mRNA). Likewise, R1 of a paired and fr-stranded RNA-seq sample is supposed to have all positive ts tag, while R2 of the same sample has all negative ts tag. Reads from an unstranded sample may have roughly half reads assigned positive ts and half assigned negative ts.
Some splice-aware aligners (e.g. STAR by setting --outSAMstrandField intronMotif
) output SAM tags xs
that indicate strand of the RNA transcript. Namely, this xs tag should be the same as the strand
information in a gtf file of the same transcript. The information of ts and xs can be inferred by examining the read alignments. Obviously, if a read aligns to the positive-strand of the genome and it is from the positive strand of the RNA (positive ts), then the transcript should be from the positive strand of the genome (positive xs, positive strand in gtf). If a read aligns to the negative-strand of the genome and it is from the negative strand of the RNA (negative ts), then the transcript should still be from the positive strand of the genome (positive xs, positive strand in gtf), i.e. the double negation cancels out.
In SAM/BAM format, bit 16
in a SAM FLAG indicates whether a read aligns to the reverse strand of the genome (bit 16 set iff -
strand; bit 16 unset iff +
strand). Hence, considering all the tags and flags as Boolean values, tx
is negative iff bit 16 and ts
flag are the same; tx
is positive iff bit 16 and ts
flag are different.