RNASq - Seq Analysis Tools

This documents a set of tools, written for use in Python and using extensively the tools from the BioPython suite. It is intended not for genomic studies, but rather, for characterizing relatively short complete or RACE sequences that are expected to be based on an "expected" sequence but that nevertheless have significant sequence and/or length heterogeneity. We use the tools to study in vitro transcribed RNAs.

Introduction

This document describes a suite of Python tools for analysis of in vitro RNA-Seq data (not intended for genomic analyses). The basic usage follows an object oriented model. You start by defining a variable to contain information about a specific experimental data set.

myExptSetUp = Seqsetup(
)

Click

Parameters above, in order (noting how you can reference each in programming):

RNAset.filename - Illumina file name: fastq format (gzipped, or not)
RNAset.tseq - Expected (encoded) sequence (can be in DNA or RNA format)
RNAset.adptr5 - 3' end of the 5' adapter (default used in trimming; can be overridden at trimming)
RNAset.adptr3 - 5' end of the 3' adapter (default used in trimming; can be overridden at trimming)
RNAset.exptinfo - (einfo) special variable - see below - contains info on the transcription experiment
RNAset.pnotes - one line description of the experiment - what's it about?
RNAset.istemplate - True/False - is this a DNA sequence (typical use: TRUE if seq of template DNA)?
RNAset.SeqsUsed - dictionary variable - see below - contains sequences relative to the experiment
RNAset.dData - dictionary variable - see below - contains any other info to include

The default location for the sequence input file is in a directory called 'Data' one level up from the Python source code.
The default for output is in a directory called 'Output" one level up. These locations can be changed as follows:

config.Options["InputDir"] = "../Data"
config.Options["OutputDir"] = "../Output"

Note that here and below, all sequences should be written 5' to 3'

The following variables might be defined once (or twice, or three times) and then used in the definition of multiple sequencing data sets by passing the parameter as shown above.

einfo is a parameter set containing information on the experimental details. It can be setup using the following function:

einfo = Exptinfo(
)

SeqsUsed is a parameter set containing info about the DNA constructs used. All parameters are optional. It can be setup using the following syntax:

SeqsUsed = {
}

AlignSeq is used as a default by some functions for aligning sequences that might have been "frameshifted." So for example, if transcription started at +2, but you are interested in the 3' end of the transcript, that end will be shifted left by one. Referencing a sequence in the middle of the transcript, one can align all sequences by their common internal alignment sequences - now the 3' ends line up regardless of initiation variability. The value passed here is the default and typically can be overridden.

Routines that use this as a default: endAnalysis, getPhaseShifted (set keypos to 0); getSubseqBySeq, getSubseqByRelativeSeq, getWithMatched, getWithMatchWndw, RNAkeyseqPosAnal (set keyseq to False),

In your own programming, you can access these as: newvar = RNAset.SeqsUsed["Ntmpl"]

dData is a parameter set containing other information. All parameters are optional. It can be setup using the following syntax:

dData = {
}

In your own programming, you can access these as: newvar = RNAset.dData["Run Date"]

Click

After defining your new variable myExptSetUp using Seqsetup, you then use yExptSetUp to load (import) sequence data into a new object DNAset (you can have multiple objects for different sequences; this is the object oriented approach). The new variable (object) will then include both the raw sequence data and all of the information that was provided in the definition of the experimental data set.

DNAset = myExptSetUp.import_dataset();

places all of the Illumina sequences (in DNA format by default) into a new (object) variable called DNAset

DNAsetTrimmed = DNAset.trimAdaptors(None, None);

returns a new (object) variable containing sequences with the 5' and 3' adapters removed (trimmed). The two parameters specify the 5' and 3' adapter sequences used for trimming (specifying None or False for each says to use the sequences in the above definition of 'myExptSetUp'). Note that primer dimers (5' and 3' adapters directly ligated, with no intervening DNA) and sequences that do not contain both 5' and 3' adapter sequences are NOT returned - the console will report on these statistics.

Note that the sequences are still DNA format at this point.

RNAset = DNAsetTrimmed.toRNASet();

converts the trimmed DNA sequences into RNA (replaces T by U, and flags the set as RNA)

Finally, all of the above can be condensed into one line by chaining the objects together:

RNAset = myExptSetUp.import_dataset().trimAdaptors(None, None).toRNASet();

At this point, RNAset includes all of the RNA sequences, trimmed to remove adapter sequences. It also includes all of the information that was specified in the initial definition of 'myExptSetUp. This approach illustrates what you will be doing next.

For example, to filter the data, returning only sequences of length 10 bases or longer:

RNAsetNew = RNAset.getSpecificLengths(10,10000,'remove abortives');

Click

Typical usage involves first setting up an "experiment" by calling Seqsetup. This identifies the .fasta or .fastq file containing the sequences. It also sets various parameters, including the 5' and 3' adapter sequences and the expected RNA(DNA) sequence to be found between those adapters. Data files are expected to be found in the Data folder, one level above the code.

After defining the experiment with Seqsetup, the data are loaded into a NucleicSet object using the .import_dataset command below (this command is an object property of the Seqsetup variable and so is called as
newNucleicSet = SeqSet.import_dataset().

The generated NucleicSet variable can then be further processed using any of the 'getXxxx" commands below. These operate on the NucleicSet and return a new NucleicSet. Unless the primary data have already been processed, the first call is typically something like RNAset = newNucleicSet.trimAdaptors(None,None). For this function, the first two parameters are the 5' and 3' adaptor sequences to use in trimming. Passing None for each of these defaults to using the adaptors defined in the Seqsetup step.

The trimmed data can now be processed to, for example, extract only post-abortive sequences using a call like
postAbrtvSet = newNucleicSet.getSpecificLengths(9,2000,'post-abortives')
Note also that many other functions implicitely filter for minimum lengths. For example, .getPhaseShifted and .getPrimedExt select for sequences containing key sequences at specific or minimal length positions.

Many of the .getXXX functions also report analyses of the processing. In addition, there are Analysis functions that are solely for analysis. These are typically called as
RNAset.getGel('A','label RNA with alpha-ATP')

Note that most functions have a comment variable. This is simply for your use. Anything entered there will be printed above the results of the function (if this variable is set to "quiet", output will be suppressed.

Finally, output can be sent to a file by bracketing commands with StartCaptureToFile() and WriteCaptureToFile('_extra'). That file will be created in the Output folder, one level above the code.

What is ?

In the descriptors below, RNAset refers to a python object (a variable) that contains a set of sequencing data. This class of object contains a set of DNA/RNA sequences, but also contains a variety of information on those data. The 'get' functions below operate on one RNAset and return a new one

For example: newSet = RNAset.getSpecificLengths(8,10,'return 8mers, 9mers, and 10mers only')

Specifically,
xx.filename is the name of the file from which these data were read
xx.tseq = the expected RNA transcript sequence
xx.SeqsUsed is a python dictionary containing (5' to 3') sequences used in the experiment. Examples include:

xx.SeqsUsed['Tmplt'] - sequence of the DNA template strand (remember: 5' to 3')
xx.SeqsUsed['NTmpl'] - sequence of the DNA nontemplate strand
xx.SeqsUsed['5Prmr'] - the entire sequence of the 5' adaptor (not just the barcode)
xx.SeqsUsed['3Prmr'] - the entire sequence of the 3' adaptor
xx.SeqsUsed['AlignSeq'] - an internal sequence used to align sequences, typically a subset of tseq
xx.SeqsUsed['Toehld'] - the inverse complement of the template toehold if present (in other words, the expected RNA sequence, were the toehold to be transcribed)
xx.SeqsUsed['T2a'] - in toehold synthesis, the upstream (promoter) template strand (remember: 5' to 3')
xx.SeqsUsed['T2b'] - in toehold synthesis, the transcribed template strand (remember: 5' to 3')
xx.SeqsUsed['Target'] - in the 'jump" experiment, the target template strand (remember: 5' to 3')

xx.dData is a python dictionary containing misc info

For the :


    class NucleicSet(object):
    def __init__(self, dData, seqreclist, setnotes, history, tseq, exptinfo, istemplate, SeqsUsed, filename, xtra):
        self.dData = dData             # Dictionary with all sorts of data (currently unused, but in the future...)
        self.seqreclist = seqreclist   # list of SeqRecord objects, one for each sequence
        self.setnotes = setnotes       # string - notes about this set
        self.history = history         # the history of this data  (aka pnotes)
        self.tseq = tseq               # string - expected sequence
        self.exptinfo = exptinfo       # Exptinfo object - experimental details
        self.istemplate = istemplate   # boolean - is template DNA, not RNA
        self.SeqsUsed = SeqsUsed       # Dictionary of DNA (5' to 3') used in expt - 'fullseq' is full sequence, with 5' and 3' primers
        self.filename = filename       # original source file name (if any)
        self.xtra = xtra               # variable of class Xtra, for holding a variety of things analytical

Quick useful data:

xx.count() - the number of sequences in this set
xx.maxlength() - the length of the longest RNA in this set

Filter ("get") Functions

These functions return a subset of the sequences, depending on specific criteria. They do not modify the sequences returned, they just remove some.

get only specific lengths of RNA. For example, to look at transcripts past the abortive phase you might call: getSpecificLengths(tmpWTset, 9, 1000, 'post-abortives')

Interesting uses:

RNAset.getSpecificLengths(2,8,'get only short (abortive) RNAs')
RNAset.getSpecificLengths(9,10000,'remove short (abortive) RNAs')
RNAset.getSpecificLengths(len(RNAset.tseq,10000),len(RNAset.tseq,10000),'get only expected length (full length) RNAs')

tmptmp = RNAset.getSpecificLengths(12,20,'testing specific lengths') tmptmp.printMostCommon(2.0,'Most common seqs','testing print most common sequences') >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSpecificLengths(12,20,'') testing specific lengths Return only seqs of length 12 to 20. >> 9473 of 21736 (43.6%) >[RMHD_S8_L001_R1_001].importrawdataset().trimAdaptors('CTCCAT','TGGAA').getSpecificLengths(12,20,'').printMostCommon(2.0,'Most common seqs','') testing print most common sequences Most common s GGACTACGTACGTCGACGCNNNNNNTGCGTCGAGTATTTA 9473 ....!....1....!....2 1364 14.4% GGACTACGTACGTCGACGCC 1200 12.7% GGACTACGTACGTCGACGC 762 8.0% GGACTACGTACGTCGAC 749 7.9% GGACTACGTACGTCGACGCT 598 6.3% GGACTACGTACGTC 504 5.3% GGACTACGTACGTCGACG 464 4.9% GGACTACGTACGTCGACC 315 3.3% GGACTACGTACGTCGACGCA 274 2.9% GGACTACGTACGTCGACGCG 234 2.5% GGACTACGTACGTCGACGT 232 2.4% GGACTACGTACGTCGA 223 2.4% GGACTACGTACG 192 2.0% GGACTACGTACGTCGATC

Tests each sequence for the occurence of the key sequence; when matched, returns the entire sequence
pos = 0 allows the sequence to be found anywhere
pos > 0 returns a match only if the sequence is found at position 'pos'

If keyseq is a sequence preceded by '-' (e.g. '-GGACTTA' or '-5Prmr'), returns the sequences that do NOT contain the key sequence

Interesting uses:

RNAset.getWithMatched(RNAset.tseq[:2],1,'get only RNAs with correct dinucleotide starting sequence')
RNAset.getWithMatched(RNAset.SeqsUsed["AlignSeq"],0,'get only RNAs with originally specified align sequence (anywhere)')
RNAset.getWithMatched(False,0,'get only RNAs with originally specified align sequence (anywhere)')
RNAset.getWithMatched(False,False,'get only RNAs with originally specified align sequence at the expected position')
RNAset.getWithMatched(RNAset.tseq[-4:],0,'get only RNAs with the last four encoded bases, at any position')

tmpWTset = RNAset.getWithMatched('TCGAC',12,'testing get with Matched specific') tmpWTset.printMostCommon(2.0,'Most common','testing print most common sequences') tmpWTset.printSampleSeqs(10) >[RMLD_S7_L001_R1_001].importrawdataset().trimAdaptors('TACATA','TGGAA').getWithMatched('TCGAC',12,'') testing get with Matched specific Filtering only sequences that contain TCGAC at position 12. >> 173 of 27608 (0.6%) >[RMLD_S7_L001_R1_001].importrawdataset().trimAdaptors('TACATA','TGGAA').getWithMatched('TCGAC',12,'').printMostCommon(2.0,'Most common','') testing print most common sequences Most common GGACTACGTACGTCGACGCNNNNNNTGCGTCGAGTATTTA 173 ....!....1....!.. 4 2.3% GACTACGTACGTCGACG >[RMLD_S7_L001_R1_001].importrawdataset().trimAdaptors('TACATA','TGGAA').getWithMatched('TCGAC',12,'').printSampleSeqs(10) GGACTACGTACGTCGACGCNNNNNNTGCGTCGAGTATTTA ....'....|....'....|....'....|....'....|....' GACTACGTACGTCGACG GGACTAGTACGTCGACGCTCCAGCTGCGTCGAGTATTT GGACACGTACGTCGACGCTGTC GGACTACGACGTCGACGCGACAAGTGCC GACTACGTACGTCGACGCTCGCAGC GGACACGTACGTCGACGCCGCGAGTGC GGCTACGTACGTCGAC GGACTACTACGTCGACGCCGATTCTGCG GGACTACGTCGTCGACGCGGA GACTACGTACGTCGACGCTGCC